├── README.md ├── ReflectionTypeHint.php ├── ReflectionTypeHint_example.php ├── UTF8-CHANGELOG.txt ├── UTF8.php └── php.ini.error_prepend_string.example /README.md: -------------------------------------------------------------------------------- 1 | # UTF8 support in PHP5 2 | PHP5 UTF8 is a UTF-8 aware library of functions mirroring PHP's own string functions. 3 | The powerful solution/contribution for UTF-8 support in your framework/CMS, written on PHP. 4 | This library is advance of http://sourceforge.net/projects/phputf8 (last updated in 2007). 5 | 6 | ## Features and benefits 7 | 8 | 1. Compatibility with the interface standard PHP functions that deal with single-byte encodings 9 | 1. Ability to work without PHP extensions ICONV and MBSTRING, if any, that are actively used! Uses the fastest available method between MBSTRING, ICONV, native on PHP and hacks. 10 | 1. Useful features are missing from the ICONV and MBSTRING 11 | 1. The methods that take and return a string, are able to take and return null. This useful for selects from a database. 12 | 1. Several methods are able to process arrays recursively: `array_change_key_case()`, `convert_from()`, `convert_to()`, `strict()`, `is_utf8()`, `blocks_check()`, `convert_case()`, `lowercase()`, `uppercase()`, `unescape()` 13 | 1. Validating method parameters to allowed types via reflection (You can disable it) 14 | 1. A single interface and encapsulation, You can inherit and override 15 | 1. Test coverage 16 | 1. PHP >= 5.3.x 17 | 18 | Example: 19 | 20 | $s = 'Hello, Привет'; 21 | if (UTF8::is_utf8($s)) echo UTF8::strlen($s); 22 | 23 | ## Standard PHP functions, implemented for UTF-8 encoding string 24 | 25 | ### Alphabetical order list 26 | 27 | 1. `array_change_key_case()` 28 | 1. `chr()` — Converts a UNICODE codepoint to a UTF-8 character 29 | 1. `chunk_split()` 30 | 1. `ltrim()` 31 | 1. `ord()` — Converts a UTF-8 character to a UNICODE codepoint 32 | 1. `preg_match_all()` — Call `preg_match_all()` and convert byte offsets into character offsets for `PREG_OFFSET_CAPTURE` flag. This is regardless of whether you use `/u` modifier. 33 | 1. `range()` 34 | 1. `rtrim()` 35 | 1. `str_pad()` 36 | 1. `str_split()` 37 | 1. `strcasecmp()` 38 | 1. `strcmp()` 39 | 1. `stripos()` 40 | 1. `strlen()` 41 | 1. `strncmp()` 42 | 1. `strpos()` 43 | 1. `strrev()` 44 | 1. `strspn()` 45 | 1. `strtolower()`, `lowercase()` is alias 46 | 1. `strtoupper()`, `uppercase()` is alias 47 | 1. `strtr()` 48 | 1. `substr()` 49 | 1. `substr_replace()` 50 | 1. `trim()` 51 | 1. `ucfirst()` 52 | 1. `ucwords()` 53 | 54 | ## Extra useful functions for UTF-8 encoding string 55 | 56 | ### Alphabetical order list: 57 | 58 | 1. `blocks_check()` — Check the data in UTF-8 charset on given ranges of the standard UNICODE. The suitable alternative to regular expressions. 59 | 1. `convert_case()` — Конвертирует регистр букв в данных в кодировке UTF-8. Массивы обходятся рекурсивно, при этом конвертируются только значения в элементах массива, а ключи остаются без изменений. 60 | 1. `convert_files_from()` — Recode the text files in a specified folder in the UTF-8. In the processing skipped binary files, files encoded in UTF-8, files that could not convert. 61 | 1. `convert_from()` — Encodes data from another character encoding to UTF-8. 62 | 1. `convert_to()` — Encodes data from UTF-8 to another character encoding. 63 | 1. `diactrical_remove()` — Remove combining diactrical marks, with possibility of the restore. Удаляет диакритические знаки в тексте, с возможностью восстановления (опция) 64 | 1. `diactrical_restore()` — Restore combining diactrical marks, removed by diactrical_remove(). Восстанавливает диакритические знаки в тексте, при условии, что их символьные позиции и кол-во символов не изменились! 65 | 1. `from_unicode()` — Converts a UNICODE codepoints to a UTF-8 string 66 | 1. `has_binary()` — Check the data accessory to the class of control characters in ASCII. 67 | 1. `html_entity_decode()` — Convert all HTML entities to native UTF-8 characters 68 | 1. `html_entity_encode()` — Convert special UTF-8 characters to HTML entities. 69 | 1. `is_ascii()` — Check the data accessory to the class of characters ASCII. 70 | 1. `is_utf8()` — Returns true if data is valid UTF-8 and false otherwise. For null, integer, float, boolean returns TRUE. 71 | 1. `preg_quote_case_insensitive()` — Make regular expression for case insensitive match 72 | 1. `str_limit()`, `truncate()` — Обрезает текст в кодировке UTF-8 до заданной длины, причём последнее слово показывается целиком, а не обрывается на середине. Html сущности корректно обрабатываются. 73 | 1. `strict()` — Strips out device control codes in the ASCII range. 74 | 1. `textarea_rows()` — Calculates the height of the edit text in \ html tag by value and width. 75 | 1. `to_unicode()` — Converts a UTF-8 string to a UNICODE codepoints 76 | 1. `unescape()` — Decodes a string to UTF-8 string from some formats (can be mixed) 77 | 1. `unescape_request()` — Corrects the global arrays `$_GET`, `$_POST`, `$_COOKIE`, `$_REQUEST`, `$_FILES` decoded values from `%XX` and extended `%uXXXX` / `%u{XXXXXX}` format, for example, through an outdated JavaScript function `escape()`. Standard PHP5 cannot do it. Recode `$_GET`, `$_POST`, `$_COOKIE`, `$_REQUEST`, `$_FILES` from `$charset` encoding to UTF-8, if necessary. A side effect is a positive protection against XSS attacks with non-printable characters on the vulnerable PHP function. Thus web forms can be sent to the server in 2-encoding: `$charset` and UTF8. For example: `?тест[тест]=тест` 78 | If in the `HTTP_COOKIE` there are parameters with the same name, takes the last value (as in the `QUERY_STRING`), not the first. Creates an array of `$_POST` for non-standard Content-Type, for example, `"Content-Type: application/octet-stream"`. Standard PHP5 creates an array for `"Content-Type: application/x-www-form-urlencoded"` and `"Content-Type: multipart/form-data"`. 79 | 80 | Examples of `unescape()` 81 | 82 | '%D1%82%D0%B5%D1%81%D1%82' => "\xD1\x82\xD0\xB5\xD1\x81\xD1\x82" #binary (regular) 83 | '0xD182D0B5D181D182' => "\xD1\x82\xD0\xB5\xD1\x81\xD1\x82" #binary (compact) 84 | '%u0442%u0435%u0441%u0442' => "\xD1\x82\xD0\xB5\xD1\x81\xD1\x82" #UCS-2 (U+0 — U+FFFF) 85 | '%u{442}%u{435}%u{0441}%u{00442}' => "\xD1\x82\xD0\xB5\xD1\x81\xD1\x82" #UTF-8 (U+0 — U+FFFFFF) 86 | 87 | Examples of `unescape_request()` 88 | 89 | '%F2%E5%F1%F2' => 'тест' #CP1251 (regular) 90 | '0xF2E5F1F2' => 'тест' #CP1251 (compact) 91 | '%D1%82%D0%B5%D1%81%D1%82' => 'тест' #UTF-8 (regular) 92 | '0xD182D0B5D181D182' => 'тест' #UTF-8 (compact) 93 | '%u0442%u0435%u0441%u0442' => 'тест' #UCS-2 (U+0 — U+FFFF) 94 | '%u{442}%u{435}%u{0441}%u{00442}' => 'тест' #UTF-8 (U+0 — U+FFFFFF) 95 | 96 | # Поддержка UTF8 в PHP5 97 | 98 | ## Возможности и преимущества 99 | 100 | 1. Совместимость с интерфейсом стандартных PHP функций, работающих с однобайтовыми кодировками 101 | 1. Возможность работы без PHP расширений ICONV и MBSTRING, если они есть, то активно используются! Используется наиболее быстрый из доступных методов между MBSTRING, ICONV, родной реализацией на PHP и хаками. 102 | 1. Полезные функции, отсутствующие в ICONV и MBSTRING 103 | 1. Методы, которые принимают и возвращают строку, умеют принимать и возвращать null. Это удобно при выборках значений из базы данных. 104 | 1. Несколько методов умеют обрабатывать массивы рекурсивно: `array_change_key_case()`, `convert_from()`, `convert_to()`, `strict()`, `is_utf8()`, `blocks_check()`, `convert_case()`, `lowercase()`, `uppercase()`, `unescape()` 105 | 1. Проверка у методов входных параметров на допустимые типы через рефлексию (можно отключить) 106 | 1. Единый интерфейс и инкапсуляция, можно унаследоваться и переопределить методы 107 | 1. Покрытие тестами 108 | 1. PHP >= 5.3.x 109 | 110 | Example: 111 | 112 | $s = 'Hello, Привет'; 113 | if (UTF8::is_utf8($s)) echo UTF8::strlen($s); 114 | 115 | Project was exported from http://code.google.com/p/php5-utf8 116 | -------------------------------------------------------------------------------- /ReflectionTypeHint.php: -------------------------------------------------------------------------------- 1 | 'is_int', 48 | 'integer' => 'is_int', 49 | 'digit' => 'ctype_digit', 50 | 'number' => 'ctype_digit', 51 | 'float' => 'is_float', 52 | 'double' => 'is_float', 53 | 'real' => 'is_float', 54 | 'numeric' => 'is_numeric', 55 | 'str' => 'is_string', 56 | 'string' => 'is_string', 57 | 'char' => 'is_string', 58 | 'bool' => 'is_bool', 59 | 'boolean' => 'is_bool', 60 | 'null' => 'is_null', 61 | 'array' => 'is_array', 62 | 'obj' => 'is_object', 63 | 'object' => 'is_object', 64 | 'res' => 'is_resource', 65 | 'resource' => 'is_resource', 66 | 'scalar' => 'is_scalar', #integer, float, string or boolean 67 | 'cb' => 'is_callable', 68 | 'callback' => 'is_callable', 69 | ); 70 | 71 | #calling the methods of this class only statically! 72 | private function __construct() {} 73 | 74 | public static function isValid() 75 | { 76 | if (! assert_options(ASSERT_ACTIVE)) return true; 77 | $bt = self::debugBacktrace(null, 1); 78 | extract($bt); //to $file, $line, $function, $class, $object, $type, $args 79 | if (! $args) return true; #speed improve 80 | $r = new ReflectionMethod($class, $function); 81 | $doc = $r->getDocComment(); 82 | $cache_id = $class. $type. $function; 83 | preg_match_all('~ [\r\n]++ [\x20\t]++ \* [\x20\t]++ 84 | @param 85 | [\x20\t]++ 86 | \K #memory reduce 87 | ( [_a-z]++[_a-z\d]*+ 88 | (?>[|/,][_a-z]+[_a-z\d]*)*+ 89 | ) #1 types 90 | [\x20\t]++ 91 | &?+\$([_a-z]++[_a-z\d]*+) #2 name 92 | ~sixSX', $doc, $params, PREG_SET_ORDER); 93 | $parameters = $r->getParameters(); 94 | //d($args, $params, $parameters); 95 | if (count($parameters) > count($params)) 96 | { 97 | $message = 'phpDoc %d piece(s) @param description expected in %s%s%s(), %s given, ' . PHP_EOL 98 | . 'called in %s on line %d ' . PHP_EOL 99 | . 'and defined in %s on line %d'; 100 | $message = sprintf($message, count($parameters), $class, $type, $function, count($params), $file, $line, $r->getFileName(), $r->getStartLine()); 101 | trigger_error($message, E_USER_NOTICE); 102 | } 103 | foreach ($args as $i => $value) 104 | { 105 | if (! isset($params[$i])) return true; 106 | if ($parameters[$i]->name !== $params[$i][2]) 107 | { 108 | $param_num = $i + 1; 109 | $message = 'phpDoc @param %d in %s%s%s() must be named as $%s, $%s given, ' . PHP_EOL 110 | . 'called in %s on line %d ' . PHP_EOL 111 | . 'and defined in %s on line %d'; 112 | $message = sprintf($message, $param_num, $class, $type, $function, $parameters[$i]->name, $params[$i][2], $file, $line, $r->getFileName(), $r->getStartLine()); 113 | trigger_error($message, E_USER_NOTICE); 114 | } 115 | 116 | $hints = preg_split('~[|/,]~sSX', $params[$i][1]); 117 | if (! self::checkValueTypes($hints, $value)) 118 | { 119 | $param_num = $i + 1; 120 | $message = 'Argument %d passed to %s%s%s() must be an %s, %s given, ' . PHP_EOL 121 | . 'called in %s on line %d ' . PHP_EOL 122 | . 'and defined in %s on line %d'; 123 | $message = sprintf($message, $param_num, $class, $type, $function, implode('|', $hints), (is_object($value) ? get_class($value) . ' ' : '') . gettype($value), $file, $line, $r->getFileName(), $r->getStartLine()); 124 | trigger_error($message, E_USER_WARNING); 125 | return false; 126 | } 127 | } 128 | return true; 129 | } 130 | 131 | /** 132 | * Return stacktrace. Correctly work with call_user_func*() 133 | * (totally skip them correcting caller references). 134 | * If $return_frame is present, return only $return_frame matched caller, not all stacktrace. 135 | * 136 | * @param string|null $re_ignore example: '~^' . preg_quote(__CLASS__, '~') . '(?![a-zA-Z\d])~sSX' 137 | * @param int|null $return_frame 138 | * @return array 139 | */ 140 | public static function debugBacktrace($re_ignore = null, $return_frame = null) 141 | { 142 | $trace = debug_backtrace(); 143 | 144 | $a = array(); 145 | $frames = 0; 146 | for ($i = 0, $n = count($trace); $i < $n; $i++) 147 | { 148 | $t = $trace[$i]; 149 | if (! $t) continue; 150 | 151 | // Next frame. 152 | $next = isset($trace[$i+1])? $trace[$i+1] : null; 153 | 154 | // Dummy frame before call_user_func*() frames. 155 | if (! isset($t['file']) && $next) 156 | { 157 | $t['over_function'] = $trace[$i+1]['function']; 158 | $t = $t + $trace[$i+1]; 159 | $trace[$i+1] = null; // skip call_user_func on next iteration 160 | } 161 | 162 | // Skip myself frame. 163 | if (++$frames < 2) continue; 164 | 165 | // 'class' and 'function' field of next frame define where this frame function situated. 166 | // Skip frames for functions situated in ignored places. 167 | if ($re_ignore && $next) 168 | { 169 | // Name of function "inside which" frame was generated. 170 | $frame_caller = (isset($next['class']) ? $next['class'] . $next['type'] : '') 171 | . (isset($next['function']) ? $next['function'] : ''); 172 | if (preg_match($re_ignore, $frame_caller)) continue; 173 | } 174 | 175 | // On each iteration we consider ability to add PREVIOUS frame to $a stack. 176 | if (count($a) === $return_frame) return $t; 177 | $a[] = $t; 178 | } 179 | return $a; 180 | } 181 | 182 | /** 183 | * Checks a value to the allowed types 184 | * 185 | * @param array $types 186 | * @param mixed $value 187 | * @return bool 188 | */ 189 | public static function checkValueTypes(array $types, $value) 190 | { 191 | foreach ($types as $type) 192 | { 193 | $type = strtolower($type); 194 | if (array_key_exists($type, self::$hints) && call_user_func(self::$hints[$type], $value)) return true; 195 | if (is_object($value) && @is_a($value, $type)) return true; 196 | if ($type === 'mixed') return true; 197 | } 198 | return false; 199 | } 200 | } -------------------------------------------------------------------------------- /ReflectionTypeHint_example.php: -------------------------------------------------------------------------------- 1 | myMethod('sss', 75467, new Exception(), true); 24 | -------------------------------------------------------------------------------- /UTF8-CHANGELOG.txt: -------------------------------------------------------------------------------- 1 | 2.3.1 / 2012-03-11 2 | 3 | * UTF8::QUOTATION_MARK_RE new constant added 4 | * UTF8::$html_quotation_mark_table added 5 | * UTF8::ucfirst() improved 6 | * UTF8::ucwords() improved 7 | * UTF8::convert_files_from() improved 8 | * UTF8::array_change_key_case() recursive support added 9 | * UTF8::html_entity_encode() binary support 10 | * UTF8::html_entity_decode() ' entity support 11 | * UTF8::str_limit() syntax error: "preg_relace" instead of "preg_replace" 12 | * Small bugs fixed 13 | 14 | 2.3.0 / 2011-10-06 15 | 16 | * Constants BOM, CHAR_UPPER_RE, CHAR_LOWER_RE, HTML_ENTITY_RE added 17 | * UTF8::has_binary() - new method added 18 | * UTF8::strict() - recursive support added 19 | * UTF8::$char_re renamed to constant CHAR_RE, 20 | UTF8::$diactrical_re renamed to constant DIACTRICAL_RE 21 | * UTF8::unescape_request() - improved, $charset parameter added 22 | * UTF8::unescape() - improved and interface changed from 23 | ($data, $is_rawurlencode = false) to ($data, $is_hex2bin = false, $is_urldecode = true) 24 | * UTF8::autoconvert_request() removed, use UTF8::unescape_request() instead 25 | * UTF8::is_ascii() - recursive support removed (was ambiguity), 26 | second paramether added, for non string/int/float always returns FALSE 27 | * UTF8::blocks_check() - for non string/int/float always returns FALSE 28 | * UTF8::str_limit() - small internal improved 29 | * UTF8::preg_quote_case_insensitive() - speed improved 30 | 31 | 2.2.2 / 2011-06-24 32 | 33 | * Convert case functions improved: from all russian charsets to UTF8 native support was added 34 | * UTF8::stripos() speed improved 35 | * Constant REPLACEMENT_CHAR added 36 | 37 | 2.2.1 / 2011-06-08 38 | 39 | * UTF8::preg_quote_case_insensitive() added 40 | * UTF8::stripos() speed improved 41 | 42 | 2.2.0 / 2011-06-06 43 | 44 | * UTF8::strlen(), UTF8::substr(), UTF8::strpos(), 45 | UTF8::html_entity_encode(), UTF8::html_entity_decode(), 46 | UTF8::convert_case(), UTF8::lowercase(), UTF8::uppercase() speed improved 47 | * UTF8::stripos(), UTF8::to_unicode(), UTF8::from_unicode() added 48 | * UTF8::strtolower(), UTF8::strtoupper() as wrapper to UTF8::convert_case() added 49 | * Unicode character database to 6.0.0 (2010-06-04) updated 50 | * UTF8::$convert_case_table improved 51 | 52 | 2.1.3 / 2011-05-31 53 | 54 | * UTF8::truncate() small bug fixed 55 | 56 | 2.1.2 / 2011-03-25 57 | 58 | * Класс требует PHP-5.3.x 59 | * UTF8::$char_re deprecated 60 | * Добавлен метод UTF8::tests(), который тестирует методы класса на правильность работы 61 | * Добавлены методы UTF8::strcmp(), UTF8::strncmp(), UTF8::strcasecmp() 62 | * UTF8::is_utf8(), UTF8::str_limit(), UTF8::str_split() speed improved 63 | * Добавлен 2-й параметр в UTF8::html_entity_encode() 64 | * Добавлен 3-й параметр в UTF8::ucwords() 65 | * Методы UTF8::convert_case(), UTF8::lowercase(), UTF8::uppercase() могут принимать массив в 1-м параметре 66 | * Мелкие улучшения в UTF8::strtr() 67 | * Модернизирован класс ReflectionTypeHint 68 | 69 | 2.1.1 / 2010-07-19 70 | 71 | * Добавлены методы array_change_key_case(), range(), strtr() 72 | * Улучшен метод convert_files_from() 73 | * Unicode Character Database 5.2.0 74 | * Исправлены ошибки в trim(), ltrim(), rtrim(), str_pad(), которые могут возникать в некоторых случаях 75 | 76 | 2.1.0 / 2010-03-26 77 | 78 | * Удалён метод unescape_recursive() 79 | * Добавлен метод convert_files_from() 80 | * Несколько методов теперь могут принимать массив и делать их обход рекурсивно 81 | * Почти все методы для обработки строк могут принимать и возвращать NULL 82 | 83 | 2.0.2 / 2010-02-13 84 | 85 | * Новые методы is_ascii(), ltrim(), rtrim(), trim(), str_pad(), strspn() 86 | * Исправлена небольшая ошибка в str_limit() 87 | * Исправлена ошибка в методах convert_from() и convert_to(): они ошибочно возвращали FALSE, 88 | если подать на вход массив, содержащий элементы типа boolean со значением FALSE 89 | 90 | 2.0.1 / 2010-02-08 91 | 92 | * Удалён метод convert_from_cp1259(), используйте convert_from('cp1251') 93 | * Метод convert_from_utf16() теперь приватный, используйте convert_from('UTF-16') 94 | * Добавлены методы convert_to(), diactrical_remove(), diactrical_restore() 95 | * Другие мелкие исправления 96 | -------------------------------------------------------------------------------- /UTF8.php: -------------------------------------------------------------------------------- 1 | = 5.3.x 22 | * 23 | * In Russian: 24 | * 25 | * Поддержка UTF-8 в PHP 5. 26 | * 27 | * Возможности и преимущества 28 | * * Совместимость с интерфейсом стандартных PHP функций, работающих с однобайтовыми кодировками 29 | * * Возможность работы без PHP расширений ICONV и MBSTRING, если они есть, то активно используются! 30 | * Используется наиболее быстрый из доступных методов между MBSTRING, ICONV, родной реализацией на PHP и хаками. 31 | * * Полезные функции, отсутствующие в ICONV и MBSTRING 32 | * * Методы, которые принимают и возвращают строку, умеют принимать и возвращать null. 33 | * Это удобно при выборках значений из базы данных. 34 | * * Несколько методов умеют обрабатывать массивы рекурсивно: 35 | * array_change_key_case(), convert_from(), convert_to(), strict(), is_utf8(), blocks_check(), convert_case(), lowercase(), uppercase(), unescape() 36 | * * Проверка у методов входных параметров на допустимые типы через рефлексию (можно отключить) 37 | * * Единый интерфейс и инкапсуляция, можно унаследоваться и переопределить методы 38 | * * Покрытие тестами 39 | * * PHP >= 5.3.x 40 | * 41 | * Example: 42 | * $s = 'Hello, Привет'; 43 | * if (UTF8::is_utf8($s)) echo UTF8::strlen($s); 44 | * 45 | * UTF-8 encoding scheme: 46 | * 2^7 0x00000000 — 0x0000007F 0xxxxxxx 47 | * 2^11 0x00000080 — 0x000007FF 110xxxxx 10xxxxxx 48 | * 2^16 0x00000800 — 0x0000FFFF 1110xxxx 10xxxxxx 10xxxxxx 49 | * 2^21 0x00010000 — 0x001FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 50 | * 1-4 bytes length: 2^7 + 2^11 + 2^16 + 2^21 = 2 164 864 51 | * 52 | * If I was a owner of the world, I would leave only 2 encoding: UTF-8 and UTF-32 ;-) 53 | * 54 | * Useful links 55 | * http://ru.wikipedia.org/wiki/UTF8 56 | * http://www.madore.org/~david/misc/unitest/ A Unicode Test Page 57 | * http://www.unicode.org/ 58 | * http://www.unicode.org/reports/ 59 | * http://www.unicode.org/reports/tr10/ Unicode Collation Algorithm 60 | * http://www.unicode.org/Public/UCA/6.0.0/ Unicode Collation Algorithm 61 | * http://www.unicode.org/reports/tr6/ A Standard Compression Scheme for Unicode 62 | * http://www.fileformat.info/info/unicode/char/search.htm Unicode Character Search 63 | * 64 | * @link http://code.google.com/p/php5-utf8/ 65 | * @license http://creativecommons.org/licenses/by-sa/3.0/ 66 | * @author Nasibullin Rinat 67 | * @version 2.3.1 68 | */ 69 | class UTF8 70 | { 71 | /** 72 | * REPLACEMENT CHARACTER (for broken char) 73 | * 74 | * @var string 75 | */ 76 | const REPLACEMENT_CHAR = "\xEF\xBF\xBD"; #U+FFFD 77 | 78 | /** 79 | * Byte order mark, http://en.wikipedia.org/wiki/Byte_Order_Mark 80 | * 81 | * @var string 82 | */ 83 | const BOM = "\xEF\xBB\xBF"; 84 | 85 | /** 86 | * Regular expression for a character in UTF-8. 87 | * For engines, which don't support UTF8 mode. 88 | * In PCRE use a dot (".") and the flag /u, it works much faster! 89 | * 90 | * @var string 91 | */ 92 | const CHAR_RE = 93 | '[\x09\x0A\x0D\x20-\x7E] # ASCII strict 94 | # [\x00-\x7F] # ASCII non-strict (including control chars) 95 | | [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte 96 | | \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs 97 | | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte 98 | | \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates 99 | | \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3 100 | | [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15 101 | | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16 102 | '; 103 | 104 | /** 105 | * Combining diactrical marks (Unicode 5.1). 106 | * \p{M} in PCRE terms. 107 | * For engines, which don't support UTF8 mode. 108 | * 109 | * For example, russian letters in composed form: "Ё" (U+0401), "Й" (U+0419), 110 | * decomposed form: (U+0415 U+0308), (U+0418 U+0306) 111 | * 112 | * @link http://www.unicode.org/charts/PDF/U0300.pdf 113 | * @link http://www.unicode.org/charts/PDF/U1DC0.pdf 114 | * @link http://www.unicode.org/charts/PDF/UFE20.pdf 115 | * @var string 116 | */ 117 | const DIACTRICAL_RE = 118 | ' \xcc[\x80-\xb9]|\xcd[\x80-\xaf] #UNICODE range: U+0300 — U+036F (for letters) 119 | | \xe2\x83[\x90-\xbf] #UNICODE range: U+20D0 — U+20FF (for symbols) 120 | | \xe1\xb7[\x80-\xbf] #UNICODE range: U+1DC0 — U+1DFF (supplement) 121 | | \xef\xb8[\xa0-\xaf] #UNICODE range: U+FE20 — U+FE2F (combining half marks) 122 | '; 123 | 124 | /** 125 | * \p{Lu} in PCRE terms. 126 | * For engines, which don't support UTF8 mode. 127 | * 128 | * @var string 129 | */ 130 | const CHAR_UPPER_RE = '[\x41-\x5a] 131 | | \xc3[\x80-\x9e] 132 | | \xc4[\x80-\xbf] 133 | | \xc5[\x81-\xbd] 134 | | \xc6[\x81-\xbc] 135 | | \xc7[\x85-\xbe] 136 | | \xc8[\x80-\xb2] 137 | | \xce[\x86-\xab] 138 | | \xcf[\x98-\xae] 139 | | \xd0[\x80-\xaf] 140 | | \xd1[\xa0-\xbe] 141 | | \xd2[\x80-\xbe] 142 | | \xd3[\x81-\xb8] 143 | | \xd4[\x80-\xbf] 144 | | \xd5[\x80-\x96] 145 | | \xe1[\xb8\xb9\xba][\x80-\xbe] 146 | | \xe1\xbb[\x80-\xb8] 147 | | \xe1\xbc[\x88-\xbf] 148 | | \xe1\xbd[\x88-\xaf] 149 | | \xe1[\xbe\xbf][\x88-\xbc] 150 | | \xef\xbc[\xa1-\xba] 151 | '; 152 | 153 | /** 154 | * \p{Ll} in PCRE terms. 155 | * For engines, which don't support UTF8 mode. 156 | * 157 | * @var string 158 | */ 159 | const CHAR_LOWER_RE = '[\x61-\x7a] 160 | | \xc2\xb5 161 | | \xc3[\xa0-\xbf] 162 | | \xc4[\x81-\xbe] 163 | | \xc5[\x80-\xbe] 164 | | \xc6[\x83-\xbf] 165 | | \xc7[\x86-\xbf] 166 | | \xc8[\x81-\xb3] 167 | | \xc9[\x93-\xb5] 168 | | \xca[\x80-\x92] 169 | | \xce[\xac-\xbf] 170 | | \xcf[\x80-\xaf] 171 | | \xd0[\xb0-\xbf] 172 | | \xd1[\x80-\xbf] 173 | | \xd2[\x81-\xbf] 174 | | \xd3[\x82-\xb9] 175 | | \xd4[\x81-\x8f] 176 | | \xd5[\xa1-\xbf] 177 | | \xd6[\x80-\x86] 178 | | \xe1[\xb8\xb9\xba][\x81-\xbf] 179 | | \xe1\xbb[\x81-\xb9] 180 | | \xe1\xbc[\x80-\xb7] 181 | | \xe1\xbd[\x80-\xbd] 182 | | \xe1\xbe[\x80-\xb3] 183 | | \xe1\xbf[\x83-\xb3] 184 | | \xef\xbd[\x81-\x9a] 185 | '; 186 | 187 | /** 188 | * HTML entities, examples: > Ö ˜ " 189 | * 190 | * @var string 191 | */ 192 | const HTML_ENTITY_RE = '&(?> [a-zA-Z][a-zA-Z\d]++ 193 | | \#(?> \d{1,4}+ 194 | | x[\da-fA-F]{2,4}+ 195 | ) 196 | ); 197 | '; 198 | 199 | /** 200 | * Quotation marks. 201 | * For engines, which don't support UTF8 mode. 202 | * 203 | * @var string 204 | */ 205 | const QUOTATION_MARK_RE = '\x22|\xc2[\xab\xbb]|\xe2\x80[\x98\x99\x9a\x9c\x9d\x9e\xb9\xba]'; 206 | 207 | /** 208 | * 209 | * @var array 210 | */ 211 | public static $html_quotation_mark_table = array( 212 | '"' => "\x22", #U+0022 ["] " quotation mark = APL quote 213 | '«' => "\xc2\xab", #U+00AB [«] left-pointing double angle quotation mark = left pointing guillemet 214 | '»' => "\xc2\xbb", #U+00BB [»] right-pointing double angle quotation mark = right pointing guillemet 215 | '‘' => "\xe2\x80\x98", #U+2018 [‘] left single quotation mark 216 | '’' => "\xe2\x80\x99", #U+2019 [’] right single quotation mark (and apostrophe!) 217 | '‚' => "\xe2\x80\x9a", #U+201A [‚] single low-9 quotation mark 218 | '“' => "\xe2\x80\x9c", #U+201C [“] left double quotation mark 219 | '”' => "\xe2\x80\x9d", #U+201D [”] right double quotation mark 220 | '„' => "\xe2\x80\x9e", #U+201E [„] double low-9 quotation mark 221 | '‹' => "\xe2\x80\xb9", #U+2039 [‹] single left-pointing angle quotation mark 222 | '›' => "\xe2\x80\xba", #U+203A [›] single right-pointing angle quotation mark 223 | ); 224 | 225 | /** 226 | * HTML special chars table 227 | * 228 | * @var array 229 | */ 230 | public static $html_special_chars_table = array( 231 | '"' => "\x22", #U+0022 ["] " quotation mark = APL quote 232 | '&' => "\x26", #U+0026 [&] & ampersand 233 | '<' => "\x3c", #U+003C [<] < less-than sign 234 | '>' => "\x3e", #U+003E [>] > greater-than sign 235 | #' entity is only available in XHTML/HTML5 and not in plain HTML, see http://www.w3.org/TR/xhtml1/#C_16 236 | #''' => "\x27", #U+0027 ['] ' apostrophe 237 | ); 238 | 239 | /** 240 | * @link http://www.fileformat.info/format/w3c/entitytest.htm?sort=Unicode%20Character HTML Entity Browser Test Page 241 | * @var array 242 | */ 243 | public static $html_entity_table = array( 244 | #Latin-1 Entities: 245 | ' ' => "\xc2\xa0", #U+00A0 [ ] no-break space = non-breaking space 246 | '¡' => "\xc2\xa1", #U+00A1 [¡] inverted exclamation mark 247 | '¢' => "\xc2\xa2", #U+00A2 [¢] cent sign 248 | '£' => "\xc2\xa3", #U+00A3 [£] pound sign 249 | '¤' => "\xc2\xa4", #U+00A4 [¤] currency sign 250 | '¥' => "\xc2\xa5", #U+00A5 [¥] yen sign = yuan sign 251 | '¦' => "\xc2\xa6", #U+00A6 [¦] broken bar = broken vertical bar 252 | '§' => "\xc2\xa7", #U+00A7 [§] section sign 253 | '¨' => "\xc2\xa8", #U+00A8 [¨] diaeresis = spacing diaeresis 254 | '©' => "\xc2\xa9", #U+00A9 [©] copyright sign 255 | 'ª' => "\xc2\xaa", #U+00AA [ª] feminine ordinal indicator 256 | '«' => "\xc2\xab", #U+00AB [«] left-pointing double angle quotation mark = left pointing guillemet 257 | '¬' => "\xc2\xac", #U+00AC [¬] not sign 258 | '­' => "\xc2\xad", #U+00AD [ ] soft hyphen = discretionary hyphen 259 | '®' => "\xc2\xae", #U+00AE [®] registered sign = registered trade mark sign 260 | '¯' => "\xc2\xaf", #U+00AF [¯] macron = spacing macron = overline = APL overbar 261 | '°' => "\xc2\xb0", #U+00B0 [°] degree sign 262 | '±' => "\xc2\xb1", #U+00B1 [±] plus-minus sign = plus-or-minus sign 263 | '²' => "\xc2\xb2", #U+00B2 [²] superscript two = superscript digit two = squared 264 | '³' => "\xc2\xb3", #U+00B3 [³] superscript three = superscript digit three = cubed 265 | '´' => "\xc2\xb4", #U+00B4 [´] acute accent = spacing acute 266 | 'µ' => "\xc2\xb5", #U+00B5 [µ] micro sign 267 | '¶' => "\xc2\xb6", #U+00B6 [¶] pilcrow sign = paragraph sign 268 | '·' => "\xc2\xb7", #U+00B7 [·] middle dot = Georgian comma = Greek middle dot 269 | '¸' => "\xc2\xb8", #U+00B8 [¸] cedilla = spacing cedilla 270 | '¹' => "\xc2\xb9", #U+00B9 [¹] superscript one = superscript digit one 271 | 'º' => "\xc2\xba", #U+00BA [º] masculine ordinal indicator 272 | '»' => "\xc2\xbb", #U+00BB [»] right-pointing double angle quotation mark = right pointing guillemet 273 | '¼' => "\xc2\xbc", #U+00BC [¼] vulgar fraction one quarter = fraction one quarter 274 | '½' => "\xc2\xbd", #U+00BD [½] vulgar fraction one half = fraction one half 275 | '¾' => "\xc2\xbe", #U+00BE [¾] vulgar fraction three quarters = fraction three quarters 276 | '¿' => "\xc2\xbf", #U+00BF [¿] inverted question mark = turned question mark 277 | #Latin capital letter 278 | 'À' => "\xc3\x80", #Latin capital letter A with grave = Latin capital letter A grave 279 | 'Á' => "\xc3\x81", #Latin capital letter A with acute 280 | 'Â' => "\xc3\x82", #Latin capital letter A with circumflex 281 | 'Ã' => "\xc3\x83", #Latin capital letter A with tilde 282 | 'Ä' => "\xc3\x84", #Latin capital letter A with diaeresis 283 | 'Å' => "\xc3\x85", #Latin capital letter A with ring above = Latin capital letter A ring 284 | 'Æ' => "\xc3\x86", #Latin capital letter AE = Latin capital ligature AE 285 | 'Ç' => "\xc3\x87", #Latin capital letter C with cedilla 286 | 'È' => "\xc3\x88", #Latin capital letter E with grave 287 | 'É' => "\xc3\x89", #Latin capital letter E with acute 288 | 'Ê' => "\xc3\x8a", #Latin capital letter E with circumflex 289 | 'Ë' => "\xc3\x8b", #Latin capital letter E with diaeresis 290 | 'Ì' => "\xc3\x8c", #Latin capital letter I with grave 291 | 'Í' => "\xc3\x8d", #Latin capital letter I with acute 292 | 'Î' => "\xc3\x8e", #Latin capital letter I with circumflex 293 | 'Ï' => "\xc3\x8f", #Latin capital letter I with diaeresis 294 | 'Ð' => "\xc3\x90", #Latin capital letter ETH 295 | 'Ñ' => "\xc3\x91", #Latin capital letter N with tilde 296 | 'Ò' => "\xc3\x92", #Latin capital letter O with grave 297 | 'Ó' => "\xc3\x93", #Latin capital letter O with acute 298 | 'Ô' => "\xc3\x94", #Latin capital letter O with circumflex 299 | 'Õ' => "\xc3\x95", #Latin capital letter O with tilde 300 | 'Ö' => "\xc3\x96", #Latin capital letter O with diaeresis 301 | '×' => "\xc3\x97", #U+00D7 [×] multiplication sign 302 | 'Ø' => "\xc3\x98", #Latin capital letter O with stroke = Latin capital letter O slash 303 | 'Ù' => "\xc3\x99", #Latin capital letter U with grave 304 | 'Ú' => "\xc3\x9a", #Latin capital letter U with acute 305 | 'Û' => "\xc3\x9b", #Latin capital letter U with circumflex 306 | 'Ü' => "\xc3\x9c", #Latin capital letter U with diaeresis 307 | 'Ý' => "\xc3\x9d", #Latin capital letter Y with acute 308 | 'Þ' => "\xc3\x9e", #Latin capital letter THORN 309 | #Latin small letter 310 | 'ß' => "\xc3\x9f", #Latin small letter sharp s = ess-zed 311 | 'à' => "\xc3\xa0", #Latin small letter a with grave = Latin small letter a grave 312 | 'á' => "\xc3\xa1", #Latin small letter a with acute 313 | 'â' => "\xc3\xa2", #Latin small letter a with circumflex 314 | 'ã' => "\xc3\xa3", #Latin small letter a with tilde 315 | 'ä' => "\xc3\xa4", #Latin small letter a with diaeresis 316 | 'å' => "\xc3\xa5", #Latin small letter a with ring above = Latin small letter a ring 317 | 'æ' => "\xc3\xa6", #Latin small letter ae = Latin small ligature ae 318 | 'ç' => "\xc3\xa7", #Latin small letter c with cedilla 319 | 'è' => "\xc3\xa8", #Latin small letter e with grave 320 | 'é' => "\xc3\xa9", #Latin small letter e with acute 321 | 'ê' => "\xc3\xaa", #Latin small letter e with circumflex 322 | 'ë' => "\xc3\xab", #Latin small letter e with diaeresis 323 | 'ì' => "\xc3\xac", #Latin small letter i with grave 324 | 'í' => "\xc3\xad", #Latin small letter i with acute 325 | 'î' => "\xc3\xae", #Latin small letter i with circumflex 326 | 'ï' => "\xc3\xaf", #Latin small letter i with diaeresis 327 | 'ð' => "\xc3\xb0", #Latin small letter eth 328 | 'ñ' => "\xc3\xb1", #Latin small letter n with tilde 329 | 'ò' => "\xc3\xb2", #Latin small letter o with grave 330 | 'ó' => "\xc3\xb3", #Latin small letter o with acute 331 | 'ô' => "\xc3\xb4", #Latin small letter o with circumflex 332 | 'õ' => "\xc3\xb5", #Latin small letter o with tilde 333 | 'ö' => "\xc3\xb6", #Latin small letter o with diaeresis 334 | '÷' => "\xc3\xb7", #U+00F7 [÷] division sign 335 | 'ø' => "\xc3\xb8", #Latin small letter o with stroke = Latin small letter o slash 336 | 'ù' => "\xc3\xb9", #Latin small letter u with grave 337 | 'ú' => "\xc3\xba", #Latin small letter u with acute 338 | 'û' => "\xc3\xbb", #Latin small letter u with circumflex 339 | 'ü' => "\xc3\xbc", #Latin small letter u with diaeresis 340 | 'ý' => "\xc3\xbd", #Latin small letter y with acute 341 | 'þ' => "\xc3\xbe", #Latin small letter thorn 342 | 'ÿ' => "\xc3\xbf", #Latin small letter y with diaeresis 343 | #Symbols and Greek Letters: 344 | 'ƒ' => "\xc6\x92", #U+0192 [ƒ] Latin small f with hook = function = florin 345 | 'Α' => "\xce\x91", #Greek capital letter alpha 346 | 'Β' => "\xce\x92", #Greek capital letter beta 347 | 'Γ' => "\xce\x93", #Greek capital letter gamma 348 | 'Δ' => "\xce\x94", #Greek capital letter delta 349 | 'Ε' => "\xce\x95", #Greek capital letter epsilon 350 | 'Ζ' => "\xce\x96", #Greek capital letter zeta 351 | 'Η' => "\xce\x97", #Greek capital letter eta 352 | 'Θ' => "\xce\x98", #Greek capital letter theta 353 | 'Ι' => "\xce\x99", #Greek capital letter iota 354 | 'Κ' => "\xce\x9a", #Greek capital letter kappa 355 | 'Λ' => "\xce\x9b", #Greek capital letter lambda 356 | 'Μ' => "\xce\x9c", #Greek capital letter mu 357 | 'Ν' => "\xce\x9d", #Greek capital letter nu 358 | 'Ξ' => "\xce\x9e", #Greek capital letter xi 359 | 'Ο' => "\xce\x9f", #Greek capital letter omicron 360 | 'Π' => "\xce\xa0", #Greek capital letter pi 361 | 'Ρ' => "\xce\xa1", #Greek capital letter rho 362 | 'Σ' => "\xce\xa3", #Greek capital letter sigma 363 | 'Τ' => "\xce\xa4", #Greek capital letter tau 364 | 'Υ' => "\xce\xa5", #Greek capital letter upsilon 365 | 'Φ' => "\xce\xa6", #Greek capital letter phi 366 | 'Χ' => "\xce\xa7", #Greek capital letter chi 367 | 'Ψ' => "\xce\xa8", #Greek capital letter psi 368 | 'Ω' => "\xce\xa9", #Greek capital letter omega 369 | 'α' => "\xce\xb1", #Greek small letter alpha 370 | 'β' => "\xce\xb2", #Greek small letter beta 371 | 'γ' => "\xce\xb3", #Greek small letter gamma 372 | 'δ' => "\xce\xb4", #Greek small letter delta 373 | 'ε' => "\xce\xb5", #Greek small letter epsilon 374 | 'ζ' => "\xce\xb6", #Greek small letter zeta 375 | 'η' => "\xce\xb7", #Greek small letter eta 376 | 'θ' => "\xce\xb8", #Greek small letter theta 377 | 'ι' => "\xce\xb9", #Greek small letter iota 378 | 'κ' => "\xce\xba", #Greek small letter kappa 379 | 'λ' => "\xce\xbb", #Greek small letter lambda 380 | 'μ' => "\xce\xbc", #Greek small letter mu 381 | 'ν' => "\xce\xbd", #Greek small letter nu 382 | 'ξ' => "\xce\xbe", #Greek small letter xi 383 | 'ο' => "\xce\xbf", #Greek small letter omicron 384 | 'π' => "\xcf\x80", #Greek small letter pi 385 | 'ρ' => "\xcf\x81", #Greek small letter rho 386 | 'ς' => "\xcf\x82", #Greek small letter final sigma 387 | 'σ' => "\xcf\x83", #Greek small letter sigma 388 | 'τ' => "\xcf\x84", #Greek small letter tau 389 | 'υ' => "\xcf\x85", #Greek small letter upsilon 390 | 'φ' => "\xcf\x86", #Greek small letter phi 391 | 'χ' => "\xcf\x87", #Greek small letter chi 392 | 'ψ' => "\xcf\x88", #Greek small letter psi 393 | 'ω' => "\xcf\x89", #Greek small letter omega 394 | 'ϑ'=> "\xcf\x91", #Greek small letter theta symbol 395 | 'ϒ' => "\xcf\x92", #Greek upsilon with hook symbol 396 | 'ϖ' => "\xcf\x96", #U+03D6 [ϖ] Greek pi symbol 397 | 398 | '•' => "\xe2\x80\xa2", #U+2022 [•] bullet = black small circle 399 | '…' => "\xe2\x80\xa6", #U+2026 […] horizontal ellipsis = three dot leader 400 | '′' => "\xe2\x80\xb2", #U+2032 [′] prime = minutes = feet (для обозначения минут и футов) 401 | '″' => "\xe2\x80\xb3", #U+2033 [″] double prime = seconds = inches (для обозначения секунд и дюймов). 402 | '‾' => "\xe2\x80\xbe", #U+203E [‾] overline = spacing overscore 403 | '⁄' => "\xe2\x81\x84", #U+2044 [⁄] fraction slash 404 | '℘' => "\xe2\x84\x98", #U+2118 [℘] script capital P = power set = Weierstrass p 405 | 'ℑ' => "\xe2\x84\x91", #U+2111 [ℑ] blackletter capital I = imaginary part 406 | 'ℜ' => "\xe2\x84\x9c", #U+211C [ℜ] blackletter capital R = real part symbol 407 | '™' => "\xe2\x84\xa2", #U+2122 [™] trade mark sign 408 | 'ℵ' => "\xe2\x84\xb5", #U+2135 [ℵ] alef symbol = first transfinite cardinal 409 | '←' => "\xe2\x86\x90", #U+2190 [←] leftwards arrow 410 | '↑' => "\xe2\x86\x91", #U+2191 [↑] upwards arrow 411 | '→' => "\xe2\x86\x92", #U+2192 [→] rightwards arrow 412 | '↓' => "\xe2\x86\x93", #U+2193 [↓] downwards arrow 413 | '↔' => "\xe2\x86\x94", #U+2194 [↔] left right arrow 414 | '↵' => "\xe2\x86\xb5", #U+21B5 [↵] downwards arrow with corner leftwards = carriage return 415 | '⇐' => "\xe2\x87\x90", #U+21D0 [⇐] leftwards double arrow 416 | '⇑' => "\xe2\x87\x91", #U+21D1 [⇑] upwards double arrow 417 | '⇒' => "\xe2\x87\x92", #U+21D2 [⇒] rightwards double arrow 418 | '⇓' => "\xe2\x87\x93", #U+21D3 [⇓] downwards double arrow 419 | '⇔' => "\xe2\x87\x94", #U+21D4 [⇔] left right double arrow 420 | '∀' => "\xe2\x88\x80", #U+2200 [∀] for all 421 | '∂' => "\xe2\x88\x82", #U+2202 [∂] partial differential 422 | '∃' => "\xe2\x88\x83", #U+2203 [∃] there exists 423 | '∅' => "\xe2\x88\x85", #U+2205 [∅] empty set = null set = diameter 424 | '∇' => "\xe2\x88\x87", #U+2207 [∇] nabla = backward difference 425 | '∈' => "\xe2\x88\x88", #U+2208 [∈] element of 426 | '∉' => "\xe2\x88\x89", #U+2209 [∉] not an element of 427 | '∋' => "\xe2\x88\x8b", #U+220B [∋] contains as member 428 | '∏' => "\xe2\x88\x8f", #U+220F [∏] n-ary product = product sign 429 | '∑' => "\xe2\x88\x91", #U+2211 [∑] n-ary sumation 430 | '−' => "\xe2\x88\x92", #U+2212 [−] minus sign 431 | '∗' => "\xe2\x88\x97", #U+2217 [∗] asterisk operator 432 | '√' => "\xe2\x88\x9a", #U+221A [√] square root = radical sign 433 | '∝' => "\xe2\x88\x9d", #U+221D [∝] proportional to 434 | '∞' => "\xe2\x88\x9e", #U+221E [∞] infinity 435 | '∠' => "\xe2\x88\xa0", #U+2220 [∠] angle 436 | '∧' => "\xe2\x88\xa7", #U+2227 [∧] logical and = wedge 437 | '∨' => "\xe2\x88\xa8", #U+2228 [∨] logical or = vee 438 | '∩' => "\xe2\x88\xa9", #U+2229 [∩] intersection = cap 439 | '∪' => "\xe2\x88\xaa", #U+222A [∪] union = cup 440 | '∫' => "\xe2\x88\xab", #U+222B [∫] integral 441 | '∴' => "\xe2\x88\xb4", #U+2234 [∴] therefore 442 | '∼' => "\xe2\x88\xbc", #U+223C [∼] tilde operator = varies with = similar to 443 | '≅' => "\xe2\x89\x85", #U+2245 [≅] approximately equal to 444 | '≈' => "\xe2\x89\x88", #U+2248 [≈] almost equal to = asymptotic to 445 | '≠' => "\xe2\x89\xa0", #U+2260 [≠] not equal to 446 | '≡' => "\xe2\x89\xa1", #U+2261 [≡] identical to 447 | '≤' => "\xe2\x89\xa4", #U+2264 [≤] less-than or equal to 448 | '≥' => "\xe2\x89\xa5", #U+2265 [≥] greater-than or equal to 449 | '⊂' => "\xe2\x8a\x82", #U+2282 [⊂] subset of 450 | '⊃' => "\xe2\x8a\x83", #U+2283 [⊃] superset of 451 | '⊄' => "\xe2\x8a\x84", #U+2284 [⊄] not a subset of 452 | '⊆' => "\xe2\x8a\x86", #U+2286 [⊆] subset of or equal to 453 | '⊇' => "\xe2\x8a\x87", #U+2287 [⊇] superset of or equal to 454 | '⊕' => "\xe2\x8a\x95", #U+2295 [⊕] circled plus = direct sum 455 | '⊗' => "\xe2\x8a\x97", #U+2297 [⊗] circled times = vector product 456 | '⊥' => "\xe2\x8a\xa5", #U+22A5 [⊥] up tack = orthogonal to = perpendicular 457 | '⋅' => "\xe2\x8b\x85", #U+22C5 [⋅] dot operator 458 | '⌈' => "\xe2\x8c\x88", #U+2308 [⌈] left ceiling = APL upstile 459 | '⌉' => "\xe2\x8c\x89", #U+2309 [⌉] right ceiling 460 | '⌊' => "\xe2\x8c\x8a", #U+230A [⌊] left floor = APL downstile 461 | '⌋' => "\xe2\x8c\x8b", #U+230B [⌋] right floor 462 | '⟨' => "\xe2\x8c\xa9", #U+2329 [〈] left-pointing angle bracket = bra 463 | '⟩' => "\xe2\x8c\xaa", #U+232A [〉] right-pointing angle bracket = ket 464 | '◊' => "\xe2\x97\x8a", #U+25CA [◊] lozenge 465 | '♠' => "\xe2\x99\xa0", #U+2660 [♠] black spade suit 466 | '♣' => "\xe2\x99\xa3", #U+2663 [♣] black club suit = shamrock 467 | '♥' => "\xe2\x99\xa5", #U+2665 [♥] black heart suit = valentine 468 | '♦' => "\xe2\x99\xa6", #U+2666 [♦] black diamond suit 469 | #Other Special Characters: 470 | 'Œ' => "\xc5\x92", #U+0152 [Œ] Latin capital ligature OE 471 | 'œ' => "\xc5\x93", #U+0153 [œ] Latin small ligature oe 472 | 'Š' => "\xc5\xa0", #U+0160 [Š] Latin capital letter S with caron 473 | 'š' => "\xc5\xa1", #U+0161 [š] Latin small letter s with caron 474 | 'Ÿ' => "\xc5\xb8", #U+0178 [Ÿ] Latin capital letter Y with diaeresis 475 | 'ˆ' => "\xcb\x86", #U+02C6 [ˆ] modifier letter circumflex accent 476 | '˜' => "\xcb\x9c", #U+02DC [˜] small tilde 477 | ' ' => "\xe2\x80\x82", #U+2002 [ ] en space 478 | ' ' => "\xe2\x80\x83", #U+2003 [ ] em space 479 | ' ' => "\xe2\x80\x89", #U+2009 [ ] thin space 480 | '‌' => "\xe2\x80\x8c", #U+200C [‌] zero width non-joiner 481 | '‍' => "\xe2\x80\x8d", #U+200D [‍] zero width joiner 482 | '‎' => "\xe2\x80\x8e", #U+200E [‎] left-to-right mark 483 | '‏' => "\xe2\x80\x8f", #U+200F [‏] right-to-left mark 484 | '–' => "\xe2\x80\x93", #U+2013 [–] en dash 485 | '—' => "\xe2\x80\x94", #U+2014 [—] em dash 486 | '‘' => "\xe2\x80\x98", #U+2018 [‘] left single quotation mark 487 | '’' => "\xe2\x80\x99", #U+2019 [’] right single quotation mark (and apostrophe!) 488 | '‚' => "\xe2\x80\x9a", #U+201A [‚] single low-9 quotation mark 489 | '“' => "\xe2\x80\x9c", #U+201C [“] left double quotation mark 490 | '”' => "\xe2\x80\x9d", #U+201D [”] right double quotation mark 491 | '„' => "\xe2\x80\x9e", #U+201E [„] double low-9 quotation mark 492 | '†' => "\xe2\x80\xa0", #U+2020 [†] dagger 493 | '‡' => "\xe2\x80\xa1", #U+2021 [‡] double dagger 494 | '‰' => "\xe2\x80\xb0", #U+2030 [‰] per mille sign 495 | '‹' => "\xe2\x80\xb9", #U+2039 [‹] single left-pointing angle quotation mark 496 | '›' => "\xe2\x80\xba", #U+203A [›] single right-pointing angle quotation mark 497 | '€' => "\xe2\x82\xac", #U+20AC [€] euro sign 498 | ); 499 | 500 | /** 501 | * This table contains the data on how cp1259 characters map into Unicode (UTF-8). 502 | * The cp1259 map describes standart tatarish cyrillic charset and based on the cp1251 table. 503 | * cp1259 -- this is an outdated one byte encoding of the Tatar language, 504 | * which includes all the Russian letters from cp1251. 505 | * 506 | * @link http://search.cpan.org/CPAN/authors/id/A/AM/AMICHAUER/Lingua-TT-Yanalif-0.08.tar.gz 507 | * @link http://www.unicode.org/charts/PDF/U0400.pdf 508 | * @var array 509 | */ 510 | public static $cp1259_table = array( 511 | #bytes from 0x00 to 0x7F (ASCII) saved as is 512 | "\x80" => "\xd3\x98", #U+04d8 CYRILLIC CAPITAL LETTER SCHWA 513 | "\x81" => "\xd0\x83", #U+0403 CYRILLIC CAPITAL LETTER GJE 514 | "\x82" => "\xe2\x80\x9a", #U+201a SINGLE LOW-9 QUOTATION MARK 515 | "\x83" => "\xd1\x93", #U+0453 CYRILLIC SMALL LETTER GJE 516 | "\x84" => "\xe2\x80\x9e", #U+201e DOUBLE LOW-9 QUOTATION MARK 517 | "\x85" => "\xe2\x80\xa6", #U+2026 HORIZONTAL ELLIPSIS 518 | "\x86" => "\xe2\x80\xa0", #U+2020 DAGGER 519 | "\x87" => "\xe2\x80\xa1", #U+2021 DOUBLE DAGGER 520 | "\x88" => "\xe2\x82\xac", #U+20ac EURO SIGN 521 | "\x89" => "\xe2\x80\xb0", #U+2030 PER MILLE SIGN 522 | "\x8a" => "\xd3\xa8", #U+04e8 CYRILLIC CAPITAL LETTER BARRED O 523 | "\x8b" => "\xe2\x80\xb9", #U+2039 SINGLE LEFT-POINTING ANGLE QUOTATION MARK 524 | "\x8c" => "\xd2\xae", #U+04ae CYRILLIC CAPITAL LETTER STRAIGHT U 525 | "\x8d" => "\xd2\x96", #U+0496 CYRILLIC CAPITAL LETTER ZHE WITH DESCENDER 526 | "\x8e" => "\xd2\xa2", #U+04a2 CYRILLIC CAPITAL LETTER EN WITH HOOK 527 | "\x8f" => "\xd2\xba", #U+04ba CYRILLIC CAPITAL LETTER SHHA 528 | "\x90" => "\xd3\x99", #U+04d9 CYRILLIC SMALL LETTER SCHWA 529 | "\x91" => "\xe2\x80\x98", #U+2018 LEFT SINGLE QUOTATION MARK 530 | "\x92" => "\xe2\x80\x99", #U+2019 RIGHT SINGLE QUOTATION MARK 531 | "\x93" => "\xe2\x80\x9c", #U+201c LEFT DOUBLE QUOTATION MARK 532 | "\x94" => "\xe2\x80\x9d", #U+201d RIGHT DOUBLE QUOTATION MARK 533 | "\x95" => "\xe2\x80\xa2", #U+2022 BULLET 534 | "\x96" => "\xe2\x80\x93", #U+2013 EN DASH 535 | "\x97" => "\xe2\x80\x94", #U+2014 EM DASH 536 | #"\x98" #UNDEFINED 537 | "\x99" => "\xe2\x84\xa2", #U+2122 TRADE MARK SIGN 538 | "\x9a" => "\xd3\xa9", #U+04e9 CYRILLIC SMALL LETTER BARRED O 539 | "\x9b" => "\xe2\x80\xba", #U+203a SINGLE RIGHT-POINTING ANGLE QUOTATION MARK 540 | "\x9c" => "\xd2\xaf", #U+04af CYRILLIC SMALL LETTER STRAIGHT U 541 | "\x9d" => "\xd2\x97", #U+0497 CYRILLIC SMALL LETTER ZHE WITH DESCENDER 542 | "\x9e" => "\xd2\xa3", #U+04a3 CYRILLIC SMALL LETTER EN WITH HOOK 543 | "\x9f" => "\xd2\xbb", #U+04bb CYRILLIC SMALL LETTER SHHA 544 | "\xa0" => "\xc2\xa0", #U+00a0 NO-BREAK SPACE 545 | "\xa1" => "\xd0\x8e", #U+040e CYRILLIC CAPITAL LETTER SHORT U 546 | "\xa2" => "\xd1\x9e", #U+045e CYRILLIC SMALL LETTER SHORT U 547 | "\xa3" => "\xd0\x88", #U+0408 CYRILLIC CAPITAL LETTER JE 548 | "\xa4" => "\xc2\xa4", #U+00a4 CURRENCY SIGN 549 | "\xa5" => "\xd2\x90", #U+0490 CYRILLIC CAPITAL LETTER GHE WITH UPTURN 550 | "\xa6" => "\xc2\xa6", #U+00a6 BROKEN BAR 551 | "\xa7" => "\xc2\xa7", #U+00a7 SECTION SIGN 552 | "\xa8" => "\xd0\x81", #U+0401 CYRILLIC CAPITAL LETTER IO 553 | "\xa9" => "\xc2\xa9", #U+00a9 COPYRIGHT SIGN 554 | "\xaa" => "\xd0\x84", #U+0404 CYRILLIC CAPITAL LETTER UKRAINIAN IE 555 | "\xab" => "\xc2\xab", #U+00ab LEFT-POINTING DOUBLE ANGLE QUOTATION MARK 556 | "\xac" => "\xc2\xac", #U+00ac NOT SIGN 557 | "\xad" => "\xc2\xad", #U+00ad SOFT HYPHEN 558 | "\xae" => "\xc2\xae", #U+00ae REGISTERED SIGN 559 | "\xaf" => "\xd0\x87", #U+0407 CYRILLIC CAPITAL LETTER YI 560 | "\xb0" => "\xc2\xb0", #U+00b0 DEGREE SIGN 561 | "\xb1" => "\xc2\xb1", #U+00b1 PLUS-MINUS SIGN 562 | "\xb2" => "\xd0\x86", #U+0406 CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I 563 | "\xb3" => "\xd1\x96", #U+0456 CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I 564 | "\xb4" => "\xd2\x91", #U+0491 CYRILLIC SMALL LETTER GHE WITH UPTURN 565 | "\xb5" => "\xc2\xb5", #U+00b5 MICRO SIGN 566 | "\xb6" => "\xc2\xb6", #U+00b6 PILCROW SIGN 567 | "\xb7" => "\xc2\xb7", #U+00b7 MIDDLE DOT 568 | "\xb8" => "\xd1\x91", #U+0451 CYRILLIC SMALL LETTER IO 569 | "\xb9" => "\xe2\x84\x96", #U+2116 NUMERO SIGN 570 | "\xba" => "\xd1\x94", #U+0454 CYRILLIC SMALL LETTER UKRAINIAN IE 571 | "\xbb" => "\xc2\xbb", #U+00bb RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK 572 | "\xbc" => "\xd1\x98", #U+0458 CYRILLIC SMALL LETTER JE 573 | "\xbd" => "\xd0\x85", #U+0405 CYRILLIC CAPITAL LETTER DZE 574 | "\xbe" => "\xd1\x95", #U+0455 CYRILLIC SMALL LETTER DZE 575 | "\xbf" => "\xd1\x97", #U+0457 CYRILLIC SMALL LETTER YI 576 | "\xc0" => "\xd0\x90", #U+0410 CYRILLIC CAPITAL LETTER A 577 | "\xc1" => "\xd0\x91", #U+0411 CYRILLIC CAPITAL LETTER BE 578 | "\xc2" => "\xd0\x92", #U+0412 CYRILLIC CAPITAL LETTER VE 579 | "\xc3" => "\xd0\x93", #U+0413 CYRILLIC CAPITAL LETTER GHE 580 | "\xc4" => "\xd0\x94", #U+0414 CYRILLIC CAPITAL LETTER DE 581 | "\xc5" => "\xd0\x95", #U+0415 CYRILLIC CAPITAL LETTER IE 582 | "\xc6" => "\xd0\x96", #U+0416 CYRILLIC CAPITAL LETTER ZHE 583 | "\xc7" => "\xd0\x97", #U+0417 CYRILLIC CAPITAL LETTER ZE 584 | "\xc8" => "\xd0\x98", #U+0418 CYRILLIC CAPITAL LETTER I 585 | "\xc9" => "\xd0\x99", #U+0419 CYRILLIC CAPITAL LETTER SHORT I 586 | "\xca" => "\xd0\x9a", #U+041a CYRILLIC CAPITAL LETTER KA 587 | "\xcb" => "\xd0\x9b", #U+041b CYRILLIC CAPITAL LETTER EL 588 | "\xcc" => "\xd0\x9c", #U+041c CYRILLIC CAPITAL LETTER EM 589 | "\xcd" => "\xd0\x9d", #U+041d CYRILLIC CAPITAL LETTER EN 590 | "\xce" => "\xd0\x9e", #U+041e CYRILLIC CAPITAL LETTER O 591 | "\xcf" => "\xd0\x9f", #U+041f CYRILLIC CAPITAL LETTER PE 592 | "\xd0" => "\xd0\xa0", #U+0420 CYRILLIC CAPITAL LETTER ER 593 | "\xd1" => "\xd0\xa1", #U+0421 CYRILLIC CAPITAL LETTER ES 594 | "\xd2" => "\xd0\xa2", #U+0422 CYRILLIC CAPITAL LETTER TE 595 | "\xd3" => "\xd0\xa3", #U+0423 CYRILLIC CAPITAL LETTER U 596 | "\xd4" => "\xd0\xa4", #U+0424 CYRILLIC CAPITAL LETTER EF 597 | "\xd5" => "\xd0\xa5", #U+0425 CYRILLIC CAPITAL LETTER HA 598 | "\xd6" => "\xd0\xa6", #U+0426 CYRILLIC CAPITAL LETTER TSE 599 | "\xd7" => "\xd0\xa7", #U+0427 CYRILLIC CAPITAL LETTER CHE 600 | "\xd8" => "\xd0\xa8", #U+0428 CYRILLIC CAPITAL LETTER SHA 601 | "\xd9" => "\xd0\xa9", #U+0429 CYRILLIC CAPITAL LETTER SHCHA 602 | "\xda" => "\xd0\xaa", #U+042a CYRILLIC CAPITAL LETTER HARD SIGN 603 | "\xdb" => "\xd0\xab", #U+042b CYRILLIC CAPITAL LETTER YERU 604 | "\xdc" => "\xd0\xac", #U+042c CYRILLIC CAPITAL LETTER SOFT SIGN 605 | "\xdd" => "\xd0\xad", #U+042d CYRILLIC CAPITAL LETTER E 606 | "\xde" => "\xd0\xae", #U+042e CYRILLIC CAPITAL LETTER YU 607 | "\xdf" => "\xd0\xaf", #U+042f CYRILLIC CAPITAL LETTER YA 608 | "\xe0" => "\xd0\xb0", #U+0430 CYRILLIC SMALL LETTER A 609 | "\xe1" => "\xd0\xb1", #U+0431 CYRILLIC SMALL LETTER BE 610 | "\xe2" => "\xd0\xb2", #U+0432 CYRILLIC SMALL LETTER VE 611 | "\xe3" => "\xd0\xb3", #U+0433 CYRILLIC SMALL LETTER GHE 612 | "\xe4" => "\xd0\xb4", #U+0434 CYRILLIC SMALL LETTER DE 613 | "\xe5" => "\xd0\xb5", #U+0435 CYRILLIC SMALL LETTER IE 614 | "\xe6" => "\xd0\xb6", #U+0436 CYRILLIC SMALL LETTER ZHE 615 | "\xe7" => "\xd0\xb7", #U+0437 CYRILLIC SMALL LETTER ZE 616 | "\xe8" => "\xd0\xb8", #U+0438 CYRILLIC SMALL LETTER I 617 | "\xe9" => "\xd0\xb9", #U+0439 CYRILLIC SMALL LETTER SHORT I 618 | "\xea" => "\xd0\xba", #U+043a CYRILLIC SMALL LETTER KA 619 | "\xeb" => "\xd0\xbb", #U+043b CYRILLIC SMALL LETTER EL 620 | "\xec" => "\xd0\xbc", #U+043c CYRILLIC SMALL LETTER EM 621 | "\xed" => "\xd0\xbd", #U+043d CYRILLIC SMALL LETTER EN 622 | "\xee" => "\xd0\xbe", #U+043e CYRILLIC SMALL LETTER O 623 | "\xef" => "\xd0\xbf", #U+043f CYRILLIC SMALL LETTER PE 624 | "\xf0" => "\xd1\x80", #U+0440 CYRILLIC SMALL LETTER ER 625 | "\xf1" => "\xd1\x81", #U+0441 CYRILLIC SMALL LETTER ES 626 | "\xf2" => "\xd1\x82", #U+0442 CYRILLIC SMALL LETTER TE 627 | "\xf3" => "\xd1\x83", #U+0443 CYRILLIC SMALL LETTER U 628 | "\xf4" => "\xd1\x84", #U+0444 CYRILLIC SMALL LETTER EF 629 | "\xf5" => "\xd1\x85", #U+0445 CYRILLIC SMALL LETTER HA 630 | "\xf6" => "\xd1\x86", #U+0446 CYRILLIC SMALL LETTER TSE 631 | "\xf7" => "\xd1\x87", #U+0447 CYRILLIC SMALL LETTER CHE 632 | "\xf8" => "\xd1\x88", #U+0448 CYRILLIC SMALL LETTER SHA 633 | "\xf9" => "\xd1\x89", #U+0449 CYRILLIC SMALL LETTER SHCHA 634 | "\xfa" => "\xd1\x8a", #U+044a CYRILLIC SMALL LETTER HARD SIGN 635 | "\xfb" => "\xd1\x8b", #U+044b CYRILLIC SMALL LETTER YERU 636 | "\xfc" => "\xd1\x8c", #U+044c CYRILLIC SMALL LETTER SOFT SIGN 637 | "\xfd" => "\xd1\x8d", #U+044d CYRILLIC SMALL LETTER E 638 | "\xfe" => "\xd1\x8e", #U+044e CYRILLIC SMALL LETTER YU 639 | "\xff" => "\xd1\x8f", #U+044f CYRILLIC SMALL LETTER YA 640 | ); 641 | 642 | /** 643 | * UTF-8 Case lookup table 644 | * 645 | * This lookuptable defines the upper case letters to their correspponding 646 | * lower case letter in UTF-8 647 | * 648 | * @author Andreas Gohr 649 | * @var array 650 | */ 651 | public static $convert_case_table = array( 652 | #CASE_UPPER => case_lower 653 | "\x41" => "\x61", #A a 654 | "\x42" => "\x62", #B b 655 | "\x43" => "\x63", #C c 656 | "\x44" => "\x64", #D d 657 | "\x45" => "\x65", #E e 658 | "\x46" => "\x66", #F f 659 | "\x47" => "\x67", #G g 660 | "\x48" => "\x68", #H h 661 | "\x49" => "\x69", #I i 662 | "\x4a" => "\x6a", #J j 663 | "\x4b" => "\x6b", #K k 664 | "\x4c" => "\x6c", #L l 665 | "\x4d" => "\x6d", #M m 666 | "\x4e" => "\x6e", #N n 667 | "\x4f" => "\x6f", #O o 668 | "\x50" => "\x70", #P p 669 | "\x51" => "\x71", #Q q 670 | "\x52" => "\x72", #R r 671 | "\x53" => "\x73", #S s 672 | "\x54" => "\x74", #T t 673 | "\x55" => "\x75", #U u 674 | "\x56" => "\x76", #V v 675 | "\x57" => "\x77", #W w 676 | "\x58" => "\x78", #X x 677 | "\x59" => "\x79", #Y y 678 | "\x5a" => "\x7a", #Z z 679 | "\xc3\x80" => "\xc3\xa0", 680 | "\xc3\x81" => "\xc3\xa1", 681 | "\xc3\x82" => "\xc3\xa2", 682 | "\xc3\x83" => "\xc3\xa3", 683 | "\xc3\x84" => "\xc3\xa4", 684 | "\xc3\x85" => "\xc3\xa5", 685 | "\xc3\x86" => "\xc3\xa6", 686 | "\xc3\x87" => "\xc3\xa7", 687 | "\xc3\x88" => "\xc3\xa8", 688 | "\xc3\x89" => "\xc3\xa9", 689 | "\xc3\x8a" => "\xc3\xaa", 690 | "\xc3\x8b" => "\xc3\xab", 691 | "\xc3\x8c" => "\xc3\xac", 692 | "\xc3\x8d" => "\xc3\xad", 693 | "\xc3\x8e" => "\xc3\xae", 694 | "\xc3\x8f" => "\xc3\xaf", 695 | "\xc3\x90" => "\xc3\xb0", 696 | "\xc3\x91" => "\xc3\xb1", 697 | "\xc3\x92" => "\xc3\xb2", 698 | "\xc3\x93" => "\xc3\xb3", 699 | "\xc3\x94" => "\xc3\xb4", 700 | "\xc3\x95" => "\xc3\xb5", 701 | "\xc3\x96" => "\xc3\xb6", 702 | "\xc3\x98" => "\xc3\xb8", 703 | "\xc3\x99" => "\xc3\xb9", 704 | "\xc3\x9a" => "\xc3\xba", 705 | "\xc3\x9b" => "\xc3\xbb", 706 | "\xc3\x9c" => "\xc3\xbc", 707 | "\xc3\x9d" => "\xc3\xbd", 708 | "\xc3\x9e" => "\xc3\xbe", 709 | "\xc4\x80" => "\xc4\x81", 710 | "\xc4\x82" => "\xc4\x83", 711 | "\xc4\x84" => "\xc4\x85", 712 | "\xc4\x86" => "\xc4\x87", 713 | "\xc4\x88" => "\xc4\x89", 714 | "\xc4\x8a" => "\xc4\x8b", 715 | "\xc4\x8c" => "\xc4\x8d", 716 | "\xc4\x8e" => "\xc4\x8f", 717 | "\xc4\x90" => "\xc4\x91", 718 | "\xc4\x92" => "\xc4\x93", 719 | "\xc4\x94" => "\xc4\x95", 720 | "\xc4\x96" => "\xc4\x97", 721 | "\xc4\x98" => "\xc4\x99", 722 | "\xc4\x9a" => "\xc4\x9b", 723 | "\xc4\x9c" => "\xc4\x9d", 724 | "\xc4\x9e" => "\xc4\x9f", 725 | "\xc4\xa0" => "\xc4\xa1", 726 | "\xc4\xa2" => "\xc4\xa3", 727 | "\xc4\xa4" => "\xc4\xa5", 728 | "\xc4\xa6" => "\xc4\xa7", 729 | "\xc4\xa8" => "\xc4\xa9", 730 | "\xc4\xaa" => "\xc4\xab", 731 | "\xc4\xac" => "\xc4\xad", 732 | "\xc4\xae" => "\xc4\xaf", 733 | "\xc4\xb2" => "\xc4\xb3", 734 | "\xc4\xb4" => "\xc4\xb5", 735 | "\xc4\xb6" => "\xc4\xb7", 736 | "\xc4\xb9" => "\xc4\xba", 737 | "\xc4\xbb" => "\xc4\xbc", 738 | "\xc4\xbd" => "\xc4\xbe", 739 | "\xc4\xbf" => "\xc5\x80", 740 | "\xc5\x81" => "\xc5\x82", 741 | "\xc5\x83" => "\xc5\x84", 742 | "\xc5\x85" => "\xc5\x86", 743 | "\xc5\x87" => "\xc5\x88", 744 | "\xc5\x8a" => "\xc5\x8b", 745 | "\xc5\x8c" => "\xc5\x8d", 746 | "\xc5\x8e" => "\xc5\x8f", 747 | "\xc5\x90" => "\xc5\x91", 748 | "\xc5\x92" => "\xc5\x93", 749 | "\xc5\x94" => "\xc5\x95", 750 | "\xc5\x96" => "\xc5\x97", 751 | "\xc5\x98" => "\xc5\x99", 752 | "\xc5\x9a" => "\xc5\x9b", 753 | "\xc5\x9c" => "\xc5\x9d", 754 | "\xc5\x9e" => "\xc5\x9f", 755 | "\xc5\xa0" => "\xc5\xa1", 756 | "\xc5\xa2" => "\xc5\xa3", 757 | "\xc5\xa4" => "\xc5\xa5", 758 | "\xc5\xa6" => "\xc5\xa7", 759 | "\xc5\xa8" => "\xc5\xa9", 760 | "\xc5\xaa" => "\xc5\xab", 761 | "\xc5\xac" => "\xc5\xad", 762 | "\xc5\xae" => "\xc5\xaf", 763 | "\xc5\xb0" => "\xc5\xb1", 764 | "\xc5\xb2" => "\xc5\xb3", 765 | "\xc5\xb4" => "\xc5\xb5", 766 | "\xc5\xb6" => "\xc5\xb7", 767 | "\xc5\xb8" => "\xc3\xbf", 768 | "\xc5\xb9" => "\xc5\xba", 769 | "\xc5\xbb" => "\xc5\xbc", 770 | "\xc5\xbd" => "\xc5\xbe", 771 | "\xc6\x81" => "\xc9\x93", 772 | "\xc6\x82" => "\xc6\x83", 773 | "\xc6\x84" => "\xc6\x85", 774 | "\xc6\x86" => "\xc9\x94", 775 | "\xc6\x87" => "\xc6\x88", 776 | "\xc6\x89" => "\xc9\x96", 777 | "\xc6\x8a" => "\xc9\x97", 778 | "\xc6\x8b" => "\xc6\x8c", 779 | "\xc6\x8e" => "\xc7\x9d", 780 | "\xc6\x8f" => "\xc9\x99", 781 | "\xc6\x90" => "\xc9\x9b", 782 | "\xc6\x91" => "\xc6\x92", 783 | "\xc6\x94" => "\xc9\xa3", 784 | "\xc6\x96" => "\xc9\xa9", 785 | "\xc6\x97" => "\xc9\xa8", 786 | "\xc6\x98" => "\xc6\x99", 787 | "\xc6\x9c" => "\xc9\xaf", 788 | "\xc6\x9d" => "\xc9\xb2", 789 | "\xc6\x9f" => "\xc9\xb5", 790 | "\xc6\xa0" => "\xc6\xa1", 791 | "\xc6\xa2" => "\xc6\xa3", 792 | "\xc6\xa4" => "\xc6\xa5", 793 | "\xc6\xa6" => "\xca\x80", 794 | "\xc6\xa7" => "\xc6\xa8", 795 | "\xc6\xa9" => "\xca\x83", 796 | "\xc6\xac" => "\xc6\xad", 797 | "\xc6\xae" => "\xca\x88", 798 | "\xc6\xaf" => "\xc6\xb0", 799 | "\xc6\xb1" => "\xca\x8a", 800 | "\xc6\xb2" => "\xca\x8b", 801 | "\xc6\xb3" => "\xc6\xb4", 802 | "\xc6\xb5" => "\xc6\xb6", 803 | "\xc6\xb7" => "\xca\x92", 804 | "\xc6\xb8" => "\xc6\xb9", 805 | "\xc6\xbc" => "\xc6\xbd", 806 | "\xc7\x85" => "\xc7\x86", 807 | "\xc7\x88" => "\xc7\x89", 808 | "\xc7\x8b" => "\xc7\x8c", 809 | "\xc7\x8d" => "\xc7\x8e", 810 | "\xc7\x8f" => "\xc7\x90", 811 | "\xc7\x91" => "\xc7\x92", 812 | "\xc7\x93" => "\xc7\x94", 813 | "\xc7\x95" => "\xc7\x96", 814 | "\xc7\x97" => "\xc7\x98", 815 | "\xc7\x99" => "\xc7\x9a", 816 | "\xc7\x9b" => "\xc7\x9c", 817 | "\xc7\x9e" => "\xc7\x9f", 818 | "\xc7\xa0" => "\xc7\xa1", 819 | "\xc7\xa2" => "\xc7\xa3", 820 | "\xc7\xa4" => "\xc7\xa5", 821 | "\xc7\xa6" => "\xc7\xa7", 822 | "\xc7\xa8" => "\xc7\xa9", 823 | "\xc7\xaa" => "\xc7\xab", 824 | "\xc7\xac" => "\xc7\xad", 825 | "\xc7\xae" => "\xc7\xaf", 826 | "\xc7\xb2" => "\xc7\xb3", 827 | "\xc7\xb4" => "\xc7\xb5", 828 | "\xc7\xb6" => "\xc6\x95", 829 | "\xc7\xb7" => "\xc6\xbf", 830 | "\xc7\xb8" => "\xc7\xb9", 831 | "\xc7\xba" => "\xc7\xbb", 832 | "\xc7\xbc" => "\xc7\xbd", 833 | "\xc7\xbe" => "\xc7\xbf", 834 | "\xc8\x80" => "\xc8\x81", 835 | "\xc8\x82" => "\xc8\x83", 836 | "\xc8\x84" => "\xc8\x85", 837 | "\xc8\x86" => "\xc8\x87", 838 | "\xc8\x88" => "\xc8\x89", 839 | "\xc8\x8a" => "\xc8\x8b", 840 | "\xc8\x8c" => "\xc8\x8d", 841 | "\xc8\x8e" => "\xc8\x8f", 842 | "\xc8\x90" => "\xc8\x91", 843 | "\xc8\x92" => "\xc8\x93", 844 | "\xc8\x94" => "\xc8\x95", 845 | "\xc8\x96" => "\xc8\x97", 846 | "\xc8\x98" => "\xc8\x99", 847 | "\xc8\x9a" => "\xc8\x9b", 848 | "\xc8\x9c" => "\xc8\x9d", 849 | "\xc8\x9e" => "\xc8\x9f", 850 | "\xc8\xa0" => "\xc6\x9e", 851 | "\xc8\xa2" => "\xc8\xa3", 852 | "\xc8\xa4" => "\xc8\xa5", 853 | "\xc8\xa6" => "\xc8\xa7", 854 | "\xc8\xa8" => "\xc8\xa9", 855 | "\xc8\xaa" => "\xc8\xab", 856 | "\xc8\xac" => "\xc8\xad", 857 | "\xc8\xae" => "\xc8\xaf", 858 | "\xc8\xb0" => "\xc8\xb1", 859 | "\xc8\xb2" => "\xc8\xb3", 860 | "\xce\x86" => "\xce\xac", 861 | "\xce\x88" => "\xce\xad", 862 | "\xce\x89" => "\xce\xae", 863 | "\xce\x8a" => "\xce\xaf", 864 | "\xce\x8c" => "\xcf\x8c", 865 | "\xce\x8e" => "\xcf\x8d", 866 | "\xce\x8f" => "\xcf\x8e", 867 | "\xce\x91" => "\xce\xb1", 868 | "\xce\x92" => "\xce\xb2", 869 | "\xce\x93" => "\xce\xb3", 870 | "\xce\x94" => "\xce\xb4", 871 | "\xce\x95" => "\xce\xb5", 872 | "\xce\x96" => "\xce\xb6", 873 | "\xce\x97" => "\xce\xb7", 874 | "\xce\x98" => "\xce\xb8", 875 | "\xce\x99" => "\xce\xb9", 876 | "\xce\x9a" => "\xce\xba", 877 | "\xce\x9b" => "\xce\xbb", 878 | "\xce\x9c" => "\xc2\xb5", 879 | "\xce\x9d" => "\xce\xbd", 880 | "\xce\x9e" => "\xce\xbe", 881 | "\xce\x9f" => "\xce\xbf", 882 | "\xce\xa0" => "\xcf\x80", 883 | "\xce\xa1" => "\xcf\x81", 884 | "\xce\xa3" => "\xcf\x82", 885 | "\xce\xa4" => "\xcf\x84", 886 | "\xce\xa5" => "\xcf\x85", 887 | "\xce\xa6" => "\xcf\x86", 888 | "\xce\xa7" => "\xcf\x87", 889 | "\xce\xa8" => "\xcf\x88", 890 | "\xce\xa9" => "\xcf\x89", 891 | "\xce\xaa" => "\xcf\x8a", 892 | "\xce\xab" => "\xcf\x8b", 893 | "\xcf\x98" => "\xcf\x99", 894 | "\xcf\x9a" => "\xcf\x9b", 895 | "\xcf\x9c" => "\xcf\x9d", 896 | "\xcf\x9e" => "\xcf\x9f", 897 | "\xcf\xa0" => "\xcf\xa1", 898 | "\xcf\xa2" => "\xcf\xa3", 899 | "\xcf\xa4" => "\xcf\xa5", 900 | "\xcf\xa6" => "\xcf\xa7", 901 | "\xcf\xa8" => "\xcf\xa9", 902 | "\xcf\xaa" => "\xcf\xab", 903 | "\xcf\xac" => "\xcf\xad", 904 | "\xcf\xae" => "\xcf\xaf", 905 | "\xd0\x80" => "\xd1\x90", 906 | "\xd0\x81" => "\xd1\x91", 907 | "\xd0\x82" => "\xd1\x92", 908 | "\xd0\x83" => "\xd1\x93", 909 | "\xd0\x84" => "\xd1\x94", 910 | "\xd0\x85" => "\xd1\x95", 911 | "\xd0\x86" => "\xd1\x96", 912 | "\xd0\x87" => "\xd1\x97", 913 | "\xd0\x88" => "\xd1\x98", 914 | "\xd0\x89" => "\xd1\x99", 915 | "\xd0\x8a" => "\xd1\x9a", 916 | "\xd0\x8b" => "\xd1\x9b", 917 | "\xd0\x8c" => "\xd1\x9c", 918 | "\xd0\x8d" => "\xd1\x9d", 919 | "\xd0\x8e" => "\xd1\x9e", 920 | "\xd0\x8f" => "\xd1\x9f", 921 | "\xd0\x90" => "\xd0\xb0", 922 | "\xd0\x91" => "\xd0\xb1", 923 | "\xd0\x92" => "\xd0\xb2", 924 | "\xd0\x93" => "\xd0\xb3", 925 | "\xd0\x94" => "\xd0\xb4", 926 | "\xd0\x95" => "\xd0\xb5", 927 | "\xd0\x96" => "\xd0\xb6", 928 | "\xd0\x97" => "\xd0\xb7", 929 | "\xd0\x98" => "\xd0\xb8", 930 | "\xd0\x99" => "\xd0\xb9", 931 | "\xd0\x9a" => "\xd0\xba", 932 | "\xd0\x9b" => "\xd0\xbb", 933 | "\xd0\x9c" => "\xd0\xbc", 934 | "\xd0\x9d" => "\xd0\xbd", 935 | "\xd0\x9e" => "\xd0\xbe", 936 | "\xd0\x9f" => "\xd0\xbf", 937 | "\xd0\xa0" => "\xd1\x80", 938 | "\xd0\xa1" => "\xd1\x81", 939 | "\xd0\xa2" => "\xd1\x82", 940 | "\xd0\xa3" => "\xd1\x83", 941 | "\xd0\xa4" => "\xd1\x84", 942 | "\xd0\xa5" => "\xd1\x85", 943 | "\xd0\xa6" => "\xd1\x86", 944 | "\xd0\xa7" => "\xd1\x87", 945 | "\xd0\xa8" => "\xd1\x88", 946 | "\xd0\xa9" => "\xd1\x89", 947 | "\xd0\xaa" => "\xd1\x8a", 948 | "\xd0\xab" => "\xd1\x8b", 949 | "\xd0\xac" => "\xd1\x8c", 950 | "\xd0\xad" => "\xd1\x8d", 951 | "\xd0\xae" => "\xd1\x8e", 952 | "\xd0\xaf" => "\xd1\x8f", 953 | "\xd1\xa0" => "\xd1\xa1", 954 | "\xd1\xa2" => "\xd1\xa3", 955 | "\xd1\xa4" => "\xd1\xa5", 956 | "\xd1\xa6" => "\xd1\xa7", 957 | "\xd1\xa8" => "\xd1\xa9", 958 | "\xd1\xaa" => "\xd1\xab", 959 | "\xd1\xac" => "\xd1\xad", 960 | "\xd1\xae" => "\xd1\xaf", 961 | "\xd1\xb0" => "\xd1\xb1", 962 | "\xd1\xb2" => "\xd1\xb3", 963 | "\xd1\xb4" => "\xd1\xb5", 964 | "\xd1\xb6" => "\xd1\xb7", 965 | "\xd1\xb8" => "\xd1\xb9", 966 | "\xd1\xba" => "\xd1\xbb", 967 | "\xd1\xbc" => "\xd1\xbd", 968 | "\xd1\xbe" => "\xd1\xbf", 969 | "\xd2\x80" => "\xd2\x81", 970 | "\xd2\x8a" => "\xd2\x8b", 971 | "\xd2\x8c" => "\xd2\x8d", 972 | "\xd2\x8e" => "\xd2\x8f", 973 | "\xd2\x90" => "\xd2\x91", 974 | "\xd2\x92" => "\xd2\x93", 975 | "\xd2\x94" => "\xd2\x95", 976 | "\xd2\x96" => "\xd2\x97", 977 | "\xd2\x98" => "\xd2\x99", 978 | "\xd2\x9a" => "\xd2\x9b", 979 | "\xd2\x9c" => "\xd2\x9d", 980 | "\xd2\x9e" => "\xd2\x9f", 981 | "\xd2\xa0" => "\xd2\xa1", 982 | "\xd2\xa2" => "\xd2\xa3", 983 | "\xd2\xa4" => "\xd2\xa5", 984 | "\xd2\xa6" => "\xd2\xa7", 985 | "\xd2\xa8" => "\xd2\xa9", 986 | "\xd2\xaa" => "\xd2\xab", 987 | "\xd2\xac" => "\xd2\xad", 988 | "\xd2\xae" => "\xd2\xaf", 989 | "\xd2\xb0" => "\xd2\xb1", 990 | "\xd2\xb2" => "\xd2\xb3", 991 | "\xd2\xb4" => "\xd2\xb5", 992 | "\xd2\xb6" => "\xd2\xb7", 993 | "\xd2\xb8" => "\xd2\xb9", 994 | "\xd2\xba" => "\xd2\xbb", 995 | "\xd2\xbc" => "\xd2\xbd", 996 | "\xd2\xbe" => "\xd2\xbf", 997 | "\xd3\x81" => "\xd3\x82", 998 | "\xd3\x83" => "\xd3\x84", 999 | "\xd3\x85" => "\xd3\x86", 1000 | "\xd3\x87" => "\xd3\x88", 1001 | "\xd3\x89" => "\xd3\x8a", 1002 | "\xd3\x8b" => "\xd3\x8c", 1003 | "\xd3\x8d" => "\xd3\x8e", 1004 | "\xd3\x90" => "\xd3\x91", 1005 | "\xd3\x92" => "\xd3\x93", 1006 | "\xd3\x94" => "\xd3\x95", 1007 | "\xd3\x96" => "\xd3\x97", 1008 | "\xd3\x98" => "\xd3\x99", 1009 | "\xd3\x9a" => "\xd3\x9b", 1010 | "\xd3\x9c" => "\xd3\x9d", 1011 | "\xd3\x9e" => "\xd3\x9f", 1012 | "\xd3\xa0" => "\xd3\xa1", 1013 | "\xd3\xa2" => "\xd3\xa3", 1014 | "\xd3\xa4" => "\xd3\xa5", 1015 | "\xd3\xa6" => "\xd3\xa7", 1016 | "\xd3\xa8" => "\xd3\xa9", 1017 | "\xd3\xaa" => "\xd3\xab", 1018 | "\xd3\xac" => "\xd3\xad", 1019 | "\xd3\xae" => "\xd3\xaf", 1020 | "\xd3\xb0" => "\xd3\xb1", 1021 | "\xd3\xb2" => "\xd3\xb3", 1022 | "\xd3\xb4" => "\xd3\xb5", 1023 | "\xd3\xb8" => "\xd3\xb9", 1024 | "\xd4\x80" => "\xd4\x81", 1025 | "\xd4\x82" => "\xd4\x83", 1026 | "\xd4\x84" => "\xd4\x85", 1027 | "\xd4\x86" => "\xd4\x87", 1028 | "\xd4\x88" => "\xd4\x89", 1029 | "\xd4\x8a" => "\xd4\x8b", 1030 | "\xd4\x8c" => "\xd4\x8d", 1031 | "\xd4\x8e" => "\xd4\x8f", 1032 | "\xd4\xb1" => "\xd5\xa1", 1033 | "\xd4\xb2" => "\xd5\xa2", 1034 | "\xd4\xb3" => "\xd5\xa3", 1035 | "\xd4\xb4" => "\xd5\xa4", 1036 | "\xd4\xb5" => "\xd5\xa5", 1037 | "\xd4\xb6" => "\xd5\xa6", 1038 | "\xd4\xb7" => "\xd5\xa7", 1039 | "\xd4\xb8" => "\xd5\xa8", 1040 | "\xd4\xb9" => "\xd5\xa9", 1041 | "\xd4\xba" => "\xd5\xaa", 1042 | "\xd4\xbb" => "\xd5\xab", 1043 | "\xd4\xbc" => "\xd5\xac", 1044 | "\xd4\xbd" => "\xd5\xad", 1045 | "\xd4\xbe" => "\xd5\xae", 1046 | "\xd4\xbf" => "\xd5\xaf", 1047 | "\xd5\x80" => "\xd5\xb0", 1048 | "\xd5\x81" => "\xd5\xb1", 1049 | "\xd5\x82" => "\xd5\xb2", 1050 | "\xd5\x83" => "\xd5\xb3", 1051 | "\xd5\x84" => "\xd5\xb4", 1052 | "\xd5\x85" => "\xd5\xb5", 1053 | "\xd5\x86" => "\xd5\xb6", 1054 | "\xd5\x87" => "\xd5\xb7", 1055 | "\xd5\x88" => "\xd5\xb8", 1056 | "\xd5\x89" => "\xd5\xb9", 1057 | "\xd5\x8a" => "\xd5\xba", 1058 | "\xd5\x8b" => "\xd5\xbb", 1059 | "\xd5\x8c" => "\xd5\xbc", 1060 | "\xd5\x8d" => "\xd5\xbd", 1061 | "\xd5\x8e" => "\xd5\xbe", 1062 | "\xd5\x8f" => "\xd5\xbf", 1063 | "\xd5\x90" => "\xd6\x80", 1064 | "\xd5\x91" => "\xd6\x81", 1065 | "\xd5\x92" => "\xd6\x82", 1066 | "\xd5\x93" => "\xd6\x83", 1067 | "\xd5\x94" => "\xd6\x84", 1068 | "\xd5\x95" => "\xd6\x85", 1069 | "\xd5\x96" => "\xd6\x86", 1070 | "\xe1\xb8\x80" => "\xe1\xb8\x81", 1071 | "\xe1\xb8\x82" => "\xe1\xb8\x83", 1072 | "\xe1\xb8\x84" => "\xe1\xb8\x85", 1073 | "\xe1\xb8\x86" => "\xe1\xb8\x87", 1074 | "\xe1\xb8\x88" => "\xe1\xb8\x89", 1075 | "\xe1\xb8\x8a" => "\xe1\xb8\x8b", 1076 | "\xe1\xb8\x8c" => "\xe1\xb8\x8d", 1077 | "\xe1\xb8\x8e" => "\xe1\xb8\x8f", 1078 | "\xe1\xb8\x90" => "\xe1\xb8\x91", 1079 | "\xe1\xb8\x92" => "\xe1\xb8\x93", 1080 | "\xe1\xb8\x94" => "\xe1\xb8\x95", 1081 | "\xe1\xb8\x96" => "\xe1\xb8\x97", 1082 | "\xe1\xb8\x98" => "\xe1\xb8\x99", 1083 | "\xe1\xb8\x9a" => "\xe1\xb8\x9b", 1084 | "\xe1\xb8\x9c" => "\xe1\xb8\x9d", 1085 | "\xe1\xb8\x9e" => "\xe1\xb8\x9f", 1086 | "\xe1\xb8\xa0" => "\xe1\xb8\xa1", 1087 | "\xe1\xb8\xa2" => "\xe1\xb8\xa3", 1088 | "\xe1\xb8\xa4" => "\xe1\xb8\xa5", 1089 | "\xe1\xb8\xa6" => "\xe1\xb8\xa7", 1090 | "\xe1\xb8\xa8" => "\xe1\xb8\xa9", 1091 | "\xe1\xb8\xaa" => "\xe1\xb8\xab", 1092 | "\xe1\xb8\xac" => "\xe1\xb8\xad", 1093 | "\xe1\xb8\xae" => "\xe1\xb8\xaf", 1094 | "\xe1\xb8\xb0" => "\xe1\xb8\xb1", 1095 | "\xe1\xb8\xb2" => "\xe1\xb8\xb3", 1096 | "\xe1\xb8\xb4" => "\xe1\xb8\xb5", 1097 | "\xe1\xb8\xb6" => "\xe1\xb8\xb7", 1098 | "\xe1\xb8\xb8" => "\xe1\xb8\xb9", 1099 | "\xe1\xb8\xba" => "\xe1\xb8\xbb", 1100 | "\xe1\xb8\xbc" => "\xe1\xb8\xbd", 1101 | "\xe1\xb8\xbe" => "\xe1\xb8\xbf", 1102 | "\xe1\xb9\x80" => "\xe1\xb9\x81", 1103 | "\xe1\xb9\x82" => "\xe1\xb9\x83", 1104 | "\xe1\xb9\x84" => "\xe1\xb9\x85", 1105 | "\xe1\xb9\x86" => "\xe1\xb9\x87", 1106 | "\xe1\xb9\x88" => "\xe1\xb9\x89", 1107 | "\xe1\xb9\x8a" => "\xe1\xb9\x8b", 1108 | "\xe1\xb9\x8c" => "\xe1\xb9\x8d", 1109 | "\xe1\xb9\x8e" => "\xe1\xb9\x8f", 1110 | "\xe1\xb9\x90" => "\xe1\xb9\x91", 1111 | "\xe1\xb9\x92" => "\xe1\xb9\x93", 1112 | "\xe1\xb9\x94" => "\xe1\xb9\x95", 1113 | "\xe1\xb9\x96" => "\xe1\xb9\x97", 1114 | "\xe1\xb9\x98" => "\xe1\xb9\x99", 1115 | "\xe1\xb9\x9a" => "\xe1\xb9\x9b", 1116 | "\xe1\xb9\x9c" => "\xe1\xb9\x9d", 1117 | "\xe1\xb9\x9e" => "\xe1\xb9\x9f", 1118 | "\xe1\xb9\xa0" => "\xe1\xb9\xa1", 1119 | "\xe1\xb9\xa2" => "\xe1\xb9\xa3", 1120 | "\xe1\xb9\xa4" => "\xe1\xb9\xa5", 1121 | "\xe1\xb9\xa6" => "\xe1\xb9\xa7", 1122 | "\xe1\xb9\xa8" => "\xe1\xb9\xa9", 1123 | "\xe1\xb9\xaa" => "\xe1\xb9\xab", 1124 | "\xe1\xb9\xac" => "\xe1\xb9\xad", 1125 | "\xe1\xb9\xae" => "\xe1\xb9\xaf", 1126 | "\xe1\xb9\xb0" => "\xe1\xb9\xb1", 1127 | "\xe1\xb9\xb2" => "\xe1\xb9\xb3", 1128 | "\xe1\xb9\xb4" => "\xe1\xb9\xb5", 1129 | "\xe1\xb9\xb6" => "\xe1\xb9\xb7", 1130 | "\xe1\xb9\xb8" => "\xe1\xb9\xb9", 1131 | "\xe1\xb9\xba" => "\xe1\xb9\xbb", 1132 | "\xe1\xb9\xbc" => "\xe1\xb9\xbd", 1133 | "\xe1\xb9\xbe" => "\xe1\xb9\xbf", 1134 | "\xe1\xba\x80" => "\xe1\xba\x81", 1135 | "\xe1\xba\x82" => "\xe1\xba\x83", 1136 | "\xe1\xba\x84" => "\xe1\xba\x85", 1137 | "\xe1\xba\x86" => "\xe1\xba\x87", 1138 | "\xe1\xba\x88" => "\xe1\xba\x89", 1139 | "\xe1\xba\x8a" => "\xe1\xba\x8b", 1140 | "\xe1\xba\x8c" => "\xe1\xba\x8d", 1141 | "\xe1\xba\x8e" => "\xe1\xba\x8f", 1142 | "\xe1\xba\x90" => "\xe1\xba\x91", 1143 | "\xe1\xba\x92" => "\xe1\xba\x93", 1144 | "\xe1\xba\x94" => "\xe1\xba\x95", 1145 | "\xe1\xba\xa0" => "\xe1\xba\xa1", 1146 | "\xe1\xba\xa2" => "\xe1\xba\xa3", 1147 | "\xe1\xba\xa4" => "\xe1\xba\xa5", 1148 | "\xe1\xba\xa6" => "\xe1\xba\xa7", 1149 | "\xe1\xba\xa8" => "\xe1\xba\xa9", 1150 | "\xe1\xba\xaa" => "\xe1\xba\xab", 1151 | "\xe1\xba\xac" => "\xe1\xba\xad", 1152 | "\xe1\xba\xae" => "\xe1\xba\xaf", 1153 | "\xe1\xba\xb0" => "\xe1\xba\xb1", 1154 | "\xe1\xba\xb2" => "\xe1\xba\xb3", 1155 | "\xe1\xba\xb4" => "\xe1\xba\xb5", 1156 | "\xe1\xba\xb6" => "\xe1\xba\xb7", 1157 | "\xe1\xba\xb8" => "\xe1\xba\xb9", 1158 | "\xe1\xba\xba" => "\xe1\xba\xbb", 1159 | "\xe1\xba\xbc" => "\xe1\xba\xbd", 1160 | "\xe1\xba\xbe" => "\xe1\xba\xbf", 1161 | "\xe1\xbb\x80" => "\xe1\xbb\x81", 1162 | "\xe1\xbb\x82" => "\xe1\xbb\x83", 1163 | "\xe1\xbb\x84" => "\xe1\xbb\x85", 1164 | "\xe1\xbb\x86" => "\xe1\xbb\x87", 1165 | "\xe1\xbb\x88" => "\xe1\xbb\x89", 1166 | "\xe1\xbb\x8a" => "\xe1\xbb\x8b", 1167 | "\xe1\xbb\x8c" => "\xe1\xbb\x8d", 1168 | "\xe1\xbb\x8e" => "\xe1\xbb\x8f", 1169 | "\xe1\xbb\x90" => "\xe1\xbb\x91", 1170 | "\xe1\xbb\x92" => "\xe1\xbb\x93", 1171 | "\xe1\xbb\x94" => "\xe1\xbb\x95", 1172 | "\xe1\xbb\x96" => "\xe1\xbb\x97", 1173 | "\xe1\xbb\x98" => "\xe1\xbb\x99", 1174 | "\xe1\xbb\x9a" => "\xe1\xbb\x9b", 1175 | "\xe1\xbb\x9c" => "\xe1\xbb\x9d", 1176 | "\xe1\xbb\x9e" => "\xe1\xbb\x9f", 1177 | "\xe1\xbb\xa0" => "\xe1\xbb\xa1", 1178 | "\xe1\xbb\xa2" => "\xe1\xbb\xa3", 1179 | "\xe1\xbb\xa4" => "\xe1\xbb\xa5", 1180 | "\xe1\xbb\xa6" => "\xe1\xbb\xa7", 1181 | "\xe1\xbb\xa8" => "\xe1\xbb\xa9", 1182 | "\xe1\xbb\xaa" => "\xe1\xbb\xab", 1183 | "\xe1\xbb\xac" => "\xe1\xbb\xad", 1184 | "\xe1\xbb\xae" => "\xe1\xbb\xaf", 1185 | "\xe1\xbb\xb0" => "\xe1\xbb\xb1", 1186 | "\xe1\xbb\xb2" => "\xe1\xbb\xb3", 1187 | "\xe1\xbb\xb4" => "\xe1\xbb\xb5", 1188 | "\xe1\xbb\xb6" => "\xe1\xbb\xb7", 1189 | "\xe1\xbb\xb8" => "\xe1\xbb\xb9", 1190 | "\xe1\xbc\x88" => "\xe1\xbc\x80", 1191 | "\xe1\xbc\x89" => "\xe1\xbc\x81", 1192 | "\xe1\xbc\x8a" => "\xe1\xbc\x82", 1193 | "\xe1\xbc\x8b" => "\xe1\xbc\x83", 1194 | "\xe1\xbc\x8c" => "\xe1\xbc\x84", 1195 | "\xe1\xbc\x8d" => "\xe1\xbc\x85", 1196 | "\xe1\xbc\x8e" => "\xe1\xbc\x86", 1197 | "\xe1\xbc\x8f" => "\xe1\xbc\x87", 1198 | "\xe1\xbc\x98" => "\xe1\xbc\x90", 1199 | "\xe1\xbc\x99" => "\xe1\xbc\x91", 1200 | "\xe1\xbc\x9a" => "\xe1\xbc\x92", 1201 | "\xe1\xbc\x9b" => "\xe1\xbc\x93", 1202 | "\xe1\xbc\x9c" => "\xe1\xbc\x94", 1203 | "\xe1\xbc\x9d" => "\xe1\xbc\x95", 1204 | "\xe1\xbc\xa9" => "\xe1\xbc\xa1", 1205 | "\xe1\xbc\xaa" => "\xe1\xbc\xa2", 1206 | "\xe1\xbc\xab" => "\xe1\xbc\xa3", 1207 | "\xe1\xbc\xac" => "\xe1\xbc\xa4", 1208 | "\xe1\xbc\xad" => "\xe1\xbc\xa5", 1209 | "\xe1\xbc\xae" => "\xe1\xbc\xa6", 1210 | "\xe1\xbc\xaf" => "\xe1\xbc\xa7", 1211 | "\xe1\xbc\xb8" => "\xe1\xbc\xb0", 1212 | "\xe1\xbc\xb9" => "\xe1\xbc\xb1", 1213 | "\xe1\xbc\xba" => "\xe1\xbc\xb2", 1214 | "\xe1\xbc\xbb" => "\xe1\xbc\xb3", 1215 | "\xe1\xbc\xbc" => "\xe1\xbc\xb4", 1216 | "\xe1\xbc\xbd" => "\xe1\xbc\xb5", 1217 | "\xe1\xbc\xbe" => "\xe1\xbc\xb6", 1218 | "\xe1\xbc\xbf" => "\xe1\xbc\xb7", 1219 | "\xe1\xbd\x88" => "\xe1\xbd\x80", 1220 | "\xe1\xbd\x89" => "\xe1\xbd\x81", 1221 | "\xe1\xbd\x8a" => "\xe1\xbd\x82", 1222 | "\xe1\xbd\x8b" => "\xe1\xbd\x83", 1223 | "\xe1\xbd\x8c" => "\xe1\xbd\x84", 1224 | "\xe1\xbd\x8d" => "\xe1\xbd\x85", 1225 | "\xe1\xbd\x99" => "\xe1\xbd\x91", 1226 | "\xe1\xbd\x9b" => "\xe1\xbd\x93", 1227 | "\xe1\xbd\x9d" => "\xe1\xbd\x95", 1228 | "\xe1\xbd\x9f" => "\xe1\xbd\x97", 1229 | "\xe1\xbd\xa9" => "\xe1\xbd\xa1", 1230 | "\xe1\xbd\xaa" => "\xe1\xbd\xa2", 1231 | "\xe1\xbd\xab" => "\xe1\xbd\xa3", 1232 | "\xe1\xbd\xac" => "\xe1\xbd\xa4", 1233 | "\xe1\xbd\xad" => "\xe1\xbd\xa5", 1234 | "\xe1\xbd\xae" => "\xe1\xbd\xa6", 1235 | "\xe1\xbd\xaf" => "\xe1\xbd\xa7", 1236 | "\xe1\xbe\x88" => "\xe1\xbe\x80", 1237 | "\xe1\xbe\x89" => "\xe1\xbe\x81", 1238 | "\xe1\xbe\x8a" => "\xe1\xbe\x82", 1239 | "\xe1\xbe\x8b" => "\xe1\xbe\x83", 1240 | "\xe1\xbe\x8c" => "\xe1\xbe\x84", 1241 | "\xe1\xbe\x8d" => "\xe1\xbe\x85", 1242 | "\xe1\xbe\x8e" => "\xe1\xbe\x86", 1243 | "\xe1\xbe\x8f" => "\xe1\xbe\x87", 1244 | "\xe1\xbe\x98" => "\xe1\xbe\x90", 1245 | "\xe1\xbe\x99" => "\xe1\xbe\x91", 1246 | "\xe1\xbe\x9a" => "\xe1\xbe\x92", 1247 | "\xe1\xbe\x9b" => "\xe1\xbe\x93", 1248 | "\xe1\xbe\x9c" => "\xe1\xbe\x94", 1249 | "\xe1\xbe\x9d" => "\xe1\xbe\x95", 1250 | "\xe1\xbe\x9e" => "\xe1\xbe\x96", 1251 | "\xe1\xbe\x9f" => "\xe1\xbe\x97", 1252 | "\xe1\xbe\xa9" => "\xe1\xbe\xa1", 1253 | "\xe1\xbe\xaa" => "\xe1\xbe\xa2", 1254 | "\xe1\xbe\xab" => "\xe1\xbe\xa3", 1255 | "\xe1\xbe\xac" => "\xe1\xbe\xa4", 1256 | "\xe1\xbe\xad" => "\xe1\xbe\xa5", 1257 | "\xe1\xbe\xae" => "\xe1\xbe\xa6", 1258 | "\xe1\xbe\xaf" => "\xe1\xbe\xa7", 1259 | "\xe1\xbe\xb8" => "\xe1\xbe\xb0", 1260 | "\xe1\xbe\xb9" => "\xe1\xbe\xb1", 1261 | "\xe1\xbe\xba" => "\xe1\xbd\xb0", 1262 | "\xe1\xbe\xbb" => "\xe1\xbd\xb1", 1263 | "\xe1\xbe\xbc" => "\xe1\xbe\xb3", 1264 | "\xe1\xbf\x88" => "\xe1\xbd\xb2", 1265 | "\xe1\xbf\x89" => "\xe1\xbd\xb3", 1266 | "\xe1\xbf\x8a" => "\xe1\xbd\xb4", 1267 | "\xe1\xbf\x8b" => "\xe1\xbd\xb5", 1268 | "\xe1\xbf\x8c" => "\xe1\xbf\x83", 1269 | "\xe1\xbf\x98" => "\xe1\xbf\x90", 1270 | "\xe1\xbf\x99" => "\xe1\xbf\x91", 1271 | "\xe1\xbf\x9a" => "\xe1\xbd\xb6", 1272 | "\xe1\xbf\x9b" => "\xe1\xbd\xb7", 1273 | "\xe1\xbf\xa9" => "\xe1\xbf\xa1", 1274 | "\xe1\xbf\xaa" => "\xe1\xbd\xba", 1275 | "\xe1\xbf\xab" => "\xe1\xbd\xbb", 1276 | "\xe1\xbf\xac" => "\xe1\xbf\xa5", 1277 | "\xe1\xbf\xb8" => "\xe1\xbd\xb8", 1278 | "\xe1\xbf\xb9" => "\xe1\xbd\xb9", 1279 | "\xe1\xbf\xba" => "\xe1\xbd\xbc", 1280 | "\xe1\xbf\xbb" => "\xe1\xbd\xbd", 1281 | "\xe1\xbf\xbc" => "\xe1\xbf\xb3", 1282 | "\xef\xbc\xa1" => "\xef\xbd\x81", 1283 | "\xef\xbc\xa2" => "\xef\xbd\x82", 1284 | "\xef\xbc\xa3" => "\xef\xbd\x83", 1285 | "\xef\xbc\xa4" => "\xef\xbd\x84", 1286 | "\xef\xbc\xa5" => "\xef\xbd\x85", 1287 | "\xef\xbc\xa6" => "\xef\xbd\x86", 1288 | "\xef\xbc\xa7" => "\xef\xbd\x87", 1289 | "\xef\xbc\xa8" => "\xef\xbd\x88", 1290 | "\xef\xbc\xa9" => "\xef\xbd\x89", 1291 | "\xef\xbc\xaa" => "\xef\xbd\x8a", 1292 | "\xef\xbc\xab" => "\xef\xbd\x8b", 1293 | "\xef\xbc\xac" => "\xef\xbd\x8c", 1294 | "\xef\xbc\xad" => "\xef\xbd\x8d", 1295 | "\xef\xbc\xae" => "\xef\xbd\x8e", 1296 | "\xef\xbc\xaf" => "\xef\xbd\x8f", 1297 | "\xef\xbc\xb0" => "\xef\xbd\x90", 1298 | "\xef\xbc\xb1" => "\xef\xbd\x91", 1299 | "\xef\xbc\xb2" => "\xef\xbd\x92", 1300 | "\xef\xbc\xb3" => "\xef\xbd\x93", 1301 | "\xef\xbc\xb4" => "\xef\xbd\x94", 1302 | "\xef\xbc\xb5" => "\xef\xbd\x95", 1303 | "\xef\xbc\xb6" => "\xef\xbd\x96", 1304 | "\xef\xbc\xb7" => "\xef\xbd\x97", 1305 | "\xef\xbc\xb8" => "\xef\xbd\x98", 1306 | "\xef\xbc\xb9" => "\xef\xbd\x99", 1307 | "\xef\xbc\xba" => "\xef\xbd\x9a", 1308 | ); 1309 | 1310 | /** 1311 | * Unicode Character Database 6.0.0 (2010-06-04) 1312 | * Autogenerated by unicode_blocks_txt2php() PHP function at 2011-06-04 00:19:39, 209 blocks total 1313 | * 1314 | * @var array 1315 | */ 1316 | public static $unicode_blocks = array( 1317 | 'Basic Latin' => array( 1318 | 0 => 0x0000, 1319 | 1 => 0x007F, 1320 | 2 => 0, 1321 | ), 1322 | 'Latin-1 Supplement' => array( 1323 | 0 => 0x0080, 1324 | 1 => 0x00FF, 1325 | 2 => 1, 1326 | ), 1327 | 'Latin Extended-A' => array( 1328 | 0 => 0x0100, 1329 | 1 => 0x017F, 1330 | 2 => 2, 1331 | ), 1332 | 'Latin Extended-B' => array( 1333 | 0 => 0x0180, 1334 | 1 => 0x024F, 1335 | 2 => 3, 1336 | ), 1337 | 'IPA Extensions' => array( 1338 | 0 => 0x0250, 1339 | 1 => 0x02AF, 1340 | 2 => 4, 1341 | ), 1342 | 'Spacing Modifier Letters' => array( 1343 | 0 => 0x02B0, 1344 | 1 => 0x02FF, 1345 | 2 => 5, 1346 | ), 1347 | 'Combining Diacritical Marks' => array( 1348 | 0 => 0x0300, 1349 | 1 => 0x036F, 1350 | 2 => 6, 1351 | ), 1352 | 'Greek and Coptic' => array( 1353 | 0 => 0x0370, 1354 | 1 => 0x03FF, 1355 | 2 => 7, 1356 | ), 1357 | 'Cyrillic' => array( 1358 | 0 => 0x0400, 1359 | 1 => 0x04FF, 1360 | 2 => 8, 1361 | ), 1362 | 'Cyrillic Supplement' => array( 1363 | 0 => 0x0500, 1364 | 1 => 0x052F, 1365 | 2 => 9, 1366 | ), 1367 | 'Armenian' => array( 1368 | 0 => 0x0530, 1369 | 1 => 0x058F, 1370 | 2 => 10, 1371 | ), 1372 | 'Hebrew' => array( 1373 | 0 => 0x0590, 1374 | 1 => 0x05FF, 1375 | 2 => 11, 1376 | ), 1377 | 'Arabic' => array( 1378 | 0 => 0x0600, 1379 | 1 => 0x06FF, 1380 | 2 => 12, 1381 | ), 1382 | 'Syriac' => array( 1383 | 0 => 0x0700, 1384 | 1 => 0x074F, 1385 | 2 => 13, 1386 | ), 1387 | 'Arabic Supplement' => array( 1388 | 0 => 0x0750, 1389 | 1 => 0x077F, 1390 | 2 => 14, 1391 | ), 1392 | 'Thaana' => array( 1393 | 0 => 0x0780, 1394 | 1 => 0x07BF, 1395 | 2 => 15, 1396 | ), 1397 | 'NKo' => array( 1398 | 0 => 0x07C0, 1399 | 1 => 0x07FF, 1400 | 2 => 16, 1401 | ), 1402 | 'Samaritan' => array( 1403 | 0 => 0x0800, 1404 | 1 => 0x083F, 1405 | 2 => 17, 1406 | ), 1407 | 'Mandaic' => array( 1408 | 0 => 0x0840, 1409 | 1 => 0x085F, 1410 | 2 => 18, 1411 | ), 1412 | 'Devanagari' => array( 1413 | 0 => 0x0900, 1414 | 1 => 0x097F, 1415 | 2 => 19, 1416 | ), 1417 | 'Bengali' => array( 1418 | 0 => 0x0980, 1419 | 1 => 0x09FF, 1420 | 2 => 20, 1421 | ), 1422 | 'Gurmukhi' => array( 1423 | 0 => 0x0A00, 1424 | 1 => 0x0A7F, 1425 | 2 => 21, 1426 | ), 1427 | 'Gujarati' => array( 1428 | 0 => 0x0A80, 1429 | 1 => 0x0AFF, 1430 | 2 => 22, 1431 | ), 1432 | 'Oriya' => array( 1433 | 0 => 0x0B00, 1434 | 1 => 0x0B7F, 1435 | 2 => 23, 1436 | ), 1437 | 'Tamil' => array( 1438 | 0 => 0x0B80, 1439 | 1 => 0x0BFF, 1440 | 2 => 24, 1441 | ), 1442 | 'Telugu' => array( 1443 | 0 => 0x0C00, 1444 | 1 => 0x0C7F, 1445 | 2 => 25, 1446 | ), 1447 | 'Kannada' => array( 1448 | 0 => 0x0C80, 1449 | 1 => 0x0CFF, 1450 | 2 => 26, 1451 | ), 1452 | 'Malayalam' => array( 1453 | 0 => 0x0D00, 1454 | 1 => 0x0D7F, 1455 | 2 => 27, 1456 | ), 1457 | 'Sinhala' => array( 1458 | 0 => 0x0D80, 1459 | 1 => 0x0DFF, 1460 | 2 => 28, 1461 | ), 1462 | 'Thai' => array( 1463 | 0 => 0x0E00, 1464 | 1 => 0x0E7F, 1465 | 2 => 29, 1466 | ), 1467 | 'Lao' => array( 1468 | 0 => 0x0E80, 1469 | 1 => 0x0EFF, 1470 | 2 => 30, 1471 | ), 1472 | 'Tibetan' => array( 1473 | 0 => 0x0F00, 1474 | 1 => 0x0FFF, 1475 | 2 => 31, 1476 | ), 1477 | 'Myanmar' => array( 1478 | 0 => 0x1000, 1479 | 1 => 0x109F, 1480 | 2 => 32, 1481 | ), 1482 | 'Georgian' => array( 1483 | 0 => 0x10A0, 1484 | 1 => 0x10FF, 1485 | 2 => 33, 1486 | ), 1487 | 'Hangul Jamo' => array( 1488 | 0 => 0x1100, 1489 | 1 => 0x11FF, 1490 | 2 => 34, 1491 | ), 1492 | 'Ethiopic' => array( 1493 | 0 => 0x1200, 1494 | 1 => 0x137F, 1495 | 2 => 35, 1496 | ), 1497 | 'Ethiopic Supplement' => array( 1498 | 0 => 0x1380, 1499 | 1 => 0x139F, 1500 | 2 => 36, 1501 | ), 1502 | 'Cherokee' => array( 1503 | 0 => 0x13A0, 1504 | 1 => 0x13FF, 1505 | 2 => 37, 1506 | ), 1507 | 'Unified Canadian Aboriginal Syllabics' => array( 1508 | 0 => 0x1400, 1509 | 1 => 0x167F, 1510 | 2 => 38, 1511 | ), 1512 | 'Ogham' => array( 1513 | 0 => 0x1680, 1514 | 1 => 0x169F, 1515 | 2 => 39, 1516 | ), 1517 | 'Runic' => array( 1518 | 0 => 0x16A0, 1519 | 1 => 0x16FF, 1520 | 2 => 40, 1521 | ), 1522 | 'Tagalog' => array( 1523 | 0 => 0x1700, 1524 | 1 => 0x171F, 1525 | 2 => 41, 1526 | ), 1527 | 'Hanunoo' => array( 1528 | 0 => 0x1720, 1529 | 1 => 0x173F, 1530 | 2 => 42, 1531 | ), 1532 | 'Buhid' => array( 1533 | 0 => 0x1740, 1534 | 1 => 0x175F, 1535 | 2 => 43, 1536 | ), 1537 | 'Tagbanwa' => array( 1538 | 0 => 0x1760, 1539 | 1 => 0x177F, 1540 | 2 => 44, 1541 | ), 1542 | 'Khmer' => array( 1543 | 0 => 0x1780, 1544 | 1 => 0x17FF, 1545 | 2 => 45, 1546 | ), 1547 | 'Mongolian' => array( 1548 | 0 => 0x1800, 1549 | 1 => 0x18AF, 1550 | 2 => 46, 1551 | ), 1552 | 'Unified Canadian Aboriginal Syllabics Extended' => array( 1553 | 0 => 0x18B0, 1554 | 1 => 0x18FF, 1555 | 2 => 47, 1556 | ), 1557 | 'Limbu' => array( 1558 | 0 => 0x1900, 1559 | 1 => 0x194F, 1560 | 2 => 48, 1561 | ), 1562 | 'Tai Le' => array( 1563 | 0 => 0x1950, 1564 | 1 => 0x197F, 1565 | 2 => 49, 1566 | ), 1567 | 'New Tai Lue' => array( 1568 | 0 => 0x1980, 1569 | 1 => 0x19DF, 1570 | 2 => 50, 1571 | ), 1572 | 'Khmer Symbols' => array( 1573 | 0 => 0x19E0, 1574 | 1 => 0x19FF, 1575 | 2 => 51, 1576 | ), 1577 | 'Buginese' => array( 1578 | 0 => 0x1A00, 1579 | 1 => 0x1A1F, 1580 | 2 => 52, 1581 | ), 1582 | 'Tai Tham' => array( 1583 | 0 => 0x1A20, 1584 | 1 => 0x1AAF, 1585 | 2 => 53, 1586 | ), 1587 | 'Balinese' => array( 1588 | 0 => 0x1B00, 1589 | 1 => 0x1B7F, 1590 | 2 => 54, 1591 | ), 1592 | 'Sundanese' => array( 1593 | 0 => 0x1B80, 1594 | 1 => 0x1BBF, 1595 | 2 => 55, 1596 | ), 1597 | 'Batak' => array( 1598 | 0 => 0x1BC0, 1599 | 1 => 0x1BFF, 1600 | 2 => 56, 1601 | ), 1602 | 'Lepcha' => array( 1603 | 0 => 0x1C00, 1604 | 1 => 0x1C4F, 1605 | 2 => 57, 1606 | ), 1607 | 'Ol Chiki' => array( 1608 | 0 => 0x1C50, 1609 | 1 => 0x1C7F, 1610 | 2 => 58, 1611 | ), 1612 | 'Vedic Extensions' => array( 1613 | 0 => 0x1CD0, 1614 | 1 => 0x1CFF, 1615 | 2 => 59, 1616 | ), 1617 | 'Phonetic Extensions' => array( 1618 | 0 => 0x1D00, 1619 | 1 => 0x1D7F, 1620 | 2 => 60, 1621 | ), 1622 | 'Phonetic Extensions Supplement' => array( 1623 | 0 => 0x1D80, 1624 | 1 => 0x1DBF, 1625 | 2 => 61, 1626 | ), 1627 | 'Combining Diacritical Marks Supplement' => array( 1628 | 0 => 0x1DC0, 1629 | 1 => 0x1DFF, 1630 | 2 => 62, 1631 | ), 1632 | 'Latin Extended Additional' => array( 1633 | 0 => 0x1E00, 1634 | 1 => 0x1EFF, 1635 | 2 => 63, 1636 | ), 1637 | 'Greek Extended' => array( 1638 | 0 => 0x1F00, 1639 | 1 => 0x1FFF, 1640 | 2 => 64, 1641 | ), 1642 | 'General Punctuation' => array( 1643 | 0 => 0x2000, 1644 | 1 => 0x206F, 1645 | 2 => 65, 1646 | ), 1647 | 'Superscripts and Subscripts' => array( 1648 | 0 => 0x2070, 1649 | 1 => 0x209F, 1650 | 2 => 66, 1651 | ), 1652 | 'Currency Symbols' => array( 1653 | 0 => 0x20A0, 1654 | 1 => 0x20CF, 1655 | 2 => 67, 1656 | ), 1657 | 'Combining Diacritical Marks for Symbols' => array( 1658 | 0 => 0x20D0, 1659 | 1 => 0x20FF, 1660 | 2 => 68, 1661 | ), 1662 | 'Letterlike Symbols' => array( 1663 | 0 => 0x2100, 1664 | 1 => 0x214F, 1665 | 2 => 69, 1666 | ), 1667 | 'Number Forms' => array( 1668 | 0 => 0x2150, 1669 | 1 => 0x218F, 1670 | 2 => 70, 1671 | ), 1672 | 'Arrows' => array( 1673 | 0 => 0x2190, 1674 | 1 => 0x21FF, 1675 | 2 => 71, 1676 | ), 1677 | 'Mathematical Operators' => array( 1678 | 0 => 0x2200, 1679 | 1 => 0x22FF, 1680 | 2 => 72, 1681 | ), 1682 | 'Miscellaneous Technical' => array( 1683 | 0 => 0x2300, 1684 | 1 => 0x23FF, 1685 | 2 => 73, 1686 | ), 1687 | 'Control Pictures' => array( 1688 | 0 => 0x2400, 1689 | 1 => 0x243F, 1690 | 2 => 74, 1691 | ), 1692 | 'Optical Character Recognition' => array( 1693 | 0 => 0x2440, 1694 | 1 => 0x245F, 1695 | 2 => 75, 1696 | ), 1697 | 'Enclosed Alphanumerics' => array( 1698 | 0 => 0x2460, 1699 | 1 => 0x24FF, 1700 | 2 => 76, 1701 | ), 1702 | 'Box Drawing' => array( 1703 | 0 => 0x2500, 1704 | 1 => 0x257F, 1705 | 2 => 77, 1706 | ), 1707 | 'Block Elements' => array( 1708 | 0 => 0x2580, 1709 | 1 => 0x259F, 1710 | 2 => 78, 1711 | ), 1712 | 'Geometric Shapes' => array( 1713 | 0 => 0x25A0, 1714 | 1 => 0x25FF, 1715 | 2 => 79, 1716 | ), 1717 | 'Miscellaneous Symbols' => array( 1718 | 0 => 0x2600, 1719 | 1 => 0x26FF, 1720 | 2 => 80, 1721 | ), 1722 | 'Dingbats' => array( 1723 | 0 => 0x2700, 1724 | 1 => 0x27BF, 1725 | 2 => 81, 1726 | ), 1727 | 'Miscellaneous Mathematical Symbols-A' => array( 1728 | 0 => 0x27C0, 1729 | 1 => 0x27EF, 1730 | 2 => 82, 1731 | ), 1732 | 'Supplemental Arrows-A' => array( 1733 | 0 => 0x27F0, 1734 | 1 => 0x27FF, 1735 | 2 => 83, 1736 | ), 1737 | 'Braille Patterns' => array( 1738 | 0 => 0x2800, 1739 | 1 => 0x28FF, 1740 | 2 => 84, 1741 | ), 1742 | 'Supplemental Arrows-B' => array( 1743 | 0 => 0x2900, 1744 | 1 => 0x297F, 1745 | 2 => 85, 1746 | ), 1747 | 'Miscellaneous Mathematical Symbols-B' => array( 1748 | 0 => 0x2980, 1749 | 1 => 0x29FF, 1750 | 2 => 86, 1751 | ), 1752 | 'Supplemental Mathematical Operators' => array( 1753 | 0 => 0x2A00, 1754 | 1 => 0x2AFF, 1755 | 2 => 87, 1756 | ), 1757 | 'Miscellaneous Symbols and Arrows' => array( 1758 | 0 => 0x2B00, 1759 | 1 => 0x2BFF, 1760 | 2 => 88, 1761 | ), 1762 | 'Glagolitic' => array( 1763 | 0 => 0x2C00, 1764 | 1 => 0x2C5F, 1765 | 2 => 89, 1766 | ), 1767 | 'Latin Extended-C' => array( 1768 | 0 => 0x2C60, 1769 | 1 => 0x2C7F, 1770 | 2 => 90, 1771 | ), 1772 | 'Coptic' => array( 1773 | 0 => 0x2C80, 1774 | 1 => 0x2CFF, 1775 | 2 => 91, 1776 | ), 1777 | 'Georgian Supplement' => array( 1778 | 0 => 0x2D00, 1779 | 1 => 0x2D2F, 1780 | 2 => 92, 1781 | ), 1782 | 'Tifinagh' => array( 1783 | 0 => 0x2D30, 1784 | 1 => 0x2D7F, 1785 | 2 => 93, 1786 | ), 1787 | 'Ethiopic Extended' => array( 1788 | 0 => 0x2D80, 1789 | 1 => 0x2DDF, 1790 | 2 => 94, 1791 | ), 1792 | 'Cyrillic Extended-A' => array( 1793 | 0 => 0x2DE0, 1794 | 1 => 0x2DFF, 1795 | 2 => 95, 1796 | ), 1797 | 'Supplemental Punctuation' => array( 1798 | 0 => 0x2E00, 1799 | 1 => 0x2E7F, 1800 | 2 => 96, 1801 | ), 1802 | 'CJK Radicals Supplement' => array( 1803 | 0 => 0x2E80, 1804 | 1 => 0x2EFF, 1805 | 2 => 97, 1806 | ), 1807 | 'Kangxi Radicals' => array( 1808 | 0 => 0x2F00, 1809 | 1 => 0x2FDF, 1810 | 2 => 98, 1811 | ), 1812 | 'Ideographic Description Characters' => array( 1813 | 0 => 0x2FF0, 1814 | 1 => 0x2FFF, 1815 | 2 => 99, 1816 | ), 1817 | 'CJK Symbols and Punctuation' => array( 1818 | 0 => 0x3000, 1819 | 1 => 0x303F, 1820 | 2 => 100, 1821 | ), 1822 | 'Hiragana' => array( 1823 | 0 => 0x3040, 1824 | 1 => 0x309F, 1825 | 2 => 101, 1826 | ), 1827 | 'Katakana' => array( 1828 | 0 => 0x30A0, 1829 | 1 => 0x30FF, 1830 | 2 => 102, 1831 | ), 1832 | 'Bopomofo' => array( 1833 | 0 => 0x3100, 1834 | 1 => 0x312F, 1835 | 2 => 103, 1836 | ), 1837 | 'Hangul Compatibility Jamo' => array( 1838 | 0 => 0x3130, 1839 | 1 => 0x318F, 1840 | 2 => 104, 1841 | ), 1842 | 'Kanbun' => array( 1843 | 0 => 0x3190, 1844 | 1 => 0x319F, 1845 | 2 => 105, 1846 | ), 1847 | 'Bopomofo Extended' => array( 1848 | 0 => 0x31A0, 1849 | 1 => 0x31BF, 1850 | 2 => 106, 1851 | ), 1852 | 'CJK Strokes' => array( 1853 | 0 => 0x31C0, 1854 | 1 => 0x31EF, 1855 | 2 => 107, 1856 | ), 1857 | 'Katakana Phonetic Extensions' => array( 1858 | 0 => 0x31F0, 1859 | 1 => 0x31FF, 1860 | 2 => 108, 1861 | ), 1862 | 'Enclosed CJK Letters and Months' => array( 1863 | 0 => 0x3200, 1864 | 1 => 0x32FF, 1865 | 2 => 109, 1866 | ), 1867 | 'CJK Compatibility' => array( 1868 | 0 => 0x3300, 1869 | 1 => 0x33FF, 1870 | 2 => 110, 1871 | ), 1872 | 'CJK Unified Ideographs Extension A' => array( 1873 | 0 => 0x3400, 1874 | 1 => 0x4DBF, 1875 | 2 => 111, 1876 | ), 1877 | 'Yijing Hexagram Symbols' => array( 1878 | 0 => 0x4DC0, 1879 | 1 => 0x4DFF, 1880 | 2 => 112, 1881 | ), 1882 | 'CJK Unified Ideographs' => array( 1883 | 0 => 0x4E00, 1884 | 1 => 0x9FFF, 1885 | 2 => 113, 1886 | ), 1887 | 'Yi Syllables' => array( 1888 | 0 => 0xA000, 1889 | 1 => 0xA48F, 1890 | 2 => 114, 1891 | ), 1892 | 'Yi Radicals' => array( 1893 | 0 => 0xA490, 1894 | 1 => 0xA4CF, 1895 | 2 => 115, 1896 | ), 1897 | 'Lisu' => array( 1898 | 0 => 0xA4D0, 1899 | 1 => 0xA4FF, 1900 | 2 => 116, 1901 | ), 1902 | 'Vai' => array( 1903 | 0 => 0xA500, 1904 | 1 => 0xA63F, 1905 | 2 => 117, 1906 | ), 1907 | 'Cyrillic Extended-B' => array( 1908 | 0 => 0xA640, 1909 | 1 => 0xA69F, 1910 | 2 => 118, 1911 | ), 1912 | 'Bamum' => array( 1913 | 0 => 0xA6A0, 1914 | 1 => 0xA6FF, 1915 | 2 => 119, 1916 | ), 1917 | 'Modifier Tone Letters' => array( 1918 | 0 => 0xA700, 1919 | 1 => 0xA71F, 1920 | 2 => 120, 1921 | ), 1922 | 'Latin Extended-D' => array( 1923 | 0 => 0xA720, 1924 | 1 => 0xA7FF, 1925 | 2 => 121, 1926 | ), 1927 | 'Syloti Nagri' => array( 1928 | 0 => 0xA800, 1929 | 1 => 0xA82F, 1930 | 2 => 122, 1931 | ), 1932 | 'Common Indic Number Forms' => array( 1933 | 0 => 0xA830, 1934 | 1 => 0xA83F, 1935 | 2 => 123, 1936 | ), 1937 | 'Phags-pa' => array( 1938 | 0 => 0xA840, 1939 | 1 => 0xA87F, 1940 | 2 => 124, 1941 | ), 1942 | 'Saurashtra' => array( 1943 | 0 => 0xA880, 1944 | 1 => 0xA8DF, 1945 | 2 => 125, 1946 | ), 1947 | 'Devanagari Extended' => array( 1948 | 0 => 0xA8E0, 1949 | 1 => 0xA8FF, 1950 | 2 => 126, 1951 | ), 1952 | 'Kayah Li' => array( 1953 | 0 => 0xA900, 1954 | 1 => 0xA92F, 1955 | 2 => 127, 1956 | ), 1957 | 'Rejang' => array( 1958 | 0 => 0xA930, 1959 | 1 => 0xA95F, 1960 | 2 => 128, 1961 | ), 1962 | 'Hangul Jamo Extended-A' => array( 1963 | 0 => 0xA960, 1964 | 1 => 0xA97F, 1965 | 2 => 129, 1966 | ), 1967 | 'Javanese' => array( 1968 | 0 => 0xA980, 1969 | 1 => 0xA9DF, 1970 | 2 => 130, 1971 | ), 1972 | 'Cham' => array( 1973 | 0 => 0xAA00, 1974 | 1 => 0xAA5F, 1975 | 2 => 131, 1976 | ), 1977 | 'Myanmar Extended-A' => array( 1978 | 0 => 0xAA60, 1979 | 1 => 0xAA7F, 1980 | 2 => 132, 1981 | ), 1982 | 'Tai Viet' => array( 1983 | 0 => 0xAA80, 1984 | 1 => 0xAADF, 1985 | 2 => 133, 1986 | ), 1987 | 'Ethiopic Extended-A' => array( 1988 | 0 => 0xAB00, 1989 | 1 => 0xAB2F, 1990 | 2 => 134, 1991 | ), 1992 | 'Meetei Mayek' => array( 1993 | 0 => 0xABC0, 1994 | 1 => 0xABFF, 1995 | 2 => 135, 1996 | ), 1997 | 'Hangul Syllables' => array( 1998 | 0 => 0xAC00, 1999 | 1 => 0xD7AF, 2000 | 2 => 136, 2001 | ), 2002 | 'Hangul Jamo Extended-B' => array( 2003 | 0 => 0xD7B0, 2004 | 1 => 0xD7FF, 2005 | 2 => 137, 2006 | ), 2007 | 'High Surrogates' => array( 2008 | 0 => 0xD800, 2009 | 1 => 0xDB7F, 2010 | 2 => 138, 2011 | ), 2012 | 'High Private Use Surrogates' => array( 2013 | 0 => 0xDB80, 2014 | 1 => 0xDBFF, 2015 | 2 => 139, 2016 | ), 2017 | 'Low Surrogates' => array( 2018 | 0 => 0xDC00, 2019 | 1 => 0xDFFF, 2020 | 2 => 140, 2021 | ), 2022 | 'Private Use Area' => array( 2023 | 0 => 0xE000, 2024 | 1 => 0xF8FF, 2025 | 2 => 141, 2026 | ), 2027 | 'CJK Compatibility Ideographs' => array( 2028 | 0 => 0xF900, 2029 | 1 => 0xFAFF, 2030 | 2 => 142, 2031 | ), 2032 | 'Alphabetic Presentation Forms' => array( 2033 | 0 => 0xFB00, 2034 | 1 => 0xFB4F, 2035 | 2 => 143, 2036 | ), 2037 | 'Arabic Presentation Forms-A' => array( 2038 | 0 => 0xFB50, 2039 | 1 => 0xFDFF, 2040 | 2 => 144, 2041 | ), 2042 | 'Variation Selectors' => array( 2043 | 0 => 0xFE00, 2044 | 1 => 0xFE0F, 2045 | 2 => 145, 2046 | ), 2047 | 'Vertical Forms' => array( 2048 | 0 => 0xFE10, 2049 | 1 => 0xFE1F, 2050 | 2 => 146, 2051 | ), 2052 | 'Combining Half Marks' => array( 2053 | 0 => 0xFE20, 2054 | 1 => 0xFE2F, 2055 | 2 => 147, 2056 | ), 2057 | 'CJK Compatibility Forms' => array( 2058 | 0 => 0xFE30, 2059 | 1 => 0xFE4F, 2060 | 2 => 148, 2061 | ), 2062 | 'Small Form Variants' => array( 2063 | 0 => 0xFE50, 2064 | 1 => 0xFE6F, 2065 | 2 => 149, 2066 | ), 2067 | 'Arabic Presentation Forms-B' => array( 2068 | 0 => 0xFE70, 2069 | 1 => 0xFEFF, 2070 | 2 => 150, 2071 | ), 2072 | 'Halfwidth and Fullwidth Forms' => array( 2073 | 0 => 0xFF00, 2074 | 1 => 0xFFEF, 2075 | 2 => 151, 2076 | ), 2077 | 'Specials' => array( 2078 | 0 => 0xFFF0, 2079 | 1 => 0xFFFF, 2080 | 2 => 152, 2081 | ), 2082 | 'Linear B Syllabary' => array( 2083 | 0 => 0x10000, 2084 | 1 => 0x1007F, 2085 | 2 => 153, 2086 | ), 2087 | 'Linear B Ideograms' => array( 2088 | 0 => 0x10080, 2089 | 1 => 0x100FF, 2090 | 2 => 154, 2091 | ), 2092 | 'Aegean Numbers' => array( 2093 | 0 => 0x10100, 2094 | 1 => 0x1013F, 2095 | 2 => 155, 2096 | ), 2097 | 'Ancient Greek Numbers' => array( 2098 | 0 => 0x10140, 2099 | 1 => 0x1018F, 2100 | 2 => 156, 2101 | ), 2102 | 'Ancient Symbols' => array( 2103 | 0 => 0x10190, 2104 | 1 => 0x101CF, 2105 | 2 => 157, 2106 | ), 2107 | 'Phaistos Disc' => array( 2108 | 0 => 0x101D0, 2109 | 1 => 0x101FF, 2110 | 2 => 158, 2111 | ), 2112 | 'Lycian' => array( 2113 | 0 => 0x10280, 2114 | 1 => 0x1029F, 2115 | 2 => 159, 2116 | ), 2117 | 'Carian' => array( 2118 | 0 => 0x102A0, 2119 | 1 => 0x102DF, 2120 | 2 => 160, 2121 | ), 2122 | 'Old Italic' => array( 2123 | 0 => 0x10300, 2124 | 1 => 0x1032F, 2125 | 2 => 161, 2126 | ), 2127 | 'Gothic' => array( 2128 | 0 => 0x10330, 2129 | 1 => 0x1034F, 2130 | 2 => 162, 2131 | ), 2132 | 'Ugaritic' => array( 2133 | 0 => 0x10380, 2134 | 1 => 0x1039F, 2135 | 2 => 163, 2136 | ), 2137 | 'Old Persian' => array( 2138 | 0 => 0x103A0, 2139 | 1 => 0x103DF, 2140 | 2 => 164, 2141 | ), 2142 | 'Deseret' => array( 2143 | 0 => 0x10400, 2144 | 1 => 0x1044F, 2145 | 2 => 165, 2146 | ), 2147 | 'Shavian' => array( 2148 | 0 => 0x10450, 2149 | 1 => 0x1047F, 2150 | 2 => 166, 2151 | ), 2152 | 'Osmanya' => array( 2153 | 0 => 0x10480, 2154 | 1 => 0x104AF, 2155 | 2 => 167, 2156 | ), 2157 | 'Cypriot Syllabary' => array( 2158 | 0 => 0x10800, 2159 | 1 => 0x1083F, 2160 | 2 => 168, 2161 | ), 2162 | 'Imperial Aramaic' => array( 2163 | 0 => 0x10840, 2164 | 1 => 0x1085F, 2165 | 2 => 169, 2166 | ), 2167 | 'Phoenician' => array( 2168 | 0 => 0x10900, 2169 | 1 => 0x1091F, 2170 | 2 => 170, 2171 | ), 2172 | 'Lydian' => array( 2173 | 0 => 0x10920, 2174 | 1 => 0x1093F, 2175 | 2 => 171, 2176 | ), 2177 | 'Kharoshthi' => array( 2178 | 0 => 0x10A00, 2179 | 1 => 0x10A5F, 2180 | 2 => 172, 2181 | ), 2182 | 'Old South Arabian' => array( 2183 | 0 => 0x10A60, 2184 | 1 => 0x10A7F, 2185 | 2 => 173, 2186 | ), 2187 | 'Avestan' => array( 2188 | 0 => 0x10B00, 2189 | 1 => 0x10B3F, 2190 | 2 => 174, 2191 | ), 2192 | 'Inscriptional Parthian' => array( 2193 | 0 => 0x10B40, 2194 | 1 => 0x10B5F, 2195 | 2 => 175, 2196 | ), 2197 | 'Inscriptional Pahlavi' => array( 2198 | 0 => 0x10B60, 2199 | 1 => 0x10B7F, 2200 | 2 => 176, 2201 | ), 2202 | 'Old Turkic' => array( 2203 | 0 => 0x10C00, 2204 | 1 => 0x10C4F, 2205 | 2 => 177, 2206 | ), 2207 | 'Rumi Numeral Symbols' => array( 2208 | 0 => 0x10E60, 2209 | 1 => 0x10E7F, 2210 | 2 => 178, 2211 | ), 2212 | 'Brahmi' => array( 2213 | 0 => 0x11000, 2214 | 1 => 0x1107F, 2215 | 2 => 179, 2216 | ), 2217 | 'Kaithi' => array( 2218 | 0 => 0x11080, 2219 | 1 => 0x110CF, 2220 | 2 => 180, 2221 | ), 2222 | 'Cuneiform' => array( 2223 | 0 => 0x12000, 2224 | 1 => 0x123FF, 2225 | 2 => 181, 2226 | ), 2227 | 'Cuneiform Numbers and Punctuation' => array( 2228 | 0 => 0x12400, 2229 | 1 => 0x1247F, 2230 | 2 => 182, 2231 | ), 2232 | 'Egyptian Hieroglyphs' => array( 2233 | 0 => 0x13000, 2234 | 1 => 0x1342F, 2235 | 2 => 183, 2236 | ), 2237 | 'Bamum Supplement' => array( 2238 | 0 => 0x16800, 2239 | 1 => 0x16A3F, 2240 | 2 => 184, 2241 | ), 2242 | 'Kana Supplement' => array( 2243 | 0 => 0x1B000, 2244 | 1 => 0x1B0FF, 2245 | 2 => 185, 2246 | ), 2247 | 'Byzantine Musical Symbols' => array( 2248 | 0 => 0x1D000, 2249 | 1 => 0x1D0FF, 2250 | 2 => 186, 2251 | ), 2252 | 'Musical Symbols' => array( 2253 | 0 => 0x1D100, 2254 | 1 => 0x1D1FF, 2255 | 2 => 187, 2256 | ), 2257 | 'Ancient Greek Musical Notation' => array( 2258 | 0 => 0x1D200, 2259 | 1 => 0x1D24F, 2260 | 2 => 188, 2261 | ), 2262 | 'Tai Xuan Jing Symbols' => array( 2263 | 0 => 0x1D300, 2264 | 1 => 0x1D35F, 2265 | 2 => 189, 2266 | ), 2267 | 'Counting Rod Numerals' => array( 2268 | 0 => 0x1D360, 2269 | 1 => 0x1D37F, 2270 | 2 => 190, 2271 | ), 2272 | 'Mathematical Alphanumeric Symbols' => array( 2273 | 0 => 0x1D400, 2274 | 1 => 0x1D7FF, 2275 | 2 => 191, 2276 | ), 2277 | 'Mahjong Tiles' => array( 2278 | 0 => 0x1F000, 2279 | 1 => 0x1F02F, 2280 | 2 => 192, 2281 | ), 2282 | 'Domino Tiles' => array( 2283 | 0 => 0x1F030, 2284 | 1 => 0x1F09F, 2285 | 2 => 193, 2286 | ), 2287 | 'Playing Cards' => array( 2288 | 0 => 0x1F0A0, 2289 | 1 => 0x1F0FF, 2290 | 2 => 194, 2291 | ), 2292 | 'Enclosed Alphanumeric Supplement' => array( 2293 | 0 => 0x1F100, 2294 | 1 => 0x1F1FF, 2295 | 2 => 195, 2296 | ), 2297 | 'Enclosed Ideographic Supplement' => array( 2298 | 0 => 0x1F200, 2299 | 1 => 0x1F2FF, 2300 | 2 => 196, 2301 | ), 2302 | 'Miscellaneous Symbols And Pictographs' => array( 2303 | 0 => 0x1F300, 2304 | 1 => 0x1F5FF, 2305 | 2 => 197, 2306 | ), 2307 | 'Emoticons' => array( 2308 | 0 => 0x1F600, 2309 | 1 => 0x1F64F, 2310 | 2 => 198, 2311 | ), 2312 | 'Transport And Map Symbols' => array( 2313 | 0 => 0x1F680, 2314 | 1 => 0x1F6FF, 2315 | 2 => 199, 2316 | ), 2317 | 'Alchemical Symbols' => array( 2318 | 0 => 0x1F700, 2319 | 1 => 0x1F77F, 2320 | 2 => 200, 2321 | ), 2322 | 'CJK Unified Ideographs Extension B' => array( 2323 | 0 => 0x20000, 2324 | 1 => 0x2A6DF, 2325 | 2 => 201, 2326 | ), 2327 | 'CJK Unified Ideographs Extension C' => array( 2328 | 0 => 0x2A700, 2329 | 1 => 0x2B73F, 2330 | 2 => 202, 2331 | ), 2332 | 'CJK Unified Ideographs Extension D' => array( 2333 | 0 => 0x2B740, 2334 | 1 => 0x2B81F, 2335 | 2 => 203, 2336 | ), 2337 | 'CJK Compatibility Ideographs Supplement' => array( 2338 | 0 => 0x2F800, 2339 | 1 => 0x2FA1F, 2340 | 2 => 204, 2341 | ), 2342 | 'Tags' => array( 2343 | 0 => 0xE0000, 2344 | 1 => 0xE007F, 2345 | 2 => 205, 2346 | ), 2347 | 'Variation Selectors Supplement' => array( 2348 | 0 => 0xE0100, 2349 | 1 => 0xE01EF, 2350 | 2 => 206, 2351 | ), 2352 | 'Supplementary Private Use Area-A' => array( 2353 | 0 => 0xF0000, 2354 | 1 => 0xFFFFF, 2355 | 2 => 207, 2356 | ), 2357 | 'Supplementary Private Use Area-B' => array( 2358 | 0 => 0x100000, 2359 | 1 => 0x10FFFF, 2360 | 2 => 208, 2361 | ), 2362 | ); 2363 | 2364 | #calling the methods of this class only statically! 2365 | private function __construct() {} 2366 | 2367 | /** 2368 | * Remove combining diactrical marks, with possibility of the restore 2369 | * Удаляет диакритические знаки в тексте, с возможностью восстановления (опция) 2370 | * 2371 | * @param string|null $s 2372 | * @param array|null $additional_chars for example: "\xc2\xad" #soft hyphen = discretionary hyphen 2373 | * @param bool $is_can_restored 2374 | * @param array|null &$restore_table 2375 | * @return string|bool|null Returns FALSE if error occurred 2376 | */ 2377 | public static function diactrical_remove($s, $additional_chars = null, $is_can_restored = false, &$restore_table = null) 2378 | { 2379 | if (! ReflectionTypeHint::isValid()) return false; 2380 | if (! is_string($s) || $s === '') return $s; 2381 | 2382 | if ($additional_chars) 2383 | { 2384 | foreach ($additional_chars as $k => &$v) $v = preg_quote($v, '/'); 2385 | $re = '/((?>' . self::DIACTRICAL_RE . '|' . implode('|', $additional_chars) . ')+)/sxSX'; 2386 | } 2387 | else $re = '/((?>' . self::DIACTRICAL_RE . ')+)/sxSX'; 2388 | if (! $is_can_restored) return preg_replace($re, '', $s); 2389 | 2390 | $restore_table = array(); 2391 | $a = preg_split($re, $s, -1, PREG_SPLIT_DELIM_CAPTURE); 2392 | $c = count($a); 2393 | if ($c === 1) return $s; 2394 | $pos = 0; 2395 | $s2 = ''; 2396 | for ($i = 0; $i < $c - 1; $i += 2) 2397 | { 2398 | $s2 .= $a[$i]; 2399 | #запоминаем символьные (не байтовые!) позиции 2400 | $pos += self::strlen($a[$i]); 2401 | $restore_table['offsets'][$pos] = $a[$i + 1]; 2402 | } 2403 | $restore_table['length'] = $pos + self::strlen(end($a)); 2404 | return $s2 . end($a); 2405 | } 2406 | 2407 | /** 2408 | * Restore combining diactrical marks, removed by self::diactrical_remove() 2409 | * In Russian: 2410 | * Восстанавливает диакритические знаки в тексте, при условии, что их символьные позиции и кол-во символов не изменились! 2411 | * 2412 | * @see self::diactrical_remove() 2413 | * @param string|null $s 2414 | * @param array $restore_table 2415 | * @return string|bool|null Returns FALSE if error occurred (broken $restore_table) 2416 | */ 2417 | public static function diactrical_restore($s, array $restore_table) 2418 | { 2419 | if (! ReflectionTypeHint::isValid()) return false; 2420 | if (! is_string($s) || $s === '') return $s; 2421 | 2422 | if (! $restore_table) return $s; 2423 | if (! is_int(@$restore_table['length']) || 2424 | ! is_array(@$restore_table['offsets']) || 2425 | $restore_table['length'] !== self::strlen($s)) return false; 2426 | $a = array(); 2427 | $length = $offset = 0; 2428 | $s2 = ''; 2429 | foreach ($restore_table['offsets'] as $pos => $diactricals) 2430 | { 2431 | $length = $pos - $offset; 2432 | $s2 .= self::substr($s, $offset, $length) . $diactricals; 2433 | $offset = $pos; 2434 | } 2435 | return $s2 . self::substr($s, $offset, strlen($s)); 2436 | } 2437 | 2438 | /** 2439 | * Encodes data from another character encoding to UTF-8. 2440 | * 2441 | * @param array|scalar|null $data 2442 | * @param string $charset 2443 | * @return array|scalar|null Returns FALSE if error occurred 2444 | */ 2445 | public static function convert_from($data, $charset = 'cp1251') 2446 | { 2447 | if (! ReflectionTypeHint::isValid()) return false; 2448 | $charset = strtoupper($charset); 2449 | return self::_convert($data, $charset, 'UTF-8'); 2450 | } 2451 | 2452 | /** 2453 | * Encodes data from UTF-8 to another character encoding. 2454 | * 2455 | * @param array|scalar|null $data 2456 | * @param string $charset 2457 | * @return array|scalar|null Returns FALSE if error occurred 2458 | */ 2459 | public static function convert_to($data, $charset = 'cp1251') 2460 | { 2461 | if (! ReflectionTypeHint::isValid()) return false; 2462 | $charset = strtoupper($charset); 2463 | return self::_convert($data, 'UTF-8', $charset); 2464 | } 2465 | 2466 | /** 2467 | * Recoding the data of any structure to/from UTF-8. 2468 | * Arrays traversed recursively, recoded keys and values. 2469 | * 2470 | * @see mb_encoding_aliases() 2471 | * @param array|scalar|null $data 2472 | * @param string $charset_from 2473 | * @param string $charset_to 2474 | * @return array|scalar|null Returns FALSE if error occurred 2475 | */ 2476 | private static function _convert($data, $charset_from, $charset_to) 2477 | { 2478 | if (! ReflectionTypeHint::isValid()) return false; #for recursive calls 2479 | if ($charset_from === $charset_to) return $data; #speed improve 2480 | if (is_array($data)) 2481 | { 2482 | $d = array(); 2483 | foreach ($data as $k => &$v) 2484 | { 2485 | if (is_string($k)) 2486 | { 2487 | $k = self::_convert($k, $charset_from, $charset_to); 2488 | if (! is_string($k)) return false; 2489 | } 2490 | $d[$k] = self::_convert($v, $charset_from, $charset_to); 2491 | if ($d[$k] === false && ! is_bool($v)) return false; 2492 | } 2493 | return $d; 2494 | } 2495 | if (is_string($data)) 2496 | { 2497 | #smart behaviour for errors protected + speed improve 2498 | if ($charset_from === 'UTF-8' && ! self::is_utf8($data)) return $data; 2499 | if ($charset_to === 'UTF-8' && self::is_utf8($data)) return $data; 2500 | 2501 | #since PHP-5.3.x iconv() faster then mb_convert_encoding() 2502 | if (function_exists('iconv')) return iconv($charset_from, $charset_to . '//IGNORE//TRANSLIT', $data); 2503 | if (function_exists('mb_convert_encoding')) return mb_convert_encoding($data, $charset_to, $charset_from); 2504 | 2505 | #charset_from 2506 | if ($charset_from === 'ISO-8859-1') return utf8_encode($data); 2507 | if ($charset_from === 'UTF-16' || $charset_from === 'UCS-2') return self::_convert_from_utf16($data); 2508 | if ($charset_from === 'CP1251' || $charset_from === 'CP1259') return strtr($data, self::$cp1259_table); 2509 | if ($charset_from === 'KOI8-R') return strtr(convert_cyr_string($data, 'k', 'w'), self::$cp1259_table); 2510 | if ($charset_from === 'ISO-8859-5') return strtr(convert_cyr_string($data, 'i', 'w'), self::$cp1259_table); 2511 | if ($charset_from === 'CP866') return strtr(convert_cyr_string($data, 'a', 'w'), self::$cp1259_table); 2512 | if ($charset_from === 'MAC-CYRILLIC') return strtr(convert_cyr_string($data, 'm', 'w'), self::$cp1259_table); 2513 | 2514 | #charset_to 2515 | if ($charset_to === 'ISO-8859-1') return utf8_decode($data); 2516 | if ($charset_to === 'CP1251' || $charset_to === 'CP1259') return strtr($data, array_flip(self::$cp1259_table)); 2517 | 2518 | #last trying 2519 | if (function_exists('recode_string')) 2520 | { 2521 | $s = @recode_string($charset_from . '..' . $charset_to, $data); 2522 | if (is_string($s)) return $s; 2523 | } 2524 | 2525 | trigger_error('Convert "' . $charset_from . '" --> "' . $charset_to . '" is not supported native, "iconv" or "mbstring" extension required', E_USER_WARNING); 2526 | return false; 2527 | } 2528 | if (is_scalar($data) || is_null($data)) return $data; #~ null, integer, float, boolean 2529 | return false; #object or resource 2530 | } 2531 | 2532 | /** 2533 | * Convert UTF-16 / UCS-2 encoding string to UTF-8. 2534 | * Surrogates UTF-16 are supported! 2535 | * 2536 | * In Russian: 2537 | * Преобразует строку из кодировки UTF-16 / UCS-2 в UTF-8. 2538 | * Суррогаты UTF-16 поддерживаются! 2539 | * 2540 | * @param string $s 2541 | * @param string $type 'BE' -- big endian byte order 2542 | * 'LE' -- little endian byte order 2543 | * @param bool $to_array returns array chars instead whole string? 2544 | * @return string|array|bool UTF-8 string, array chars or FALSE if error occurred 2545 | */ 2546 | private static function _convert_from_utf16($s, $type = 'BE', $to_array = false) 2547 | { 2548 | static $types = array( 2549 | 'BE' => 'n', #unsigned short (always 16 bit, big endian byte order) 2550 | 'LE' => 'v', #unsigned short (always 16 bit, little endian byte order) 2551 | ); 2552 | if (! array_key_exists($type, $types)) 2553 | { 2554 | trigger_error('Unexpected value in 2-nd parameter, "' . $type . '" given!', E_USER_WARNING); 2555 | return false; 2556 | } 2557 | #the fastest way: 2558 | if (function_exists('iconv') || function_exists('mb_convert_encoding')) 2559 | { 2560 | if (function_exists('iconv')) $s = iconv('UTF-16' . $type, 'UTF-8', $s); 2561 | elseif (function_exists('mb_convert_encoding')) $s = mb_convert_encoding($s, 'UTF-8', 'UTF-16' . $type); 2562 | if (! $to_array) return $s; 2563 | return self::str_split($s); 2564 | } 2565 | 2566 | /* 2567 | http://en.wikipedia.org/wiki/UTF-16 2568 | 2569 | The improvement that UTF-16 made over UCS-2 is its ability to encode 2570 | characters in planes 1-16, not just those in plane 0 (BMP). 2571 | 2572 | UTF-16 represents non-BMP characters (those from U+10000 through U+10FFFF) 2573 | using a pair of 16-bit words, known as a surrogate pair. 2574 | First 1000016 is subtracted from the code point to give a 20-bit value. 2575 | This is then split into two separate 10-bit values each of which is represented 2576 | as a surrogate with the most significant half placed in the first surrogate. 2577 | To allow safe use of simple word-oriented string processing, separate ranges 2578 | of values are used for the two surrogates: 0xD800-0xDBFF for the first, most 2579 | significant surrogate and 0xDC00-0xDFFF for the second, least significant surrogate. 2580 | 2581 | For example, the character at code point U+10000 becomes the code unit sequence 0xD800 0xDC00, 2582 | and the character at U+10FFFD, the upper limit of Unicode, becomes the sequence 0xDBFF 0xDFFD. 2583 | Unicode and ISO/IEC 10646 do not, and will never, assign characters to any of the code points 2584 | in the U+D800-U+DFFF range, so an individual code value from a surrogate pair does not ever 2585 | represent a character. 2586 | 2587 | http://www.russellcottrell.com/greek/utilities/SurrogatePairCalculator.htm 2588 | http://www.russellcottrell.com/greek/utilities/UnicodeRanges.htm 2589 | 2590 | Conversion of a Unicode scalar value S to a surrogate pair : 2591 | H = Math.floor((S - 0x10000) / 0x400) + 0xD800; 2592 | L = ((S - 0x10000) % 0x400) + 0xDC00; 2593 | The conversion of a surrogate pair to a scalar value: 2594 | N = ((H - 0xD800) * 0x400) + (L - 0xDC00) + 0x10000; 2595 | */ 2596 | $a = array(); 2597 | $hi = false; 2598 | foreach (unpack($types[$type] . '*', $s) as $codepoint) 2599 | { 2600 | #surrogate process 2601 | if ($hi !== false) 2602 | { 2603 | $lo = $codepoint; 2604 | if ($lo < 0xDC00 || $lo > 0xDFFF) $a[] = "\xEF\xBF\xBD"; #U+FFFD REPLACEMENT CHARACTER (for broken char) 2605 | else 2606 | { 2607 | $codepoint = (($hi - 0xD800) * 0x400) + ($lo - 0xDC00) + 0x10000; 2608 | $a[] = self::chr($codepoint); 2609 | } 2610 | $hi = false; 2611 | } 2612 | elseif ($codepoint < 0xD800 || $codepoint > 0xDBFF) $a[] = self::chr($codepoint); #not surrogate 2613 | else $hi = $codepoint; #surrogate was found 2614 | } 2615 | return $to_array ? $a : implode('', $a); 2616 | } 2617 | 2618 | /** 2619 | * Strips out device control codes in the ASCII range. 2620 | * 2621 | * @param array|scalar|null Data to clean 2622 | * @return array|scalar|null Returns FALSE if error occurred 2623 | */ 2624 | public static function strict($data) 2625 | { 2626 | if (! ReflectionTypeHint::isValid()) return false; 2627 | if (is_array($data)) 2628 | { 2629 | $d = array(); 2630 | foreach ($data as $k => &$v) 2631 | { 2632 | if (is_string($k)) 2633 | { 2634 | $k = self::strict($k); 2635 | if (! is_string($k)) return false; 2636 | } 2637 | $d[$k] = self::strict($v); 2638 | if ($d[$k] === false && ! is_bool($v)) return false; 2639 | } 2640 | return $d; 2641 | } 2642 | if (is_string($data)) return preg_replace('/[\x00-\x08\x0B\x0C\x0E-\x1F]+/sSX', '', $data); 2643 | if (is_scalar($data) || is_null($data)) return $data; #int/float/bool/null 2644 | return false; #object or resource 2645 | } 2646 | 2647 | /** 2648 | * Check the data accessory to the class of control characters in ASCII. 2649 | * For non string always returns FALSE. 2650 | * 2651 | * @param scalar|null $data 2652 | * @param int|null $found_char_offset Returns the offset for the first found binary symbol 2653 | * @return bool 2654 | */ 2655 | public static function has_binary($data, &$found_char_offset = null) 2656 | { 2657 | if (! ReflectionTypeHint::isValid()) return false; 2658 | #[\t\n\r] = [\x09\x0a\x0d] 2659 | #[\x00-\x1f\x7f](? &$v) 2730 | { 2731 | if (! self::is_utf8($k, $is_strict) || ! self::is_utf8($v, $is_strict)) return false; 2732 | } 2733 | return true; 2734 | } 2735 | return false; #object or resource 2736 | } 2737 | 2738 | /** 2739 | * Tries to detect if a string is in Unicode encoding 2740 | * 2741 | * @deprecated Slowly, use self::is_utf8() instead 2742 | * @see self::is_utf8() 2743 | * @param string $s текст 2744 | * @param bool $is_strict строгая проверка диапазона ASCII? 2745 | * @return bool 2746 | */ 2747 | public static function check($s, $is_strict = true) 2748 | { 2749 | if (! ReflectionTypeHint::isValid()) return false; 2750 | for ($i = 0, $len = strlen($s); $i < $len; $i++) 2751 | { 2752 | $c = ord($s[$i]); 2753 | if ($c < 0x80) #1 byte 0bbbbbbb 2754 | { 2755 | if ($is_strict === false || ($c > 0x1F && $c < 0x7F) || $c == 0x09 || $c == 0x0A || $c == 0x0D) continue; 2756 | } 2757 | if (($c & 0xE0) == 0xC0) $n = 1; #2 bytes 110bbbbb 10bbbbbb 2758 | elseif (($c & 0xF0) == 0xE0) $n = 2; #3 bytes 1110bbbb 10bbbbbb 10bbbbbb 2759 | elseif (($c & 0xF8) == 0xF0) $n = 3; #4 bytes 11110bbb 10bbbbbb 10bbbbbb 10bbbbbb 2760 | elseif (($c & 0xFC) == 0xF8) $n = 4; #5 bytes 111110bb 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb 2761 | elseif (($c & 0xFE) == 0xFC) $n = 5; #6 bytes 1111110b 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb 2762 | else return false; #does not match any model 2763 | #n bytes matching 10bbbbbb follow ? 2764 | for ($j = 0; $j < $n; $j++) 2765 | { 2766 | $i++; 2767 | if ($i == $len || ((ord($s[$i]) & 0xC0) != 0x80) ) return false; 2768 | } 2769 | } 2770 | return true; 2771 | } 2772 | 2773 | /** 2774 | * Check the data in UTF-8 charset on given ranges of the standard UNICODE. 2775 | * The suitable alternative to regular expressions. 2776 | * 2777 | * For null, integer, float, boolean returns TRUE. 2778 | * 2779 | * Arrays traversed recursively (keys and values). 2780 | * At least if one array element value is not passed checking, it returns FALSE. 2781 | * 2782 | * @example 2783 | * #A simple check the standard named ranges: 2784 | * UTF8::blocks_check('поисковые системы Google и Yandex', array('Basic Latin', 'Cyrillic')); 2785 | * #You can check the named, direct ranges or codepoints together: 2786 | * UTF8::blocks_check('поисковые системы Google и Yandex', array(array(0x20, 0x7E), #[\x20-\x7E] 2787 | * array(0x0410, 0x044F), #[A-Яa-я] 2788 | * 0x0401, #russian yo (Ё) 2789 | * 0x0451, #russian ye (ё) 2790 | * 'Arrows', 2791 | * )); 2792 | * 2793 | * @link http://www.unicode.org/charts/ 2794 | * @param array|scalar|null $data 2795 | * @param array|string $blocks 2796 | * @return bool Возвращает TRUE, если все символы из текста принадлежат указанным диапазонам 2797 | * и FALSE в противном случае или для разбитого UTF-8. 2798 | */ 2799 | public static function blocks_check($data, $blocks) 2800 | { 2801 | if (! ReflectionTypeHint::isValid()) return false; 2802 | 2803 | if (is_array($data)) 2804 | { 2805 | foreach ($data as $k => &$v) 2806 | { 2807 | if (! self::blocks_check($k, $blocks) || ! self::blocks_check($v, $blocks)) return false; 2808 | } 2809 | return true; 2810 | } 2811 | 2812 | if (is_int($data)) $data = strval($data); 2813 | elseif (is_float($data)) $data = str_replace(',', '.', strval($data)); 2814 | elseif (! is_string($data)) return false; 2815 | 2816 | $chars = self::str_split($data); 2817 | if ($chars === false) return false; #broken UTF-8 2818 | unset($data); #memory free 2819 | $skip = array(); #save to cache already checked symbols 2820 | foreach ($chars as $i => $char) 2821 | { 2822 | if (array_key_exists($char, $skip)) continue; #speed improve 2823 | $codepoint = self::ord($char); 2824 | if (! is_int($codepoint)) return false; #broken UTF-8? 2825 | $is_valid = false; 2826 | $blocks = (array)$blocks; 2827 | foreach ($blocks as $j => $block) 2828 | { 2829 | if (is_string($block)) 2830 | { 2831 | if (! array_key_exists($block, self::$unicode_blocks)) 2832 | { 2833 | trigger_error('Unknown block "' . $block . '"!', E_USER_WARNING); 2834 | return false; 2835 | } 2836 | list ($min, $max) = self::$unicode_blocks[$block]; 2837 | } 2838 | elseif (is_array($block)) list ($min, $max) = $block; 2839 | elseif (is_int($block)) $min = $max = $block; 2840 | else trigger_error('A string/array/int type expected for block[' . $j . ']!', E_USER_ERROR); 2841 | if ($codepoint >= $min && $codepoint <= $max) 2842 | { 2843 | $is_valid = true; 2844 | break; 2845 | } 2846 | } 2847 | if (! $is_valid) return false; 2848 | $skip[$char] = null; 2849 | } 2850 | return true; 2851 | } 2852 | 2853 | /** 2854 | * Сравнение строк 2855 | * 2856 | * @param string|null $s1 2857 | * @param string|null $s2 2858 | * @param string $locale For example, 'en_CA', 'ru_RU' 2859 | * @return int|bool|null Returns FALSE if error occurred 2860 | * Returns < 0 if $s1 is less than $s2; 2861 | * > 0 if $s1 is greater than $s2; 2862 | * 0 if they are equal. 2863 | */ 2864 | public static function strcmp($s1, $s2, $locale = '') 2865 | { 2866 | if (! ReflectionTypeHint::isValid()) return false; 2867 | if (! is_string($s1) || ! is_string($s2)) return null; 2868 | if (! function_exists('collator_create')) return strcmp($s1, $s2); 2869 | # PHP 5 >= 5.3.0, PECL intl >= 1.0.0 2870 | # If empty string ("") or "root" are passed, UCA rules will be used. 2871 | $c = new Collator($locale); 2872 | if (! $c) 2873 | { 2874 | # Returns an "empty" object on error. You can use intl_get_error_code() and/or intl_get_error_message() to know what happened. 2875 | trigger_error(intl_get_error_message(), E_USER_WARNING); 2876 | return false; 2877 | } 2878 | return $c->compare($s1, $s2); 2879 | } 2880 | 2881 | /** 2882 | * Сравнение строк для N первых символов 2883 | * 2884 | * @param string|null $s1 2885 | * @param string|null $s2 2886 | * @param int $length 2887 | * @return int|bool|null Returns FALSE if error occurred 2888 | * Returns < 0 if $s1 is less than $s2; 2889 | * > 0 if $s1 is greater than $s2; 2890 | * 0 if they are equal. 2891 | */ 2892 | public static function strncmp($s1, $s2, $length) 2893 | { 2894 | if (! ReflectionTypeHint::isValid()) return false; 2895 | if (! is_string($s1) || ! is_string($s2)) return null; 2896 | return self::strcmp(self::substr($s1, 0, $length), self::substr($s2, 0, $length)); 2897 | } 2898 | 2899 | /** 2900 | * Implementation strcasecmp() function for UTF-8 encoding string. 2901 | * 2902 | * @param string|null $s1 2903 | * @param string|null $s2 2904 | * @return int|bool|null Returns FALSE if error occurred 2905 | * Returns < 0 if $s1 is less than $s2; 2906 | * > 0 if $s1 is greater than $s2; 2907 | * 0 if they are equal. 2908 | */ 2909 | public static function strcasecmp($s1, $s2) 2910 | { 2911 | if (! ReflectionTypeHint::isValid()) return false; 2912 | if (! is_string($s1) || ! is_string($s2)) return null; 2913 | return self::strcmp(self::lowercase($s1), self::lowercase($s2)); 2914 | } 2915 | 2916 | /** 2917 | * Converts a UTF-8 string to a UNICODE codepoints 2918 | * 2919 | * @param string|null $s UTF-8 string 2920 | * @return array|bool|null Unicode codepoints 2921 | * Returns FALSE if $s broken (not UTF-8) 2922 | */ 2923 | public static function to_unicode($s) 2924 | { 2925 | if (! ReflectionTypeHint::isValid()) return false; 2926 | if (! is_string($s) || $s === '') return $s; 2927 | 2928 | $s2 = null; 2929 | #since PHP-5.3.x iconv() little faster then mb_convert_encoding() 2930 | if (function_exists('iconv')) $s2 = @iconv('UTF-8', 'UCS-4BE', $s); 2931 | elseif (function_exists('mb_convert_encoding')) $s2 = @mb_convert_encoding($s, 'UCS-4BE', 'UTF-8'); 2932 | if (is_string($s2)) return array_values(unpack('N*', $s2)); 2933 | if ($s2 !== null) return false; 2934 | 2935 | $a = self::str_split($s); 2936 | if (! is_array($a)) return false; 2937 | return array_map(array(__CLASS__, 'ord'), $a); 2938 | } 2939 | 2940 | /** 2941 | * Converts a UNICODE codepoints to a UTF-8 string 2942 | * 2943 | * @param array|null $a Unicode codepoints 2944 | * @return string|bool|null UTF-8 string 2945 | * Returns FALSE if error occurred 2946 | */ 2947 | public static function from_unicode($a) 2948 | { 2949 | if (! ReflectionTypeHint::isValid()) return false; 2950 | if (! is_array($a)) return $a; 2951 | 2952 | #since PHP-5.3.x iconv() little faster then mb_convert_encoding() 2953 | if (function_exists('iconv')) 2954 | { 2955 | array_walk($a, function(&$cp) { $cp = pack('N', $cp); }); 2956 | $s = @iconv('UCS-4BE', 'UTF-8', implode('', $a)); 2957 | if (! is_string($s)) return false; 2958 | return $s; 2959 | } 2960 | if (function_exists('mb_convert_encoding')) 2961 | { 2962 | array_walk($a, function(&$cp) { $cp = pack('N', $cp); }); 2963 | $s = mb_convert_encoding(implode('', $a), 'UTF-8', 'UCS-4BE'); 2964 | if (! is_string($s)) return false; 2965 | return $s; 2966 | } 2967 | 2968 | return implode('', array_map(array(__CLASS__, 'chr'), $a)); 2969 | } 2970 | 2971 | /** 2972 | * Converts a UTF-8 character to a UNICODE codepoint 2973 | * 2974 | * @param string|null $char UTF-8 character 2975 | * @return int|bool|null Unicode codepoint 2976 | * Returns FALSE if $char broken (not UTF-8) 2977 | */ 2978 | public static function ord($char) 2979 | { 2980 | if (! ReflectionTypeHint::isValid()) return false; 2981 | if (! is_string($char)) return $char; 2982 | 2983 | static $cache = array(); 2984 | if (array_key_exists($char, $cache)) return $cache[$char]; #speed improve 2985 | 2986 | switch (strlen($char)) 2987 | { 2988 | case 1 : return $cache[$char] = ord($char); 2989 | case 2 : return $cache[$char] = (ord($char{1}) & 63) | 2990 | ((ord($char{0}) & 31) << 6); 2991 | case 3 : return $cache[$char] = (ord($char{2}) & 63) | 2992 | ((ord($char{1}) & 63) << 6) | 2993 | ((ord($char{0}) & 15) << 12); 2994 | case 4 : return $cache[$char] = (ord($char{3}) & 63) | 2995 | ((ord($char{2}) & 63) << 6) | 2996 | ((ord($char{1}) & 63) << 12) | 2997 | ((ord($char{0}) & 7) << 18); 2998 | default : 2999 | trigger_error('Character 0x' . bin2hex($char) . ' is not UTF-8!', E_USER_WARNING); 3000 | return false; 3001 | } 3002 | } 3003 | 3004 | /** 3005 | * Converts a UNICODE codepoint to a UTF-8 character 3006 | * 3007 | * @param int|digit|null $cp Unicode codepoint 3008 | * @return string|bool|null UTF-8 character 3009 | * Returns FALSE if error occurred 3010 | */ 3011 | public static function chr($cp) 3012 | { 3013 | if (! ReflectionTypeHint::isValid()) return false; 3014 | if (! is_int($cp) && ! ctype_digit($cp)) return $cp; 3015 | 3016 | static $cache = array(); 3017 | if (array_key_exists($cp, $cache)) return $cache[$cp]; #speed improve 3018 | 3019 | if ($cp <= 0x7f) return $cache[$cp] = chr($cp); 3020 | if ($cp <= 0x7ff) return $cache[$cp] = chr(0xc0 | ($cp >> 6)) . 3021 | chr(0x80 | ($cp & 0x3f)); 3022 | if ($cp <= 0xffff) return $cache[$cp] = chr(0xe0 | ($cp >> 12)) . 3023 | chr(0x80 | (($cp >> 6) & 0x3f)) . 3024 | chr(0x80 | ($cp & 0x3f)); 3025 | if ($cp <= 0x10ffff) return $cache[$cp] = chr(0xf0 | ($cp >> 18)) . 3026 | chr(0x80 | (($cp >> 12) & 0x3f)) . 3027 | chr(0x80 | (($cp >> 6) & 0x3f)) . 3028 | chr(0x80 | ($cp & 0x3f)); 3029 | #U+FFFD REPLACEMENT CHARACTER 3030 | return $cache[$cp] = "\xEF\xBF\xBD"; 3031 | } 3032 | 3033 | /** 3034 | * Implementation chunk_split() function for UTF-8 encoding string. 3035 | * 3036 | * @param string|null $s 3037 | * @param int|digit|null $length 3038 | * @param string|null $glue 3039 | * @return string|bool|null Returns FALSE if error occurred 3040 | */ 3041 | public static function chunk_split($s, $length = null, $glue = null) 3042 | { 3043 | if (! ReflectionTypeHint::isValid()) return false; 3044 | if (! is_string($s) || $s === '') return $s; 3045 | 3046 | $length = intval($length); 3047 | $glue = strval($glue); 3048 | if ($length < 1) $length = 76; 3049 | if ($glue === '') $glue = "\r\n"; 3050 | $a = self::str_split($s, $length); 3051 | if (! is_array($a)) return false; 3052 | return implode($glue, $a); 3053 | } 3054 | 3055 | /** 3056 | * Changes all keys in an array 3057 | * 3058 | * @param array|null $a 3059 | * @param int $mode {CASE_LOWER|CASE_UPPER} 3060 | * @param bool $is_recursive 3061 | * @return array|bool|null Returns FALSE if error occurred 3062 | */ 3063 | public static function array_change_key_case($a, $mode, $is_recursive = false) 3064 | { 3065 | if (! ReflectionTypeHint::isValid()) return false; 3066 | if (! is_array($a)) return $a; 3067 | 3068 | $a2 = array(); 3069 | foreach ($a as $k => $v) 3070 | { 3071 | if (is_string($k)) 3072 | { 3073 | $k = self::convert_case($k, $mode); 3074 | if ($k === false) return false; 3075 | } 3076 | if ($is_recursive && is_array($v)) #recursive support 3077 | { 3078 | $v = self::array_change_key_case($v, $mode, $is_recursive); 3079 | if (! is_array($v)) return false; 3080 | } 3081 | $a2[$k] = $v; 3082 | } 3083 | return $a2; 3084 | } 3085 | 3086 | /** 3087 | * Конвертирует регистр букв в данных в кодировке UTF-8. 3088 | * Массивы обходятся рекурсивно, при этом конвертируются только значения 3089 | * в элементах массива, а ключи остаются без изменений. 3090 | * Для конвертирования только ключей используйте метод self::array_change_key_case(). 3091 | * 3092 | * @see self::array_change_key_case() 3093 | * @link http://www.unicode.org/charts/PDF/U0400.pdf 3094 | * @link http://ru.wikipedia.org/wiki/ISO_639-1 3095 | * @param array|scalar|null $data Данные произвольной структуры 3096 | * @param int $mode {CASE_LOWER|CASE_UPPER} 3097 | * @param bool $is_ascii_optimization for speed improve 3098 | * @return scalar|bool|null Returns FALSE if error occurred 3099 | */ 3100 | public static function convert_case($data, $mode, $is_ascii_optimization = true) 3101 | { 3102 | if (! ReflectionTypeHint::isValid()) return false; 3103 | 3104 | if (is_array($data)) #recursive support 3105 | { 3106 | foreach ($data as $k => $v) 3107 | { 3108 | $data[$k] = self::convert_case($v, $mode); 3109 | if ($data[$k] === false && ! is_bool($v)) return false; 3110 | } 3111 | return $data; 3112 | } 3113 | if (! is_string($data) || ! $data) return $data; 3114 | 3115 | if ($mode === CASE_UPPER) 3116 | { 3117 | if ($is_ascii_optimization && self::is_ascii($data)) return strtoupper($data); #speed improve! 3118 | #deprecated, since PHP-5.3.x strtr() 2-3 times faster then mb_strtolower() 3119 | #if (function_exists('mb_strtoupper')) return mb_strtoupper($data, 'utf-8'); 3120 | return strtr($data, array_flip(self::$convert_case_table)); 3121 | } 3122 | if ($mode === CASE_LOWER) 3123 | { 3124 | if ($is_ascii_optimization && self::is_ascii($data)) return strtolower($data); #speed improve! 3125 | #deprecated, since PHP-5.3.x strtr() 2-3 times faster then mb_strtolower() 3126 | #if (function_exists('mb_strtolower')) return mb_strtolower($data, 'utf-8'); 3127 | return strtr($data, self::$convert_case_table); 3128 | } 3129 | trigger_error('Parameter 2 should be a constant of CASE_LOWER or CASE_UPPER!', E_USER_WARNING); 3130 | return $data; 3131 | } 3132 | 3133 | /** 3134 | * Convert a data to lower case 3135 | * 3136 | * @param array|scalar|null $data 3137 | * @return scalar|bool|null Returns FALSE if error occurred */ 3138 | public static function lowercase($data) 3139 | { 3140 | if (! ReflectionTypeHint::isValid()) return false; 3141 | return self::convert_case($data, CASE_LOWER); 3142 | } 3143 | 3144 | /** 3145 | * Convert a data to upper case 3146 | * 3147 | * @param array|scalar|null $data 3148 | * @return scalar|null Returns FALSE if error occurred 3149 | */ 3150 | public static function uppercase($data) 3151 | { 3152 | if (! ReflectionTypeHint::isValid()) return false; 3153 | return self::convert_case($data, CASE_UPPER); 3154 | } 3155 | 3156 | /** 3157 | * Convert a data to lower case 3158 | * 3159 | * @param array|scalar|null $data 3160 | * @return scalar|bool|null Returns FALSE if error occurred 3161 | */ 3162 | public static function strtolower($data) 3163 | { 3164 | if (! ReflectionTypeHint::isValid()) return false; 3165 | return self::convert_case($data, CASE_LOWER); 3166 | } 3167 | 3168 | /** 3169 | * Convert a data to upper case 3170 | * 3171 | * @param array|scalar|null $data 3172 | * @return scalar|null Returns FALSE if error occurred 3173 | */ 3174 | public static function strtoupper($data) 3175 | { 3176 | if (! ReflectionTypeHint::isValid()) return false; 3177 | return self::convert_case($data, CASE_UPPER); 3178 | } 3179 | 3180 | 3181 | /** 3182 | * Convert all HTML entities to native UTF-8 characters 3183 | * Функция декодирует гораздо больше именованных сущностей, чем стандартная html_entity_decode() 3184 | * Все dec и hex сущности так же переводятся в UTF-8. 3185 | * 3186 | * Example: '"' or '"' or '"' will be converted to '"'. 3187 | * 3188 | * @link http://www.htmlhelp.com/reference/html40/entities/ 3189 | * @link http://www.alanwood.net/demos/ent4_frame.html (HTML 4.01 Character Entity References) 3190 | * @link http://msdn.microsoft.com/workshop/author/dhtml/reference/charsets/charset1.asp?frame=true 3191 | * @link http://msdn.microsoft.com/workshop/author/dhtml/reference/charsets/charset2.asp?frame=true 3192 | * @link http://msdn.microsoft.com/workshop/author/dhtml/reference/charsets/charset3.asp?frame=true 3193 | * 3194 | * @param scalar|null $s 3195 | * @param bool $is_special_chars Дополнительно обрабатывать специальные html сущности? (< > & " ') 3196 | * @return scalar|null Returns FALSE if error occurred 3197 | */ 3198 | public static function html_entity_decode($s, $is_special_chars = false) 3199 | { 3200 | if (! ReflectionTypeHint::isValid()) return false; 3201 | if (! is_string($s) || $s === '') return $s; 3202 | 3203 | #speed improve 3204 | if (strlen($s) < 4 #по минимальной длине сущности - 4 байта: &#d; &xx; 3205 | || ($pos = strpos($s, '&') === false) || strpos($s, ';', $pos) === false) return $s; 3206 | 3207 | $table = self::$html_entity_table; 3208 | if ($is_special_chars) 3209 | { 3210 | $table += self::$html_special_chars_table 3211 | + array( 3212 | #' entity is only available in XHTML/HTML5 and not in plain HTML, see http://www.w3.org/TR/xhtml1/#C_16 3213 | ''' => "\x27", #U+0027 ['] ' apostrophe 3214 | ); 3215 | } 3216 | #replace named entities 3217 | $s = strtr($s, $table); 3218 | #block below deprecated, since PHP-5.3.x strtr() 1.5 times faster 3219 | if (0 && preg_match_all('/&[a-zA-Z]++\d*+;/sSX', $s, $m, null, $pos)) 3220 | { 3221 | foreach (array_unique($m[0]) as $entity) 3222 | { 3223 | if (array_key_exists($entity, $table)) $s = str_replace($entity, $table[$entity], $s); 3224 | } 3225 | } 3226 | 3227 | #заменяем числовые dec и hex сущности: 3228 | if (strpos($s, '&#') !== false) #speed improve 3229 | { 3230 | $class = __CLASS__; 3231 | $html_special_chars_table_flipped = array_flip(self::$html_special_chars_table); 3232 | $s = preg_replace_callback('/&#((x)[\da-fA-F]{1,6}+|\d{1,7}+);/sSX', 3233 | function (array $m) use ($class, $html_special_chars_table_flipped, $is_special_chars) 3234 | { 3235 | $codepoint = isset($m[2]) && $m[2] === 'x' ? hexdec($m[1]) : $m[1]; 3236 | if (! $is_special_chars) 3237 | { 3238 | $char = pack('C', $codepoint); 3239 | if (array_key_exists($char, $html_special_chars_table_flipped)) return $html_special_chars_table_flipped[$char]; 3240 | } 3241 | return $class::chr($codepoint); 3242 | }, $s); 3243 | } 3244 | return $s; 3245 | } 3246 | 3247 | /** 3248 | * Convert special UTF-8 characters to HTML entities. 3249 | * Функция кодирует гораздо больше именованных сущностей, чем стандартная htmlentities() 3250 | * 3251 | * @link http://www.htmlhelp.com/reference/html40/entities/ 3252 | * @link http://www.alanwood.net/demos/ent4_frame.html (HTML 4.01 Character Entity References) 3253 | * @link http://msdn.microsoft.com/workshop/author/dhtml/reference/charsets/charset1.asp?frame=true 3254 | * @link http://msdn.microsoft.com/workshop/author/dhtml/reference/charsets/charset2.asp?frame=true 3255 | * @link http://msdn.microsoft.com/workshop/author/dhtml/reference/charsets/charset3.asp?frame=true 3256 | * 3257 | * @param scalar|null $s 3258 | * @param bool $is_special_chars_only Обрабатывать только специальные html сущности? (< > & ") 3259 | * @return scalar|null Returns FALSE if error occurred 3260 | */ 3261 | public static function html_entity_encode($s, $is_special_chars_only = false) 3262 | { 3263 | if (! ReflectionTypeHint::isValid()) return false; 3264 | if (! is_string($s) || $s === '') return $s; 3265 | 3266 | if ($is_special_chars_only) return strtr($s, array_flip(self::$html_special_chars_table)); #binary support 3267 | #if ($is_special_chars_only) return htmlspecialchars($s); #DEPRECATED, charset dependent 3268 | 3269 | #replace UTF-8 chars to named entities: 3270 | $s = strtr($s, array_flip(self::$html_entity_table)); 3271 | 3272 | #block below deprecated, since PHP-5.3.x strtr() 3 times faster 3273 | if (0 && preg_match_all('~(?> [\xc2\xc3\xc5\xc6\xcb\xce\xcf][\x80-\xbf] #2 bytes 3274 | | \xe2[\x80-\x99][\x82-\xac] #3 bytes 3275 | ) 3276 | ~sxSX', $s, $m)) 3277 | { 3278 | $table = array_flip(self::$html_entity_table); 3279 | foreach (array_unique($m[0]) as $char) 3280 | { 3281 | if (array_key_exists($char, $table)) $s = str_replace($char, $table[$char], $s); 3282 | } 3283 | } 3284 | 3285 | return $s; 3286 | } 3287 | 3288 | /** 3289 | * Make regular expression for case insensitive match 3290 | * Example (only digits): "123" => "123" 3291 | * Example (only ASCII): "123_test" => "(?i:123_test)" 3292 | * Example (upper ASCII): "123_слово_test" => "123_(с|С)(л|Л)(о|О)(в|В)(о|О)_[tT][eE][sS][tT]" 3293 | * 3294 | * @param string|null $s 3295 | * @param string|null $delimiter If the optional delimiter is specified, it will also be escaped. 3296 | * This is useful for escaping the delimiter that is required by the PCRE functions. 3297 | * The / is the most commonly used delimiter. 3298 | * @return string|bool|null Returns FALSE if error occurred 3299 | */ 3300 | public static function preg_quote_case_insensitive($s, $delimiter = null) 3301 | { 3302 | if (! ReflectionTypeHint::isValid()) return false; 3303 | if (! is_string($s) || $s === '') return $s; 3304 | 3305 | if (ctype_digit($s)) return preg_quote($s, $delimiter); #speed improve 3306 | if (self::is_ascii($s)) return '(?i:' . preg_quote($s, $delimiter) . ')'; #speed improve 3307 | 3308 | $s_lc = self::convert_case($s, CASE_LOWER, false); if ($s_lc === false) return false; 3309 | $s_uc = self::convert_case($s, CASE_UPPER, false); if ($s_uc === false) return false; 3310 | if ($s_lc === $s_uc) return preg_quote($s, $delimiter); #speed improve 3311 | 3312 | $chars_lc = self::str_split($s_lc); if ($chars_lc === false) return false; 3313 | $chars_uc = self::str_split($s_uc); if ($chars_uc === false) return false; 3314 | 3315 | $s_re = ''; 3316 | foreach ($chars_lc as $i => $char) 3317 | { 3318 | if ($chars_lc[$i] === $chars_uc[$i]) 3319 | $s_re .= preg_quote($chars_lc[$i], $delimiter); 3320 | elseif (strlen($chars_lc[$i]) === 1 /*self::is_ascii($chars_lc[$i])*/) 3321 | $s_re .= '[' . self::_preg_quote_class($chars_lc[$i] . $chars_uc[$i], $delimiter) . ']'; 3322 | else 3323 | #для русских и др. букв, т. к. флаг /u и (?i:слово) не помогают :( 3324 | $s_re .= '(' . preg_quote($chars_lc[$i], $delimiter) . '|' 3325 | . preg_quote($chars_uc[$i], $delimiter) . ')'; 3326 | } 3327 | return $s_re; 3328 | } 3329 | 3330 | /** 3331 | * Call preg_match_all() and convert byte offsets into character offsets for PREG_OFFSET_CAPTURE flag. 3332 | * This is regardless of whether you use /u modifier. 3333 | * 3334 | * @link http://bolknote.ru/2010/09/08/~2704 3335 | * 3336 | * @param string $pattern 3337 | * @param string|null $subject 3338 | * @param array $matches 3339 | * @param int $flags 3340 | * @param int $char_offset 3341 | * @return array|bool|null Returns FALSE if error occurred 3342 | */ 3343 | public static function preg_match_all($pattern, $subject, &$matches, $flags = PREG_PATTERN_ORDER, $char_offset = 0) 3344 | { 3345 | if (! ReflectionTypeHint::isValid()) return false; 3346 | if (! is_string($subject)) return $subject; 3347 | 3348 | $byte_offset = ($char_offset > 0) ? strlen(self::substr($subject, 0, $char_offset)) : $char_offset; 3349 | 3350 | $return = preg_match_all($pattern, $subject, $matches, $flags, $byte_offset); 3351 | if ($return === false) return false; 3352 | 3353 | if ($flags & PREG_OFFSET_CAPTURE) 3354 | { 3355 | foreach ($matches as &$match) 3356 | { 3357 | foreach ($match as &$a) $a[1] = self::strlen(substr($subject, 0, $a[1])); 3358 | } 3359 | } 3360 | 3361 | return $return; 3362 | } 3363 | 3364 | #alias for self::str_limit() 3365 | public static function truncate($s, $maxlength = null, $continue = "\xe2\x80\xa6", &$is_cutted = null, $tail_min_length = 20) 3366 | { 3367 | return self::str_limit($s, $maxlength, $continue, $is_cutted, $tail_min_length); 3368 | } 3369 | 3370 | /** 3371 | * Обрезает текст в кодировке UTF-8 до заданной длины, 3372 | * причём последнее слово показывается целиком, а не обрывается на середине. 3373 | * Html сущности корректно обрабатываются. 3374 | * 3375 | * @param string|null $s Текст в кодировке UTF-8 3376 | * @param int|null|digit $maxlength Ограничение длины текста 3377 | * @param string $continue Завершающая строка, которая будет вставлена после текста, если он обрежется 3378 | * @param bool|null &$is_cutted Текст был обрезан? 3379 | * @param int|digit $tail_min_length Если длина "хвоста", оставшегося после обрезки текста, меньше $tail_min_length, 3380 | * то текст возвращается без изменений 3381 | * @return string|bool|null Returns FALSE if error occurred 3382 | */ 3383 | public static function str_limit($s, $maxlength = null, $continue = "\xe2\x80\xa6", &$is_cutted = null, $tail_min_length = 20) #"\xe2\x80\xa6" = "…" 3384 | { 3385 | if (! ReflectionTypeHint::isValid()) return false; 3386 | if (! is_string($s) || $s === '') return $s; 3387 | 3388 | $is_cutted = false; 3389 | if ($continue === null) $continue = "\xe2\x80\xa6"; 3390 | if (! $maxlength) $maxlength = 256; 3391 | 3392 | #speed improve block 3393 | #{{{ 3394 | if (strlen($s) <= $maxlength) return $s; 3395 | $s2 = str_replace("\r\n", '?', $s); 3396 | $s2 = preg_replace('~' . self::HTML_ENTITY_RE . '~sxSX', '?', $s2); 3397 | if (strlen($s2) <= $maxlength || self::strlen($s2) <= $maxlength) return $s; 3398 | #}}} 3399 | 3400 | $r = preg_match_all('~(?> \r\n # next line 3401 | | ' . self::HTML_ENTITY_RE . ' 3402 | | . 3403 | ) 3404 | ~sxuSX', $s, $m); 3405 | if ($r === false) return false; 3406 | 3407 | #d($m); 3408 | if (count($m[0]) <= $maxlength) return $s; 3409 | 3410 | $left = implode('', array_slice($m[0], 0, $maxlength)); 3411 | #из диапазона ASCII исключаем буквы, цифры, открывающие парные символы [a-zA-Z\d\(\{\[] и некоторые др. символы 3412 | #нельзя вырезать в конце строки символ ";", т.к. он используются в сущностях &xxx; 3413 | $left2 = rtrim($left, "\x00..\x28\x2A..\x2F\x3A\x3C..\x3E\x40\x5B\x5C\x5E..\x60\x7B\x7C\x7E\x7F"); 3414 | if (strlen($left) !== strlen($left2)) $return = $left2 . $continue; 3415 | else 3416 | { 3417 | #добавляем остаток к обрезанному слову 3418 | $right = implode('', array_slice($m[0], $maxlength)); 3419 | preg_match('/^(?> 3420 | #цифры, закрывающие парные символы, дефис для составных слов, дата, время, IP-адреса, URL типа www.ya.ru:80! 3421 | [\d\)\]\}\-\.:]+ 3422 | #letters 3423 | | \p{L}+ 3424 | #quotation marks 3425 | | [' . implode('', self::$html_quotation_mark_table) . ']+ 3426 | )+ 3427 | /suxSX', $right, $m); 3428 | #d($m); 3429 | $right = isset($m[0]) ? rtrim($m[0], '.-') : ''; 3430 | $return = $left . $right; 3431 | if (strlen($return) !== strlen($s)) $return .= $continue; 3432 | } 3433 | if (self::strlen($s) - self::strlen($return) < $tail_min_length) return $s; 3434 | 3435 | $is_cutted = true; 3436 | return $return; 3437 | } 3438 | 3439 | /** 3440 | * Implementation str_split() function for UTF-8 encoding string. 3441 | * 3442 | * @param string|null $s 3443 | * @param int|null|digit $length 3444 | * @return array|bool|null Returns FALSE if error occurred 3445 | */ 3446 | public static function str_split($s, $length = null) 3447 | { 3448 | if (! ReflectionTypeHint::isValid()) return false; 3449 | if (! is_string($s)) return $s; 3450 | 3451 | $length = ($length === null) ? 1 : intval($length); 3452 | if ($length < 1) return false; 3453 | #there are limits in regexp for {min,max}! 3454 | if (preg_match_all('~.~suSX', $s, $m) === false) return false; 3455 | if (function_exists('preg_last_error') && preg_last_error() !== PREG_NO_ERROR) return false; 3456 | if ($length === 1) $a = $m[0]; 3457 | else 3458 | { 3459 | $a = array(); 3460 | for ($i = 0, $c = count($m[0]); $i < $c; $i += $length) $a[] = implode('', array_slice($m[0], $i, $length)); 3461 | } 3462 | return $a; 3463 | } 3464 | 3465 | /** 3466 | * Implementation strlen() function for UTF-8 encoding string. 3467 | * 3468 | * @param string|null $s 3469 | * @return int|bool|null Returns FALSE if error occurred 3470 | */ 3471 | public static function strlen($s) 3472 | { 3473 | if (! ReflectionTypeHint::isValid()) return false; 3474 | if (! is_string($s)) return $s; 3475 | 3476 | //since PHP-5.3.x mb_strlen() faster then strlen(utf8_decode()) 3477 | if (function_exists('mb_strlen')) return mb_strlen($s, 'utf-8'); 3478 | 3479 | /* 3480 | utf8_decode() converts characters that are not in ISO-8859-1 to '?', which, for the purpose of counting, is quite alright. 3481 | It's much faster than iconv_strlen() 3482 | Note: this function does not count bad UTF-8 bytes in the string - these are simply ignored 3483 | */ 3484 | return strlen(utf8_decode($s)); 3485 | 3486 | /* 3487 | #iconv_strlen() slowly then strlen(utf8_decode()) 3488 | if (function_exists('iconv_strlen')) return iconv_strlen($s, 'utf-8'); 3489 | 3490 | #Do not count UTF-8 continuation bytes 3491 | #return strlen(preg_replace('/[\x80-\xBF]/sSX', '', $s)); 3492 | 3493 | #slowly then strlen(utf8_decode()) 3494 | preg_match_all('~.~suSX', $str, $m); 3495 | return count($m[0]); 3496 | 3497 | #slowly then preg_match_all() + count() 3498 | $n = 0; 3499 | for ($i = 0, $len = strlen($s); $i < $len; $i++) 3500 | { 3501 | $c = ord(substr($s, $i, 1)); 3502 | if ($c < 0x80) $n++; #single-byte (0xxxxxx) 3503 | elseif (($c & 0xC0) == 0xC0) $n++; #multi-byte starting byte (11xxxxxx) 3504 | } 3505 | return $n; 3506 | */ 3507 | } 3508 | 3509 | /** 3510 | * Implementation strpos() function for UTF-8 encoding string 3511 | * 3512 | * @param string|null $s The entire string 3513 | * @param string|int $needle The searched substring 3514 | * @param int|null $offset The optional offset parameter specifies the position from which the search should be performed 3515 | * @return int|bool|null Returns the numeric position of the first occurrence of needle in haystack. 3516 | * If needle is not found, will return FALSE. 3517 | */ 3518 | public static function strpos($s, $needle, $offset = null) 3519 | { 3520 | if (! ReflectionTypeHint::isValid()) return false; 3521 | if (! is_string($s)) return $s; 3522 | 3523 | if ($offset === null || $offset < 0) $offset = 0; 3524 | #mb_strpos() faster then iconv_strpos() 3525 | if (function_exists('mb_strpos')) return mb_strpos($s, $needle, $offset, 'utf-8'); 3526 | #iconv_strpos() deprecated, because slowly than self::strlen(substr()) 3527 | #if (function_exists('iconv_strpos')) return iconv_strpos($s, $needle, $offset, 'utf-8'); 3528 | $byte_pos = $offset; 3529 | do if (($byte_pos = strpos($s, $needle, $byte_pos)) === false) return false; 3530 | while (($char_pos = self::strlen(substr($s, 0, $byte_pos++))) < $offset); 3531 | return $char_pos; 3532 | } 3533 | 3534 | /** 3535 | * Find position of first occurrence of a case-insensitive string. 3536 | * 3537 | * @param string|null $s The entire string 3538 | * @param string|int $needle The searched substring 3539 | * @param int|null $offset The optional offset parameter specifies the position from which the search should be performed 3540 | * @return int|bool|null Returns the numeric position of the first occurrence of needle in haystack. 3541 | * If needle is not found, will return FALSE. 3542 | */ 3543 | public static function stripos($s, $needle, $offset = null) 3544 | { 3545 | if (! ReflectionTypeHint::isValid()) return false; 3546 | if (! is_string($s)) return $s; 3547 | 3548 | if ($offset === null || $offset < 0) $offset = 0; 3549 | if (function_exists('mb_stripos')) return mb_stripos($s, $needle, $offset, 'utf-8'); 3550 | 3551 | #optimization block (speed improve) 3552 | #{{{ 3553 | $ascii_int = intval(self::is_ascii($s)) + intval(self::is_ascii($needle)); 3554 | if ($ascii_int === 1) return false; 3555 | if ($ascii_int === 2) return stripos($s, $needle, $offset); 3556 | #}}} 3557 | 3558 | $s = self::convert_case($s, CASE_LOWER, false); 3559 | if ($s === false) return false; 3560 | $needle = self::convert_case($needle, CASE_LOWER, false); 3561 | if ($needle === false) return false; 3562 | return self::strpos($s, $needle, $offset); 3563 | } 3564 | 3565 | /** 3566 | * Implementation strrev() function for UTF-8 encoding string 3567 | * 3568 | * @param string|null $s 3569 | * @return string|bool|null Returns FALSE if error occurred 3570 | */ 3571 | public static function strrev($s) 3572 | { 3573 | if (! ReflectionTypeHint::isValid()) return false; 3574 | if (! is_string($s) || $s === '') return $s; 3575 | 3576 | if (0) #TODO test speed 3577 | { 3578 | $s = self::_convert($s, 'UTF-8', 'UTF-32'); 3579 | if (! is_string($s)) return false; 3580 | $s = implode('', array_reverse(str_split($s, 4))); 3581 | return self::_convert($s, 'UTF-32', 'UTF-8'); 3582 | } 3583 | 3584 | if (! is_array($a = self::str_split($s))) return false; 3585 | return implode('', array_reverse($a)); 3586 | } 3587 | 3588 | /** 3589 | * Implementation substr() function for UTF-8 encoding string. 3590 | * 3591 | * @link http://www.w3.org/International/questions/qa-forms-utf-8.html 3592 | * @param string|null $s 3593 | * @param int|digit $offset 3594 | * @param int|null|digit $length 3595 | * @return string|bool|null Returns FALSE if error occurred 3596 | */ 3597 | public static function substr($s, $offset, $length = null) 3598 | { 3599 | if (! ReflectionTypeHint::isValid()) return false; 3600 | if (! is_string($s)) return $s; 3601 | 3602 | #since PHP-5.3.x mb_substr() faster then iconv_substr() 3603 | if (function_exists('mb_substr')) 3604 | { 3605 | if ($length === null) $length = self::strlen($s); 3606 | return mb_substr($s, $offset, $length, 'utf-8'); 3607 | } 3608 | if (function_exists('iconv_substr')) 3609 | { 3610 | if ($length === null) $length = self::strlen($s); 3611 | return iconv_substr($s, $offset, $length, 'utf-8'); 3612 | } 3613 | 3614 | static $_s = null; 3615 | static $_a = null; 3616 | 3617 | if ($_s !== $s) $_a = self::str_split($_s = $s); 3618 | if (! is_array($_a)) return false; 3619 | if ($length !== null) $a = array_slice($_a, $offset, $length); 3620 | else $a = array_slice($_a, $offset); 3621 | return implode('', $a); 3622 | } 3623 | 3624 | /** 3625 | * Implementation substr_replace() function for UTF-8 encoding string. 3626 | * 3627 | * @param string|null $s 3628 | * @param string|int $replacement 3629 | * @param int|digit $start 3630 | * @param int|null $length 3631 | * @return string|bool|null Returns FALSE if error occurred 3632 | */ 3633 | public static function substr_replace($s, $replacement, $start, $length = null) 3634 | { 3635 | if (! ReflectionTypeHint::isValid()) return false; 3636 | if (! is_string($s) || $s === '') return $s; 3637 | 3638 | $a = self::str_split($s); 3639 | if (! is_array($a)) return false; 3640 | array_splice($a, $start, $length, $replacement); 3641 | return implode('', $a); 3642 | } 3643 | 3644 | /** 3645 | * Implementation ucfirst() function for UTF-8 encoding string. 3646 | * Преобразует первый символ строки в кодировке UTF-8 в верхний регистр. 3647 | * Корректно обрабатывает слова в кавычках, например: «северный поток» --> «Северный поток» 3648 | * 3649 | * @param string|null $s 3650 | * @param bool $is_other_to_lowercase остальные символы преобразуются в нижний регистр? 3651 | * @return string|bool|null Returns FALSE if error occurred 3652 | */ 3653 | public static function ucfirst($s, $is_other_to_lowercase = true) 3654 | { 3655 | if (! ReflectionTypeHint::isValid()) return false; 3656 | if ($s === '' || ! is_string($s)) return $s; 3657 | 3658 | if (! preg_match('/^([' . implode('', self::$html_quotation_mark_table) . ']{1,2}+) #1 quotation marks 3659 | (\p{L}) #2 first letter 3660 | (.*+) #3 next letters 3661 | $/sxuSX', $s, $m)) return $s; #letters not found 3662 | return $m[1] . self::uppercase($m[2]) . ($is_other_to_lowercase ? self::lowercase($m[3]) : $m[3]); 3663 | } 3664 | 3665 | /** 3666 | * Implementation ucwords() function for UTF-8 encoding string. 3667 | * Преобразует в верхний регистр первый символ каждого слова в строке в кодировке UTF-8, 3668 | * остальные символы каждого слова преобразуются в нижний регистр. 3669 | * 3670 | * @param string|null $s 3671 | * @param bool $is_other_to_lowercase остальные символы преобразуются в нижний регистр? 3672 | * @param string $spaces_re 3673 | * @return string|bool|null Returns FALSE if error occurred 3674 | */ 3675 | public static function ucwords($s, $is_other_to_lowercase = true, $spaces_re = '~([\p{Z}\s]+)~suSX') 3676 | { 3677 | if (! ReflectionTypeHint::isValid()) return false; 3678 | if ($s === '' || ! is_string($s)) return $s; 3679 | 3680 | $words = preg_split($spaces_re, $s, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE); 3681 | foreach ($words as $k => $word) 3682 | { 3683 | $words[$k] = self::ucfirst($word, $is_other_to_lowercase); 3684 | if ($words[$k] === false) return false; 3685 | } 3686 | return implode('', $words); 3687 | } 3688 | 3689 | /** 3690 | * Decodes a string to UTF-8 string from some formats (can be mixed) 3691 | * Examples 3692 | * '%D1%82%D0%B5%D1%81%D1%82' => "\xD1\x82\xD0\xB5\xD1\x81\xD1\x82" #binary (regular) 3693 | * '0xD182D0B5D181D182' => "\xD1\x82\xD0\xB5\xD1\x81\xD1\x82" #binary (compact) 3694 | * '%u0442%u0435%u0441%u0442' => "\xD1\x82\xD0\xB5\xD1\x81\xD1\x82" #UCS-2 (U+0 — U+FFFF) 3695 | * '%u{442}%u{435}%u{0441}%u{00442}' => "\xD1\x82\xD0\xB5\xD1\x81\xD1\x82" #UTF-8 (U+0 — U+FFFFFF) 3696 | * 3697 | * It is used to decode the data in the format %uXXXX, encoded deprecated 3698 | * javascript's function encode(). Recommended to use encodeURIComponent(). 3699 | * Obsolete format %uXXXX allows unicode only in the range of UCS-2, ie, U+0 to U+FFFF. 3700 | * 3701 | * @see urldecode() 3702 | * @param array|scalar|null $data 3703 | * @param bool $is_hex2bin Decode the HEX-data? 3704 | * Example: '0xD182D0B5D181D182' => "\xD1\x82\xD0\xB5\xD1\x81\xD1\x82" 3705 | * Hint: parameters in the URL address is sometimes 3706 | * convenient to encode not function rawurlencode($string), 3707 | * and use the following mechanism (encoded data is more compact): 3708 | * '0x' . bin2hex($string) 3709 | * @param bool $is_urldecode 3710 | * @return array|scalar|null Returns FALSE if error occurred 3711 | */ 3712 | public static function unescape($data, $is_hex2bin = false, $is_urldecode = true) 3713 | { 3714 | if (! ReflectionTypeHint::isValid()) return false; 3715 | if (is_array($data)) 3716 | { 3717 | $d = array(); 3718 | foreach ($data as $k => &$v) 3719 | { 3720 | if (is_string($k)) 3721 | { 3722 | $k = self::unescape($k, $is_hex2bin, $is_urldecode); 3723 | if (! is_string($k)) return false; 3724 | } 3725 | $d[$k] = self::unescape($v, $is_hex2bin, $is_urldecode); 3726 | if ($d[$k] === false && ! is_bool($v)) return false; 3727 | } 3728 | return $d; 3729 | } 3730 | if (is_string($data)) 3731 | { 3732 | #use strpos() for speed improving of regexp 3733 | if ($is_hex2bin && strpos($data, '0x') !== false) 3734 | { 3735 | $data = preg_replace_callback( 3736 | '~0x((?:[\da-fA-F]{2})+)~sSX', 3737 | function (array $m) 3738 | { 3739 | $s = pack('H' . strlen($m[1]), $m[1]); #hex2bin() 3740 | return rawurlencode($s); 3741 | }, 3742 | $data); 3743 | } 3744 | if (strpos($data, '%u') !== false) 3745 | { 3746 | $class = __CLASS__; 3747 | $data = preg_replace_callback( 3748 | '~%u( [\da-fA-F]{4}+ #%uXXXX only UCS-2 3749 | | \{ [\da-fA-F]{1,6}+ \} #%u{XXXXXX} extended form for all UNICODE charts 3750 | ) 3751 | ~sxSX', 3752 | function (array $m) use ($class) 3753 | { 3754 | $codepoint = hexdec(trim($m[1], '{}')); 3755 | $char = $class::chr($codepoint); 3756 | return rawurlencode($char); 3757 | }, 3758 | $data); 3759 | } 3760 | return $is_urldecode ? urldecode($data) : $data; 3761 | } 3762 | if (is_scalar($data) || is_null($data)) return $data; #~ null, integer, float, boolean 3763 | return false; #object or resource 3764 | } 3765 | 3766 | /** 3767 | * 1) Corrects the global arrays $_GET, $_POST, $_COOKIE, $_REQUEST, $_FILES 3768 | * decoded values ​​from %XX and extended %uXXXX / %u{XXXXXX} format, 3769 | * for example, through an outdated javascript function escape(). 3770 | * Standard PHP5 cannot do it. 3771 | * 2) Recode $_GET, $_POST, $_COOKIE, $_REQUEST, $_FILES from $charset 3772 | * encoding to UTF-8, if necessary. 3773 | * A side effect is a positive protection against XSS attacks with 3774 | * non-printable characters on the vulnerable PHP function. 3775 | * Thus web forms can be sent to the server in 2-encoding: $charset and UTF-8. 3776 | * For example: ?тест[тест]=тест 3777 | * 3) If in the HTTP_COOKIE there are parameters with the same name, 3778 | * takes the last value (as in the QUERY_STRING), not the first. 3779 | * 4) Creates an array of $_POST for non-standard Content-Type, for example, 3780 | * "Content-Type: application/octet-stream". Standard PHP5 creates 3781 | * an array for "Content-Type: application/x-www-form-urlencoded" 3782 | * and "Content-Type: multipart/form-data". 3783 | * 3784 | * Examples 3785 | * '%F2%E5%F1%F2' => 'тест' #CP1251 (regular) 3786 | * '0xF2E5F1F2' => 'тест' #CP1251 (compact) 3787 | * '%D1%82%D0%B5%D1%81%D1%82' => 'тест' #UTF-8 (regular) 3788 | * '0xD182D0B5D181D182' => 'тест' #UTF-8 (compact) 3789 | * '%u0442%u0435%u0441%u0442' => 'тест' #UCS-2 (U+0 — U+FFFF) 3790 | * '%u{442}%u{435}%u{0441}%u{00442}' => 'тест' #UTF-8 (U+0 — U+FFFFFF) 3791 | * 3792 | * Сессии, куки и независимая авторизация на поддоменах. 3793 | * 3794 | * ПРИМЕР 1 3795 | * У рабочего сайта http://domain.com появились поддомены. 3796 | * Для кроссдоменной авторизации через механизм сессий имя хоста для COOKIE было изменено с "domain.com" на ".domain.com" 3797 | * В результате авторизация не работает. Решение: поменять имя сессии. 3798 | * Ещё помогает очистка COOKIE, но их принудительная очистка на тысячах пользовательских компьютеров проблематична. 3799 | * PHP не правильно (?) обрабатывает заголовок HTTP_COOKIE, если там встречаются параметры с одинаковым именем, но разными значениями. 3800 | * Пример запроса HTTP-заголовка клиентом: "Cookie: sid=chpgs2fiak-330mzqza; sid=cmz5tnp5zz-xlbbgqp" 3801 | * В этом случае сервер берёт первое значение, а не последнее. 3802 | * Хотя если в QUERY_STRING есть такая ситуация, всегда берётся последний параметр. 3803 | * В HTTP_COOKIE два параметра с одинаковым именем могут появиться, если отправить клиенту следующие HTTP-заголовки: 3804 | * "Set-Cookie: sid=chpgs2fiak-330mzqza; expires=Thu, 15 Oct 2009 14:23:42 GMT; path=/; domain=domain.com" (только domain.com) 3805 | * "Set-Cookie: sid=cmz6uqorzv-1bn35110; expires=Thu, 15 Oct 2009 14:23:42 GMT; path=/; domain=.domain.com" (domain.com и все его поддомены) 3806 | * 3807 | * ПРИМЕР 2 3808 | * Есть рабочие сайты: http://domain.com (основной), http://admin.domain.com (админка), 3809 | * http://sub1.domain.com (подпроект 1), http://sub2.domain.com, (подпроект 2). 3810 | * Так же имеется сервер разработки http://dev.domain.com, на котором м. б. свои поддомены. 3811 | * Требуется сделать независимую кросс-доменную авторизацию для http://*.domain.com и http://*.dev.domain.com. 3812 | * Для сохранения статуса авторизации будем использовать сессию, имя и значение которой пишется в COOKIE. 3813 | * Т. к. домены http://*.dev.domain.com имеют пересечение с доменами http://*.domain.com, 3814 | * для независимой авторизации нужно использовать разные имена сессий! 3815 | * Пример HTTP заголовков ответа сервера: 3816 | * "Set-Cookie: sid=chpgs2fiak-330mzqza; expires=Thu, 15 Oct 2009 14:23:42 GMT; path=/; domain=.domain.com" (.domain.com и все его поддомены) 3817 | * "Set-Cookie: sid.dev=cmz6uqorzv-1bn35110; expires=Thu, 15 Oct 2009 14:23:42 GMT; path=/; domain=.dev.domain.com" (dev.domain.com и все его поддомены) 3818 | * 3819 | * @link http://tools.ietf.org/html/rfc2965 RFC 2965 - HTTP State Management Mechanism 3820 | * @param bool $is_hex2bin Decode the HEX-data? 3821 | * Example: '0xD182D0B5D181D182' => "\xD1\x82\xD0\xB5\xD1\x81\xD1\x82" 3822 | * Hint: parameters in the URL address is sometimes 3823 | * convenient to encode not function rawurlencode($string), 3824 | * and use the following mechanism (encoded data is more compact): 3825 | * '0x' . bin2hex($string) 3826 | * @param string $charset 3827 | * @return bool 3828 | */ 3829 | public static function unescape_request($is_hex2bin = false, $charset = 'ISO-8859-1') 3830 | { 3831 | $fixed = false; 3832 | #ATTENTION! HTTP_RAW_POST_DATA is only accessible when Content-Type of POST request is NOT default "application/x-www-form-urlencoded"! 3833 | $HTTP_RAW_POST_DATA = isset($_SERVER['REQUEST_METHOD']) && $_SERVER['REQUEST_METHOD'] === 'POST' ? (isset($GLOBALS['HTTP_RAW_POST_DATA']) ? $GLOBALS['HTTP_RAW_POST_DATA'] : @file_get_contents('php://input')) : null; 3834 | if (ini_get('always_populate_raw_post_data')) $GLOBALS['HTTP_RAW_POST_DATA'] = $HTTP_RAW_POST_DATA; 3835 | foreach (array( '_GET' => isset($_SERVER['QUERY_STRING']) ? $_SERVER['QUERY_STRING'] : null, 3836 | '_POST' => $HTTP_RAW_POST_DATA, 3837 | '_COOKIE' => isset($_SERVER['HTTP_COOKIE']) ? $_SERVER['HTTP_COOKIE'] : null, 3838 | '_FILES' => isset($_FILES) ? $_FILES : null, 3839 | ) as $k => $v) 3840 | { 3841 | if (! is_string($v)) continue; 3842 | 3843 | if ($k === '_COOKIE') 3844 | { 3845 | $v = preg_replace('/; *+/sSX', '&', $v); 3846 | unset($_COOKIE); #будем парсить HTTP_COOKIE сами, чтобы сделать обработку как у QUERY_STRING 3847 | } 3848 | 3849 | $v = self::unescape($v, $is_hex2bin, false); 3850 | if ($v === false) return false; 3851 | parse_str($v, $GLOBALS[$k]); 3852 | 3853 | $GLOBALS[$k] = self::convert_from($GLOBALS[$k], $charset); 3854 | if ($GLOBALS[$k] === false) 3855 | { 3856 | trigger_error('Array $' . $k . ' does not have keys/values in UTF-8 charset!', E_USER_WARNING); 3857 | return false; 3858 | } 3859 | 3860 | $fixed = true; 3861 | } 3862 | if ($fixed) 3863 | { 3864 | $_REQUEST = 3865 | (isset($_COOKIE) ? $_COOKIE : array()) + 3866 | (isset($_POST) ? $_POST : array()) + 3867 | (isset($_GET) ? $_GET : array()); 3868 | } 3869 | return true; 3870 | } 3871 | 3872 | /** 3873 | * Calculates the height of the edit text in