├── LICENSE ├── README.md ├── core ├── mbstring.php └── native.php ├── functions ├── ord.php ├── str_ireplace.php ├── str_pad.php ├── str_split.php ├── strcasecmp.php ├── strcspn.php ├── stristr.php ├── strrev.php ├── strspn.php ├── substr_replace.php ├── trim.php ├── ucfirst.php └── wordwrap.php ├── php-utf8.php └── utils ├── ascii.php ├── bad.php ├── patterns.php ├── position.php ├── specials.php ├── unicode.php └── validation.php /LICENSE: -------------------------------------------------------------------------------- 1 | GNU LESSER GENERAL PUBLIC LICENSE 2 | Version 2.1, February 1999 3 | 4 | Copyright (C) 1991, 1999 Free Software Foundation, Inc. 5 | 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA 6 | Everyone is permitted to copy and distribute verbatim copies 7 | of this license document, but changing it is not allowed. 8 | 9 | [This is the first released version of the Lesser GPL. It also counts 10 | as the successor of the GNU Library Public License, version 2, hence 11 | the version number 2.1.] 12 | 13 | Preamble 14 | 15 | The licenses for most software are designed to take away your 16 | freedom to share and change it. By contrast, the GNU General Public 17 | Licenses are intended to guarantee your freedom to share and change 18 | free software--to make sure the software is free for all its users. 19 | 20 | This license, the Lesser General Public License, applies to some 21 | specially designated software packages--typically libraries--of the 22 | Free Software Foundation and other authors who decide to use it. You 23 | can use it too, but we suggest you first think carefully about whether 24 | this license or the ordinary General Public License is the better 25 | strategy to use in any particular case, based on the explanations below. 26 | 27 | When we speak of free software, we are referring to freedom of use, 28 | not price. Our General Public Licenses are designed to make sure that 29 | you have the freedom to distribute copies of free software (and charge 30 | for this service if you wish); that you receive source code or can get 31 | it if you want it; that you can change the software and use pieces of 32 | it in new free programs; and that you are informed that you can do 33 | these things. 34 | 35 | To protect your rights, we need to make restrictions that forbid 36 | distributors to deny you these rights or to ask you to surrender these 37 | rights. These restrictions translate to certain responsibilities for 38 | you if you distribute copies of the library or if you modify it. 39 | 40 | For example, if you distribute copies of the library, whether gratis 41 | or for a fee, you must give the recipients all the rights that we gave 42 | you. You must make sure that they, too, receive or can get the source 43 | code. If you link other code with the library, you must provide 44 | complete object files to the recipients, so that they can relink them 45 | with the library after making changes to the library and recompiling 46 | it. And you must show them these terms so they know their rights. 47 | 48 | We protect your rights with a two-step method: (1) we copyright the 49 | library, and (2) we offer you this license, which gives you legal 50 | permission to copy, distribute and/or modify the library. 51 | 52 | To protect each distributor, we want to make it very clear that 53 | there is no warranty for the free library. Also, if the library is 54 | modified by someone else and passed on, the recipients should know 55 | that what they have is not the original version, so that the original 56 | author's reputation will not be affected by problems that might be 57 | introduced by others. 58 | 59 | Finally, software patents pose a constant threat to the existence of 60 | any free program. We wish to make sure that a company cannot 61 | effectively restrict the users of a free program by obtaining a 62 | restrictive license from a patent holder. Therefore, we insist that 63 | any patent license obtained for a version of the library must be 64 | consistent with the full freedom of use specified in this license. 65 | 66 | Most GNU software, including some libraries, is covered by the 67 | ordinary GNU General Public License. This license, the GNU Lesser 68 | General Public License, applies to certain designated libraries, and 69 | is quite different from the ordinary General Public License. We use 70 | this license for certain libraries in order to permit linking those 71 | libraries into non-free programs. 72 | 73 | When a program is linked with a library, whether statically or using 74 | a shared library, the combination of the two is legally speaking a 75 | combined work, a derivative of the original library. The ordinary 76 | General Public License therefore permits such linking only if the 77 | entire combination fits its criteria of freedom. The Lesser General 78 | Public License permits more lax criteria for linking other code with 79 | the library. 80 | 81 | We call this license the "Lesser" General Public License because it 82 | does Less to protect the user's freedom than the ordinary General 83 | Public License. It also provides other free software developers Less 84 | of an advantage over competing non-free programs. These disadvantages 85 | are the reason we use the ordinary General Public License for many 86 | libraries. However, the Lesser license provides advantages in certain 87 | special circumstances. 88 | 89 | For example, on rare occasions, there may be a special need to 90 | encourage the widest possible use of a certain library, so that it becomes 91 | a de-facto standard. To achieve this, non-free programs must be 92 | allowed to use the library. A more frequent case is that a free 93 | library does the same job as widely used non-free libraries. In this 94 | case, there is little to gain by limiting the free library to free 95 | software only, so we use the Lesser General Public License. 96 | 97 | In other cases, permission to use a particular library in non-free 98 | programs enables a greater number of people to use a large body of 99 | free software. For example, permission to use the GNU C Library in 100 | non-free programs enables many more people to use the whole GNU 101 | operating system, as well as its variant, the GNU/Linux operating 102 | system. 103 | 104 | Although the Lesser General Public License is Less protective of the 105 | users' freedom, it does ensure that the user of a program that is 106 | linked with the Library has the freedom and the wherewithal to run 107 | that program using a modified version of the Library. 108 | 109 | The precise terms and conditions for copying, distribution and 110 | modification follow. Pay close attention to the difference between a 111 | "work based on the library" and a "work that uses the library". The 112 | former contains code derived from the library, whereas the latter must 113 | be combined with the library in order to run. 114 | 115 | GNU LESSER GENERAL PUBLIC LICENSE 116 | TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION 117 | 118 | 0. This License Agreement applies to any software library or other 119 | program which contains a notice placed by the copyright holder or 120 | other authorized party saying it may be distributed under the terms of 121 | this Lesser General Public License (also called "this License"). 122 | Each licensee is addressed as "you". 123 | 124 | A "library" means a collection of software functions and/or data 125 | prepared so as to be conveniently linked with application programs 126 | (which use some of those functions and data) to form executables. 127 | 128 | The "Library", below, refers to any such software library or work 129 | which has been distributed under these terms. A "work based on the 130 | Library" means either the Library or any derivative work under 131 | copyright law: that is to say, a work containing the Library or a 132 | portion of it, either verbatim or with modifications and/or translated 133 | straightforwardly into another language. (Hereinafter, translation is 134 | included without limitation in the term "modification".) 135 | 136 | "Source code" for a work means the preferred form of the work for 137 | making modifications to it. For a library, complete source code means 138 | all the source code for all modules it contains, plus any associated 139 | interface definition files, plus the scripts used to control compilation 140 | and installation of the library. 141 | 142 | Activities other than copying, distribution and modification are not 143 | covered by this License; they are outside its scope. The act of 144 | running a program using the Library is not restricted, and output from 145 | such a program is covered only if its contents constitute a work based 146 | on the Library (independent of the use of the Library in a tool for 147 | writing it). Whether that is true depends on what the Library does 148 | and what the program that uses the Library does. 149 | 150 | 1. You may copy and distribute verbatim copies of the Library's 151 | complete source code as you receive it, in any medium, provided that 152 | you conspicuously and appropriately publish on each copy an 153 | appropriate copyright notice and disclaimer of warranty; keep intact 154 | all the notices that refer to this License and to the absence of any 155 | warranty; and distribute a copy of this License along with the 156 | Library. 157 | 158 | You may charge a fee for the physical act of transferring a copy, 159 | and you may at your option offer warranty protection in exchange for a 160 | fee. 161 | 162 | 2. You may modify your copy or copies of the Library or any portion 163 | of it, thus forming a work based on the Library, and copy and 164 | distribute such modifications or work under the terms of Section 1 165 | above, provided that you also meet all of these conditions: 166 | 167 | a) The modified work must itself be a software library. 168 | 169 | b) You must cause the files modified to carry prominent notices 170 | stating that you changed the files and the date of any change. 171 | 172 | c) You must cause the whole of the work to be licensed at no 173 | charge to all third parties under the terms of this License. 174 | 175 | d) If a facility in the modified Library refers to a function or a 176 | table of data to be supplied by an application program that uses 177 | the facility, other than as an argument passed when the facility 178 | is invoked, then you must make a good faith effort to ensure that, 179 | in the event an application does not supply such function or 180 | table, the facility still operates, and performs whatever part of 181 | its purpose remains meaningful. 182 | 183 | (For example, a function in a library to compute square roots has 184 | a purpose that is entirely well-defined independent of the 185 | application. Therefore, Subsection 2d requires that any 186 | application-supplied function or table used by this function must 187 | be optional: if the application does not supply it, the square 188 | root function must still compute square roots.) 189 | 190 | These requirements apply to the modified work as a whole. If 191 | identifiable sections of that work are not derived from the Library, 192 | and can be reasonably considered independent and separate works in 193 | themselves, then this License, and its terms, do not apply to those 194 | sections when you distribute them as separate works. But when you 195 | distribute the same sections as part of a whole which is a work based 196 | on the Library, the distribution of the whole must be on the terms of 197 | this License, whose permissions for other licensees extend to the 198 | entire whole, and thus to each and every part regardless of who wrote 199 | it. 200 | 201 | Thus, it is not the intent of this section to claim rights or contest 202 | your rights to work written entirely by you; rather, the intent is to 203 | exercise the right to control the distribution of derivative or 204 | collective works based on the Library. 205 | 206 | In addition, mere aggregation of another work not based on the Library 207 | with the Library (or with a work based on the Library) on a volume of 208 | a storage or distribution medium does not bring the other work under 209 | the scope of this License. 210 | 211 | 3. You may opt to apply the terms of the ordinary GNU General Public 212 | License instead of this License to a given copy of the Library. To do 213 | this, you must alter all the notices that refer to this License, so 214 | that they refer to the ordinary GNU General Public License, version 2, 215 | instead of to this License. (If a newer version than version 2 of the 216 | ordinary GNU General Public License has appeared, then you can specify 217 | that version instead if you wish.) Do not make any other change in 218 | these notices. 219 | 220 | Once this change is made in a given copy, it is irreversible for 221 | that copy, so the ordinary GNU General Public License applies to all 222 | subsequent copies and derivative works made from that copy. 223 | 224 | This option is useful when you wish to copy part of the code of 225 | the Library into a program that is not a library. 226 | 227 | 4. You may copy and distribute the Library (or a portion or 228 | derivative of it, under Section 2) in object code or executable form 229 | under the terms of Sections 1 and 2 above provided that you accompany 230 | it with the complete corresponding machine-readable source code, which 231 | must be distributed under the terms of Sections 1 and 2 above on a 232 | medium customarily used for software interchange. 233 | 234 | If distribution of object code is made by offering access to copy 235 | from a designated place, then offering equivalent access to copy the 236 | source code from the same place satisfies the requirement to 237 | distribute the source code, even though third parties are not 238 | compelled to copy the source along with the object code. 239 | 240 | 5. A program that contains no derivative of any portion of the 241 | Library, but is designed to work with the Library by being compiled or 242 | linked with it, is called a "work that uses the Library". Such a 243 | work, in isolation, is not a derivative work of the Library, and 244 | therefore falls outside the scope of this License. 245 | 246 | However, linking a "work that uses the Library" with the Library 247 | creates an executable that is a derivative of the Library (because it 248 | contains portions of the Library), rather than a "work that uses the 249 | library". The executable is therefore covered by this License. 250 | Section 6 states terms for distribution of such executables. 251 | 252 | When a "work that uses the Library" uses material from a header file 253 | that is part of the Library, the object code for the work may be a 254 | derivative work of the Library even though the source code is not. 255 | Whether this is true is especially significant if the work can be 256 | linked without the Library, or if the work is itself a library. The 257 | threshold for this to be true is not precisely defined by law. 258 | 259 | If such an object file uses only numerical parameters, data 260 | structure layouts and accessors, and small macros and small inline 261 | functions (ten lines or less in length), then the use of the object 262 | file is unrestricted, regardless of whether it is legally a derivative 263 | work. (Executables containing this object code plus portions of the 264 | Library will still fall under Section 6.) 265 | 266 | Otherwise, if the work is a derivative of the Library, you may 267 | distribute the object code for the work under the terms of Section 6. 268 | Any executables containing that work also fall under Section 6, 269 | whether or not they are linked directly with the Library itself. 270 | 271 | 6. As an exception to the Sections above, you may also combine or 272 | link a "work that uses the Library" with the Library to produce a 273 | work containing portions of the Library, and distribute that work 274 | under terms of your choice, provided that the terms permit 275 | modification of the work for the customer's own use and reverse 276 | engineering for debugging such modifications. 277 | 278 | You must give prominent notice with each copy of the work that the 279 | Library is used in it and that the Library and its use are covered by 280 | this License. You must supply a copy of this License. If the work 281 | during execution displays copyright notices, you must include the 282 | copyright notice for the Library among them, as well as a reference 283 | directing the user to the copy of this License. Also, you must do one 284 | of these things: 285 | 286 | a) Accompany the work with the complete corresponding 287 | machine-readable source code for the Library including whatever 288 | changes were used in the work (which must be distributed under 289 | Sections 1 and 2 above); and, if the work is an executable linked 290 | with the Library, with the complete machine-readable "work that 291 | uses the Library", as object code and/or source code, so that the 292 | user can modify the Library and then relink to produce a modified 293 | executable containing the modified Library. (It is understood 294 | that the user who changes the contents of definitions files in the 295 | Library will not necessarily be able to recompile the application 296 | to use the modified definitions.) 297 | 298 | b) Use a suitable shared library mechanism for linking with the 299 | Library. A suitable mechanism is one that (1) uses at run time a 300 | copy of the library already present on the user's computer system, 301 | rather than copying library functions into the executable, and (2) 302 | will operate properly with a modified version of the library, if 303 | the user installs one, as long as the modified version is 304 | interface-compatible with the version that the work was made with. 305 | 306 | c) Accompany the work with a written offer, valid for at 307 | least three years, to give the same user the materials 308 | specified in Subsection 6a, above, for a charge no more 309 | than the cost of performing this distribution. 310 | 311 | d) If distribution of the work is made by offering access to copy 312 | from a designated place, offer equivalent access to copy the above 313 | specified materials from the same place. 314 | 315 | e) Verify that the user has already received a copy of these 316 | materials or that you have already sent this user a copy. 317 | 318 | For an executable, the required form of the "work that uses the 319 | Library" must include any data and utility programs needed for 320 | reproducing the executable from it. However, as a special exception, 321 | the materials to be distributed need not include anything that is 322 | normally distributed (in either source or binary form) with the major 323 | components (compiler, kernel, and so on) of the operating system on 324 | which the executable runs, unless that component itself accompanies 325 | the executable. 326 | 327 | It may happen that this requirement contradicts the license 328 | restrictions of other proprietary libraries that do not normally 329 | accompany the operating system. Such a contradiction means you cannot 330 | use both them and the Library together in an executable that you 331 | distribute. 332 | 333 | 7. You may place library facilities that are a work based on the 334 | Library side-by-side in a single library together with other library 335 | facilities not covered by this License, and distribute such a combined 336 | library, provided that the separate distribution of the work based on 337 | the Library and of the other library facilities is otherwise 338 | permitted, and provided that you do these two things: 339 | 340 | a) Accompany the combined library with a copy of the same work 341 | based on the Library, uncombined with any other library 342 | facilities. This must be distributed under the terms of the 343 | Sections above. 344 | 345 | b) Give prominent notice with the combined library of the fact 346 | that part of it is a work based on the Library, and explaining 347 | where to find the accompanying uncombined form of the same work. 348 | 349 | 8. You may not copy, modify, sublicense, link with, or distribute 350 | the Library except as expressly provided under this License. Any 351 | attempt otherwise to copy, modify, sublicense, link with, or 352 | distribute the Library is void, and will automatically terminate your 353 | rights under this License. However, parties who have received copies, 354 | or rights, from you under this License will not have their licenses 355 | terminated so long as such parties remain in full compliance. 356 | 357 | 9. You are not required to accept this License, since you have not 358 | signed it. However, nothing else grants you permission to modify or 359 | distribute the Library or its derivative works. These actions are 360 | prohibited by law if you do not accept this License. Therefore, by 361 | modifying or distributing the Library (or any work based on the 362 | Library), you indicate your acceptance of this License to do so, and 363 | all its terms and conditions for copying, distributing or modifying 364 | the Library or works based on it. 365 | 366 | 10. Each time you redistribute the Library (or any work based on the 367 | Library), the recipient automatically receives a license from the 368 | original licensor to copy, distribute, link with or modify the Library 369 | subject to these terms and conditions. You may not impose any further 370 | restrictions on the recipients' exercise of the rights granted herein. 371 | You are not responsible for enforcing compliance by third parties with 372 | this License. 373 | 374 | 11. If, as a consequence of a court judgment or allegation of patent 375 | infringement or for any other reason (not limited to patent issues), 376 | conditions are imposed on you (whether by court order, agreement or 377 | otherwise) that contradict the conditions of this License, they do not 378 | excuse you from the conditions of this License. If you cannot 379 | distribute so as to satisfy simultaneously your obligations under this 380 | License and any other pertinent obligations, then as a consequence you 381 | may not distribute the Library at all. For example, if a patent 382 | license would not permit royalty-free redistribution of the Library by 383 | all those who receive copies directly or indirectly through you, then 384 | the only way you could satisfy both it and this License would be to 385 | refrain entirely from distribution of the Library. 386 | 387 | If any portion of this section is held invalid or unenforceable under any 388 | particular circumstance, the balance of the section is intended to apply, 389 | and the section as a whole is intended to apply in other circumstances. 390 | 391 | It is not the purpose of this section to induce you to infringe any 392 | patents or other property right claims or to contest validity of any 393 | such claims; this section has the sole purpose of protecting the 394 | integrity of the free software distribution system which is 395 | implemented by public license practices. Many people have made 396 | generous contributions to the wide range of software distributed 397 | through that system in reliance on consistent application of that 398 | system; it is up to the author/donor to decide if he or she is willing 399 | to distribute software through any other system and a licensee cannot 400 | impose that choice. 401 | 402 | This section is intended to make thoroughly clear what is believed to 403 | be a consequence of the rest of this License. 404 | 405 | 12. If the distribution and/or use of the Library is restricted in 406 | certain countries either by patents or by copyrighted interfaces, the 407 | original copyright holder who places the Library under this License may add 408 | an explicit geographical distribution limitation excluding those countries, 409 | so that distribution is permitted only in or among countries not thus 410 | excluded. In such case, this License incorporates the limitation as if 411 | written in the body of this License. 412 | 413 | 13. The Free Software Foundation may publish revised and/or new 414 | versions of the Lesser General Public License from time to time. 415 | Such new versions will be similar in spirit to the present version, 416 | but may differ in detail to address new problems or concerns. 417 | 418 | Each version is given a distinguishing version number. If the Library 419 | specifies a version number of this License which applies to it and 420 | "any later version", you have the option of following the terms and 421 | conditions either of that version or of any later version published by 422 | the Free Software Foundation. If the Library does not specify a 423 | license version number, you may choose any version ever published by 424 | the Free Software Foundation. 425 | 426 | 14. If you wish to incorporate parts of the Library into other free 427 | programs whose distribution conditions are incompatible with these, 428 | write to the author to ask for permission. For software which is 429 | copyrighted by the Free Software Foundation, write to the Free 430 | Software Foundation; we sometimes make exceptions for this. Our 431 | decision will be guided by the two goals of preserving the free status 432 | of all derivatives of our free software and of promoting the sharing 433 | and reuse of software generally. 434 | 435 | NO WARRANTY 436 | 437 | 15. BECAUSE THE LIBRARY IS LICENSED FREE OF CHARGE, THERE IS NO 438 | WARRANTY FOR THE LIBRARY, TO THE EXTENT PERMITTED BY APPLICABLE LAW. 439 | EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR 440 | OTHER PARTIES PROVIDE THE LIBRARY "AS IS" WITHOUT WARRANTY OF ANY 441 | KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE 442 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR 443 | PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE 444 | LIBRARY IS WITH YOU. SHOULD THE LIBRARY PROVE DEFECTIVE, YOU ASSUME 445 | THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 446 | 447 | 16. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN 448 | WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY 449 | AND/OR REDISTRIBUTE THE LIBRARY AS PERMITTED ABOVE, BE LIABLE TO YOU 450 | FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR 451 | CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE 452 | LIBRARY (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING 453 | RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A 454 | FAILURE OF THE LIBRARY TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF 455 | SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH 456 | DAMAGES. 457 | 458 | END OF TERMS AND CONDITIONS 459 | 460 | How to Apply These Terms to Your New Libraries 461 | 462 | If you develop a new library, and you want it to be of the greatest 463 | possible use to the public, we recommend making it free software that 464 | everyone can redistribute and change. You can do so by permitting 465 | redistribution under these terms (or, alternatively, under the terms of the 466 | ordinary General Public License). 467 | 468 | To apply these terms, attach the following notices to the library. It is 469 | safest to attach them to the start of each source file to most effectively 470 | convey the exclusion of warranty; and each file should have at least the 471 | "copyright" line and a pointer to where the full notice is found. 472 | 473 | 474 | Copyright (C) 475 | 476 | This library is free software; you can redistribute it and/or 477 | modify it under the terms of the GNU Lesser General Public 478 | License as published by the Free Software Foundation; either 479 | version 2.1 of the License, or (at your option) any later version. 480 | 481 | This library is distributed in the hope that it will be useful, 482 | but WITHOUT ANY WARRANTY; without even the implied warranty of 483 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU 484 | Lesser General Public License for more details. 485 | 486 | You should have received a copy of the GNU Lesser General Public 487 | License along with this library; if not, write to the Free Software 488 | Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA 489 | 490 | Also add information on how to contact you by electronic and paper mail. 491 | 492 | You should also get your employer (if you work as a programmer) or your 493 | school, if any, to sign a "copyright disclaimer" for the library, if 494 | necessary. Here is a sample; alter the names: 495 | 496 | Yoyodyne, Inc., hereby disclaims all copyright interest in the 497 | library `Frob' (a library for tweaking knobs) written by James Random Hacker. 498 | 499 | , 1 April 1990 500 | Ty Coon, President of Vice 501 | 502 | That's all there is to it! 503 | 504 | 505 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | PHP-UTF8 2 | ======== 3 | 4 | Introduction 5 | ------------ 6 | 7 | PHP-UTF-8 is a UTF-8 aware library of functions mirroring PHP's own string 8 | functions. Does not require PHP mbstring extension though will use it, if 9 | found, for a (small) performance gain. 10 | 11 | The project was initially on sourceforge where it died due to lack of development 12 | and support. This project has been forked and moved to github.com so that many 13 | more people can actually contribute with more ease. 14 | 15 | Use the [issue tracker][1] here on github.com, to post about problems and 16 | feature requests. 17 | 18 | Please feel free to fork and get back to us with fork requests for optimizations 19 | and new features. 20 | 21 | Documentation & Usage Information 22 | --------------------------------- 23 | 24 | Using the php-utf8 library is quite easy. Just include the `php-utf8.php` and 25 | any additional functions that you may need from the `functions` folder. 26 | 27 | Sample Code: 28 | 29 | // get the core functions included ... 30 | require('php-utf8_path/php-utf8.php'); 31 | 32 | // ... and any other functions/*.php or utils/*.php files you may need. 33 | require('php-utf8_path/functions/trim.php'); 34 | 35 | Make sure that you are confident about using the library by reading 36 | [Character Sets / Character Encoding Issues][2] and [Handling UTF-8 with PHP][3]. 37 | 38 | Use these functions **only** if you really need them & you understand **why** 39 | you need to use them. 40 | 41 | In particular, do not blindly replace all use of PHP's string functions which 42 | functions found here. Most of the time you will not need to, and you will be 43 | introducing a significant performance overhead to your application. 44 | 45 | Most of the functions here are not operating *defensively*, mainly for performance 46 | reasons. For example there is no extensive parameter checking and it is assumed 47 | that they are fed with well formed UTF-8. This is particularly relevant when is 48 | comes to catching badly formed UTF-8. You should screen input on the *outer perimeter* 49 | with help from functions in the `utils/validation.php` and `utils/bad.php` files. 50 | 51 | Throughout the library **all** ASCII characters (*control characters included*) 52 | are treated as valid throughout the library. Make sure you take the appropriate 53 | measures before outputting into XML since it can become ill-formed with some 54 | control characters. [more info][5] 55 | 56 | Licensing 57 | --------- 58 | The initial code of PHP-UTF8 is published under LGPL. Please find a copy of the 59 | license in the LICENSE file. 60 | 61 | Parts of the code in this library come from other places, under different licenses. 62 | The authors involved have been contacted (see below). 63 | Attribution for which code came from elsewhere can be found in the source code itself. 64 | 65 | - Andreas Gohr / Chris Smith of Dokuwiki. *There is a fair degree of 66 | collaboration/exchange of ideas and code between [Dokuwiki's UTF-8 library][6] 67 | and phputf8. Although Dokuwiki is released under GPL, its UTF-8 library is 68 | released under LGPL, hence no conflict with phputf8* 69 | - Henri Sivonen ([site][7]) *has also given permission for his code to be released 70 | under the terms of the LGPL. He ported a Unicode / UTF-8 converter from the 71 | Mozilla codebase to PHP, which is re-used in php-utf8.* 72 | 73 | [1]: http://github.com/FSX/php-utf8/issues 74 | [2]: http://www.phpwact.org/php/i18n/charsets 75 | [3]: http://www.phpwact.org/php/i18n/utf-8 76 | [4]: http://www.phpwact.org/php/i18n/utf-8 77 | [5]: http://hsivonen.iki.fi/producing-xml/#controlchar 78 | [6]: http://dev.splitbrain.org/view/darcs/dokuwiki/inc/utf8.php 79 | [7]: http://hsivonen.iki.fi/php-utf8/ 80 | -------------------------------------------------------------------------------- /core/mbstring.php: -------------------------------------------------------------------------------- 1 | 25 | * @link http://www.php.net/manual/en/function.strlen.php 26 | * @link http://www.php.net/manual/en/function.utf8-decode.php 27 | * @param string $str UTF-8 string 28 | * @return int number of UTF-8 characters in string 29 | */ 30 | function utf8_strlen($str) 31 | { 32 | return strlen(utf8_decode(utf8_bad_clean($str))); 33 | } 34 | 35 | /** 36 | * UTF-8 aware alternative to strpos. 37 | * 38 | * Find position of first occurrence of a string. 39 | * This will get alot slower if offset is used. 40 | * 41 | * @see http://www.php.net/strpos 42 | * @see utf8_strlen 43 | * @see utf8_substr 44 | * @param string $str haystack 45 | * @param string $str needle (you should validate this with utf8_is_valid) 46 | * @param integer $offset offset in characters (from left) 47 | * @return mixed integer position or FALSE on failure 48 | */ 49 | function utf8_strpos($str, $needle, $offset = false) 50 | { 51 | if ($offset === false) 52 | { 53 | $ar = explode($needle, $str, 2); 54 | 55 | //if (count($ar) > 1) 56 | if (isset($ar[1])) 57 | return utf8_strlen($ar[0]); 58 | 59 | return false; 60 | } 61 | 62 | if (!is_int($offset)) 63 | { 64 | trigger_error('utf8_strpos: Offset must be an integer', E_USER_ERROR); 65 | return false; 66 | } 67 | 68 | $str = utf8_substr($str, $offset); 69 | 70 | if (($pos = utf8_strpos($str, $needle)) !== false) 71 | return $pos + $offset; 72 | 73 | return false; 74 | } 75 | 76 | /** 77 | * UTF-8 aware alternative to strrpos. 78 | * 79 | * Find position of last occurrence of a char in a string. 80 | * This will get alot slower if offset is used 81 | * 82 | * @see http://www.php.net/strrpos 83 | * @see utf8_substr 84 | * @see utf8_strlen 85 | * @param string $str haystack 86 | * @param string $needle needle (you should validate this with utf8_is_valid) 87 | * @param integer $offset (optional) offset (from left) 88 | * @return mixed integer position or FALSE on failure 89 | */ 90 | function utf8_strrpos($str, $needle, $offset = false) 91 | { 92 | if ($offset === false) 93 | { 94 | $ar = explode($needle, $str); 95 | 96 | //if (count($ar) > 1) 97 | if (isset($ar[1])) 98 | { 99 | // Pop off the end of the string where the last match was made 100 | array_pop($ar); 101 | $str = implode($needle, $ar); 102 | 103 | return utf8_strlen($str); 104 | } 105 | 106 | return false; 107 | } 108 | 109 | if (!is_int($offset)) 110 | { 111 | trigger_error('utf8_strrpos expects parameter 3 to be long', E_USER_WARNING); 112 | return false; 113 | } 114 | 115 | $str = utf8_substr($str, $offset); 116 | 117 | if (($pos = utf8_strrpos($str, $needle)) !== false) 118 | return $pos + $offset; 119 | 120 | return false; 121 | } 122 | 123 | /** 124 | * UTF-8 aware alternative to substr. 125 | * 126 | * Return part of a string given character offset (and optionally length) 127 | * 128 | * Compared to substr, if offset or length are not integers, this version will 129 | * not complain but rather massages them into an integer. 130 | * 131 | * Note on returned values: substr documentation states false can be returned 132 | * in some cases (e.g. offset > string length) mb_substr never returns false, 133 | * it will return an empty string instead. This adopts the mb_substr approach. 134 | * 135 | * Note on implementation: PCRE only supports repetitions of less than 65536, 136 | * in order to accept up to MAXINT values for offset and length, we'll repeat 137 | * a group of 65535 characters when needed. 138 | * 139 | * Note on implementation: calculating the number of characters in the string 140 | * is a relatively expensive operation, so we only carry it out when necessary. 141 | * It isn't necessary for +ve offsets and no specified length 142 | * 143 | * @author Chris Smith 144 | * @param string $str 145 | * @param integer $offset number of UTF-8 characters offset (from left) 146 | * @param integer $length (optional) length in UTF-8 characters from offset 147 | * @return mixed string or FALSE if failure 148 | */ 149 | function utf8_substr($str, $offset, $length = false) 150 | { 151 | // Generates E_NOTICE for PHP4 objects, but not PHP5 objects 152 | $str = (string) $str; 153 | $offset = (int) $offset; 154 | 155 | if ($length) 156 | $length = (int) $length; 157 | 158 | // Handle trivial cases 159 | if ($length === 0) 160 | return ''; 161 | if ($offset < 0 && $length < 0 && $length < $offset) 162 | return ''; 163 | 164 | // Normalise negative offsets (we could use a tail 165 | // anchored pattern, but they are horribly slow!) 166 | if ($offset < 0) 167 | { 168 | // See notes 169 | $strlen = utf8_strlen($str); 170 | $offset = $strlen + $offset; 171 | 172 | if($offset < 0) 173 | $offset = 0; 174 | } 175 | 176 | $offset_pattern = ''; 177 | $length_pattern = ''; 178 | 179 | // Establish a pattern for offset, a 180 | // non-captured group equal in length to offset 181 | if ($offset > 0) 182 | { 183 | $ox = (int) ($offset / 65535); 184 | $oy = $offset % 65535; 185 | 186 | if ($ox) 187 | $offset_pattern = '(?:.{65535}){'.$ox.'}'; 188 | 189 | $offset_pattern = '^(?:'.$offset_pattern.'.{'.$oy.'})'; 190 | } 191 | else 192 | $offset_pattern = '^'; 193 | 194 | 195 | // Establish a pattern for length 196 | if (!$length) 197 | $length_pattern = '(.*)$'; // The rest of the string 198 | else 199 | { 200 | // See notes 201 | if (!isset($strlen)) 202 | $strlen = utf8_strlen($str); 203 | 204 | // Another trivial case 205 | if ($offset > $strlen) 206 | return ''; 207 | 208 | if ($length > 0) 209 | { 210 | // Reduce any length that would go passed the end of the string 211 | $length = min($strlen - $offset, $length); 212 | 213 | $lx = (int) ($length / 65535); 214 | $ly = $length % 65535; 215 | 216 | // Negative length requires a captured group of length characters 217 | if ($lx) 218 | $length_pattern = '(?:.{65535}){'.$lx.'}'; 219 | 220 | $length_pattern = '('.$length_pattern.'.{'.$ly.'})'; 221 | } 222 | elseif ($length < 0) 223 | { 224 | if ($length < ($offset - $strlen)) 225 | return ''; 226 | 227 | $lx = (int) ((-$length) / 65535); 228 | $ly = (-$length) % 65535; 229 | 230 | // Negative length requires ... capture everything except a group of 231 | // -length characters anchored at the tail-end of the string 232 | if ($lx) 233 | $length_pattern = '(?:.{65535}){'.$lx.'}'; 234 | 235 | $length_pattern = '(.*)(?:'.$length_pattern.'.{'.$ly.'})$'; 236 | } 237 | } 238 | 239 | if(!preg_match('#'.$offset_pattern.$length_pattern.'#us', $str, $match)) 240 | return ''; 241 | 242 | return $match[1]; 243 | } 244 | 245 | /** 246 | * UTF-8 aware alternative to strtolower. 247 | * 248 | * Make a string lowercase. 249 | * 250 | * The concept of a characters "case" only exists is some alphabets such as 251 | * Latin, Greek, Cyrillic, Armenian and archaic Georgian - it does not exist in 252 | * the Chinese alphabet, for example. See Unicode Standard Annex #21: Case Mappings. 253 | * 254 | * @author Andreas Gohr 255 | * @see http://www.php.net/strtolower 256 | * @see utf8_to_unicode 257 | * @see utf8_from_unicode 258 | * @see http://www.unicode.org/reports/tr21/tr21-5.html 259 | * @see http://dev.splitbrain.org/view/darcs/dokuwiki/inc/utf8.php 260 | * @param string $string 261 | * @return mixed either string in lowercase or FALSE is UTF-8 invalid 262 | */ 263 | function utf8_strtolower($string) 264 | { 265 | static $UTF8_UPPER_TO_LOWER; 266 | 267 | $uni = utf8_to_unicode($string); 268 | if (!$uni) 269 | return false; 270 | 271 | if (!$UTF8_UPPER_TO_LOWER) 272 | { 273 | $UTF8_UPPER_TO_LOWER = array( 274 | 0x0041 => 0x0061, 0x03A6 => 0x03C6, 0x0162 => 0x0163, 0x00C5 => 0x00E5, 0x0042 => 0x0062, 275 | 0x0139 => 0x013A, 0x00C1 => 0x00E1, 0x0141 => 0x0142, 0x038E => 0x03CD, 0x0100 => 0x0101, 276 | 0x0490 => 0x0491, 0x0394 => 0x03B4, 0x015A => 0x015B, 0x0044 => 0x0064, 0x0393 => 0x03B3, 277 | 0x00D4 => 0x00F4, 0x042A => 0x044A, 0x0419 => 0x0439, 0x0112 => 0x0113, 0x041C => 0x043C, 278 | 0x015E => 0x015F, 0x0143 => 0x0144, 0x00CE => 0x00EE, 0x040E => 0x045E, 0x042F => 0x044F, 279 | 0x039A => 0x03BA, 0x0154 => 0x0155, 0x0049 => 0x0069, 0x0053 => 0x0073, 0x1E1E => 0x1E1F, 280 | 0x0134 => 0x0135, 0x0427 => 0x0447, 0x03A0 => 0x03C0, 0x0418 => 0x0438, 0x00D3 => 0x00F3, 281 | 0x0420 => 0x0440, 0x0404 => 0x0454, 0x0415 => 0x0435, 0x0429 => 0x0449, 0x014A => 0x014B, 282 | 0x0411 => 0x0431, 0x0409 => 0x0459, 0x1E02 => 0x1E03, 0x00D6 => 0x00F6, 0x00D9 => 0x00F9, 283 | 0x004E => 0x006E, 0x0401 => 0x0451, 0x03A4 => 0x03C4, 0x0423 => 0x0443, 0x015C => 0x015D, 284 | 0x0403 => 0x0453, 0x03A8 => 0x03C8, 0x0158 => 0x0159, 0x0047 => 0x0067, 0x00C4 => 0x00E4, 285 | 0x0386 => 0x03AC, 0x0389 => 0x03AE, 0x0166 => 0x0167, 0x039E => 0x03BE, 0x0164 => 0x0165, 286 | 0x0116 => 0x0117, 0x0108 => 0x0109, 0x0056 => 0x0076, 0x00DE => 0x00FE, 0x0156 => 0x0157, 287 | 0x00DA => 0x00FA, 0x1E60 => 0x1E61, 0x1E82 => 0x1E83, 0x00C2 => 0x00E2, 0x0118 => 0x0119, 288 | 0x0145 => 0x0146, 0x0050 => 0x0070, 0x0150 => 0x0151, 0x042E => 0x044E, 0x0128 => 0x0129, 289 | 0x03A7 => 0x03C7, 0x013D => 0x013E, 0x0422 => 0x0442, 0x005A => 0x007A, 0x0428 => 0x0448, 290 | 0x03A1 => 0x03C1, 0x1E80 => 0x1E81, 0x016C => 0x016D, 0x00D5 => 0x00F5, 0x0055 => 0x0075, 291 | 0x0176 => 0x0177, 0x00DC => 0x00FC, 0x1E56 => 0x1E57, 0x03A3 => 0x03C3, 0x041A => 0x043A, 292 | 0x004D => 0x006D, 0x016A => 0x016B, 0x0170 => 0x0171, 0x0424 => 0x0444, 0x00CC => 0x00EC, 293 | 0x0168 => 0x0169, 0x039F => 0x03BF, 0x004B => 0x006B, 0x00D2 => 0x00F2, 0x00C0 => 0x00E0, 294 | 0x0414 => 0x0434, 0x03A9 => 0x03C9, 0x1E6A => 0x1E6B, 0x00C3 => 0x00E3, 0x042D => 0x044D, 295 | 0x0416 => 0x0436, 0x01A0 => 0x01A1, 0x010C => 0x010D, 0x011C => 0x011D, 0x00D0 => 0x00F0, 296 | 0x013B => 0x013C, 0x040F => 0x045F, 0x040A => 0x045A, 0x00C8 => 0x00E8, 0x03A5 => 0x03C5, 297 | 0x0046 => 0x0066, 0x00DD => 0x00FD, 0x0043 => 0x0063, 0x021A => 0x021B, 0x00CA => 0x00EA, 298 | 0x0399 => 0x03B9, 0x0179 => 0x017A, 0x00CF => 0x00EF, 0x01AF => 0x01B0, 0x0045 => 0x0065, 299 | 0x039B => 0x03BB, 0x0398 => 0x03B8, 0x039C => 0x03BC, 0x040C => 0x045C, 0x041F => 0x043F, 300 | 0x042C => 0x044C, 0x00DE => 0x00FE, 0x00D0 => 0x00F0, 0x1EF2 => 0x1EF3, 0x0048 => 0x0068, 301 | 0x00CB => 0x00EB, 0x0110 => 0x0111, 0x0413 => 0x0433, 0x012E => 0x012F, 0x00C6 => 0x00E6, 302 | 0x0058 => 0x0078, 0x0160 => 0x0161, 0x016E => 0x016F, 0x0391 => 0x03B1, 0x0407 => 0x0457, 303 | 0x0172 => 0x0173, 0x0178 => 0x00FF, 0x004F => 0x006F, 0x041B => 0x043B, 0x0395 => 0x03B5, 304 | 0x0425 => 0x0445, 0x0120 => 0x0121, 0x017D => 0x017E, 0x017B => 0x017C, 0x0396 => 0x03B6, 305 | 0x0392 => 0x03B2, 0x0388 => 0x03AD, 0x1E84 => 0x1E85, 0x0174 => 0x0175, 0x0051 => 0x0071, 306 | 0x0417 => 0x0437, 0x1E0A => 0x1E0B, 0x0147 => 0x0148, 0x0104 => 0x0105, 0x0408 => 0x0458, 307 | 0x014C => 0x014D, 0x00CD => 0x00ED, 0x0059 => 0x0079, 0x010A => 0x010B, 0x038F => 0x03CE, 308 | 0x0052 => 0x0072, 0x0410 => 0x0430, 0x0405 => 0x0455, 0x0402 => 0x0452, 0x0126 => 0x0127, 309 | 0x0136 => 0x0137, 0x012A => 0x012B, 0x038A => 0x03AF, 0x042B => 0x044B, 0x004C => 0x006C, 310 | 0x0397 => 0x03B7, 0x0124 => 0x0125, 0x0218 => 0x0219, 0x00DB => 0x00FB, 0x011E => 0x011F, 311 | 0x041E => 0x043E, 0x1E40 => 0x1E41, 0x039D => 0x03BD, 0x0106 => 0x0107, 0x03AB => 0x03CB, 312 | 0x0426 => 0x0446, 0x00DE => 0x00FE, 0x00C7 => 0x00E7, 0x03AA => 0x03CA, 0x0421 => 0x0441, 313 | 0x0412 => 0x0432, 0x010E => 0x010F, 0x00D8 => 0x00F8, 0x0057 => 0x0077, 0x011A => 0x011B, 314 | 0x0054 => 0x0074, 0x004A => 0x006A, 0x040B => 0x045B, 0x0406 => 0x0456, 0x0102 => 0x0103, 315 | 0x039B => 0x03BB, 0x00D1 => 0x00F1, 0x041D => 0x043D, 0x038C => 0x03CC, 0x00C9 => 0x00E9, 316 | 0x00D0 => 0x00F0, 0x0407 => 0x0457, 0x0122 => 0x0123 317 | ); 318 | } 319 | 320 | $cnt = count($uni); 321 | for ($i = 0; $i < $cnt; $i++) 322 | { 323 | if (isset($UTF8_UPPER_TO_LOWER[$uni[$i]])) 324 | $uni[$i] = $UTF8_UPPER_TO_LOWER[$uni[$i]]; 325 | } 326 | 327 | return utf8_from_unicode($uni); 328 | } 329 | 330 | /** 331 | * UTF-8 aware alternative to strtoupper. 332 | * 333 | * Make a string uppercase. 334 | * 335 | * The concept of a characters "case" only exists is some alphabets such as 336 | * Latin, Greek, Cyrillic, Armenian and archaic Georgian - it does not exist in 337 | * the Chinese alphabet, for example. See Unicode Standard Annex #21: Case Mappings. 338 | * 339 | * @author Andreas Gohr 340 | * @see http://www.php.net/strtoupper 341 | * @see utf8_to_unicode 342 | * @see utf8_from_unicode 343 | * @see http://www.unicode.org/reports/tr21/tr21-5.html 344 | * @see http://dev.splitbrain.org/view/darcs/dokuwiki/inc/utf8.php 345 | * @param string $string 346 | * @return mixed either string in lowercase or FALSE is UTF-8 invalid 347 | */ 348 | function utf8_strtoupper($string) 349 | { 350 | static $UTF8_LOWER_TO_UPPER; 351 | 352 | $uni = utf8_to_unicode($string); 353 | if (!$uni) 354 | return false; 355 | 356 | if (!$UTF8_LOWER_TO_UPPER) 357 | { 358 | $UTF8_LOWER_TO_UPPER = array( 359 | 0x0061 => 0x0041, 0x03C6 => 0x03A6, 0x0163 => 0x0162, 0x00E5 => 0x00C5, 0x0062 => 0x0042, 360 | 0x013A => 0x0139, 0x00E1 => 0x00C1, 0x0142 => 0x0141, 0x03CD => 0x038E, 0x0101 => 0x0100, 361 | 0x0491 => 0x0490, 0x03B4 => 0x0394, 0x015B => 0x015A, 0x0064 => 0x0044, 0x03B3 => 0x0393, 362 | 0x00F4 => 0x00D4, 0x044A => 0x042A, 0x0439 => 0x0419, 0x0113 => 0x0112, 0x043C => 0x041C, 363 | 0x015F => 0x015E, 0x0144 => 0x0143, 0x00EE => 0x00CE, 0x045E => 0x040E, 0x044F => 0x042F, 364 | 0x03BA => 0x039A, 0x0155 => 0x0154, 0x0069 => 0x0049, 0x0073 => 0x0053, 0x1E1F => 0x1E1E, 365 | 0x0135 => 0x0134, 0x0447 => 0x0427, 0x03C0 => 0x03A0, 0x0438 => 0x0418, 0x00F3 => 0x00D3, 366 | 0x0440 => 0x0420, 0x0454 => 0x0404, 0x0435 => 0x0415, 0x0449 => 0x0429, 0x014B => 0x014A, 367 | 0x0431 => 0x0411, 0x0459 => 0x0409, 0x1E03 => 0x1E02, 0x00F6 => 0x00D6, 0x00F9 => 0x00D9, 368 | 0x006E => 0x004E, 0x0451 => 0x0401, 0x03C4 => 0x03A4, 0x0443 => 0x0423, 0x015D => 0x015C, 369 | 0x0453 => 0x0403, 0x03C8 => 0x03A8, 0x0159 => 0x0158, 0x0067 => 0x0047, 0x00E4 => 0x00C4, 370 | 0x03AC => 0x0386, 0x03AE => 0x0389, 0x0167 => 0x0166, 0x03BE => 0x039E, 0x0165 => 0x0164, 371 | 0x0117 => 0x0116, 0x0109 => 0x0108, 0x0076 => 0x0056, 0x00FE => 0x00DE, 0x0157 => 0x0156, 372 | 0x00FA => 0x00DA, 0x1E61 => 0x1E60, 0x1E83 => 0x1E82, 0x00E2 => 0x00C2, 0x0119 => 0x0118, 373 | 0x0146 => 0x0145, 0x0070 => 0x0050, 0x0151 => 0x0150, 0x044E => 0x042E, 0x0129 => 0x0128, 374 | 0x03C7 => 0x03A7, 0x013E => 0x013D, 0x0442 => 0x0422, 0x007A => 0x005A, 0x0448 => 0x0428, 375 | 0x03C1 => 0x03A1, 0x1E81 => 0x1E80, 0x016D => 0x016C, 0x00F5 => 0x00D5, 0x0075 => 0x0055, 376 | 0x0177 => 0x0176, 0x00FC => 0x00DC, 0x1E57 => 0x1E56, 0x03C3 => 0x03A3, 0x043A => 0x041A, 377 | 0x006D => 0x004D, 0x016B => 0x016A, 0x0171 => 0x0170, 0x0444 => 0x0424, 0x00EC => 0x00CC, 378 | 0x0169 => 0x0168, 0x03BF => 0x039F, 0x006B => 0x004B, 0x00F2 => 0x00D2, 0x00E0 => 0x00C0, 379 | 0x0434 => 0x0414, 0x03C9 => 0x03A9, 0x1E6B => 0x1E6A, 0x00E3 => 0x00C3, 0x044D => 0x042D, 380 | 0x0436 => 0x0416, 0x01A1 => 0x01A0, 0x010D => 0x010C, 0x011D => 0x011C, 0x00F0 => 0x00D0, 381 | 0x013C => 0x013B, 0x045F => 0x040F, 0x045A => 0x040A, 0x00E8 => 0x00C8, 0x03C5 => 0x03A5, 382 | 0x0066 => 0x0046, 0x00FD => 0x00DD, 0x0063 => 0x0043, 0x021B => 0x021A, 0x00EA => 0x00CA, 383 | 0x03B9 => 0x0399, 0x017A => 0x0179, 0x00EF => 0x00CF, 0x01B0 => 0x01AF, 0x0065 => 0x0045, 384 | 0x03BB => 0x039B, 0x03B8 => 0x0398, 0x03BC => 0x039C, 0x045C => 0x040C, 0x043F => 0x041F, 385 | 0x044C => 0x042C, 0x00FE => 0x00DE, 0x00F0 => 0x00D0, 0x1EF3 => 0x1EF2, 0x0068 => 0x0048, 386 | 0x00EB => 0x00CB, 0x0111 => 0x0110, 0x0433 => 0x0413, 0x012F => 0x012E, 0x00E6 => 0x00C6, 387 | 0x0078 => 0x0058, 0x0161 => 0x0160, 0x016F => 0x016E, 0x03B1 => 0x0391, 0x0457 => 0x0407, 388 | 0x0173 => 0x0172, 0x00FF => 0x0178, 0x006F => 0x004F, 0x043B => 0x041B, 0x03B5 => 0x0395, 389 | 0x0445 => 0x0425, 0x0121 => 0x0120, 0x017E => 0x017D, 0x017C => 0x017B, 0x03B6 => 0x0396, 390 | 0x03B2 => 0x0392, 0x03AD => 0x0388, 0x1E85 => 0x1E84, 0x0175 => 0x0174, 0x0071 => 0x0051, 391 | 0x0437 => 0x0417, 0x1E0B => 0x1E0A, 0x0148 => 0x0147, 0x0105 => 0x0104, 0x0458 => 0x0408, 392 | 0x014D => 0x014C, 0x00ED => 0x00CD, 0x0079 => 0x0059, 0x010B => 0x010A, 0x03CE => 0x038F, 393 | 0x0072 => 0x0052, 0x0430 => 0x0410, 0x0455 => 0x0405, 0x0452 => 0x0402, 0x0127 => 0x0126, 394 | 0x0137 => 0x0136, 0x012B => 0x012A, 0x03AF => 0x038A, 0x044B => 0x042B, 0x006C => 0x004C, 395 | 0x03B7 => 0x0397, 0x0125 => 0x0124, 0x0219 => 0x0218, 0x00FB => 0x00DB, 0x011F => 0x011E, 396 | 0x043E => 0x041E, 0x1E41 => 0x1E40, 0x03BD => 0x039D, 0x0107 => 0x0106, 0x03CB => 0x03AB, 397 | 0x0446 => 0x0426, 0x00FE => 0x00DE, 0x00E7 => 0x00C7, 0x03CA => 0x03AA, 0x0441 => 0x0421, 398 | 0x0432 => 0x0412, 0x010F => 0x010E, 0x00F8 => 0x00D8, 0x0077 => 0x0057, 0x011B => 0x011A, 399 | 0x0074 => 0x0054, 0x006A => 0x004A, 0x045B => 0x040B, 0x0456 => 0x0406, 0x0103 => 0x0102, 400 | 0x03BB => 0x039B, 0x00F1 => 0x00D1, 0x043D => 0x041D, 0x03CC => 0x038C, 0x00E9 => 0x00C9, 401 | 0x00F0 => 0x00D0, 0x0457 => 0x0407, 0x0123 => 0x0122 402 | ); 403 | } 404 | 405 | $cnt = count($uni); 406 | for ($i = 0; $i < $cnt; $i++) 407 | { 408 | if (isset($UTF8_LOWER_TO_UPPER[$uni[$i]])) 409 | $uni[$i] = $UTF8_LOWER_TO_UPPER[$uni[$i]]; 410 | } 411 | 412 | return utf8_from_unicode($uni); 413 | } 414 | 415 | // Is needed by utf8_ucwords_callback() 416 | require_once UTF8.'/functions/substr_replace.php'; 417 | 418 | /** 419 | * UTF-8 aware alternative to ucwords. 420 | * 421 | * Uppercase the first character of each word in a string 422 | * 423 | * @see http://php.net/manual/en/function.ucwords.php 424 | * @uses utf8_substr_replace 425 | * @uses utf8_strtoupper 426 | * @param string 427 | * @return string with first char of each word uppercase 428 | */ 429 | function utf8_ucwords($str) 430 | { 431 | // Note: [\x0c\x09\x0b\x0a\x0d\x20] matches; 432 | // Form feeds, horizontal tabs, vertical tabs, linefeeds and carriage returns 433 | // This corresponds to the definition of a "word" defined at http://www.php.net/ucwords 434 | $pattern = '/(^|([\x0c\x09\x0b\x0a\x0d\x20]+))([^\x0c\x09\x0b\x0a\x0d\x20]{1})[^\x0c\x09\x0b\x0a\x0d\x20]*/u'; 435 | return preg_replace_callback($pattern, '_utf8_ucwords_callback', $str); 436 | } 437 | 438 | /** 439 | * Callback function for preg_replace_callback call in utf8_ucwords. 440 | * You don't need to call this yourself. 441 | * 442 | * @access private 443 | * @uses utf8_ucwords 444 | * @uses utf8_strtoupper 445 | * @param array of matches corresponding to a single word 446 | * @return string with first char of the word in uppercase 447 | */ 448 | function _utf8_ucwords_callback($matches) 449 | { 450 | $leadingws = $matches[2]; 451 | $ucfirst = utf8_strtoupper($matches[3]); 452 | $ucword = utf8_substr_replace(ltrim($matches[0]), $ucfirst, 0, 1); 453 | 454 | return $leadingws.$ucword; 455 | } 456 | -------------------------------------------------------------------------------- /functions/ord.php: -------------------------------------------------------------------------------- 1 | = 0 && $ord0 <= 127) 19 | return $ord0; 20 | 21 | if (!isset($chr[1])) 22 | { 23 | trigger_error('Short sequence - at least 2 bytes expected, only 1 seen'); 24 | return false; 25 | } 26 | 27 | $ord1 = ord($chr[1]); 28 | if ($ord0 >= 192 && $ord0 <= 223) 29 | return ($ord0 - 192) * 64 + ($ord1 - 128); 30 | 31 | if (!isset($chr[2])) 32 | { 33 | trigger_error('Short sequence - at least 3 bytes expected, only 2 seen'); 34 | return false; 35 | } 36 | 37 | $ord2 = ord($chr[2]); 38 | if ($ord0 >= 224 && $ord0 <= 239) 39 | return ($ord0 - 224) * 4096 + ($ord1 - 128) * 64 + ($ord2 - 128); 40 | 41 | if (!isset($chr[3])) 42 | { 43 | trigger_error('Short sequence - at least 4 bytes expected, only 3 seen'); 44 | return false; 45 | } 46 | 47 | $ord3 = ord($chr[3]); 48 | if ($ord0 >= 240 && $ord0 <= 247) 49 | return ($ord0 - 240) * 262144 + ($ord1 - 128) * 4096 + ($ord2 - 128) * 64 + ($ord3 - 128); 50 | 51 | if (!isset($chr[4])) 52 | { 53 | trigger_error('Short sequence - at least 5 bytes expected, only 4 seen'); 54 | return false; 55 | } 56 | 57 | $ord4 = ord($chr[4]); 58 | if ($ord0 >= 248 && $ord0 <= 251) 59 | return ($ord0 - 248) * 16777216 + ($ord1 - 128) * 262144 + ($ord2 - 128) * 4096 + ($ord3 - 128) * 64 + ($ord4 - 128); 60 | 61 | if (!isset($chr[5])) 62 | { 63 | trigger_error('Short sequence - at least 6 bytes expected, only 5 seen'); 64 | return false; 65 | } 66 | 67 | if ($ord0 >= 252 && $ord0 <= 253) 68 | return ($ord0 - 252) * 1073741824 + ($ord1 - 128) * 16777216 + ($ord2 - 128) * 262144 + ($ord3 - 128) * 4096 + ($ord4 - 128) * 64 + (ord($chr[5]) - 128); 69 | 70 | if ($ord0 >= 254 && $ord0 <= 255) 71 | { 72 | trigger_error('Invalid UTF-8 with surrogate ordinal '.$ord0); 73 | return false; 74 | } 75 | } 76 | -------------------------------------------------------------------------------- /functions/str_ireplace.php: -------------------------------------------------------------------------------- 1 | 9 | * @package php-utf8 10 | * @subpackage functions 11 | * @see http://www.php.net/str_pad 12 | * @uses utf8_substr 13 | * @param string $input 14 | * @param int $length 15 | * @param string $pad_str 16 | * @param int $type ( same constants as str_pad ) 17 | * @return string 18 | */ 19 | function utf8_str_pad($input, $length, $pad_str=' ', $type = STR_PAD_RIGHT) 20 | { 21 | $input_len = utf8_strlen($input); 22 | if ($length <= $input_len) 23 | return $input; 24 | 25 | $pad_str_len = utf8_strlen($pad_str); 26 | $pad_len = $length - $input_len; 27 | 28 | if ($type == STR_PAD_RIGHT) 29 | { 30 | $repeat_times = ceil($pad_len / $pad_str_len); 31 | return utf8_substr($input.str_repeat($pad_str, $repeat_times), 0, $length); 32 | } 33 | 34 | if ($type == STR_PAD_LEFT) 35 | { 36 | $repeat_times = ceil($pad_len / $pad_str_len); 37 | return utf8_substr(str_repeat($pad_str, $repeat_times), 0, floor($pad_len)).$input; 38 | } 39 | 40 | if ($type == STR_PAD_BOTH) 41 | { 42 | $pad_len /= 2; 43 | $pad_amount_left = floor($pad_len); 44 | $pad_amount_right = ceil($pad_len); 45 | $repeat_times_left = ceil($pad_amount_left / $pad_str_len); 46 | $repeat_times_right = ceil($pad_amount_right / $pad_str_len); 47 | 48 | $padding_left = utf8_substr(str_repeat($pad_str, $repeat_times_left), 0, $pad_amount_left); 49 | $padding_right = utf8_substr(str_repeat($pad_str, $repeat_times_right), 0, $pad_amount_right); 50 | 51 | return $padding_left.$input.$padding_right; 52 | } 53 | 54 | trigger_error('utf8_str_pad: Unknown padding type ('.$type.')', E_USER_ERROR); 55 | } 56 | -------------------------------------------------------------------------------- /functions/str_split.php: -------------------------------------------------------------------------------- 1 | 16 | * @see http://www.php.net/ltrim 17 | * @see http://dev.splitbrain.org/view/darcs/dokuwiki/inc/utf8.php 18 | * @param string $str 19 | * @param string $charlist 20 | * @return string 21 | */ 22 | function utf8_ltrim($str, $charlist = '') 23 | { 24 | if(empty($charlist)) 25 | return ltrim($str); 26 | 27 | // Quote charlist for use in a characterclass 28 | $charlist = preg_replace('!([\\\\\\-\\]\\[/^])!', '\\\${1}', $charlist); 29 | 30 | return preg_replace('/^['.$charlist.']+/u', '', $str); 31 | } 32 | 33 | /** 34 | * UTF-8 aware replacement for rtrim(). 35 | * 36 | * @author Andreas Gohr 37 | * @see http://www.php.net/rtrim 38 | * @see http://dev.splitbrain.org/view/darcs/dokuwiki/inc/utf8.php 39 | * @param string $str 40 | * @param string $charlist 41 | * @return string 42 | */ 43 | function utf8_rtrim($str, $charlist= '') 44 | { 45 | if(empty($charlist)) 46 | return rtrim($str); 47 | 48 | // Quote charlist for use in a characterclass 49 | $charlist = preg_replace('!([\\\\\\-\\]\\[/^])!', '\\\${1}', $charlist); 50 | 51 | return preg_replace('/['.$charlist.']+$/u', '', $str); 52 | } 53 | 54 | /** 55 | * UTF-8 aware replacement for trim(). 56 | * 57 | * @author Andreas Gohr 58 | * @see http://www.php.net/trim 59 | * @see http://dev.splitbrain.org/view/darcs/dokuwiki/inc/utf8.php 60 | * @param string $str 61 | * @param boolean $charlist 62 | * @return string 63 | */ 64 | function utf8_trim($str, $charlist= '') 65 | { 66 | if(empty($charlist)) 67 | return trim($str); 68 | 69 | return utf8_ltrim(utf8_rtrim($str, $charlist), $charlist); 70 | } 71 | -------------------------------------------------------------------------------- /functions/ucfirst.php: -------------------------------------------------------------------------------- 1 | 18 | * if ( utf8_is_ascii($someString) ) { 19 | * // It's just ASCII - use the native PHP version 20 | * $someString = strtolower($someString); 21 | * } else { 22 | * $someString = utf8_strtolower($someString); 23 | * } 24 | * 25 | * 26 | * Optionally the check can be performed with the control characters included. 27 | * 28 | * @param string $str 29 | * @return boolean TRUE if it's all ASCII 30 | * @see utf8_is_ascii_ctrl 31 | */ 32 | function utf8_is_ascii($str, $check_ctrl_chars_too = false) 33 | { 34 | if (empty($str)) 35 | return true; 36 | 37 | $pattern = '/(?:[^\x00-\x7F])/'; 38 | 39 | if ($check_ctrl_chars_too) 40 | $pattern = '/[^\x09\x0A\x0D\x20-\x7E]/'; 41 | 42 | return (preg_match($pattern, $str) !== 1); 43 | } 44 | 45 | /** 46 | * Strip out all non-7bit ASCII bytes and, optionally, ASCII device control codes. 47 | * 48 | * If you need to transmit a string to system which you know can only support 7bit ASCII, you could use this function. 49 | * 50 | * Optionally strip out device control codes in the ASCII range which are not permitted in XML. 51 | * 52 | * Note that this leaves multi-byte characters untouched - it only removes device control codes 53 | * 54 | * @see http://hsivonen.iki.fi/producing-xml/#controlchar 55 | * @param string $str 56 | * @param string $mode 57 | * @return string control codes removed 58 | */ 59 | function utf8_to_ascii($str, $mode = 'non_ascii') 60 | { 61 | static $modes; 62 | 63 | if (!$modes) 64 | { 65 | $modes = array( 66 | 'non_ascii' => '^([\x00-\x7F]+)|([^\x00-\x7F]+)', 67 | 'ctrl_chars' => '^([^\x00-\x08\x0B\x0C\x0E-\x1F\x7F]+)|([\x00-\x08\x0B\x0C\x0E-\x1F\x7F]+)', 68 | 'both' => '^([\x09\x0A\x0D\x20-\x7E]+)|([^\x09\x0A\x0D\x20-\x7E]+)', 69 | ); 70 | } 71 | 72 | ob_start(); 73 | 74 | while (preg_match( '/'.$modes[$mode].'/S', $str, $matches)) 75 | { 76 | if (!isset($matches[2])) 77 | echo $matches[0]; 78 | 79 | $str = substr($str, strlen($matches[0])); 80 | } 81 | 82 | return ob_get_clean(); 83 | } 84 | 85 | /** 86 | * Replace accented UTF-8 characters by unaccented ASCII-7 "equivalents". 87 | * 88 | * The purpose of this function is to replace characters commonly found in Latin 89 | * alphabets with something more or less equivalent from the ASCII range. 90 | * This can be useful for converting a UTF-8 to something ready for a filename, 91 | * for example. 92 | * Following the use of this function, you would probably also pass the string 93 | * through utf8_strip_non_ascii to clean out any other non-ASCII chars 94 | * 95 | * Use the optional parameter to just deaccent lower ($case = -1) or 96 | * upper ($case = 1) letters. Default is to deaccent both cases ($case = 0) 97 | * 98 | * For a more complete implementation of transliteration, see the utf8_to_ascii 99 | * package available from the phputf8 project downloads: http://prdownloads.sourceforge.net/phputf8 100 | * 101 | * @param string $str UTF-8 string 102 | * @param int $case (optional) -1 lowercase only, +1 uppercase only, 1 both cases 103 | * @param string UTF-8 with accented characters replaced by ASCII chars 104 | * @return string accented chars replaced with ascii equivalents 105 | * @author Andreas Gohr 106 | */ 107 | function utf8_accents_to_ascii($str, $mode = 'both') 108 | { 109 | static $accents; 110 | 111 | if (empty($str)) 112 | return ''; 113 | 114 | if (!$accents) 115 | { 116 | $accents = array( 117 | 'lower' => array( 118 | 'à' => 'a', 'ô' => 'o', 'ď' => 'd', 'ḟ' => 'f', 'ë' => 'e', 'š' => 's', 'ơ' => 'o', 119 | 'ß' => 'ss', 'ă' => 'a', 'ř' => 'r', 'ț' => 't', 'ň' => 'n', 'ā' => 'a', 'ķ' => 'k', 120 | 'ŝ' => 's', 'ỳ' => 'y', 'ņ' => 'n', 'ĺ' => 'l', 'ħ' => 'h', 'ṗ' => 'p', 'ó' => 'o', 121 | 'ú' => 'u', 'ě' => 'e', 'é' => 'e', 'ç' => 'c', 'ẁ' => 'w', 'ċ' => 'c', 'õ' => 'o', 122 | 'ṡ' => 's', 'ø' => 'o', 'ģ' => 'g', 'ŧ' => 't', 'ș' => 's', 'ė' => 'e', 'ĉ' => 'c', 123 | 'ś' => 's', 'î' => 'i', 'ű' => 'u', 'ć' => 'c', 'ę' => 'e', 'ŵ' => 'w', 'ṫ' => 't', 124 | 'ū' => 'u', 'č' => 'c', 'ö' => 'oe', 'è' => 'e', 'ŷ' => 'y', 'ą' => 'a', 'ł' => 'l', 125 | 'ų' => 'u', 'ů' => 'u', 'ş' => 's', 'ğ' => 'g', 'ļ' => 'l', 'ƒ' => 'f', 'ž' => 'z', 126 | 'ẃ' => 'w', 'ḃ' => 'b', 'å' => 'a', 'ì' => 'i', 'ï' => 'i', 'ḋ' => 'd', 'ť' => 't', 127 | 'ŗ' => 'r', 'ä' => 'ae', 'í' => 'i', 'ŕ' => 'r', 'ê' => 'e', 'ü' => 'ue', 'ò' => 'o', 128 | 'ē' => 'e', 'ñ' => 'n', 'ń' => 'n', 'ĥ' => 'h', 'ĝ' => 'g', 'đ' => 'd', 'ĵ' => 'j', 129 | 'ÿ' => 'y', 'ũ' => 'u', 'ŭ' => 'u', 'ư' => 'u', 'ţ' => 't', 'ý' => 'y', 'ő' => 'o', 130 | 'â' => 'a', 'ľ' => 'l', 'ẅ' => 'w', 'ż' => 'z', 'ī' => 'i', 'ã' => 'a', 'ġ' => 'g', 131 | 'ṁ' => 'm', 'ō' => 'o', 'ĩ' => 'i', 'ù' => 'u', 'į' => 'i', 'ź' => 'z', 'á' => 'a', 132 | 'û' => 'u', 'þ' => 'th', 'ð' => 'dh', 'æ' => 'ae', 'µ' => 'u', 'ĕ' => 'e', 133 | ), 134 | 'upper' => array( 135 | 'À' => 'A', 'Ô' => 'O', 'Ď' => 'D', 'Ḟ' => 'F', 'Ë' => 'E', 'Š' => 'S', 'Ơ' => 'O', 136 | 'Ă' => 'A', 'Ř' => 'R', 'Ț' => 'T', 'Ň' => 'N', 'Ā' => 'A', 'Ķ' => 'K', 137 | 'Ŝ' => 'S', 'Ỳ' => 'Y', 'Ņ' => 'N', 'Ĺ' => 'L', 'Ħ' => 'H', 'Ṗ' => 'P', 'Ó' => 'O', 138 | 'Ú' => 'U', 'Ě' => 'E', 'É' => 'E', 'Ç' => 'C', 'Ẁ' => 'W', 'Ċ' => 'C', 'Õ' => 'O', 139 | 'Ṡ' => 'S', 'Ø' => 'O', 'Ģ' => 'G', 'Ŧ' => 'T', 'Ș' => 'S', 'Ė' => 'E', 'Ĉ' => 'C', 140 | 'Ś' => 'S', 'Î' => 'I', 'Ű' => 'U', 'Ć' => 'C', 'Ę' => 'E', 'Ŵ' => 'W', 'Ṫ' => 'T', 141 | 'Ū' => 'U', 'Č' => 'C', 'Ö' => 'Oe', 'È' => 'E', 'Ŷ' => 'Y', 'Ą' => 'A', 'Ł' => 'L', 142 | 'Ų' => 'U', 'Ů' => 'U', 'Ş' => 'S', 'Ğ' => 'G', 'Ļ' => 'L', 'Ƒ' => 'F', 'Ž' => 'Z', 143 | 'Ẃ' => 'W', 'Ḃ' => 'B', 'Å' => 'A', 'Ì' => 'I', 'Ï' => 'I', 'Ḋ' => 'D', 'Ť' => 'T', 144 | 'Ŗ' => 'R', 'Ä' => 'Ae', 'Í' => 'I', 'Ŕ' => 'R', 'Ê' => 'E', 'Ü' => 'Ue', 'Ò' => 'O', 145 | 'Ē' => 'E', 'Ñ' => 'N', 'Ń' => 'N', 'Ĥ' => 'H', 'Ĝ' => 'G', 'Đ' => 'D', 'Ĵ' => 'J', 146 | 'Ÿ' => 'Y', 'Ũ' => 'U', 'Ŭ' => 'U', 'Ư' => 'U', 'Ţ' => 'T', 'Ý' => 'Y', 'Ő' => 'O', 147 | 'Â' => 'A', 'Ľ' => 'L', 'Ẅ' => 'W', 'Ż' => 'Z', 'Ī' => 'I', 'Ã' => 'A', 'Ġ' => 'G', 148 | 'Ṁ' => 'M', 'Ō' => 'O', 'Ĩ' => 'I', 'Ù' => 'U', 'Į' => 'I', 'Ź' => 'Z', 'Á' => 'A', 149 | 'Û' => 'U', 'Þ' => 'Th', 'Ð' => 'Dh', 'Æ' => 'Ae', 'Ĕ' => 'E', 150 | ) 151 | ); 152 | } 153 | 154 | return strtr($str, ($mode == 'both') ? array_merge($accents['lower'], $accents['upper']) : $accents[$mode]); 155 | } 156 | -------------------------------------------------------------------------------- /utils/bad.php: -------------------------------------------------------------------------------- 1 | 160 | * @param string $str UTF-8 encoded string 161 | * @return mixed $i integer constant describing problem or FALSE if valid UTF-8 162 | * @see utf8_bad_explain 163 | * @see http://hsivonen.iki.fi/php-utf8/ 164 | */ 165 | function utf8_bad_identify($str, &$i) 166 | { 167 | $mState = 0; // Cached expected number of octets after the current octet 168 | // until the beginning of the next UTF8 character sequence 169 | $mUcs4 = 0; // Cached Unicode character 170 | $mBytes = 1; // Cached expected number of octets in the current sequence 171 | 172 | $len = strlen($str); 173 | 174 | for ($i = 0; $i < $len; $i++) 175 | { 176 | $in = ord($str{$i}); 177 | 178 | if ($mState == 0) 179 | { 180 | // When mState is zero we expect either a US-ASCII character or a multi-octet sequence. 181 | if (0 == (0x80 & ($in))) 182 | { 183 | // US-ASCII, pass straight through. 184 | $mBytes = 1; 185 | } 186 | else if (0xC0 == (0xE0 & ($in))) 187 | { 188 | // First octet of 2 octet sequence 189 | $mUcs4 = ($in); 190 | $mUcs4 = ($mUcs4 & 0x1F) << 6; 191 | $mState = 1; 192 | $mBytes = 2; 193 | } 194 | else if (0xE0 == (0xF0 & ($in))) 195 | { 196 | // First octet of 3 octet sequence 197 | $mUcs4 = ($in); 198 | $mUcs4 = ($mUcs4 & 0x0F) << 12; 199 | $mState = 2; 200 | $mBytes = 3; 201 | } 202 | else if (0xF0 == (0xF8 & ($in))) 203 | { 204 | // First octet of 4 octet sequence 205 | $mUcs4 = ($in); 206 | $mUcs4 = ($mUcs4 & 0x07) << 18; 207 | $mState = 3; 208 | $mBytes = 4; 209 | } 210 | else if (0xF8 == (0xFC & ($in))) 211 | { 212 | /* First octet of 5 octet sequence. 213 | * 214 | * This is illegal because the encoded codepoint must be either 215 | * (a) not the shortest form or 216 | * (b) outside the Unicode range of 0-0x10FFFF. 217 | */ 218 | return PHP_UTF8_BAD_5OCTET; 219 | } 220 | else if (0xFC == (0xFE & ($in))) 221 | { 222 | // First octet of 6 octet sequence, see comments for 5 octet sequence. 223 | return PHP_UTF8_BAD_6OCTET; 224 | } 225 | else 226 | { 227 | // Current octet is neither in the US-ASCII range nor a legal first 228 | // octet of a multi-octet sequence. 229 | return PHP_UTF8_BAD_SEQID; 230 | } 231 | } 232 | else 233 | { 234 | // When mState is non-zero, we expect a continuation of the multi-octet sequence 235 | if (0x80 == (0xC0 & ($in))) 236 | { 237 | // Legal continuation. 238 | $shift = ($mState - 1) * 6; 239 | $tmp = $in; 240 | $tmp = ($tmp & 0x0000003F) << $shift; 241 | $mUcs4 |= $tmp; 242 | 243 | /** 244 | * End of the multi-octet sequence. mUcs4 now contains the final 245 | * Unicode codepoint to be output 246 | */ 247 | if (0 == --$mState) 248 | { 249 | // From Unicode 3.1, non-shortest form is illegal 250 | if (((2 == $mBytes) && ($mUcs4 < 0x0080)) || ((3 == $mBytes) && ($mUcs4 < 0x0800)) || ((4 == $mBytes) && ($mUcs4 < 0x10000)) ) 251 | return PHP_UTF8_BAD_NONSHORT; 252 | elseif (($mUcs4 & 0xFFFFF800) == 0xD800 ) // From Unicode 3.2, surrogate characters are illegal 253 | return PHP_UTF8_BAD_SURROGATE; 254 | elseif ($mUcs4 > 0x10FFFF ) // Codepoints outside the Unicode range are illegal 255 | return PHP_UTF8_BAD_UNIOUTRANGE; 256 | 257 | // Initialize UTF8 cache 258 | $mState = 0; 259 | $mUcs4 = 0; 260 | $mBytes = 1; 261 | } 262 | } 263 | else 264 | { 265 | // ((0xC0 & (*in) != 0x80) && (mState != 0)) 266 | // Incomplete multi-octet sequence. 267 | $i--; 268 | return PHP_UTF8_BAD_SEQINCOMPLETE; 269 | } 270 | } 271 | } 272 | 273 | // Incomplete multi-octet sequence 274 | if ($mState != 0) 275 | { 276 | $i--; 277 | return PHP_UTF8_BAD_SEQINCOMPLETE; 278 | } 279 | 280 | // No bad octets found 281 | $i = null; 282 | return false; 283 | } 284 | 285 | /** 286 | * Takes a return code from utf8_bad_identify() are returns a message (in English) 287 | * explaining what the problem is. 288 | * 289 | * @param int $code return code from utf8_bad_identify 290 | * @return mixed string message or FALSE if return code unknown 291 | * @see utf8_bad_identify 292 | */ 293 | function utf8_bad_explain($code) 294 | { 295 | static $errors; 296 | 297 | if (!$errors) 298 | { 299 | $errors = array( 300 | PHP_UTF8_BAD_5OCTET => 'Five octet sequences are valid UTF-8 but are not supported by Unicode', 301 | PHP_UTF8_BAD_6OCTET => 'Six octet sequences are valid UTF-8 but are not supported by Unicode', 302 | PHP_UTF8_BAD_SEQID => 'Invalid octet for use as start of multi-byte UTF-8 sequence', 303 | PHP_UTF8_BAD_NONSHORT => 'From Unicode 3.1, non-shortest form is illegal', 304 | PHP_UTF8_BAD_SURROGATE => 'From Unicode 3.2, surrogate characters are illegal', 305 | PHP_UTF8_BAD_UNIOUTRANGE => 'Codepoints outside the Unicode range are illegal', 306 | PHP_UTF8_BAD_SEQINCOMPLETE => 'Incomplete multi-octet sequence' 307 | ); 308 | } 309 | 310 | if (isset($errors[$code])) 311 | trigger_error('Unknown error code: '.$errors[$code], E_USER_WARNING); 312 | 313 | return false; 314 | } 315 | -------------------------------------------------------------------------------- /utils/patterns.php: -------------------------------------------------------------------------------- 1 | 22 | * @param string string to locate index in 23 | * @param int (n times) 24 | * @return mixed - int if only one input int, array if more, boolean TRUE if it's all ASCII 25 | */ 26 | function utf8_byte_position() 27 | { 28 | $args = func_get_args(); 29 | $str = & array_shift($args); 30 | 31 | if (!is_string($str)) 32 | return false; 33 | 34 | $result = array(); 35 | $prev = array(0, 0); // Trivial byte index, character offset pair 36 | $i = utf8_locate_next_chr($str, 300); // Use a short piece of str to estimate bytes per character. $i (& $j) -> byte indexes into $str 37 | $c = strlen(utf8_decode(substr($str, 0, $i))); // $c -> character offset into $str 38 | 39 | // Deal with arguments from lowest to highest 40 | sort($args); 41 | 42 | foreach ($args as $offset) 43 | { 44 | // Sanity checks FIXME 45 | // 0 is an easy check 46 | if ($offset == 0) 47 | { 48 | $result[] = 0; 49 | continue; 50 | } 51 | 52 | // Ensure no endless looping 53 | $safety_valve = 50; 54 | 55 | do 56 | { 57 | if (($c - $prev[1]) == 0) 58 | { 59 | // Hack: gone past end of string 60 | $error = 0; 61 | $i = strlen($str); 62 | break; 63 | } 64 | 65 | $j = $i + (int) (($offset - $c) * ($i - $prev[0]) / ($c - $prev[1])); 66 | $j = utf8_locate_next_chr($str, $j); // Correct to utf8 character boundary 67 | $prev = array($i, $c); // Save the index, offset for use next iteration 68 | 69 | if ($j > $i) 70 | $c += strlen(utf8_decode(substr($str, $i, $j - $i))); // Determine new character offset 71 | else 72 | $c -= strlen(utf8_decode(substr($str, $j, $i - $j))); // Ditto 73 | 74 | $error = abs($c - $offset); 75 | $i = $j; // Ready for next time around 76 | } 77 | while (($error > 7) && --$safety_valve); // From 7 it is faster to iterate over the string 78 | 79 | if ($error && $error <= 7) 80 | { 81 | if ($c < $offset) 82 | { 83 | // Move up 84 | while ($error--) 85 | $i = utf8_locate_next_chr($str, ++$i); 86 | } 87 | else 88 | { 89 | // Move down 90 | while($error--) 91 | $i = utf8_locate_current_chr($str, --$i); 92 | } 93 | 94 | // Ready for next arg 95 | $c = $offset; 96 | } 97 | 98 | $result[] = $i; 99 | } 100 | 101 | if (count($result) == 1) 102 | return $result[0]; 103 | 104 | return $result; 105 | } 106 | 107 | /** 108 | * Given a string and any byte index, returns the byte index of the start of the 109 | * current UTF-8 character, relative to supplied position. 110 | * 111 | * If the current character begins at the same place as the supplied byte index, 112 | * that byte index will be returned. Otherwise this function will step backwards, 113 | * looking for the index where curent UTF-8 character begins. 114 | * 115 | * @author Chris Smith 116 | * @param string $str 117 | * @param int $idx byte index in the string 118 | * @return int byte index of start of next UTF-8 character 119 | */ 120 | function utf8_locate_current_chr(&$str, $idx) 121 | { 122 | if ($idx <= 0) 123 | return 0; 124 | 125 | $limit = strlen($str); 126 | if ($idx >= $limit) 127 | return $limit; 128 | 129 | // Binary value for any byte after the first in a multi-byte UTF-8 character 130 | // will be like 10xxxxxx so & 0xC0 can be used to detect this kind 131 | // of byte - assuming well formed UTF-8 132 | while($idx && ((ord($str[$idx]) & 0xC0) == 0x80)) 133 | $idx--; 134 | 135 | return $idx; 136 | } 137 | 138 | /** 139 | * Given a string and any byte index, returns the byte index of the start of the 140 | * next UTF-8 character, relative to supplied position. 141 | * 142 | * If the next character begins at the same place as the supplied byte index, 143 | * that byte index will be returned. 144 | * 145 | * @author Chris Smith 146 | * @param string $str 147 | * @param int $idx byte index in the string 148 | * @return int byte index of start of next UTF-8 character 149 | */ 150 | function utf8_locate_next_chr(&$str, $idx) 151 | { 152 | if ($idx <= 0) 153 | return 0; 154 | 155 | $limit = strlen($str); 156 | if ($idx >= $limit) 157 | return $limit; 158 | 159 | // Binary value for any byte after the first in a multi-byte UTF-8 character 160 | // will be like 10xxxxxx so & 0xC0 can be used to detect this kind 161 | // of byte - assuming well formed UTF-8 162 | while (($idx < $limit) && ((ord($str[$idx]) & 0xC0) == 0x80)) 163 | $idx++; 164 | 165 | return $idx; 166 | } 167 | -------------------------------------------------------------------------------- /utils/specials.php: -------------------------------------------------------------------------------- 1 | 125 | * @param string $string The UTF8 string to strip of special chars 126 | * @param string $repl (optional) Replace special with this string 127 | * @return string with common non-alphanumeric characters removed 128 | */ 129 | function utf8_strip_specials($string, $repl='') 130 | { 131 | return preg_replace(utf8_specials_pattern(), $repl, $string); 132 | } 133 | -------------------------------------------------------------------------------- /utils/unicode.php: -------------------------------------------------------------------------------- 1 | 0xFFFF. 22 | * Occurrances of the BOM are ignored. Surrogates are not allowed. 23 | * Returns false if the input string isn't a valid UTF-8 octet sequence and raises a PHP error at level E_USER_WARNING 24 | * Note: this function has been modified slightly in this library to trigger errors on encountering bad bytes 25 | * 26 | * @author 27 | * @param string $str UTF-8 encoded string 28 | * @return mixed array of unicode code points or FALSE if UTF-8 invalid 29 | * @see utf8_from_unicode 30 | * @see http://hsivonen.iki.fi/php-utf8/ 31 | */ 32 | function utf8_to_unicode($str) 33 | { 34 | $mState = 0; // Cached expected number of octets after the current octet 35 | // until the beginning of the next UTF8 character sequence 36 | $mUcs4 = 0; // Cached Unicode character 37 | $mBytes = 1; // Cached expected number of octets in the current sequence 38 | 39 | $out = array(); 40 | $len = strlen($str); 41 | 42 | for ($i = 0; $i < $len; $i++) 43 | { 44 | $in = ord($str[$i]); 45 | 46 | if ($mState == 0) 47 | { 48 | // When mState is zero we expect either a US-ASCII character or a multi-octet sequence. 49 | if (0 == (0x80 & ($in))) 50 | { 51 | // US-ASCII, pass straight through. 52 | $out[] = $in; 53 | $mBytes = 1; 54 | } 55 | elseif (0xC0 == (0xE0 & ($in))) 56 | { 57 | // First octet of 2 octet sequence 58 | $mUcs4 = ($in); 59 | $mUcs4 = ($mUcs4 & 0x1F) << 6; 60 | $mState = 1; 61 | $mBytes = 2; 62 | } 63 | elseif (0xE0 == (0xF0 & ($in))) 64 | { 65 | // First octet of 3 octet sequence 66 | $mUcs4 = ($in); 67 | $mUcs4 = ($mUcs4 & 0x0F) << 12; 68 | $mState = 2; 69 | $mBytes = 3; 70 | } 71 | elseif (0xF0 == (0xF8 & ($in))) 72 | { 73 | // First octet of 4 octet sequence 74 | $mUcs4 = ($in); 75 | $mUcs4 = ($mUcs4 & 0x07) << 18; 76 | $mState = 3; 77 | $mBytes = 4; 78 | } 79 | elseif (0xF8 == (0xFC & ($in))) 80 | { 81 | /* First octet of 5 octet sequence. 82 | * 83 | * This is illegal because the encoded codepoint must be either 84 | * (a) not the shortest form or 85 | * (b) outside the Unicode range of 0-0x10FFFF. 86 | * Rather than trying to resynchronize, we will carry on until the end 87 | * of the sequence and let the later error handling code catch it. 88 | */ 89 | $mUcs4 = ($in); 90 | $mUcs4 = ($mUcs4 & 0x03) << 24; 91 | $mState = 4; 92 | $mBytes = 5; 93 | } 94 | elseif (0xFC == (0xFE & ($in))) 95 | { 96 | // First octet of 6 octet sequence, see comments for 5 octet sequence. 97 | $mUcs4 = ($in); 98 | $mUcs4 = ($mUcs4 & 1) << 30; 99 | $mState = 5; 100 | $mBytes = 6; 101 | } 102 | else 103 | { 104 | // Current octet is neither in the US-ASCII range nor a legal first octet of a multi-octet sequence 105 | trigger_error('utf8_to_unicode: Illegal sequence identifier in UTF-8 at byte '.$i, E_USER_WARNING); 106 | return false; 107 | } 108 | } 109 | else 110 | { 111 | // When mState is non-zero, we expect a continuation of the multi-octet sequence 112 | if (0x80 == (0xC0 & ($in))) 113 | { 114 | // Legal continuation. 115 | $shift = ($mState - 1) * 6; 116 | $tmp = $in; 117 | $tmp = ($tmp & 0x0000003F) << $shift; 118 | $mUcs4 |= $tmp; 119 | 120 | /** 121 | * End of the multi-octet sequence. mUcs4 now contains the final 122 | * Unicode codepoint to be output 123 | */ 124 | if (0 == --$mState) 125 | { 126 | /* 127 | * Check for illegal sequences and codepoints. 128 | */ 129 | // From Unicode 3.1, non-shortest form is illegal 130 | if (((2 == $mBytes) && ($mUcs4 < 0x0080)) || ((3 == $mBytes) && ($mUcs4 < 0x0800)) || 131 | ((4 == $mBytes) && ($mUcs4 < 0x10000)) || (4 < $mBytes) || 132 | // From Unicode 3.2, surrogate characters are illegal 133 | (($mUcs4 & 0xFFFFF800) == 0xD800) || 134 | // Codepoints outside the Unicode range are illegal 135 | ($mUcs4 > 0x10FFFF)) 136 | { 137 | trigger_error('utf8_to_unicode: Illegal sequence or codepoint in UTF-8 at byte '.$i, E_USER_WARNING); 138 | return false; 139 | } 140 | 141 | // BOM is legal but we don't want to output it 142 | if (0xFEFF != $mUcs4) 143 | $out[] = $mUcs4; 144 | 145 | // Initialize UTF8 cache 146 | $mState = 0; 147 | $mUcs4 = 0; 148 | $mBytes = 1; 149 | } 150 | } 151 | else 152 | { 153 | /* ((0xC0 & (*in) != 0x80) && (mState != 0)) 154 | Incomplete multi-octet sequence. */ 155 | trigger_error('utf8_to_unicode: Incomplete multi-octet sequence in UTF-8 at byte '.$i, E_USER_WARNING); 156 | return false; 157 | } 158 | } 159 | } 160 | 161 | return $out; 162 | } 163 | 164 | /** 165 | * Takes an array of ints representing the Unicode characters and returns a UTF-8 string. 166 | * Astral planes are supported ie. the ints in the input can be > 0xFFFF. 167 | * Occurrances of the BOM are ignored. Surrogates are not allowed. 168 | * Returns false if the input array contains ints that represent surrogates or are outside the Unicode range and raises a PHP error at level E_USER_WARNING 169 | * Note: this function has been modified slightly in this library to use output buffering to concatenate the UTF-8 string (faster) as well as reference the array by it's keys 170 | * 171 | * @see utf8_to_unicode 172 | * @see http://hsivonen.iki.fi/php-utf8/ 173 | * @param array $arr Array of unicode code points representing a string 174 | * @return mixed UTF-8 string or FALSE if array contains invalid code points 175 | * @author 176 | */ 177 | function utf8_from_unicode($arr) 178 | { 179 | ob_start(); 180 | 181 | foreach(array_keys($arr) as $k) 182 | { 183 | if(($arr[$k] >= 0) && ($arr[$k] <= 0x007f)) // ASCII range (including control chars) 184 | echo chr($arr[$k]); 185 | elseif ($arr[$k] <= 0x07ff) // 2 byte sequence 186 | { 187 | echo chr(0xc0 | ($arr[$k] >> 6)); 188 | echo chr(0x80 | ($arr[$k] & 0x003f)); 189 | } 190 | elseif ($arr[$k] == 0xFEFF) // Byte order mark (skip) 191 | { 192 | // Nop -- zap the BOM 193 | } 194 | elseif ($arr[$k] >= 0xD800 && $arr[$k] <= 0xDFFF) // Test for illegal surrogates 195 | { 196 | // Found a surrogate 197 | trigger_error('utf8_from_unicode: Illegal surrogate at index: '.$k.', value: '.$arr[$k], E_USER_WARNING); 198 | return false; 199 | } 200 | elseif ($arr[$k] <= 0xffff) // 3 byte sequence 201 | { 202 | echo chr(0xe0 | ($arr[$k] >> 12)); 203 | echo chr(0x80 | (($arr[$k] >> 6) & 0x003f)); 204 | echo chr(0x80 | ($arr[$k] & 0x003f)); 205 | } 206 | elseif ($arr[$k] <= 0x10ffff) // 4 byte sequence 207 | { 208 | echo chr(0xf0 | ($arr[$k] >> 18)); 209 | echo chr(0x80 | (($arr[$k] >> 12) & 0x3f)); 210 | echo chr(0x80 | (($arr[$k] >> 6) & 0x3f)); 211 | echo chr(0x80 | ($arr[$k] & 0x3f)); 212 | } 213 | else 214 | { 215 | // Out of range 216 | trigger_error('utf8_from_unicode: Codepoint out of Unicode range at index: '.$k.', value: '.$arr[$k], E_USER_WARNING); 217 | return false; 218 | } 219 | } 220 | 221 | $result = ob_get_contents(); 222 | ob_end_clean(); 223 | 224 | return $result; 225 | } 226 | -------------------------------------------------------------------------------- /utils/validation.php: -------------------------------------------------------------------------------- 1 | 22 | * @see http://hsivonen.iki.fi/php-utf8/ 23 | * @see utf8_compliant 24 | * @param string $str UTF-8 encoded string 25 | * @return boolean TRUE if valid 26 | */ 27 | function utf8_is_valid($str) 28 | { 29 | $mState = 0; // Cached expected number of octets after the current octet 30 | // until the beginning of the next UTF8 character sequence 31 | $mUcs4 = 0; // Cached Unicode character 32 | $mBytes = 1; // Cached expected number of octets in the current sequence 33 | 34 | $len = strlen($str); 35 | 36 | for ($i = 0; $i < $len; $i++) 37 | { 38 | $in = ord($str{$i}); 39 | 40 | if ($mState == 0) 41 | { 42 | // When mState is zero we expect either a US-ASCII character or a multi-octet sequence. 43 | if (0 == (0x80 & ($in))) 44 | $mBytes = 1; // US-ASCII, pass straight through 45 | elseif (0xC0 == (0xE0 & ($in))) 46 | { 47 | // First octet of 2 octet sequence 48 | $mUcs4 = ($in); 49 | $mUcs4 = ($mUcs4 & 0x1F) << 6; 50 | $mState = 1; 51 | $mBytes = 2; 52 | } 53 | elseif (0xE0 == (0xF0 & ($in))) 54 | { 55 | // First octet of 3 octet sequence 56 | $mUcs4 = ($in); 57 | $mUcs4 = ($mUcs4 & 0x0F) << 12; 58 | $mState = 2; 59 | $mBytes = 3; 60 | } 61 | elseif (0xF0 == (0xF8 & ($in))) 62 | { 63 | // First octet of 4 octet sequence 64 | $mUcs4 = ($in); 65 | $mUcs4 = ($mUcs4 & 0x07) << 18; 66 | $mState = 3; 67 | $mBytes = 4; 68 | } 69 | elseif (0xF8 == (0xFC & ($in))) 70 | { 71 | /* First octet of 5 octet sequence. 72 | * 73 | * This is illegal because the encoded codepoint must be either 74 | * (a) not the shortest form or 75 | * (b) outside the Unicode range of 0-0x10FFFF. 76 | * Rather than trying to resynchronize, we will carry on until the end 77 | * of the sequence and let the later error handling code catch it. 78 | */ 79 | $mUcs4 = ($in); 80 | $mUcs4 = ($mUcs4 & 0x03) << 24; 81 | $mState = 4; 82 | $mBytes = 5; 83 | } 84 | elseif (0xFC == (0xFE & ($in))) 85 | { 86 | // First octet of 6 octet sequence, see comments for 5 octet sequence. 87 | $mUcs4 = ($in); 88 | $mUcs4 = ($mUcs4 & 1) << 30; 89 | $mState = 5; 90 | $mBytes = 6; 91 | } 92 | else 93 | { 94 | // Current octet is neither in the US-ASCII range nor a legal first octet of a multi-octet sequence. 95 | return false; 96 | } 97 | } 98 | else 99 | { 100 | // When mState is non-zero, we expect a continuation of the multi-octet sequence 101 | if (0x80 == (0xC0 & ($in))) 102 | { 103 | // Legal continuation. 104 | $shift = ($mState - 1) * 6; 105 | $tmp = $in; 106 | $tmp = ($tmp & 0x0000003F) << $shift; 107 | $mUcs4 |= $tmp; 108 | 109 | /** 110 | * End of the multi-octet sequence. mUcs4 now contains the final 111 | * Unicode codepoint to be output 112 | */ 113 | if (0 == --$mState) 114 | { 115 | /* 116 | * Check for illegal sequences and codepoints. 117 | */ 118 | // From Unicode 3.1, non-shortest form is illegal 119 | if (((2 == $mBytes) && ($mUcs4 < 0x0080)) || ((3 == $mBytes) && ($mUcs4 < 0x0800)) || 120 | ((4 == $mBytes) && ($mUcs4 < 0x10000)) || (4 < $mBytes) || 121 | // From Unicode 3.2, surrogate characters are illegal 122 | (($mUcs4 & 0xFFFFF800) == 0xD800) || 123 | // Codepoints outside the Unicode range are illegal 124 | ($mUcs4 > 0x10FFFF)) 125 | { 126 | return false; 127 | } 128 | 129 | // Initialize UTF8 cache 130 | $mState = 0; 131 | $mUcs4 = 0; 132 | $mBytes = 1; 133 | } 134 | } 135 | else 136 | { 137 | /** 138 | * ((0xC0 & (*in) != 0x80) && (mState != 0)) 139 | * Incomplete multi-octet sequence. 140 | */ 141 | return false; 142 | } 143 | } 144 | } 145 | 146 | return true; 147 | } 148 | 149 | /** 150 | * Tests whether a string complies as UTF-8. This will be much faster than 151 | * utf8_is_valid, but will pass five and six octet UTF-8 sequences, 152 | * which are not supported by Unicode and so cannot be displayed correctly 153 | * in a browser. 154 | * In other words it is not as strict as utf8_is_valid but it's faster. 155 | * If you use is to validate user input, you place yourself at the risk 156 | * that attackers will be able to inject 5 and 6 byte sequences (which may 157 | * or may not be a significant risk, depending on what you are are doing) 158 | * Note: Does not pass five and six octet UTF-8 sequences anymore in the unit tests. 159 | * 160 | * @see utf8_is_valid 161 | * @see http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php#54805 162 | * @param string $str UTF-8 string to check 163 | * @return boolean TRUE if string is valid UTF-8 164 | */ 165 | function utf8_compliant($str) 166 | { 167 | if(empty($str)) 168 | return true; 169 | 170 | // If even just the first character can be matched, when the /u 171 | // modifier is used, then it's valid UTF-8. If the UTF-8 is somehow 172 | // invalid, nothing at all will match, even if the string contains 173 | // some valid sequences 174 | return preg_match('/^.{1}/us', $str, $ar) == 1; 175 | } 176 | --------------------------------------------------------------------------------