├── .gitignore ├── config.php.template ├── LICENSE ├── README.md └── mysql-convert-latin1-to-utf8.php /.gitignore: -------------------------------------------------------------------------------- 1 | config.php 2 | -------------------------------------------------------------------------------- /config.php.template: -------------------------------------------------------------------------------- 1 | 'utf8_bin', 21 | 'latin1_general_ci' => 'utf8_unicode_ci', 22 | 'latin1_swedish_ci' => 'utf8_unicode_ci', 23 | 'utf8_general_ci' => 'utf8_unicode_ci', 24 | ); 25 | 26 | // TODO: Database information 27 | $dbHost = 'localhost'; 28 | $dbName = ''; 29 | $dbUser = ''; 30 | $dbPass = ''; 31 | 32 | if ($dbPass == '') { 33 | echo "DB password:"; 34 | $dbPass = trim(fgets(STDIN)); 35 | }; 36 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2012 Nic Jansma, http://nicj.net 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining 4 | a copy of this software and associated documentation files (the 5 | "Software"), to deal in the Software without restriction, including 6 | without limitation the rights to use, copy, modify, merge, publish, 7 | distribute, sublicense, and/or sell copies of the Software, and to 8 | permit persons to whom the Software is furnished to do so, subject to 9 | the following conditions: 10 | 11 | The above copyright notice and this permission notice shall be 12 | included in all copies or substantial portions of the Software. 13 | 14 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 15 | EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 16 | MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND 17 | NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE 18 | LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION 19 | OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION 20 | WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Copyright (c) 2013 Nic Jansma 2 | [http://nicj.net](http://nicj.net) 3 | 4 | See [nicj.net](http://nicj.net/2011/04/17/mysql-converting-an-incorrect-latin1-column-to-utf8) for a description of the problem 5 | and how this script aids in correcting the issue. 6 | 7 | # Introduction 8 | 9 | This script automates the conversion of any UTF-8 data stored in MySQL latin1 columns to proper UTF-8 columns. 10 | 11 | I've modified [fabio's script](http://www.varesano.net/blog/fabio/latin1%20encoded%20tables%20or%20databases%20utf8%20data%20stored%20convert%20them%20native%20mysql%20utf8%20tables) to automate the conversion for all of the latin1 columns for whatever database you 12 | configure it to look at. It converts the columns first to the proper BINARY cousin, then to utf8\_general\_ci, while 13 | retaining the column lengths, defaults and NULL attributes. 14 | 15 | Warning: This script assumes you know you have UTF-8 characters in a latin1 column. Please test your changes before blindly running the script! 16 | 17 | Here are the steps you should take to use the script: 18 | 19 | # Determine Which Columns Need Updating 20 | 21 | If you're like me, you may have a mixture of latin1 and UTF-8 columns in your databases. Not all of the columns in my 22 | database needed to be updated from latin1 to UTF-8. For example, some of the tables belonged to other PHP apps on the 23 | server, and I only wanted to update the columns that I knew had to be fixed. The script will currently convert all of 24 | the tables for the specified database - you could modify the script to change specific tables or columns if you need. 25 | 26 | Additionally, the script will only update appropriate text-based columns. Character sets are only appropriate for some 27 | types of data: CHAR, VARCHAR, TINYTEXT, TEXT, MEDIUMTEXT and LONGTEXT. Other column types such as numeric (INT) and 28 | BLOBs do not have a "character set". 29 | 30 | ENUM and SET column types can be converted **only** if all of the enum possibilities only use characters in the 0-127 ASCII 31 | character set. If you have ENUMs or SETs that satisfy this criteria, look for the relevant `TODO:` in the script. 32 | 33 | You can see what character sets your columns are using via the MySQL Administration tool, phpMyAdmin, or even using a 34 | SQL query against the information\_schema: 35 | 36 | mysql> SELECT * FROM COLUMNS WHERE TABLE_SCHEMA = "MyTable" AND COLLATION_NAME LIKE "latin1%"; 37 | ... 38 | 115 rows in set (0.03 sec) 39 | 40 | # Test Convert the Columns 41 | 42 | You should test all of the changes before committing them to your database. 43 | 44 | The first thing to test is that the SQL generated from the conversion script is correct. To do this, you can dump the structure of your database: 45 | 46 | server> mysqldump --no-data -h localhost -u dbuser -p mydatabase > structure.sql 47 | 48 | And import this structure to another test MySQL database: 49 | 50 | server> mysql -u dbuser -p mydatabase_test < structure.sql 51 | 52 | Next, run the conversion script (below) against your temporary database: 53 | 54 | server> php -f mysql-convert-latin1-to-utf8.php 55 | 56 | The script will spit out "!!! ERROR" statements if a change fails. If you encounter ERRORs, modifications may be needed based on your requirements. Some of the common problems are listed in Step 3. 57 | 58 | After you run the script against your temporary database, check the information\_schema tables to ensure the conversion was successful: 59 | 60 | mysql> SELECT * FROM COLUMNS WHERE TABLE_SCHEMA = "MyTable"; 61 | 62 | As long as you see all of your columns in UTF8, you should be all set! 63 | 64 | # Problems You May Encounter 65 | 66 | Some of the issues you may encounter: 67 | 68 | ## FULLTEXT indexes 69 | 70 | I have several columns with FULLTEXT indexes on them. The ALTER TABLE to BINARY command for a column that has a FULLTEXT index will cause an error: 71 | 72 | mysql> ALTER TABLE MyTable MODIFY MyColumn BLOB; 73 | ERROR 1283 (HY000): Column 'MyColumn' cannot be part of FULLTEXT index 74 | 75 | The simple solution I came up with was to modify the script to drop the index prior to the conversion, and restore it afterward: 76 | 77 | ALTER TABLE MyTable DROP INDEX `mycolumn_fulltext` 78 | 79 | ... (convert all columns) ... 80 | 81 | ALTER TABLE MyTable ADD FULLTEXT KEY `mycolumn_fulltext` (`MyColumn`) 82 | 83 | There are TODOs listed in the script where you should make these changes. 84 | 85 | ## Invalid UTF-8 data 86 | 87 | Since my database was over 5 years old, it had acquired some cruft over time. I'm not sure exactly how this happened, but some of the columns had data that are not valid UTF-8 encodings, though they were valid latin1 characters. I believe this occurred before I hardened my PHP application to reject non-UTF-8 data, but I'm not sure. I found this out when initially trying to do the conversion: 88 | 89 | mysql> ALTER TABLE MyTable MODIFY MyColumn VARBINARY(3000) NOT NULL DEFAULT ''; 90 | Query OK, 21171 rows affected (0.66 sec) 91 | 92 | mysql> ALTER TABLE MyTable MODIFY MyColumn varchar(3000) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL DEFAULT ''; 93 | ERROR 1366 (HY000): Incorrect string value: '\xE2\x80? fl...' for column 'MyColumn' at row 128 94 | 95 | What's going on? 96 | 97 | At some point, a character sequence that contained invalid UTF-8 characters was entered into the database, and now MySQL refuses to call the column VARCHAR (as UTF-8) because it has these invalid character sequences. 98 | 99 | I checked the HTML representation of this column in my PHP website, and sure enough, the garbage shows up there too: 100 | 101 | ?? flown 102 | 103 | The ? is the actual character that your browser shows. Not the best user experience, and definitely not the correct character. 104 | 105 | I fixed that single row (via phpMyAdmin), and ran the ALTER TABLE MODIFY command again, and hit the same issue, another row. Looks like there is more than a single corrupt row. 106 | 107 | I found a good way of rooting out all of the columns that will cause the conversion to fail. If you SELECT CONVERT (MyColumn USING utf8) as a new column, any NULL columns returned are columns that would cause the ALTER TABLE to fail. 108 | 109 | For example: 110 | 111 | mysql> SELECT MyID, MyColumn, CONVERT(MyColumn USING utf8) 112 | FROM MyTable 113 | WHERE CONVERT(MyColumn USING utf8) IS NULL 114 | ... 115 | 5 rows in set, 10 warnings (0.05 sec) 116 | 117 | This showed me the specific rows that contained invalid UTF-8, so I hand-edited to fix them. You could manually NULL them out using an UPDATE if you're not afraid of losing data. I had to do this for 6 columns out of the 115 columns that were converted. Only 30 rows in total were corrupt. 118 | 119 | You may also want to [use `utf8mb4` instead of `utf8`](https://medium.com/@adamhooper/in-mysql-never-use-utf8-use-utf8mb4-11761243e434) as your collation. This means you would set `$defaultCollation='utf8mb4_unicode_ci';`. 120 | 121 | # Usage 122 | 123 | First, read over the script and make sure you understand what it does. If you don't understand what it's doing, you 124 | probably shouldn't run it. 125 | 126 | Next, check all of the `TODO:`s in the script. You will need to make some changes to get it to work. 127 | 128 | At this point, it may take some guts for you to hit the go button on your live database. 129 | 130 | php -f mysql-convert-latin1-to-utf8.php 131 | 132 | Personally, I ran the script against a test (empty) database, then a copy of my live data, then a staging server before finally executing it on the live data. 133 | 134 | Warning: Please be careful when using the script and test, test, test before committing to it! 135 | 136 | # Version History 137 | 138 | * v1.0 - 2011-04-17: Initial release 139 | * v1.1 - 2013-01-25: Added possible ENUM support via [patrick-mcdougle](https://github.com/patrick-mcdougle) 140 | * v1.2 - 2013-03-26: Added SET support and the ability to convert from multiple collations, as well as bulk-doing conversion in one statement for quicker changes via [Synchro](https://github.com/Synchro) 141 | * v1.3 - 2017-05-06: Allows for `config.php` separate from script via [bderubinat](https://github.com/bderubinat) 142 | 143 | # Credits 144 | 145 | Initially based on [fabio's script](http://www.varesano.net/blog/fabio/latin1%20encoded%20tables%20or%20databases%20utf8%20data%20stored%20convert%20them%20native%20mysql%20utf8%20tables). 146 | 147 | Modified by Nic Jansma 148 | 149 | Contributions by: 150 | * [patrick-mcdougle](https://github.com/patrick-mcdougle) 151 | * [Synchro](https://github.com/Synchro) 152 | -------------------------------------------------------------------------------- /mysql-convert-latin1-to-utf8.php: -------------------------------------------------------------------------------- 1 | 19 | * @copyright 2013 Nic Jansma 20 | * @link http://www.nicj.net 21 | */ 22 | 23 | // TODO: Pretend-mode -- if set to true, no SQL queries will be executed. Instead, they will only be echo'd 24 | // to the console. 25 | $pretend = true; 26 | 27 | // TODO: Should SET and ENUM columns be processed? 28 | $processEnums = false; 29 | 30 | // TODO: The collation you want to convert the overall database to 31 | $defaultCollation = 'utf8_unicode_ci'; 32 | 33 | // TODO Convert column collations and table defaults using this mapping 34 | // latin1_swedish_ci is included since that's the MySQL default 35 | $collationMap = 36 | array( 37 | 'latin1_bin' => 'utf8_bin', 38 | 'latin1_general_ci' => 'utf8_unicode_ci', 39 | 'latin1_swedish_ci' => 'utf8_unicode_ci', 40 | ); 41 | 42 | // TODO: Database information 43 | $dbHost = 'localhost'; 44 | $dbName = ''; 45 | $dbUser = ''; 46 | $dbPass = ''; 47 | 48 | if (file_exists('config.php')) { 49 | require_once('config.php'); 50 | } 51 | 52 | if ($dbPass == '') { 53 | echo 'DB password:'; 54 | $dbPass = trim(fgets(STDIN)); 55 | }; 56 | 57 | $mapstring = ''; 58 | foreach ($collationMap as $s => $t) { 59 | $mapstring .= "'$s',"; 60 | } 61 | 62 | // Strip trailing comma 63 | $mapstring = substr($mapstring, 0, -1); 64 | echo $mapstring; 65 | 66 | // Open a connection to the information_schema database 67 | $infoDB = new mysqli($dbHost, $dbUser, $dbPass); 68 | $infoDB->select_db('information_schema'); 69 | 70 | // Open a second connection to the target (to be converted) database 71 | $targetDB = new mysqli($dbHost, $dbUser, $dbPass); 72 | $targetDB->select_db($dbName); 73 | 74 | // 75 | // TODO: FULLTEXT Indexes 76 | // 77 | // You may need to drop FULLTEXT indexes before the conversion -- execute the drop here. 78 | // eg. 79 | // sqlExec($targetDB, "ALTER TABLE MyTable DROP INDEX `my_index_name`", $pretend); 80 | // 81 | // If so, you should restore the FULLTEXT index after the conversion -- search for 'TODO' 82 | // later in this script. 83 | // 84 | 85 | // Get all tables in the specified database 86 | $tables = sqlObjs($infoDB, 87 | "SELECT TABLE_NAME, TABLE_COLLATION 88 | FROM TABLES 89 | WHERE TABLE_SCHEMA = '$dbName'"); 90 | 91 | foreach ($tables as $table) { 92 | $tableName = $table->TABLE_NAME; 93 | $tableCollation = $table->TABLE_COLLATION; 94 | 95 | // Find all columns whose collation is of one of $mapstring's source types 96 | $cols = sqlObjs($infoDB, 97 | "SELECT * 98 | FROM COLUMNS 99 | WHERE TABLE_SCHEMA = '$dbName' 100 | AND TABLE_Name = '$tableName' 101 | AND COLLATION_NAME IN($mapstring) 102 | AND COLLATION_NAME IS NOT NULL"); 103 | 104 | $intermediateChanges = array(); 105 | $finalChanges = array(); 106 | 107 | foreach ($cols as $col) { 108 | // If this column doesn't use one of the collations we want to handle, skip it 109 | if (!array_key_exists($col->COLLATION_NAME, $collationMap)) { 110 | continue; 111 | } else { 112 | $targetCollation = $collationMap[$col->COLLATION_NAME]; 113 | } 114 | 115 | // Save current column settings 116 | $colName = $col->COLUMN_NAME; 117 | $colCollation = $col->COLLATION_NAME; 118 | $colType = $col->COLUMN_TYPE; 119 | $colDataType = $col->DATA_TYPE; 120 | $colLength = $col->CHARACTER_OCTET_LENGTH; 121 | $colNull = ($col->IS_NULLABLE === 'NO') ? 'NOT NULL' : ''; 122 | 123 | $colDefault = ''; 124 | if ($col->COLUMN_DEFAULT !== null) { 125 | $colDefault = "DEFAULT '{$col->COLUMN_DEFAULT}'"; 126 | } 127 | 128 | // Determine the target temporary BINARY type 129 | $tmpDataType = ''; 130 | switch (strtoupper($colDataType)) { 131 | case 'CHAR': 132 | $tmpDataType = 'BINARY'; 133 | break; 134 | 135 | case 'VARCHAR': 136 | $tmpDataType = 'VARBINARY'; 137 | break; 138 | 139 | case 'TINYTEXT': 140 | $tmpDataType = 'TINYBLOB'; 141 | break; 142 | 143 | case 'TEXT': 144 | $tmpDataType = 'BLOB'; 145 | break; 146 | 147 | case 'MEDIUMTEXT': 148 | $tmpDataType = 'MEDIUMBLOB'; 149 | break; 150 | 151 | case 'LONGTEXT': 152 | $tmpDataType = 'LONGBLOB'; 153 | break; 154 | 155 | // 156 | // TODO: If your database uses the enum type it is safe to uncomment this block if and only if 157 | // all of the enum possibilities only use characters in the 0-127 ASCII character set. 158 | // 159 | case 'SET': 160 | case 'ENUM': 161 | $tmpDataType = 'SKIP'; 162 | if ($processEnums) { 163 | // ENUM data-type isn't using a temporary BINARY type -- just convert its column type directly 164 | $finalChanges[] = "MODIFY `$colName` $colType COLLATE $defaultCollation $colNull $colDefault"; 165 | } 166 | break; 167 | 168 | default: 169 | $tmpDataType = ''; 170 | break; 171 | } 172 | 173 | // any data types marked as SKIP were already handled 174 | if ($tmpDataType === 'SKIP') { 175 | continue; 176 | } 177 | 178 | if ($tmpDataType === '') { 179 | print "Unknown type! $colDataType\n"; 180 | exit; 181 | } 182 | 183 | // Change the column definition to the new type 184 | $tempColType = str_ireplace($colDataType, $tmpDataType, $colType); 185 | 186 | // Convert the column to the temporary BINARY cousin 187 | $intermediateChanges[] = "MODIFY `$colName` $tempColType $colNull"; 188 | 189 | // Convert it back to the original type with the correct collation 190 | $finalChanges[] = "MODIFY `$colName` $colType COLLATE $targetCollation $colNull $colDefault"; 191 | } 192 | 193 | if (array_key_exists($tableCollation, $collationMap)) { 194 | $finalChanges[] = 'DEFAULT COLLATE ' . $collationMap[$tableCollation]; 195 | } 196 | 197 | // Now run the conversions 198 | if (count($intermediateChanges) > 0) { 199 | sqlExec($targetDB, "ALTER TABLE `$dbName`.`$tableName`\n". implode(",\n", $intermediateChanges), $pretend); 200 | } 201 | 202 | if (count($finalChanges) > 0) { 203 | sqlExec($targetDB, "ALTER TABLE `$dbName`.`$tableName`\n". implode(",\n", $finalChanges), $pretend); 204 | } 205 | } 206 | 207 | // 208 | // TODO: Restore FULLTEXT indexes here 209 | // eg. 210 | // sqlExec($targetDB, "ALTER TABLE MyTable ADD FULLTEXT KEY `my_index_name` (`mycol1`)", $pretend); 211 | // 212 | 213 | // Set the default collation 214 | sqlExec($infoDB, "ALTER DATABASE `$dbName` COLLATE $defaultCollation", $pretend); 215 | 216 | // Done! 217 | 218 | // 219 | // Functions 220 | // 221 | /** 222 | * Executes the specified SQL 223 | * 224 | * @param object $db Target SQL connection 225 | * @param string $sql SQL to execute 226 | * @param boolean $pretend Pretend mode -- if set to true, don't execute query 227 | * 228 | * @return SQL result 229 | */ 230 | function sqlExec($db, $sql, $pretend = false) 231 | { 232 | echo "$sql;\n"; 233 | if ($pretend === false) { 234 | $res = $db->query($sql); 235 | if ($res === false) { 236 | $error = $db->error_list[0]['error']; 237 | print "!!! ERROR: $error\n"; 238 | } 239 | } 240 | return $res; 241 | } 242 | 243 | /** 244 | * Gets the SQL back as objects 245 | * 246 | * @param object $db Target SQL connection 247 | * @param string $sql SQL to execute 248 | * 249 | * @return SQL objects 250 | */ 251 | function sqlObjs($db, $sql) 252 | { 253 | $res = sqlExec($db, $sql); 254 | 255 | $a = array(); 256 | 257 | if ($res !== false) { 258 | while ($obj = $res->fetch_object()) { 259 | $a[] = $obj; 260 | } 261 | } 262 | 263 | return $a; 264 | } 265 | 266 | ?> 267 | --------------------------------------------------------------------------------