├── LICENSE ├── README.md └── mysqldump_to_csv.py /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2014 James Mishra 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # MySQL dump to CSV 2 | ## Introduction 3 | This Python script converts MySQL dump files into CSV format. It is optimized to handle extraordinarily large dumps, such as those from Wikipedia. 4 | 5 | MySQL dumps contain a series of INSERT statements and can be difficult to import or manipulate, often requiring significant hardware upgrades. This script provides an easy way to convert these dump files into the universal CSV format. 6 | 7 | The script takes advantage of the structural similarities between MySQL INSERT statements and CSV files. It uses Python's CSV parser to convert the MySQL syntax into CSV, enabling the data to be read and used more easily. 8 | 9 | ## Usage 10 | Just run `python mysqldump_to_csv.py` followed by the filename of an SQL file. You can specify multiple SQL files, and they will all be concatenated into one CSV file. This script can also take in SQL files from standard input, which can be useful for turning a gzipped MySQL dump into a CSV file without uncompressing the MySQL dump. 11 | 12 | `zcat dumpfile.sql.gz | python mysqldump_to_csv.py` 13 | 14 | ## How It Works 15 | The following SQL: 16 | 17 | INSERT INTO `page` VALUES (1,0,'April','',1,0,0,0.778582929065,'20140312223924','20140312223929',4657771,20236,0), 18 | (2,0,'August','',0,0,0,0.123830928525,'20140312221818','20140312221822',4360163,11466,0); 19 | 20 | is turned into the following CSV: 21 | 22 | 1,0,April,1,0,0,0.778582929065,20140312223924,20140312223929,4657771,20236,0 23 | 2,0,August,0,0,0,0.123830928525,20140312221818,20140312221822,4360163,11466,0 24 | 25 | ## License 26 | The code is strung together from other public repos, I'm pretty sure the license is standard MIT License. 27 | -------------------------------------------------------------------------------- /mysqldump_to_csv.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | import fileinput 3 | import csv 4 | import sys 5 | 6 | # This prevents prematurely closed pipes from raising 7 | # an exception in Python 8 | from signal import signal, SIGPIPE, SIG_DFL 9 | signal(SIGPIPE, SIG_DFL) 10 | 11 | # allow large content in the dump 12 | csv.field_size_limit(sys.maxsize) 13 | 14 | def is_insert(line): 15 | """ 16 | Returns true if the line begins a SQL insert statement. 17 | """ 18 | return line.startswith('INSERT INTO') 19 | 20 | 21 | def get_values(line): 22 | """ 23 | Returns the portion of an INSERT statement containing values 24 | """ 25 | return line.partition(' VALUES ')[2] 26 | 27 | 28 | def values_sanity_check(values): 29 | """ 30 | Ensures that values from the INSERT statement meet basic checks. 31 | """ 32 | assert values 33 | assert values[0] == '(' 34 | # Assertions have not been raised 35 | return True 36 | 37 | 38 | def parse_values(values, outfile): 39 | """ 40 | Given a file handle and the raw values from a MySQL INSERT 41 | statement, write the equivalent CSV to the file 42 | """ 43 | latest_row = [] 44 | 45 | reader = csv.reader([values], delimiter=',', 46 | doublequote=False, 47 | escapechar='\\', 48 | quotechar="'", 49 | strict=True 50 | ) 51 | 52 | writer = csv.writer(outfile, quoting=csv.QUOTE_MINIMAL) 53 | for reader_row in reader: 54 | for column in reader_row: 55 | # If our current string is empty... 56 | if len(column) == 0 or column == 'NULL': 57 | latest_row.append(chr(0)) 58 | continue 59 | # If our string starts with an open paren 60 | if column[0] == "(": 61 | # If we've been filling out a row 62 | if len(latest_row) > 0: 63 | # Check if the previous entry ended in 64 | # a close paren. If so, the row we've 65 | # been filling out has been COMPLETED 66 | # as: 67 | # 1) the previous entry ended in a ) 68 | # 2) the current entry starts with a ( 69 | if latest_row[-1][-1] == ")": 70 | # Remove the close paren. 71 | latest_row[-1] = latest_row[-1][:-1] 72 | writer.writerow(latest_row) 73 | latest_row = [] 74 | # If we're beginning a new row, eliminate the 75 | # opening parentheses. 76 | if len(latest_row) == 0: 77 | column = column[1:] 78 | # Add our column to the row we're working on. 79 | latest_row.append(column) 80 | # At the end of an INSERT statement, we'll 81 | # have the semicolon. 82 | # Make sure to remove the semicolon and 83 | # the close paren. 84 | if latest_row[-1][-2:] == ");": 85 | latest_row[-1] = latest_row[-1][:-2] 86 | writer.writerow(latest_row) 87 | 88 | 89 | def main(): 90 | """ 91 | Parse arguments and start the program 92 | """ 93 | # Iterate over all lines in all files 94 | # listed in sys.argv[1:] 95 | # or stdin if no args given. 96 | try: 97 | for line in fileinput.input(): 98 | # Look for an INSERT statement and parse it. 99 | if not is_insert(line): 100 | raise Exception("SQL INSERT statement could not be found!") 101 | values = get_values(line) 102 | if not values_sanity_check(values): 103 | raise Exception("Getting substring of SQL INSERT statement after ' VALUES ' failed!") 104 | parse_values(values, sys.stdout) 105 | except KeyboardInterrupt: 106 | sys.exit(0) 107 | 108 | if __name__ == "__main__": 109 | main() 110 | --------------------------------------------------------------------------------