├── README.md └── mypumpkin.py /README.md: -------------------------------------------------------------------------------- 1 | # 让mysqldump变成并发导出导入的魔法 2 | 3 | ## 1. 简介 4 | 取名mypumpkin,是python封装的一个让mysqldump以多线程的方式导出库表,再以mysql命令多线程导入新库,用于成倍加快导出,特别是导入的速度。这一切只需要在 mysqldump 或 mysql 命令前面加上 `mypumpkin.py` 即可,所以称作魔法。 5 | 6 | 该程序源于需要对现网单库几百G的数据进行转移到新库,并对中间进行一些特殊操作(如字符集转换),无法容忍mysqldump导入速度。有人可能会提到为什么不用 mydumper,其实也尝试过它但还是放弃了,原因有: 7 | 8 | 1. 不能设置字符集 9 | mydumper强制使用 binary 方式来连接库以达到不关心备份恢复时的字符集问题,然而我的场景下需要特意以不同的字符集导出、再导入。写这个程序的时候正好在公众号看到网易有推送的一篇文章 ([解密网易MySQL实例迁移高效完成背后的黑科技](http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650756926&idx=1&sn=b8081a8ae9456a6051d1ba519febee54&chksm=f3f9e2abc48e6bbd5912edb4e6207ff6ec5bf7123fedbf10b5c65a43146af22845dbf0787b39&scene=0#wechat_redirect)),提到他们对mydumper的改进已支持字符集设置,可是在0.9.1版本的patch里还是没找到。 10 | 2. 没有像 mysqldump 那样灵活控制过滤选项(导哪些表、忽略哪些表) 11 | 因为数据量之巨大,而且将近70%是不变更的历史表数据,这些表是可以提前导出转换的;又有少量单表大于50G的,最好是分库导出转换。mydumper 不具备 mysqldump 这样的灵活性 12 | 3. 对忽略导出gtid信息、触发器等其它支持 13 | 阿里云rds 5.6 导出必须要设置 set-gtid-purged=OFF 14 | 15 | 另外有人还可能提到 mysqlpump —— 它才是我认为mysqldump应该具有的模样,语法兼容,基于表的并发导出。但是只有 mysql服务端 5.7.9 以上才支持,这就是现实和理想的距离。。。 16 | 17 | ## 2. 实现方法 18 | 首先说明,mysqldump的导出速度并不慢,经测试能达到50M/s的速度,10G数据花费3分钟的样子,可以看到瓶颈在于网络和磁盘IO,再怎样的导出工具也快不了多少,但是导入却花了60分钟,磁盘和网络大概只用到了20%,瓶颈在目标库写入速度(而一般顺序写入达不到IOPS限制),所以mypumpkin就诞生了 —— 兼顾myloader的导入速度和mysqldump导出的灵活性。 19 | 20 | 用python构造1个队列,将需要导出的所有表一次放到队列中,同时启动N个python线程,各自从这个Queue里取出表名,subprocess调用操作系统的mysqldump命令,导出数据到以 dbname.tablename.sql 命名的文件中。load in 与 dump out 类似,根据指定的库名或表名,从dump_dir目录找到所有sql文件,压进队列,N个线程同时调用mysql构造新的命令,模拟 `<` 操作。 21 | 22 | 参数解析从原来自己解析,到改用argparse模块,几乎做了一次重构。 23 | 对于没有指定`--tables`的情况,程序会主动去库里查询一下所有表名,然后过滤进队列。 24 | 25 | load in目标库,选项做到与dump out一样丰富,可以指定导入哪些db、哪些表、忽略哪些表。 26 | 27 | 其中的重点是做到与原mysqldump兼容,因为需要对与表有关的选项(`-B`, `-A`, `--tables`, `--ignore=`),进行分析并组合成新的执行命令,考虑的异常情况非常多。 28 | 29 | ## 3. 限制 30 | 1. **重要**:导出的数据不保证库级别的一致性 31 | 1. 对历史不变表,是不影响的 32 | 2. 具体到一个表能保证一致性,这是mysqldump本身采用哪些选项决定的 33 | 3. 不同表导出动作在不同的mysqldump命令中,无法保证事务。 34 | 在我的案例场景下,是有开发同学辅助使用一套binlog解析程序,等完成后重放所有变更,来保证最终一致性。 35 | 另,许多情况下我们导数据,并不需要完整的或者一致的数据,只是用于离线分析或临时导出,重点是快速拿数据给到开发。 36 | 2. 不寻常选项识别 37 | 程序已经尽力做到与mysqldump命令兼容,只需要加上 mypumpkin.py、指定dump-dir,就完成并发魔法,但有些情况的参数不方便解析,暂不支持格式: 38 | ``` 39 | db1 table1 table2 40 | db2 db3 41 | ``` 42 | 即以上无法在命令行下判断 db1、table1 是库名还是表面,用的时候只需记住“[-A|-B], [--tables], [--ignore-table]”三组,必须出现一个:`db1 table1 table2`改成`db1 --tables table1 table2`,`db2`改成`-B db2 db3`。 43 | 3. 密码暂只能显式输入 44 | 45 | ## 4. 使用说明 46 | 安装基于python 2.7 开发,其它版本没测。需要按 MySQLdb 库。 47 | 48 | ### 4.1 help 49 | ``` 50 | ./mypumpkin.py --help 51 | Only mysqldump or mysql allowed after mypumpkin.py 52 | 53 | usage: mypumpkin.py {mysqldump|mysqls} [--help] 54 | 55 | This's a program that wrap mysqldump/mysql to make them dump-out/load-in 56 | concurrently. Attention: it can not keep consistent for whole database(s). 57 | 58 | optional arguments: 59 | --help show this help message and exit 60 | -B db1 [db1 ...], --databases db1 [db1 ...] 61 | Dump one or more databases 62 | -A, --all-databases Dump all databases 63 | --tables t1 [t1 ...] Specifiy tables to dump. Override --databases (-B) 64 | --ignore-table db1.table1 [db1.table1 ...] 65 | Do not dump the specified table. (format like 66 | --ignore-table=dbname.tablename). Use the directive 67 | multiple times for more than one table to ignore. 68 | --threads =N Threads to dump out [2], or load in [CPUs*2]. 69 | --dump-dir DUMP_DIR Required. Directory to dump out (create if not exist), 70 | Or Where to load in sqlfile 71 | 72 | At least one of these 3 group options given: [-A,-B] [--tables] [--ignore-table] 73 | ``` 74 | 75 | - `--dump-dir`,必选项,原来用的shell标准输入输出 `> or <` 不允许使用。dump-dir指定目录不存在时会尝试自动创建。 76 | - `--threads=N`,N指定并发导出或导入线程数。dump out 默认线程数2, mypumpkin load in 默认线程数是 cpu个数 * 2。 77 | 注:线程数不是越大越好,这里主要的衡量指标是网络带宽、磁盘IO、目标库IOPS,最好用 dstat 观察一下。 78 | - `-B`, `--tables`,`--ignore-table`,使用与mysqldump相同,如: 79 | 1. 在mysqldump里面,`--tables`会覆盖`--databases/-B`选项 80 | 2. 在mysqldump里面,`--tables`与`--ignore-table`不能同时出现 81 | 3. 在mysqldump里面,如果没有指定`-B`,则`--tables`或`--ignore-table`必须紧跟db名之后 82 | - 其它选项,mypumpkin会原封不动的保留下来,放到shell去执行。所以如果其它选项有错误,检查是交给原生mysqldump去做的,执行过程遇到一个失败则会退出线程。 83 | 84 | ### 4.2 example 85 | 导出: 86 | ``` 87 | ## 导出源库所有db到visit_dumpdir2目录 (不包括information_schema和performance_schema) 88 | $ ./mypumpkin.py mysqldump -h dbhost_name -utest_user -pyourpassword -P3306 \ 89 | --single-transaction --opt -A --dump-dir visit_dumpdir2 90 | 91 | ## 导出源库db1,db2,会从原库查询所有表名来过滤 92 | $ ./mypumpkin.py mysqldump -h dbhost_name -utest_user -pyourpassword -P3306 \ 93 | --single-transaction --opt -B db1 db2 --dump-dir visit_dumpdir2 94 | 95 | ## 只导出db1库的t1,t2表,如果指定表不存在则有提示 96 | $ ./mypumpkin.py mysqldump -h dbhost_name -utest_user -pyourpassword -P3306 \ 97 | --single-transaction --opt -B db1 --tables t1 t2 --dump-dir visit_dumpdir2 98 | 99 | ## 导出db1,db2库,但忽略 db1.t1, db2.t2, db2.t3表 100 | ## mysqldump只支持--ignore-table=db1.t1这种,使用多个重复指令来指定多表。这里做了兼容扩展 101 | $ ./mypumpkin.py mysqldump -h dbhost_name -utest_user -pyourpassword --single-transaction \ 102 | --opt -B db1 db2 --ignore-table=db1.t1 --ignore-table db2.t2 db2.t3 --dump-dir visit_dumpdir2 (如果-A表示全部db) 103 | 104 | ## 不带 -A/-B 105 | $ ./mypumpkin.py mysqldump -h dbhost_name -utest_user -pyourpassword -P3306 \ 106 | --single-transaction --opt db1 --ignore-table=db1.t1 --dump-dir=visit_dumpdir2 107 | 108 | ## 其它选项不做处理 109 | $ ./mypumpkin.py mysqldump -h dbhost_name -utest_user -pyourpassword -P3306 \ 110 | --single-transaction --set-gtid-purged=OFF --no-set-names --skip-add-locks -e -q -t -n --skip-triggers \ 111 | --max-allowed-packet=134217728 --net-buffer-length=1638400 --default-character-set=latin1 \ 112 | --insert-ignore --hex-blob --no-autocommit \ 113 | db1 --tables t1 --dump-dir visit_dumpdir2 114 | ``` 115 | 116 | 导入: 117 | `-A`, `-B`, `--tables`, `--ignore-table`, `--threads`, `--dump-dir`用法与作用与上面完全相同,举部分例子: 118 | 119 | ``` 120 | ## 导入dump-dir目录下所有表 121 | $ ./mypumpkin.py mysql -h dbhost_name -utest_user -pyourpassword --port 3307 -A \ 122 | --dump-dir=visit_dumpdir2 123 | 124 | ## 导入db1库(所有表) 125 | $ ./mypumpkin.py mysql -h dbhost_name -utest_user -pyourpassword --port 3307 -B db1 \ 126 | --dump-dir=visit_dumpdir2 127 | 128 | ## 只导入db.t1表 129 | $ ./mypumpkin.py mysql -h dbhost_name -utest_user -pyourpassword --port 3307 \ 130 | --default-character-set=utf8mb4 --max-allowed-packet=134217728 --net-buffer-length=1638400 \ 131 | -B db1 --tables t1 --dump-dir=visit_dumpdir2 132 | 133 | ## 导入db1,db2库,但忽略db1.t1表(会到dump-dir目录检查db1,db2有无对应的表存在,不在目标库检查) 134 | $ ./mypumpkin.py mysql -h dbhost_name -utest_user -pyourpassword --port 3307 \ 135 | -B db1 db2 --ignore-table=db1.t1 --dump-dir=visit_dumpdir2 136 | ``` 137 | 138 | ## 5.速度对比 139 | -------------------------------------------------------------------------------- /mypumpkin.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | #coding:utf-8 3 | """ 4 | Author: seanlook 5 | Contact: seanlook7@gmail http://seanlook.com 6 | Date: 2016-11-02 released 7 | """ 8 | import sys, os 9 | from Queue import Queue 10 | import time 11 | import subprocess 12 | import MySQLdb 13 | from threading import Thread 14 | from collections import defaultdict 15 | # import argparse 16 | from argparse import ArgumentParser 17 | from multiprocessing import cpu_count 18 | 19 | MYCMD_NEW = [] # handled mysqldump/load 20 | MYQUEUE = Queue() 21 | 22 | 23 | """ 24 | 选项解析的父类 25 | MYCMD_NEW 是处理选项后留下的部分,会与与后续处理表名组合起来 26 | mycmd 是mysqldump或mysql完整的命令行参数 27 | """# 28 | class NewOptions(object): 29 | def __init__(self, mycmd): 30 | global MYCMD_NEW 31 | self.mycmd = mycmd 32 | 33 | # 判断紧跟的命令是否合法 34 | work_mode = '' 35 | try: 36 | if mycmd[1] == 'mysqldump': 37 | work_mode = 'DUMP' 38 | elif mycmd[1] == 'mysql': 39 | work_mode = 'LOAD' 40 | else: 41 | print "Only mysqldump or mysql allowed after mypumpkin.py\n" 42 | # myparser will do the next 43 | except IndexError: 44 | pass 45 | #help_parser = self.parse_myopt() 46 | #ArgumentParser.error(help_parser, help_parser.print_help()) 47 | 48 | myparser = self.parse_myopt(work_mode) 49 | 50 | self.myopts, MYCMD_NEW = myparser.parse_known_args(mycmd) 51 | print "myparse options handling tables&dbs: ", self.myopts 52 | 53 | self.threads = self.myopts.threads[0] 54 | self.dumpdir = self.get_dumpdir(work_mode) 55 | 56 | # 获取dumpout或loadin时指定的目录 57 | def get_dumpdir(self, work_mode): 58 | dump_dir = self.myopts.dump_dir[0] 59 | if dump_dir == "": 60 | print "You must specifiy --dump-dir=xxx. (not support '>')" 61 | sys.exit(-1) 62 | elif not os.path.exists(dump_dir): 63 | if work_mode == 'DUMP': 64 | print "The specified dump-dir %s does not exist, the program will try to create it for you." % dump_dir 65 | try: 66 | os.makedirs(dump_dir) 67 | except: 68 | print "创建目录 %s 失败" % dump_dir 69 | sys.exit(-1) 70 | elif work_mode == 'LOAD': 71 | print "The specified dump-dir %s does not exist" 72 | sys.exit(-1) 73 | return dump_dir 74 | 75 | # 定义参数解析对象并返回该对象,init方法里使用它解析命令行 76 | def parse_myopt(self, work_mode=''): 77 | parser = ArgumentParser(description="This's a program that wrap mysqldump/mysql to make them dump-out/load-in concurrently.\n" 78 | "Attention: it can not keep consistent for whole database(s).", 79 | add_help=False, 80 | usage='%(prog)s {mysqldump|mysqls} [--help]', 81 | epilog="At least one of these 3 group options given: [-A,-B] [--tables] [--ignore-table]") # , allow_abbrev=False) 82 | group1 = parser.add_mutually_exclusive_group() 83 | group2 = parser.add_mutually_exclusive_group() 84 | 85 | # 默认load并发线程数是cpu核数的2倍,dump默认是2 86 | num_threads = cpu_count() * 2 87 | if work_mode == 'DUMP': 88 | num_threads = 2 89 | # parser.add_argument('mysql_cmd', choices=['mysqldump', 'mysql']) 90 | parser.add_argument('--help', action='help', help='show this help message and exit') 91 | 92 | group1.add_argument('-B', '--databases', nargs='+', metavar='db1', help='Dump one or more databases') 93 | group1.add_argument('-A', '--all-databases', action='store_true', help='Dump all databases') 94 | group2.add_argument('--tables', nargs='+', metavar='t1', 95 | help='Specifiy tables to dump. Override --databases (-B)') 96 | group2.add_argument('--ignore-table', nargs='+', metavar='db1.table1', action='append', 97 | help='Do not dump the specified table. (format like --ignore-table=dbname.tablename). ' 98 | 'Use the directive multiple times for more than one table to ignore.') 99 | parser.add_argument('--threads', nargs=1, metavar='=N', default=[num_threads], type=int, help='Threads to dump out [2], or load in [CPUs*2].') 100 | parser.add_argument('--dump-dir', nargs=1, required=True, action='store', help='Required. Directory to dump out (create if not exist), Or Where to load in sqlfile') 101 | 102 | # print parser.parse_args(mydump_cmd[2:]) 103 | return parser # .parse_args() 104 | 105 | """ 106 | 分析-A,-B,--tables,--ignore-table 107 | 返回命令行解析出来的,要处理的db([]表示没有-A,-B),要处理的table(tables_tag表示是include还是ignore) 108 | 该方法在子类中才调用 109 | """ 110 | def get_tables_opt(self): 111 | global MYCMD_NEW 112 | 113 | print "Start to handle your table relevant options..." 114 | opt_dbs = self.myopts.databases 115 | opt_is_alldbs = self.myopts.all_databases 116 | opt_tables = self.myopts.tables 117 | opt_ignores = self.myopts.ignore_table 118 | 119 | len_dbs = [len(opt_dbs) if opt_dbs is not None else 0][0] 120 | len_alldbs = [1 if opt_is_alldbs else 0][0] 121 | len_tables = [len(opt_tables) if opt_tables is not None else 0][0] 122 | len_ignores = [len(opt_ignores) if opt_ignores is not None else 0][0] 123 | 124 | """ 5种情形 125 | 1. -B db1 db2 或者 -A 126 | 2. -B db1 --table t1 t2 127 | 3. -B db1 db2 --ignore-table db1.t1 db1.t2 --ignore-table db2.t1 db2.t2 或者 -A --ignore... 128 | 4. db1 --ignore-table=db1.t1 --ignore-table=db1.t2 129 | 5. db1 --tables t1 t2 130 | 131 | db1 t1 t2 not support 132 | db1 not support 133 | --tables与-B与--ignore-table必出现其一 134 | --tables与--ignore-table只能出现其一 135 | -A,-B只能出现其一 136 | --tables, --ignore-table 必紧跟隐式db之后 137 | """ 138 | 139 | if len_tables + len_ignores + len_dbs + len_alldbs == 0: 140 | print "Error: at least one of [--tables, --ignore-table, -B, -A] is specified!" 141 | sys.exit(-1) 142 | 143 | tables_handler = [] # --tables, --ignore-table, --B d1 d2 dbname.* 144 | dbname_list = [] 145 | tables_tag = 'db-include' # ignore-table databases all-databases 146 | 147 | if (len_alldbs > 0 or len_dbs > 1) and len_tables > 0: 148 | print "Error: --tables only be specified with one databases" 149 | sys.exit(-1) 150 | elif len_dbs + len_alldbs == 0: # 情形4和5,没有显示指定db 151 | for table_opt in self.mycmd: 152 | if table_opt.startswith('--tables') or table_opt.startswith('--ignore-table'): 153 | pos_table_opt = self.mycmd.index(table_opt) 154 | pos_dbname = pos_table_opt - 1 155 | dbname = self.mycmd[pos_dbname] 156 | 157 | if dbname.startswith('-'): 158 | print "Error: Please give the right database name" 159 | sys.exit(-1) 160 | else: 161 | dbname_list = [dbname] 162 | MYCMD_NEW.remove(dbname) 163 | 164 | break 165 | else: 166 | # tables_tag = 'include' 167 | if opt_dbs is not None: 168 | dbname_list = opt_dbs 169 | elif opt_is_alldbs: 170 | dbname_list = [] 171 | else: 172 | print "no right databases given. this should never be print" 173 | print "mypumpkin>> This is the databases detected: ", dbname_list 174 | 175 | if opt_tables is not None: # 情景5,2 176 | for tab in opt_tables: 177 | tables_handler.append(dbname_list[0] + "." + tab) 178 | tables_tag = 'include-tab' 179 | elif opt_ignores is not None: # 情景4,3 180 | for tabs in opt_ignores: 181 | for db_tab in tabs: 182 | tables_handler.append(db_tab) 183 | tables_tag = 'db-exclude' 184 | print "mypumpkin>> This is the tables (%s) detected: %s" %(tables_tag, tables_handler) 185 | 186 | MYCMD_NEW = MYCMD_NEW[1:] # 去掉外包装 187 | # print "MYCMD_NEW ready:", MYCMD_NEW 188 | return dbname_list, tables_handler, tables_tag 189 | 190 | # 使用mysql去load in,继承自NewOptions 191 | class MyLoad(NewOptions): 192 | # 调用父类get_tables_opt,查找dumpdir里面已有的sqlfile,以获得最终需要导入的表(字典) 193 | def handle_tables_options(self): 194 | dbname_list, tables_list, tables_tag = self.get_tables_opt() 195 | 196 | # all_tables_os是操作系统上dumpdir找到的所有db和表 197 | all_tables_os = defaultdict(list) 198 | for dirName, subdirList, fileList in os.walk(self.dumpdir): 199 | for fname in fileList: 200 | fname_list = fname.split(".") 201 | if fname_list[-1] == "sql": 202 | schema_name, table_name = fname_list[0], fname_list[1] 203 | all_tables_os[schema_name].append(table_name) 204 | # print "all_tables_os: ", all_tables_os 205 | 206 | if tables_tag == 'include-tab': # [-B] db1 --table t1 207 | all_tables = defaultdict(list) 208 | for st_name in tables_list: 209 | db_name, tb_name = st_name.split(".") 210 | if tb_name in all_tables_os[db_name]: 211 | all_tables[db_name].append(tb_name) 212 | else: 213 | print "Error: can not find dumped file for table [%s]" % st_name 214 | sys.exit(-1) 215 | all_tables_os = all_tables # include 216 | elif tables_tag.startswith('db-'): # -B db1 db2 (-A) 217 | # all_tables = self.get_tables_from_db() # 从db里面获取所有表 218 | if len(dbname_list) != 0: # not -A 219 | set_db_notexist = set(dbname_list) - set(all_tables_os.keys()) 220 | if set_db_notexist: 221 | print "Error: Db [%s] do not dumped" % ",".join(set_db_notexist) 222 | sys.exit(-1) 223 | for db_l in all_tables_os.keys(): 224 | if db_l not in dbname_list: 225 | del all_tables_os[db_l] # 删除不在-B指定的db 226 | 227 | if tables_tag == 'db-exclude': # db1 --ignore-table db1.t1, -B db1 [db2] --ignore-table (-A) 228 | for st_name in tables_list: 229 | db_name, tb_name = st_name.split(".") 230 | try: 231 | all_tables_os[db_name].remove(tb_name) 232 | except ValueError: 233 | print "Error: can not get ignored table [%s] from dumped directory [%s] " % (st_name, self.dumpdir) 234 | sys.exit(-1) 235 | 236 | return all_tables_os 237 | 238 | # 将 handle_tables_options 的结果放入全局队列 239 | def queue_myload_tables(self): 240 | global MYQUEUE 241 | 242 | tables_dict = self.handle_tables_options() 243 | # print "Tables to load: ", tables_dict 244 | 245 | for db, tabs in tables_dict.items(): 246 | for tab in tabs: 247 | MYQUEUE.put("{0}.{1}".format(db, tab)) 248 | 249 | print "mypumpkin>> tables waiting to load in have queued" 250 | 251 | """ 252 | 从队列取出表名,在os上启动一个进行进行load in 253 | 多线程里循环调用该方法 254 | """ 255 | def do_process(self): 256 | global MYQUEUE 257 | while True: 258 | if not MYQUEUE.empty(): 259 | in_table = MYQUEUE.get(block=False) 260 | in_table_list = in_table.split(".") 261 | schema_name, table_name = in_table_list[0], in_table_list[1] 262 | 263 | load_option = " --database %s < %s/%s.sql" % (schema_name, self.dumpdir, in_table) 264 | myload_cmd_run = " ".join(MYCMD_NEW) + load_option 265 | try: 266 | print "mypumpkin>> Loading in table [%s]: " % in_table 267 | print " " + myload_cmd_run 268 | subprocess.check_output(myload_cmd_run, shell=True) # , stderr=subprocess.STDOUT) 269 | # 进程的输出,包括warning和错误,都打印出来 270 | except subprocess.CalledProcessError as e: 271 | print "Error shell returncode %d: exit \n" % e.returncode 272 | sys.exit(-1) 273 | time.sleep(0.3) 274 | else: 275 | print "mypumpkin>> databases and tables load thread finished" 276 | break 277 | 278 | class MyDump(NewOptions): 279 | 280 | def handle_tables_options(self): 281 | dbname_list, tables_list, tables_tag = self.get_tables_opt() 282 | 283 | all_tables = defaultdict(list) 284 | if tables_tag == 'include-tab': # [-B] db1 --table t1 285 | for st_name in tables_list: 286 | db_name, tb_name = st_name.split(".") 287 | all_tables[db_name].append(tb_name) 288 | elif tables_tag.startswith('db-'): # -B db1 db2 (-A) 289 | all_tables = self.get_tables_from_db() # 从db里面获取所有表 290 | if len(dbname_list) != 0: # not -A 291 | set_db_notexist = set(dbname_list) - set(all_tables.keys()) 292 | if set_db_notexist: 293 | print "Error: Db [%s] do not exist" % ",".join(set_db_notexist) 294 | sys.exit(-1) 295 | for db_l in all_tables.keys(): 296 | if db_l not in dbname_list: 297 | del all_tables[db_l] 298 | 299 | if tables_tag == 'db-exclude': # db1 --ignore-table db1.t1, -B db1 [db2] --ignore-table (-A) 300 | for st_name in tables_list: 301 | db_name, tb_name = st_name.split(".") 302 | try: 303 | all_tables[db_name].remove(tb_name) 304 | except ValueError: 305 | print "Table %s does not exist (or not in -B databases)." % st_name 306 | sys.exit(-1) 307 | 308 | return all_tables 309 | 310 | def queue_mydump_tables(self): 311 | global MYQUEUE 312 | 313 | tables_dict = self.handle_tables_options() 314 | # print "table_dict: ", tables_dict 315 | 316 | for db, tabs in tables_dict.items(): 317 | for tab in tabs: 318 | MYQUEUE.put("{0}.{1}".format(db, tab)) 319 | 320 | print "mypumpkin>> tables waiting to dump out have queued" 321 | 322 | # 导出指了DB时,需要从源库information_schema里面找到表名 323 | def get_tables_from_db(self): 324 | print "Go for target db to get all tables list..." 325 | 326 | dbinfo = self.get_dbinfo_cmd() 327 | 328 | try: 329 | if dbinfo[4] is not None: # socket given 330 | conn = MySQLdb.Connect(host=dbinfo[0], user=dbinfo[1], passwd=dbinfo[2], port=dbinfo[3], 331 | unix_socket=dbinfo[4], connect_timeout=5) 332 | else: 333 | conn = MySQLdb.Connect(host=dbinfo[0], user=dbinfo[1], passwd=dbinfo[2], port=dbinfo[3], connect_timeout=5) 334 | cur = conn.cursor() 335 | 336 | sqlstr = "select table_schema, table_name from information_schema.tables where TABLE_TYPE = 'BASE TABLE' AND " \ 337 | "TABLE_SCHEMA not in('information_schema', 'performance_schema', 'sys')" 338 | # print "get tables:", sqlstr 339 | cur.execute(sqlstr) 340 | except MySQLdb.Error, e: 341 | print "Error mysql %d: %s" % (e.args[0], e.args[1]) 342 | sys.exit(-1) 343 | 344 | res = cur.fetchall() 345 | cur.close() 346 | conn.close() 347 | 348 | dict_tables_db = defaultdict(list) 349 | for d, t in res: 350 | dict_tables_db[d].append(t) 351 | 352 | # print "db all tables: ", dict_tables_db 353 | return dict_tables_db 354 | 355 | # 被上面的get_tables_from_db调用,单独解析登录信息 356 | def get_dbinfo_cmd(self): 357 | parser = ArgumentParser(description="Process some args", conflict_handler='resolve') 358 | 359 | parser.add_argument('-h', '--host', nargs=1, metavar='host1', help='Host to connect') 360 | parser.add_argument('-u', '--user', nargs=1, metavar='user1', help='User to connect') 361 | parser.add_argument('-p', '--password', nargs=1, metavar='yourpassword', help='Password for user1 to connect') 362 | parser.add_argument('-P', '--port', nargs=1, metavar='port', type=int, default=3306, help='Port for host to connect') 363 | parser.add_argument('-S', '--socket', nargs=1, metavar='socket', help='Socket address for host to connect') 364 | 365 | dbinfo_opt, _ = parser.parse_known_args(self.mycmd) 366 | 367 | db_host = dbinfo_opt.host[0] 368 | db_user = dbinfo_opt.user[0] 369 | db_pass = dbinfo_opt.password[0] 370 | db_port = dbinfo_opt.port[0] 371 | db_sock = dbinfo_opt.socket 372 | 373 | return db_host, db_user, db_pass, db_port, db_sock 374 | 375 | # def dump_out(self): 376 | def do_process(self): 377 | global MYQUEUE 378 | while True: 379 | if not MYQUEUE.empty(): 380 | in_table = MYQUEUE.get(block=False) 381 | in_table_list = in_table.split(".") 382 | schema_name, table_name = in_table_list[0], in_table_list[1] 383 | 384 | dump_option = " %s --tables %s --result-file=%s/%s.sql" \ 385 | % (schema_name, table_name, self.dumpdir, in_table) 386 | mydump_cmd_run = " ".join(MYCMD_NEW) + dump_option 387 | 388 | try: 389 | print "mypumpkin>> Dumping out table [%s]: " % in_table 390 | print " " + mydump_cmd_run 391 | subprocess.check_output(mydump_cmd_run, shell=True) # , stderr=subprocess.STDOUT) 392 | # 进程的输出,包括warning和错误,都打印出来 393 | except subprocess.CalledProcessError as e: 394 | print "Error shell returncode %d: exit \n" % e.returncode 395 | sys.exit(-1) 396 | time.sleep(0.3) 397 | else: 398 | print "mypumpkin>> databases and tables dump thread finished" 399 | break 400 | 401 | 402 | class myThread(Thread): 403 | def __init__(self, myprocess): 404 | Thread.__init__(self) 405 | self.myprocess = myprocess 406 | 407 | def run(self): 408 | # 消费线程不关心队列里是哪个表的sql 409 | self.myprocess.do_process() 410 | 411 | 412 | if __name__ == '__main__': 413 | mycmd = sys.argv 414 | my_process = NewOptions(mycmd) # just for args check 415 | my_process = None 416 | 417 | # my_process = None 418 | if mycmd[1] == 'mysqldump': 419 | my_process = MyDump(mycmd) 420 | my_process.queue_mydump_tables() 421 | elif mycmd[1] == 'mysql': 422 | my_process = MyLoad(mycmd) 423 | my_process.queue_myload_tables() 424 | else: 425 | print "Only mysqldump or mysql allowed after mypumpkin.py\n" # should never print 426 | sys.exit(-1) 427 | 428 | num_threads = my_process.threads 429 | 430 | print "mypumpkin>> number of threads: ", num_threads 431 | for i in range(num_threads): 432 | worker = myThread(my_process) 433 | # worker.setDaemon(True) 434 | worker.start() 435 | time.sleep(0.5) 436 | --------------------------------------------------------------------------------