├── README.md
└── mypumpkin.py


/README.md:
--------------------------------------------------------------------------------
  1 | # 让mysqldump变成并发导出导入的魔法
  2 | 
  3 | ## 1. 简介
  4 | 取名mypumpkin，是python封装的一个让mysqldump以多线程的方式导出库表，再以mysql命令多线程导入新库，用于成倍加快导出，特别是导入的速度。这一切只需要在 mysqldump 或 mysql 命令前面加上 `mypumpkin.py` 即可，所以称作魔法。
  5 | 
  6 | 该程序源于需要对现网单库几百G的数据进行转移到新库，并对中间进行一些特殊操作（如字符集转换），无法容忍mysqldump导入速度。有人可能会提到为什么不用 mydumper，其实也尝试过它但还是放弃了，原因有：
  7 | 
  8 | 1. 不能设置字符集
  9 | mydumper强制使用 binary 方式来连接库以达到不关心备份恢复时的字符集问题，然而我的场景下需要特意以不同的字符集导出、再导入。写这个程序的时候正好在公众号看到网易有推送的一篇文章 ([解密网易MySQL实例迁移高效完成背后的黑科技](http://mp.weixin.qq.com/s?__biz=MzI4NTA1MDEwNg==&mid=2650756926&idx=1&sn=b8081a8ae9456a6051d1ba519febee54&chksm=f3f9e2abc48e6bbd5912edb4e6207ff6ec5bf7123fedbf10b5c65a43146af22845dbf0787b39&scene=0#wechat_redirect))，提到他们对mydumper的改进已支持字符集设置，可是在0.9.1版本的patch里还是没找到。
 10 | 2. 没有像 mysqldump 那样灵活控制过滤选项（导哪些表、忽略哪些表）
 11 | 因为数据量之巨大，而且将近70%是不变更的历史表数据，这些表是可以提前导出转换的；又有少量单表大于50G的，最好是分库导出转换。mydumper 不具备 mysqldump 这样的灵活性
 12 | 3. 对忽略导出gtid信息、触发器等其它支持
 13 | 阿里云rds 5.6 导出必须要设置 set-gtid-purged=OFF
 14 | 
 15 | 另外有人还可能提到 mysqlpump —— 它才是我认为mysqldump应该具有的模样，语法兼容，基于表的并发导出。但是只有 mysql服务端 5.7.9 以上才支持，这就是现实和理想的距离。。。
 16 | 
 17 | ## 2. 实现方法
 18 | 首先说明，mysqldump的导出速度并不慢，经测试能达到50M/s的速度，10G数据花费3分钟的样子，可以看到瓶颈在于网络和磁盘IO，再怎样的导出工具也快不了多少，但是导入却花了60分钟，磁盘和网络大概只用到了20%，瓶颈在目标库写入速度（而一般顺序写入达不到IOPS限制），所以mypumpkin就诞生了 —— 兼顾myloader的导入速度和mysqldump导出的灵活性。
 19 | 
 20 | 用python构造1个队列，将需要导出的所有表一次放到队列中，同时启动N个python线程，各自从这个Queue里取出表名，subprocess调用操作系统的mysqldump命令，导出数据到以 dbname.tablename.sql 命名的文件中。load in 与 dump out 类似，根据指定的库名或表名，从dump_dir目录找到所有sql文件，压进队列，N个线程同时调用mysql构造新的命令，模拟 `<` 操作。
 21 | 
 22 | 参数解析从原来自己解析，到改用argparse模块，几乎做了一次重构。
 23 | 对于没有指定`--tables`的情况，程序会主动去库里查询一下所有表名，然后过滤进队列。
 24 | 
 25 | load in目标库，选项做到与dump out一样丰富，可以指定导入哪些db、哪些表、忽略哪些表。
 26 | 
 27 | 其中的重点是做到与原mysqldump兼容，因为需要对与表有关的选项（`-B`, `-A`, `--tables`, `--ignore=`），进行分析并组合成新的执行命令，考虑的异常情况非常多。
 28 | 
 29 | ## 3. 限制
 30 | 1. **重要**：导出的数据不保证库级别的一致性
 31 |   1. 对历史不变表，是不影响的
 32 |   2. 具体到一个表能保证一致性，这是mysqldump本身采用哪些选项决定的
 33 |   3. 不同表导出动作在不同的mysqldump命令中，无法保证事务。
 34 |   在我的案例场景下，是有开发同学辅助使用一套binlog解析程序，等完成后重放所有变更，来保证最终一致性。
 35 |   另，许多情况下我们导数据，并不需要完整的或者一致的数据，只是用于离线分析或临时导出，重点是快速拿数据给到开发。
 36 | 2. 不寻常选项识别
 37 | 程序已经尽力做到与mysqldump命令兼容，只需要加上 mypumpkin.py、指定dump-dir，就完成并发魔法，但有些情况的参数不方便解析，暂不支持格式：
 38 | ```
 39 | db1 table1 table2
 40 | db2 db3
 41 | ```
 42 | 即以上无法在命令行下判断 db1、table1 是库名还是表面，用的时候只需记住“[-A|-B], [--tables], [--ignore-table]”三组，必须出现一个：`db1 table1 table2`改成`db1 --tables table1 table2`，`db2`改成`-B db2 db3`。
 43 | 3. 密码暂只能显式输入
 44 | 
 45 | ## 4. 使用说明
 46 | 安装基于python 2.7 开发，其它版本没测。需要按 MySQLdb 库。
 47 | 
 48 | ### 4.1 help
 49 | ```
 50 | ./mypumpkin.py --help
 51 | Only mysqldump or mysql allowed after mypumpkin.py
 52 | 
 53 | usage: mypumpkin.py {mysqldump|mysqls} [--help]
 54 | 
 55 | This's a program that wrap mysqldump/mysql to make them dump-out/load-in
 56 | concurrently. Attention: it can not keep consistent for whole database(s).
 57 | 
 58 | optional arguments:
 59 |   --help                show this help message and exit
 60 |   -B db1 [db1 ...], --databases db1 [db1 ...]
 61 |                         Dump one or more databases
 62 |   -A, --all-databases   Dump all databases
 63 |   --tables t1 [t1 ...]  Specifiy tables to dump. Override --databases (-B)
 64 |   --ignore-table db1.table1 [db1.table1 ...]
 65 |                         Do not dump the specified table. (format like
 66 |                         --ignore-table=dbname.tablename). Use the directive
 67 |                         multiple times for more than one table to ignore.
 68 |   --threads =N          Threads to dump out [2], or load in [CPUs*2].
 69 |   --dump-dir DUMP_DIR   Required. Directory to dump out (create if not exist),
 70 |                         Or Where to load in sqlfile
 71 | 
 72 | At least one of these 3 group options given: [-A,-B] [--tables] [--ignore-table]
 73 | ```
 74 | 
 75 | - `--dump-dir`，必选项，原来用的shell标准输入输出 `> or <` 不允许使用。dump-dir指定目录不存在时会尝试自动创建。
 76 | - `--threads=N`，N指定并发导出或导入线程数。dump out 默认线程数2， mypumpkin load in 默认线程数是 cpu个数 * 2。
 77 | 	注：线程数不是越大越好，这里主要的衡量指标是网络带宽、磁盘IO、目标库IOPS，最好用 dstat 观察一下。
 78 | - `-B`, `--tables`，`--ignore-table`，使用与mysqldump相同，如：  
 79 |   1. 在mysqldump里面，`--tables`会覆盖`--databases/-B`选项
 80 |   2. 在mysqldump里面，`--tables`与`--ignore-table`不能同时出现
 81 |   3. 在mysqldump里面，如果没有指定`-B`，则`--tables`或`--ignore-table`必须紧跟db名之后
 82 | - 其它选项，mypumpkin会原封不动的保留下来，放到shell去执行。所以如果其它选项有错误，检查是交给原生mysqldump去做的，执行过程遇到一个失败则会退出线程。
 83 | 
 84 | ### 4.2 example
 85 | 导出：
 86 | ```
 87 | ## 导出源库所有db到visit_dumpdir2目录 （不包括information_schema和performance_schema）
 88 | $ ./mypumpkin.py mysqldump -h dbhost_name -utest_user -pyourpassword -P3306 \
 89 |  --single-transaction --opt -A --dump-dir visit_dumpdir2
 90 | 
 91 | ## 导出源库db1,db2，会从原库查询所有表名来过滤
 92 | $ ./mypumpkin.py mysqldump -h dbhost_name -utest_user -pyourpassword -P3306 \
 93 |  --single-transaction --opt -B db1 db2 --dump-dir visit_dumpdir2
 94 | 
 95 | ## 只导出db1库的t1,t2表，如果指定表不存在则有提示
 96 | $ ./mypumpkin.py mysqldump -h dbhost_name -utest_user -pyourpassword -P3306 \
 97 |  --single-transaction --opt -B db1 --tables t1 t2 --dump-dir visit_dumpdir2
 98 | 
 99 | ## 导出db1,db2库，但忽略 db1.t1, db2.t2, db2.t3表
100 | ## mysqldump只支持--ignore-table=db1.t1这种，使用多个重复指令来指定多表。这里做了兼容扩展
101 | $ ./mypumpkin.py mysqldump -h dbhost_name -utest_user -pyourpassword --single-transaction \
102 |  --opt -B db1 db2 --ignore-table=db1.t1 --ignore-table db2.t2 db2.t3 --dump-dir visit_dumpdir2 (如果-A表示全部db)
103 | 
104 | ## 不带 -A/-B
105 | $ ./mypumpkin.py mysqldump -h dbhost_name -utest_user -pyourpassword -P3306 \
106 |  --single-transaction --opt db1 --ignore-table=db1.t1 --dump-dir=visit_dumpdir2
107 | 
108 | ## 其它选项不做处理
109 | $ ./mypumpkin.py mysqldump -h dbhost_name -utest_user -pyourpassword -P3306 \
110 |  --single-transaction --set-gtid-purged=OFF --no-set-names --skip-add-locks -e -q -t -n --skip-triggers \
111 |  --max-allowed-packet=134217728 --net-buffer-length=1638400 --default-character-set=latin1 \
112 |  --insert-ignore --hex-blob --no-autocommit \
113 |  db1 --tables t1 --dump-dir visit_dumpdir2
114 | ```
115 | 
116 | 导入：  
117 | `-A`, `-B`, `--tables`, `--ignore-table`, `--threads`, `--dump-dir`用法与作用与上面完全相同，举部分例子：
118 | 
119 | ```
120 | ## 导入dump-dir目录下所有表
121 | $ ./mypumpkin.py mysql -h dbhost_name -utest_user -pyourpassword --port 3307 -A \
122 |  --dump-dir=visit_dumpdir2
123 | 
124 | ## 导入db1库（所有表）
125 | $ ./mypumpkin.py mysql -h dbhost_name -utest_user -pyourpassword --port 3307 -B db1 \
126 |  --dump-dir=visit_dumpdir2
127 | 
128 | ## 只导入db.t1表
129 | $ ./mypumpkin.py mysql -h dbhost_name -utest_user -pyourpassword --port 3307 \
130 |  --default-character-set=utf8mb4 --max-allowed-packet=134217728 --net-buffer-length=1638400 \
131 |  -B db1 --tables t1 --dump-dir=visit_dumpdir2
132 | 
133 | ## 导入db1,db2库，但忽略db1.t1表（会到dump-dir目录检查db1,db2有无对应的表存在，不在目标库检查）
134 | $ ./mypumpkin.py mysql -h dbhost_name -utest_user -pyourpassword --port 3307 \
135 |  -B db1 db2 --ignore-table=db1.t1 --dump-dir=visit_dumpdir2
136 | ```
137 | 
138 | ## 5.速度对比
139 | 


--------------------------------------------------------------------------------
/mypumpkin.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/python
  2 | #coding:utf-8
  3 | """
  4 | Author:     seanlook
  5 | Contact:    seanlook7@gmail http://seanlook.com
  6 | Date:       2016-11-02 released
  7 | """
  8 | import sys, os
  9 | from Queue import Queue
 10 | import time
 11 | import subprocess
 12 | import MySQLdb
 13 | from threading import Thread
 14 | from collections import defaultdict
 15 | # import argparse
 16 | from argparse import ArgumentParser
 17 | from multiprocessing import cpu_count
 18 | 
 19 | MYCMD_NEW = []  # handled mysqldump/load
 20 | MYQUEUE = Queue()
 21 | 
 22 | 
 23 | """
 24 | 选项解析的父类
 25 | MYCMD_NEW 是处理选项后留下的部分，会与与后续处理表名组合起来
 26 | mycmd 是mysqldump或mysql完整的命令行参数
 27 | """#
 28 | class NewOptions(object):
 29 |     def __init__(self, mycmd):
 30 |         global MYCMD_NEW
 31 |         self.mycmd = mycmd
 32 | 
 33 |         # 判断紧跟的命令是否合法
 34 |         work_mode = ''
 35 |         try:
 36 |             if mycmd[1] == 'mysqldump':
 37 |                 work_mode = 'DUMP'
 38 |             elif mycmd[1] == 'mysql':
 39 |                 work_mode = 'LOAD'
 40 |             else:
 41 |                 print "Only mysqldump or mysql allowed after mypumpkin.py\n"
 42 |                 # myparser will do the next
 43 |         except IndexError:
 44 |             pass
 45 |             #help_parser = self.parse_myopt()
 46 |             #ArgumentParser.error(help_parser, help_parser.print_help())
 47 | 
 48 |         myparser = self.parse_myopt(work_mode)
 49 | 
 50 |         self.myopts, MYCMD_NEW = myparser.parse_known_args(mycmd)
 51 |         print "myparse options handling tables&dbs: ", self.myopts
 52 | 
 53 |         self.threads = self.myopts.threads[0]
 54 |         self.dumpdir = self.get_dumpdir(work_mode)
 55 | 
 56 |     # 获取dumpout或loadin时指定的目录
 57 |     def get_dumpdir(self, work_mode):
 58 |         dump_dir = self.myopts.dump_dir[0]
 59 |         if dump_dir == "":
 60 |             print "You must specifiy --dump-dir=xxx. (not support '>')"
 61 |             sys.exit(-1)
 62 |         elif not os.path.exists(dump_dir):
 63 |             if work_mode == 'DUMP':
 64 |                 print "The specified dump-dir %s does not exist, the program will try to create it for you." % dump_dir
 65 |                 try:
 66 |                     os.makedirs(dump_dir)
 67 |                 except:
 68 |                     print "创建目录 %s 失败" % dump_dir
 69 |                     sys.exit(-1)
 70 |             elif work_mode == 'LOAD':
 71 |                 print "The specified dump-dir %s does not exist"
 72 |                 sys.exit(-1)
 73 |         return dump_dir
 74 | 
 75 |     # 定义参数解析对象并返回该对象，init方法里使用它解析命令行
 76 |     def parse_myopt(self, work_mode=''):
 77 |         parser = ArgumentParser(description="This's a program that wrap mysqldump/mysql to make them dump-out/load-in concurrently.\n"
 78 |                                             "Attention: it can not keep consistent for whole database(s).",
 79 |                                 add_help=False,
 80 |                                 usage='%(prog)s {mysqldump|mysqls} [--help]',
 81 |                                 epilog="At least one of these 3 group options given: [-A,-B] [--tables] [--ignore-table]")  # , allow_abbrev=False)
 82 |         group1 = parser.add_mutually_exclusive_group()
 83 |         group2 = parser.add_mutually_exclusive_group()
 84 | 
 85 |         # 默认load并发线程数是cpu核数的2倍，dump默认是2
 86 |         num_threads = cpu_count() * 2
 87 |         if work_mode == 'DUMP':
 88 |             num_threads = 2
 89 |         # parser.add_argument('mysql_cmd', choices=['mysqldump', 'mysql'])
 90 |         parser.add_argument('--help', action='help', help='show this help message and exit')
 91 | 
 92 |         group1.add_argument('-B', '--databases', nargs='+', metavar='db1', help='Dump one or more databases')
 93 |         group1.add_argument('-A', '--all-databases', action='store_true', help='Dump all databases')
 94 |         group2.add_argument('--tables', nargs='+', metavar='t1',
 95 |                             help='Specifiy tables to dump. Override --databases (-B)')
 96 |         group2.add_argument('--ignore-table', nargs='+', metavar='db1.table1', action='append',
 97 |                             help='Do not dump the specified table. (format like --ignore-table=dbname.tablename). '
 98 |                                  'Use the directive multiple times for more than one table to ignore.')
 99 |         parser.add_argument('--threads', nargs=1, metavar='=N', default=[num_threads], type=int, help='Threads to dump out [2], or load in [CPUs*2].')
100 |         parser.add_argument('--dump-dir', nargs=1, required=True, action='store', help='Required. Directory to dump out (create if not exist), Or Where to load in sqlfile')
101 | 
102 |         # print parser.parse_args(mydump_cmd[2:])
103 |         return parser  # .parse_args()
104 | 
105 |     """
106 |     分析-A,-B,--tables,--ignore-table
107 |     返回命令行解析出来的，要处理的db([]表示没有-A,-B)，要处理的table(tables_tag表示是include还是ignore)
108 |     该方法在子类中才调用
109 |     """
110 |     def get_tables_opt(self):
111 |         global MYCMD_NEW
112 | 
113 |         print "Start to handle your table relevant options..."
114 |         opt_dbs = self.myopts.databases
115 |         opt_is_alldbs = self.myopts.all_databases
116 |         opt_tables = self.myopts.tables
117 |         opt_ignores = self.myopts.ignore_table
118 | 
119 |         len_dbs = [len(opt_dbs) if opt_dbs is not None else 0][0]
120 |         len_alldbs = [1 if opt_is_alldbs else 0][0]
121 |         len_tables = [len(opt_tables) if opt_tables is not None else 0][0]
122 |         len_ignores = [len(opt_ignores) if opt_ignores is not None else 0][0]
123 | 
124 |         """ 5种情形
125 |         1. -B db1 db2  或者 -A
126 |         2. -B db1 --table t1 t2
127 |         3. -B db1 db2 --ignore-table db1.t1 db1.t2 --ignore-table db2.t1 db2.t2  或者 -A --ignore...
128 |         4. db1 --ignore-table=db1.t1 --ignore-table=db1.t2
129 |         5. db1 --tables t1 t2
130 | 
131 |         db1 t1 t2  not support
132 |         db1 not support
133 |         --tables与-B与--ignore-table必出现其一
134 |         --tables与--ignore-table只能出现其一
135 |         -A,-B只能出现其一
136 |         --tables, --ignore-table 必紧跟隐式db之后
137 |         """
138 | 
139 |         if len_tables + len_ignores + len_dbs + len_alldbs == 0:
140 |             print "Error: at least one of [--tables, --ignore-table, -B, -A] is specified!"
141 |             sys.exit(-1)
142 | 
143 |         tables_handler = []  # --tables, --ignore-table, --B d1 d2    dbname.*
144 |         dbname_list = []
145 |         tables_tag = 'db-include'  # ignore-table  databases  all-databases
146 | 
147 |         if (len_alldbs > 0 or len_dbs > 1) and len_tables > 0:
148 |             print "Error: --tables only be specified with one databases"
149 |             sys.exit(-1)
150 |         elif len_dbs + len_alldbs == 0:  # 情形4和5，没有显示指定db
151 |             for table_opt in self.mycmd:
152 |                 if table_opt.startswith('--tables') or table_opt.startswith('--ignore-table'):
153 |                     pos_table_opt = self.mycmd.index(table_opt)
154 |                     pos_dbname = pos_table_opt - 1
155 |                     dbname = self.mycmd[pos_dbname]
156 | 
157 |                     if dbname.startswith('-'):
158 |                         print "Error: Please give the right database name"
159 |                         sys.exit(-1)
160 |                     else:
161 |                         dbname_list = [dbname]
162 |                         MYCMD_NEW.remove(dbname)
163 | 
164 |                     break
165 |         else:
166 |             # tables_tag = 'include'
167 |             if opt_dbs is not None:
168 |                 dbname_list = opt_dbs
169 |             elif opt_is_alldbs:
170 |                 dbname_list = []
171 |             else:
172 |                 print "no right databases given. this should never be print"
173 |         print "mypumpkin>> This is the databases detected: ", dbname_list
174 | 
175 |         if opt_tables is not None:  # 情景5，2
176 |             for tab in opt_tables:
177 |                 tables_handler.append(dbname_list[0] + "." + tab)
178 |             tables_tag = 'include-tab'
179 |         elif opt_ignores is not None:  # 情景4，3
180 |             for tabs in opt_ignores:
181 |                 for db_tab in tabs:
182 |                     tables_handler.append(db_tab)
183 |             tables_tag = 'db-exclude'
184 |         print "mypumpkin>> This is the tables (%s) detected: %s" %(tables_tag, tables_handler)
185 | 
186 |         MYCMD_NEW = MYCMD_NEW[1:]  # 去掉外包装
187 |         # print "MYCMD_NEW ready:", MYCMD_NEW
188 |         return dbname_list, tables_handler, tables_tag
189 | 
190 | # 使用mysql去load in，继承自NewOptions
191 | class MyLoad(NewOptions):
192 |     # 调用父类get_tables_opt，查找dumpdir里面已有的sqlfile，以获得最终需要导入的表（字典）
193 |     def handle_tables_options(self):
194 |         dbname_list, tables_list, tables_tag = self.get_tables_opt()
195 | 
196 |         # all_tables_os是操作系统上dumpdir找到的所有db和表
197 |         all_tables_os = defaultdict(list)
198 |         for dirName, subdirList, fileList in os.walk(self.dumpdir):
199 |             for fname in fileList:
200 |                 fname_list = fname.split(".")
201 |                 if fname_list[-1] == "sql":
202 |                     schema_name, table_name = fname_list[0], fname_list[1]
203 |                     all_tables_os[schema_name].append(table_name)
204 |         # print "all_tables_os: ", all_tables_os
205 | 
206 |         if tables_tag == 'include-tab':  # [-B] db1 --table t1
207 |             all_tables = defaultdict(list)
208 |             for st_name in tables_list:
209 |                 db_name, tb_name = st_name.split(".")
210 |                 if tb_name in all_tables_os[db_name]:
211 |                     all_tables[db_name].append(tb_name)
212 |                 else:
213 |                     print "Error: can not find dumped file for table [%s]" % st_name
214 |                     sys.exit(-1)
215 |             all_tables_os = all_tables  # include
216 |         elif tables_tag.startswith('db-'):  # -B db1 db2 (-A)
217 |             # all_tables = self.get_tables_from_db()  # 从db里面获取所有表
218 |             if len(dbname_list) != 0:  # not -A
219 |                 set_db_notexist = set(dbname_list) - set(all_tables_os.keys())
220 |                 if set_db_notexist:
221 |                     print "Error: Db [%s] do not dumped" % ",".join(set_db_notexist)
222 |                     sys.exit(-1)
223 |                 for db_l in all_tables_os.keys():
224 |                     if db_l not in dbname_list:
225 |                         del all_tables_os[db_l]  # 删除不在-B指定的db
226 | 
227 |             if tables_tag == 'db-exclude':  # db1 --ignore-table db1.t1,  -B db1 [db2] --ignore-table (-A)
228 |                 for st_name in tables_list:
229 |                     db_name, tb_name = st_name.split(".")
230 |                     try:
231 |                         all_tables_os[db_name].remove(tb_name)
232 |                     except ValueError:
233 |                         print "Error: can not get ignored table [%s] from dumped directory [%s] " % (st_name, self.dumpdir)
234 |                         sys.exit(-1)
235 | 
236 |         return all_tables_os
237 | 
238 |     # 将 handle_tables_options 的结果放入全局队列
239 |     def queue_myload_tables(self):
240 |         global MYQUEUE
241 | 
242 |         tables_dict = self.handle_tables_options()
243 |         # print "Tables to load: ", tables_dict
244 | 
245 |         for db, tabs in tables_dict.items():
246 |             for tab in tabs:
247 |                 MYQUEUE.put("{0}.{1}".format(db, tab))
248 | 
249 |         print "mypumpkin>> tables waiting to load in have queued"
250 | 
251 |     """
252 |     从队列取出表名，在os上启动一个进行进行load in
253 |     多线程里循环调用该方法
254 |     """
255 |     def do_process(self):
256 |         global MYQUEUE
257 |         while True:
258 |             if not MYQUEUE.empty():
259 |                 in_table = MYQUEUE.get(block=False)
260 |                 in_table_list = in_table.split(".")
261 |                 schema_name, table_name = in_table_list[0], in_table_list[1]
262 | 
263 |                 load_option = " --database %s < %s/%s.sql" % (schema_name, self.dumpdir, in_table)
264 |                 myload_cmd_run = " ".join(MYCMD_NEW) + load_option
265 |                 try:
266 |                     print "mypumpkin>> Loading in table [%s]: " % in_table
267 |                     print "  " + myload_cmd_run
268 |                     subprocess.check_output(myload_cmd_run, shell=True)  # , stderr=subprocess.STDOUT)
269 |                     # 进程的输出，包括warning和错误，都打印出来
270 |                 except subprocess.CalledProcessError as e:
271 |                     print "Error shell returncode %d: exit \n" % e.returncode
272 |                     sys.exit(-1)
273 |                 time.sleep(0.3)
274 |             else:
275 |                 print "mypumpkin>> databases and tables load thread finished"
276 |                 break
277 | 
278 | class MyDump(NewOptions):
279 | 
280 |     def handle_tables_options(self):
281 |         dbname_list, tables_list, tables_tag = self.get_tables_opt()
282 | 
283 |         all_tables = defaultdict(list)
284 |         if tables_tag == 'include-tab':  # [-B] db1 --table t1
285 |             for st_name in tables_list:
286 |                 db_name, tb_name = st_name.split(".")
287 |                 all_tables[db_name].append(tb_name)
288 |         elif tables_tag.startswith('db-'):  # -B db1 db2 (-A)
289 |             all_tables = self.get_tables_from_db()  # 从db里面获取所有表
290 |             if len(dbname_list) != 0:  # not -A
291 |                 set_db_notexist = set(dbname_list) - set(all_tables.keys())
292 |                 if set_db_notexist:
293 |                     print "Error: Db [%s] do not exist" % ",".join(set_db_notexist)
294 |                     sys.exit(-1)
295 |                 for db_l in all_tables.keys():
296 |                     if db_l not in dbname_list:
297 |                         del all_tables[db_l]
298 | 
299 |             if tables_tag == 'db-exclude':  # db1 --ignore-table db1.t1,  -B db1 [db2] --ignore-table (-A)
300 |                 for st_name in tables_list:
301 |                     db_name, tb_name = st_name.split(".")
302 |                     try:
303 |                         all_tables[db_name].remove(tb_name)
304 |                     except ValueError:
305 |                         print "Table %s does not exist (or not in -B databases)." % st_name
306 |                         sys.exit(-1)
307 | 
308 |         return all_tables
309 | 
310 |     def queue_mydump_tables(self):
311 |         global MYQUEUE
312 | 
313 |         tables_dict = self.handle_tables_options()
314 |         # print "table_dict: ", tables_dict
315 | 
316 |         for db, tabs in tables_dict.items():
317 |             for tab in tabs:
318 |                 MYQUEUE.put("{0}.{1}".format(db, tab))
319 | 
320 |         print "mypumpkin>> tables waiting to dump out have queued"
321 | 
322 |     # 导出指了DB时，需要从源库information_schema里面找到表名
323 |     def get_tables_from_db(self):
324 |         print "Go for target db to get all tables list..."
325 | 
326 |         dbinfo = self.get_dbinfo_cmd()
327 | 
328 |         try:
329 |             if dbinfo[4] is not None:  # socket given
330 |                 conn = MySQLdb.Connect(host=dbinfo[0], user=dbinfo[1], passwd=dbinfo[2], port=dbinfo[3],
331 |                                        unix_socket=dbinfo[4], connect_timeout=5)
332 |             else:
333 |                 conn = MySQLdb.Connect(host=dbinfo[0], user=dbinfo[1], passwd=dbinfo[2], port=dbinfo[3], connect_timeout=5)
334 |             cur = conn.cursor()
335 | 
336 |             sqlstr = "select table_schema, table_name from information_schema.tables where TABLE_TYPE = 'BASE TABLE' AND " \
337 |                      "TABLE_SCHEMA not in('information_schema', 'performance_schema', 'sys')"
338 |             # print "get tables:", sqlstr
339 |             cur.execute(sqlstr)
340 |         except MySQLdb.Error, e:
341 |             print "Error mysql %d: %s" % (e.args[0], e.args[1])
342 |             sys.exit(-1)
343 | 
344 |         res = cur.fetchall()
345 |         cur.close()
346 |         conn.close()
347 | 
348 |         dict_tables_db = defaultdict(list)
349 |         for d, t in res:
350 |             dict_tables_db[d].append(t)
351 | 
352 |         # print "db all tables: ", dict_tables_db
353 |         return dict_tables_db
354 | 
355 |     # 被上面的get_tables_from_db调用，单独解析登录信息
356 |     def get_dbinfo_cmd(self):
357 |         parser = ArgumentParser(description="Process some args", conflict_handler='resolve')
358 | 
359 |         parser.add_argument('-h', '--host', nargs=1, metavar='host1', help='Host to connect')
360 |         parser.add_argument('-u', '--user', nargs=1, metavar='user1', help='User to connect')
361 |         parser.add_argument('-p', '--password', nargs=1, metavar='yourpassword', help='Password for user1 to connect')
362 |         parser.add_argument('-P', '--port', nargs=1, metavar='port', type=int, default=3306, help='Port for host to connect')
363 |         parser.add_argument('-S', '--socket', nargs=1, metavar='socket', help='Socket address for host to connect')
364 | 
365 |         dbinfo_opt, _ = parser.parse_known_args(self.mycmd)
366 | 
367 |         db_host = dbinfo_opt.host[0]
368 |         db_user = dbinfo_opt.user[0]
369 |         db_pass = dbinfo_opt.password[0]
370 |         db_port = dbinfo_opt.port[0]
371 |         db_sock = dbinfo_opt.socket
372 | 
373 |         return db_host, db_user, db_pass, db_port, db_sock
374 | 
375 |     # def dump_out(self):
376 |     def do_process(self):
377 |         global MYQUEUE
378 |         while True:
379 |             if not MYQUEUE.empty():
380 |                 in_table = MYQUEUE.get(block=False)
381 |                 in_table_list = in_table.split(".")
382 |                 schema_name, table_name = in_table_list[0], in_table_list[1]
383 | 
384 |                 dump_option = " %s --tables %s --result-file=%s/%s.sql" \
385 |                               % (schema_name, table_name, self.dumpdir, in_table)
386 |                 mydump_cmd_run = " ".join(MYCMD_NEW) + dump_option
387 | 
388 |                 try:
389 |                     print "mypumpkin>> Dumping out table [%s]: " % in_table
390 |                     print "  " + mydump_cmd_run
391 |                     subprocess.check_output(mydump_cmd_run, shell=True)  # , stderr=subprocess.STDOUT)
392 |                     # 进程的输出，包括warning和错误，都打印出来
393 |                 except subprocess.CalledProcessError as e:
394 |                     print "Error shell returncode %d: exit \n" % e.returncode
395 |                     sys.exit(-1)
396 |                 time.sleep(0.3)
397 |             else:
398 |                 print "mypumpkin>> databases and tables dump thread finished"
399 |                 break
400 | 
401 | 
402 | class myThread(Thread):
403 |     def __init__(self, myprocess):
404 |         Thread.__init__(self)
405 |         self.myprocess = myprocess
406 | 
407 |     def run(self):
408 |         # 消费线程不关心队列里是哪个表的sql
409 |         self.myprocess.do_process()
410 | 
411 | 
412 | if __name__ == '__main__':
413 |     mycmd = sys.argv
414 |     my_process = NewOptions(mycmd)  # just for args check
415 |     my_process = None
416 | 
417 |     # my_process = None
418 |     if mycmd[1] == 'mysqldump':
419 |         my_process = MyDump(mycmd)
420 |         my_process.queue_mydump_tables()
421 |     elif mycmd[1] == 'mysql':
422 |         my_process = MyLoad(mycmd)
423 |         my_process.queue_myload_tables()
424 |     else:
425 |         print "Only mysqldump or mysql allowed after mypumpkin.py\n"  # should never print
426 |         sys.exit(-1)
427 | 
428 |     num_threads = my_process.threads
429 | 
430 |     print "mypumpkin>> number of threads: ", num_threads
431 |     for i in range(num_threads):
432 |         worker = myThread(my_process)
433 |         # worker.setDaemon(True)
434 |         worker.start()
435 |         time.sleep(0.5)
436 | 


--------------------------------------------------------------------------------