├── .gitignore
├── Contributing.md
├── README.md
├── add_spaces.py
└── sample.md


/.gitignore:
--------------------------------------------------------------------------------
 1 | ##Python.gitignore##
 2 | 
 3 | # Byte-compiled / optimized / DLL files
 4 | __pycache__/
 5 | *.py[cod]
 6 | # C extensions
 7 | *.so
 8 | # Distribution / packaging
 9 | .Python
10 | env/
11 | build/
12 | develop-eggs/
13 | dist/
14 | downloads/
15 | eggs/
16 | .eggs/
17 | lib/
18 | lib64/
19 | parts/
20 | sdist/
21 | var/
22 | *.egg-info/
23 | .installed.cfg
24 | *.egg
25 | 
26 | # PyInstaller
27 | # Usually these files are written by a python script from a template
28 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
29 | *.manifest
30 | *.spec
31 | # Installer logs
32 | pip-log.txt
33 | pip-delete-this-directory.txt
34 | # Unit test / coverage reports
35 | htmlcov/
36 | .tox/
37 | .coverage
38 | .coverage.*
39 | .cache
40 | nosetests.xml
41 | coverage.xml
42 | *,cover
43 | # Translations
44 | *.mo
45 | *.pot
46 | # Django stuff:
47 | *.log
48 | # Sphinx documentation
49 | docs/_build/
50 | 
51 | # PyBuilder
52 | target/
53 | 


--------------------------------------------------------------------------------
/Contributing.md:
--------------------------------------------------------------------------------
 1 | 贡献
 2 | ================================================================================
 3 | 
 4 | 如果你想要给 [add-spaces](https://github.com/robot527/add-spaces) 做贡献，请按以下步骤进行：
 5 | 
 6 | 1. 点击 [fork](https://github.com/login?return_to=%2Frobot527%2Fadd-spaces) 按钮。
 7 | 
 8 | 2. 通过如下命令从你的 GitHub 帐号 `clone` 这个库： 
 9 | 
10 |     ```
11 |     git clone git@github.com:your_github_username/add-spaces.git
12 |     ```
13 | 
14 | 3. 通过如下命令设置上游仓库 (`upstream`) :
15 | 
16 |     ```
17 |     git remote add upstream https://github.com/robot527/add-spaces.git
18 |     ```
19 | 
20 | 4. 对本地库进行修改。
21 | 
22 | 5. 备份本地修改，**每次提交**之前先用下面的命令同步上游仓库的更新：
23 | 
24 |     ```
25 |     git fetch upstream
26 |     git checkout master
27 |     git rebase upstream/master
28 |     ```
29 | 
30 | 6. 加入自己的修改并提交，然后推送 (`push`) 到远程仓库。
31 | 
32 |     ```
33 |     git add your_changed_files
34 | 	git commit -m "change log"
35 |     git push -u origin
36 |     ```
37 | 
38 | 7. 点击 `New pull request` 按钮，将 your_github_username 的 `add-spaces` 库的 master 分支的修改提交到 robot527 的 `add-spaces` 库中。
39 | 
40 | 谢谢！
41 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | 给文本文件的中文英文之间添加合理的空格
 2 | ==================================================
 3 | 
 4 | ## 文档排版需求
 5 | 
 6 |   - [中英文之间需要增加空格](#中英文之间需要增加空格)
 7 |   - [中文与数字之间需要增加空格](#中文与数字之间需要增加空格)
 8 |   - [数字与单位之间需要增加空格](#数字与单位之间需要增加空格)
 9 |   - [全角标点与其他字符之间不加空格](#全角标点与其他字符之间不加空格)
10 | 
11 | ## 脚本用法
12 | 
13 | ```
14 | 	python add_spaces.py /path/to/file code  # code 为文件编码，如：gbk, utf8
15 | 	# 或者自动猜测文本文件的编码
16 | 	python add_spaces.py /path/to/file
17 | ```
18 | 
19 | ## 更新历史  
20 | ### 更新时间：2016-08-28
21 |   - 支持对中文里有粗体或斜体英文单词的语句的处理
22 |   - 支持对中文里有粗体或斜体中文字词的语句的处理
23 | 
24 | ## 空格
25 | 
26 | 「有研究显示，打字的时候不喜欢在中文和英文之间加空格的人，感情路都走得很辛苦，有七成的比例会在 34 岁的时候跟自己不爱的人结婚，而其余三成的人最后只能把遗产留给自己的猫。毕竟爱情跟书写都需要适时地留白。
27 | 
28 | 与大家共勉之。」——[vinta/paranoid-auto-spacing](https://github.com/vinta/pangu.js)
29 | 
30 | ### 中英文之间需要增加空格
31 | 
32 | 正确：
33 | 
34 | > 在 LeanCloud 上，数据存储是围绕 `AVObject` 进行的。
35 | 
36 | 错误：
37 | 
38 | > 在LeanCloud上，数据存储是围绕`AVObject`进行的。
39 | 
40 | > 在 LeanCloud上，数据存储是围绕`AVObject` 进行的。
41 | 
42 | 完整的正确用法：
43 | 
44 | > 在 LeanCloud 上，数据存储是围绕 `AVObject` 进行的。每个 `AVObject` 都包含了与 JSON 兼容的 key-value 对应的数据。数据是 schema-free 的，你不需要在每个 `AVObject` 上提前指定存在哪些键，只要直接设定对应的 key-value 即可。
45 | 
46 | 例外：「豆瓣FM」等产品名词，按照官方所定义的格式书写。
47 | 
48 | ### 中文与数字之间需要增加空格
49 | 
50 | 正确：
51 | 
52 | > 今天出去买菜花了 5000 元。
53 | 
54 | 错误：
55 | 
56 | > 今天出去买菜花了 5000元。
57 | 
58 | > 今天出去买菜花了5000元。
59 | 
60 | ### 数字与单位之间需要增加空格
61 | 
62 | 正确：
63 | 
64 | > 我家的光纤入户宽带有 10 Gbps，SSD 一共有 20 TB。
65 | 
66 | 错误：
67 | 
68 | > 我家的光纤入户宽带有 10Gbps，SSD 一共有 10TB。
69 | 
70 | 例外：度／百分比与数字之间不需要增加空格：
71 | 
72 | 正确：
73 | 
74 | > 今天是 233° 的高温。
75 | 
76 | > 新 MacBook Pro 有 15% 的 CPU 性能提升。
77 | 
78 | 错误：
79 | 
80 | > 今天是 233 ° 的高温。
81 | 
82 | > 新 MacBook Pro 有 15 % 的 CPU 性能提升。
83 | 
84 | ### 全角标点与其他字符之间不加空格
85 | 
86 | 正确：
87 | 
88 | > 刚刚买了一部 iPhone，好开心！
89 | 
90 | 错误：
91 | 
92 | > 刚刚买了一部 iPhone ，好开心！
93 | 
94 | ## [贡献](./Contributing.md)
95 | 
96 | ## 参考
97 | 
98 | - [中文文案排版指北](https://github.com/LCTT/TranslateProject/blob/master/%E4%B8%AD%E6%96%87%E6%8E%92%E7%89%88%E6%8C%87%E5%8C%97.md#%E4%B8%AD%E8%8B%B1%E6%96%87%E4%B9%8B%E9%97%B4%E9%9C%80%E8%A6%81%E5%A2%9E%E5%8A%A0%E7%A9%BA%E6%A0%BC)
99 | 


--------------------------------------------------------------------------------
/add_spaces.py:
--------------------------------------------------------------------------------
  1 | #! /usr/bin/python
  2 | # -*- coding: UTF-8 -*-
  3 | # author: robot527
  4 | # created at 2016-05-30
  5 | 
  6 | """
  7 | 自动给中文英文之间加入合理的空格
  8 | """
  9 | 
 10 | def is_chinese(uni_ch):
 11 |     """判断一个 unicode 是否是汉字。"""
 12 |     if uni_ch >= u'\u4e00' and uni_ch <= u'\u9fa5':
 13 |         return True
 14 |     else:
 15 |         return False
 16 | 
 17 | 
 18 | def isdigit(uni_ch):
 19 |     """判断一个 unicode 是否是十进制数字。"""
 20 |     if uni_ch >= u'\u0030' and uni_ch <= u'\u0039':
 21 |         return True
 22 |     else:
 23 |         return False
 24 | 
 25 | def isalpha(uni_ch):
 26 |     """判断一个 unicode 是否是字母。"""
 27 |     if (uni_ch >= u'\u0041' and uni_ch <= u'\u005a') \
 28 |         or (uni_ch >= u'\u0061' and uni_ch <= u'\u007a'):
 29 |         return True
 30 |     else:
 31 |         return False
 32 | 
 33 | 
 34 | def is_en_symbol(uni_ch):
 35 |     """判断一个 unicode 是否是英文符号。"""
 36 |     if uni_ch in [u':', u';', u'%', u'!', u'?', u'`', u'°', u'*', u'_',\
 37 |             u'<', u'=', u'>', u'"', u'$', u'&', u'\'', u',', u'.', u'~',\
 38 |             u'/', u'@', u'\\', u'^', u'|']:
 39 |         return True
 40 |     else:
 41 |         return False
 42 | 
 43 | 
 44 | def is_en_l_bracket(uni_ch):
 45 |     """判断一个 unicode 是否是英文左括号。"""
 46 |     if uni_ch == u'(' or uni_ch == u'[':
 47 |         return True
 48 |     else:
 49 |         return False
 50 | 
 51 | 
 52 | def is_en_r_bracket(uni_ch):
 53 |     """判断一个 unicode 是否是英文右括号。"""
 54 |     if uni_ch == u')' or uni_ch == u']':
 55 |         return True
 56 |     else:
 57 |         return False
 58 | 
 59 | 
 60 | def is_zh_l_bracket(uni_ch):
 61 |     """判断一个 unicode 是否是中文左括号。"""
 62 |     if uni_ch == u'\uff08':
 63 |         return True
 64 |     else:
 65 |         return False
 66 | 
 67 | 
 68 | def is_zh_r_bracket(uni_ch):
 69 |     """判断一个 unicode 是否是中文右括号。"""
 70 |     if uni_ch == u'\uff09':
 71 |         return True
 72 |     else:
 73 |         return False
 74 | 
 75 | 
 76 | def add_spaces_to_string(string, code):
 77 |     """给字符串添加合理的空格。"""
 78 |     from re import sub
 79 |     newustr = ""
 80 |     flag = 0
 81 |     ustr = string.decode(code)
 82 |     ch_lst = list(ustr)
 83 |     length = len(ch_lst)
 84 |     for i in range(0, length):
 85 |         if i < length - 1:
 86 |             #中文(括号)与英文(括号)之间需要增加空格
 87 |             if (is_chinese(ch_lst[i]) and isalpha(ch_lst[i + 1])) \
 88 |                 or (isalpha(ch_lst[i]) and is_chinese(ch_lst[i + 1])):
 89 |                 ch_lst[i] += u" "
 90 |             elif (isalpha(ch_lst[i]) and is_zh_l_bracket(ch_lst[i + 1])) \
 91 |                 or (is_zh_r_bracket(ch_lst[i]) and isalpha(ch_lst[i + 1])):
 92 |                 ch_lst[i] += u" "
 93 |             elif (is_chinese(ch_lst[i]) and is_en_l_bracket(ch_lst[i + 1])) \
 94 |                 or (is_en_r_bracket(ch_lst[i]) and is_chinese(ch_lst[i + 1])):
 95 |                 ch_lst[i] += u" "
 96 |             #中文与英文符号之间需要增加空格
 97 |             elif (is_chinese(ch_lst[i]) and is_en_symbol(ch_lst[i + 1])) \
 98 |                 or (is_en_symbol(ch_lst[i]) and is_chinese(ch_lst[i + 1])):
 99 |                 ch_lst[i] += u" "
100 |                 flag = 1
101 |             #中文(括号)与数字之间需要增加空格
102 |             elif (is_chinese(ch_lst[i]) and isdigit(ch_lst[i + 1]))\
103 |                 or (isdigit(ch_lst[i]) and is_chinese(ch_lst[i + 1])):
104 |                 ch_lst[i] += u" "
105 |             elif (isdigit(ch_lst[i]) and is_zh_l_bracket(ch_lst[i + 1]))\
106 |                 or (is_zh_r_bracket(ch_lst[i]) and isdigit(ch_lst[i + 1])):
107 |                 ch_lst[i] += u" "
108 | 
109 |         newustr += ch_lst[i]
110 |     newstring = newustr.encode(code)
111 |     if flag == 1:
112 |         #处理中文里的粗体字和斜体字
113 |         newstring = sub(r' \* ', '*', newstring)
114 |         newstring = sub(r' \*\* ', '**', newstring)
115 |         newstring = sub(' _ ', '_', newstring)
116 |         newstring = sub(' __ ', '__', newstring)
117 | 
118 |     return add_space_betw_digit_and_unit(newstring)
119 | 
120 | 
121 | def add_space_betw_digit_and_unit(string):
122 |     """给数字与单位之间增加空格。"""
123 |     from re import sub
124 |     # 常用单位，不齐全
125 |     units = ['bps', 'Kbps', 'Mbps', 'Gbps',
126 |             'B', 'KB', 'MB', 'GB', 'TB', 'PB',
127 |             'g', 'Kg', 't',
128 |             'h', 'm', 's']
129 |     for unit in units:
130 |         pattern = r'(?<=\d)' + unit #positive lookbehind assertion,
131 |                                     #如果前面是括号中 '=' 后面的字符串，则匹配成功
132 |         repl = ' ' + unit
133 |         string = sub(pattern, repl, string)
134 |     return string
135 | 
136 | 
137 | def add_spaces_to_file(file_name, code="gbk"):
138 |     """给文本文件的内容添加合理的空格, 生成处理过的新文件。"""
139 |     import os.path
140 |     dir_name = os.path.dirname(file_name)
141 |     base_name = os.path.basename(file_name)
142 |     if dir_name == '':
143 |         new_file = code + "-" + base_name
144 |     else:
145 |         new_file = dir_name + "/" + code + "-" + base_name
146 |     try:
147 |         with open(file_name) as text:
148 |             line_list = [add_spaces_to_string(line, code) \
149 |                             for line in text]
150 |     except UnicodeDecodeError as err:
151 |         return str(err)
152 |     except IOError as err:
153 |         return str(err)
154 |     try:
155 |         with open(new_file, "w") as nfile:
156 |             nfile.writelines(line_list)
157 |             print 'Finished adding spaces, generated new file: %s' % new_file
158 |             return 'Success.'
159 |     except IOError as err:
160 |         return str(err)
161 | 
162 | 
163 | if __name__ == '__main__':
164 |     import sys
165 |     argc = len(sys.argv)
166 |     codeset = ['gb2312', 'gbk', 'utf8', 'gb18030', 'hz',\
167 |                 'iso2022_jp_2', 'big5', 'big5hkscs']
168 |     if argc == 1:
169 |         print 'Usage: python add_spaces.py /path/to/file code(e.g. gbk, utf8)'
170 |         print '    or python add_spaces.py /path/to/file'
171 |     elif argc == 2:
172 |         for item in codeset:
173 |             if add_spaces_to_file(sys.argv[1], item) == 'Success.':
174 |                 print 'Processing completed.'
175 |                 break
176 |     elif argc == 3:
177 |         if sys.argv[2] in codeset:
178 |             print add_spaces_to_file(sys.argv[1], sys.argv[2])
179 |         else:
180 |             print 'Parameter code (%s) error!' % sys.argv[2]
181 |             print 'Supported codes are ' + ', '.join(codeset)
182 |     else:
183 |         print 'Usage: python add_spaces.py /path/to/file code'
184 | 


--------------------------------------------------------------------------------
/sample.md:
--------------------------------------------------------------------------------
 1 | 这是一个被测试的**样例**文档
 2 | ===============================
 3 | 
 4 | 本文档使用**Markdown**语法编写，用作*add_spaces.py*脚本的测试样例。
 5 | 
 6 | 这一行里有__粗体__字，也有_斜体_字。  
 7 | 这一行里有_Italic_字，也有__Bold__字。
 8 | 
 9 | ## 样例代码：  
10 | 测试乘法表达式是否与*斜体*字被混淆处理。
11 | ```
12 | #! /usr/bin/python
13 | 
14 | b, c, d = 2, 3, 4
15 | a = b * c * 5 + d
16 | print a
17 | ```
18 | 
19 | ## 中英文测试样例：  
20 | 在LeanCloud上，数据存储是围绕`AVObject`进行的。每个`AVObject`都包含了与JSON兼容的key-value对应的数据。数据是schema-free的，你不需要在每个`AVObject`上提前指定存在哪些键，只要直接设定对应的key-value即可。  
21 | 今天出去买菜花了500元。  
22 | 我家的光纤入户宽带有10Gbps，SSD一共有10TB。  
23 | 今天是33°C的高温。  
24 | 刚买了一部iPhone，好开心！  
25 | 新MacBook Pro有15%的CPU性能提升。
26 | 
27 | 单词"you"由字母'y' 'o' 'u'组成。
28 | 
29 | C中定义了一些字母前加反斜杠"\"来表示常见的那些不能显示的ASCII字符，如\0,\t,\n等，就称为转义字符，因为后面的字符，都不是它本来的ASCII字符意思了。
30 | 双目运算符：&按位与，|按位或，^按位异或。
31 | 
32 | 


--------------------------------------------------------------------------------