├── README.md
└── cnenlinter
    ├── MANIFEST.in
    ├── README.md
    ├── cnenlinter.py
    ├── rules.yml
    └── setup.py


/README.md:
--------------------------------------------------------------------------------
  1 | # Markdown 简体中文与西文混排要点
  2 | 
  3 | **Version: 0.3**
  4 | 
  5 | 李笑来 2019/04
  6 | 
  7 | ---
  8 | 
  9 | 这篇文档的标题中所使用的措辞是 “**要点**”，而非 “规范” —— 原因在于这些要点争议颇多。
 10 | 
 11 | 然而，为了统一编辑，也为了读者阅读方便，《成长时代》（selfteaching.com）仓库中的所有文档，应尽量遵守以下要点。
 12 | 
 13 | 以下的要点是针对 Markdown 写作而整理的。Markdown 是纯文本文件，它们最终都需要被转换成 HTML 文档或者其他文件格式，比如 PDF 等等 —— 即，为方便阅读而被渲染成的格式文档。
 14 | 
 15 | 另外，本文不涉及 “文字风格建议”，只涉及**格式排版要求**。如，“表达数值变化程度时，不能使用 ‘降低了 n 倍’ 的说法，因为 ‘降低 1 倍’ 的意思是说，‘原来是 `100`，现在是 `0`’。应该用的表达方式是 “降低了百分之多少。” —— 这是文字风格（Writing Style）；而 “数值与单位之间、货币符号之间不能有空格。如 `75kg`、`$85`、`25%`。” —— 这是格式排版要求。
 16 | 
 17 | ## 常用标点符号
 18 | 
 19 | 中英混排的文本中，除了完整的英文句子或段落之外，应全部使用全角标点符号。
 20 | 
 21 | 以下是常用中文全角标点符号：
 22 | 
 23 | |  名称  |         符号         |                               备注                                |
 24 | | ------ | -------------------- | ----------------------------------------------------------------- |
 25 | | 句号   | `。`                 |                                                                   |
 26 | | 逗号   | `，`                 |                                                                   |
 27 | | 顿号   | `、`                 |                                                                   |
 28 | | 问号   | `？`                 |                                                                   |
 29 | | 感叹号 | `！`                 |                                                                   |
 30 | | 引号   | ` “” ` &nbsp; ` ‘’ ` | 弯引号                                                            |
 31 | | 冒号   | `：`                 |                                                                   |
 32 | | 分号   | `；`                 |                                                                   |
 33 | | 省略号 | `……`                 | 共 6 个点，占据两个全角字符位置                                   |
 34 | | 破折号 | `——`                 | 共 2 个 `—`，占据两个全角字符位置                                 |
 35 | | 圆括号 | `（）`               |                                                                   |
 36 | | 书名号 | `《》`               |                                                                   |
 37 | | 分隔号 | `・`                 | [Katakana Middle Dot](<https://en.wikipedia.org/wiki/Interpunct>) |
 38 | 
 39 | **注意**
 40 | 
 41 | 1. 分隔号统一使用占据一个全角位置的 [Katakana Middle Dot](<https://en.wikipedia.org/wiki/Interpunct>)，`&#12539;`，即，`・`；而非键盘上可以直接打出的 `&sdot;`，`・` —— 这个分隔号是半角符号。
 42 | 2. 中英混排的文字中，单个英文单词需要用引号（单引号、双引号）括起来的时候，统一使用全角引号。英文句子中出现的引号，统一使用半角引号（单引号、双引号）。
 43 | 3. 句子末尾用括号加注时，句号应该在括号之外。如：`……（参见第三章）。`
 44 | 4. 句子内部的并列词汇，使用顿号（`、`）分割，即便并列词为英文，也要如此。如：`经常使用的等宽字体包括 Menlo、Monaco、Courier New、monospace 等等`。而纯英文句子中的并列词，则要用半角逗号（`,`）分割。
 45 | 
 46 | ## 空格
 47 | 
 48 | 中英混排的文本中使用的空格是半角符号空格：` ` —— 这也更符合大多数中文输入法的习惯。
 49 | 
 50 | 1. 中文与英文之间、中文与数字之间，都要有一个半角空格；如：`这是 1 个 variable 的例子`
 51 | 2. 英文字符、数字字符，与全角标点符号之间，不应该有空格；如：`这是一个 variable，这是数字 100。`；再比如：`变量 a 的值是：8；a 的值大于变量 b。`
 52 | 3. 全角引号（单引号、双引号）之外要有空格；如：`所谓的 “过早引用” 就是这样令人迷惑的。`
 53 | 4. 中英文并存的句子里，英文单词若是需要用括号括起，必须使用全角引号；如：`这就是所谓的 “过早引用”（Forward References）` —— 注意，引号和括号之间没有空格。
 54 | 5. 破折号（`——`）前后要各有一个半角空格。
 55 | 6. 省略号（`……`）后要各有一个半角空格。如：`他们总是这么说…… 可实际上呢？`
 56 | 7. 引号、破折号、省略号之外的全角标点符号前后不能有空格。如，`…… 就是这个元素（“decorators”）—— 即，所谓的装饰器。` 注意，`”`、`）`、`——` 之间都没有空格。
 57 | 8. 行内代码标示（Inline code）前后要有空格；如：```表达式 `a += 1` 的意思是说……```
 58 | 
 59 | ## 倾斜
 60 | 
 61 | 1. Markdown 中的倾斜标示，可用星号或者下划线，如，`*强调*` 或者 `_强调_`。然而，中文字符使用倾斜显示的话，在排版上会显得非常难看。
 62 | 2. 在渲染（Render）时，Markdown 中的 `*强调*` 或者 `_强调_` 会被同样渲染成 `<em>强调</em>`。而 `<em>` 需由 css 设定为 `font-style: regular;`，而字体颜色则可以设置为不同的颜色以示强调。
 63 | 3. 而在必须为英文单词设置强调（倾斜样式）之时，要在 Markdown 中使用 HTML 标签：`<i>`，如：`<i>emphasis</i>`。如有必要，在 css 中再另外设置字体颜色。
 64 | 
 65 | ## 标题
 66 | 
 67 | 1. 标题一概使用 `#` 符号标示。
 68 | 
 69 | 2. 由于在 GFM（Github Flavored Markdown）中，`/#[0-9]+/` 被自动渲染为 issue 的链接，所以，在标示标题的时候，`#` 符号后应有且只有一个空格，例如：
 70 | 
 71 |    ```markdown
 72 |     # 一级标题
 73 |     ## 二级标题
 74 |     ### 三级标题
 75 |    ```
 76 | 
 77 | 3. 一个 Markdown 文件中有且只有一个一级标题。
 78 | 
 79 | 4. 一个 Markdown 文件中最多使用到三级标题。如果层级过多，说明你可能需要将文本切分到多个文件。
 80 | 
 81 | ## 段落
 82 | 
 83 | 1. 段落不使用行首缩进。
 84 | 2. 段落之间用一个空行隔开。
 85 | 
 86 | ## 数字
 87 | 
 88 | 1. 阿拉伯数字一律使用半角字符。
 89 | 2. 使用半角逗号标记千分位；4 ～ 6 位的的数值，千分位的逗号是可选的，但，7 位或者 7 位以上的数值，必须有千分位的逗号。如：`2000`，`21,000,000`。针对多位小数可从小数点后从左至右添加千分位的逗号，如，`3.141,59`。
 90 | 3. 表示数值范围，使用 ` ~ `（`~`前后各有一个半角空格字符），如：`25 ～ 29`。
 91 | 4. 数值带有单位或者百分号的时候，前后两个数值都要有单位或者百分号，如：`25% ~ 29%`、` 72kg ~ 75kg`；不能是：`25 ~ 29%`、` 72 ~ 75kg`
 92 | 5. 数值与单位之间、货币符号之间不能有空格。如 `75kg`、`$85`、`25%`。
 93 | 
 94 | ## 常用汉文数字
 95 | 
 96 | * 壹、贰、叁、肆、伍、陆、柒、捌、玖、拾
 97 | * 零、〇
 98 | * 廿（niàn）、卅（sà）
 99 | 
100 | ## 常用特殊字符
101 | 
102 | | HTML Identity | Displayed |
103 | | ------------- | --------- |
104 | | `&amp;`       | &amp;     |
105 | | `&lt;`        | &lt;      |
106 | | `&gt;`        | &gt;      |
107 | | `&lbrack;`    | &lbrack;  |
108 | | `&rbrack;`    | &rbrack;  |
109 | | `&grave;`     | &grave;   |
110 | | `&vert;`      | &vert;    |
111 | | `&bsol;`      | &bsol;    |
112 | | `&permil;`    | &permil;  |
113 | | `&pertenk;`   | &pertenk; |
114 | | `&trade;`     | &trade;   |
115 | | `&copy;`      | &copy;    |
116 | | `&reg;`       | &reg;     |
117 | 
118 | 更多请查询：<https://www.toptal.com/designers/htmlarrows/>
119 | 
120 | ## 版权声明的选择
121 | 
122 | selfteaching.com 上的所有文章，首选 [CC-BY-NC-ND](<https://creativecommons.org/licenses/by-nc-nd/3.0/deed.zh>) 版权协议，即：
123 | 
124 | > 署名-非商业性使用-禁止演绎 3.0 未本地化版本 (CC BY-NC-ND 3.0)
125 | 
126 | ## 必读教程
127 | 
128 | 1. Github 的 Markdown 教程：[Github: Mastering Markdown](https://guides.github.com/features/mastering-markdown/)
129 | 2. 微软的写作风格指导：[Microsoft Writing Style Guide](https://docs.microsoft.com/en-us/contribute/how-to-write-use-markdown)
130 | 3. Markdown 格式检查工具：[MarkdownLint](https://github.com/DavidAnson/markdownlint) —— 虽然它本身是 lint 工具，但它的文档中包含很多 Markdown 格式上的优化要求。
131 | 
132 | ## 推荐使用的 Markdown 编辑器
133 | 
134 | - [VSCode](<https://code.visualstudio.com/>) + [Docs Authoring Pack](https://marketplace.visualstudio.com/items?itemName=docsmsft.docs-authoring-pack)
135 | - [Typora](https://typora.io/)
136 | 
137 | ## 更多参考链接
138 | 
139 | > * https://golem.ph.utexas.edu/~distler/maruku/markdown_syntax.html
140 | > * http://www.pinyin.info/tools/converter/chars2uninumbers.html
141 | > * https://www.w3.org/html/ig/zh/wiki/Css4-text
142 | > * https://www.toptal.com/designers/htmlarrows/
143 | > * https://www.key-shortcut.com/en/writing-systems/%E6%96%87%E5%AD%97-chinese-cjk/cjk-characters-1/
144 | 


--------------------------------------------------------------------------------
/cnenlinter/MANIFEST.in:
--------------------------------------------------------------------------------
1 | include rules.yml


--------------------------------------------------------------------------------
/cnenlinter/README.md:
--------------------------------------------------------------------------------
  1 | # cnenlinter：中英混排格式命令行清理工具
  2 | 
  3 | 这是我自己用的一个简陋的 Python 程序，用来检查文本文件中（例如 markdown 文件）的中英混排的句子不符合要求的地方 —— 其实，它只不过是一个批处理工具。
  4 | 
  5 | 理论上它可以用来检查任何基于纯文本的文档（如 markdown、html、json、ipynb 等等），只不过是需要更换 `rules.yml` 文件中的正则表达式而已。匹配全角字符的正则表达式是：`[\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC\uF900-\uFAAD]`。
  6 | 
  7 | 程序很简单：
  8 | 
  9 | > 1. 针对文本文件中**非纯 ascii 字符**构成的每一行逐一进行操作；
 10 | > 1. 读取 `rules.yml` 文件中的规则（正则表达式），逐一应用到该行；
 11 | > 1. 检查过程中允许使用 verbose 模式决定是否进行修正。
 12 | 
 13 | ## 安装方法
 14 | 
 15 | ```bash
 16 | git clone https://github.com/selfteaching/markdown-writing-with-mixed-cn-en
 17 | cd markdown-writing-with-mixed-cn-en/cnenlinter
 18 | pip install virtualenv
 19 | virtualenv venv
 20 | . venv/bin/activate
 21 | 
 22 | pip install -e .
 23 | ```
 24 | 
 25 | ## 使用帮助
 26 | 
 27 | ```bash
 28 | cnenlinter --help
 29 | 
 30 | Usage: cnenlinter [OPTIONS] [FILES]...
 31 | 
 32 | Options:
 33 |   -c, --config-path PATH      Specify directory that contains rules file.
 34 |   -l, --log-file TEXT         Specify file name for log, default: "log.txt".
 35 |   -f, --fix-directly BOOLEAN  Fix file(s) directly, rather than save to
 36 |                               "/linted" directory. Default: True.
 37 |   -r, --rules-file-name TEXT  Specify rules file name. Default: rules.yml
 38 |   -v, --verbose BOOLEAN       Ask permission before fix. Default: True.
 39 |   --help                      Show this message and exit.
 40 | ```
 41 | 
 42 | ## 基本命令
 43 | 
 44 | ```bash
 45 | cnenlinter *.md
 46 | ```
 47 | 
 48 | 也可以通过添加参数关闭 `verbose` 和 `fix-directly` 模式
 49 | 
 50 | ```bash
 51 | cnenlinter -v False *.md
 52 | cnenlinter -f False *.md
 53 | ```
 54 | 
 55 | 还可以指定规则文件及其存放的目录：
 56 | 
 57 | ```bash
 58 | cnenlinter -c <path-of-rules-file> -r <rules-file-name> -v False -f False *.md
 59 | # 随后可以打开 log.txt 文件查看可修订记录
 60 | ```
 61 | 
 62 | 在处理单个文件的时候，我通常会使用 `cnenlinter <file>`，因为即便是总监不小心操作出错（比如，在不应该的地方顺手敲了 `y`），也可以通过 `log.txt` 文件查找哪里出了问题。
 63 | 
 64 | 但是，在处理多个文件的时候，我会使用 `cnenlinter -f False *.md`，即，修改过的文件将另存在 `linted` 目录中。
 65 | 
 66 | ## 关于规则，以及 rules.yml
 67 | 
 68 | `rules.yml` 文件里保存着搜索（`pattern`）和替换（`expected`）的正则表达式。
 69 | 
 70 | 每个规则由 `---` 作为起始，而后一个 `expected` 再加上一个 `pattern`，比如：
 71 | 
 72 | ```yaml
 73 | ---
 74 | # conver half-width puntuations in Chinese sentences to full-width ones.
 75 | # 中文前后的半角标点符号字符更换为全角标点符号
 76 | 'expected': /\1，/
 77 | 'pattern': /([\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC\uF900-\uFAAD”’])\,/
 78 | ---
 79 | 'expected': /\1。/
 80 | 'pattern': /([\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC\uF900-\uFAAD”’])\./
 81 | ---
 82 | 'expected': /\1：/
 83 | 'pattern': /([\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC\uF900-\uFAAD”’])\:/
 84 | ---
 85 | 'expected': /\1；/
 86 | 'pattern': /([\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC\uF900-\uFAAD”’])\;/
 87 | ---
 88 | 'expected': /\1？/
 89 | 'pattern': /([\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC\uF900-\uFAAD”’])\?/
 90 | ---
 91 | 'expected': /\1！/
 92 | 'pattern': /([\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC\uF900-\uFAAD”’])\!/
 93 | ---
 94 | 'expected': /\1）/
 95 | 'pattern': /([\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC\uF900-\uFAAD”’])\s*[\)）]/
 96 | ---
 97 | 'expected': /（\1/
 98 | 'pattern': /[（\(]\s*([\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC\uF900-\uFAAD‘“])/
 99 | ```
100 | 
101 | 正则表达式前后，使用 `/` 标记。
102 | 
103 | 在 `pattern` 中允许使用三个 `flag`：`a`、`i` 和 `l` —— 分别对应着 `re.A`、`re.I` 和 `re.L`：
104 | 
105 | ```
106 | /<regex>/<flag>
107 | ```
108 | 
109 | 在 `expected` 中，使用 `\1` `\2`... 来替换 `pattern` 中的捕获。
110 | 
111 | ## 注意
112 | 
113 | 此程序只在 Mac OSX 环境下测试运行过。
114 | 


--------------------------------------------------------------------------------
/cnenlinter/cnenlinter.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | import os
  3 | import re
  4 | import yaml
  5 | import click
  6 | 
  7 | # creating linted directory
  8 | linted_path = os.path.join(os.getcwd(), 'linted')
  9 | 
 10 | this_dir, this_filename = os.path.split(__file__)
 11 | 
 12 | try:
 13 |     os.mkdir(linted_path)
 14 | except FileExistsError:
 15 |     pass 
 16 | except OSError:
 17 |     print(f"Creation of linted directory {linted_path} failed.")
 18 | 
 19 | re_dict = {
 20 |     'a': 're.A',
 21 |     'i': 're.I',
 22 |     'l': 're.L'
 23 | }
 24 | 
 25 | @click.command()
 26 | @click.option(
 27 |     '-c',
 28 |     '--config-path',
 29 |     'config_path',
 30 |     default=this_dir,
 31 |     type=click.Path(),
 32 |     help='Specify directory that contains rules file.'
 33 |     )    
 34 | @click.option(
 35 |     '-l',
 36 |     '--log-file',
 37 |     'log_file',
 38 |     default=os.path.join(os.getcwd(), 'log.txt'),
 39 |     help='Specify file name for log, default: "log.txt".'
 40 |     )    
 41 | @click.option(
 42 |     '-f',
 43 |     '--fix-directly',
 44 |     'fix_directly',
 45 |     default=True,
 46 |     type=click.BOOL,
 47 |     help='Fix file(s) directly, rather than save to "/linted" directory. Default: True.'
 48 |     )
 49 | @click.option(
 50 |     '-r',
 51 |     '--rules-file-name',
 52 |     'rules_file_name',
 53 |     default='rules.yml',
 54 |     help='Specify rules file name. Default: rules.yml'
 55 |     )       
 56 | @click.option(
 57 |     '-v',
 58 |     '--verbose',
 59 |     'verbose',
 60 |     default=True,
 61 |     type=click.BOOL,
 62 |     help='Ask permission before fix. Default: True.'
 63 |     )        
 64 | @click.argument(
 65 |     'files',
 66 |     nargs=-1,
 67 |     type=click.Path()
 68 |     )
 69 | 
 70 | def cnenlinter(config_path, log_file, fix_directly, rules_file_name, verbose, files):
 71 | 
 72 |     rules_file = os.path.join(config_path, rules_file_name)
 73 | 
 74 |     with open(os.path.join(config_path, rules_file), 'r') as rf:
 75 |         rules = list(yaml.safe_load_all(rf.read()))
 76 | 
 77 |     log = ''
 78 |     logfile = open(log_file, 'w')
 79 | 
 80 |     for filename in files:
 81 | 
 82 |         with open(filename, 'r') as f:
 83 |             text = f.read()
 84 |             # replace concessive blank lines to single one
 85 |             pattern = re.compile(r"(\s*\n){3,}")
 86 |             text = pattern.sub('\n\n', text)
 87 |             lines = text.splitlines()
 88 |         f.close
 89 | 
 90 |         lines_linted = []
 91 | 
 92 |         for line in lines:
 93 | 
 94 |             linted = line.rstrip()
 95 |             temp = linted
 96 | 
 97 |             # ignore lines only consisting of asscii chars
 98 |             if not linted.isascii() or linted.startswith('#'):
 99 | 
100 |                 for rule in rules:
101 |                     # accepted flag: a, i, l
102 |                     expected_text = rule['expected'].split('/')
103 |                     pattern_text = rule['pattern'].split('/')
104 |                     if pattern_text[2] in ['a', 'i', 'l']:
105 |                         flag = re_dict[pattern_text[2]]
106 |                         pattern = re.compile(pattern_text[1], flag)
107 |                     else: 
108 |                         pattern = re.compile(pattern_text[1])
109 | 
110 |                     if pattern.findall(linted):
111 |                         linted = pattern.sub(rule['expected'].strip('/'), linted)
112 | 
113 |                 if temp != linted.rstrip(): 
114 |                     log = f'\n\n{filename} (line {lines.index(line) + 1}):\n{line}\n=>\n{linted}'
115 |                     print(log)
116 | 
117 |                     if verbose:
118 |                         valid_permission = True
119 |                         while valid_permission:
120 |                             permission = input('fix this one? "y" or "n"? ')
121 |                             if permission == 'y' or permission == 'n':
122 |                                 if permission == 'y':
123 |                                     logfile.writelines(log + '\n**ACCEPTED!**\n')
124 |                                 elif permission == 'n':
125 |                                     linted = temp
126 |                                     logfile.writelines(log + '\n**REJECTED!**\n')
127 |                                 valid_permission = False
128 |                             else:
129 |                                 valid_permission = True
130 |                     else:
131 |                         logfile.writelines(log + '\n**ACCEPTED!**\n')
132 |             
133 |             lines_linted.append(linted.rstrip())
134 | 
135 |         if fix_directly:
136 |             file_to_save = filename
137 |         else:
138 |             file_to_save = os.path.join(linted_path, filename)
139 | 
140 |         with open(file_to_save, 'w') as r:
141 |             for line_linted in lines_linted:
142 |                 r.writelines(line_linted + '\n')
143 |         r.close
144 | 
145 |         logfile.close
146 | 
147 | if __name__ == '__main__':
148 |     cnenlinter()


--------------------------------------------------------------------------------
/cnenlinter/rules.yml:
--------------------------------------------------------------------------------
  1 | ---
  2 | # conver half-width puntuations in Chinese sentences to full-width ones.
  3 | # 中文前后的半角标点符号字符更换为全角标点符号
  4 | 'expected': /\1，/
  5 | 'pattern': /([\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC\uF900-\uFAAD”’])\,/
  6 | ---
  7 | 'expected': /\1。/
  8 | 'pattern': /([\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC\uF900-\uFAAD”’])\./
  9 | ---
 10 | 'expected': /\1：/
 11 | 'pattern': /([\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC\uF900-\uFAAD”’])\:/
 12 | ---
 13 | 'expected': /\1；/
 14 | 'pattern': /([\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC\uF900-\uFAAD”’])\;/
 15 | ---
 16 | 'expected': /\1？/
 17 | 'pattern': /([\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC\uF900-\uFAAD”’])\?/
 18 | ---
 19 | 'expected': /\1！/
 20 | 'pattern': /([\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC\uF900-\uFAAD”’])\!/
 21 | ---
 22 | 'expected': /\1）/
 23 | 'pattern': /([\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC\uF900-\uFAAD”’])\s*[\)）]/
 24 | ---
 25 | 'expected': /（\1/
 26 | 'pattern': /[（\(]\s*([\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC\uF900-\uFAAD‘“])/
 27 | ---
 28 | # strait single quote among ascii chars
 29 | 'expected': /\1'\2/
 30 | 'pattern': /([a-zA-Z0-9]+)’([a-zA-Z]+)/
 31 | ---
 32 | # add space after curly quotes
 33 | 'expected': /\1 /
 34 | 'pattern': /([’”][_\*]{0,2}) *(?![！？，。：；、）])/
 35 | ---
 36 | # add space before curly quotes
 37 | 'expected': / \1/
 38 | 'pattern': /(?<!^) *([_\*]{0,2}[“‘])/
 39 | ---
 40 | # fixing title to fit GFM
 41 | # Github 上，/#+[0-9]+/ 会被自动转换成 issue 链接；
 42 | # 所以，标题中的 `#` 之后必须有一个空格
 43 | 'expected': /\1 \2/
 44 | 'pattern': /^(#+)\s*(.*)/
 45 | ---
 46 | # space around incline code
 47 | # 行内代码标记符号 ` 前后应该有一个空格
 48 | 'expected': / \1\2\1 /
 49 | 'pattern': /\s*(`)(.*?)\1\s*![。，、？！：；（）《》・]/
 50 | ---
 51 | # spaces around ascii among Chinese characters
 52 | # 中文字符 /[\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC\uF900-\uFAAD]/ 之间的数字、英文单词前后要有空格
 53 | # regex 中包含 `_\*` 是为了同时可以匹配到 markdown 标记，`**` 和 `__`
 54 | 'expected': /\1 \2/
 55 | 'pattern': /([a-zA-Z0-9%‰‱]+[_\*]{1,2})\s*([_\*]{1,2}[\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC\uF900-\uFAAD]+?)/        
 56 | ---
 57 | # 作用同上，只是前后顺序不一样
 58 | 'expected': /\1 \2/
 59 | 'pattern': /([\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC\uF900-\uFAAD]+?[_\*]{1,2})\s*([_\*]{1,2}[a-zA-Z0-9+\-$¥]+)/        
 60 | ---
 61 | # add space after `……` 
 62 | # 有争议：从美观上来看，省略号之前有空格并不好看……
 63 | # 以下的规则是：省略号前面没有空格，它后面有一个空格。
 64 | 'expected': /\1…… /
 65 | 'pattern': /([\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC\uF900-\uFAAD]+?)\s*([…]{1,2}|[\.。]{2,3})\s*/
 66 | ---
 67 | # add space around `——` 
 68 | 'expected': / —— /
 69 | 'pattern': /\s*([—]{1,2})\s*/
 70 | ---
 71 | # no leading space when `——` and `……` ahead of the line
 72 | 'expected': /\1/
 73 | 'pattern': /^\s*([—…]{1,2})/
 74 | ---  
 75 | # no space between `……` and curly quotes
 76 | 'expected': /…\1/
 77 | 'pattern': /…\s+([’”])/
 78 | ---  
 79 | # no space between `……` and curly quotes
 80 | 'expected': /\1…/
 81 | 'pattern': /([‘“’”])\s+…/
 82 | ---  
 83 | # use Katakana Middle Dot
 84 | 'expected': /・/
 85 | 'pattern': /·/
 86 | ---
 87 | # remove space between plus/minus sign and digits
 88 | # 正负号、货币符号与数字之间不应有空格
 89 | 'expected': /\1\2/
 90 | 'pattern': /[\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC\uF900-\uFAAD].*?([+\-¥$])\s+([0-9]+)/
 91 | ---
 92 | # remove space between digits and common units
 93 | # 数字与常用计量单位之间不应该有空格
 94 | 'expected': /\1\2/
 95 | 'pattern': /([0-9]+)\s+([bBgGmMkKtT])/
 96 | ---
 97 | # add thousand seperators to numbers
 98 | # regex from https://www.oreilly.com/library/view/regular-expressions-cookbook/9781449327453/ch06s12.html
 99 | # \g<0> explanation from: https://stackoverflow.com/questions/21028917/why-g0-behaves-differently-than-0-in-re-sub
100 | # 为多位数字添加千位分割逗号
101 | # 'expected': /\g<0>,/
102 | # 'pattern': /[0-9](?=(?:[0-9]{3})+(?![0-9]))/
103 | # 慎用：它会把 url 中的长串数字加上逗号分隔符。
104 | # ---
105 | # remove spaces among Chinese characters
106 | # 去除中文字符之间的空格
107 | # (?<!^#) negative lookabehind, doesn't include /^#/
108 | # ... to allow space between Chinese characters in Title
109 | # 'expected': /\1\2/
110 | # 'pattern': /(?<!^#{1,6}.*?)([_\*]{0,2}[\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC\uF900-\uFAAD]+?[_\*]{0,2})\s+([_\*]{0,2}[\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC\uF900-\uFAAD]+?[_\*]{0,2})/
111 | # ---
112 | 'expected': /\1\2\3/
113 | 'pattern': /([\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC\uF900-\uFAAD]+?)\s*([_\*]{0,2})\s*([\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC\uF900-\uFAAD]+?)/
114 | ---
115 | # repeat above rule again, but with /\s+?/
116 | 'expected': /\1\2\3/
117 | 'pattern': /([\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC\uF900-\uFAAD]+?)\s+?([_\*]{0,2})\s+?([\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC\uF900-\uFAAD]+?)/
118 | ---
119 | # add space around links
120 | # 链接前后应该有空格 —— 这一条在纯英文排版中有意义，但中文或者中英混排中并不好看……
121 | # 'expected': / [\1](\2) /
122 | # 'pattern': /\s*\[(.*?)\]\((.*?)\)\s*/
123 | # ---
124 | # spaces around `~`
125 | 'expected': / ~ /
126 | 'pattern': /\s*[~～]![\\]\s*/
127 | ---
128 | # remove space around Chinese full-width punctuation
129 | 'expected': /\1/
130 | 'pattern': /\s*([。，、？！：；（）《》・])\s*/
131 | 
132 | 


--------------------------------------------------------------------------------
/cnenlinter/setup.py:
--------------------------------------------------------------------------------
 1 | from setuptools import setup
 2 | 
 3 | setup(
 4 |     name='cnenlinter',
 5 |     version='0.1',
 6 |     description='Use Regex rules to fix common lint problems in Chinese-English docs',
 7 |     author='xiaolai',
 8 |     author_email='lixiaolai@gmail.com',
 9 |     license='MIT',
10 |     py_modules=['cnenlinter'],
11 |     install_requires=[
12 |         'Click',
13 |         'PyYAML',
14 |     ],
15 |     entry_points='''
16 |         [console_scripts]
17 |         cnenlinter=cnenlinter:cnenlinter
18 |     ''',
19 |     include_package_data=True,
20 | )


--------------------------------------------------------------------------------