├── .gitignore ├── LICENSE ├── README.md ├── exec-git-find-unicode-control.sh ├── find_unicode_control.py ├── pyproject.toml ├── tests ├── examples │ ├── empty │ └── symlink └── test_simple.py └── tox.ini /.gitignore: -------------------------------------------------------------------------------- 1 | .tox 2 | __pycache__ 3 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2021, Red Hat 2 | All rights reserved. 3 | 4 | Redistribution and use in source and binary forms, with or without 5 | modification, are permitted provided that the following conditions are met: 6 | 7 | Redistributions of source code must retain the above copyright notice, this 8 | list of conditions and the following disclaimer. 9 | 10 | Redistributions in binary form must reproduce the above copyright notice, 11 | this list of conditions and the following disclaimer in the documentation 12 | and/or other materials provided with the distribution. 13 | 14 | Neither the name of HPCProject, Serge Guelton nor the names of its 15 | contributors may be used to endorse or promote products derived from this 16 | software without specific prior written permission. 17 | 18 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND 19 | ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED 20 | WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 21 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 22 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 23 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 24 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 25 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 26 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 27 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 28 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # find-unicode-control 2 | 3 | These scripts look for non-printable unicode characters in all text files in a 4 | source tree. `find_unicode_control.py` should work with python2 as well as 5 | python3. It uses `python-magic` if available to determine file type, or simply 6 | spawns the `file --mime-type` command. They should be functionally the same 7 | and `find_unicode_control.py` could eventually get disposed. 8 | 9 | ``` 10 | usage: find_unicode_control.py [-h] [-p {all,bidi}] [-v] [-c CONFIG] path [path ...] 11 | 12 | Look for Unicode control characters 13 | 14 | positional arguments: 15 | path Sources to analyze 16 | 17 | optional arguments: 18 | -h, --help show this help message and exit 19 | -p {all,bidi}, --nonprint {all,bidi} 20 | Look for either all non-printable unicode characters or bidirectional control characters. 21 | -v, --verbose Verbose mode. 22 | -d, --detailed Print line numbers where characters occur. 23 | -t, --notests Exclude tests (basically test.* as a component of path). 24 | -c CONFIG, --config CONFIG 25 | Configuration file to read settings from. 26 | ``` 27 | 28 | If unicode BIDI control characters or non-printable characters are found in a 29 | file, it will print output as follows: 30 | 31 | ``` 32 | $ python3 find_unicode_control.py -p bidi *.c 33 | commenting-out.c: bidirectional control characters: {'\u202e', '\u2066', '\u2069'} 34 | early-return.c: bidirectional control characters: {'\u2067'} 35 | stretched-string.c: bidirectional control characters: {'\u202e', '\u2066', '\u2069'} 36 | ``` 37 | 38 | Using the `-d` flag, the output is more detailed, showing line numbers in 39 | files, but this mode is also slower: 40 | 41 | ``` 42 | find_unicode_control.py -p bidi -d . 43 | ./commenting-out.c:4 bidirectional control characters: ['\u202e', '\u2066', '\u2069', '\u2066'] 44 | ./commenting-out.c:6 bidirectional control characters: ['\u202e', '\u2066'] 45 | ./early-return.c:4 bidirectional control characters: ['\u2067'] 46 | ./stretched-string.c:6 bidirectional control characters: ['\u202e', '\u2066', '\u2069', '\u2066'] 47 | ``` 48 | 49 | The optimal workflow would be to do a quick scan through a source tree and if 50 | any issues are found, do a detailed scan on only those files. 51 | 52 | ## Configuration file 53 | 54 | If files need to be excluded from the scan, make a configuration file and 55 | define a `scan_exclude` variable to a list of regular expressions that match 56 | the files or paths to exclude. Alternatively, add a `scan_exclude_mime` list 57 | with the list of mime types to ignore; this can again be a regular expression. 58 | Here is an example configuration that glibc uses: 59 | 60 | ``` 61 | scan_exclude = [ 62 | # Iconv test data 63 | r'/iconvdata/testdata/', 64 | # Test case data 65 | r'libio/tst-widetext.input$', 66 | # Test script. This is to silence the warning: 67 | # 'utf-8' codec can't decode byte 0xe9 in position 2118: invalid continuation byte 68 | # since the script tests mixed encoding characters. 69 | r'localedata/tst-langinfo.sh$'] 70 | ``` 71 | 72 | ## Notes 73 | 74 | This script was quickly hacked together to scan repositories with mostly LTR, 75 | unicode content. If you have RTL content (either in comments, literals or even 76 | identifiers in code), it will give false warnings that you need to weed out. 77 | For now you need to exclude such RTL code using `scan_exclude` but a long term 78 | wish list (if this remains relevant, hopefully more sophisticated RTL 79 | diagnostics will make it obsolete!) is to handle RTL a bit more intelligently. 80 | -------------------------------------------------------------------------------- /exec-git-find-unicode-control.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # SPDX-License-Identifier: BSD-3-Clause 3 | 4 | ### 5 | # 6 | # Batch script to scan all files in branches and tags of 7 | # a git repository. 8 | # 9 | # Both this script and the find-find_unicode_control.py script 10 | # should be accessible from the PATH. 11 | # 12 | # Change directory into the local git directory to be scanned 13 | # then execute this script with either: 14 | # -a: scan all branches and tags 15 | # -b: scan all branches 16 | # -t: scan all tags 17 | # -p PATTERN: scan all branches/tags that match PATTERN 18 | # 19 | ### 20 | 21 | get-branches() { 22 | # Ignore dependabot 23 | # Ignore pull requests 24 | local branches=$(git branch --remotes --format='%(refname:short)' | \ 25 | grep -v dependabot | \ 26 | grep -v 'pr/' | \ 27 | grep -v 'pull/') 28 | if [ -z "${branches}" ]; then 29 | echo >&2 echo "ERROR: cannot find any branches from ${PROD_REPO_ALIAS}" 30 | exit 1 31 | fi 32 | 33 | echo ${branches} 34 | } 35 | 36 | get-tags() { 37 | remotes=$(git remote | awk {'print $1'}) 38 | local tags="" 39 | for remote in ${remotes} 40 | do 41 | # Ignore dependabot 42 | # Ignore pull requests 43 | local rtags=$(git ls-remote --tags "${remote}" | \ 44 | awk {'print $2'} | \ 45 | grep -v "{}" | \ 46 | grep -v 'pr/' | \ 47 | grep -v 'pull/' | \ 48 | sed 's/refs\///') 49 | tags="${tags} ${rtags}" 50 | done 51 | 52 | if [ -z "${tags}" ]; then 53 | echo >&2 echo "ERROR: cannot find any tags" 54 | exit 1 55 | fi 56 | 57 | echo ${tags} 58 | } 59 | 60 | get-branches-and-tags() { 61 | local branches=$(get-branches) 62 | local tags=$(get-tags) 63 | 64 | echo "${branches} ${tags}" 65 | } 66 | 67 | get-pattern() { 68 | local pattern="${1}" 69 | if [ -z "${pattern}" ]; then 70 | echo >&2 echo "ERROR: pattern not specified" 71 | exit 1 72 | fi 73 | 74 | local commits=$(git for-each-ref --format='%(refname)' | \ 75 | grep -v dependabot | \ 76 | grep -v 'pr/' | \ 77 | grep -v 'pull/' | \ 78 | grep "${pattern}" | \ 79 | sed -E -e 's/refs\///' | sed -E -e 's/remotes\///') 80 | if [ -z "${commits}" ]; then 81 | echo >&2 echo "ERROR: cannot find any branches or tags" 82 | exit 1 83 | fi 84 | 85 | echo ${commits} 86 | } 87 | 88 | 89 | # 90 | # Check find_unicode_control is available 91 | # 92 | if ! command -v find_unicode_control.py &> /dev/null; then 93 | echo "ERROR: Please install the find_unicode_control.py script so it is accessible from any directory" 94 | exit 1 95 | fi 96 | 97 | # 98 | # Fetch all tags from all remote repositories defined locally. 99 | # Note: if tags are duplicated across repositories then this will 100 | # overwrite them. In this event you should checkout the overwritten 101 | # tag manually and scan it. 102 | # 103 | git fetch --tags --all -f 104 | if [ $? != 0 ]; then 105 | echo "Warning: Problems occurred while fetching from all repositories" 106 | fi 107 | 108 | # 109 | # Check which branches and tags are required to be scanned 110 | # 111 | while getopts ":atbp:" OPT; do 112 | case $OPT in 113 | a) 114 | ends=$(get-branches-and-tags) 115 | ;; 116 | b) 117 | ends=$(get-branches) 118 | ;; 119 | t) 120 | ends=$(get-tags) 121 | ;; 122 | p) 123 | ends=$(get-pattern ${OPTARG}) 124 | ;; 125 | \?) 126 | echo "Invalid option: -$OPTARG" >&2 127 | ;; 128 | esac 129 | done 130 | shift $((OPTIND-1)) 131 | 132 | if [ -z "${ends}" ]; then 133 | echo "ERROR: No branches or tags specified" 134 | exit 1 135 | fi 136 | 137 | # 138 | # Loop through the commits listed, check them out and scan the contents 139 | # 140 | for commit in ${ends} 141 | do 142 | echo "Checking: $commit" 143 | 144 | status=$(git submodule deinit -f --all &> /dev/null && git checkout -f "${commit}" &> /dev/null && git submodule init &> /dev/null) 145 | if [ $? != 0 ]; then 146 | echo "ERROR occurred ... checking out ${commit}" 147 | exit 1 148 | fi 149 | 150 | find_unicode_control.py -p bidi . 151 | 152 | done 153 | -------------------------------------------------------------------------------- /find_unicode_control.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # SPDX-License-Identifier: BSD-3-Clause 3 | """Find unicode control characters in source files 4 | 5 | By default the script takes one or more files or directories and looks for 6 | unicode control characters in all text files. To narrow down the files, provide 7 | a config file with the -c command line, defining a scan_exclude list, which 8 | should be a list of regular expressions matching paths to exclude from the scan. 9 | 10 | There is a second mode enabled with -p which when set to 'all', prints all 11 | control characters and when set to 'bidi', prints only the 9 bidirectional 12 | control characters. 13 | """ 14 | from __future__ import print_function 15 | 16 | import sys, os, argparse, re, unicodedata, subprocess 17 | import importlib 18 | from stat import * 19 | 20 | try: 21 | import magic 22 | except ImportError: 23 | magic = None 24 | 25 | def _unicode(line, encoding): 26 | if isinstance(line, str): 27 | return line 28 | return line.decode(encoding) 29 | 30 | import platform 31 | if platform.python_version()[0] == '2': 32 | _chr = unichr 33 | do_unicode = unicode 34 | else: 35 | _chr = chr 36 | do_unicode = _unicode 37 | 38 | if platform.python_version_tuple() < ('3','04','0'): 39 | import imp 40 | 41 | # .git and .hg will get excluded when they're part of a directory tree being 42 | # scanned but not when they're explicitly specified as targets. That is: 43 | # 44 | # find_unicode_control.py foo 45 | # 46 | # will skip foo/.git but 47 | # 48 | # cd foo && find_unicode_control.py .git 49 | # 50 | # will read into .git. I've left it like this on purpose in case someone ever 51 | # needs to scan inside these directories even though I don't see any reason to 52 | # do that at the moment. Also, 53 | # 54 | # find_unicode_control.py foo/.git 55 | # 56 | # will skip .git and I don't really care much about fixing it myself since you 57 | # could just pushd/popd your way out of that. Send a PR if you wish to see that 58 | # work :) 59 | scan_exclude = [r'/\.git/', r'/\.hg/', r'\.desktop$', r'ChangeLog$', r'NEWS$', 60 | r'\.ppd$', r'\.txt$', r'\.directory$'] 61 | scan_exclude_mime = [r'text/x-po$', r'text/x-tex$', r'text/x-troff$', 62 | r'text/html$'] 63 | verbose_mode = False 64 | 65 | # Print to stderr in verbose mode. 66 | def eprint(*args, **kwargs): 67 | if verbose_mode: 68 | print(*args, file=sys.stderr, **kwargs) 69 | 70 | # Decode a single latin1 line. 71 | def decodeline(inf): 72 | return do_unicode(inf, 'utf-8') 73 | 74 | # Make a text string from a file, attempting to decode from latin1 if necessary. 75 | # Other non-utf-8 locales are not supported at the moment. 76 | def getfiletext(filename): 77 | text = None 78 | with open(filename) as infile: 79 | try: 80 | if detailed_mode: 81 | return [decodeline(inf) for inf in infile] 82 | except Exception as e: 83 | eprint('%s: %s' % (filename, e)) 84 | return None 85 | 86 | try: 87 | text = decodeline(''.join(infile)) 88 | except UnicodeDecodeError: 89 | eprint('%s: Retrying with latin1' % filename) 90 | try: 91 | text = ''.join([decodeline(inf) for inf in infile]) 92 | except Exception as e: 93 | eprint('%s: %s' % (filename, e)) 94 | if text: 95 | return set(text) 96 | else: 97 | return None 98 | 99 | def analyze_text_detailed(filename, text, disallowed, msg): 100 | line = 0 101 | warned = False 102 | for t in text: 103 | line = line + 1 104 | subset = [c for c in t if _chr(ord(c)) in disallowed] 105 | if subset: 106 | print('%s:%d %s: %s' % (filename, line, msg, subset)) 107 | warned = True 108 | if not warned: 109 | eprint('%s: OK' % filename) 110 | 111 | return warned 112 | 113 | # Look for disallowed characters in the text. We reduce all characters into a 114 | # set to speed up analysis. FIXME: Add a slow mode to get line numbers in files 115 | # that have these disallowed chars. 116 | def analyze_text(filename, text, disallowed, msg): 117 | if detailed_mode: 118 | return analyze_text_detailed(filename, text, disallowed, msg) 119 | 120 | if not text.isdisjoint(disallowed): 121 | print('%s: %s: %s' % (filename, msg, text & disallowed)) 122 | return True 123 | else: 124 | eprint('%s: OK' % filename) 125 | 126 | return False 127 | 128 | def get_mime(f): 129 | if magic: 130 | return magic.detect_from_filename(f).mime_type 131 | args = ['file', '--mime-type', f] 132 | proc = subprocess.Popen(args, stdout=subprocess.PIPE) 133 | m = [decodeline(x[:-1]) for x in proc.stdout][0].split(':')[1].strip() 134 | return m 135 | 136 | def should_read(f): 137 | # Fast check, just the file name. 138 | if [e for e in scan_exclude if re.search(e, f)]: 139 | return False 140 | 141 | # Slower check, mime type. 142 | m = get_mime(f) 143 | if not 'text/' in m \ 144 | or [e for e in scan_exclude_mime if re.search(e, m)]: 145 | return False 146 | return True 147 | 148 | # Get file text and feed into analyze_text. 149 | def analyze_file(f, disallowed, msg): 150 | eprint('%s: Reading file' % f) 151 | if should_read(f): 152 | text = getfiletext(f) 153 | if text: 154 | return analyze_text(f, text, disallowed, msg) 155 | else: 156 | eprint('%s: SKIPPED' % f) 157 | 158 | return False 159 | 160 | # Actual implementation of the recursive descent into directories. 161 | def analyze_any(p, disallowed, msg, dirs_seen): 162 | try: 163 | stat_res = os.stat(p) 164 | except Exception as e: 165 | eprint('%s: %s' % (p, e)) 166 | return False 167 | 168 | mode = stat_res.st_mode 169 | if S_ISDIR(mode): 170 | # avoid analyzing the same dir twice 171 | inode = stat_res.st_ino 172 | if inode: 173 | if inode in dirs_seen: 174 | return False 175 | dirs_seen.add(inode) 176 | 177 | return analyze_dir(p, disallowed, msg, dirs_seen) 178 | elif S_ISREG(mode): 179 | return analyze_file(p, disallowed, msg) 180 | else: 181 | eprint('%s: UNREADABLE' % p) 182 | 183 | return False 184 | 185 | # Recursively analyze files in the directory. 186 | def analyze_dir(d, disallowed, msg, dirs_seen): 187 | warned = False 188 | 189 | for f in os.listdir(d): 190 | w = analyze_any(os.path.join(d, f), disallowed, msg, dirs_seen) 191 | warned = warned or w 192 | 193 | return warned 194 | 195 | def analyze_paths(paths, disallowed, msg, dirs_seen): 196 | warned = False 197 | 198 | for p in paths: 199 | w = analyze_any(p, disallowed, msg, dirs_seen) 200 | warned = warned or w 201 | 202 | return warned 203 | 204 | # All control characters. We omit the ascii control characters. 205 | def nonprint_unicode(c): 206 | cat = unicodedata.category(c) 207 | if cat.startswith('C') and cat != 'Cc': 208 | return True 209 | return False 210 | 211 | if __name__ == '__main__': 212 | parser = argparse.ArgumentParser(description="Look for Unicode control characters") 213 | parser.add_argument('path', metavar='path', nargs='+', 214 | help='Sources to analyze') 215 | parser.add_argument('-p', '--nonprint', required=False, 216 | type=str, choices=['all', 'bidi'], 217 | help='Look for either all non-printable unicode characters or bidirectional control characters.') 218 | parser.add_argument('-v', '--verbose', required=False, action='store_true', 219 | help='Verbose mode.') 220 | parser.add_argument('-d', '--detailed', required=False, action='store_true', 221 | help='Print line numbers where characters occur.') 222 | parser.add_argument('-t', '--notests', required=False, 223 | action='store_true', help='Exclude tests (basically test.* as a component of path).') 224 | parser.add_argument('-c', '--config', required=False, type=str, 225 | help='Configuration file to read settings from.') 226 | 227 | args = parser.parse_args() 228 | verbose_mode = args.verbose 229 | detailed_mode = args.detailed 230 | 231 | if not args.nonprint: 232 | # Formatting control characters in the unicode space. This includes the 233 | # bidi control characters. 234 | disallowed = set(_chr(c) for c in range(sys.maxunicode) if \ 235 | unicodedata.category(_chr(c)) == 'Cf') 236 | 237 | msg = 'unicode control characters' 238 | elif args.nonprint == 'all': 239 | # All control characters. 240 | disallowed = set(_chr(c) for c in range(sys.maxunicode) if \ 241 | nonprint_unicode(_chr(c))) 242 | 243 | msg = 'disallowed characters' 244 | else: 245 | # Only bidi control characters. 246 | disallowed = set([ 247 | _chr(0x202a), _chr(0x202b), _chr(0x202c), _chr(0x202d), _chr(0x202e), 248 | _chr(0x2066), _chr(0x2067), _chr(0x2068), _chr(0x2069)]) 249 | msg = 'bidirectional control characters' 250 | 251 | if args.config: 252 | if platform.python_version_tuple() >= ('3','04','0'): 253 | # load settings file, method for Python >= 3.4 254 | spec = importlib.util.spec_from_file_location("settings", args.config) 255 | settings = importlib.util.module_from_spec(spec) 256 | spec.loader.exec_module(settings) 257 | else: 258 | # load settings file, method for Python < 3.4 259 | settings = imp.load_source("settings", args.config) 260 | 261 | if hasattr(settings, 'scan_exclude'): 262 | scan_exclude = scan_exclude + settings.scan_exclude 263 | if hasattr(settings, 'scan_exclude_mime'): 264 | scan_exclude_mime = scan_exclude_mime + settings.scan_exclude_mime 265 | 266 | if args.notests: 267 | scan_exclude = scan_exclude + [r'/test[^/]+/'] 268 | 269 | dirs_seen = set() 270 | 271 | warned = analyze_paths(args.path, disallowed, msg, dirs_seen) 272 | 273 | if warned: 274 | exit(1) 275 | -------------------------------------------------------------------------------- /pyproject.toml: -------------------------------------------------------------------------------- 1 | # Project configuration 2 | [project] 3 | name = "find-unicode-control" 4 | version = "0.1" 5 | description = "Script to find Unicode control characters" 6 | requires-python = ">=2.7" 7 | keywords = [ 8 | "unicode", 9 | "script", 10 | ] 11 | authors = [ 12 | {name="Siddhesh Poyarekar"}, 13 | ] 14 | readme = "README.md" 15 | 16 | [project.urls] 17 | "Homepage" = "https://github.com/siddhesh/find-unicode-control" 18 | -------------------------------------------------------------------------------- /tests/examples/empty: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/siddhesh/find-unicode-control/c2accbfbba7553a8bc1ebd97089ae08ad8347e58/tests/examples/empty -------------------------------------------------------------------------------- /tests/examples/symlink: -------------------------------------------------------------------------------- 1 | . -------------------------------------------------------------------------------- /tests/test_simple.py: -------------------------------------------------------------------------------- 1 | from subprocess import check_output 2 | import pytest 3 | 4 | def test_simple(): 5 | check_output('python find_unicode_control.py *', shell=True) 6 | -------------------------------------------------------------------------------- /tox.ini: -------------------------------------------------------------------------------- 1 | [tox] 2 | skipsdist=True 3 | envlist = py27,py3 4 | 5 | [testenv] 6 | deps = pytest 7 | commands = pytest {posargs} 8 | --------------------------------------------------------------------------------