├── .gitignore
├── LICENSE
├── README.md
├── exec-git-find-unicode-control.sh
├── find_unicode_control.py
├── pyproject.toml
├── tests
    ├── examples
    │   ├── empty
    │   └── symlink
    └── test_simple.py
└── tox.ini


/.gitignore:
--------------------------------------------------------------------------------
1 | .tox
2 | __pycache__
3 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | Copyright (c) 2021, Red Hat
 2 | All rights reserved.
 3 | 
 4 | Redistribution and use in source and binary forms, with or without
 5 | modification, are permitted provided that the following conditions are met:
 6 | 
 7 | 	Redistributions of source code must retain the above copyright notice, this
 8 | 	list of conditions and the following disclaimer.
 9 | 
10 | 	Redistributions in binary form must reproduce the above copyright notice,
11 | 	this list of conditions and the following disclaimer in the documentation
12 | 	and/or other materials provided with the distribution.
13 | 
14 | 	Neither the name of HPCProject, Serge Guelton nor the names of its
15 | 	contributors may be used to endorse or promote products derived from this
16 | 	software without specific prior written permission.
17 | 
18 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
19 | ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
20 | WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
21 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
22 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
23 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
24 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
25 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
26 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
27 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
28 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # find-unicode-control
 2 | 
 3 | These scripts look for non-printable unicode characters in all text files in a
 4 | source tree.  `find_unicode_control.py` should work with python2 as well as
 5 | python3.  It uses `python-magic` if available to determine file type, or simply
 6 | spawns the `file --mime-type` command.  They should be functionally the same
 7 | and `find_unicode_control.py` could eventually get disposed.
 8 | 
 9 | ```
10 | usage: find_unicode_control.py [-h] [-p {all,bidi}] [-v] [-c CONFIG] path [path ...]
11 | 
12 | Look for Unicode control characters
13 | 
14 | positional arguments:
15 |   path                  Sources to analyze
16 | 
17 | optional arguments:
18 |   -h, --help            show this help message and exit
19 |   -p {all,bidi}, --nonprint {all,bidi}
20 |                         Look for either all non-printable unicode characters or bidirectional control characters.
21 |   -v, --verbose         Verbose mode.
22 |   -d, --detailed        Print line numbers where characters occur.
23 |   -t, --notests         Exclude tests (basically test.* as a component of path).
24 |   -c CONFIG, --config CONFIG
25 |                         Configuration file to read settings from.
26 | ```
27 | 
28 | If unicode BIDI control characters or non-printable characters are found in a
29 | file, it will print output as follows:
30 | 
31 | ```
32 | $ python3 find_unicode_control.py -p bidi *.c
33 | commenting-out.c: bidirectional control characters: {'\u202e', '\u2066', '\u2069'}
34 | early-return.c: bidirectional control characters: {'\u2067'}
35 | stretched-string.c: bidirectional control characters: {'\u202e', '\u2066', '\u2069'}
36 | ```
37 | 
38 | Using the `-d` flag, the output is more detailed, showing line numbers in
39 | files, but this mode is also slower:
40 | 
41 | ```
42 | find_unicode_control.py -p bidi -d .
43 | ./commenting-out.c:4 bidirectional control characters: ['\u202e', '\u2066', '\u2069', '\u2066']
44 | ./commenting-out.c:6 bidirectional control characters: ['\u202e', '\u2066']
45 | ./early-return.c:4 bidirectional control characters: ['\u2067']
46 | ./stretched-string.c:6 bidirectional control characters: ['\u202e', '\u2066', '\u2069', '\u2066']
47 | ```
48 | 
49 | The optimal workflow would be to do a quick scan through a source tree and if
50 | any issues are found, do a detailed scan on only those files.
51 | 
52 | ## Configuration file
53 | 
54 | If files need to be excluded from the scan, make a configuration file and
55 | define a `scan_exclude` variable to a list of regular expressions that match
56 | the files or paths to exclude.  Alternatively, add a `scan_exclude_mime` list
57 | with the list of mime types to ignore; this can again be a regular expression.
58 | Here is an example configuration that glibc uses:
59 | 
60 | ```
61 | scan_exclude = [
62 |         # Iconv test data
63 |         r'/iconvdata/testdata/',
64 |         # Test case data
65 |         r'libio/tst-widetext.input$',
66 |         # Test script.  This is to silence the warning:
67 |         # 'utf-8' codec can't decode byte 0xe9 in position 2118: invalid continuation byte
68 |         # since the script tests mixed encoding characters.
69 |         r'localedata/tst-langinfo.sh$']
70 | ```
71 | 
72 | ## Notes
73 | 
74 | This script was quickly hacked together to scan repositories with mostly LTR,
75 | unicode content.  If you have RTL content (either in comments, literals or even
76 | identifiers in code), it will give false warnings that you need to weed out.
77 | For now you need to exclude such RTL code using `scan_exclude` but a long term
78 | wish list (if this remains relevant, hopefully more sophisticated RTL
79 | diagnostics will make it obsolete!) is to handle RTL a bit more intelligently.
80 | 


--------------------------------------------------------------------------------
/exec-git-find-unicode-control.sh:
--------------------------------------------------------------------------------
  1 | #!/bin/bash
  2 | # SPDX-License-Identifier: BSD-3-Clause
  3 | 
  4 | ###
  5 | #
  6 | # Batch script to scan all files in branches and tags of
  7 | # a git repository.
  8 | #
  9 | # Both this script and the find-find_unicode_control.py script
 10 | # should be accessible from the PATH.
 11 | #
 12 | # Change directory into the local git directory to be scanned
 13 | # then execute this script with either:
 14 | # -a: scan all branches and tags
 15 | # -b: scan all branches
 16 | # -t: scan all tags
 17 | # -p PATTERN: scan all branches/tags that match PATTERN
 18 | #
 19 | ###
 20 | 
 21 | get-branches() {
 22 |   # Ignore dependabot
 23 |   # Ignore pull requests
 24 |   local branches=$(git branch --remotes --format='%(refname:short)' | \
 25 |                     grep -v dependabot | \
 26 |                     grep -v 'pr/' | \
 27 |                     grep -v 'pull/')
 28 |   if [ -z "${branches}" ]; then
 29 |     echo >&2 echo "ERROR: cannot find any branches from ${PROD_REPO_ALIAS}"
 30 |     exit 1
 31 |   fi
 32 | 
 33 |   echo ${branches}
 34 | }
 35 | 
 36 | get-tags() {
 37 |   remotes=$(git remote | awk {'print $1'})
 38 |   local tags=""
 39 |   for remote in ${remotes}
 40 |   do
 41 |     # Ignore dependabot
 42 |     # Ignore pull requests
 43 |     local rtags=$(git ls-remote --tags "${remote}" | \
 44 |                    awk {'print $2'} | \
 45 |                    grep -v "{}" | \
 46 |                    grep -v 'pr/' | \
 47 |                    grep -v 'pull/' | \
 48 |                    sed 's/refs\///')
 49 |     tags="${tags} ${rtags}"
 50 |   done
 51 | 
 52 |   if [ -z "${tags}" ]; then
 53 |     echo >&2 echo "ERROR: cannot find any tags"
 54 |     exit 1
 55 |   fi
 56 | 
 57 |   echo ${tags}
 58 | }
 59 | 
 60 | get-branches-and-tags() {
 61 |   local branches=$(get-branches)
 62 |   local tags=$(get-tags)
 63 |   
 64 |   echo "${branches} ${tags}"
 65 | }
 66 | 
 67 | get-pattern() {
 68 |   local pattern="${1}"
 69 |   if [ -z "${pattern}" ]; then
 70 |     echo >&2 echo "ERROR: pattern not specified"
 71 |     exit 1
 72 |   fi
 73 | 
 74 |   local commits=$(git for-each-ref --format='%(refname)' | \
 75 |                     grep -v dependabot | \
 76 |                     grep -v 'pr/' | \
 77 |                     grep -v 'pull/' | \
 78 |                     grep "${pattern}" | \
 79 |                     sed -E -e 's/refs\///' | sed -E -e 's/remotes\///')
 80 |   if [ -z "${commits}" ]; then
 81 |     echo >&2 echo "ERROR: cannot find any branches or tags"
 82 |     exit 1
 83 |   fi
 84 | 
 85 |   echo ${commits}
 86 | }
 87 | 
 88 | 
 89 | #
 90 | # Check find_unicode_control is available
 91 | #
 92 | if ! command -v find_unicode_control.py &> /dev/null; then
 93 |   echo "ERROR: Please install the find_unicode_control.py script so it is accessible from any directory"
 94 |   exit 1
 95 | fi
 96 | 
 97 | #
 98 | # Fetch all tags from all remote repositories defined locally.
 99 | # Note: if tags are duplicated across repositories then this will
100 | #       overwrite them. In this event you should checkout the overwritten
101 | #       tag manually and scan it.
102 | #
103 | git fetch --tags --all -f
104 | if [ $? != 0 ]; then
105 |   echo "Warning: Problems occurred while fetching from all repositories"
106 | fi
107 | 
108 | #
109 | # Check which branches and tags are required to be scanned
110 | #
111 | while getopts ":atbp:" OPT; do
112 |   case $OPT in
113 |     a)
114 |       ends=$(get-branches-and-tags)
115 |       ;;
116 |     b)
117 |       ends=$(get-branches)
118 |       ;;
119 |     t)
120 |       ends=$(get-tags)
121 |       ;;
122 |     p)
123 |       ends=$(get-pattern ${OPTARG})
124 |       ;;
125 |    \?)
126 |      echo "Invalid option: -$OPTARG" >&2
127 |      ;;
128 |   esac
129 | done
130 | shift $((OPTIND-1))
131 | 
132 | if [ -z "${ends}" ]; then
133 |   echo "ERROR: No branches or tags specified"
134 |   exit 1
135 | fi
136 | 
137 | #
138 | # Loop through the commits listed, check them out and scan the contents
139 | #
140 | for commit in ${ends}
141 | do
142 |   echo "Checking: $commit"
143 | 
144 |   status=$(git submodule deinit -f --all &> /dev/null && git checkout -f "${commit}" &> /dev/null && git submodule init &> /dev/null)
145 |   if [ $? != 0 ]; then
146 |     echo "ERROR occurred ... checking out ${commit}"
147 |     exit 1
148 |   fi
149 | 
150 |   find_unicode_control.py -p bidi .
151 |   
152 | done
153 | 


--------------------------------------------------------------------------------
/find_unicode_control.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | # SPDX-License-Identifier: BSD-3-Clause
  3 | """Find unicode control characters in source files
  4 | 
  5 | By default the script takes one or more files or directories and looks for
  6 | unicode control characters in all text files.  To narrow down the files, provide
  7 | a config file with the -c command line, defining a scan_exclude list, which
  8 | should be a list of regular expressions matching paths to exclude from the scan.
  9 | 
 10 | There is a second mode enabled with -p which when set to 'all', prints all
 11 | control characters and when set to 'bidi', prints only the 9 bidirectional
 12 | control characters.
 13 | """
 14 | from __future__ import print_function
 15 | 
 16 | import sys, os, argparse, re, unicodedata, subprocess
 17 | import importlib
 18 | from stat import *
 19 | 
 20 | try:
 21 |     import magic
 22 | except ImportError:
 23 |     magic = None
 24 | 
 25 | def _unicode(line, encoding):
 26 |     if isinstance(line, str):
 27 |         return line
 28 |     return line.decode(encoding)
 29 | 
 30 | import platform
 31 | if platform.python_version()[0] == '2':
 32 |     _chr = unichr
 33 |     do_unicode = unicode
 34 | else:
 35 |     _chr = chr
 36 |     do_unicode = _unicode
 37 | 
 38 | if platform.python_version_tuple() < ('3','04','0'):
 39 |     import imp
 40 | 
 41 | # .git and .hg will get excluded when they're part of a directory tree being
 42 | # scanned but not when they're explicitly specified as targets.  That is:
 43 | #
 44 | # find_unicode_control.py foo
 45 | #
 46 | # will skip foo/.git but
 47 | #
 48 | # cd foo && find_unicode_control.py .git
 49 | #
 50 | # will read into .git.  I've left it like this on purpose in case someone ever
 51 | # needs to scan inside these directories even though I don't see any reason to
 52 | # do that at the moment.  Also,
 53 | #
 54 | # find_unicode_control.py foo/.git
 55 | #
 56 | # will skip .git and I don't really care much about fixing it myself since you
 57 | # could just pushd/popd your way out of that.  Send a PR if you wish to see that
 58 | # work :)
 59 | scan_exclude = [r'/\.git/', r'/\.hg/', r'\.desktop$', r'ChangeLog$', r'NEWS$',
 60 |                 r'\.ppd$', r'\.txt$', r'\.directory$']
 61 | scan_exclude_mime = [r'text/x-po$', r'text/x-tex$', r'text/x-troff$',
 62 |                      r'text/html$']
 63 | verbose_mode = False
 64 | 
 65 | # Print to stderr in verbose mode.
 66 | def eprint(*args, **kwargs):
 67 |     if verbose_mode:
 68 |         print(*args, file=sys.stderr, **kwargs)
 69 | 
 70 | # Decode a single latin1 line.
 71 | def decodeline(inf):
 72 |     return do_unicode(inf, 'utf-8')
 73 | 
 74 | # Make a text string from a file, attempting to decode from latin1 if necessary.
 75 | # Other non-utf-8 locales are not supported at the moment.
 76 | def getfiletext(filename):
 77 |     text = None
 78 |     with open(filename) as infile:
 79 |         try:
 80 |             if detailed_mode:
 81 |                 return [decodeline(inf) for inf in infile]
 82 |         except Exception as e:
 83 |             eprint('%s: %s' % (filename, e))
 84 |             return None
 85 | 
 86 |         try:
 87 |             text = decodeline(''.join(infile))
 88 |         except UnicodeDecodeError:
 89 |             eprint('%s: Retrying with latin1' % filename)
 90 |             try:
 91 |                 text = ''.join([decodeline(inf) for inf in infile])
 92 |             except Exception as e:
 93 |                 eprint('%s: %s' % (filename, e))
 94 |     if text:
 95 |         return set(text)
 96 |     else:
 97 |         return None
 98 | 
 99 | def analyze_text_detailed(filename, text, disallowed, msg):
100 |     line = 0
101 |     warned = False
102 |     for t in text:
103 |         line = line + 1
104 |         subset = [c for c in t if _chr(ord(c)) in disallowed]
105 |         if subset:
106 |             print('%s:%d %s: %s' % (filename, line, msg, subset))
107 |             warned = True
108 |     if not warned:
109 |         eprint('%s: OK' % filename)
110 | 
111 |     return warned
112 | 
113 | # Look for disallowed characters in the text.  We reduce all characters into a
114 | # set to speed up analysis.  FIXME: Add a slow mode to get line numbers in files
115 | # that have these disallowed chars.
116 | def analyze_text(filename, text, disallowed, msg):
117 |     if detailed_mode:
118 |         return analyze_text_detailed(filename, text, disallowed, msg)
119 | 
120 |     if not text.isdisjoint(disallowed):
121 |         print('%s: %s: %s' % (filename, msg, text & disallowed))
122 |         return True
123 |     else:
124 |         eprint('%s: OK' % filename)
125 | 
126 |     return False
127 | 
128 | def get_mime(f):
129 |     if magic:
130 |         return magic.detect_from_filename(f).mime_type
131 |     args = ['file', '--mime-type', f]
132 |     proc = subprocess.Popen(args, stdout=subprocess.PIPE)
133 |     m = [decodeline(x[:-1]) for x in proc.stdout][0].split(':')[1].strip()
134 |     return m
135 | 
136 | def should_read(f):
137 |     # Fast check, just the file name.
138 |     if [e for e in scan_exclude if re.search(e, f)]:
139 |         return False
140 | 
141 |     # Slower check, mime type.
142 |     m = get_mime(f)
143 |     if not 'text/' in m \
144 |             or [e for e in scan_exclude_mime if re.search(e, m)]:
145 |         return False
146 |     return True
147 | 
148 | # Get file text and feed into analyze_text.
149 | def analyze_file(f, disallowed, msg):
150 |     eprint('%s: Reading file' % f)
151 |     if should_read(f):
152 |         text = getfiletext(f)
153 |         if text:
154 |             return analyze_text(f, text, disallowed, msg)
155 |     else:
156 |         eprint('%s: SKIPPED' % f)
157 | 
158 |     return False
159 | 
160 | # Actual implementation of the recursive descent into directories.
161 | def analyze_any(p, disallowed, msg, dirs_seen):
162 |     try:
163 |         stat_res = os.stat(p)
164 |     except Exception as e:
165 |         eprint('%s: %s' % (p, e))
166 |         return False
167 | 
168 |     mode = stat_res.st_mode
169 |     if S_ISDIR(mode):
170 |         # avoid analyzing the same dir twice
171 |         inode = stat_res.st_ino
172 |         if inode:
173 |             if inode in dirs_seen:
174 |                 return False
175 |             dirs_seen.add(inode)
176 | 
177 |         return analyze_dir(p, disallowed, msg, dirs_seen)
178 |     elif S_ISREG(mode):
179 |         return analyze_file(p, disallowed, msg)
180 |     else:
181 |         eprint('%s: UNREADABLE' % p)
182 | 
183 |     return False
184 | 
185 | # Recursively analyze files in the directory.
186 | def analyze_dir(d, disallowed, msg, dirs_seen):
187 |     warned = False
188 | 
189 |     for f in os.listdir(d):
190 |         w = analyze_any(os.path.join(d, f), disallowed, msg, dirs_seen)
191 |         warned = warned or w
192 | 
193 |     return warned
194 | 
195 | def analyze_paths(paths, disallowed, msg, dirs_seen):
196 |     warned = False
197 | 
198 |     for p in paths:
199 |         w = analyze_any(p, disallowed, msg, dirs_seen)
200 |         warned = warned or w
201 | 
202 |     return warned
203 | 
204 | # All control characters.  We omit the ascii control characters.
205 | def nonprint_unicode(c):
206 |     cat = unicodedata.category(c)
207 |     if cat.startswith('C') and cat != 'Cc':
208 |         return True
209 |     return False
210 | 
211 | if __name__ == '__main__':
212 |     parser = argparse.ArgumentParser(description="Look for Unicode control characters")
213 |     parser.add_argument('path', metavar='path', nargs='+',
214 |             help='Sources to analyze')
215 |     parser.add_argument('-p', '--nonprint', required=False,
216 |             type=str, choices=['all', 'bidi'],
217 |             help='Look for either all non-printable unicode characters or bidirectional control characters.')
218 |     parser.add_argument('-v', '--verbose', required=False, action='store_true',
219 |             help='Verbose mode.')
220 |     parser.add_argument('-d', '--detailed', required=False, action='store_true',
221 |             help='Print line numbers where characters occur.')
222 |     parser.add_argument('-t', '--notests', required=False,
223 |             action='store_true', help='Exclude tests (basically test.* as a component of path).')
224 |     parser.add_argument('-c', '--config', required=False, type=str,
225 |             help='Configuration file to read settings from.')
226 | 
227 |     args = parser.parse_args()
228 |     verbose_mode = args.verbose
229 |     detailed_mode = args.detailed
230 | 
231 |     if not args.nonprint:
232 |         # Formatting control characters in the unicode space.  This includes the
233 |         # bidi control characters.
234 |         disallowed = set(_chr(c) for c in range(sys.maxunicode) if \
235 |                                  unicodedata.category(_chr(c)) == 'Cf')
236 | 
237 |         msg = 'unicode control characters'
238 |     elif args.nonprint == 'all':
239 |         # All control characters.
240 |         disallowed = set(_chr(c) for c in range(sys.maxunicode) if \
241 |                          nonprint_unicode(_chr(c)))
242 | 
243 |         msg = 'disallowed characters'
244 |     else:
245 |         # Only bidi control characters.
246 |         disallowed = set([
247 |             _chr(0x202a), _chr(0x202b), _chr(0x202c), _chr(0x202d), _chr(0x202e),
248 |             _chr(0x2066), _chr(0x2067), _chr(0x2068), _chr(0x2069)])
249 |         msg = 'bidirectional control characters'
250 | 
251 |     if args.config:
252 |         if platform.python_version_tuple() >= ('3','04','0'):
253 |             # load settings file, method for Python >= 3.4
254 |             spec = importlib.util.spec_from_file_location("settings", args.config)
255 |             settings = importlib.util.module_from_spec(spec)
256 |             spec.loader.exec_module(settings)
257 |         else:
258 |             # load settings file, method for Python < 3.4
259 |             settings = imp.load_source("settings", args.config)
260 | 
261 |         if hasattr(settings, 'scan_exclude'):
262 |             scan_exclude = scan_exclude + settings.scan_exclude
263 |         if hasattr(settings, 'scan_exclude_mime'):
264 |             scan_exclude_mime = scan_exclude_mime + settings.scan_exclude_mime
265 | 
266 |     if args.notests:
267 |         scan_exclude = scan_exclude + [r'/test[^/]+/']
268 | 
269 |     dirs_seen = set()
270 | 
271 |     warned = analyze_paths(args.path, disallowed, msg, dirs_seen)
272 | 
273 |     if warned:
274 |         exit(1)
275 | 


--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------
 1 | # Project configuration
 2 | [project]
 3 | name = "find-unicode-control"
 4 | version = "0.1"
 5 | description = "Script to find Unicode control characters"
 6 | requires-python = ">=2.7"
 7 | keywords = [
 8 |   "unicode",
 9 |   "script",
10 | ]
11 | authors = [
12 |   {name="Siddhesh Poyarekar"},
13 | ]
14 | readme = "README.md"
15 | 
16 | [project.urls]
17 | "Homepage" = "https://github.com/siddhesh/find-unicode-control"
18 | 


--------------------------------------------------------------------------------
/tests/examples/empty:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/siddhesh/find-unicode-control/c2accbfbba7553a8bc1ebd97089ae08ad8347e58/tests/examples/empty


--------------------------------------------------------------------------------
/tests/examples/symlink:
--------------------------------------------------------------------------------
1 | .


--------------------------------------------------------------------------------
/tests/test_simple.py:
--------------------------------------------------------------------------------
1 | from subprocess import check_output
2 | import pytest
3 | 
4 | def test_simple():
5 |     check_output('python find_unicode_control.py *', shell=True)
6 | 


--------------------------------------------------------------------------------
/tox.ini:
--------------------------------------------------------------------------------
1 | [tox]
2 | skipsdist=True
3 | envlist = py27,py3
4 | 
5 | [testenv]
6 | deps = pytest
7 | commands = pytest {posargs}
8 | 


--------------------------------------------------------------------------------