├── .gitignore
├── _includes
    ├── analytics.html
    ├── footer.html
    ├── top.html
    └── header.html
├── img
    ├── icann.png
    ├── uchar-A.png
    ├── spoof-google.png
    ├── spoof-slash.png
    ├── uchar-017F.png
    ├── uchar-180E.png
    ├── uchar-feff.png
    ├── dot-dot-slash.png
    ├── not-two-bytes.png
    ├── spoof-mozilla.png
    ├── spoof-slash-FF0F.png
    ├── content-type-charset.png
    ├── normalization-turkish-i.png
    ├── spoof-win-explorer-file.png
    ├── spoof-win-explorer-folder.png
    └── normalization-nfkc-nfkd-003C.png
├── _config.yml
├── README.md
├── _layouts
    └── default.html
├── js
    └── scale.fix.js
├── params.json
├── index.md
├── css
    ├── pygment_trac.css
    └── styles.css
├── page2.md
├── page1.md
└── page3.md


/.gitignore:
--------------------------------------------------------------------------------
1 | _site/
2 | 


--------------------------------------------------------------------------------
/_includes/analytics.html:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/img/icann.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cweb/unicode-security-guide/HEAD/img/icann.png


--------------------------------------------------------------------------------
/img/uchar-A.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cweb/unicode-security-guide/HEAD/img/uchar-A.png


--------------------------------------------------------------------------------
/img/spoof-google.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cweb/unicode-security-guide/HEAD/img/spoof-google.png


--------------------------------------------------------------------------------
/img/spoof-slash.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cweb/unicode-security-guide/HEAD/img/spoof-slash.png


--------------------------------------------------------------------------------
/img/uchar-017F.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cweb/unicode-security-guide/HEAD/img/uchar-017F.png


--------------------------------------------------------------------------------
/img/uchar-180E.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cweb/unicode-security-guide/HEAD/img/uchar-180E.png


--------------------------------------------------------------------------------
/img/uchar-feff.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cweb/unicode-security-guide/HEAD/img/uchar-feff.png


--------------------------------------------------------------------------------
/img/dot-dot-slash.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cweb/unicode-security-guide/HEAD/img/dot-dot-slash.png


--------------------------------------------------------------------------------
/img/not-two-bytes.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cweb/unicode-security-guide/HEAD/img/not-two-bytes.png


--------------------------------------------------------------------------------
/img/spoof-mozilla.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cweb/unicode-security-guide/HEAD/img/spoof-mozilla.png


--------------------------------------------------------------------------------
/img/spoof-slash-FF0F.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cweb/unicode-security-guide/HEAD/img/spoof-slash-FF0F.png


--------------------------------------------------------------------------------
/img/content-type-charset.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cweb/unicode-security-guide/HEAD/img/content-type-charset.png


--------------------------------------------------------------------------------
/img/normalization-turkish-i.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cweb/unicode-security-guide/HEAD/img/normalization-turkish-i.png


--------------------------------------------------------------------------------
/img/spoof-win-explorer-file.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cweb/unicode-security-guide/HEAD/img/spoof-win-explorer-file.png


--------------------------------------------------------------------------------
/img/spoof-win-explorer-folder.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cweb/unicode-security-guide/HEAD/img/spoof-win-explorer-folder.png


--------------------------------------------------------------------------------
/img/normalization-nfkc-nfkd-003C.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cweb/unicode-security-guide/HEAD/img/normalization-nfkc-nfkd-003C.png


--------------------------------------------------------------------------------
/_config.yml:
--------------------------------------------------------------------------------
1 | url: http://cweb.github.io/unicode-security-guide
2 | baseurl: /unicode-security-guide
3 | author: "Chris Weber"
4 | markdown: kramdown
5 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | unicode-security-guide
2 | ======================
3 | 
4 | A repository for the [*Unicode Security Guide*](http://cweb.github.io/unicode-security-guide/).
5 | 
6 | Push all commits to the _gh-pages_ branch.
7 | 


--------------------------------------------------------------------------------
/_includes/footer.html:
--------------------------------------------------------------------------------
1 |   <footer>
2 |   <p class="zero"><strong>Written and maintained by:</strong> Chris Weber
3 |   </p>
4 |   <p class="zero">Released under the 
5 |   <a href="http://creativecommons.org/licenses/by/3.0"
6 |       rel="nofollow">CC-3.0-BY</a> license</p>
7 |   </footer>
8 | 


--------------------------------------------------------------------------------
/_layouts/default.html:
--------------------------------------------------------------------------------
 1 | {% include top.html %}
 2 | 
 3 | <body>
 4 |   <div class="wrapper">
 5 |     {% include header.html %}
 6 |     
 7 |     <section>
 8 |     {{ content }}
 9 |     </section>
10 | 
11 |     {% include footer.html %}
12 |     {% include analytics.html %}
13 |   </div>
14 |   <script src="{{ site.url }}/js/scale.fix.js"></script>
15 | </body>
16 | </html>
17 | 


--------------------------------------------------------------------------------
/_includes/top.html:
--------------------------------------------------------------------------------
 1 | <!DOCTYPE HTML>
 2 | <html lang="en-US">
 3 | <head>
 4 |   <meta charset="UTF-8">
 5 |   <title>{{ page.title }}</title>
 6 |   <link rel="stylesheet" href="{{ site.url }}/css/styles.css">
 7 |   <link rel="stylesheet" href="{{ site.url }}/css/pygment_trac.css">
 8 |   <meta name="viewport" content="width=device-width, initial-scale=1, user-scalable=no">
 9 | </head>
10 | 
11 | 


--------------------------------------------------------------------------------
/js/scale.fix.js:
--------------------------------------------------------------------------------
 1 | var metas = document.getElementsByTagName('meta');
 2 | var i;
 3 | if (navigator.userAgent.match(/iPhone/i)) {
 4 |   for (i=0; i<metas.length; i++) {
 5 |     if (metas[i].name == "viewport") {
 6 |       metas[i].content = "width=device-width, minimum-scale=1.0, maximum-scale=1.0";
 7 |     }
 8 |   }
 9 |   document.addEventListener("gesturestart", gestureStart, false);
10 | }
11 | function gestureStart() {
12 |   for (i=0; i<metas.length; i++) {
13 |     if (metas[i].name == "viewport") {
14 |       metas[i].content = "width=device-width, minimum-scale=0.25, maximum-scale=1.6";
15 |     }
16 |   }
17 | }


--------------------------------------------------------------------------------
/_includes/header.html:
--------------------------------------------------------------------------------
 1 |   <header>
 2 |     <h3>Unicode Security Guide</h3>
 3 |     <p class="zero view">1) <a href="{{ site.url }}">Home Page</a></p>
 4 |       <p class="zero">2) <a href="{{ site.url }}/background">Background</a></p>
 5 |       <p class="zero">3) <a href="{{ site.url }}/visual-spoofing">Visual Spoofing</a></p>
 6 |       <p class="zero">4) <a href="{{ site.url }}/character-transformations">Character
 7 |         Transformation</a></p>
 8 |       <p></p>
 9 |       <p class="view">
10 |       <a href="https://github.com/cweb/unicode-security-guide">View the
11 |         Project on GitHub <small>cweb/unicode-security-guide</small></a></p>
12 |     <ul>
13 |       <li><a href="https://github.com/cweb/unicode-security-guide/zipball/master">Download <strong>ZIP File</strong></a></li>
14 |       <li><a href="https://github.com/cweb/unicode-security-guide/tarball/master">Download <strong>TAR Ball</strong></a></li>
15 |       <li><a href="https://github.com/cweb/unicode-security-guide">View On <strong>GitHub</strong></a></li>
16 |     </ul>
17 |   </header>
18 | 


--------------------------------------------------------------------------------
/params.json:
--------------------------------------------------------------------------------
1 | {"name":"Unicode-security-guide","tagline":"Unicode Security Guide","body":"### Welcome to GitHub Pages.\r\nThis automatic page generator is the easiest way to create beautiful pages for all of your projects. Author your page content here using GitHub Flavored Markdown, select a template crafted by a designer, and publish. After your page is generated, you can check out the new branch:\r\n\r\n```\r\n$ cd your_repo_root/repo_name\r\n$ git fetch origin\r\n$ git checkout gh-pages\r\n```\r\n\r\nIf you're using the GitHub for Mac, simply sync your repository and you'll see the new branch.\r\n\r\n### Designer Templates\r\nWe've crafted some handsome templates for you to use. Go ahead and continue to layouts to browse through them. You can easily go back to edit your page before publishing. After publishing your page, you can revisit the page generator and switch to another theme. Your Page content will be preserved if it remained markdown format.\r\n\r\n### Rather Drive Stick?\r\nIf you prefer to not use the automatic generator, push a branch named `gh-pages` to your repository to create a page manually. In addition to supporting regular HTML content, GitHub Pages support Jekyll, a simple, blog aware static site generator written by our own Tom Preston-Werner. Jekyll makes it easy to create site-wide headers and footers without having to copy them across every page. It also offers intelligent blog support and other advanced templating features.\r\n\r\n### Authors and Contributors\r\nYou can @mention a GitHub username to generate a link to their profile. The resulting `<a>` element will link to the contributor's GitHub Profile. For example: In 2007, Chris Wanstrath (@defunkt), PJ Hyett (@pjhyett), and Tom Preston-Werner (@mojombo) founded GitHub.\r\n\r\n### Support or Contact\r\nHaving trouble with Pages? Check out the documentation at http://help.github.com/pages or contact support@github.com and we’ll help you sort it out.\r\n","google":"","note":"Don't delete this file! It's used internally to help with page regeneration."}


--------------------------------------------------------------------------------
/index.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | layout: default
 3 | title: Unicode Security Guide
 4 | ---
 5 | 
 6 | # Unicode Security Guide
 7 | 
 8 | Welcome to the _Unicode Security Guide_! This guide has been designed to give Web application developers, software engineers, and application security researchers a reference for understanding Unicode-related security issues in operating systems, applications, and the Web.
 9 | 
10 | The dynamics of Unicode, and character encodings in general, are often misunderstood or poorly implemented, and lead to an array of interesting if not catastrophic security vulnerabilities.
11 | 
12 | The content here has been sourced through testing, research, and the following two technical reports from the Unicode Consortium:
13 | 
14 | * [Technical Report #36 : Unicode Security Considerations](https://www.unicode.org/reports/tr36/)
15 | * [Technical Report #39 : Unicode Security Mechanisms](https://www.unicode.org/reports/tr39/)
16 | 
17 | Beyond these two sources, further research has been ongoing around identifying and inventorying software behaviors.  Test cases are being provided in the <a href="https://github.com/cweb/unicode-security-guide">source code repository</a>.
18 | 
19 | ## Contributions and Acknowledgements
20 | Thank you to the following security-minded practitioners for their valuable feedback on this document:
21 | 
22 | * [Bil Corry](https://twitter.com/bilcorry)
23 | * Abraham Kang
24 | 
25 | And the following for their research and documentation into the issues:
26 | 
27 | * <a href="https://www.unicode.org">Unicode Consortium</a>
28 | * <a href="http://www.macchiato.com/">Mark Davis</a>
29 | * [Andy Heninger](https://plus.google.com/117524124943387916918)
30 | * [Richard Ishida](http://rishida.net/)
31 | * [Michael Kaplan](https://twitter.com/michkap)
32 | * [Shawn Steele](http://blogs.msdn.com/b/shawnste/)
33 | * <a href="https://twitter.com/hasegawayosuke">Yosuke HASEGAWA</a>
34 | * <a href="http://eaea.sirdarckcat.net/home.html">Eduardo Vela</a>
35 | * <a href="https://twitter.com/thornmaker">David Lindsay</a>
36 | * [Gareth Heyes](http://www.thespanner.co.uk/)
37 | 
38 | ## Disclaimers
39 | This guide has been written by application security professionals, and has not endorsed or reviewed by the Unicode Consortium.  It does draw on material from the Consortium, with references, where applicable.
40 | 
41 | 
42 | 


--------------------------------------------------------------------------------
/css/pygment_trac.css:
--------------------------------------------------------------------------------
 1 | .highlight  { background: #ffffff; }
 2 | .highlight .c { color: #999988; font-style: italic } /* Comment */
 3 | .highlight .err { color: #a61717; background-color: #e3d2d2 } /* Error */
 4 | .highlight .k { font-weight: bold } /* Keyword */
 5 | .highlight .o { font-weight: bold } /* Operator */
 6 | .highlight .cm { color: #999988; font-style: italic } /* Comment.Multiline */
 7 | .highlight .cp { color: #999999; font-weight: bold } /* Comment.Preproc */
 8 | .highlight .c1 { color: #999988; font-style: italic } /* Comment.Single */
 9 | .highlight .cs { color: #999999; font-weight: bold; font-style: italic } /* Comment.Special */
10 | .highlight .gd { color: #000000; background-color: #ffdddd } /* Generic.Deleted */
11 | .highlight .gd .x { color: #000000; background-color: #ffaaaa } /* Generic.Deleted.Specific */
12 | .highlight .ge { font-style: italic } /* Generic.Emph */
13 | .highlight .gr { color: #aa0000 } /* Generic.Error */
14 | .highlight .gh { color: #999999 } /* Generic.Heading */
15 | .highlight .gi { color: #000000; background-color: #ddffdd } /* Generic.Inserted */
16 | .highlight .gi .x { color: #000000; background-color: #aaffaa } /* Generic.Inserted.Specific */
17 | .highlight .go { color: #888888 } /* Generic.Output */
18 | .highlight .gp { color: #555555 } /* Generic.Prompt */
19 | .highlight .gs { font-weight: bold } /* Generic.Strong */
20 | .highlight .gu { color: #800080; font-weight: bold; } /* Generic.Subheading */
21 | .highlight .gt { color: #aa0000 } /* Generic.Traceback */
22 | .highlight .kc { font-weight: bold } /* Keyword.Constant */
23 | .highlight .kd { font-weight: bold } /* Keyword.Declaration */
24 | .highlight .kn { font-weight: bold } /* Keyword.Namespace */
25 | .highlight .kp { font-weight: bold } /* Keyword.Pseudo */
26 | .highlight .kr { font-weight: bold } /* Keyword.Reserved */
27 | .highlight .kt { color: #445588; font-weight: bold } /* Keyword.Type */
28 | .highlight .m { color: #009999 } /* Literal.Number */
29 | .highlight .s { color: #d14 } /* Literal.String */
30 | .highlight .na { color: #008080 } /* Name.Attribute */
31 | .highlight .nb { color: #0086B3 } /* Name.Builtin */
32 | .highlight .nc { color: #445588; font-weight: bold } /* Name.Class */
33 | .highlight .no { color: #008080 } /* Name.Constant */
34 | .highlight .ni { color: #800080 } /* Name.Entity */
35 | .highlight .ne { color: #990000; font-weight: bold } /* Name.Exception */
36 | .highlight .nf { color: #990000; font-weight: bold } /* Name.Function */
37 | .highlight .nn { color: #555555 } /* Name.Namespace */
38 | .highlight .nt { color: #000080 } /* Name.Tag */
39 | .highlight .nv { color: #008080 } /* Name.Variable */
40 | .highlight .ow { font-weight: bold } /* Operator.Word */
41 | .highlight .w { color: #bbbbbb } /* Text.Whitespace */
42 | .highlight .mf { color: #009999 } /* Literal.Number.Float */
43 | .highlight .mh { color: #009999 } /* Literal.Number.Hex */
44 | .highlight .mi { color: #009999 } /* Literal.Number.Integer */
45 | .highlight .mo { color: #009999 } /* Literal.Number.Oct */
46 | .highlight .sb { color: #d14 } /* Literal.String.Backtick */
47 | .highlight .sc { color: #d14 } /* Literal.String.Char */
48 | .highlight .sd { color: #d14 } /* Literal.String.Doc */
49 | .highlight .s2 { color: #d14 } /* Literal.String.Double */
50 | .highlight .se { color: #d14 } /* Literal.String.Escape */
51 | .highlight .sh { color: #d14 } /* Literal.String.Heredoc */
52 | .highlight .si { color: #d14 } /* Literal.String.Interpol */
53 | .highlight .sx { color: #d14 } /* Literal.String.Other */
54 | .highlight .sr { color: #009926 } /* Literal.String.Regex */
55 | .highlight .s1 { color: #d14 } /* Literal.String.Single */
56 | .highlight .ss { color: #990073 } /* Literal.String.Symbol */
57 | .highlight .bp { color: #999999 } /* Name.Builtin.Pseudo */
58 | .highlight .vc { color: #008080 } /* Name.Variable.Class */
59 | .highlight .vg { color: #008080 } /* Name.Variable.Global */
60 | .highlight .vi { color: #008080 } /* Name.Variable.Instance */
61 | .highlight .il { color: #009999 } /* Literal.Number.Integer.Long */
62 | 
63 | .type-csharp .highlight .k { color: #0000FF }
64 | .type-csharp .highlight .kt { color: #0000FF }
65 | .type-csharp .highlight .nf { color: #000000; font-weight: normal }
66 | .type-csharp .highlight .nc { color: #2B91AF }
67 | .type-csharp .highlight .nn { color: #000000 }
68 | .type-csharp .highlight .s { color: #A31515 }
69 | .type-csharp .highlight .sc { color: #A31515 }
70 | 


--------------------------------------------------------------------------------
/css/styles.css:
--------------------------------------------------------------------------------
  1 | body {
  2 |   padding:50px;
  3 |   font:13px/1.5 "Helvetica Neue", Helvetica, Arial, sans-serif;
  4 |   color:#777;
  5 |   font-weight:300;
  6 | }
  7 | 
  8 | p, h1, h2, h3, h4, h5, h6 {
  9 |   color:#222;
 10 |   margin:0 0 20px;
 11 | }
 12 | 
 13 | .indent {
 14 |   font-size: 1.2em;
 15 |   border-width: 0 0 0 5px;
 16 |   border-style: solid;
 17 |   border-color: silver;
 18 |   padding: 1em 0em 1em 1em;
 19 |   margin-left: 1em;
 20 | }
 21 | 
 22 | .superscript {
 23 |   position: relative;
 24 |   top: -0.5em;
 25 |   font-size: 80%;
 26 | }
 27 | 
 28 | .red {
 29 |   color: red;
 30 | }
 31 | 
 32 | .green {
 33 |   color: green;
 34 | }
 35 | 
 36 | ol, table, pre, dl {
 37 |   margin:0 0 20px;
 38 | }
 39 | 
 40 | 
 41 | p.zero {
 42 |   margin:0;
 43 |   line-height:1.0;
 44 | }
 45 | 
 46 | span.uchar {
 47 |   color: firebrick;
 48 |   font-family: Lucida Sans Unicode;
 49 | }
 50 | 
 51 | h1, h2, h3 {
 52 |   line-height:1.1;
 53 | }
 54 | 
 55 | h1 {
 56 |   font-size:28px;
 57 | }
 58 | 
 59 | h2 {
 60 |   color:#393939;
 61 | }
 62 | 
 63 | h3, h4, h5, h6 {
 64 |   color:#494949;
 65 | }
 66 | 
 67 | a {
 68 |   color:#39c;
 69 |   font-weight:400;
 70 |   text-decoration:none;
 71 | }
 72 | 
 73 | a:hover {
 74 |   color:#069;
 75 | }
 76 | 
 77 | a small {
 78 |   font-size:11px;
 79 |   color:#777;
 80 |   margin-top:-0.6em;
 81 |   display:block;
 82 | }
 83 | 
 84 | a:hover small {
 85 |   color:#777;
 86 | }
 87 | 
 88 | .wrapper {
 89 |   width:1060px;
 90 |   margin:0;
 91 | }
 92 | 
 93 | blockquote {
 94 |   border-left:1px solid #e5e5e5;
 95 |   margin:0;
 96 |   padding:0 0 0 20px;
 97 |   font-style:italic;
 98 | }
 99 | 
100 | code, pre {
101 |   font-family:Monaco, Bitstream Vera Sans Mono, Lucida Console, Terminal;
102 |   color:#333;
103 |   font-size:12px;
104 | }
105 | 
106 | pre {
107 |   padding:8px 15px;
108 |   background: #f8f8f8;  
109 |   border-radius:5px;
110 |   border:1px solid #e5e5e5;
111 |   overflow-x: auto;
112 | }
113 | 
114 | table {
115 |   width:100%;
116 |   border-collapse:collapse;
117 | }
118 | 
119 | thead {
120 |   background-color: silver;
121 |   color: black;
122 |   font-weight: 500;
123 | }
124 | 
125 | tr {
126 |   margin: 1px;
127 |   line-height: 1em;
128 | }
129 | 
130 | th, td {
131 |   text-align:left;
132 |   padding:2px 2px;
133 |   border-bottom:1px solid #e5e5e5;
134 | }
135 | 
136 | dt {
137 |   color:#444;
138 |   font-weight:700;
139 | }
140 | 
141 | th {
142 |   color:#444;
143 | }
144 | 
145 | img {
146 |   max-width:100%;
147 | }
148 | 
149 | img.center {
150 |   display: block;
151 |   margin-left: auto;
152 |   margin-right: auto;
153 |   max-width: 600px;
154 | }
155 | 
156 | header {
157 |   width:270px;
158 |   float:left;
159 |   position:fixed;
160 | }
161 | 
162 | header ul {
163 |   list-style:none;
164 |   height:40px;
165 |   
166 |   padding:0;
167 |   
168 |   background: #eee;
169 |   background: -moz-linear-gradient(top, #f8f8f8 0%, #dddddd 100%);
170 |   background: -webkit-gradient(linear, left top, left bottom, color-stop(0%,#f8f8f8), color-stop(100%,#dddddd));
171 |   background: -webkit-linear-gradient(top, #f8f8f8 0%,#dddddd 100%);
172 |   background: -o-linear-gradient(top, #f8f8f8 0%,#dddddd 100%);
173 |   background: -ms-linear-gradient(top, #f8f8f8 0%,#dddddd 100%);
174 |   background: linear-gradient(top, #f8f8f8 0%,#dddddd 100%);
175 |   
176 |   border-radius:5px;
177 |   border:1px solid #d2d2d2;
178 |   box-shadow:inset #fff 0 1px 0, inset rgba(0,0,0,0.03) 0 -1px 0;
179 |   width:270px;
180 | }
181 | 
182 | header li {
183 |   width:89px;
184 |   float:left;
185 |   border-right:1px solid #d2d2d2;
186 |   height:40px;
187 | }
188 | 
189 | header li:first-child a {
190 |   border-radius:5px 0 0 5px;
191 | }
192 | 
193 | header li:last-child a {
194 |   border-radius:0 5px 5px 0;
195 | }
196 | 
197 | header ul a {
198 |   line-height:1;
199 |   font-size:11px;
200 |   color:#999;
201 |   display:block;
202 |   text-align:center;
203 |   padding-top:6px;
204 |   height:34px;
205 | }
206 | 
207 | header ul a:hover {
208 |   color:#999;
209 |   background: -moz-linear-gradient(top, #fff 0%, #ddd 100%);
210 |   background: -webkit-gradient(linear, left top, left bottom, color-stop(0%,#fff), color-stop(100%,#ddd));
211 |   background: -webkit-linear-gradient(top, #fff 0%,#ddd 100%);
212 |   background: -o-linear-gradient(top, #fff 0%,#ddd 100%);
213 |   background: -ms-linear-gradient(top, #fff 0%,#ddd 100%);
214 |   background: linear-gradient(top, #fff 0%,#ddd 100%);
215 | }
216 | 
217 | header ul a:active {
218 |   -webkit-box-shadow: inset 0px 2px 2px 0px #ddd;
219 |   -moz-box-shadow: inset 0px 2px 2px 0px #ddd;
220 |   box-shadow: inset 0px 2px 2px 0px #ddd;
221 | }
222 | 
223 | strong {
224 |   color:#222;
225 |   font-weight:700;
226 | }
227 | 
228 | header ul li + li {
229 |   width:88px;
230 |   border-left:1px solid #fff;
231 | }
232 | 
233 | header ul li + li + li {
234 |   border-right:none;
235 |   width:89px;
236 | }
237 | 
238 | header ul a strong {
239 |   font-size:14px;
240 |   display:block;
241 |   color:#222;
242 | }
243 | 
244 | section {
245 |   width:700px;
246 |   float:right;
247 |   padding-bottom:50px;
248 | }
249 | 
250 | small {
251 |   font-size:11px;
252 | }
253 | 
254 | hr {
255 |   border:0;
256 |   background:#e5e5e5;
257 |   height:1px;
258 |   margin:0 0 20px;
259 | }
260 | 
261 | footer {
262 |   width:270px;
263 |   float:left;
264 |   position:fixed;
265 |   bottom:50px;
266 | }
267 | 
268 | @media print, screen and (max-width: 960px) {
269 |   
270 |   div.wrapper {
271 |     width:auto;
272 |     margin:0;
273 |   }
274 |   
275 |   header, section, footer {
276 |     float:none;
277 |     position:static;
278 |     width:auto;
279 |   }
280 |   
281 |   header {
282 |     padding-right:320px;
283 |   }
284 |   
285 |   section {
286 |     border:1px solid #e5e5e5;
287 |     border-width:1px 0;
288 |     padding:20px 0;
289 |     margin:0 0 20px;
290 |   }
291 |   
292 |   header a small {
293 |     display:inline;
294 |   }
295 |   
296 |   header ul {
297 |     position:absolute;
298 |     right:50px;
299 |     top:52px;
300 |   }
301 | }
302 | 
303 | @media print, screen and (max-width: 720px) {
304 |   body {
305 |     word-wrap:break-word;
306 |   }
307 |   
308 |   header {
309 |     padding:0;
310 |   }
311 |   
312 |   header ul, header p.view {
313 |     position:static;
314 |   }
315 |   
316 |   pre, code {
317 |     word-wrap:normal;
318 |   }
319 | }
320 | 
321 | @media print, screen and (max-width: 480px) {
322 |   body {
323 |     padding:15px;
324 |   }
325 |   
326 |   header ul {
327 |     display:none;
328 |   }
329 | }
330 | 
331 | @media print {
332 |   body {
333 |     padding:0.4in;
334 |     font-size:12pt;
335 |     color:#444;
336 |   }
337 | }
338 | 
339 | 
340 | 


--------------------------------------------------------------------------------
/page2.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | layout: default
  3 | title: Visual Spoofing
  4 | permalink: visual-spoofing/
  5 | ---
  6 | 
  7 | # Unicode Security Guide
  8 | ## _Visual Spoofing_ 
  9 | 
 10 | While Unicode has provided an incredible framework for storing, transmitting, and presenting information in many of our world’s native languages, it has also enabled attack vectors for phishing and word filters. Commonly referred to as ‘visual spoofing’ these attack vectors leverage characters from various languages that are visually identical to letters in another language.
 11 | 
 12 | To a human reader, some of the following letters are indistinguishable from one another while others closely resemble one another:
 13 | 
 14 | <span class="indent"> A&#x0391; &#x0410; &#x15C5; &#x15CB; &#x1D00; &#xFF21;</span>
 15 | 
 16 | To a computer system however, each of these letters has very different meaning. The underlying bits that represent each letter are different from one to the next.
 17 | 
 18 | ## Table of Contents 
 19 | 
 20 | * [Prior Research](#prior) 
 21 | * [Attack Scenarios](#attack)
 22 | * [Defensive Options](#defense) 
 23 | 
 24 | ## <a id="prior"></a>Prior Research
 25 | One of the most well-known attacks to exploit visual spoofing was the Paypal.com IDN spoof of 2005. Setup to demonstrate the power of these attack vectors, [Eric Johanson](http://www.shmoo.com/idn/) and The Schmoo Group successfully used a [www.paypal.com](http://www.paypal.com) lookalike domain name to fool visitors into providing personal information. The advisory references original research from 2002 by [Evgeniy Gabrilovich and Alex Gontmakher](http://www.cs.technion.ac.il/~gabr/papers/homograph.html) at the Israel Institute of Technology. Their original paper described an attack using Microsoft.com as an example.
 26 | 
 27 | Viktor Krammer, author of the [Quero Toolbar](http://www.quero.at/) for Internet Explorer, also presented additional research on these attack vectors and detection mechanisms in his [2006 presentation](http://www.quero.at/papers/idn_spoofing.pdf).  Additionally, the [Unicode Consortium](https://unicode.org) has been active at raising awareness of these issues in its security papers, and in providing recommended solutions.
 28 | 
 29 | ## <a id="vectors"></a>Summary of Vectors
 30 | The phenomena of 'visual spoofing' may be malicious and deliberate or benign and accidental.  There have been cases where a choice of font displayed a sequence of characters in an unintended way, just as there have been cases where Unicode characters did not display properly.  The following list attempts to capture the major vectors:
 31 | 
 32 | __Non-Unicode lookalikes__
 33 | 
 34 | Simple characters or character combinations can look like something else.  For example, the letters "r" and "n" together can look like the letter "m".  E.g. <span class="spoof">"rn"</span>.  Also, the number <span class="spoof">"0"</span> can look like the letter <span class="spoof">"O"</span>, the number <span class="spoof">"1"</span> can look like the letter <span class="spoof">"l"</span>, and so on.
 35 | 
 36 | __Unicode Confusables__
 37 | 
 38 | The Unicode Confusables are discussed in detail later in this document.  In short, these are the diverse array of non-ASCII Unicode characters which are easily confused with characters across languages.
 39 | 
 40 | __ The Invisibles__
 41 | 
 42 | Discussed later in this document, these are characters which have no visual appearance and minimal spacing if any spacing at all.  Hence, they are visually non-existant.  
 43 | 
 44 | __Problematic font-rendering__
 45 | 
 46 | Fonts are ultimately responsible for the visual display of characters, and can sometimes render glyphs confusingly, or as empty white space.  There are numerous examples of this, just one of which is described below:
 47 | 
 48 | <table>
 49 | <thead>
 50 |  <tr>
 51 |    <td>Character sequence</td>
 52 |    <td>Should appear as</td>
 53 |    <td>Might appear as</td>
 54 |  </tr>
 55 | </thead>
 56 | <tbody>
 57 |  <tr>
 58 |    <td>U+00B7 U+0041 U+0338</td>
 59 |    <td><span class="uchar">A&#x0338;</span></td> 
 60 |    <td><span class="uchar">A/</span></td>
 61 |  </tr>
 62 | </tbody>
 63 | </table>
 64 | 
 65 | __Manipulating combining-marks__
 66 | 
 67 | Combining marks can be stacked or re-ordered in a myriad of ways.  Consider the following table which illustrates just one way that combining marks can be stacked (using one directly after another). The table also shows an example of how combining marks can be re-ordered in a different sequence, but still have the same visual appearance.
 68 | 
 69 | <table>
 70 | <thead>
 71 |  <tr>
 72 |    <td>Character sequence</td>
 73 |    <td>Appears as</td>
 74 |  </tr>
 75 | </thead>
 76 | <tbody>
 77 |  <tr>
 78 |    <td>U+006F U+0304</td>
 79 |    <td><span class="uchar">o&#x0304;</span></td> 
 80 |  </tr>
 81 |  <tr>
 82 |    <td>U+006F U+0304 U+0304</td>
 83 |    <td><span class="uchar">o&#x0304;&#x0304;</span></td> 
 84 |  </tr>
 85 |  <tr>
 86 |    <td>U+006F U+0336 U+0335</td>
 87 |    <td><span class="uchar">o&#x0336;&#x0335;</span></td> 
 88 |  </tr>
 89 |  <tr>
 90 |    <td>U+006F U+0335 U+0336</td>
 91 |    <td><span class="uchar">o&#x0335;&#x0336;</span></td> 
 92 |  </tr>
 93 | </tbody>
 94 | </table>
 95 | 
 96 | __Bidi and syntax spoofing__
 97 | Another interesting vector uses the 'bidirectional' properties of certain characters, also known as 'bidi'.  
 98 | 
 99 | ## <a id="attack"></a>Attack Scenarios 
100 | 
101 | A variety of scenarios exist where visual spoofing may be used to attack and exploit people.  This section looks at a few.
102 | 
103 | ### <a id="domains"></a>Spoofing Domain Names
104 | Domain names represent an interesting attack vector because their mere visual appearance inspires trust in a brand.  The following image represents what visually appear as two identical domain names, however, the second contains the <span class="uchar">U+0261 LATIN SMALL LETTER SCRIPT G</span>.
105 | 
106 | <img class="center" src="{{ site.url }}/img/spoof-google.png" />
107 | 
108 | The tricky part about presenting domain names is that they often get simply a glance, if any look at all.  The following image represents two domain names which might be visually similar 'enough' to fool someone, yet not identical.
109 | 
110 | <img class="center" src="{{ site.url }}/img/spoof-mozilla.png" />
111 | 
112 | Finally, characters that appear to be syntactic elements, such as <span class="uchar">U+FF89 HALFWIDTH KATAKANA LETTER NO</span> which resembles the forward-slash path-separator, can be troublesome.  In the following image, this character is used in the subdomain label of a domain name, but appears to be a path-separator.
113 | 
114 | <img class="center" src="{{ site.url }}/img/spoof-slash.png" />
115 | 
116 | ### <a id="vanity"></a>Fraudulent Vanity URL's
117 | 
118 | A social networking service wants to allow vanity URL’s to be registered using international characters such as <span class="uchar">www.foo.bar/&#x0444;&#x0443;</span> but perceives too great a risk from the variety of ways that the URL could be subject to visual fraud and confusion. Because Unicode characters are well-supported in the path portion of a browser’s URL display, a well-crafted vanity URL could easily fool victims and be the landing page for a phishing attack.  In fact, it's often unnecessary to use Unicode - in some cases, the number one "1" can appear as the letter "l", and in certain fonts the sequence "rn" can appear as the letter "m".
119 | 
120 | ### <a id="profanity"></a>Bypassing Profanity Filters
121 | 
122 | An email or forum system needs to prevent violent and profane words from being used. It's well-known that there are trivial ways to bypass such filters, including using spacing and punctuation between letters in a word (e.g. c_r_a_p), or slight misspellings which give the same effect (e.g. crrap), to name just a couple.  There’s also the possibility of using confusable characters which have no visual side-affect (e.g. crap) written entirely in another script (or a mix of scripts).  
123 | 
124 | ### <a id="ui"></a>Spoofing User Interface Dialogs
125 | 
126 | Security decisions are often presented to end users in the form of dialog boxes consisting in part of user-supplied input. For example: 
127 | 
128 | * When a user downloads a file through a Web browser, they’re asked to confirm their decision, often with the filename as a part of the dialog's content. 
129 | * When a user tries to launch an untrusted application they may also be presented with a dialog box asking for confirmation. 
130 | * A social networking site may ask its users for confirmation before redirecting them to an off-site URL, often with the URL making up the dialog's content. 
131 | 
132 | In any of these cases, a clever attack may use special BIDI or other characters that reverse the direction of text, or otherwise manipulate the text in a way that may confuse or fool the end users.
133 | 
134 | Consider the following image which shows the Windows Explorer program.  What appears to be a plain text file ending in the ".txt" file extension, is actually an executable file ending in the ".exe" extension.  Fortunately, Windows Explorer recognizes the true file type, which it has listed as "Application".
135 | 
136 | <img class="center" src="{{ site.url }}/img/spoof-win-explorer-file.png" />
137 | 
138 | In another example, the <span class="uchar">U+FEFF ZERO WIDTH NO-BREAK SPACE</span> character, also known as the Byte-Order Mark, or BOM, acts as an 'invisible' character.
139 | 
140 | <img class="center" src="{{ site.url }}/img/spoof-win-explorer-folder.png" />
141 | 
142 | Invisible characters present their own interesting dynamics and applications.  As seen in the image above, Windows Explorer presents what appears to be two folders with identical names, whereas a default command prompt does not properly display the BOM, and so presents it as an empty box.
143 | 
144 | ### <a id="ads"></a>Malvertisements
145 | 
146 | Advertising network's often need to protect brand name trademarks from being registered or used by anyone other than their owner. This threat might be mitigated through filters, human editorial inspection, or a combination of the two.  An attacker could place an malicious phishing ad that bypasses trademark filters by using confusable characters. For example “Download Microsoft Windows 8 Service Pack 1 here” where the trademarked name 'Microsoft Windows' was crafted using non-English script, or even using invisible characters.
147 | 
148 | ### <a id="email"></a>Forging Internationalized Email
149 | 
150 | Email addresses and the SMTP protocol has long been confined to ASCII, however, standards work through the <a href="http://www.ietf.org">IETF</a> was concluded in 2013 by the <a href="http://datatracker.ietf.org/wg/eai/charter/">Email Address Internationalization Working Group</a>.  The EAI effort delivered documentation for integrating UTF-8 into the core email protocols, as well as advice to EAI deployment in client and server software.  In preparing for the transition, email client engineers and designers will need to anticipate and handle the case of visually identical email addresses, among other issues.  If left unhandled, then end users could easily be fooled. Digital certificates would provide a good mechanism for proving authenticity of a message; however such certificates also support Unicode and are vulnerable to the exact same attacks.
151 | 
152 | ## <a id="defense"></a>Defensive Options
153 | All does not seem lost.  While
154 | 
155 | ## <a id="confusables"></a>The Confusables
156 | Throughout Unicode, the characters that visually resemble one another are referred to as <strong>the confusables</strong>.  The Unicode Consortium has documented this phenomena in both <a href="https://www.unicode.org/reports/tr36/">Technical Report 36</a> and <a href="https://www.unicode.org/reports/tr39/">TR 39</a>.  
157 | 
158 | It is TR 39 specifically which provides links to the data files comprising the confusables, such as <a href="https://www.unicode.org/Public/security/revision-05/confusables.txt">confusables.txt</a> which provides a mapping for visual confusables.
159 | 
160 | The Unicode Consortium has also provided <a href="https://unicode.org/cldr/utility/confusables.jsp">Unicode Utilities: Confusables</a> which takes an input string and produces visually confusable strings generated using the prior mentioned data files.
161 | 
162 | ### <a id="single"></a>Single-Script Confusables
163 | 
164 | 
165 | ### <a id="mixed"></a>Mixed-Script Confusables
166 | ### <a id="whole"></a>Whole-Script Confusables
167 | 
168 | ### <a id="whole"></a>The Invisibles
169 | 
170 | <img class="center" src="{{ site.url }}/img/uchar-180E.png" />
171 | <img class="center" src="{{ site.url }}/img/uchar-feff.png" />
172 | 
173 | ## <a id="idna"></a>Internationalized Domain Names in Applications (IDNA)
174 | 
175 | 
176 | ### <a id="idna2003"></a>IDNA 2003
177 | ### <a id="idna2008"></a>IDNA 2008
178 | 


--------------------------------------------------------------------------------
/page1.md:
--------------------------------------------------------------------------------
  1 | ---
  2 | layout: default
  3 | title: Background
  4 | permalink: background/
  5 | ---
  6 | 
  7 | # Unicode Security Guide
  8 | ## _Background_ 
  9 | 
 10 | The Unicode Standard provides a unique number for every character, enabling disparate computing systems to exchange and interpret text in the same way.
 11 | 
 12 | ## Table of Contents 
 13 | * [History](#history) 
 14 | * [Introduction](#intro)
 15 |     * [Code Points](#cp)
 16 | * [Character Encoding](#encoding) 
 17 | * [Character Escape Sequences and Entity References](#escape)
 18 | * [Security Testing Focus Areas](#testing)
 19 | 
 20 | ## <a id="history"></a>Brief History of Character Encodings
 21 | Early in computing history, it became widely clear that a standardized way to represent characters would provide many benefits. Around 1963, IBM standardized EBCDIC in its mainframes, about the same time that ASCII was standardized as a 7-bit character set.  EBCDIC used an 8-bit encoding, and the unused eighth-bit in ASCII allowed OEM's to apply the extra bit for their proprietary purposes. The following list roughly captures some of the history leading up to Unicode.
 22 | 
 23 | * __1991__ - Unicode
 24 | * __1990__ - ISO 10646 (UCS)
 25 | * __1985__ - ISO-8859-1 (code pages galore!)
 26 | * __1981__ - MBCS (e.g. GB2312)
 27 | * __1964__ - EBCDIC (non-ASCII compatible)
 28 | * __1963__ - ASCII 7-bit (an 8th bit free-for-all soon followed)
 29 | 
 30 | This allowed OEM's to ship computers and later PC's with customized character encodings specific to language or region. So computers could ship to Israel with a tweaked ASCII encoding set that supported Hebrew for example. The divergence in these customized character sets grew into a problem over time, as data interchange become error-prone if not impossible when computers didn't share the same character set. 
 31 | 
 32 | In response to this growth, the International Organization for Standardization (ISO) began developing the ISO-8859 set of character encoding standards in the early 1980's. The ISO-8859 standards were aimed at providing a reliable system for data-interchange across computing systems. They provided support for many popular languages, but weren't designed for high-quality typography which needed symbols such as ligatures. 
 33 | 
 34 | In the late 1980's Unicode was being designed, around the same time ISO recognized the need for a ubiquitous character encoding framework, what would later come to be called the Universal Character Set (UCS), or ISO 10646. Version 1.0 of the Unicode standard was released in 1991 at almost the same time as UCS was made public. Since that time, Unicode has become the de facto character encoding model, and has worked closely with ISO and UCS to ensure compatibility and similar goals.
 35 | 
 36 | ## <a id="intro"></a>Brief Introduction to Unicode
 37 | Most people are familiar with [ASCII](http://en.wikipedia.org/wiki/ASCII), it's usefulness and it's limitation to 128 characters.  Unicode and UCS expanded the available array of characters by separating the concepts of __code points__ and binary __encodings__.
 38 | 
 39 | The Unicode framework can presumably represent all of the worlds languages and scripts, past, present, and future. That's because the current version 5.1 of the Unicode Standard has space for over 1 million code points. A code point is a unique value within the Unicode code-space. A single code point can represent a letter, a special control character (e.g. carriage return), a symbol, or even some other abstract thing.
 40 | 
 41 | ### <a id="cp"></a>Code Points 
 42 | A code point is a 21-bit scalar value in the current version of Unicode, and is represented using the following type of reference where NNNN would be a hex value: 
 43 | 
 44 | <span class="indent">U+NNNN</span>
 45 | 
 46 | The valid range for Unicode code points is currently U+0000 to U+10FFFF.  This range can be expanded in the future if the Unicode Standard changes. The following image illustrates some of the properties or metadata that accompany a given code point.
 47 | 
 48 | <img class="center" src="{{ site.url }}/img/uchar-A.png" />
 49 | 
 50 | Code point U+0041 represents the Latin Capital Letter A. It's no coincidence that this maps directly to ASCII's value 0x41, as the Unicode Standard has always preserved the lower ASCII range to ensure widespread compatibility. Some interesting things to note here are the properties associated with this code point:
 51 | 
 52 | * Several categories are assigned including a general 'category' and a 'script' family.
 53 | * A 'lower case' mapping is defined.
 54 | * An 'upper case' mapping is defined.
 55 | * A 'normalization' mapping is defined.
 56 | * Binary properties are assigned.
 57 | 
 58 | This short list only represents some of the metadata attached to a code point, there can be much more information. In looking for security issues however, this short list provides a good starting point.
 59 | 
 60 | ## <a id="encoding"></a>Character Encoding
 61 | A discussion of characters and strings can quickly dissolve into a soup of terminology, where many terms get mixed up and used inaccurately.  This document will aim to avoid using all of the terminology, and may use some terms inaccurately according to the Unicode Consortium, with the goal of simplicity. 
 62 | 
 63 | To put it simply, an encoding is the binary representation of some character.  It’s ‘bits on the wire’ or ‘data at rest’ in some encoding scheme.  The Unicode Consortium has defined four character encoding forms, the Unicode Transformation Formats (UTF):
 64 | 
 65 | 1. UTF-7
 66 |    Defined by RFC 2152, and has since been largely deprecated and its use is not recommended. 
 67 | 1. UTF-8
 68 |    A __variable-width encoding__ where each Unicode code point is assigned to an unsigned byte sequence of __1 to 4__ bytes in length.  Older versions of the specification allowed for up to __6 bytes__ in length but that is no longer the case.
 69 | 1. UTF-16
 70 |    A __variable-width encoding__ where each Unicode code point is assigned to an unsigned sequence of __2 or 4__ bytes.  The 2-byte sequences are comprised of __surrogate pairs__.
 71 | 1. UTF-32
 72 |    A __fixed-width encoding__ where each Unicode code point is assigned to an unsigned sequence of __4 bytes__.  UTF-32 employs a fixed mapping using the same numeric value as the code point, so no algorithms are needed.
 73 | 
 74 | Of these four, UTF-7 has been deprecated, UTF-8 is the most commonly used on the Web, and both UTF-16 and UTF-32 can be serialized in little or big endian format.
 75 | 
 76 | A character encoding as defined here means the actual bytes used to represent the data, or code point.  So, a given code point <span class="uchar">U+0041 LATIN CAPITAL LETTER A</span> can be encoded using the following bytes in each UTF form:
 77 | 
 78 | <table>
 79 |  <thead><tr>
 80 |   <td>UTF Format</td>
 81 |   <td>Byte sequence</td>
 82 |  </tr>
 83 |  </thead>
 84 |  <tbody>
 85 |  <tr>
 86 |   <td>UTF-8</td>
 87 |   <td>&lt; 41 &gt;</td>
 88 |  </tr>
 89 |  <tr>
 90 |   <td>UTF-16 Little Endian</td>
 91 |   <td>&lt; 41 00 &gt;</td>
 92 |  </tr>
 93 |  <tr>
 94 |   <td>UTF-16 Big Endian</td>
 95 |   <td>&lt; 00 41 &gt;</td>
 96 |  </tr>
 97 |  <tr>
 98 |   <td>UTF-32 Little Endian</td>
 99 |   <td>&lt; 41 00 00 00 &gt;</td>
100 |  </tr>
101 |  <tr>
102 |   <td>UTF-32 Big Endian</td>
103 |   <td>&lt; 00 00 00 41 &gt;</td>
104 |  </tr>
105 | </tbody></table>
106 | 
107 | The lower ASCII character set is preserved by UTF-8 up through U+007F.  The following table gives another example, using <span class="uchar">U+FEFF ZERO WIDTH NO-BREAK SPACE</span>, also known as the Unicode Byte Order Mark.
108 | 
109 | <table>
110 |  <thead><tr>
111 |   <td>UTF Format</td>
112 |   <td>Byte sequence</td>
113 |  </tr>
114 |  </thead>
115 |  <tbody>
116 |  <tr>
117 |   <td>UTF-8</td>
118 |   <td>&lt; EF BB BF &gt;</td>
119 |  </tr>
120 |  <tr>
121 |   <td>UTF-16 Little Endian</td>
122 |   <td>&lt; FF FE &gt;</td>
123 |  </tr>
124 |  <tr>
125 |   <td>UTF-16 Big Endian</td>
126 |   <td>&lt; FE FF &gt;</td>
127 |  </tr>
128 |  <tr>
129 |   <td>UTF-32 Little Endian</td>
130 |   <td>&lt; FF FE 00 00 &gt;</td>
131 |  </tr>
132 |  <tr>
133 |   <td>UTF-32 Big Endian</td>
134 |   <td> &lt; 00 00 FE FF &gt;</td>
135 |  </tr>
136 | </tbody></table>
137 | 
138 | At this point UTF-8 uses three bytes to represent the code point.  One may wonder at this point how a code point greater than U+FFFF would be represented in UTF-16.  The answer lies in surrogate pairs, which use two double-byte sequences together.  Consider the code point <span class="uchar">U+10FFFD PRIVATE USE CHARACTER-10FFFD</span> in the following table.
139 | 
140 | <table>
141 |  <thead><tr>
142 |   <td>UTF Format</td>
143 |   <td>Byte sequence</td>
144 |  </tr>
145 |  </thead>
146 |  <tbody>
147 |  <tr>
148 |   <td>UTF-8</td>
149 |   <td>&lt; F4 8F BF BD &gt;</td>
150 |  </tr>
151 |  <tr>
152 |   <td>UTF-16 Little Endian</td>
153 |   <td>&lt; FF DB &gt; &lt; FD DF &gt;</td>
154 |  </tr>
155 |  <tr>
156 |   <td>UTF-16 Big Endian</td>
157 |   <td>&lt; DB FF &gt; &lt; DF FD &gt;</td>
158 |  </tr>
159 |  <tr>
160 |   <td>UTF-32 Little Endian</td>
161 |   <td>&lt; FD FF 10 00 &gt;</td>
162 |  </tr>
163 |  <tr>
164 |   <td>UTF-32 Big Endian</td>
165 |   <td>&lt; 00 10 FF FD &gt;</td>
166 |  </tr>
167 | </tbody></table>
168 | 
169 | Surrogate pairs combine two pairs in the reserved code point range U+D800 to U+DFFF, to be capable of representing all of Unicode’s code points in the 16 bit format.  For this reason, UTF-16 is considered a variable-width encoding just as is UTF-8.  UTF-32 however, is considered a fixed-width encoding.
170 | 
171 | ## <a id="escape"></a>Character Escape Sequences and Entity References
172 | 
173 | An alternative to encoding characters is representing them using a symbolic representation rather than a serialization of bytes.  This is common in HTTP with URL-encoded data, and in HTML.   In HTML, numerical character references (NCR) can be used in either a decimal or hexadecimal form that maps to a Unicode code point. 
174 | 
175 | In fact, CSS (Cascading Style Sheets) and even JavaScript use escape sequences, as do most programming languages.  The details of each protocol’s specification are outside the scope of this document, however examples will be used here for reference.
176 | 
177 | The following table lists the common escape sequences for <span class="uchar">U+0041 LATIN CAPITAL LETTER A</span>.
178 | 
179 | <table>
180 |  <thead><tr>
181 |   <td>UTF Format</td>
182 |   <td>Character Reference or Escape Sequence</td>
183 |  </tr>
184 |  </thead>
185 |  <tbody>
186 |  <tr>
187 |   <td>URL</td>
188 |   <td>%41</td>
189 |  </tr>
190 |  <tr>
191 |   <td>NCR (decimal)</td>
192 |   <td>&amp;#65;</td>
193 |  </tr>
194 |  <tr>
195 |   <td>NCR (Hex)</td>
196 |   <td>&amp;#x41;</td>
197 |  </tr>
198 |  <tr>
199 |   <td>CSS</td>
200 |   <td>\41 and \0041</td>
201 |  </tr>
202 |  <tr>
203 |   <td>JavaScript</td>
204 |   <td>\x41 and \u0041</td>
205 |  </tr>
206 |  <tr>
207 |   <td>Other</td>
208 |   <td>\u0041</td>
209 |  </tr>
210 | </tbody></table>
211 | 
212 | The following table gives another example, using <span class="uchar">U+FEFF ZERO WIDTH NO-BREAK SPACE</span>, also known as the Unicode Byte Order Mark. 
213 | 
214 | <table>
215 |  <thead><tr>
216 |   <td>UTF Format</td>
217 |   <td>Character  Reference or Escape Sequence</td>
218 |  </tr>
219 |  </thead>
220 |  <tbody>
221 |  <tr>
222 |   <td>URL</td>
223 |   <td>%EF%BB%BF</td>
224 |  </tr>
225 |  <tr>
226 |   <td>NCR (decimal)</td>
227 |   <td>&amp;#65279;</td>
228 |  </tr>
229 |  <tr>
230 |   <td>NCR (Hex)</td>
231 |   <td>&amp;#xFEFF;</td>
232 |  </tr>
233 |  <tr>
234 |   <td>CSS</td>
235 |   <td>&nbsp;\FEFF</td>
236 |  </tr>
237 |  <tr>
238 |   <td>JavaScript (as bytes)</td>
239 |   <td>\xEF\xBB\xBF</td>
240 |  </tr>
241 |  <tr>
242 |   <td>JavaScript (as reference)</td>
243 |   <td>\uFEFF</td>
244 |  </tr>
245 |  <tr>
246 |   <td>JSON</td>
247 |   <td>\uFEFF</td>
248 |  </tr>
249 |  <tr>
250 |   <td>Microsoft Internet Information Server (IIS)</td>
251 |   <td>%uFEFF</td>
252 |  </tr>
253 | </tbody></table>
254 | 
255 | ## <a id="testing"></a>Security Testing Focus Areas
256 | 
257 | This guide has been designed with two general areas in mind - one being to aid readers in setting goals for a software security assessment. Where possible, data has also been provided to assist software engineers in developing more security software. Information such as how framework API's behave by default and when overridden is subject to change at any time.
258 | 
259 | Clearly, any protocol and standard can be subject to security vulnerabilities, examples include HTML, HTTP, TCP, DNS.  Character encodings and the Unicode standard are also exposed to vulnerability. Sometimes vulnerabilities are related to a design-flaw in the standard, but more often they’re related to implementation in practice. Many of the phenomena discussed here are not vulnerabilities in the standard. Instead, the following general categories of vulnerability are most common in applications which are not built to anticipate and prevent the relevant attacks:
260 | 
261 | * Visual Spoofing
262 | * Best-fit mappings
263 | * Charset transcodings and character mappings
264 | * Normalization
265 | * Canonicalization of overlong UTF-8
266 | * Over-consumption
267 | * Character substitution
268 | * Character deletion
269 | * Casing
270 | * Buffer overflows
271 | * Controlling Syntax
272 | * Charset mismatches
273 | 
274 | Consider the following image as an example.  In the case of <span class="uchar">U+017F LATIN SMALL LETTER LONG S</span>, the upper casing and normalization operations transform the character into a completely different value.  Many characters such as this one have explicit mappings defined through the Unicode Standard, indicating what character (or sequences of characters) they should transform to through casing and normalization.  Normalization is a defined process discussed later in this document.  In some situations, this behavior could be exploited to create cross-site scripting or other attack scenarios.
275 | 
276 | The rest of this guide intends to explore each of these phenomena in more detail, as each relates to software vulnerability mitigation and testing.
277 | 
278 | 
279 | 


--------------------------------------------------------------------------------
/page3.md:
--------------------------------------------------------------------------------
   1 | ---
   2 | layout: default
   3 | title: Character Transformations
   4 | permalink: character-transformations/
   5 | ---
   6 | 
   7 | # Unicode Security Guide
   8 | ## _Character Transformations_
   9 | 
  10 | This section attempts to explore the various ways in which characters and strings can be transformed by software processes.  Such transformations are not vulnerabilities necessarily, but could be exploited by clever attackers. 
  11 | 
  12 | As an example, consider an attacker trying to inject script (i.e. cross-site scripting, or XSS attack) into a Web-application which utilizes a defensive input filter.  The attacker finds that the application performs a lowercase operation on the input after filtering, and by injecting special characters they can exploit that behavior.   That is, the string "script" is prevented by the filter, but the string "scr&#x0130;pt" is allowed.
  13 | 
  14 | ## Table of Contents 
  15 | * [Round Trip](#round-trip) 
  16 | * [Best Fit Mappings](#best-fit)
  17 | * [Charset Transcoding and Character Mappings](#transcoding) 
  18 | * [Normalization](#normalization)
  19 | * [Canonicalization of Non-Shortest Form UTF-8](#canonicalization)
  20 | * [Over-Consumption](#overconsumption)
  21 |   * [Well-formed and Ill-formed Byte Sequences](#formedness)
  22 |   * [Handling Ill-formed Byte Sequences](#handling)
  23 | * [Handling the Unexpected](#unexpected)
  24 |   * [Unexpected Inputs](#unexpected-inputs)
  25 |   * [Character Substitution](#unexpected-substitution)
  26 |   * [Character Deletion](#unexpected-deletion)
  27 | * [Upper and Lower Casing](#casing)
  28 | * [Buffer Overflows](#overflows)
  29 |   * [Upper and Lower Casing](#overflow-casing)
  30 |   * [Normalization](#overflow-normalization)
  31 | * [Controlling Syntax](#syntax)
  32 | * [Charset Mismatch](#charset)
  33 | 
  34 | 
  35 | ## <a id="round-trip"></a>Round-trip Conversions: A Common Pattern
  36 | In practice, globalized software must be capable of handling many different character sets, and converting data between them. The process for supporting this requirement can generally look like the following:
  37 | 
  38 | 1. Accept data from any character set, e.g. Unicode, shift_jis, ISO-8859-1.
  39 | 2. Transform, or convert, data to Unicode for processing and storage.
  40 | 3. Transform data to original or other character set for output and display.
  41 | 
  42 | In this pattern, Unicode is used as the broker. With support for such a large character repertoire, Unicode will often have a character mapping for both sides of this transaction. To illustrate this, consider the following Web application transaction.
  43 | 
  44 | 1. An application end-user inputs their full name using characters encoded from the shift_jis character set.
  45 | 2. Before storing in the database, the application converts the user-input to Unicode's UTF-8 format.
  46 | 3. When visiting the Web page, the user's full name will be returned in UTF-8 format, unless other conditions cause the data to be returned in a different encoding. Such conditions may be based on the Web application's configuration or the user's Web browser language and encoding settings. Under these types of conditions, the Web application will convert the data to the requested encoding.
  47 | 
  48 | The round-trip conversions illustrated here can lead to numerous issues that will be further discussed. While it serves as a good example, this isn't the only case where such issues can arise.
  49 | 
  50 | ## <a id="best-fit"></a>Best-fit Mappings
  51 | The "best-fit" phenomena occurs when a character X gets transformed to an entirely different character Y.  This can occur for reasons such as:
  52 | 
  53 | * A framework API transforms input to a different character encoding by default.
  54 | * Data is marshalled from a wide string type (multi-byte character representation) such as UTF-16, to a non-wide string (single-byte character representation) such as US-ASCII. 
  55 | * Character X in the source encoding doesn't exist in the destination encoding, so the software attempts to find a 'best-fit' match.
  56 | 
  57 | In general, best-fit mappings occur when characters are transcoded between Unicode and another encoding.  It's often the case that the source encoding is Unicode and the destination is another charset such as shift_jis, however, it could happen in reverse as well. Best-fit mappings are different than character set transcoding which is discussed in another section of this guide.
  58 | 
  59 | Software vulnerabilities may arise when best-fit mappings occur. To name a few:
  60 | 
  61 | * Best-fit mappings are often not reversible, so data is irrevocably lost.  For example, a common best-fit operation would transform a <span class="uchar">U+FF1C FULLWIDTH LESS-THAN SIGN &#xFF1C;</span> to the  <span class="uchar">U+003C LESS-THAN SIGN</span>, or the ASCII &lt; used in HTML.  Once converted down to the ASCII &lt;, there’s no reliable way to convert back to the FULLWIDTH source.
  62 | * Characters can be manipulated to bypass string handling filters, such as cross-site scripting (XSS) filters, WAF's, and IDS devices.
  63 | * Characters can be manipulated to abuse logic in software. Such as when the characters can be used to access files on the file system. In this case, a best-fit mapping to characters such as ../ or file:// could be damaging.
  64 | 
  65 | For example, consider a Web-application that’s implemented a filter to prevent XSS (cross-site scripting) attacks.  The filter attempts to block most dangerous characters, and operates at an outermost layer in the application.  The implementation might look like:
  66 | 
  67 | 1. An input validation filter rejects characters such as &lt;, &gt;, ', and " in a Web-application accepting UTF-8 encoded text.
  68 | 2. An attacker sends in a <span class="uchar">U+FF1C FULLWIDTH LESS-THAN SIGN &#xFF1C;</span> in place of the ASCII &lt;.
  69 | 3. The attacker’s input looks like:  &#xFF1C;script&gt;
  70 | 4. After passing through the XSS filter unchanged, the input moves deeper into the application.
  71 | 5. Another API, perhaps at the data access layer, is configured to use a different character set such as windows-1252. 
  72 | 6. On receiving the input, a data access layer converts the multi-byte UTF-8 text to the single-byte windows-1252 code page, forcing a best-fit conversion to the dangerous characters the original XSS filter was trying to block.
  73 | 7. The attacker’s input successfully persists to the database.
  74 | 
  75 | [Shawn Steele](http://blogs.msdn.com/shawnste/archive/2006/01/19/515047.aspx) describes the security issues well on his blog, it's a highly recommended short read for the level of coverage he provides regarding Microsoft's API's:
  76 | 
  77 | > Best Fit in WideCharToMultiByte and System.Text.Encoding Should be Avoided. Windows and the .Net Framework have the concept of "best-fit" behavior for code pages and encodings. Best fit can be interesting, but often its not a good idea. In WideCharToMultiByte() this behavior is controlled by a WC_NO_BEST_FIT_CHARS flag. In .Net you can use the EncoderFallback to control whether or not to get Best Fit behavior. Unfortunately in both cases best fit is the default behavior. In Microsoft .Net 2.0 best fit is also slower.
  78 | 
  79 | As a software engineer, it's important to understand the API's being used directly, and in some cases indirectly (by other processing on the stack). The following table of common library API's lists known behaviors:
  80 | 
  81 | <table>
  82 |  <thead><tr>
  83 |   <td>Library</td>
  84 |   <td>API</td>
  85 |   <td>Best-fit default</td>
  86 |   <td>Can override</td>
  87 |   <td>Guidance</td>
  88 |  </tr>
  89 |  </thead>
  90 |  <tbody>
  91 |  <tr>
  92 |   <td>.NET 2.0</td>
  93 |   <td>System.Text.Encoding</td>
  94 |   <td>Yes</td>
  95 |   <td>Yes</td>
  96 |   <td>Specify EncoderReplacementFallback in the Encoding constructor.</td>
  97 |  </tr>
  98 |  <tr>
  99 |   <td>.NET 3.0</td>
 100 |   <td>System.Text.Encoding</td>
 101 |   <td>Yes</td>
 102 |   <td>Yes</td>
 103 |   <td>Specify
 104 |   EncoderReplacementFallback in the Encoding constructor.</td>
 105 |  </tr>
 106 |  <tr>
 107 |   <td>.NET 3.0</td>
 108 |   <td>DllImport</td>
 109 |   <td>Yes</td>
 110 |   <td>Yes</td>
 111 |   <td>To properly and more safely deal with this, you can use the
 112 |   MarshallAsAttribute class to specify a LPWStr type instead of a LPStr.
 113 |   [MarshalAs(UnmanagedType.LPWStr)]</td>
 114 |  </tr>
 115 |  <tr>
 116 |   <td>Win32</td>
 117 |   <td>WideCharToMultiByte</td>
 118 |   <td>Yes</td>
 119 |   <td>Yes</td>
 120 |   <td>Set the <a href="http://msdn.microsoft.com/en-us/library/dd374130(VS.85).aspx">WC_NO_BEST_FIT_CHARS</a> flag.</td>
 121 |  </tr>
 122 |  <tr>
 123 |   <td>Java</td>
 124 |   <td>TBD</td>
 125 |   <td></td>
 126 |   <td></td>
 127 |   <td>...</td>
 128 |  </tr>
 129 |  <tr>
 130 |   <td>ICU</td>
 131 |   <td>TBD</td>
 132 |   <td></td>
 133 |   <td></td>
 134 |   <td>...</td>
 135 |  </tr>
 136 | </tbody></table>
 137 | 
 138 | Another important note Shawn Steel tells us on his blog is
 139 | that <a href="http://blogs.msdn.com/shawnste/archive/2007/09/24/are-we-going-to-update-or-maintain-the-best-fit-or-code-page-mappings.aspx">Microsoft
 140 | does not intend to maintain the best-fit mappings</a>. For these and other
 141 | security reasons it's a good idea to avoid best-fit type of behavior.
 142 | 
 143 | The following table lists test cases to run from a black-box, external perspective. By interpreting the output/rendered data, a tester can determine if a best-fit conversion may be happening. Note that the mapping tables for best-fit conversions are numerous and large, leading to a nearly insurmountable number of permutations. To top it off, the best-fit behavior varies between vendors, making for an inconsistent playing field that does not lend well to automation. For this reason, focus here will be on data that is known to either normalize or best-fit.  The table below is not comprehensive by any means, and is only being provided with the understanding that something is better than nothing.
 144 | 
 145 | <table>
 146 |  <thead><tr>
 147 |   <td>Target
 148 |   char</td>
 149 |   <td>Target code point</td>
 150 |   <td>Test code point</td>
 151 |   <td>Name</td>
 152 |  </tr>
 153 |  </thead>
 154 |  <tbody>
 155 |  <tr>
 156 |   <td>o</td>
 157 |   <td>\u006F</td>
 158 |   <td>\u2134</td>
 159 |   <td>SCRIPT SMALL O</td>
 160 |  </tr>
 161 |  <tr>
 162 |   <td>o</td>
 163 |   <td>\u006F</td>
 164 |   <td>\u014D</td>
 165 |   <td>LATIN SMALL LETTER O WITH MACRON</td>
 166 |  </tr>
 167 |  <tr>
 168 |   <td>s</td>
 169 |   <td>\u0073</td>
 170 |   <td>\u017F</td>
 171 |   <td>LATIN SMALL LETTER LONG S</td>
 172 |  </tr>
 173 |  <tr>
 174 |   <td>I</td>
 175 |   <td>\u0049</td>
 176 |   <td>\u0131</td>
 177 |   <td>LATIN SMALL
 178 |   LETTER DOTLESS I</td>
 179 |  </tr>
 180 |  <tr>
 181 |   <td>i</td>
 182 |   <td>\u0069</td>
 183 |   <td>\u0129</td>
 184 |   <td>LATIN SMALL LETTER I WITH
 185 |   TILDE</td>
 186 |  </tr>
 187 |  <tr>
 188 |   <td>K</td>
 189 |   <td>\u004B</td>
 190 |   <td>\u212A</td>
 191 |   <td>KELVIN SIGN</td>
 192 |  </tr>
 193 |  <tr>
 194 |   <td>k</td>
 195 |   <td>\u006B</td>
 196 |   <td>\u0137</td>
 197 |   <td>LATIN SMALL LETTER K WITH CEDILLA</td>
 198 |  </tr>
 199 |  <tr>
 200 |   <td>A</td>
 201 |   <td>\u0041</td>
 202 |   <td>\uFF21</td>
 203 |   <td>FULLWIDTH LATIN CAPITAL LETTER A</td>
 204 |  </tr>
 205 |  <tr>
 206 |   <td>a</td>
 207 |   <td>\u0061</td>
 208 |   <td>\u03B1</td>
 209 |   <td>GREEK SMALL LETTER ALPHA</td>
 210 |  </tr>
 211 |  <tr>
 212 |   <td>"</td>
 213 |   <td>\u0022</td>
 214 |   <td>\u02BA</td>
 215 |   <td>MODIFIER
 216 |   LETTER DOUBLE PRIME</td>
 217 |  </tr>
 218 |  <tr>
 219 |   <td>"</td>
 220 |   <td>\u0022</td>
 221 |   <td>\u030E</td>
 222 |   <td>COMBINING DOUBLE VERTICAL LINE
 223 |   ABOVE</td>
 224 |  </tr>
 225 |  <tr>
 226 |   <td>"</td>
 227 |   <td>\u0027</td>
 228 |   <td>\uFF02</td>
 229 |   <td>FULLWIDTH QUOTATION MARK</td>
 230 |  </tr>
 231 |  <tr>
 232 |   <td>'</td>
 233 |   <td>\u0027</td>
 234 |   <td>\u02B9</td>
 235 |   <td>MODIFIER LETTER PRIME</td>
 236 |  </tr>
 237 |  <tr>
 238 |   <td>'</td>
 239 |   <td>\u0027</td>
 240 |   <td>\u030D</td>
 241 |   <td>COMBINING VERTICAL LINE ABOVE</td>
 242 |  </tr>
 243 |  <tr>
 244 |   <td>'</td>
 245 |   <td>\u0027</td>
 246 |   <td>\uFF07</td>
 247 |   <td>FULLWIDTH APOSTROPHE</td>
 248 |  </tr>
 249 |  <tr>
 250 |   <td>&lt;</td>
 251 |   <td>\u003C</td>
 252 |   <td>\uFF1C</td>
 253 |   <td>FULLWIDTH LESS-THAN SIGN</td>
 254 |  </tr>
 255 |  <tr>
 256 |   <td>&lt;</td>
 257 |   <td>\u003C</td>
 258 |   <td>\uFE64</td>
 259 |   <td>SMALL LESS-THAN SIGN</td>
 260 |  </tr>
 261 |  <tr>
 262 |   <td>&lt;</td>
 263 |   <td>\u003C</td>
 264 |   <td>\u2329</td>
 265 |   <td>LEFT-POINTING ANGLE BRACKET</td>
 266 |  </tr>
 267 |  <tr>
 268 |   <td>&lt;</td>
 269 |   <td>\u003C</td>
 270 |   <td>\u3008</td>
 271 |   <td>LEFT ANGLE BRACKET</td>
 272 |  </tr>
 273 |  <tr>
 274 |   <td>&lt;</td>
 275 |   <td>\u003C</td>
 276 |   <td>\u00AB</td>
 277 |   <td>LEFT-POINTING
 278 |   DOUBLE ANGLE QUOTATION MARK</td>
 279 |  </tr>
 280 |  <tr>
 281 |   <td>&gt;</td>
 282 |   <td>\u003E</td>
 283 |   <td>\u00BB</td>
 284 |   <td>RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK</td>
 285 |  </tr>
 286 |  <tr>
 287 |   <td>&gt;</td>
 288 |   <td>\u003E</td>
 289 |   <td>\u3009</td>
 290 |   <td>RIGHT ANGLE BRACKET</td>
 291 |  </tr>
 292 |  <tr>
 293 |   <td>&gt;</td>
 294 |   <td>\u003E</td>
 295 |   <td>\u232A</td>
 296 |   <td>RIGHT-POINTING ANGLE BRACKET</td>
 297 |  </tr>
 298 |  <tr>
 299 |   <td>&gt;</td>
 300 |   <td>\u003E</td>
 301 |   <td>\uFE65</td>
 302 |   <td>SMALL GREATER-THAN SIGN</td>
 303 |  </tr>
 304 |  <tr>
 305 |   <td>&gt;</td>
 306 |   <td>\u003E</td>
 307 |   <td>\uFF1E</td>
 308 |   <td>FULLWIDTH GREATER-THAN SIGN</td>
 309 |  </tr>
 310 |  <tr>
 311 |   <td>:</td>
 312 |   <td>\u003A</td>
 313 |   <td>\u2236</td>
 314 |   <td>RATIO</td>
 315 |  </tr>
 316 |  <tr>
 317 |   <td>:</td>
 318 |   <td>\u003A</td>
 319 |   <td>\u0589</td>
 320 |   <td>ARMENIAN FULL STOP</td>
 321 |  </tr>
 322 |  <tr>
 323 |   <td>:</td>
 324 |   <td>\u003A</td>
 325 |   <td>\uFE13</td>
 326 |   <td>PRESENTATION FORM FOR VERTICAL COLON</td>
 327 |  </tr>
 328 |  <tr>
 329 |   <td>:</td>
 330 |   <td>\u003A</td>
 331 |   <td>\uFE55</td>
 332 |   <td>SMALL COLON</td>
 333 |  </tr>
 334 |  <tr>
 335 |   <td>:</td>
 336 |   <td>\u003A</td>
 337 |   <td>\uFF1A</td>
 338 |   <td>FULLWIDTH
 339 |   COLON</td>
 340 |  </tr>
 341 | </tbody></table>
 342 | 
 343 | These test cases are largely derived from the <a href="https://unicode.org/Public/MAPPINGS/VENDORS/">public best-fit mappings provided by the Unicode Consortium</a>. These are provided to software vendors but do not necessarily they were implemented as documented. In fact, any
 344 | software vendor such as Microsoft, IBM, Oracle, can implement these mappings as they desire. 
 345 | 
 346 | ## <a id="transcoding"></a>Charset Transcoding and Character Mappings
 347 | 
 348 | Sometimes characters and strings are transcoded from a source character set into a destination character set. On the surface this phenomena may seem similar to best-fit mappings, but the process is quite different. In general, when software transcodes data from source charset X to destination charset Y, it follows either a data-driven mapping table or an algorithmic formula.
 349 | 
 350 | For the most part this process is data-driven. While these tables are standardized somewhere there may be differences between vendors. ICU
 351 | maintains a list of its <a href="http://site.icu-project.org/charts/charset">character set mapping tables</a> online. Also, ICU's <a href="http://demo.icu-project.org/icu-bin/convexp">Converter Explorer</a> tool lets you browse the maintained charset mapping tables. 
 352 | 
 353 | Data may be transcoded directly from a source charset to a destination charset, however it's also common to use Unicode as the broker. In the latter case the software will first transcode the source charset to Unicode, and from there to the destination charset. Some vendors such as Microsoft are known to leverage the Private Use Area (PUA) when transcoding to Unicode, when a direct mapping cannot be found or when a source byte sequence is invalid or illegal. It's important to be aware of a few pitfalls during the transcoding process.
 354 | 
 355 | * When data is transcoded to the PUA, converting it again from the PUA may have unexpected consequences.
 356 | * Data can change length, particularly if transcoding to/from a single-byte charset leads to a multi-byte character in the other charset. 
 357 | 
 358 | As a software engineer building a mechanism for transcoding data between charsets, it's important to understand these pitfalls and handle these unexpected cases gracefully.
 359 | 
 360 | Software vulnerabilities can arise through charset transcodings. To name a few:
 361 | 
 362 | * Transcoding data is not always reversible, so data can be irrevocably lost.
 363 | * Characters can be manipulated to bypass string handling filters, such as cross-site scripting (XSS) filters, WAF's, and IDS devices.
 364 | * Characters can be manipulated to abuse logic in software. For example, characters transcoded into ../ or file:// would prove detrimental in file handling operations. 
 365 | 
 366 | ## <a id="normalization"></a>Normalization
 367 | 
 368 | In Unicode, Normalization of characters and strings follows a specification defined in the <a href="https://unicode.org/reports/tr15/">Unicode Standard Annex #15: Unicode Normalization Forms</a>.  The details of Normalization are not for the faint of heart and will not be discussed in this guide. For engineers and testers, it's at least important to understand that there are four
 369 | Normalization forms defined: 
 370 | 
 371 | * NFC - Canonical Decomposition
 372 | * NFD - Canonical Decomposition, followed by Canonical Composition
 373 | * NFKC - Compatibility Decomposition
 374 | * NFKD - Compatibility Decomposition,followed by Canonical Composition
 375 | 
 376 | When testing for security vulnerabilities, we're often most interested in the <strong>compatibility decomposition forms (NFKC, NFKD)</strong>, but occassionally the canonical decomposition forms will produce interesting transformations as well. Cases where characters, and sequences of characters, transform into something different than the original source, might be used to bypass filters or produce other exploits.  Consider the following image, which depicts the result of normalizing with either NFKC or NFKD for the character <span class="uchar">U+FE64 SMALL LESS-THAN SIGN</span>.
 377 | 
 378 | <img class="center" style="max-width: 50%;" src="{{ site.url }}/img/normalization-nfkc-nfkd-003C.png" />
 379 | 
 380 | In the above example, the character U+FE64 will transform into U+003C, which might lead to security vulnerability in HTML applications. Consider the next example which shows the result of either NFD or NFKD decomposition applied to the "Turkish I" character <span class="uchar">U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE</span>.
 381 | 
 382 | <img class="center" style="max-width: 60%;" src="{{ site.url }}/img/normalization-turkish-i.png" />
 383 | 
 384 | As a software engineer, it becomes evident that Unicode normalization plays an important role, and that it is not always an explicit choice.  Often times normalization is applied implicitly by the underlying framework, platform, or Web browser.  It's important to understand the API's being used directly, and in some cases indirectly (by other processing on the stack). 
 385 | 
 386 | ### <a id="normalization-apis"></a>Normalization Defaults in Common Libraries
 387 | The following table of common library API's lists known behaviors:
 388 | 
 389 | <table>
 390 |  <thead><tr>
 391 |   <td>Library</td>
 392 |   <td>API</td>
 393 |   <td>Default</td>
 394 |   <td>Can override</td>
 395 |   <td>Notes</td>
 396 |  </tr>
 397 |  </thead>
 398 |  <tbody>
 399 |  <tr>
 400 |   <td>.NET</td>
 401 |   <td>System.Text.Encoding</td>
 402 |   <td>NFC</td>
 403 |   <td>Yes</td>
 404 |   <td></td>
 405 |  </tr>
 406 |  <tr>
 407 |   <td>Win32</td>
 408 |   <td></td>
 409 |   <td></td>
 410 |   <td></td>
 411 |   <td></td>
 412 |  </tr>
 413 |  <tr>
 414 |   <td>Java</td>
 415 |   <td></td>
 416 |   <td></td>
 417 |   <td></td>
 418 |   <td></td>
 419 |  </tr>
 420 |  <tr>
 421 |   <td>ICU</td>
 422 |   <td></td>
 423 |   <td></td>
 424 |   <td></td>
 425 |   <td></td>
 426 |  </tr>
 427 |  <tr>
 428 |   <td>Ruby</td>
 429 |   <td></td>
 430 |   <td></td>
 431 |   <td></td>
 432 |   <td></td>
 433 |  </tr>
 434 |  <tr>
 435 |   <td>Python</td>
 436 |   <td></td>
 437 |   <td></td>
 438 |   <td></td>
 439 |   <td></td>
 440 |  </tr>
 441 |  <tr>
 442 |   <td>PHP</td>
 443 |   <td></td>
 444 |   <td></td>
 445 |   <td></td>
 446 |   <td></td>
 447 |  </tr>
 448 |  <tr>
 449 |   <td>Perl</td>
 450 |   <td></td>
 451 |   <td></td>
 452 |   <td></td>
 453 |   <td></td>
 454 |  </tr>
 455 |  <tr>
 456 |   <td>JavaScript</td>
 457 |   <td>String.prototype.normalize()</td>
 458 |   <td>NFC</td>
 459 |   <td>Yes</td>
 460 |   <td><a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize">MDN</a></td>
 461 |  </tr>
 462 | </tbody></table>
 463 | 
 464 | ### <a id="normalization-browsers"></a>Normalization in Web Browser URLs
 465 | The following table captures how Web browsers normalize URLs.  Differences in normalization and character transformations can lead to incompatibility as well as security vulnerability.
 466 | <span class="superscript"><a href="http://web.lookout.net/2012/03/unicode-normalization-in-urls.html">source</a></span>
 467 | 
 468 | <table>
 469 |  <thead><tr>
 470 |   <td>Description</td>
 471 |   <td>MSIE 9</td>
 472 |   <td>FF 5.0</td>
 473 |   <td>Chrome 12</td>
 474 |   <td>Safari 5</td>
 475 |   <td>Opera 11.5</td>
 476 |  </tr>
 477 |  </thead>
 478 |  <tbody>
 479 |  <tr>
 480 |   <td>Applies normalization in the path</td>
 481 |   <td class="green">No</td>
 482 |   <td class="green">No</td>
 483 |   <td class="green">No</td>
 484 |   <td class="red">Yes - NFC</td>
 485 |   <td class="green">No</td>
 486 |  </tr>
 487 |  <tr>
 488 |   <td>Applies normalization in the query</td>
 489 |   <td class="green">No</td>
 490 |   <td class="green">No</td>
 491 |   <td class="green">No</td>
 492 |   <td class="red">Yes - NFC</td>
 493 |   <td class="green">No</td>
 494 |  </tr>
 495 |  <tr>
 496 |   <td>Applies normalization in the fragment</td>
 497 |   <td class="green">No</td>
 498 |   <td class="green">No</td>
 499 |   <td class="red">Yes - NFC</td>
 500 |   <td class="red">Yes - NFC</td>
 501 |   <td class="green">No</td>
 502 |  </tr>
 503 | </tbody></table>
 504 | 
 505 | ### <a id="normalization-test"></a>Normalization Test Cases
 506 | The following table lists test cases to run from a black-box, external perspective. By interpreting the output/rendered data, a tester can determine if a normalization transformation may be happening.
 507 | 
 508 | <table>
 509 |  <thead><tr>
 510 |   <td><b>Target char</b></td>
 511 |   <td><b>Target code point</b></td>
 512 |   <td><b>Test code point</b></td>
 513 |   <td><b>Name</b></td>
 514 |  </tr>
 515 |  </thead>
 516 |  <tbody>
 517 |  <tr>
 518 |   <td><b>o</b></td>
 519 |   <td>\u006F</td>
 520 |   <td>\u2134</td>
 521 |   <td>SCRIPT SMALL O</td>
 522 |  </tr>
 523 |  <tr>
 524 |   <td><b>s</b></td>
 525 |   <td>\u0073</td>
 526 |   <td>\u017F</td>
 527 |   <td>LATIN SMALL LETTER LONG S</td>
 528 |  </tr>
 529 |  <tr>
 530 |   <td><b>K</b></td>
 531 |   <td>\u004B</td>
 532 |   <td>\u212A</td>
 533 |   <td>KELVIN SIGN</td>
 534 |  </tr>
 535 |  <tr>
 536 |   <td><b>A</b></td>
 537 |   <td>\u0041</td>
 538 |   <td>\uFF21</td>
 539 |   <td>FULLWIDTH LATIN CAPITAL LETTER A</td>
 540 |  </tr>
 541 |  <tr>
 542 |   <td><b>"</b></td>
 543 |   <td>\u0027</td>
 544 |   <td>\uFF02</td>
 545 |   <td>FULLWIDTH
 546 |   QUOTATION MARK</td>
 547 |  </tr>
 548 |  <tr>
 549 |   <td><b>'</b></td>
 550 |   <td>\u0027</td>
 551 |   <td>\uFF07</td>
 552 |   <td>FULLWIDTH APOSTROPHE</td>
 553 |  </tr>
 554 |  <tr>
 555 |   <td><b>&lt;</b></td>
 556 |   <td>\u003C</td>
 557 |   <td>\uFF1C</td>
 558 |   <td>FULLWIDTH
 559 |   LESS-THAN SIGN</td>
 560 |  </tr>
 561 |  <tr>
 562 |   <td><b>&lt;</b></td>
 563 |   <td>\u003C</td>
 564 |   <td>\uFE64</td>
 565 |   <td>SMALL LESS-THAN SIGN</td>
 566 |  </tr>
 567 |  <tr>
 568 |   <td><b>&gt;</b></td>
 569 |   <td>\u003E</td>
 570 |   <td>\uFE65</td>
 571 |   <td>SMALL
 572 |   GREATER-THAN SIGN</td>
 573 |  </tr>
 574 |  <tr>
 575 |   <td><b>&gt;</b></td>
 576 |   <td>\u003E</td>
 577 |   <td>\uFF1E</td>
 578 |   <td>FULLWIDTH GREATER-THAN SIGN</td>
 579 |  </tr>
 580 |  <tr>
 581 |   <td><b>:</b></td>
 582 |   <td>\u003A</td>
 583 |   <td>\uFE13</td>
 584 |   <td>PRESENTATION FORM FOR VERTICAL COLON</td>
 585 |  </tr>
 586 |  <tr>
 587 |   <td><b>:</b></td>
 588 |   <td>\u003A</td>
 589 |   <td>\uFE55</td>
 590 |   <td>SMALL COLON</td>
 591 |  </tr>
 592 |  <tr>
 593 |   <td><b>:</b></td>
 594 |   <td>\u003A</td>
 595 |   <td>\uFF1A</td>
 596 |   <td>FULLWIDTH COLON</td>
 597 |  </tr>
 598 | </tbody></table>
 599 | 
 600 | TODO If you've determined that input is being normalized but need different characters to exploit the logic, you may use the accompanying test case database.
 601 | 
 602 | ## <a id="canonicalization"></a>Canonicalization of Non-Shortest Form UTF-8
 603 | The UTF-8 encoding algorithm allows for a single code point to be represented in multiple ways. That is, while the Latin letter 'A' is normally represented using the byte 0x41 in UTF-8, it's non-shortest form, or overlong, encoding would be any of the following:
 604 | 
 605 | * 0xC1 0x81
 606 | * 0xE0 0x81 0x81
 607 | * 0xF0 0x80 0x81 0x81
 608 | * etc...
 609 | 
 610 | Earlier versions of the Unicode Standard applied Postel's law, or, the robustness principle of 'be conservative in what you do, be liberal in what you accept from others.' While the 'generation' of non-shortest form UTF-8 was forbidden, the 'interpretation' of was allowed.  That changed with Unicode Standard version 3.0, when the requirement changed to prohibit both interpretation and generation. In fact, both the 'generation' and 'interpretation' of non-shortest form UTF-8 are currently prohibited by the standard, with one exception - that 'interpretation' only applies to the Basic Multilingual Plane (BMP) code points between U+0000 and U+FFFF. In terms of the common security vulnerabilities discussed in this document, that exception has no bearing, as the ASCII range of characters are not exempt.
 611 | 
 612 | Given the history of security vulnerabilities around overlong UTF-8, many frameworks have defaulted to a more secure position of disallowing these forms to be both generated and interpreted. However, it seems that some software still interprets non-shortest form UTF-8 for BMP characters, including ASCII. A common pattern in software follows:
 613 | 
 614 | > Process A performs security checks, but does not check for non-shortest forms.
 615 | 
 616 | > Process B accepts the byte sequence from process A, and transforms it into UTF-16 while interpreting non-shortest forms.
 617 | 
 618 | > The UTF-16 text may then contain characters that should have been filtered out by process A. [source](https://unicode.org/versions/corrigendum1.html)
 619 | 
 620 | The overlong form of UTF-8 byte sequences is currently considered an illegal byte sequence. It's therefore a good test case to attempt in software such as Web applications, browsers, and databases.
 621 | 
 622 | Some notes about canonicalization and UTF-8 encoded data.
 623 | 
 624 | * The ASCII range (0x00 to 0x7F) is preserved in UTF-8.
 625 | * UTF-8 can encode any Unicode character U+000000 through U+10FFFF using any number of bytes, thus leading to the non-shortest form problem.
 626 | * The Unicode standard (3.0 and later) requires that a code point be serializd in UTF-8 using a byte sequence of one to four bytes in length. [The Corrigendum #1: UTF-8 Shortest](https://unicode.org/versions/corrigendum1.html) Form introduced this conformance requirement.
 627 | 
 628 | __Non-shortest form UTF-8__ has been the vector for critical vulnerabilities in the past. From the [Microsoft IIS 4.0 and 5.0 directory traversal vulnerability](http://www.microsoft.com/technet/security/bulletin/MS00-078.mspx) of 2000, which was rediscovered in the product's [WebDAV component in 2009](http://blog.zoller.lu/2009/05/iis-6-webdac-auth-bypass-and-data.html).
 629 | 
 630 | Some of the common security vulnerabilities that use non-shortest form UTF-8 as an attack vector include:
 631 | 
 632 | * Directory/folder traversal.
 633 | * Bypassing folder and file access filters.
 634 | * Bypassing HTML and XSS filters.
 635 | * Bypassing WAF and NID's type devices.
 636 | 
 637 | As a developer trying to protect against this, it becomes important to understand the API's being used directly, and in some cases indirectly (by other processing on the stack). The following table of common library API's lists known behaviors:
 638 | 
 639 | <table>
 640 |  <thead><tr>
 641 |   <td>Library</td>
 642 |   <td>API</td>
 643 |   <td>Allows non-shortest UTF8</td>
 644 |   <td>Can override </td>
 645 |   <td>Notes</td>
 646 |  </tr>
 647 |  </thead>
 648 |  <tbody>
 649 |  <tr>
 650 |   <td>.NET 2.0</td>
 651 |   <td>System.Text.Encoding</td>
 652 |   <td></td>
 653 |   <td></td>
 654 |   <td></td>
 655 |  </tr>
 656 |  <tr>
 657 |   <td>.NET 3.0</td>
 658 |   <td>System.Text.Encoding</td>
 659 |   <td></td>
 660 |   <td></td>
 661 |   <td></td>
 662 |  </tr>
 663 |  <tr>
 664 |   <td>ICU</td>
 665 |   <td>System.Text.Encoding</td>
 666 |   <td></td>
 667 |   <td></td>
 668 |   <td></td>
 669 |  </tr>
 670 | </tbody></table>
 671 | 
 672 | As a tester/bug hunter looking for the vulnerabilities, the following table lists test cases to run from a black-box, external perspective. The data in this table presents the first few non-shortest forms (__NSF__) UTF-8 as URL encoded data %NN. If you need __raw bytes__ instead, these same hex values apply.   All of the target chars in the first column are ASCII.
 673 | 
 674 | <table>
 675 |  <thead><tr>
 676 |   <td>Target </td>
 677 |   <td>NSF 1</td>
 678 |   <td>NSF 2</td>
 679 |   <td>NSF 3</td>
 680 |   <td>Notes</td>
 681 |   <td></td>
 682 |  </tr>
 683 |  </thead>
 684 |  <tbody>
 685 |  <tr>
 686 |   <td>A</td>
 687 |   <td>%C1%81</td>
 688 |   <td>%E0%81%81</td>
 689 |   <td>%F0%80%81%81</td>
 690 |   <td>Latin A useful as a base test case.</td>
 691 |   <td></td>
 692 |  </tr>
 693 |  <tr>
 694 |   <td>"</td>
 695 |   <td>%C0%A2</td>
 696 |   <td>%E0%80%A2</td>
 697 |   <td>%F0%80%80%A2</td>
 698 |   <td>Double quote</td>
 699 |   <td></td>
 700 |  </tr>
 701 |  <tr>
 702 |   <td>'</td>
 703 |   <td>%C0%A7</td>
 704 |   <td>%E0%80%A7</td>
 705 |   <td>%F0%80%80%A7</td>
 706 |   <td>Single quote</td>
 707 |   <td></td>
 708 |  </tr>
 709 |  <tr>
 710 |   <td>&lt;<o:p></o:p></td>
 711 |   <td>%C0%BC</td>
 712 |   <td>%E0%80%BC</td>
 713 |   <td>%F0%80%80%BC</td>
 714 |   <td>Less-than
 715 |   sign</td>
 716 |   <td></td>
 717 |  </tr>
 718 |  <tr>
 719 |   <td>&gt;<o:p></o:p></td>
 720 |   <td>%C0%BE</td>
 721 |   <td>%E0%80%BE</td>
 722 |   <td>%F0%80%80%BE</td>
 723 |   <td>Greater-than sign</td>
 724 |   <td></td>
 725 |  </tr>
 726 |  <tr>
 727 |   <td>.</td>
 728 |   <td>%C0%AE</td>
 729 |   <td>%E0%80%AE</td>
 730 |   <td>%F0%80%80%AE</td>
 731 |   <td>Full stop </td>
 732 |   <td></td>
 733 |  </tr>
 734 |  <tr>
 735 |   <td>/</td>
 736 |   <td>%C0%AF</td>
 737 |   <td>%E0%80%AF</td>
 738 |   <td>%F0%80%80%AF</td>
 739 |   <td>Solidus</td>
 740 |   <td></td>
 741 |  </tr>
 742 |  <tr>
 743 |   <td>\</td>
 744 |   <td>%C1%9C</td>
 745 |   <td>%E0%81%9C</td>
 746 |   <td>%F0%80%81%9C</td>
 747 |   <td>Reverse
 748 |   solidus</td>
 749 |   <td></td>
 750 |  </tr>
 751 | </tbody></table>
 752 | 
 753 | 
 754 | 
 755 | ## <a id="overconsumption"></a>Over-consumption
 756 | 
 757 | The Unicode Transformation Formats (e.g. UTF-8 and UTF-16) serialize code points into legal, or well-formed, byte sequences, also called code units. For example, consider the following code points and their corresponding well-formed code units in UTF-8 format.
 758 | 
 759 | <table>
 760 |  <thead><tr>
 761 |   <td>Code
 762 |   point</td>
 763 |   <td>Description</td>
 764 |   <td>UTF-8
 765 |   byte sequence</td>
 766 |  </tr>
 767 |  </thead>
 768 |  <tbody>
 769 |  <tr>
 770 |   <td>U+0041</td>
 771 |   <td>LATIN CAPITAL LETTER A</td>
 772 |   <td>0x41</td>
 773 |  </tr>
 774 |  <tr>
 775 |   <td>U+FF21</td>
 776 |   <td>FULLWIDTH LATIN CAPITAL LETTER A</td>
 777 |   <td>0xEC 0xBC 0xA1</td>
 778 |  </tr>
 779 |  <tr>
 780 |   <td>U+00C0</td>
 781 |   <td>LATIN CAPITAL LETTER A WITH GRAVE</td>
 782 |   <td>0xC3 0x80</td>
 783 |  </tr>
 784 | </tbody></table>
 785 | 
 786 | And following are the same code points in their corresponding well-formed UTF-16 (little endian) format.
 787 | 
 788 | <table>
 789 |  <thead><tr>
 790 |   <td>Code point</td>
 791 |   <td>Description</td>
 792 |   <td>UTF-16LE byte sequence</td>
 793 |  </tr>
 794 |  </thead>
 795 |  <tbody>
 796 |  <tr>
 797 |   <td>U+0041</td>
 798 |   <td>LATIN CAPITAL LETTER A</td>
 799 |   <td>0x00 0x41</td>
 800 |  </tr>
 801 |  <tr>
 802 |   <td>U+FF21</td>
 803 |   <td>FULLWIDTH LATIN CAPITAL LETTER A</td>
 804 |   <td>0xFF 0x21</td>
 805 |  </tr>
 806 |  <tr>
 807 |   <td>U+00C0</td>
 808 |   <td>LATIN CAPITAL LETTER A WITH GRAVE</td>
 809 |   <td>0x00 0xC0</td>
 810 |  </tr>
 811 | </tbody></table>
 812 | 
 813 | ### <a id="formedness"></a>Well-formed and Ill-formed Byte Sequences
 814 | Consider a UTF-8 decoder consuming a stream of data from a file. It encounters a well-formed byte sequence like:
 815 | 
 816 | &lt41 C3 80 41&gt;
 817 | 
 818 | This sequence is made up of three well-formed _sub-sequences_.  First is the &lt;41&gt;, second is the &lt;C3 80&gt;, and third is the &lt;41&gt;. The second subsequence &lt;C3 80&gt; is a two-byte sequence. The lead byte C3 indicates a two-byte sequence, and the trailing byte 80 is a valid trailing byte. The table below indicates these relationahips. Now consider that the UTF-8 decoder encounters an __ill-formed byte sequence__:
 819 | 
 820 | &lt41 C2 C3 80 41&gt;
 821 | 
 822 | Taken apart, there are three minimally well-formed subsequences &lt;41&gt;, &lt;C3 80&gt;, and &lt;41&gt;. However, the &lt;C2&gt; is ill-formed because it doesn't have a valid trailing byte, which would be required per the table below. 
 823 | 
 824 | <table>
 825 |  <thead>
 826 |   <tr>
 827 |    <td>Code point</td>
 828 |    <td>First byte</td>
 829 |    <td>Second byte</td>
 830 |    <td>Third byte</td>
 831 |    <td>Fourth byte</td>
 832 |   </tr>
 833 |  </thead>
 834 |  <tbody><tr>
 835 |   <td>U+0000..U+007F</td>
 836 |   <td>00..7F</td>
 837 |   <td></td>
 838 |   <td></td>
 839 |   <td></td>
 840 |  </tr>
 841 |  <tr>
 842 |   <td>U+0080..U+07FF</td>
 843 |   <td>C2..DF</td>
 844 |   <td>80..BF</td>
 845 |   <td></td>
 846 |   <td></td>
 847 |  </tr>
 848 |  <tr>
 849 |   <td>U+0800..U+0FFF</td>
 850 |   <td>E0</td>
 851 |   <td>A0..BF</td>
 852 |   <td>80..BF</td>
 853 |   <td></td>
 854 |  </tr>
 855 |  <tr>
 856 |   <td>U+1000..U+CFFF</td>
 857 |   <td>E1..EC</td>
 858 |   <td>80..BF</td>
 859 |   <td>80..BF</td>
 860 |   <td></td>
 861 |  </tr>
 862 |  <tr>
 863 |   <td>U+D000..U+D7FF</td>
 864 |   <td>ED</td>
 865 |   <td>80..9F</td>
 866 |   <td>80..BF</td>
 867 |   <td></td>
 868 |  </tr>
 869 |  <tr>
 870 |   <td>U+E000..U+FFFF</td>
 871 |   <td>EE..EF</td>
 872 |   <td>80..BF</td>
 873 |   <td>80..BF</td>
 874 |   <td></td>
 875 |  </tr>
 876 |  <tr>
 877 |   <td>U+10000..U+3FFFF</td>
 878 |   <td>F0</td>
 879 |   <td>90..BF</td>
 880 |   <td>80..BF</td>
 881 |   <td>80..BF</td>
 882 |  </tr>
 883 |  <tr>
 884 |   <td>U+40000..U+FFFFF</td>
 885 |   <td>F1..F3</td>
 886 |   <td>80..BF</td>
 887 |   <td>80..BF</td>
 888 |   <td>80..BF</td>
 889 |  </tr>
 890 |  <tr>
 891 |   <td>U+100000..U+10FFFF</td>
 892 |   <td>F4</td>
 893 |   <td>80..BF</td>
 894 |   <td>80..BF</td>
 895 |   <td>80..BF<a href="https://unicode.org/versions/Unicode5.0.0/ch03.pdf"><sup>source</sup></a></td>
 896 |  </tr>
 897 | </tbody></table>
 898 | 
 899 | The table above shows the legal and valid UTF-8 byte sequences, as defined by the Unicode Standard 5.0. The lower ASCII range 00..7F has always been preserved in UTF-8. Multi-byte sequences start at code point U+0080 and continue from two to four bytes. For example, code point U+0700 would be encoded in UTF-8 as a two byte sequence, with the lead byte somewhere in the range of C2..DF.
 900 | 
 901 | 
 902 | ### <a id="handling"></a>Handling Ill-formed Byte Sequences
 903 | Over-consumption of well-formed byte sequences has been the vector for critical vulnerabilities. These generally expose widespread issues when they affect a widely used library. One example can be found in the [Internationalization Components for Unicode (ICU)](http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2009-0153) in 2009, which would leave almost any Web-application exposed to cross-site scripting (XSS) threats since software such as Apple's Safari Web browser exposed the flaw. Even Web-applications with strong HTML/XSS filters can be vulnerable when the Web browser is non-conformant.
 904 | 
 905 | The following input illustrates the over-consumption attack vector, where an attacker controls the <span class="uchar">img</span> element's src attribute, followed by a text fragment in the HTML. The [0xC2] represents the attacker's UTF-8 lead byte with an invalid trailing byte, the double quote " which gets consumed in the resultant string. The HTML text portion including the <span class="uchar">onerror</span> text is also attacker-controlled input. The entire payload becomes:
 906 | 
 907 | <span class="indent">&lt;img src="#[0xC2]"&gt; " onerror="alert(1)"&lt;/ br&gt;</span>
 908 | 
 909 | The resultant string after over-consumption:
 910 | 
 911 | <span class="indent">&lt;img src="#&gt; " onerror="alert(1)"&lt;/ br&gt;</span>
 912 | 
 913 | Although the above is a broken fragment of HTML because the <span class="uchar">img</span> element is not properly closed, most browsers will render it as an img element with an <span class="uchar">onerror</span> event handler.
 914 | 
 915 | Some of the common security vulnerabilities that exploit an over-consumption flaw as an attack vector include:
 916 | 
 917 | * Bypassing folder and file access filters.
 918 | * Bypassing parser-based filters such as HTML and XSS filters.
 919 | * Bypassing detection signatures in WAF and NID's type devices.
 920 | 
 921 | As a developer trying to protect against this, it again becomes important to understand how the API's being used will handle ill-formed byte sequences. The following table of common library API's lists known behaviors:
 922 | 
 923 | <table>
 924 |  <thead><tr>
 925 |   <td>Library</td>
 926 |   <td>API</td>
 927 |   <td>Allows ill-formed UTF8</td>
 928 |   <td>Can override </td>
 929 |   <td>Notes</td>
 930 |  </tr>
 931 |  </thead>
 932 |  <tbody>
 933 |  <tr>
 934 |   <td>.NET 2.0</td>
 935 |   <td>System.Text.Encoding</td>
 936 |   <td>No</td>
 937 |   <td>No</td>
 938 |   <td></td>
 939 |  </tr>
 940 |  <tr>
 941 |   <td>.NET 3.0</td>
 942 |   <td>UTF8Encoding</td>
 943 |   <td>No</td>
 944 |   <td>No</td>
 945 |   <td></td>
 946 |  </tr>
 947 |  <tr>
 948 |   <td>ICU</td>
 949 |   <td>System.Text.Encoding</td>
 950 |   <td>Yes</td>
 951 |   <td>Yes</td>
 952 |   <td></td>
 953 |  </tr>
 954 | </tbody></table>
 955 | 
 956 | As a tester/bug hunter looking for the vulnerabilities, the following table lists test cases to run from a black-box, external perspective. The data in this table presents byte sequences that could elicit __over-consumption__. You can substitute a % before each byte value __to create a URL-encoded value__ for use in testing. This would be applicable for passing ill-formed byte sequences in a Web-application.
 957 | 
 958 | <table>
 959 |  <thead><tr>
 960 |   <td>Source bytes</td>
 961 |   <td>Expected safe result</td>
 962 |   <td>Desired unsafe result</td>
 963 |   <td>Notes</td>
 964 |  </tr>
 965 |  </thead>
 966 |  <tbody>
 967 |  <tr>
 968 |   <td>C2 22 3C</td>
 969 |   <td>22 3C</td>
 970 |   <td>3C</td>
 971 |   <td>Error handling of C2 overconsumed the trailing 22.</td>
 972 |  </tr>
 973 |  <tr>
 974 |   <td>"</td>
 975 |   <td>%C0%A2</td>
 976 |   <td>%E0%80%A2</td>
 977 |   <td>Double quote</td>
 978 |  </tr>
 979 | </tbody></table>
 980 | 
 981 | Over-consumption typically happens at a layer lower than most developers work at. It's more likely to be in the frameworks, the browsers, the database, etc. If designing a character set or Unicode layer, be sure to include an error condition for cases where valid lead bytes are followed by invalid trailing bytes.
 982 | 
 983 | ## <a id="unexpected"></a>Handling the Unexpected
 984 | Through error handling, filtering, or other cases of input validation, problematic characters or raw bytes might be replaced or deleted. In these cases, it's important that the resultant string or byte sequence does not introduce a vulnerability. This problem is not specific to Unicode by any means, and can occur with any character set. However as will be discussed, Unicode has a good solution.
 985 | 
 986 | ### <a id="unexpected-input"></a>Unexpected Inputs
 987 | TODO
 988 | #### Unassigned Code Points
 989 | U+2073
 990 | #### Illegal Code Points
 991 | e.g. half of a surrogate pair
 992 | 
 993 | ### <a id="unexpected-substitution"></a>Character Substitution
 994 | The following input illustrates a dangerous character substitution. In this case, the application uses input validation to detect when a string contains characters such as <span class="uchar">&lt;</span> and then sanitizes such character’s by replacing them with a <span class="uchar">.</span> period, or full stop. Internally, the application fetches files from a file share in the form:
 995 | 
 996 | <span class="indent">file://sharename/protected/user-01/files</span>
 997 | 
 998 | By exploiting the character substitution logic, an attacker could perform directory traversal attacks on the application:
 999 | 
1000 | <span class="indent">file://sharename/protected/user-01/../user-002/files</span>
1001 | 
1002 | ### <a id="unexpected-detetion"></a>Character Deletion
1003 | An application may choose to delete characters when invalid, illegal, or unexpected data is encountered. This can also be problematic if not handled carefully. In general, it's safer to replace with Unicode's <span class="uchar">REPLACEMENT CHARACTER U+FFFD</span> than it is to delete.
1004 | 
1005 | Consider a Web-browser that deletes certain special characters such as a mid-stream Unicode BOM when encountered in its HTML parsing. An attacker injects the following HTML which includes the Unicode BOM represented by <span class="uchar">U+FEFF</span>. The existence of this character allows the attacker's input to bypass the Web-application's cross-site scripting filter, which rejects an occurrence of <span class="uchar">&lt;script&gt;</span>.
1006 | 
1007 | <span class="indent">&lt;scr[U+FEFF]ipt&gt;</span>
1008 | 
1009 | The Unicode BOM has special meaning in the standard, and in most software.  The following image illustrates some of the special properties associated with this character:
1010 | 
1011 | TODO add image
1012 | 
1013 | The Unicode BOM is recommend input for most software test cases, and can be especially useful when test text parsers such as HTML and XML.
1014 | 
1015 | #### Guidance
1016 | 
1017 | Handle error conditions securely by replacing with the Unicode <span class="uchar">REPACEMENT CHARACTER U+FFFD</span>. If that's impractical for some reason then choose a safe replacement that doesn't have syntactical meaning in the protocol being used. Some common examples include ? and #.
1018 | 
1019 | ## <a id="casing"></a>Upper and Lower Casing
1020 | Strings are transformed through upper and lower casing operations, and sometimes in ways that weren't intended.  This behavior can be exploited if performed at the wrong time.  For example, if a casing operation is performed anywhere in the stack after a security check, then a special character like <span class="uchar">U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE</span> could be used to bypass a cross-site scripting filter.
1021 | 
1022 | <span class="indent">toLower("&#x0130") == "i"</span>
1023 | 
1024 | Another aspect of casing operations is that the length of characters and strings can change, depending on the input.  The following should never be assumed:
1025 | 
1026 | <span class="indent">toLower("scr&#x0130pt") == "script"</span>
1027 | 
1028 | Another aspect of casing operations is that the length of characters and strings can change, depending on the input.  The following should never be assumed:
1029 | 
1030 | <span class="indent">len(x) != len(toLower(x))</span>
1031 | 
1032 | Common frameworks handle string comparison in different ways.  The following table captures the behavior of classes intended for case-sensitive and case-insensitive string comparison.
1033 | 
1034 | <table>
1035 |  <thead><tr>
1036 |   <td>Library</td>
1037 |   <td>API</td>
1038 |   <td>Is Dangerous </td>
1039 |   <td>Can override </td>
1040 |   <td>Notes</td>
1041 |  </tr>
1042 |  </thead>
1043 |  <tbody>
1044 |  <tr>
1045 |   <td>.NET 1.0</td>
1046 |   <td>StringComparer</td>
1047 |   <td></td>
1048 |   <td></td>
1049 |   <td></td>
1050 |  </tr>
1051 |  <tr>
1052 |   <td>.NET 2.0</td>
1053 |   <td>StringComparer</td>
1054 |   <td></td>
1055 |   <td></td>
1056 |   <td></td>
1057 |  </tr>
1058 |  <tr>
1059 |   <td>.NET 3.0</td>
1060 |   <td>StringComparer</td>
1061 |   <td></td>
1062 |   <td></td>
1063 |   <td></td>
1064 |  </tr>
1065 |  <tr>
1066 |   <td>Win32</td>
1067 |   <td>CompareStringOrdinal</td>
1068 |   <td></td>
1069 |   <td></td>
1070 |   <td></td>
1071 |  </tr>
1072 |  <tr>
1073 |   <td>Win32</td>
1074 |   <td><a href="http://msdn.microsoft.com/en-us/library/ms647489(VS.85).aspx">lstrcmpi</a></td>
1075 |   <td></td>
1076 |   <td></td>
1077 |   <td></td>
1078 |  </tr>
1079 |  <tr>
1080 |   <td>Win32</td>
1081 |   <td>CompareStringEx</td>
1082 |   <td></td>
1083 |   <td></td>
1084 |   <td></td>
1085 |  </tr>
1086 |  <tr>
1087 |   <td>ICU C</td>
1088 |   <td>ucol_strcoll</td>
1089 |   <td></td>
1090 |   <td></td>
1091 |   <td></td>
1092 |  </tr>
1093 |  <tr>
1094 |   <td>ICU C</td>
1095 |   <td>ucol_strcollIter</td>
1096 |   <td></td>
1097 |   <td>Allows for comparing two strings that are supplied as character
1098 |   iterators (UCharIterator). This is useful when you need to compare
1099 |   differently encoded strings using strcoll</td>
1100 |   <td></td>
1101 |  </tr>
1102 |  <tr>
1103 |   <td>ICU C++</td>
1104 |   <td>Collator::Compare</td>
1105 |   <td></td>
1106 |   <td></td>
1107 |   <td></td>
1108 |  </tr>
1109 |  <tr>
1110 |   <td>ICU C</td>
1111 |   <td>u_strCaseCompare</td>
1112 |   <td></td>
1113 |   <td>Compare two strings case-insensitively using full case folding.</td>
1114 |   <td></td>
1115 |  </tr>
1116 |  <tr>
1117 |   <td>ICU C</td>
1118 |   <td>u_strcasecmp</td>
1119 |   <td></td>
1120 |   <td>Compare two strings case-insensitively using full
1121 |   case folding.</td>
1122 |   <td></td>
1123 |  </tr>
1124 |  <tr>
1125 |   <td>ICU C</td>
1126 |   <td>u_strncasecmp</td>
1127 |   <td></td>
1128 |   <td>Compare two strings case-insensitively using full case folding.</td>
1129 |   <td></td>
1130 |  </tr>
1131 |  <tr>
1132 |   <td>ICU Java</td>
1133 |   <td>caseCompare</td>
1134 |   <td></td>
1135 |   <td>Compare two strings case-insensitively using full
1136 |   case folding.</td>
1137 |   <td></td>
1138 |  </tr>
1139 |  <tr>
1140 |   <td></td>
1141 |   <td></td>
1142 |   <td></td>
1143 |   <td></td>
1144 |   <td></td>
1145 |  </tr>
1146 |  <tr>
1147 |   <td>ICU Java</td>
1148 |   <td>Collator.compare</td>
1149 |   <td></td>
1150 |   <td></td>
1151 |   <td></td>
1152 |  </tr>
1153 |  <tr>
1154 |   <td>POSIX</td>
1155 |   <td>strcoll</td>
1156 |   <td></td>
1157 |   <td></td>
1158 |   <td></td>
1159 |  </tr>
1160 | </tbody></table>
1161 | 
1162 | 
1163 | 
1164 | ## <a id="overflows"></a>Buffer Overflows
1165 | Buffer overflows can occur through improper assumptions about characters versus bytes, and also about string sizes after casing and normalization operations.
1166 | 
1167 | ### <a id="overflow-casing"></a>Upper and Lower Casing
1168 | The following table from UTR 36 illustrates the maximum expansion factors for casing operations on the edge-case characters in Unicode.  These inputs make excellent test cases.
1169 | 
1170 | <table>
1171 |  <thead><tr>
1172 |   <td>Operation </td>
1173 |   <td>UTF </td>
1174 |   <td>Factor </td>
1175 |   <td>Sample </td>
1176 |  </tr>
1177 |  </thead>
1178 |  <tbody>
1179 |  <tr>
1180 |   <td>Lower </td>
1181 |   <td>8</td>
1182 |   <td>1.5</td>
1183 |   <td>&#x023A;</td>
1184 |   <td>U+023A</td>
1185 |  </tr>
1186 |  <tr>
1187 |   <td>16, 32</td>
1188 |   <td>1</td>
1189 |   <td>A</td>
1190 |   <td>U+0041</td>
1191 |  </tr>
1192 |  <tr>
1193 |   <td>Upper </td>
1194 |   <td>8, 16, 32 </td>
1195 |   <td>3 </td>
1196 |   <td>&#x0390;</td>
1197 |   <td>U+0390</td>
1198 |  </tr>
1199 | </tbody></table>
1200 | 
1201 | <sup>[source:  Unicode Technical Report #36](https://www.unicode.org/reports/tr36/)</sup>
1202 | 
1203 | ### <a id="overflow-normalization"></a>Normalization
1204 | 
1205 | The following table from UTR 36 illustrates the maximum expansion factors for normalization operations on the edge case characters in Unicode.  These inputs make excellent test cases.
1206 | 
1207 | <table>
1208 |  <thead><tr>
1209 |   <td>Operation </td>
1210 |   <td>UTF </td>
1211 |   <td>Factor </td>
1212 |   <td>Sample </td>
1213 |  </tr>
1214 |  </thead>
1215 |  <tbody>
1216 |  <tr>
1217 |   <td>NFC</td>
1218 |   <td>8</td>
1219 |   <td>3X</td>
1220 |   <td>&#x1D160;</td>
1221 |   <td>U+1D160</td>
1222 |  </tr>
1223 |  <tr>
1224 |   <td>16, 32</td>
1225 |   <td>3X</td>
1226 |   <td>&#xFB2C;</td>
1227 |   <td>U+FB2C</td>
1228 |  </tr>
1229 |  <tr>
1230 |   <td>NFD</td>
1231 |   <td>8</td>
1232 |   <td>3X</td>
1233 |   <td>&#x0390;</td>
1234 |   <td>U+0390</td>
1235 |  </tr>
1236 |  <tr>
1237 |   <td>16, 32</td>
1238 |   <td>4X</td>
1239 |   <td>&#x1F82;</td>
1240 |   <td>U+1F82</td>
1241 |  </tr>
1242 |  <tr>
1243 |   <td>NFKC/NFKD</td>
1244 |   <td>8</td>
1245 |   <td>11X</td>
1246 |   <td>&#xFDFA;</td>
1247 |   <td>U+FDFA</td>
1248 |  </tr>
1249 |  <tr>
1250 |   <td>16, 32</td>
1251 |   <td>18X</td>
1252 |  </tr>
1253 | </tbody></table>
1254 | <sup>[source:  Unicode Technical Report #36](https://www.unicode.org/reports/tr36/)</sup>
1255 | 
1256 | 
1257 | 
1258 | ## <a id="syntax"></a>Controlling Syntax
1259 | 
1260 | White space and line feeds affect syntax in parsers such as HTML, XML and javascript.  By interpreting characters such as the 'Ogham space mark' and 'Mongolian vowel separator' as whitespace software can allow attacks through the system.  This could give attackers control over the parser, and enable attacks that might bypass security filters.  Several characters in Unicode are assigned the 'white space' category and also the 'white space' binary property.  Depending on how software is designed, these characters may literally be treated as a space character U+0020.
1261 | 
1262 | For example, the following illustration shows the special white space properties associated with the <span class="uchar">U+180E MONGOLIAN VOWEL SEPARATOR</span> character.
1263 | 
1264 | TODO: add image
1265 | 
1266 | If a Web browser interprets this character as white space U+0020, then the following HTML fragment would execute script:
1267 | 
1268 | <span class="indent">&lt;a href=#[U+180E]onclick=alert()&gt;</span>
1269 | 
1270 | 
1271 | ## <a id="charset"></a>Charset Mismatch
1272 | 
1273 | When software cannot accurately determine the character set of the text it is dealing with, then it must decide to either error or make an assumption.  User-agents most commonly must deal with this problem, as they’re faced with interpreting data from a large assortment of character sets.  There are no standards that define how to handle situations of character set mismatch, and vendor implementations vary greatly.
1274 | 
1275 |  Consider the following diagram, in which a Web browser receives an HTTP response with an HTTP charset of ISO-8859-1 defined, and a meta tag charset of shift_jis defined in the HTML.
1276 | 
1277 | TODO add image
1278 | 
1279 | When an attacker can exploit can control charset declarations, they can control the software’s behavior and in some cases setup an attack.
1280 | 


--------------------------------------------------------------------------------