├── .gitignore ├── _includes ├── analytics.html ├── footer.html ├── top.html └── header.html ├── img ├── icann.png ├── uchar-A.png ├── spoof-google.png ├── spoof-slash.png ├── uchar-017F.png ├── uchar-180E.png ├── uchar-feff.png ├── dot-dot-slash.png ├── not-two-bytes.png ├── spoof-mozilla.png ├── spoof-slash-FF0F.png ├── content-type-charset.png ├── normalization-turkish-i.png ├── spoof-win-explorer-file.png ├── spoof-win-explorer-folder.png └── normalization-nfkc-nfkd-003C.png ├── _config.yml ├── README.md ├── _layouts └── default.html ├── js └── scale.fix.js ├── params.json ├── index.md ├── css ├── pygment_trac.css └── styles.css ├── page2.md ├── page1.md └── page3.md /.gitignore: -------------------------------------------------------------------------------- 1 | _site/ 2 | -------------------------------------------------------------------------------- /_includes/analytics.html: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /img/icann.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cweb/unicode-security-guide/HEAD/img/icann.png -------------------------------------------------------------------------------- /img/uchar-A.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cweb/unicode-security-guide/HEAD/img/uchar-A.png -------------------------------------------------------------------------------- /img/spoof-google.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cweb/unicode-security-guide/HEAD/img/spoof-google.png -------------------------------------------------------------------------------- /img/spoof-slash.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cweb/unicode-security-guide/HEAD/img/spoof-slash.png -------------------------------------------------------------------------------- /img/uchar-017F.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cweb/unicode-security-guide/HEAD/img/uchar-017F.png -------------------------------------------------------------------------------- /img/uchar-180E.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cweb/unicode-security-guide/HEAD/img/uchar-180E.png -------------------------------------------------------------------------------- /img/uchar-feff.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cweb/unicode-security-guide/HEAD/img/uchar-feff.png -------------------------------------------------------------------------------- /img/dot-dot-slash.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cweb/unicode-security-guide/HEAD/img/dot-dot-slash.png -------------------------------------------------------------------------------- /img/not-two-bytes.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cweb/unicode-security-guide/HEAD/img/not-two-bytes.png -------------------------------------------------------------------------------- /img/spoof-mozilla.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cweb/unicode-security-guide/HEAD/img/spoof-mozilla.png -------------------------------------------------------------------------------- /img/spoof-slash-FF0F.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cweb/unicode-security-guide/HEAD/img/spoof-slash-FF0F.png -------------------------------------------------------------------------------- /img/content-type-charset.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cweb/unicode-security-guide/HEAD/img/content-type-charset.png -------------------------------------------------------------------------------- /img/normalization-turkish-i.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cweb/unicode-security-guide/HEAD/img/normalization-turkish-i.png -------------------------------------------------------------------------------- /img/spoof-win-explorer-file.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cweb/unicode-security-guide/HEAD/img/spoof-win-explorer-file.png -------------------------------------------------------------------------------- /img/spoof-win-explorer-folder.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cweb/unicode-security-guide/HEAD/img/spoof-win-explorer-folder.png -------------------------------------------------------------------------------- /img/normalization-nfkc-nfkd-003C.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cweb/unicode-security-guide/HEAD/img/normalization-nfkc-nfkd-003C.png -------------------------------------------------------------------------------- /_config.yml: -------------------------------------------------------------------------------- 1 | url: http://cweb.github.io/unicode-security-guide 2 | baseurl: /unicode-security-guide 3 | author: "Chris Weber" 4 | markdown: kramdown 5 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | unicode-security-guide 2 | ====================== 3 | 4 | A repository for the [*Unicode Security Guide*](http://cweb.github.io/unicode-security-guide/). 5 | 6 | Push all commits to the _gh-pages_ branch. 7 | -------------------------------------------------------------------------------- /_includes/footer.html: -------------------------------------------------------------------------------- 1 | 8 | -------------------------------------------------------------------------------- /_layouts/default.html: -------------------------------------------------------------------------------- 1 | {% include top.html %} 2 | 3 | 4 |
5 | {% include header.html %} 6 | 7 |
8 | {{ content }} 9 |
10 | 11 | {% include footer.html %} 12 | {% include analytics.html %} 13 |
14 | 15 | 16 | 17 | -------------------------------------------------------------------------------- /_includes/top.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | {{ page.title }} 6 | 7 | 8 | 9 | 10 | 11 | -------------------------------------------------------------------------------- /js/scale.fix.js: -------------------------------------------------------------------------------- 1 | var metas = document.getElementsByTagName('meta'); 2 | var i; 3 | if (navigator.userAgent.match(/iPhone/i)) { 4 | for (i=0; i 2 |

Unicode Security Guide

3 |

1) Home Page

4 |

2) Background

5 |

3) Visual Spoofing

6 |

4) Character 7 | Transformation

8 |

9 |

10 | View the 11 | Project on GitHub cweb/unicode-security-guide

12 | 17 | 18 | -------------------------------------------------------------------------------- /params.json: -------------------------------------------------------------------------------- 1 | {"name":"Unicode-security-guide","tagline":"Unicode Security Guide","body":"### Welcome to GitHub Pages.\r\nThis automatic page generator is the easiest way to create beautiful pages for all of your projects. Author your page content here using GitHub Flavored Markdown, select a template crafted by a designer, and publish. After your page is generated, you can check out the new branch:\r\n\r\n```\r\n$ cd your_repo_root/repo_name\r\n$ git fetch origin\r\n$ git checkout gh-pages\r\n```\r\n\r\nIf you're using the GitHub for Mac, simply sync your repository and you'll see the new branch.\r\n\r\n### Designer Templates\r\nWe've crafted some handsome templates for you to use. Go ahead and continue to layouts to browse through them. You can easily go back to edit your page before publishing. After publishing your page, you can revisit the page generator and switch to another theme. Your Page content will be preserved if it remained markdown format.\r\n\r\n### Rather Drive Stick?\r\nIf you prefer to not use the automatic generator, push a branch named `gh-pages` to your repository to create a page manually. In addition to supporting regular HTML content, GitHub Pages support Jekyll, a simple, blog aware static site generator written by our own Tom Preston-Werner. Jekyll makes it easy to create site-wide headers and footers without having to copy them across every page. It also offers intelligent blog support and other advanced templating features.\r\n\r\n### Authors and Contributors\r\nYou can @mention a GitHub username to generate a link to their profile. The resulting `` element will link to the contributor's GitHub Profile. For example: In 2007, Chris Wanstrath (@defunkt), PJ Hyett (@pjhyett), and Tom Preston-Werner (@mojombo) founded GitHub.\r\n\r\n### Support or Contact\r\nHaving trouble with Pages? Check out the documentation at http://help.github.com/pages or contact support@github.com and we’ll help you sort it out.\r\n","google":"","note":"Don't delete this file! It's used internally to help with page regeneration."} -------------------------------------------------------------------------------- /index.md: -------------------------------------------------------------------------------- 1 | --- 2 | layout: default 3 | title: Unicode Security Guide 4 | --- 5 | 6 | # Unicode Security Guide 7 | 8 | Welcome to the _Unicode Security Guide_! This guide has been designed to give Web application developers, software engineers, and application security researchers a reference for understanding Unicode-related security issues in operating systems, applications, and the Web. 9 | 10 | The dynamics of Unicode, and character encodings in general, are often misunderstood or poorly implemented, and lead to an array of interesting if not catastrophic security vulnerabilities. 11 | 12 | The content here has been sourced through testing, research, and the following two technical reports from the Unicode Consortium: 13 | 14 | * [Technical Report #36 : Unicode Security Considerations](https://www.unicode.org/reports/tr36/) 15 | * [Technical Report #39 : Unicode Security Mechanisms](https://www.unicode.org/reports/tr39/) 16 | 17 | Beyond these two sources, further research has been ongoing around identifying and inventorying software behaviors. Test cases are being provided in the source code repository. 18 | 19 | ## Contributions and Acknowledgements 20 | Thank you to the following security-minded practitioners for their valuable feedback on this document: 21 | 22 | * [Bil Corry](https://twitter.com/bilcorry) 23 | * Abraham Kang 24 | 25 | And the following for their research and documentation into the issues: 26 | 27 | * Unicode Consortium 28 | * Mark Davis 29 | * [Andy Heninger](https://plus.google.com/117524124943387916918) 30 | * [Richard Ishida](http://rishida.net/) 31 | * [Michael Kaplan](https://twitter.com/michkap) 32 | * [Shawn Steele](http://blogs.msdn.com/b/shawnste/) 33 | * Yosuke HASEGAWA 34 | * Eduardo Vela 35 | * David Lindsay 36 | * [Gareth Heyes](http://www.thespanner.co.uk/) 37 | 38 | ## Disclaimers 39 | This guide has been written by application security professionals, and has not endorsed or reviewed by the Unicode Consortium. It does draw on material from the Consortium, with references, where applicable. 40 | 41 | 42 | -------------------------------------------------------------------------------- /css/pygment_trac.css: -------------------------------------------------------------------------------- 1 | .highlight { background: #ffffff; } 2 | .highlight .c { color: #999988; font-style: italic } /* Comment */ 3 | .highlight .err { color: #a61717; background-color: #e3d2d2 } /* Error */ 4 | .highlight .k { font-weight: bold } /* Keyword */ 5 | .highlight .o { font-weight: bold } /* Operator */ 6 | .highlight .cm { color: #999988; font-style: italic } /* Comment.Multiline */ 7 | .highlight .cp { color: #999999; font-weight: bold } /* Comment.Preproc */ 8 | .highlight .c1 { color: #999988; font-style: italic } /* Comment.Single */ 9 | .highlight .cs { color: #999999; font-weight: bold; font-style: italic } /* Comment.Special */ 10 | .highlight .gd { color: #000000; background-color: #ffdddd } /* Generic.Deleted */ 11 | .highlight .gd .x { color: #000000; background-color: #ffaaaa } /* Generic.Deleted.Specific */ 12 | .highlight .ge { font-style: italic } /* Generic.Emph */ 13 | .highlight .gr { color: #aa0000 } /* Generic.Error */ 14 | .highlight .gh { color: #999999 } /* Generic.Heading */ 15 | .highlight .gi { color: #000000; background-color: #ddffdd } /* Generic.Inserted */ 16 | .highlight .gi .x { color: #000000; background-color: #aaffaa } /* Generic.Inserted.Specific */ 17 | .highlight .go { color: #888888 } /* Generic.Output */ 18 | .highlight .gp { color: #555555 } /* Generic.Prompt */ 19 | .highlight .gs { font-weight: bold } /* Generic.Strong */ 20 | .highlight .gu { color: #800080; font-weight: bold; } /* Generic.Subheading */ 21 | .highlight .gt { color: #aa0000 } /* Generic.Traceback */ 22 | .highlight .kc { font-weight: bold } /* Keyword.Constant */ 23 | .highlight .kd { font-weight: bold } /* Keyword.Declaration */ 24 | .highlight .kn { font-weight: bold } /* Keyword.Namespace */ 25 | .highlight .kp { font-weight: bold } /* Keyword.Pseudo */ 26 | .highlight .kr { font-weight: bold } /* Keyword.Reserved */ 27 | .highlight .kt { color: #445588; font-weight: bold } /* Keyword.Type */ 28 | .highlight .m { color: #009999 } /* Literal.Number */ 29 | .highlight .s { color: #d14 } /* Literal.String */ 30 | .highlight .na { color: #008080 } /* Name.Attribute */ 31 | .highlight .nb { color: #0086B3 } /* Name.Builtin */ 32 | .highlight .nc { color: #445588; font-weight: bold } /* Name.Class */ 33 | .highlight .no { color: #008080 } /* Name.Constant */ 34 | .highlight .ni { color: #800080 } /* Name.Entity */ 35 | .highlight .ne { color: #990000; font-weight: bold } /* Name.Exception */ 36 | .highlight .nf { color: #990000; font-weight: bold } /* Name.Function */ 37 | .highlight .nn { color: #555555 } /* Name.Namespace */ 38 | .highlight .nt { color: #000080 } /* Name.Tag */ 39 | .highlight .nv { color: #008080 } /* Name.Variable */ 40 | .highlight .ow { font-weight: bold } /* Operator.Word */ 41 | .highlight .w { color: #bbbbbb } /* Text.Whitespace */ 42 | .highlight .mf { color: #009999 } /* Literal.Number.Float */ 43 | .highlight .mh { color: #009999 } /* Literal.Number.Hex */ 44 | .highlight .mi { color: #009999 } /* Literal.Number.Integer */ 45 | .highlight .mo { color: #009999 } /* Literal.Number.Oct */ 46 | .highlight .sb { color: #d14 } /* Literal.String.Backtick */ 47 | .highlight .sc { color: #d14 } /* Literal.String.Char */ 48 | .highlight .sd { color: #d14 } /* Literal.String.Doc */ 49 | .highlight .s2 { color: #d14 } /* Literal.String.Double */ 50 | .highlight .se { color: #d14 } /* Literal.String.Escape */ 51 | .highlight .sh { color: #d14 } /* Literal.String.Heredoc */ 52 | .highlight .si { color: #d14 } /* Literal.String.Interpol */ 53 | .highlight .sx { color: #d14 } /* Literal.String.Other */ 54 | .highlight .sr { color: #009926 } /* Literal.String.Regex */ 55 | .highlight .s1 { color: #d14 } /* Literal.String.Single */ 56 | .highlight .ss { color: #990073 } /* Literal.String.Symbol */ 57 | .highlight .bp { color: #999999 } /* Name.Builtin.Pseudo */ 58 | .highlight .vc { color: #008080 } /* Name.Variable.Class */ 59 | .highlight .vg { color: #008080 } /* Name.Variable.Global */ 60 | .highlight .vi { color: #008080 } /* Name.Variable.Instance */ 61 | .highlight .il { color: #009999 } /* Literal.Number.Integer.Long */ 62 | 63 | .type-csharp .highlight .k { color: #0000FF } 64 | .type-csharp .highlight .kt { color: #0000FF } 65 | .type-csharp .highlight .nf { color: #000000; font-weight: normal } 66 | .type-csharp .highlight .nc { color: #2B91AF } 67 | .type-csharp .highlight .nn { color: #000000 } 68 | .type-csharp .highlight .s { color: #A31515 } 69 | .type-csharp .highlight .sc { color: #A31515 } 70 | -------------------------------------------------------------------------------- /css/styles.css: -------------------------------------------------------------------------------- 1 | body { 2 | padding:50px; 3 | font:13px/1.5 "Helvetica Neue", Helvetica, Arial, sans-serif; 4 | color:#777; 5 | font-weight:300; 6 | } 7 | 8 | p, h1, h2, h3, h4, h5, h6 { 9 | color:#222; 10 | margin:0 0 20px; 11 | } 12 | 13 | .indent { 14 | font-size: 1.2em; 15 | border-width: 0 0 0 5px; 16 | border-style: solid; 17 | border-color: silver; 18 | padding: 1em 0em 1em 1em; 19 | margin-left: 1em; 20 | } 21 | 22 | .superscript { 23 | position: relative; 24 | top: -0.5em; 25 | font-size: 80%; 26 | } 27 | 28 | .red { 29 | color: red; 30 | } 31 | 32 | .green { 33 | color: green; 34 | } 35 | 36 | ol, table, pre, dl { 37 | margin:0 0 20px; 38 | } 39 | 40 | 41 | p.zero { 42 | margin:0; 43 | line-height:1.0; 44 | } 45 | 46 | span.uchar { 47 | color: firebrick; 48 | font-family: Lucida Sans Unicode; 49 | } 50 | 51 | h1, h2, h3 { 52 | line-height:1.1; 53 | } 54 | 55 | h1 { 56 | font-size:28px; 57 | } 58 | 59 | h2 { 60 | color:#393939; 61 | } 62 | 63 | h3, h4, h5, h6 { 64 | color:#494949; 65 | } 66 | 67 | a { 68 | color:#39c; 69 | font-weight:400; 70 | text-decoration:none; 71 | } 72 | 73 | a:hover { 74 | color:#069; 75 | } 76 | 77 | a small { 78 | font-size:11px; 79 | color:#777; 80 | margin-top:-0.6em; 81 | display:block; 82 | } 83 | 84 | a:hover small { 85 | color:#777; 86 | } 87 | 88 | .wrapper { 89 | width:1060px; 90 | margin:0; 91 | } 92 | 93 | blockquote { 94 | border-left:1px solid #e5e5e5; 95 | margin:0; 96 | padding:0 0 0 20px; 97 | font-style:italic; 98 | } 99 | 100 | code, pre { 101 | font-family:Monaco, Bitstream Vera Sans Mono, Lucida Console, Terminal; 102 | color:#333; 103 | font-size:12px; 104 | } 105 | 106 | pre { 107 | padding:8px 15px; 108 | background: #f8f8f8; 109 | border-radius:5px; 110 | border:1px solid #e5e5e5; 111 | overflow-x: auto; 112 | } 113 | 114 | table { 115 | width:100%; 116 | border-collapse:collapse; 117 | } 118 | 119 | thead { 120 | background-color: silver; 121 | color: black; 122 | font-weight: 500; 123 | } 124 | 125 | tr { 126 | margin: 1px; 127 | line-height: 1em; 128 | } 129 | 130 | th, td { 131 | text-align:left; 132 | padding:2px 2px; 133 | border-bottom:1px solid #e5e5e5; 134 | } 135 | 136 | dt { 137 | color:#444; 138 | font-weight:700; 139 | } 140 | 141 | th { 142 | color:#444; 143 | } 144 | 145 | img { 146 | max-width:100%; 147 | } 148 | 149 | img.center { 150 | display: block; 151 | margin-left: auto; 152 | margin-right: auto; 153 | max-width: 600px; 154 | } 155 | 156 | header { 157 | width:270px; 158 | float:left; 159 | position:fixed; 160 | } 161 | 162 | header ul { 163 | list-style:none; 164 | height:40px; 165 | 166 | padding:0; 167 | 168 | background: #eee; 169 | background: -moz-linear-gradient(top, #f8f8f8 0%, #dddddd 100%); 170 | background: -webkit-gradient(linear, left top, left bottom, color-stop(0%,#f8f8f8), color-stop(100%,#dddddd)); 171 | background: -webkit-linear-gradient(top, #f8f8f8 0%,#dddddd 100%); 172 | background: -o-linear-gradient(top, #f8f8f8 0%,#dddddd 100%); 173 | background: -ms-linear-gradient(top, #f8f8f8 0%,#dddddd 100%); 174 | background: linear-gradient(top, #f8f8f8 0%,#dddddd 100%); 175 | 176 | border-radius:5px; 177 | border:1px solid #d2d2d2; 178 | box-shadow:inset #fff 0 1px 0, inset rgba(0,0,0,0.03) 0 -1px 0; 179 | width:270px; 180 | } 181 | 182 | header li { 183 | width:89px; 184 | float:left; 185 | border-right:1px solid #d2d2d2; 186 | height:40px; 187 | } 188 | 189 | header li:first-child a { 190 | border-radius:5px 0 0 5px; 191 | } 192 | 193 | header li:last-child a { 194 | border-radius:0 5px 5px 0; 195 | } 196 | 197 | header ul a { 198 | line-height:1; 199 | font-size:11px; 200 | color:#999; 201 | display:block; 202 | text-align:center; 203 | padding-top:6px; 204 | height:34px; 205 | } 206 | 207 | header ul a:hover { 208 | color:#999; 209 | background: -moz-linear-gradient(top, #fff 0%, #ddd 100%); 210 | background: -webkit-gradient(linear, left top, left bottom, color-stop(0%,#fff), color-stop(100%,#ddd)); 211 | background: -webkit-linear-gradient(top, #fff 0%,#ddd 100%); 212 | background: -o-linear-gradient(top, #fff 0%,#ddd 100%); 213 | background: -ms-linear-gradient(top, #fff 0%,#ddd 100%); 214 | background: linear-gradient(top, #fff 0%,#ddd 100%); 215 | } 216 | 217 | header ul a:active { 218 | -webkit-box-shadow: inset 0px 2px 2px 0px #ddd; 219 | -moz-box-shadow: inset 0px 2px 2px 0px #ddd; 220 | box-shadow: inset 0px 2px 2px 0px #ddd; 221 | } 222 | 223 | strong { 224 | color:#222; 225 | font-weight:700; 226 | } 227 | 228 | header ul li + li { 229 | width:88px; 230 | border-left:1px solid #fff; 231 | } 232 | 233 | header ul li + li + li { 234 | border-right:none; 235 | width:89px; 236 | } 237 | 238 | header ul a strong { 239 | font-size:14px; 240 | display:block; 241 | color:#222; 242 | } 243 | 244 | section { 245 | width:700px; 246 | float:right; 247 | padding-bottom:50px; 248 | } 249 | 250 | small { 251 | font-size:11px; 252 | } 253 | 254 | hr { 255 | border:0; 256 | background:#e5e5e5; 257 | height:1px; 258 | margin:0 0 20px; 259 | } 260 | 261 | footer { 262 | width:270px; 263 | float:left; 264 | position:fixed; 265 | bottom:50px; 266 | } 267 | 268 | @media print, screen and (max-width: 960px) { 269 | 270 | div.wrapper { 271 | width:auto; 272 | margin:0; 273 | } 274 | 275 | header, section, footer { 276 | float:none; 277 | position:static; 278 | width:auto; 279 | } 280 | 281 | header { 282 | padding-right:320px; 283 | } 284 | 285 | section { 286 | border:1px solid #e5e5e5; 287 | border-width:1px 0; 288 | padding:20px 0; 289 | margin:0 0 20px; 290 | } 291 | 292 | header a small { 293 | display:inline; 294 | } 295 | 296 | header ul { 297 | position:absolute; 298 | right:50px; 299 | top:52px; 300 | } 301 | } 302 | 303 | @media print, screen and (max-width: 720px) { 304 | body { 305 | word-wrap:break-word; 306 | } 307 | 308 | header { 309 | padding:0; 310 | } 311 | 312 | header ul, header p.view { 313 | position:static; 314 | } 315 | 316 | pre, code { 317 | word-wrap:normal; 318 | } 319 | } 320 | 321 | @media print, screen and (max-width: 480px) { 322 | body { 323 | padding:15px; 324 | } 325 | 326 | header ul { 327 | display:none; 328 | } 329 | } 330 | 331 | @media print { 332 | body { 333 | padding:0.4in; 334 | font-size:12pt; 335 | color:#444; 336 | } 337 | } 338 | 339 | 340 | -------------------------------------------------------------------------------- /page2.md: -------------------------------------------------------------------------------- 1 | --- 2 | layout: default 3 | title: Visual Spoofing 4 | permalink: visual-spoofing/ 5 | --- 6 | 7 | # Unicode Security Guide 8 | ## _Visual Spoofing_ 9 | 10 | While Unicode has provided an incredible framework for storing, transmitting, and presenting information in many of our world’s native languages, it has also enabled attack vectors for phishing and word filters. Commonly referred to as ‘visual spoofing’ these attack vectors leverage characters from various languages that are visually identical to letters in another language. 11 | 12 | To a human reader, some of the following letters are indistinguishable from one another while others closely resemble one another: 13 | 14 | AΑ А ᗅ ᗋ ᴀ A 15 | 16 | To a computer system however, each of these letters has very different meaning. The underlying bits that represent each letter are different from one to the next. 17 | 18 | ## Table of Contents 19 | 20 | * [Prior Research](#prior) 21 | * [Attack Scenarios](#attack) 22 | * [Defensive Options](#defense) 23 | 24 | ## Prior Research 25 | One of the most well-known attacks to exploit visual spoofing was the Paypal.com IDN spoof of 2005. Setup to demonstrate the power of these attack vectors, [Eric Johanson](http://www.shmoo.com/idn/) and The Schmoo Group successfully used a [www.paypal.com](http://www.paypal.com) lookalike domain name to fool visitors into providing personal information. The advisory references original research from 2002 by [Evgeniy Gabrilovich and Alex Gontmakher](http://www.cs.technion.ac.il/~gabr/papers/homograph.html) at the Israel Institute of Technology. Their original paper described an attack using Microsoft.com as an example. 26 | 27 | Viktor Krammer, author of the [Quero Toolbar](http://www.quero.at/) for Internet Explorer, also presented additional research on these attack vectors and detection mechanisms in his [2006 presentation](http://www.quero.at/papers/idn_spoofing.pdf). Additionally, the [Unicode Consortium](https://unicode.org) has been active at raising awareness of these issues in its security papers, and in providing recommended solutions. 28 | 29 | ## Summary of Vectors 30 | The phenomena of 'visual spoofing' may be malicious and deliberate or benign and accidental. There have been cases where a choice of font displayed a sequence of characters in an unintended way, just as there have been cases where Unicode characters did not display properly. The following list attempts to capture the major vectors: 31 | 32 | __Non-Unicode lookalikes__ 33 | 34 | Simple characters or character combinations can look like something else. For example, the letters "r" and "n" together can look like the letter "m". E.g. "rn". Also, the number "0" can look like the letter "O", the number "1" can look like the letter "l", and so on. 35 | 36 | __Unicode Confusables__ 37 | 38 | The Unicode Confusables are discussed in detail later in this document. In short, these are the diverse array of non-ASCII Unicode characters which are easily confused with characters across languages. 39 | 40 | __ The Invisibles__ 41 | 42 | Discussed later in this document, these are characters which have no visual appearance and minimal spacing if any spacing at all. Hence, they are visually non-existant. 43 | 44 | __Problematic font-rendering__ 45 | 46 | Fonts are ultimately responsible for the visual display of characters, and can sometimes render glyphs confusingly, or as empty white space. There are numerous examples of this, just one of which is described below: 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 |
Character sequenceShould appear asMight appear as
U+00B7 U+0041 U+0338A/
64 | 65 | __Manipulating combining-marks__ 66 | 67 | Combining marks can be stacked or re-ordered in a myriad of ways. Consider the following table which illustrates just one way that combining marks can be stacked (using one directly after another). The table also shows an example of how combining marks can be re-ordered in a different sequence, but still have the same visual appearance. 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 |
Character sequenceAppears as
U+006F U+0304
U+006F U+0304 U+0304ō̄
U+006F U+0336 U+0335o̶̵
U+006F U+0335 U+0336o̵̶
95 | 96 | __Bidi and syntax spoofing__ 97 | Another interesting vector uses the 'bidirectional' properties of certain characters, also known as 'bidi'. 98 | 99 | ## Attack Scenarios 100 | 101 | A variety of scenarios exist where visual spoofing may be used to attack and exploit people. This section looks at a few. 102 | 103 | ### Spoofing Domain Names 104 | Domain names represent an interesting attack vector because their mere visual appearance inspires trust in a brand. The following image represents what visually appear as two identical domain names, however, the second contains the U+0261 LATIN SMALL LETTER SCRIPT G. 105 | 106 | 107 | 108 | The tricky part about presenting domain names is that they often get simply a glance, if any look at all. The following image represents two domain names which might be visually similar 'enough' to fool someone, yet not identical. 109 | 110 | 111 | 112 | Finally, characters that appear to be syntactic elements, such as U+FF89 HALFWIDTH KATAKANA LETTER NO which resembles the forward-slash path-separator, can be troublesome. In the following image, this character is used in the subdomain label of a domain name, but appears to be a path-separator. 113 | 114 | 115 | 116 | ### Fraudulent Vanity URL's 117 | 118 | A social networking service wants to allow vanity URL’s to be registered using international characters such as www.foo.bar/фу but perceives too great a risk from the variety of ways that the URL could be subject to visual fraud and confusion. Because Unicode characters are well-supported in the path portion of a browser’s URL display, a well-crafted vanity URL could easily fool victims and be the landing page for a phishing attack. In fact, it's often unnecessary to use Unicode - in some cases, the number one "1" can appear as the letter "l", and in certain fonts the sequence "rn" can appear as the letter "m". 119 | 120 | ### Bypassing Profanity Filters 121 | 122 | An email or forum system needs to prevent violent and profane words from being used. It's well-known that there are trivial ways to bypass such filters, including using spacing and punctuation between letters in a word (e.g. c_r_a_p), or slight misspellings which give the same effect (e.g. crrap), to name just a couple. There’s also the possibility of using confusable characters which have no visual side-affect (e.g. crap) written entirely in another script (or a mix of scripts). 123 | 124 | ### Spoofing User Interface Dialogs 125 | 126 | Security decisions are often presented to end users in the form of dialog boxes consisting in part of user-supplied input. For example: 127 | 128 | * When a user downloads a file through a Web browser, they’re asked to confirm their decision, often with the filename as a part of the dialog's content. 129 | * When a user tries to launch an untrusted application they may also be presented with a dialog box asking for confirmation. 130 | * A social networking site may ask its users for confirmation before redirecting them to an off-site URL, often with the URL making up the dialog's content. 131 | 132 | In any of these cases, a clever attack may use special BIDI or other characters that reverse the direction of text, or otherwise manipulate the text in a way that may confuse or fool the end users. 133 | 134 | Consider the following image which shows the Windows Explorer program. What appears to be a plain text file ending in the ".txt" file extension, is actually an executable file ending in the ".exe" extension. Fortunately, Windows Explorer recognizes the true file type, which it has listed as "Application". 135 | 136 | 137 | 138 | In another example, the U+FEFF ZERO WIDTH NO-BREAK SPACE character, also known as the Byte-Order Mark, or BOM, acts as an 'invisible' character. 139 | 140 | 141 | 142 | Invisible characters present their own interesting dynamics and applications. As seen in the image above, Windows Explorer presents what appears to be two folders with identical names, whereas a default command prompt does not properly display the BOM, and so presents it as an empty box. 143 | 144 | ### Malvertisements 145 | 146 | Advertising network's often need to protect brand name trademarks from being registered or used by anyone other than their owner. This threat might be mitigated through filters, human editorial inspection, or a combination of the two. An attacker could place an malicious phishing ad that bypasses trademark filters by using confusable characters. For example “Download Microsoft Windows 8 Service Pack 1 here” where the trademarked name 'Microsoft Windows' was crafted using non-English script, or even using invisible characters. 147 | 148 | ### Forging Internationalized Email 149 | 150 | Email addresses and the SMTP protocol has long been confined to ASCII, however, standards work through the IETF was concluded in 2013 by the Email Address Internationalization Working Group. The EAI effort delivered documentation for integrating UTF-8 into the core email protocols, as well as advice to EAI deployment in client and server software. In preparing for the transition, email client engineers and designers will need to anticipate and handle the case of visually identical email addresses, among other issues. If left unhandled, then end users could easily be fooled. Digital certificates would provide a good mechanism for proving authenticity of a message; however such certificates also support Unicode and are vulnerable to the exact same attacks. 151 | 152 | ## Defensive Options 153 | All does not seem lost. While 154 | 155 | ## The Confusables 156 | Throughout Unicode, the characters that visually resemble one another are referred to as the confusables. The Unicode Consortium has documented this phenomena in both Technical Report 36 and TR 39. 157 | 158 | It is TR 39 specifically which provides links to the data files comprising the confusables, such as confusables.txt which provides a mapping for visual confusables. 159 | 160 | The Unicode Consortium has also provided Unicode Utilities: Confusables which takes an input string and produces visually confusable strings generated using the prior mentioned data files. 161 | 162 | ### Single-Script Confusables 163 | 164 | 165 | ### Mixed-Script Confusables 166 | ### Whole-Script Confusables 167 | 168 | ### The Invisibles 169 | 170 | 171 | 172 | 173 | ## Internationalized Domain Names in Applications (IDNA) 174 | 175 | 176 | ### IDNA 2003 177 | ### IDNA 2008 178 | -------------------------------------------------------------------------------- /page1.md: -------------------------------------------------------------------------------- 1 | --- 2 | layout: default 3 | title: Background 4 | permalink: background/ 5 | --- 6 | 7 | # Unicode Security Guide 8 | ## _Background_ 9 | 10 | The Unicode Standard provides a unique number for every character, enabling disparate computing systems to exchange and interpret text in the same way. 11 | 12 | ## Table of Contents 13 | * [History](#history) 14 | * [Introduction](#intro) 15 | * [Code Points](#cp) 16 | * [Character Encoding](#encoding) 17 | * [Character Escape Sequences and Entity References](#escape) 18 | * [Security Testing Focus Areas](#testing) 19 | 20 | ## Brief History of Character Encodings 21 | Early in computing history, it became widely clear that a standardized way to represent characters would provide many benefits. Around 1963, IBM standardized EBCDIC in its mainframes, about the same time that ASCII was standardized as a 7-bit character set. EBCDIC used an 8-bit encoding, and the unused eighth-bit in ASCII allowed OEM's to apply the extra bit for their proprietary purposes. The following list roughly captures some of the history leading up to Unicode. 22 | 23 | * __1991__ - Unicode 24 | * __1990__ - ISO 10646 (UCS) 25 | * __1985__ - ISO-8859-1 (code pages galore!) 26 | * __1981__ - MBCS (e.g. GB2312) 27 | * __1964__ - EBCDIC (non-ASCII compatible) 28 | * __1963__ - ASCII 7-bit (an 8th bit free-for-all soon followed) 29 | 30 | This allowed OEM's to ship computers and later PC's with customized character encodings specific to language or region. So computers could ship to Israel with a tweaked ASCII encoding set that supported Hebrew for example. The divergence in these customized character sets grew into a problem over time, as data interchange become error-prone if not impossible when computers didn't share the same character set. 31 | 32 | In response to this growth, the International Organization for Standardization (ISO) began developing the ISO-8859 set of character encoding standards in the early 1980's. The ISO-8859 standards were aimed at providing a reliable system for data-interchange across computing systems. They provided support for many popular languages, but weren't designed for high-quality typography which needed symbols such as ligatures. 33 | 34 | In the late 1980's Unicode was being designed, around the same time ISO recognized the need for a ubiquitous character encoding framework, what would later come to be called the Universal Character Set (UCS), or ISO 10646. Version 1.0 of the Unicode standard was released in 1991 at almost the same time as UCS was made public. Since that time, Unicode has become the de facto character encoding model, and has worked closely with ISO and UCS to ensure compatibility and similar goals. 35 | 36 | ## Brief Introduction to Unicode 37 | Most people are familiar with [ASCII](http://en.wikipedia.org/wiki/ASCII), it's usefulness and it's limitation to 128 characters. Unicode and UCS expanded the available array of characters by separating the concepts of __code points__ and binary __encodings__. 38 | 39 | The Unicode framework can presumably represent all of the worlds languages and scripts, past, present, and future. That's because the current version 5.1 of the Unicode Standard has space for over 1 million code points. A code point is a unique value within the Unicode code-space. A single code point can represent a letter, a special control character (e.g. carriage return), a symbol, or even some other abstract thing. 40 | 41 | ### Code Points 42 | A code point is a 21-bit scalar value in the current version of Unicode, and is represented using the following type of reference where NNNN would be a hex value: 43 | 44 | U+NNNN 45 | 46 | The valid range for Unicode code points is currently U+0000 to U+10FFFF. This range can be expanded in the future if the Unicode Standard changes. The following image illustrates some of the properties or metadata that accompany a given code point. 47 | 48 | 49 | 50 | Code point U+0041 represents the Latin Capital Letter A. It's no coincidence that this maps directly to ASCII's value 0x41, as the Unicode Standard has always preserved the lower ASCII range to ensure widespread compatibility. Some interesting things to note here are the properties associated with this code point: 51 | 52 | * Several categories are assigned including a general 'category' and a 'script' family. 53 | * A 'lower case' mapping is defined. 54 | * An 'upper case' mapping is defined. 55 | * A 'normalization' mapping is defined. 56 | * Binary properties are assigned. 57 | 58 | This short list only represents some of the metadata attached to a code point, there can be much more information. In looking for security issues however, this short list provides a good starting point. 59 | 60 | ## Character Encoding 61 | A discussion of characters and strings can quickly dissolve into a soup of terminology, where many terms get mixed up and used inaccurately. This document will aim to avoid using all of the terminology, and may use some terms inaccurately according to the Unicode Consortium, with the goal of simplicity. 62 | 63 | To put it simply, an encoding is the binary representation of some character. It’s ‘bits on the wire’ or ‘data at rest’ in some encoding scheme. The Unicode Consortium has defined four character encoding forms, the Unicode Transformation Formats (UTF): 64 | 65 | 1. UTF-7 66 | Defined by RFC 2152, and has since been largely deprecated and its use is not recommended. 67 | 1. UTF-8 68 | A __variable-width encoding__ where each Unicode code point is assigned to an unsigned byte sequence of __1 to 4__ bytes in length. Older versions of the specification allowed for up to __6 bytes__ in length but that is no longer the case. 69 | 1. UTF-16 70 | A __variable-width encoding__ where each Unicode code point is assigned to an unsigned sequence of __2 or 4__ bytes. The 2-byte sequences are comprised of __surrogate pairs__. 71 | 1. UTF-32 72 | A __fixed-width encoding__ where each Unicode code point is assigned to an unsigned sequence of __4 bytes__. UTF-32 employs a fixed mapping using the same numeric value as the code point, so no algorithms are needed. 73 | 74 | Of these four, UTF-7 has been deprecated, UTF-8 is the most commonly used on the Web, and both UTF-16 and UTF-32 can be serialized in little or big endian format. 75 | 76 | A character encoding as defined here means the actual bytes used to represent the data, or code point. So, a given code point U+0041 LATIN CAPITAL LETTER A can be encoded using the following bytes in each UTF form: 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 |
UTF FormatByte sequence
UTF-8< 41 >
UTF-16 Little Endian< 41 00 >
UTF-16 Big Endian< 00 41 >
UTF-32 Little Endian< 41 00 00 00 >
UTF-32 Big Endian< 00 00 00 41 >
106 | 107 | The lower ASCII character set is preserved by UTF-8 up through U+007F. The following table gives another example, using U+FEFF ZERO WIDTH NO-BREAK SPACE, also known as the Unicode Byte Order Mark. 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 |
UTF FormatByte sequence
UTF-8< EF BB BF >
UTF-16 Little Endian< FF FE >
UTF-16 Big Endian< FE FF >
UTF-32 Little Endian< FF FE 00 00 >
UTF-32 Big Endian < 00 00 FE FF >
137 | 138 | At this point UTF-8 uses three bytes to represent the code point. One may wonder at this point how a code point greater than U+FFFF would be represented in UTF-16. The answer lies in surrogate pairs, which use two double-byte sequences together. Consider the code point U+10FFFD PRIVATE USE CHARACTER-10FFFD in the following table. 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 |
UTF FormatByte sequence
UTF-8< F4 8F BF BD >
UTF-16 Little Endian< FF DB > < FD DF >
UTF-16 Big Endian< DB FF > < DF FD >
UTF-32 Little Endian< FD FF 10 00 >
UTF-32 Big Endian< 00 10 FF FD >
168 | 169 | Surrogate pairs combine two pairs in the reserved code point range U+D800 to U+DFFF, to be capable of representing all of Unicode’s code points in the 16 bit format. For this reason, UTF-16 is considered a variable-width encoding just as is UTF-8. UTF-32 however, is considered a fixed-width encoding. 170 | 171 | ## Character Escape Sequences and Entity References 172 | 173 | An alternative to encoding characters is representing them using a symbolic representation rather than a serialization of bytes. This is common in HTTP with URL-encoded data, and in HTML. In HTML, numerical character references (NCR) can be used in either a decimal or hexadecimal form that maps to a Unicode code point. 174 | 175 | In fact, CSS (Cascading Style Sheets) and even JavaScript use escape sequences, as do most programming languages. The details of each protocol’s specification are outside the scope of this document, however examples will be used here for reference. 176 | 177 | The following table lists the common escape sequences for U+0041 LATIN CAPITAL LETTER A. 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | 204 | 205 | 206 | 207 | 208 | 209 | 210 |
UTF FormatCharacter Reference or Escape Sequence
URL%41
NCR (decimal)&#65;
NCR (Hex)&#x41;
CSS\41 and \0041
JavaScript\x41 and \u0041
Other\u0041
211 | 212 | The following table gives another example, using U+FEFF ZERO WIDTH NO-BREAK SPACE, also known as the Unicode Byte Order Mark. 213 | 214 | 215 | 216 | 217 | 218 | 219 | 220 | 221 | 222 | 223 | 224 | 225 | 226 | 227 | 228 | 229 | 230 | 231 | 232 | 233 | 234 | 235 | 236 | 237 | 238 | 239 | 240 | 241 | 242 | 243 | 244 | 245 | 246 | 247 | 248 | 249 | 250 | 251 | 252 | 253 |
UTF FormatCharacter Reference or Escape Sequence
URL%EF%BB%BF
NCR (decimal)&#65279;
NCR (Hex)&#xFEFF;
CSS \FEFF
JavaScript (as bytes)\xEF\xBB\xBF
JavaScript (as reference)\uFEFF
JSON\uFEFF
Microsoft Internet Information Server (IIS)%uFEFF
254 | 255 | ## Security Testing Focus Areas 256 | 257 | This guide has been designed with two general areas in mind - one being to aid readers in setting goals for a software security assessment. Where possible, data has also been provided to assist software engineers in developing more security software. Information such as how framework API's behave by default and when overridden is subject to change at any time. 258 | 259 | Clearly, any protocol and standard can be subject to security vulnerabilities, examples include HTML, HTTP, TCP, DNS. Character encodings and the Unicode standard are also exposed to vulnerability. Sometimes vulnerabilities are related to a design-flaw in the standard, but more often they’re related to implementation in practice. Many of the phenomena discussed here are not vulnerabilities in the standard. Instead, the following general categories of vulnerability are most common in applications which are not built to anticipate and prevent the relevant attacks: 260 | 261 | * Visual Spoofing 262 | * Best-fit mappings 263 | * Charset transcodings and character mappings 264 | * Normalization 265 | * Canonicalization of overlong UTF-8 266 | * Over-consumption 267 | * Character substitution 268 | * Character deletion 269 | * Casing 270 | * Buffer overflows 271 | * Controlling Syntax 272 | * Charset mismatches 273 | 274 | Consider the following image as an example. In the case of U+017F LATIN SMALL LETTER LONG S, the upper casing and normalization operations transform the character into a completely different value. Many characters such as this one have explicit mappings defined through the Unicode Standard, indicating what character (or sequences of characters) they should transform to through casing and normalization. Normalization is a defined process discussed later in this document. In some situations, this behavior could be exploited to create cross-site scripting or other attack scenarios. 275 | 276 | The rest of this guide intends to explore each of these phenomena in more detail, as each relates to software vulnerability mitigation and testing. 277 | 278 | 279 | -------------------------------------------------------------------------------- /page3.md: -------------------------------------------------------------------------------- 1 | --- 2 | layout: default 3 | title: Character Transformations 4 | permalink: character-transformations/ 5 | --- 6 | 7 | # Unicode Security Guide 8 | ## _Character Transformations_ 9 | 10 | This section attempts to explore the various ways in which characters and strings can be transformed by software processes. Such transformations are not vulnerabilities necessarily, but could be exploited by clever attackers. 11 | 12 | As an example, consider an attacker trying to inject script (i.e. cross-site scripting, or XSS attack) into a Web-application which utilizes a defensive input filter. The attacker finds that the application performs a lowercase operation on the input after filtering, and by injecting special characters they can exploit that behavior. That is, the string "script" is prevented by the filter, but the string "scrİpt" is allowed. 13 | 14 | ## Table of Contents 15 | * [Round Trip](#round-trip) 16 | * [Best Fit Mappings](#best-fit) 17 | * [Charset Transcoding and Character Mappings](#transcoding) 18 | * [Normalization](#normalization) 19 | * [Canonicalization of Non-Shortest Form UTF-8](#canonicalization) 20 | * [Over-Consumption](#overconsumption) 21 | * [Well-formed and Ill-formed Byte Sequences](#formedness) 22 | * [Handling Ill-formed Byte Sequences](#handling) 23 | * [Handling the Unexpected](#unexpected) 24 | * [Unexpected Inputs](#unexpected-inputs) 25 | * [Character Substitution](#unexpected-substitution) 26 | * [Character Deletion](#unexpected-deletion) 27 | * [Upper and Lower Casing](#casing) 28 | * [Buffer Overflows](#overflows) 29 | * [Upper and Lower Casing](#overflow-casing) 30 | * [Normalization](#overflow-normalization) 31 | * [Controlling Syntax](#syntax) 32 | * [Charset Mismatch](#charset) 33 | 34 | 35 | ## Round-trip Conversions: A Common Pattern 36 | In practice, globalized software must be capable of handling many different character sets, and converting data between them. The process for supporting this requirement can generally look like the following: 37 | 38 | 1. Accept data from any character set, e.g. Unicode, shift_jis, ISO-8859-1. 39 | 2. Transform, or convert, data to Unicode for processing and storage. 40 | 3. Transform data to original or other character set for output and display. 41 | 42 | In this pattern, Unicode is used as the broker. With support for such a large character repertoire, Unicode will often have a character mapping for both sides of this transaction. To illustrate this, consider the following Web application transaction. 43 | 44 | 1. An application end-user inputs their full name using characters encoded from the shift_jis character set. 45 | 2. Before storing in the database, the application converts the user-input to Unicode's UTF-8 format. 46 | 3. When visiting the Web page, the user's full name will be returned in UTF-8 format, unless other conditions cause the data to be returned in a different encoding. Such conditions may be based on the Web application's configuration or the user's Web browser language and encoding settings. Under these types of conditions, the Web application will convert the data to the requested encoding. 47 | 48 | The round-trip conversions illustrated here can lead to numerous issues that will be further discussed. While it serves as a good example, this isn't the only case where such issues can arise. 49 | 50 | ## Best-fit Mappings 51 | The "best-fit" phenomena occurs when a character X gets transformed to an entirely different character Y. This can occur for reasons such as: 52 | 53 | * A framework API transforms input to a different character encoding by default. 54 | * Data is marshalled from a wide string type (multi-byte character representation) such as UTF-16, to a non-wide string (single-byte character representation) such as US-ASCII. 55 | * Character X in the source encoding doesn't exist in the destination encoding, so the software attempts to find a 'best-fit' match. 56 | 57 | In general, best-fit mappings occur when characters are transcoded between Unicode and another encoding. It's often the case that the source encoding is Unicode and the destination is another charset such as shift_jis, however, it could happen in reverse as well. Best-fit mappings are different than character set transcoding which is discussed in another section of this guide. 58 | 59 | Software vulnerabilities may arise when best-fit mappings occur. To name a few: 60 | 61 | * Best-fit mappings are often not reversible, so data is irrevocably lost. For example, a common best-fit operation would transform a U+FF1C FULLWIDTH LESS-THAN SIGN < to the U+003C LESS-THAN SIGN, or the ASCII < used in HTML. Once converted down to the ASCII <, there’s no reliable way to convert back to the FULLWIDTH source. 62 | * Characters can be manipulated to bypass string handling filters, such as cross-site scripting (XSS) filters, WAF's, and IDS devices. 63 | * Characters can be manipulated to abuse logic in software. Such as when the characters can be used to access files on the file system. In this case, a best-fit mapping to characters such as ../ or file:// could be damaging. 64 | 65 | For example, consider a Web-application that’s implemented a filter to prevent XSS (cross-site scripting) attacks. The filter attempts to block most dangerous characters, and operates at an outermost layer in the application. The implementation might look like: 66 | 67 | 1. An input validation filter rejects characters such as <, >, ', and " in a Web-application accepting UTF-8 encoded text. 68 | 2. An attacker sends in a U+FF1C FULLWIDTH LESS-THAN SIGN < in place of the ASCII <. 69 | 3. The attacker’s input looks like: <script> 70 | 4. After passing through the XSS filter unchanged, the input moves deeper into the application. 71 | 5. Another API, perhaps at the data access layer, is configured to use a different character set such as windows-1252. 72 | 6. On receiving the input, a data access layer converts the multi-byte UTF-8 text to the single-byte windows-1252 code page, forcing a best-fit conversion to the dangerous characters the original XSS filter was trying to block. 73 | 7. The attacker’s input successfully persists to the database. 74 | 75 | [Shawn Steele](http://blogs.msdn.com/shawnste/archive/2006/01/19/515047.aspx) describes the security issues well on his blog, it's a highly recommended short read for the level of coverage he provides regarding Microsoft's API's: 76 | 77 | > Best Fit in WideCharToMultiByte and System.Text.Encoding Should be Avoided. Windows and the .Net Framework have the concept of "best-fit" behavior for code pages and encodings. Best fit can be interesting, but often its not a good idea. In WideCharToMultiByte() this behavior is controlled by a WC_NO_BEST_FIT_CHARS flag. In .Net you can use the EncoderFallback to control whether or not to get Best Fit behavior. Unfortunately in both cases best fit is the default behavior. In Microsoft .Net 2.0 best fit is also slower. 78 | 79 | As a software engineer, it's important to understand the API's being used directly, and in some cases indirectly (by other processing on the stack). The following table of common library API's lists known behaviors: 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 |
LibraryAPIBest-fit defaultCan overrideGuidance
.NET 2.0System.Text.EncodingYesYesSpecify EncoderReplacementFallback in the Encoding constructor.
.NET 3.0System.Text.EncodingYesYesSpecify 104 | EncoderReplacementFallback in the Encoding constructor.
.NET 3.0DllImportYesYesTo properly and more safely deal with this, you can use the 112 | MarshallAsAttribute class to specify a LPWStr type instead of a LPStr. 113 | [MarshalAs(UnmanagedType.LPWStr)]
Win32WideCharToMultiByteYesYesSet the WC_NO_BEST_FIT_CHARS flag.
JavaTBD...
ICUTBD...
137 | 138 | Another important note Shawn Steel tells us on his blog is 139 | that Microsoft 140 | does not intend to maintain the best-fit mappings. For these and other 141 | security reasons it's a good idea to avoid best-fit type of behavior. 142 | 143 | The following table lists test cases to run from a black-box, external perspective. By interpreting the output/rendered data, a tester can determine if a best-fit conversion may be happening. Note that the mapping tables for best-fit conversions are numerous and large, leading to a nearly insurmountable number of permutations. To top it off, the best-fit behavior varies between vendors, making for an inconsistent playing field that does not lend well to automation. For this reason, focus here will be on data that is known to either normalize or best-fit. The table below is not comprehensive by any means, and is only being provided with the understanding that something is better than nothing. 144 | 145 | 146 | 147 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 179 | 180 | 181 | 182 | 183 | 184 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | 204 | 205 | 206 | 207 | 208 | 209 | 210 | 211 | 212 | 213 | 214 | 215 | 217 | 218 | 219 | 220 | 221 | 222 | 224 | 225 | 226 | 227 | 228 | 229 | 230 | 231 | 232 | 233 | 234 | 235 | 236 | 237 | 238 | 239 | 240 | 241 | 242 | 243 | 244 | 245 | 246 | 247 | 248 | 249 | 250 | 251 | 252 | 253 | 254 | 255 | 256 | 257 | 258 | 259 | 260 | 261 | 262 | 263 | 264 | 265 | 266 | 267 | 268 | 269 | 270 | 271 | 272 | 273 | 274 | 275 | 276 | 277 | 279 | 280 | 281 | 282 | 283 | 284 | 285 | 286 | 287 | 288 | 289 | 290 | 291 | 292 | 293 | 294 | 295 | 296 | 297 | 298 | 299 | 300 | 301 | 302 | 303 | 304 | 305 | 306 | 307 | 308 | 309 | 310 | 311 | 312 | 313 | 314 | 315 | 316 | 317 | 318 | 319 | 320 | 321 | 322 | 323 | 324 | 325 | 326 | 327 | 328 | 329 | 330 | 331 | 332 | 333 | 334 | 335 | 336 | 337 | 338 | 340 | 341 |
Target 148 | charTarget code pointTest code pointName
o\u006F\u2134SCRIPT SMALL O
o\u006F\u014DLATIN SMALL LETTER O WITH MACRON
s\u0073\u017FLATIN SMALL LETTER LONG S
I\u0049\u0131LATIN SMALL 178 | LETTER DOTLESS I
i\u0069\u0129LATIN SMALL LETTER I WITH 185 | TILDE
K\u004B\u212AKELVIN SIGN
k\u006B\u0137LATIN SMALL LETTER K WITH CEDILLA
A\u0041\uFF21FULLWIDTH LATIN CAPITAL LETTER A
a\u0061\u03B1GREEK SMALL LETTER ALPHA
"\u0022\u02BAMODIFIER 216 | LETTER DOUBLE PRIME
"\u0022\u030ECOMBINING DOUBLE VERTICAL LINE 223 | ABOVE
"\u0027\uFF02FULLWIDTH QUOTATION MARK
'\u0027\u02B9MODIFIER LETTER PRIME
'\u0027\u030DCOMBINING VERTICAL LINE ABOVE
'\u0027\uFF07FULLWIDTH APOSTROPHE
<\u003C\uFF1CFULLWIDTH LESS-THAN SIGN
<\u003C\uFE64SMALL LESS-THAN SIGN
<\u003C\u2329LEFT-POINTING ANGLE BRACKET
<\u003C\u3008LEFT ANGLE BRACKET
<\u003C\u00ABLEFT-POINTING 278 | DOUBLE ANGLE QUOTATION MARK
>\u003E\u00BBRIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
>\u003E\u3009RIGHT ANGLE BRACKET
>\u003E\u232ARIGHT-POINTING ANGLE BRACKET
>\u003E\uFE65SMALL GREATER-THAN SIGN
>\u003E\uFF1EFULLWIDTH GREATER-THAN SIGN
:\u003A\u2236RATIO
:\u003A\u0589ARMENIAN FULL STOP
:\u003A\uFE13PRESENTATION FORM FOR VERTICAL COLON
:\u003A\uFE55SMALL COLON
:\u003A\uFF1AFULLWIDTH 339 | COLON
342 | 343 | These test cases are largely derived from the public best-fit mappings provided by the Unicode Consortium. These are provided to software vendors but do not necessarily they were implemented as documented. In fact, any 344 | software vendor such as Microsoft, IBM, Oracle, can implement these mappings as they desire. 345 | 346 | ## Charset Transcoding and Character Mappings 347 | 348 | Sometimes characters and strings are transcoded from a source character set into a destination character set. On the surface this phenomena may seem similar to best-fit mappings, but the process is quite different. In general, when software transcodes data from source charset X to destination charset Y, it follows either a data-driven mapping table or an algorithmic formula. 349 | 350 | For the most part this process is data-driven. While these tables are standardized somewhere there may be differences between vendors. ICU 351 | maintains a list of its character set mapping tables online. Also, ICU's Converter Explorer tool lets you browse the maintained charset mapping tables. 352 | 353 | Data may be transcoded directly from a source charset to a destination charset, however it's also common to use Unicode as the broker. In the latter case the software will first transcode the source charset to Unicode, and from there to the destination charset. Some vendors such as Microsoft are known to leverage the Private Use Area (PUA) when transcoding to Unicode, when a direct mapping cannot be found or when a source byte sequence is invalid or illegal. It's important to be aware of a few pitfalls during the transcoding process. 354 | 355 | * When data is transcoded to the PUA, converting it again from the PUA may have unexpected consequences. 356 | * Data can change length, particularly if transcoding to/from a single-byte charset leads to a multi-byte character in the other charset. 357 | 358 | As a software engineer building a mechanism for transcoding data between charsets, it's important to understand these pitfalls and handle these unexpected cases gracefully. 359 | 360 | Software vulnerabilities can arise through charset transcodings. To name a few: 361 | 362 | * Transcoding data is not always reversible, so data can be irrevocably lost. 363 | * Characters can be manipulated to bypass string handling filters, such as cross-site scripting (XSS) filters, WAF's, and IDS devices. 364 | * Characters can be manipulated to abuse logic in software. For example, characters transcoded into ../ or file:// would prove detrimental in file handling operations. 365 | 366 | ## Normalization 367 | 368 | In Unicode, Normalization of characters and strings follows a specification defined in the Unicode Standard Annex #15: Unicode Normalization Forms. The details of Normalization are not for the faint of heart and will not be discussed in this guide. For engineers and testers, it's at least important to understand that there are four 369 | Normalization forms defined: 370 | 371 | * NFC - Canonical Decomposition 372 | * NFD - Canonical Decomposition, followed by Canonical Composition 373 | * NFKC - Compatibility Decomposition 374 | * NFKD - Compatibility Decomposition,followed by Canonical Composition 375 | 376 | When testing for security vulnerabilities, we're often most interested in the compatibility decomposition forms (NFKC, NFKD), but occassionally the canonical decomposition forms will produce interesting transformations as well. Cases where characters, and sequences of characters, transform into something different than the original source, might be used to bypass filters or produce other exploits. Consider the following image, which depicts the result of normalizing with either NFKC or NFKD for the character U+FE64 SMALL LESS-THAN SIGN. 377 | 378 | 379 | 380 | In the above example, the character U+FE64 will transform into U+003C, which might lead to security vulnerability in HTML applications. Consider the next example which shows the result of either NFD or NFKD decomposition applied to the "Turkish I" character U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE. 381 | 382 | 383 | 384 | As a software engineer, it becomes evident that Unicode normalization plays an important role, and that it is not always an explicit choice. Often times normalization is applied implicitly by the underlying framework, platform, or Web browser. It's important to understand the API's being used directly, and in some cases indirectly (by other processing on the stack). 385 | 386 | ### Normalization Defaults in Common Libraries 387 | The following table of common library API's lists known behaviors: 388 | 389 | 390 | 391 | 392 | 393 | 394 | 395 | 396 | 397 | 398 | 399 | 400 | 401 | 402 | 403 | 404 | 405 | 406 | 407 | 408 | 409 | 410 | 411 | 412 | 413 | 414 | 415 | 416 | 417 | 418 | 419 | 420 | 421 | 422 | 423 | 424 | 425 | 426 | 427 | 428 | 429 | 430 | 431 | 432 | 433 | 434 | 435 | 436 | 437 | 438 | 439 | 440 | 441 | 442 | 443 | 444 | 445 | 446 | 447 | 448 | 449 | 450 | 451 | 452 | 453 | 454 | 455 | 456 | 457 | 458 | 459 | 460 | 461 | 462 |
LibraryAPIDefaultCan overrideNotes
.NETSystem.Text.EncodingNFCYes
Win32
Java
ICU
Ruby
Python
PHP
Perl
JavaScriptString.prototype.normalize()NFCYesMDN
463 | 464 | ### Normalization in Web Browser URLs 465 | The following table captures how Web browsers normalize URLs. Differences in normalization and character transformations can lead to incompatibility as well as security vulnerability. 466 | source 467 | 468 | 469 | 470 | 471 | 472 | 473 | 474 | 475 | 476 | 477 | 478 | 479 | 480 | 481 | 482 | 483 | 484 | 485 | 486 | 487 | 488 | 489 | 490 | 491 | 492 | 493 | 494 | 495 | 496 | 497 | 498 | 499 | 500 | 501 | 502 | 503 |
DescriptionMSIE 9FF 5.0Chrome 12Safari 5Opera 11.5
Applies normalization in the pathNoNoNoYes - NFCNo
Applies normalization in the queryNoNoNoYes - NFCNo
Applies normalization in the fragmentNoNoYes - NFCYes - NFCNo
504 | 505 | ### Normalization Test Cases 506 | The following table lists test cases to run from a black-box, external perspective. By interpreting the output/rendered data, a tester can determine if a normalization transformation may be happening. 507 | 508 | 509 | 510 | 511 | 512 | 513 | 514 | 515 | 516 | 517 | 518 | 519 | 520 | 521 | 522 | 523 | 524 | 525 | 526 | 527 | 528 | 529 | 530 | 531 | 532 | 533 | 534 | 535 | 536 | 537 | 538 | 539 | 540 | 541 | 542 | 543 | 544 | 545 | 547 | 548 | 549 | 550 | 551 | 552 | 553 | 554 | 555 | 556 | 557 | 558 | 560 | 561 | 562 | 563 | 564 | 565 | 566 | 567 | 568 | 569 | 570 | 571 | 573 | 574 | 575 | 576 | 577 | 578 | 579 | 580 | 581 | 582 | 583 | 584 | 585 | 586 | 587 | 588 | 589 | 590 | 591 | 592 | 593 | 594 | 595 | 596 | 597 | 598 |
Target charTarget code pointTest code pointName
o\u006F\u2134SCRIPT SMALL O
s\u0073\u017FLATIN SMALL LETTER LONG S
K\u004B\u212AKELVIN SIGN
A\u0041\uFF21FULLWIDTH LATIN CAPITAL LETTER A
"\u0027\uFF02FULLWIDTH 546 | QUOTATION MARK
'\u0027\uFF07FULLWIDTH APOSTROPHE
<\u003C\uFF1CFULLWIDTH 559 | LESS-THAN SIGN
<\u003C\uFE64SMALL LESS-THAN SIGN
>\u003E\uFE65SMALL 572 | GREATER-THAN SIGN
>\u003E\uFF1EFULLWIDTH GREATER-THAN SIGN
:\u003A\uFE13PRESENTATION FORM FOR VERTICAL COLON
:\u003A\uFE55SMALL COLON
:\u003A\uFF1AFULLWIDTH COLON
599 | 600 | TODO If you've determined that input is being normalized but need different characters to exploit the logic, you may use the accompanying test case database. 601 | 602 | ## Canonicalization of Non-Shortest Form UTF-8 603 | The UTF-8 encoding algorithm allows for a single code point to be represented in multiple ways. That is, while the Latin letter 'A' is normally represented using the byte 0x41 in UTF-8, it's non-shortest form, or overlong, encoding would be any of the following: 604 | 605 | * 0xC1 0x81 606 | * 0xE0 0x81 0x81 607 | * 0xF0 0x80 0x81 0x81 608 | * etc... 609 | 610 | Earlier versions of the Unicode Standard applied Postel's law, or, the robustness principle of 'be conservative in what you do, be liberal in what you accept from others.' While the 'generation' of non-shortest form UTF-8 was forbidden, the 'interpretation' of was allowed. That changed with Unicode Standard version 3.0, when the requirement changed to prohibit both interpretation and generation. In fact, both the 'generation' and 'interpretation' of non-shortest form UTF-8 are currently prohibited by the standard, with one exception - that 'interpretation' only applies to the Basic Multilingual Plane (BMP) code points between U+0000 and U+FFFF. In terms of the common security vulnerabilities discussed in this document, that exception has no bearing, as the ASCII range of characters are not exempt. 611 | 612 | Given the history of security vulnerabilities around overlong UTF-8, many frameworks have defaulted to a more secure position of disallowing these forms to be both generated and interpreted. However, it seems that some software still interprets non-shortest form UTF-8 for BMP characters, including ASCII. A common pattern in software follows: 613 | 614 | > Process A performs security checks, but does not check for non-shortest forms. 615 | 616 | > Process B accepts the byte sequence from process A, and transforms it into UTF-16 while interpreting non-shortest forms. 617 | 618 | > The UTF-16 text may then contain characters that should have been filtered out by process A. [source](https://unicode.org/versions/corrigendum1.html) 619 | 620 | The overlong form of UTF-8 byte sequences is currently considered an illegal byte sequence. It's therefore a good test case to attempt in software such as Web applications, browsers, and databases. 621 | 622 | Some notes about canonicalization and UTF-8 encoded data. 623 | 624 | * The ASCII range (0x00 to 0x7F) is preserved in UTF-8. 625 | * UTF-8 can encode any Unicode character U+000000 through U+10FFFF using any number of bytes, thus leading to the non-shortest form problem. 626 | * The Unicode standard (3.0 and later) requires that a code point be serializd in UTF-8 using a byte sequence of one to four bytes in length. [The Corrigendum #1: UTF-8 Shortest](https://unicode.org/versions/corrigendum1.html) Form introduced this conformance requirement. 627 | 628 | __Non-shortest form UTF-8__ has been the vector for critical vulnerabilities in the past. From the [Microsoft IIS 4.0 and 5.0 directory traversal vulnerability](http://www.microsoft.com/technet/security/bulletin/MS00-078.mspx) of 2000, which was rediscovered in the product's [WebDAV component in 2009](http://blog.zoller.lu/2009/05/iis-6-webdac-auth-bypass-and-data.html). 629 | 630 | Some of the common security vulnerabilities that use non-shortest form UTF-8 as an attack vector include: 631 | 632 | * Directory/folder traversal. 633 | * Bypassing folder and file access filters. 634 | * Bypassing HTML and XSS filters. 635 | * Bypassing WAF and NID's type devices. 636 | 637 | As a developer trying to protect against this, it becomes important to understand the API's being used directly, and in some cases indirectly (by other processing on the stack). The following table of common library API's lists known behaviors: 638 | 639 | 640 | 641 | 642 | 643 | 644 | 645 | 646 | 647 | 648 | 649 | 650 | 651 | 652 | 653 | 654 | 655 | 656 | 657 | 658 | 659 | 660 | 661 | 662 | 663 | 664 | 665 | 666 | 667 | 668 | 669 | 670 |
LibraryAPIAllows non-shortest UTF8Can override Notes
.NET 2.0System.Text.Encoding
.NET 3.0System.Text.Encoding
ICUSystem.Text.Encoding
671 | 672 | As a tester/bug hunter looking for the vulnerabilities, the following table lists test cases to run from a black-box, external perspective. The data in this table presents the first few non-shortest forms (__NSF__) UTF-8 as URL encoded data %NN. If you need __raw bytes__ instead, these same hex values apply. All of the target chars in the first column are ASCII. 673 | 674 | 675 | 676 | 677 | 678 | 679 | 680 | 681 | 682 | 683 | 684 | 685 | 686 | 687 | 688 | 689 | 690 | 691 | 692 | 693 | 694 | 695 | 696 | 697 | 698 | 699 | 700 | 701 | 702 | 703 | 704 | 705 | 706 | 707 | 708 | 709 | 710 | 711 | 712 | 713 | 714 | 716 | 717 | 718 | 719 | 720 | 721 | 722 | 723 | 724 | 725 | 726 | 727 | 728 | 729 | 730 | 731 | 732 | 733 | 734 | 735 | 736 | 737 | 738 | 739 | 740 | 741 | 742 | 743 | 744 | 745 | 746 | 747 | 749 | 750 | 751 |
Target NSF 1NSF 2NSF 3Notes
A%C1%81%E0%81%81%F0%80%81%81Latin A useful as a base test case.
"%C0%A2%E0%80%A2%F0%80%80%A2Double quote
'%C0%A7%E0%80%A7%F0%80%80%A7Single quote
<%C0%BC%E0%80%BC%F0%80%80%BCLess-than 715 | sign
>%C0%BE%E0%80%BE%F0%80%80%BEGreater-than sign
.%C0%AE%E0%80%AE%F0%80%80%AEFull stop
/%C0%AF%E0%80%AF%F0%80%80%AFSolidus
\%C1%9C%E0%81%9C%F0%80%81%9CReverse 748 | solidus
752 | 753 | 754 | 755 | ## Over-consumption 756 | 757 | The Unicode Transformation Formats (e.g. UTF-8 and UTF-16) serialize code points into legal, or well-formed, byte sequences, also called code units. For example, consider the following code points and their corresponding well-formed code units in UTF-8 format. 758 | 759 | 760 | 761 | 763 | 764 | 766 | 767 | 768 | 769 | 770 | 771 | 772 | 773 | 774 | 775 | 776 | 777 | 778 | 779 | 780 | 781 | 782 | 783 | 784 |
Code 762 | pointDescriptionUTF-8 765 | byte sequence
U+0041LATIN CAPITAL LETTER A0x41
U+FF21FULLWIDTH LATIN CAPITAL LETTER A0xEC 0xBC 0xA1
U+00C0LATIN CAPITAL LETTER A WITH GRAVE0xC3 0x80
785 | 786 | And following are the same code points in their corresponding well-formed UTF-16 (little endian) format. 787 | 788 | 789 | 790 | 791 | 792 | 793 | 794 | 795 | 796 | 797 | 798 | 799 | 800 | 801 | 802 | 803 | 804 | 805 | 806 | 807 | 808 | 809 | 810 | 811 |
Code pointDescriptionUTF-16LE byte sequence
U+0041LATIN CAPITAL LETTER A0x00 0x41
U+FF21FULLWIDTH LATIN CAPITAL LETTER A0xFF 0x21
U+00C0LATIN CAPITAL LETTER A WITH GRAVE0x00 0xC0
812 | 813 | ### Well-formed and Ill-formed Byte Sequences 814 | Consider a UTF-8 decoder consuming a stream of data from a file. It encounters a well-formed byte sequence like: 815 | 816 | <41 C3 80 41> 817 | 818 | This sequence is made up of three well-formed _sub-sequences_. First is the <41>, second is the <C3 80>, and third is the <41>. The second subsequence <C3 80> is a two-byte sequence. The lead byte C3 indicates a two-byte sequence, and the trailing byte 80 is a valid trailing byte. The table below indicates these relationahips. Now consider that the UTF-8 decoder encounters an __ill-formed byte sequence__: 819 | 820 | <41 C2 C3 80 41> 821 | 822 | Taken apart, there are three minimally well-formed subsequences <41>, <C3 80>, and <41>. However, the <C2> is ill-formed because it doesn't have a valid trailing byte, which would be required per the table below. 823 | 824 | 825 | 826 | 827 | 828 | 829 | 830 | 831 | 832 | 833 | 834 | 835 | 836 | 837 | 838 | 839 | 840 | 841 | 842 | 843 | 844 | 845 | 846 | 847 | 848 | 849 | 850 | 851 | 852 | 853 | 854 | 855 | 856 | 857 | 858 | 859 | 860 | 861 | 862 | 863 | 864 | 865 | 866 | 867 | 868 | 869 | 870 | 871 | 872 | 873 | 874 | 875 | 876 | 877 | 878 | 879 | 880 | 881 | 882 | 883 | 884 | 885 | 886 | 887 | 888 | 889 | 890 | 891 | 892 | 893 | 894 | 895 | 896 | 897 |
Code pointFirst byteSecond byteThird byteFourth byte
U+0000..U+007F00..7F
U+0080..U+07FFC2..DF80..BF
U+0800..U+0FFFE0A0..BF80..BF
U+1000..U+CFFFE1..EC80..BF80..BF
U+D000..U+D7FFED80..9F80..BF
U+E000..U+FFFFEE..EF80..BF80..BF
U+10000..U+3FFFFF090..BF80..BF80..BF
U+40000..U+FFFFFF1..F380..BF80..BF80..BF
U+100000..U+10FFFFF480..BF80..BF80..BFsource
898 | 899 | The table above shows the legal and valid UTF-8 byte sequences, as defined by the Unicode Standard 5.0. The lower ASCII range 00..7F has always been preserved in UTF-8. Multi-byte sequences start at code point U+0080 and continue from two to four bytes. For example, code point U+0700 would be encoded in UTF-8 as a two byte sequence, with the lead byte somewhere in the range of C2..DF. 900 | 901 | 902 | ### Handling Ill-formed Byte Sequences 903 | Over-consumption of well-formed byte sequences has been the vector for critical vulnerabilities. These generally expose widespread issues when they affect a widely used library. One example can be found in the [Internationalization Components for Unicode (ICU)](http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2009-0153) in 2009, which would leave almost any Web-application exposed to cross-site scripting (XSS) threats since software such as Apple's Safari Web browser exposed the flaw. Even Web-applications with strong HTML/XSS filters can be vulnerable when the Web browser is non-conformant. 904 | 905 | The following input illustrates the over-consumption attack vector, where an attacker controls the img element's src attribute, followed by a text fragment in the HTML. The [0xC2] represents the attacker's UTF-8 lead byte with an invalid trailing byte, the double quote " which gets consumed in the resultant string. The HTML text portion including the onerror text is also attacker-controlled input. The entire payload becomes: 906 | 907 | <img src="#[0xC2]"> " onerror="alert(1)"</ br> 908 | 909 | The resultant string after over-consumption: 910 | 911 | <img src="#> " onerror="alert(1)"</ br> 912 | 913 | Although the above is a broken fragment of HTML because the img element is not properly closed, most browsers will render it as an img element with an onerror event handler. 914 | 915 | Some of the common security vulnerabilities that exploit an over-consumption flaw as an attack vector include: 916 | 917 | * Bypassing folder and file access filters. 918 | * Bypassing parser-based filters such as HTML and XSS filters. 919 | * Bypassing detection signatures in WAF and NID's type devices. 920 | 921 | As a developer trying to protect against this, it again becomes important to understand how the API's being used will handle ill-formed byte sequences. The following table of common library API's lists known behaviors: 922 | 923 | 924 | 925 | 926 | 927 | 928 | 929 | 930 | 931 | 932 | 933 | 934 | 935 | 936 | 937 | 938 | 939 | 940 | 941 | 942 | 943 | 944 | 945 | 946 | 947 | 948 | 949 | 950 | 951 | 952 | 953 | 954 |
LibraryAPIAllows ill-formed UTF8Can override Notes
.NET 2.0System.Text.EncodingNoNo
.NET 3.0UTF8EncodingNoNo
ICUSystem.Text.EncodingYesYes
955 | 956 | As a tester/bug hunter looking for the vulnerabilities, the following table lists test cases to run from a black-box, external perspective. The data in this table presents byte sequences that could elicit __over-consumption__. You can substitute a % before each byte value __to create a URL-encoded value__ for use in testing. This would be applicable for passing ill-formed byte sequences in a Web-application. 957 | 958 | 959 | 960 | 961 | 962 | 963 | 964 | 965 | 966 | 967 | 968 | 969 | 970 | 971 | 972 | 973 | 974 | 975 | 976 | 977 | 978 | 979 |
Source bytesExpected safe resultDesired unsafe resultNotes
C2 22 3C22 3C3CError handling of C2 overconsumed the trailing 22.
"%C0%A2%E0%80%A2Double quote
980 | 981 | Over-consumption typically happens at a layer lower than most developers work at. It's more likely to be in the frameworks, the browsers, the database, etc. If designing a character set or Unicode layer, be sure to include an error condition for cases where valid lead bytes are followed by invalid trailing bytes. 982 | 983 | ## Handling the Unexpected 984 | Through error handling, filtering, or other cases of input validation, problematic characters or raw bytes might be replaced or deleted. In these cases, it's important that the resultant string or byte sequence does not introduce a vulnerability. This problem is not specific to Unicode by any means, and can occur with any character set. However as will be discussed, Unicode has a good solution. 985 | 986 | ### Unexpected Inputs 987 | TODO 988 | #### Unassigned Code Points 989 | U+2073 990 | #### Illegal Code Points 991 | e.g. half of a surrogate pair 992 | 993 | ### Character Substitution 994 | The following input illustrates a dangerous character substitution. In this case, the application uses input validation to detect when a string contains characters such as < and then sanitizes such character’s by replacing them with a . period, or full stop. Internally, the application fetches files from a file share in the form: 995 | 996 | file://sharename/protected/user-01/files 997 | 998 | By exploiting the character substitution logic, an attacker could perform directory traversal attacks on the application: 999 | 1000 | file://sharename/protected/user-01/../user-002/files 1001 | 1002 | ### Character Deletion 1003 | An application may choose to delete characters when invalid, illegal, or unexpected data is encountered. This can also be problematic if not handled carefully. In general, it's safer to replace with Unicode's REPLACEMENT CHARACTER U+FFFD than it is to delete. 1004 | 1005 | Consider a Web-browser that deletes certain special characters such as a mid-stream Unicode BOM when encountered in its HTML parsing. An attacker injects the following HTML which includes the Unicode BOM represented by U+FEFF. The existence of this character allows the attacker's input to bypass the Web-application's cross-site scripting filter, which rejects an occurrence of <script>. 1006 | 1007 | <scr[U+FEFF]ipt> 1008 | 1009 | The Unicode BOM has special meaning in the standard, and in most software. The following image illustrates some of the special properties associated with this character: 1010 | 1011 | TODO add image 1012 | 1013 | The Unicode BOM is recommend input for most software test cases, and can be especially useful when test text parsers such as HTML and XML. 1014 | 1015 | #### Guidance 1016 | 1017 | Handle error conditions securely by replacing with the Unicode REPACEMENT CHARACTER U+FFFD. If that's impractical for some reason then choose a safe replacement that doesn't have syntactical meaning in the protocol being used. Some common examples include ? and #. 1018 | 1019 | ## Upper and Lower Casing 1020 | Strings are transformed through upper and lower casing operations, and sometimes in ways that weren't intended. This behavior can be exploited if performed at the wrong time. For example, if a casing operation is performed anywhere in the stack after a security check, then a special character like U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE could be used to bypass a cross-site scripting filter. 1021 | 1022 | toLower("İ") == "i" 1023 | 1024 | Another aspect of casing operations is that the length of characters and strings can change, depending on the input. The following should never be assumed: 1025 | 1026 | toLower("scrİpt") == "script" 1027 | 1028 | Another aspect of casing operations is that the length of characters and strings can change, depending on the input. The following should never be assumed: 1029 | 1030 | len(x) != len(toLower(x)) 1031 | 1032 | Common frameworks handle string comparison in different ways. The following table captures the behavior of classes intended for case-sensitive and case-insensitive string comparison. 1033 | 1034 | 1035 | 1036 | 1037 | 1038 | 1039 | 1040 | 1041 | 1042 | 1043 | 1044 | 1045 | 1046 | 1047 | 1048 | 1049 | 1050 | 1051 | 1052 | 1053 | 1054 | 1055 | 1056 | 1057 | 1058 | 1059 | 1060 | 1061 | 1062 | 1063 | 1064 | 1065 | 1066 | 1067 | 1068 | 1069 | 1070 | 1071 | 1072 | 1073 | 1074 | 1075 | 1076 | 1077 | 1078 | 1079 | 1080 | 1081 | 1082 | 1083 | 1084 | 1085 | 1086 | 1087 | 1088 | 1089 | 1090 | 1091 | 1092 | 1093 | 1094 | 1095 | 1096 | 1097 | 1100 | 1101 | 1102 | 1103 | 1104 | 1105 | 1106 | 1107 | 1108 | 1109 | 1110 | 1111 | 1112 | 1113 | 1114 | 1115 | 1116 | 1117 | 1118 | 1119 | 1120 | 1122 | 1123 | 1124 | 1125 | 1126 | 1127 | 1128 | 1129 | 1130 | 1131 | 1132 | 1133 | 1134 | 1135 | 1137 | 1138 | 1139 | 1140 | 1141 | 1142 | 1143 | 1144 | 1145 | 1146 | 1147 | 1148 | 1149 | 1150 | 1151 | 1152 | 1153 | 1154 | 1155 | 1156 | 1157 | 1158 | 1159 | 1160 |
LibraryAPIIs Dangerous Can override Notes
.NET 1.0StringComparer
.NET 2.0StringComparer
.NET 3.0StringComparer
Win32CompareStringOrdinal
Win32lstrcmpi
Win32CompareStringEx
ICU Cucol_strcoll
ICU Cucol_strcollIterAllows for comparing two strings that are supplied as character 1098 | iterators (UCharIterator). This is useful when you need to compare 1099 | differently encoded strings using strcoll
ICU C++Collator::Compare
ICU Cu_strCaseCompareCompare two strings case-insensitively using full case folding.
ICU Cu_strcasecmpCompare two strings case-insensitively using full 1121 | case folding.
ICU Cu_strncasecmpCompare two strings case-insensitively using full case folding.
ICU JavacaseCompareCompare two strings case-insensitively using full 1136 | case folding.
ICU JavaCollator.compare
POSIXstrcoll
1161 | 1162 | 1163 | 1164 | ## Buffer Overflows 1165 | Buffer overflows can occur through improper assumptions about characters versus bytes, and also about string sizes after casing and normalization operations. 1166 | 1167 | ### Upper and Lower Casing 1168 | The following table from UTR 36 illustrates the maximum expansion factors for casing operations on the edge-case characters in Unicode. These inputs make excellent test cases. 1169 | 1170 | 1171 | 1172 | 1173 | 1174 | 1175 | 1176 | 1177 | 1178 | 1179 | 1180 | 1181 | 1182 | 1183 | 1184 | 1185 | 1186 | 1187 | 1188 | 1189 | 1190 | 1191 | 1192 | 1193 | 1194 | 1195 | 1196 | 1197 | 1198 | 1199 |
Operation UTF Factor Sample
Lower 81.5ȺU+023A
16, 321AU+0041
Upper 8, 16, 32 3 ΐU+0390
1200 | 1201 | [source: Unicode Technical Report #36](https://www.unicode.org/reports/tr36/) 1202 | 1203 | ### Normalization 1204 | 1205 | The following table from UTR 36 illustrates the maximum expansion factors for normalization operations on the edge case characters in Unicode. These inputs make excellent test cases. 1206 | 1207 | 1208 | 1209 | 1210 | 1211 | 1212 | 1213 | 1214 | 1215 | 1216 | 1217 | 1218 | 1219 | 1220 | 1221 | 1222 | 1223 | 1224 | 1225 | 1226 | 1227 | 1228 | 1229 | 1230 | 1231 | 1232 | 1233 | 1234 | 1235 | 1236 | 1237 | 1238 | 1239 | 1240 | 1241 | 1242 | 1243 | 1244 | 1245 | 1246 | 1247 | 1248 | 1249 | 1250 | 1251 | 1252 | 1253 |
Operation UTF Factor Sample
NFC83X𝅘𝅥𝅮U+1D160
16, 323XU+FB2C
NFD83XΐU+0390
16, 324XU+1F82
NFKC/NFKD811XU+FDFA
16, 3218X
1254 | [source: Unicode Technical Report #36](https://www.unicode.org/reports/tr36/) 1255 | 1256 | 1257 | 1258 | ## Controlling Syntax 1259 | 1260 | White space and line feeds affect syntax in parsers such as HTML, XML and javascript. By interpreting characters such as the 'Ogham space mark' and 'Mongolian vowel separator' as whitespace software can allow attacks through the system. This could give attackers control over the parser, and enable attacks that might bypass security filters. Several characters in Unicode are assigned the 'white space' category and also the 'white space' binary property. Depending on how software is designed, these characters may literally be treated as a space character U+0020. 1261 | 1262 | For example, the following illustration shows the special white space properties associated with the U+180E MONGOLIAN VOWEL SEPARATOR character. 1263 | 1264 | TODO: add image 1265 | 1266 | If a Web browser interprets this character as white space U+0020, then the following HTML fragment would execute script: 1267 | 1268 | <a href=#[U+180E]onclick=alert()> 1269 | 1270 | 1271 | ## Charset Mismatch 1272 | 1273 | When software cannot accurately determine the character set of the text it is dealing with, then it must decide to either error or make an assumption. User-agents most commonly must deal with this problem, as they’re faced with interpreting data from a large assortment of character sets. There are no standards that define how to handle situations of character set mismatch, and vendor implementations vary greatly. 1274 | 1275 | Consider the following diagram, in which a Web browser receives an HTTP response with an HTTP charset of ISO-8859-1 defined, and a meta tag charset of shift_jis defined in the HTML. 1276 | 1277 | TODO add image 1278 | 1279 | When an attacker can exploit can control charset declarations, they can control the software’s behavior and in some cases setup an attack. 1280 | --------------------------------------------------------------------------------