├── .npmrc ├── compare-codepoint-talks └── compare-by-codepoint.pdf ├── .github └── workflows │ ├── build.yml │ └── deploy.yml ├── package.json ├── .gitignore ├── LICENSE ├── spec.emu └── README.md /.npmrc: -------------------------------------------------------------------------------- 1 | package-lock=false 2 | -------------------------------------------------------------------------------- /compare-codepoint-talks/compare-by-codepoint.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tc39/proposal-compare-strings-by-codepoint/main/compare-codepoint-talks/compare-by-codepoint.pdf -------------------------------------------------------------------------------- /.github/workflows/build.yml: -------------------------------------------------------------------------------- 1 | name: Build spec 2 | 3 | on: [pull_request, push] 4 | 5 | jobs: 6 | build: 7 | runs-on: ubuntu-latest 8 | 9 | steps: 10 | - uses: actions/checkout@v4 11 | - uses: ljharb/actions/node/install@main 12 | name: 'nvm install lts/* && npm install' 13 | with: 14 | node-version: lts/* 15 | - run: npm run build 16 | -------------------------------------------------------------------------------- /.github/workflows/deploy.yml: -------------------------------------------------------------------------------- 1 | name: Deploy gh-pages 2 | 3 | on: 4 | push: 5 | branches: 6 | - main 7 | 8 | jobs: 9 | deploy: 10 | runs-on: ubuntu-latest 11 | 12 | steps: 13 | - uses: actions/checkout@v4 14 | - uses: ljharb/actions/node/install@main 15 | name: 'nvm install lts/* && npm install' 16 | with: 17 | node-version: lts/* 18 | - run: npm run build 19 | - uses: JamesIves/github-pages-deploy-action@v4 20 | with: 21 | branch: gh-pages 22 | folder: build 23 | clean: true 24 | -------------------------------------------------------------------------------- /package.json: -------------------------------------------------------------------------------- 1 | { 2 | "private": true, 3 | "name": "template-for-proposals", 4 | "description": "A repository template for ECMAScript proposals.", 5 | "scripts": { 6 | "start": "npm run build-loose -- --watch", 7 | "build": "npm run build-loose -- --strict", 8 | "build-loose": "node -e 'fs.mkdirSync(\"build\", { recursive: true })' && ecmarkup --load-biblio @tc39/ecma262-biblio --verbose spec.emu build/index.html --lint-spec" 9 | }, 10 | "homepage": "https://github.com/tc39/template-for-proposals#readme", 11 | "repository": { 12 | "type": "git", 13 | "url": "git+https://github.com/tc39/template-for-proposals.git" 14 | }, 15 | "license": "MIT", 16 | "devDependencies": { 17 | "@tc39/ecma262-biblio": "2.1.2862", 18 | "ecmarkup": "^21.2.0" 19 | }, 20 | "engines": { 21 | "node": ">= 18" 22 | } 23 | } 24 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Logs 2 | logs 3 | *.log 4 | npm-debug.log* 5 | 6 | # Runtime data 7 | pids 8 | *.pid 9 | *.seed 10 | 11 | # Directory for instrumented libs generated by jscoverage/JSCover 12 | lib-cov 13 | 14 | # Coverage directory used by tools like istanbul 15 | coverage 16 | 17 | # nyc test coverage 18 | .nyc_output 19 | 20 | # Grunt intermediate storage (http://gruntjs.com/creating-plugins#storing-task-files) 21 | .grunt 22 | 23 | # node-waf configuration 24 | .lock-wscript 25 | 26 | # Compiled binary addons (http://nodejs.org/api/addons.html) 27 | build/Release 28 | 29 | # Dependency directories 30 | node_modules 31 | jspm_packages 32 | 33 | # Optional npm cache directory 34 | .npm 35 | 36 | # Optional REPL history 37 | .node_repl_history 38 | 39 | # Only apps should have lockfiles 40 | yarn.lock 41 | package-lock.json 42 | npm-shrinkwrap.json 43 | pnpm-lock.yaml 44 | 45 | # Build directory 46 | build 47 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2017 ECMA TC39 and contributors 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /spec.emu: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 |
 7 | title: Proposal Title Goes Here
 8 | stage: -1
 9 | contributors: Your Name(s) Here
10 | 
11 | 12 | 13 |

This is an emu-clause

14 |

This is an algorithm:

15 | 16 | 1. Let _proposal_ be *undefined*. 17 | 1. If IsAccepted(_proposal_) is *true*, then 18 | 1. Let _stage_ be *0*. 19 | 1. Else, 20 | 1. Let _stage_ be *-1*. 21 | 1. Return ? ToString(_stage_). 22 | 23 |
24 | 25 | 26 |

27 | IsAccepted ( 28 | _proposal_: an ECMAScript language value 29 | ): a Boolean 30 |

31 |
32 |
description
33 |
Tells you if the proposal was accepted
34 |
35 | 36 | 1. If _proposal_ is not a String, or is not accepted, return *false*. 37 | 1. Return *true*. 38 | 39 |
40 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Compare Strings by Codepoint 2 | 3 | A TC39 proposal for a new method to compare strings by Unicode codepoints. 4 | 5 | ## Status 6 | 7 | [The TC39 Process](https://tc39.es/process-document/) 8 | 9 | **Stage**: 1 10 | 11 | **Champions**: 12 | - Mathieu Hofman (@mhofman) 13 | - Mark S. Miller (@erights) 14 | - Christopher Hiller (@boneskull) 15 | 16 | ## Motivation 17 | 18 | ### Background on strings in JavaScript 19 | 20 | JavaScript exposes strings as a sequence of 16-bits / UCS-2 characters. For well-formed strings, these characters are UTF-16 code units. UTF-16 uses Surrogate Pairs for any Unicode codepoints outside the Basic Multilingual Plane. 21 | 22 | That means that a Unicode character with a codepoint in the [0x010000 - 0x10FFFF] range will be represented as 2 code units, the first leading surrogate in the [0xD800 - 0xDBFF] range, and the second trailing surrogate in the [0xDC00 - 0xDFFF] range. 23 | 24 | For further reading, see @mathiasbynens's post on [JavaScript’s internal character encoding](https://mathiasbynens.be/notes/javascript-encoding). 25 | 26 | ### Effect of string encoding on JavaScript programs 27 | 28 | This encoding choice is observed by JavaScript programs in the following main cases: 29 | - Indexed access to a strings. This extends to all String APIs involving offsets or length 30 | - Matching string using RegExp without the `u` or `v` flag 31 | - Comparing strings 32 | 33 | Unlike indexed access, iterators on strings do operate on codepoints. Similarly the `u` and `v` RegExp flags enables matching whole Unicode codepoints instead of code units / surrogate halves. However no String API exists that allows comparing strings by codepoints. 34 | 35 | Because JavaScript compares strings by their 16-bits code units, any codepoint in the range [0xE000 - 0xFFFF] will sort after a leading surrogate used to encode the first half of a codepoint in the [0x010000 - 0x10FFFF] range. 36 | 37 | ### Interoperability with other systems 38 | 39 | This comparison behavior puts JavaScript at odds with other languages or systems that end up comparing strings by their Unicode codepoints. That includes any language or system that uses UTF-8 as their string encoding and relies on bytes comparison (UTF-8 does preserve sort order). 40 | 41 | Example of languages using UTF-8 encoding are Swift and Golang. SQLite by default encodes strings using UTF-8 as well. Because of this, the sort order of strings in these systems may not match the sort order of the same strings in JavaScript. 42 | 43 | ### Locale independent compare 44 | 45 | Data is not solely presented to human users. Some data processing involves sorting, and needs to be deterministic and portable. A comparison by Unicode codepoints satisfies those requirements where a locale based comparison does not. 46 | 47 | ## Proposal 48 | 49 | A `String.codePointCompare(a, b)` (actual name TBD) that can be used to compare 2 strings by their codepoints. The function can be used as an argument for `Array.prototype.sort`. 50 | 51 | ### Example 52 | 53 | ```js 54 | const arr = [ 55 | '\u{ff42}', // Fullwidth Latin Small Letter B 56 | '\u{1d5ba}', // Mathematical Sans-Serif Small A 57 | '\u{63}', // Latin Small Letter C 58 | ]; 59 | 60 | console.log('native compare', [...arr].sort()); // [ 'c', '𝖺', 'b' ] 61 | console.log('locale compare', [...arr].sort((a, b) => a.localeCompare(b))); // [ '𝖺', 'b', 'c' ] 62 | console.log('null locale compare', [...arr].sort(new Intl.Collator('zxx').compare)); // [ '𝖺', 'b', 'c' ] 63 | console.log('codepoint compare', [...arr].sort(String.codePointCompare)); // [ 'c', 'b', '𝖺' ] 64 | ``` 65 | 66 | ## Alternatives considered 67 | 68 | ### Manual iteration 69 | 70 | It's possible to write a comparator using a manual iteration of the strings, and is the status-quo today. This can be implemented by either manually advancing String iterators in lock-step, or by using indexed access and retrieving the code units. 71 | 72 |
73 | codePointCompare shim 74 | 75 | ```js 76 | function codePointCompare(left, right) { 77 | const leftIter = left[Symbol.iterator](); 78 | const rightIter = right[Symbol.iterator](); 79 | for (;;) { 80 | const { value: leftChar } = leftIter.next(); 81 | const { value: rightChar } = rightIter.next(); 82 | if (leftChar === undefined && rightChar === undefined) { 83 | return 0; 84 | } else if (leftChar === undefined) { 85 | // left is a prefix of right. 86 | return -1; 87 | } else if (rightChar === undefined) { 88 | // right is a prefix of left. 89 | return 1; 90 | } 91 | const leftCodepoint = leftChar.codePointAt(0); 92 | const rightCodepoint = rightChar.codePointAt(0); 93 | if (leftCodepoint < rightCodepoint) return -1; 94 | if (leftCodepoint > rightCodepoint) return 1; 95 | } 96 | }; 97 | ``` 98 |
99 | 100 | ### Null locale collation 101 | 102 | There are already alternative comparators in the language, e.g. `String.prototype.localeCompare` or `Intl.Collator.prototype.compare`. While these operate on codepoints, they take into consideration the locale and collapse characters in the same equivalence class. This is also the case for the `zxx` "locale" in the [Stable Formatting proposal](https://github.com/tc39/proposal-stable-formatting). 103 | 104 | While we could imagine a collation option for a null locale that does not perform any equivalence class or grapheme logic, and simply enables comparing strings by their Unicode codepoints, this seems to [go counter](https://github.com/tc39/proposal-stable-formatting/issues/13) to the core of Intl. 105 | 106 | Furthermore `Intl` is a normative optional part of the spec and this would require JS engines that opt-out to implement enough of `Intl` just to offer a comparison that is agnostic of Internationalization concerns. 107 | 108 | ## Q&A 109 | 110 | ### What should the comparator return if it encounters malformed strings? 111 | 112 | An unmatched surrogate in a string could fallback to comparing using its code unit. There may be alternative behaviors that could be considered. 113 | 114 | ### What about comparison operators like `<` or `>` 115 | 116 | Because changing their behavior would be a breaking change, they would continue comparing strings lexicographically by code unit. However a program could instead compare the sign of the result of calling the proposed codepoint comparator function with the 2 strings. 117 | 118 | ### Should we change the default `Array.prototype.sort` comparison 119 | 120 | Like for operators, this would be a breaking change. Anyone interested in portability can fairly easily provide the comparator function. 121 | 122 | ### Can you provide an example of system where a portable comparison is needed? 123 | 124 | The Agoric platform implements collections that use a well defined sort order for keys instead of using insertion order. These collections can be either "heap-only", or backed by a SQLite DB. To provide a consistent iteration order independent of the backing store, the heap implementation needs to use a comparator compatible with the sort order implemented by SQLite. 125 | 126 | ## Presentations 127 | 128 | - [2025-04 plenary](https://docs.google.com/presentation/d/1eTuB1jjgb2_xG_zMNmkhleJx1F0QviMEwkkBUL9ezPQ) ([pdf slides](compare-codepoint-talks/compare-by-codepoint.pdf)) 129 | 130 | ## See also 131 | 132 | https://github.com/endojs/endo/pull/2008 133 | 134 | https://github.com/Agoric/agoric-sdk/issues/10335 135 | 136 | https://github.com/Agoric/agoric-sdk/pull/10299 137 | 138 | https://es.discourse.group/t/builtin-ord-compare-method-for-primitives/724/25 139 | 140 | --------------------------------------------------------------------------------