├── .npmrc
├── compare-codepoint-talks
└── compare-by-codepoint.pdf
├── .github
└── workflows
│ ├── build.yml
│ └── deploy.yml
├── package.json
├── .gitignore
├── LICENSE
├── spec.emu
└── README.md
/.npmrc:
--------------------------------------------------------------------------------
1 | package-lock=false
2 |
--------------------------------------------------------------------------------
/compare-codepoint-talks/compare-by-codepoint.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/tc39/proposal-compare-strings-by-codepoint/main/compare-codepoint-talks/compare-by-codepoint.pdf
--------------------------------------------------------------------------------
/.github/workflows/build.yml:
--------------------------------------------------------------------------------
1 | name: Build spec
2 |
3 | on: [pull_request, push]
4 |
5 | jobs:
6 | build:
7 | runs-on: ubuntu-latest
8 |
9 | steps:
10 | - uses: actions/checkout@v4
11 | - uses: ljharb/actions/node/install@main
12 | name: 'nvm install lts/* && npm install'
13 | with:
14 | node-version: lts/*
15 | - run: npm run build
16 |
--------------------------------------------------------------------------------
/.github/workflows/deploy.yml:
--------------------------------------------------------------------------------
1 | name: Deploy gh-pages
2 |
3 | on:
4 | push:
5 | branches:
6 | - main
7 |
8 | jobs:
9 | deploy:
10 | runs-on: ubuntu-latest
11 |
12 | steps:
13 | - uses: actions/checkout@v4
14 | - uses: ljharb/actions/node/install@main
15 | name: 'nvm install lts/* && npm install'
16 | with:
17 | node-version: lts/*
18 | - run: npm run build
19 | - uses: JamesIves/github-pages-deploy-action@v4
20 | with:
21 | branch: gh-pages
22 | folder: build
23 | clean: true
24 |
--------------------------------------------------------------------------------
/package.json:
--------------------------------------------------------------------------------
1 | {
2 | "private": true,
3 | "name": "template-for-proposals",
4 | "description": "A repository template for ECMAScript proposals.",
5 | "scripts": {
6 | "start": "npm run build-loose -- --watch",
7 | "build": "npm run build-loose -- --strict",
8 | "build-loose": "node -e 'fs.mkdirSync(\"build\", { recursive: true })' && ecmarkup --load-biblio @tc39/ecma262-biblio --verbose spec.emu build/index.html --lint-spec"
9 | },
10 | "homepage": "https://github.com/tc39/template-for-proposals#readme",
11 | "repository": {
12 | "type": "git",
13 | "url": "git+https://github.com/tc39/template-for-proposals.git"
14 | },
15 | "license": "MIT",
16 | "devDependencies": {
17 | "@tc39/ecma262-biblio": "2.1.2862",
18 | "ecmarkup": "^21.2.0"
19 | },
20 | "engines": {
21 | "node": ">= 18"
22 | }
23 | }
24 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | # Logs
2 | logs
3 | *.log
4 | npm-debug.log*
5 |
6 | # Runtime data
7 | pids
8 | *.pid
9 | *.seed
10 |
11 | # Directory for instrumented libs generated by jscoverage/JSCover
12 | lib-cov
13 |
14 | # Coverage directory used by tools like istanbul
15 | coverage
16 |
17 | # nyc test coverage
18 | .nyc_output
19 |
20 | # Grunt intermediate storage (http://gruntjs.com/creating-plugins#storing-task-files)
21 | .grunt
22 |
23 | # node-waf configuration
24 | .lock-wscript
25 |
26 | # Compiled binary addons (http://nodejs.org/api/addons.html)
27 | build/Release
28 |
29 | # Dependency directories
30 | node_modules
31 | jspm_packages
32 |
33 | # Optional npm cache directory
34 | .npm
35 |
36 | # Optional REPL history
37 | .node_repl_history
38 |
39 | # Only apps should have lockfiles
40 | yarn.lock
41 | package-lock.json
42 | npm-shrinkwrap.json
43 | pnpm-lock.yaml
44 |
45 | # Build directory
46 | build
47 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2017 ECMA TC39 and contributors
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/spec.emu:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 | title: Proposal Title Goes Here
8 | stage: -1
9 | contributors: Your Name(s) Here
10 |
11 |
12 |
13 |
This is an emu-clause
14 |
This is an algorithm:
15 |
16 | 1. Let _proposal_ be *undefined*.
17 | 1. If IsAccepted(_proposal_) is *true*, then
18 | 1. Let _stage_ be *0*ℤ.
19 | 1. Else,
20 | 1. Let _stage_ be *-1*ℤ.
21 | 1. Return ? ToString(_stage_).
22 |
23 |
24 |
25 |
26 |
27 | IsAccepted (
28 | _proposal_: an ECMAScript language value
29 | ): a Boolean
30 |
31 |
32 |
description
33 |
Tells you if the proposal was accepted
34 |
35 |
36 | 1. If _proposal_ is not a String, or is not accepted, return *false*.
37 | 1. Return *true*.
38 |
39 |
40 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Compare Strings by Codepoint
2 |
3 | A TC39 proposal for a new method to compare strings by Unicode codepoints.
4 |
5 | ## Status
6 |
7 | [The TC39 Process](https://tc39.es/process-document/)
8 |
9 | **Stage**: 1
10 |
11 | **Champions**:
12 | - Mathieu Hofman (@mhofman)
13 | - Mark S. Miller (@erights)
14 | - Christopher Hiller (@boneskull)
15 |
16 | ## Motivation
17 |
18 | ### Background on strings in JavaScript
19 |
20 | JavaScript exposes strings as a sequence of 16-bits / UCS-2 characters. For well-formed strings, these characters are UTF-16 code units. UTF-16 uses Surrogate Pairs for any Unicode codepoints outside the Basic Multilingual Plane.
21 |
22 | That means that a Unicode character with a codepoint in the [0x010000 - 0x10FFFF] range will be represented as 2 code units, the first leading surrogate in the [0xD800 - 0xDBFF] range, and the second trailing surrogate in the [0xDC00 - 0xDFFF] range.
23 |
24 | For further reading, see @mathiasbynens's post on [JavaScript’s internal character encoding](https://mathiasbynens.be/notes/javascript-encoding).
25 |
26 | ### Effect of string encoding on JavaScript programs
27 |
28 | This encoding choice is observed by JavaScript programs in the following main cases:
29 | - Indexed access to a strings. This extends to all String APIs involving offsets or length
30 | - Matching string using RegExp without the `u` or `v` flag
31 | - Comparing strings
32 |
33 | Unlike indexed access, iterators on strings do operate on codepoints. Similarly the `u` and `v` RegExp flags enables matching whole Unicode codepoints instead of code units / surrogate halves. However no String API exists that allows comparing strings by codepoints.
34 |
35 | Because JavaScript compares strings by their 16-bits code units, any codepoint in the range [0xE000 - 0xFFFF] will sort after a leading surrogate used to encode the first half of a codepoint in the [0x010000 - 0x10FFFF] range.
36 |
37 | ### Interoperability with other systems
38 |
39 | This comparison behavior puts JavaScript at odds with other languages or systems that end up comparing strings by their Unicode codepoints. That includes any language or system that uses UTF-8 as their string encoding and relies on bytes comparison (UTF-8 does preserve sort order).
40 |
41 | Example of languages using UTF-8 encoding are Swift and Golang. SQLite by default encodes strings using UTF-8 as well. Because of this, the sort order of strings in these systems may not match the sort order of the same strings in JavaScript.
42 |
43 | ### Locale independent compare
44 |
45 | Data is not solely presented to human users. Some data processing involves sorting, and needs to be deterministic and portable. A comparison by Unicode codepoints satisfies those requirements where a locale based comparison does not.
46 |
47 | ## Proposal
48 |
49 | A `String.codePointCompare(a, b)` (actual name TBD) that can be used to compare 2 strings by their codepoints. The function can be used as an argument for `Array.prototype.sort`.
50 |
51 | ### Example
52 |
53 | ```js
54 | const arr = [
55 | '\u{ff42}', // Fullwidth Latin Small Letter B
56 | '\u{1d5ba}', // Mathematical Sans-Serif Small A
57 | '\u{63}', // Latin Small Letter C
58 | ];
59 |
60 | console.log('native compare', [...arr].sort()); // [ 'c', '𝖺', 'b' ]
61 | console.log('locale compare', [...arr].sort((a, b) => a.localeCompare(b))); // [ '𝖺', 'b', 'c' ]
62 | console.log('null locale compare', [...arr].sort(new Intl.Collator('zxx').compare)); // [ '𝖺', 'b', 'c' ]
63 | console.log('codepoint compare', [...arr].sort(String.codePointCompare)); // [ 'c', 'b', '𝖺' ]
64 | ```
65 |
66 | ## Alternatives considered
67 |
68 | ### Manual iteration
69 |
70 | It's possible to write a comparator using a manual iteration of the strings, and is the status-quo today. This can be implemented by either manually advancing String iterators in lock-step, or by using indexed access and retrieving the code units.
71 |
72 |
73 | codePointCompare shim
74 |
75 | ```js
76 | function codePointCompare(left, right) {
77 | const leftIter = left[Symbol.iterator]();
78 | const rightIter = right[Symbol.iterator]();
79 | for (;;) {
80 | const { value: leftChar } = leftIter.next();
81 | const { value: rightChar } = rightIter.next();
82 | if (leftChar === undefined && rightChar === undefined) {
83 | return 0;
84 | } else if (leftChar === undefined) {
85 | // left is a prefix of right.
86 | return -1;
87 | } else if (rightChar === undefined) {
88 | // right is a prefix of left.
89 | return 1;
90 | }
91 | const leftCodepoint = leftChar.codePointAt(0);
92 | const rightCodepoint = rightChar.codePointAt(0);
93 | if (leftCodepoint < rightCodepoint) return -1;
94 | if (leftCodepoint > rightCodepoint) return 1;
95 | }
96 | };
97 | ```
98 |
99 |
100 | ### Null locale collation
101 |
102 | There are already alternative comparators in the language, e.g. `String.prototype.localeCompare` or `Intl.Collator.prototype.compare`. While these operate on codepoints, they take into consideration the locale and collapse characters in the same equivalence class. This is also the case for the `zxx` "locale" in the [Stable Formatting proposal](https://github.com/tc39/proposal-stable-formatting).
103 |
104 | While we could imagine a collation option for a null locale that does not perform any equivalence class or grapheme logic, and simply enables comparing strings by their Unicode codepoints, this seems to [go counter](https://github.com/tc39/proposal-stable-formatting/issues/13) to the core of Intl.
105 |
106 | Furthermore `Intl` is a normative optional part of the spec and this would require JS engines that opt-out to implement enough of `Intl` just to offer a comparison that is agnostic of Internationalization concerns.
107 |
108 | ## Q&A
109 |
110 | ### What should the comparator return if it encounters malformed strings?
111 |
112 | An unmatched surrogate in a string could fallback to comparing using its code unit. There may be alternative behaviors that could be considered.
113 |
114 | ### What about comparison operators like `<` or `>`
115 |
116 | Because changing their behavior would be a breaking change, they would continue comparing strings lexicographically by code unit. However a program could instead compare the sign of the result of calling the proposed codepoint comparator function with the 2 strings.
117 |
118 | ### Should we change the default `Array.prototype.sort` comparison
119 |
120 | Like for operators, this would be a breaking change. Anyone interested in portability can fairly easily provide the comparator function.
121 |
122 | ### Can you provide an example of system where a portable comparison is needed?
123 |
124 | The Agoric platform implements collections that use a well defined sort order for keys instead of using insertion order. These collections can be either "heap-only", or backed by a SQLite DB. To provide a consistent iteration order independent of the backing store, the heap implementation needs to use a comparator compatible with the sort order implemented by SQLite.
125 |
126 | ## Presentations
127 |
128 | - [2025-04 plenary](https://docs.google.com/presentation/d/1eTuB1jjgb2_xG_zMNmkhleJx1F0QviMEwkkBUL9ezPQ) ([pdf slides](compare-codepoint-talks/compare-by-codepoint.pdf))
129 |
130 | ## See also
131 |
132 | https://github.com/endojs/endo/pull/2008
133 |
134 | https://github.com/Agoric/agoric-sdk/issues/10335
135 |
136 | https://github.com/Agoric/agoric-sdk/pull/10299
137 |
138 | https://es.discourse.group/t/builtin-ord-compare-method-for-primitives/724/25
139 |
140 |
--------------------------------------------------------------------------------