├── .github └── workflows │ ├── bb.yml │ └── main.yml ├── .gitignore ├── .npmrc ├── logo.svg ├── package.json └── readme.md /.github/workflows/bb.yml: -------------------------------------------------------------------------------- 1 | jobs: 2 | main: 3 | runs-on: ubuntu-latest 4 | steps: 5 | - uses: unifiedjs/beep-boop-beta@main 6 | with: 7 | repo-token: ${{secrets.GITHUB_TOKEN}} 8 | name: bb 9 | on: 10 | issues: 11 | types: [closed, edited, labeled, opened, reopened, unlabeled] 12 | pull_request_target: 13 | types: [closed, edited, labeled, opened, reopened, unlabeled] 14 | -------------------------------------------------------------------------------- /.github/workflows/main.yml: -------------------------------------------------------------------------------- 1 | jobs: 2 | main: 3 | runs-on: ubuntu-latest 4 | steps: 5 | - uses: actions/checkout@v4 6 | - uses: actions/setup-node@v4 7 | with: 8 | node-version: node 9 | - run: npm install 10 | - run: npm test 11 | name: main 12 | on: 13 | - pull_request 14 | - push 15 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store 2 | *.log 3 | node_modules/ 4 | -------------------------------------------------------------------------------- /.npmrc: -------------------------------------------------------------------------------- 1 | ignore-scripts=true 2 | package-lock=false 3 | -------------------------------------------------------------------------------- /logo.svg: -------------------------------------------------------------------------------- 1 | 2 | 10 | 11 | 12 | 13 | 14 | -------------------------------------------------------------------------------- /package.json: -------------------------------------------------------------------------------- 1 | { 2 | "author": "Titus Wormer (wooorm.com)", 3 | "bugs": "https://github.com/syntax-tree/nlcst/issues", 4 | "contributors": [ 5 | "Eugene Sharygin ", 6 | "Titus Wormer (wooorm.com)" 7 | ], 8 | "description": "natural language concrete syntax tree", 9 | "devDependencies": { 10 | "remark-cli": "^12.0.0", 11 | "remark-preset-wooorm": "^10.0.0" 12 | }, 13 | "keywords": [], 14 | "license": "MIT", 15 | "name": "nlcst", 16 | "private": true, 17 | "remarkConfig": { 18 | "plugins": [ 19 | "remark-preset-wooorm" 20 | ] 21 | }, 22 | "repository": "syntax-tree/nlcst", 23 | "scripts": { 24 | "format": "remark . -qfo", 25 | "test": "npm run format" 26 | }, 27 | "version": "0.0.0" 28 | } 29 | -------------------------------------------------------------------------------- /readme.md: -------------------------------------------------------------------------------- 1 | # ![nlcst][logo] 2 | 3 | **N**atural **L**anguage **C**oncrete **S**yntax **T**ree format. 4 | 5 | *** 6 | 7 | **nlcst** is a specification for representing natural language in a [syntax 8 | tree][syntax-tree]. 9 | It implements the **[unist][]** spec. 10 | 11 | This document may not be released. 12 | See [releases][] for released documents. 13 | The latest released version is [`1.0.2`][latest]. 14 | 15 | ## Contents 16 | 17 | * [Introduction](#introduction) 18 | * [Where this specification fits](#where-this-specification-fits) 19 | * [Types](#types) 20 | * [Nodes (abstract)](#nodes-abstract) 21 | * [`Literal`](#literal) 22 | * [`Parent`](#parent) 23 | * [Nodes](#nodes) 24 | * [`Paragraph`](#paragraph) 25 | * [`Punctuation`](#punctuation) 26 | * [`Root`](#root) 27 | * [`Sentence`](#sentence) 28 | * [`Source`](#source) 29 | * [`Symbol`](#symbol) 30 | * [`Text`](#text) 31 | * [`WhiteSpace`](#whitespace) 32 | * [`Word`](#word) 33 | * [Glossary](#glossary) 34 | * [List of utilities](#list-of-utilities) 35 | * [Related](#related) 36 | * [References](#references) 37 | * [Contribute](#contribute) 38 | * [Acknowledgments](#acknowledgments) 39 | * [License](#license) 40 | 41 | ## Introduction 42 | 43 | This document defines a format for representing natural language as a [concrete 44 | syntax tree][syntax-tree]. 45 | Development of nlcst started in May 2014, 46 | in the now deprecated [textom][] project for [retext][], 47 | before [unist][] existed. 48 | This specification is written in a [Web IDL][webidl]-like grammar. 49 | 50 | ### Where this specification fits 51 | 52 | nlcst extends [unist][], 53 | a format for syntax trees, 54 | to benefit from its [ecosystem of utilities][utilities]. 55 | 56 | nlcst relates to [JavaScript][] in that it has an [ecosystem of 57 | utilities][list-of-utilities] for working with compliant syntax trees in 58 | JavaScript. 59 | However, 60 | nlcst is not limited to JavaScript and can be used in other programming 61 | languages. 62 | 63 | nlcst relates to the [unified][] and [retext][] projects in that nlcst syntax 64 | trees are used throughout their ecosystems. 65 | 66 | ## Types 67 | 68 | If you are using TypeScript, 69 | you can use the nlcst types by installing them with npm: 70 | 71 | ```sh 72 | npm install @types/nlcst 73 | ``` 74 | 75 | ## Nodes (abstract) 76 | 77 | ### `Literal` 78 | 79 | ```idl 80 | interface Literal <: UnistLiteral { 81 | value: string 82 | } 83 | ``` 84 | 85 | **Literal** ([**UnistLiteral**][dfn-unist-literal]) represents a node in nlcst 86 | containing a value. 87 | 88 | Its `value` field is a `string`. 89 | 90 | ### `Parent` 91 | 92 | ```idl 93 | interface Parent <: UnistParent { 94 | children: [Paragraph | Punctuation | Sentence | Source | Symbol | Text | WhiteSpace | Word] 95 | } 96 | ``` 97 | 98 | **Parent** ([**UnistParent**][dfn-unist-parent]) represents a node in nlcst 99 | containing other nodes (said to be [*children*][term-child]). 100 | 101 | Its content is limited to only other nlcst content. 102 | 103 | ## Nodes 104 | 105 | ### `Paragraph` 106 | 107 | ```idl 108 | interface Paragraph <: Parent { 109 | type: 'ParagraphNode' 110 | children: [Sentence | Source | WhiteSpace] 111 | } 112 | ``` 113 | 114 | **Paragraph** ([**Parent**][dfn-parent]) represents a unit of discourse dealing 115 | with a particular point or idea. 116 | 117 | **Paragraph** can be used in a [**root**][dfn-root] node. 118 | It can contain [**sentence**][dfn-sentence], 119 | [**whitespace**][dfn-whitespace], 120 | and [**source**][dfn-source] nodes. 121 | 122 | ### `Punctuation` 123 | 124 | ```idl 125 | interface Punctuation <: Literal { 126 | type: 'PunctuationNode' 127 | } 128 | ``` 129 | 130 | **Punctuation** ([**Literal**][dfn-literal]) represents typographical devices 131 | which aid understanding and correct reading of other grammatical units. 132 | 133 | **Punctuation** can be used in [**sentence**][dfn-sentence] or 134 | [**word**][dfn-word] nodes. 135 | 136 | ### `Root` 137 | 138 | ```idl 139 | interface Root <: Parent { 140 | type: 'RootNode' 141 | } 142 | ``` 143 | 144 | **Root** ([**Parent**][dfn-parent]) represents a document. 145 | 146 | **Root** can be used as the [*root*][term-root] of a [*tree*][term-tree], 147 | never as a [*child*][term-child]. 148 | Its content model is not limited, 149 | it can contain any nlcst content, 150 | with the restriction that all content must be of the same category. 151 | 152 | ### `Sentence` 153 | 154 | ```idl 155 | interface Sentence <: Parent { 156 | type: 'SentenceNode' 157 | children: [Punctuation | Source | Symbol | WhiteSpace | Word] 158 | } 159 | ``` 160 | 161 | **Sentence** ([**Parent**][dfn-parent]) represents grouping of grammatically 162 | linked words, 163 | that in principle tells a complete thought, 164 | although it may make little sense taken in isolation out of context. 165 | 166 | **Sentence** can be used in a [**paragraph**][dfn-paragraph] node. 167 | It can contain [**word**][dfn-word], 168 | [**symbol**][dfn-symbol], 169 | [**punctuation**][dfn-punctuation], 170 | [**whitespace**][dfn-whitespace], 171 | and [**source**][dfn-source] nodes. 172 | 173 | ### `Source` 174 | 175 | ```idl 176 | interface Source <: Literal { 177 | type: 'SourceNode' 178 | } 179 | ``` 180 | 181 | **Source** ([**Literal**][dfn-literal]) represents an external (ungrammatical) 182 | value embedded into a grammatical unit: a hyperlink, 183 | code, 184 | and such. 185 | 186 | **Source** can be used in [**root**][dfn-root], 187 | [**paragraph**][dfn-paragraph], 188 | [**sentence**][dfn-sentence], 189 | or [**word**][dfn-word] nodes. 190 | 191 | ### `Symbol` 192 | 193 | ```idl 194 | interface Symbol <: Literal { 195 | type: 'SymbolNode' 196 | } 197 | ``` 198 | 199 | **Symbol** ([**Literal**][dfn-literal]) represents typographical devices 200 | different from characters which represent sounds (like letters and numerals), 201 | white space, 202 | or punctuation. 203 | 204 | **Symbol** can be used in [**sentence**][dfn-sentence] or [**word**][dfn-word] 205 | nodes. 206 | 207 | ### `Text` 208 | 209 | ```idl 210 | interface Text <: Literal { 211 | type: 'TextNode' 212 | } 213 | ``` 214 | 215 | **Text** ([**Literal**][dfn-literal]) represents actual content in nlcst 216 | documents: one or more characters. 217 | 218 | **Text** can be used in [**word**][dfn-word] nodes. 219 | 220 | ### `WhiteSpace` 221 | 222 | ```idl 223 | interface WhiteSpace <: Literal { 224 | type: 'WhiteSpaceNode' 225 | } 226 | ``` 227 | 228 | **WhiteSpace** ([**Literal**][dfn-literal]) represents typographical devices 229 | devoid of content, 230 | separating other units. 231 | 232 | **WhiteSpace** can be used in [**root**][dfn-root], 233 | [**paragraph**][dfn-paragraph], 234 | or [**sentence**][dfn-sentence] nodes. 235 | 236 | ### `Word` 237 | 238 | ```idl 239 | interface Word <: Parent { 240 | type: 'WordNode' 241 | children: [Punctuation | Source | Symbol | Text] 242 | } 243 | ``` 244 | 245 | **Word** ([**Parent**][dfn-parent]) represents the smallest element that may be 246 | uttered in isolation with semantic or pragmatic content. 247 | 248 | **Word** can be used in a [**sentence**][dfn-sentence] node. 249 | It can contain [**text**][dfn-text], 250 | [**symbol**][dfn-symbol], 251 | [**punctuation**][dfn-punctuation], 252 | and [**source**][dfn-source] nodes. 253 | 254 | ## Glossary 255 | 256 | See the [unist glossary][glossary]. 257 | 258 | ## List of utilities 259 | 260 | See the [unist list of utilities][utilities] for more utilities. 261 | 262 | * [`nlcst-affix-emoticon-modifier`](https://github.com/syntax-tree/nlcst-affix-emoticon-modifier) 263 | — merge affix emoticons into the previous sentence 264 | * [`nlcst-emoji-modifier`](https://github.com/syntax-tree/nlcst-emoji-modifier) 265 | — support emoji 266 | * [`nlcst-emoticon-modifier`](https://github.com/syntax-tree/nlcst-emoticon-modifier) 267 | — support emoticons 268 | * [`nlcst-is-literal`](https://github.com/syntax-tree/nlcst-is-literal) 269 | — check whether a node is meant literally 270 | * [`nlcst-normalize`](https://github.com/syntax-tree/nlcst-normalize) 271 | — normalize a word for easier comparison 272 | * [`nlcst-search`](https://github.com/syntax-tree/nlcst-search) 273 | — search for patterns 274 | * [`nlcst-to-string`](https://github.com/syntax-tree/nlcst-to-string) 275 | — serialize a node 276 | * [`nlcst-test`](https://github.com/syntax-tree/nlcst-test) 277 | — validate a node 278 | * [`mdast-util-to-nlcst`](https://github.com/syntax-tree/mdast-util-to-nlcst) 279 | — transform mdast to nlcst 280 | * [`hast-util-to-nlcst`](https://github.com/syntax-tree/hast-util-to-nlcst) 281 | — transform hast to nlcst 282 | 283 | ## Related 284 | 285 | * [mdast](https://github.com/syntax-tree/mdast) 286 | — Markdown Abstract Syntax Tree format 287 | * [hast](https://github.com/syntax-tree/hast) 288 | — Hypertext Abstract Syntax Tree format 289 | * [xast](https://github.com/syntax-tree/xast) 290 | — Extensible Abstract Syntax Tree 291 | 292 | ## References 293 | 294 | * **unist**: 295 | [Universal Syntax Tree][unist]. 296 | T. Wormer; et al. 297 | * **JavaScript**: 298 | [ECMAScript Language Specification][javascript]. 299 | Ecma International. 300 | * **Web IDL**: 301 | [Web IDL][webidl], 302 | C. McCormack. 303 | W3C. 304 | 305 | ## Contribute 306 | 307 | See [`contributing.md`][contributing] in [`syntax-tree/.github`][health] for 308 | ways to get started. 309 | See [`support.md`][support] for ways to get help. 310 | Ideas for new utilities and tools can be posted in [`syntax-tree/ideas`][ideas]. 311 | 312 | A curated list of awesome syntax-tree, 313 | unist, 314 | mdast, 315 | hast, 316 | xast, 317 | and nlcst resources can be found in [awesome syntax-tree][awesome]. 318 | 319 | This project has a [code of conduct][coc]. 320 | By interacting with this repository, 321 | organization, 322 | or community you agree to abide by its terms. 323 | 324 | ## Acknowledgments 325 | 326 | The initial release of this project was authored by 327 | [**@wooorm**](https://github.com/wooorm). 328 | 329 | Thanks to 330 | [**@nwtn**](https://github.com/nwtn), 331 | [**@tmcw**](https://github.com/tmcw), 332 | [**@muraken720**](https://github.com/muraken720), 333 | and [**@dozoisch**](https://github.com/dozoisch) 334 | for contributing to nlcst and related projects! 335 | 336 | ## License 337 | 338 | [CC-BY-4.0][license] © [Titus Wormer][author] 339 | 340 | 341 | 342 | [license]: https://creativecommons.org/licenses/by/4.0/ 343 | 344 | [author]: https://wooorm.com 345 | 346 | [logo]: https://raw.githubusercontent.com/syntax-tree/nlcst/a89561d/logo.svg?sanitize=true 347 | 348 | [health]: https://github.com/syntax-tree/.github 349 | 350 | [contributing]: https://github.com/syntax-tree/.github/blob/HEAD/contributing.md 351 | 352 | [support]: https://github.com/syntax-tree/.github/blob/HEAD/support.md 353 | 354 | [coc]: https://github.com/syntax-tree/.github/blob/HEAD/code-of-conduct.md 355 | 356 | [awesome]: https://github.com/syntax-tree/awesome-syntax-tree 357 | 358 | [ideas]: https://github.com/syntax-tree/ideas 359 | 360 | [releases]: https://github.com/syntax-tree/nlcst/releases 361 | 362 | [latest]: https://github.com/syntax-tree/nlcst/releases/tag/1.0.2 363 | 364 | [list-of-utilities]: #list-of-utilities 365 | 366 | [dfn-unist-parent]: https://github.com/syntax-tree/unist#parent 367 | 368 | [dfn-unist-literal]: https://github.com/syntax-tree/unist#literal 369 | 370 | [dfn-parent]: #parent 371 | 372 | [dfn-literal]: #literal 373 | 374 | [dfn-root]: #root 375 | 376 | [dfn-paragraph]: #paragraph 377 | 378 | [dfn-sentence]: #sentence 379 | 380 | [dfn-word]: #word 381 | 382 | [dfn-symbol]: #symbol 383 | 384 | [dfn-punctuation]: #punctuation 385 | 386 | [dfn-whitespace]: #whitespace 387 | 388 | [dfn-text]: #text 389 | 390 | [dfn-source]: #source 391 | 392 | [term-tree]: https://github.com/syntax-tree/unist#tree 393 | 394 | [term-child]: https://github.com/syntax-tree/unist#child 395 | 396 | [term-root]: https://github.com/syntax-tree/unist#root 397 | 398 | [unist]: https://github.com/syntax-tree/unist 399 | 400 | [syntax-tree]: https://github.com/syntax-tree/unist#syntax-tree 401 | 402 | [javascript]: https://www.ecma-international.org/ecma-262/9.0/index.html 403 | 404 | [webidl]: https://heycam.github.io/webidl/ 405 | 406 | [glossary]: https://github.com/syntax-tree/unist#glossary 407 | 408 | [utilities]: https://github.com/syntax-tree/unist#list-of-utilities 409 | 410 | [unified]: https://github.com/unifiedjs/unified 411 | 412 | [retext]: https://github.com/retextjs/retext 413 | 414 | [textom]: https://github.com/wooorm/textom 415 | --------------------------------------------------------------------------------