└── README.md /README.md: -------------------------------------------------------------------------------- 1 | edn 2 | === 3 | 4 | extensible data notation [eed-n] 5 | 6 | # Rationale 7 | 8 | **edn** is an extensible data notation. A superset of **edn** is used by Clojure to represent 9 | programs, and it is used by Datomic and other applications as a data transfer format. This spec 10 | describes **edn** in isolation from those and other specific use cases, to help facilitate 11 | implementation of readers and writers in other languages, and for other uses. 12 | 13 | **edn** supports a rich set of built-in elements, and the definition of extension elements in terms 14 | of the others. Users of data formats without such facilities must rely on either convention or 15 | context to convey elements not included in the base set. This greatly complicates application 16 | logic, betraying the apparent simplicity of the format. **edn** is simple, yet powerful enough to 17 | meet the demands of applications without convention or complex context-sensitive logic. 18 | 19 | **edn** is a system for the conveyance of _values_. It is not a type system, and has no schemas. 20 | Nor is it a system for representing objects - there are no reference types, nor should a consumer 21 | have an expectation that two equivalent elements in some body of **edn** will yield distinct object 22 | identities when read, unless a reader implementation goes out of its way to make such a promise. 23 | Thus the resulting values should be considered immutable, and a reader implementation should yield 24 | values that ensure this, to the extent possible. 25 | 26 | **edn** is a set of definitions for acceptable _elements_. A use of **edn** might be a stream or 27 | file containing elements, but it could be as small as the conveyance of a single element in e.g. an 28 | HTTP query param. 29 | 30 | There is no enclosing element at the top level. Thus **edn** is suitable for streaming and 31 | interactive applications. 32 | 33 | The base set of elements in **edn** is meant to cover the basic set of data structures common to 34 | most programming languages. While **edn** specifies how those elements are formatted in text, it 35 | does not dictate the representation that results on the consumer side. A well behaved reader 36 | library should endeavor to map the elements to programming language types with similar semantics. 37 | 38 | # Spec 39 | 40 | Currently this specification is casual, as we gather feedback from implementors. A more rigorous 41 | e.g. BNF will follow. 42 | 43 | ## General considerations 44 | 45 | **edn** elements, streams and files should be encoded using [UTF-8](http://en.wikipedia.org/wiki/UTF-8). 46 | 47 | Elements are generally separated by whitespace. Whitespace, other than within strings, is not 48 | otherwise significant, nor need redundant whitespace be preserved during transmissions. Commas `,` 49 | are also considered whitespace, other than within strings. 50 | 51 | The delimiters `{ } ( ) [ ]` need not be separated from adjacent elements by whitespace. 52 | 53 | ### # dispatch character 54 | 55 | Tokens beginning with `#` are reserved. The character following `#` determines the behavior. The 56 | dispatches `#{` (sets), `#_` (discard), #alphabetic-char (tag) are defined below. `#` is not a 57 | delimiter. 58 | 59 | ## Built-in elements 60 | 61 | ### nil 62 | 63 | `nil` represents nil, null or nothing. It should be read as an object with similar meaning on the 64 | target platform. 65 | 66 | ### booleans 67 | 68 | `true` and `false` should be mapped to booleans. 69 | 70 | If a platform has canonic values for true and false, it is a further semantic of booleans that all 71 | instances of `true` yield that (identical) value, and similarly for `false`. 72 | 73 | ### strings 74 | 75 | Strings are enclosed in `"double quotes"`. May span multiple lines. Standard C/Java escape 76 | characters `\t, \r, \n, \\ and \"` are supported. 77 | 78 | ### characters 79 | 80 | Characters are preceded by a backslash: `\c`, `\newline`, `\return`, `\space` and `\tab` yield the 81 | corresponding characters. Unicode characters are represented with \uNNNN as in Java. Backslash cannot be 82 | followed by whitespace. 83 | 84 | ### symbols 85 | 86 | Symbols are used to represent identifiers, and should map to something other than strings, if 87 | possible. 88 | 89 | Symbols begin with a non-numeric character and can contain alphanumeric characters and `. * + ! - _ ? 90 | $ % & = < >`. If `-`, `+` or `.` are the first character, the second character (if any) must be 91 | non-numeric. Additionally, `: #` are allowed as constituent characters in symbols other than as the 92 | first character. 93 | 94 | `/` has special meaning in symbols. It can be used once only in the middle of a symbol to separate 95 | the _prefix_ (often a namespace) from the _name_, e.g. `my-namespace/foo`. `/` by itself is a legal 96 | symbol, but otherwise neither the _prefix_ nor the _name_ part can be empty when the symbol 97 | contains `/`. 98 | 99 | If a symbol has a _prefix_ and `/`, the following _name_ component should follow the 100 | first-character restrictions for symbols as a whole. This is to avoid ambiguity in reading contexts 101 | where prefixes might be presumed as implicitly included namespaces and elided thereafter. 102 | 103 | ### keywords 104 | 105 | Keywords are identifiers that typically designate themselves. They are semantically akin to 106 | enumeration values. Keywords follow the rules of symbols, except they can (and must) begin with `:`, e.g. `:fred` or `:my/fred`. If the target platform does not have a keyword type distinct 107 | from a symbol type, the same type can be used without conflict, since the mandatory leading `:` of 108 | keywords is disallowed for symbols. Per the symbol rules above, :/ and :/anything are not legal keywords. 109 | A keyword cannot begin with :: 110 | 111 | If the target platform supports some notion of interning, it is a further semantic of keywords that 112 | all instances of the same keyword yield the identical object. 113 | 114 | ### integers 115 | 116 | Integers consist of the digits `0` - `9`, optionally prefixed by `-` to indicate a negative number, or 117 | (redundantly) by `+`. No integer other than 0 may begin with 0. 64-bit (signed integer) precision is 118 | expected. An integer can have the suffix `N` to indicate that arbitrary precision is desired. -0 is a 119 | valid integer not distinct from 0. 120 | 121 | integer 122 | int 123 | int N 124 | digit 125 | 0-9 126 | int 127 | digit 128 | 1-9 digits 129 | + digit 130 | + 1-9 digits 131 | - digit 132 | - 1-9 digits 133 | 134 | ### floating point numbers 135 | 136 | 64-bit (double) precision is expected. 137 | 138 | floating-point-number 139 | int M 140 | int frac 141 | int exp 142 | int frac exp 143 | digit 144 | 0-9 145 | int 146 | digit 147 | 1-9 digits 148 | + digit 149 | + 1-9 digits 150 | - digit 151 | - 1-9 digits 152 | frac 153 | . digits 154 | exp 155 | ex digits 156 | digits 157 | digit 158 | digit digits 159 | ex 160 | e 161 | e+ 162 | e- 163 | E 164 | E+ 165 | E- 166 | 167 | In addition, a floating-point number may have the suffix `M` to indicate that exact precision is 168 | desired. 169 | 170 | ### lists 171 | 172 | A list is a sequence of values. Lists are represented by zero or more elements enclosed in 173 | parentheses `()`. Note that lists can be heterogeneous. 174 | 175 | (a b 42) 176 | 177 | ### vectors 178 | 179 | A vector is a sequence of values that supports random access. Vectors are represented by zero or 180 | more elements enclosed in square brackets `[]`. Note that vectors can be heterogeneous. 181 | 182 | [a b 42] 183 | 184 | ### maps 185 | 186 | A map is a collection of associations between keys and values. Maps are represented by zero or more 187 | key and value pairs enclosed in curly braces `{}`. Each key should appear at most once. No 188 | semantics should be associated with the order in which the pairs appear. 189 | 190 | {:a 1, "foo" :bar, [1 2 3] four} 191 | 192 | Note that keys and values can be elements of any type. The use of commas above is optional, as they 193 | are parsed as whitespace. 194 | 195 | ### sets 196 | 197 | A set is a collection of unique values. Sets are represented by zero or more elements enclosed in 198 | curly braces preceded by `#` `#{}`. No semantics should be associated with the order in which the 199 | elements appear. Note that sets can be heterogeneous. 200 | 201 | #{a b [1 2 3]} 202 | 203 | ## tagged elements 204 | 205 | **edn** supports extensibility through a simple mechanism. `#` followed immediately by a symbol 206 | starting with an alphabetic character indicates that _that symbol_ is a **_tag_**. A tag indicates 207 | the semantic interpretation of _the following element_. It is envisioned that a reader 208 | implementation will allow clients to register handlers for specific tags. Upon encountering a tag, 209 | the reader will first read the next element (which may itself be or comprise other tagged elements), 210 | then pass the result to the corresponding handler for further interpretation, and the result of the 211 | handler will be the data value yielded by the tag + tagged element, i.e. reading a tag and tagged 212 | element yields one value. This value is the value to be returned to the program and is not further 213 | interpreted as **edn** data by the reader. 214 | 215 | This process will bottom out on elements either understood or built-in. 216 | 217 | Thus you can build new distinct readable elements out of (and only out of) other readable elements, 218 | keeping extenders and extension consumers out of the text business. 219 | 220 | The semantics of a tag, and the type and interpretation of the tagged element are defined by the 221 | steward of the tag. 222 | 223 | #myapp/Person {:first "Fred" :last "Mertz"} 224 | 225 | If a reader encounters a tag for which no handler is registered, the implementation can either 226 | report an error, call a designated 'unknown element' handler, or create a well-known generic 227 | representation that contains both the tag and the tagged element, as it sees fit. Note that the 228 | non-error strategies allow for readers which are capable of reading any and all **edn**, in spite 229 | of being unaware of the details of any extensions present. 230 | 231 | ### rules for tags 232 | 233 | Tag symbols without a prefix are reserved by **edn** for built-ins defined using the tag system. 234 | 235 | User tags _**must**_ contain a prefix component, which must be owned by the user (e.g. trademark or 236 | domain) or known unique in the communication context. 237 | 238 | A tag _may_ specify more than one format for the tagged element, e.g. both a string and a vector 239 | representation. 240 | 241 | Tags themselves are not elements. It is an error to have a tag without a corresponding tagged 242 | element. 243 | 244 | ## built-in tagged elements 245 | 246 | ### #inst "rfc-3339-format" 247 | 248 | An instant in time. The tagged element is a string in 249 | [RFC-3339](http://www.ietf.org/rfc/rfc3339.txt) format. 250 | 251 | `#inst "1985-04-12T23:20:50.52Z"` 252 | 253 | ### #uuid "f81d4fae-7dec-11d0-a765-00a0c91e6bf6" 254 | 255 | A [UUID](http://en.wikipedia.org/wiki/Universally_unique_identifier). The tagged element is a 256 | canonical UUID string representation. 257 | 258 | ## comments 259 | 260 | If a `;` character is encountered outside of a string, that character and all subsequent characters 261 | to the next newline should be ignored. 262 | 263 | ## discard 264 | 265 | `#` followed immediately by `_` is the discard sequence, indicating that the next element (whether 266 | separated from `#_` by whitespace or not) should be read and discarded. Note that the next element 267 | must still be a readable element. A reader should not call user-supplied tag handlers during the 268 | processing of the element to be discarded. 269 | 270 | [a b #_foo 42] => [a b 42] 271 | 272 | The discard sequence is not an element. It is an error to have a discard sequence without a 273 | following element. 274 | 275 | ## equality 276 | 277 | Sets and maps have requirements that their elements and keys respectively be unique, which requires 278 | a mechanism for determining when 2 values are not unique (i.e. are equal). 279 | 280 | nil, booleans, strings, characters, and symbols are equal to values of the same type with the same 281 | **edn** representation. 282 | 283 | integers and floating point numbers should be considered equal to values only of the same 284 | magnitude, _type, and precision_. Comingling numeric types and precision in map/set key/elements, 285 | or constituents therein, is not advised. 286 | 287 | sequences (lists and vectors) are equal to other sequences whose count of elements is the same, and 288 | for which each corresponding pair of elements (by ordinal) is equal. 289 | 290 | sets are equal if they have the same count of elements and, for every element in one set, an equal 291 | element is in the other. 292 | 293 | maps are equal if they have the same number of entries, and for every key/value entry in one map an 294 | equal key is present and mapped to an equal value in the other. 295 | 296 | tagged elements must define their own equality semantics. #uuid elements are equal if their canonic 297 | representations are equal. #inst elements are equal if their representation strings designate the 298 | same timestamp per [RFC-3339](http://www.ietf.org/rfc/rfc3339.txt). 299 | 300 | 301 | --------------------------------------------------------------------------------