├── .gitignore ├── LICENSE ├── README.md ├── smile-design-goals.md ├── smile-faq.md └── smile-specification.md /.gitignore: -------------------------------------------------------------------------------- 1 | # use glob syntax. 2 | syntax: glob 3 | 4 | *.class 5 | *~ 6 | *.bak 7 | *.off 8 | *.old 9 | .DS_Store 10 | 11 | # building 12 | target 13 | 14 | # Eclipse 15 | .classpath 16 | .project 17 | .settings 18 | 19 | # IDEA 20 | .idea 21 | *.iml 22 | *.ipr 23 | *.iws 24 | /target 25 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | BSD 2-Clause License 2 | 3 | Copyright (c) 2017, FasterXML, LLC 4 | All rights reserved. 5 | 6 | Redistribution and use in source and binary forms, with or without 7 | modification, are permitted provided that the following conditions are met: 8 | 9 | * Redistributions of source code must retain the above copyright notice, this 10 | list of conditions and the following disclaimer. 11 | 12 | * Redistributions in binary form must reproduce the above copyright notice, 13 | this list of conditions and the following disclaimer in the documentation 14 | and/or other materials provided with the distribution. 15 | 16 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 17 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 18 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 19 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 20 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 21 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 22 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 23 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 24 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 25 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 26 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Smile Data Format 2 | 3 | "Smile" is a binary data format that defines a binary equivalent of standard 4 | [JSON](http://en.wikipedia.org/wiki/JSON) data format (*). 5 | 6 | Format was specified in 2010 by [Jackson](../../../jackson) JSON processor development team. 7 | First compliant implementation was included as a Jackson backend for Jackson version 1.6, 8 | released in September 2010. 9 | 10 | (*) with following exceptions 11 | 12 | * Number magnitude and precision are limited by length-indicators: so 13 | while for most practical purposes limits are never reached, there are theoretical limits. Specifically, "Big Integers" and "Big Decimals" (matching Java `java.math.BigInteger` and `java.math.BigDecimal`) are limited to encoded byte-lengths representable by 32-bit positive integers (about 2 GB). 14 | 15 | ## Specification 16 | 17 | Design documentation includes: 18 | 19 | * [Smile Format Specification](smile-specification.md) describes format itself; how it works and what a compliant parser/generator implementations needs to do. 20 | * [Smile Format Design Goals](smile-design-goals.md) explains rationale for design decisions concerning specification. 21 | 22 | ## Community 23 | 24 | * [Smile format Google group](http://groups.google.com/group/smile-format-discussion) 25 | 26 | ## Documentation 27 | 28 | * [SmileFAQ](smile-faq.md) 29 | * [Smile Wikipedia entry](https://en.wikipedia.org/wiki/Smile_(data_interchange_format)) 30 | * User documentation: 31 | * [Understanding Smile data format](https://medium.com/code-with-ayush/understanding-smile-a-data-format-based-on-json-29972a37d376) 32 | 33 | ## Implementations 34 | 35 | Smile Codecs 36 | 37 | * [Clojure](http://clojure.org) 38 | * [Cheshire](https://github.com/dakrone/cheshire) library offers support via Jackson `jackson-dataformat-smile` 39 | * C 40 | * [libsmile](https://github.com/pierre/libsmile) is a small C-library for reading and writing Smile data. 41 | * Go 42 | * [go-smile](https://github.com/zencoder/go-smile) Smile decoder written in Golang. 43 | * Java 44 | * [Jackson](../../../jackson) provides Smile support through [jackson-dataformat-smile](../../../jackson-dataformats-binary) modules) format codec 45 | * Full support: including streaming access, data binding and tree model (100% parity with textual JSON) 46 | * [Jackson 2.9](https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.9) added "non-blocking" ((aka "asynchronous") decoding for JSON and Smile format backends 47 | * [Protostuff](http://github.com/protostuff/protostuff) project supports Smile both as a low-level data format, and as format used for its RPC implementation 48 | * Javascript 49 | * [smile-js](https://github.com/ngyewch/smile-js) Smile decoder written in Javascript (only decodes (reads), does not encode (write)) 50 | * Kotlin Multiplatform 51 | * [kotlinx-serialization-smile](https://github.com/vooft/kotlinx-serialization-smile) pure Kotlin Multiplatform implementation for `kotlinx-serialization`, supports JVM, Native, JS, etc 52 | * Python 53 | * [NewSmile](https://pypi.org/project/newsmile/) Another Smile Format Decoder/Encoder for Python 3 54 | * [PySmile](https://github.com/jhosmer/PySmile) Python codec 55 | * Rust 56 | * [serde-smile](https://github.com/sfackler/serde-smile) Serde serializer and deserializer written in Rust. 57 | 58 | Frameworks, Systems that use Smile codec (encoder and decoder) 59 | 60 | * [Elastic Search](http://www.elastic.co) uses Smile as transport format supports access using Smile encoding. 61 | * [Apache Solr](http://lucene.apache.org/solr) can use Smile as the response format with the `wt=smile` parameter. 62 | 63 | ## Related Publications 64 | 65 | Here are some external articles, blog posts, research papers that may be of interest: 66 | 67 | * [A Survey of JSON-compatible Binary Serialization Specifications](https://arxiv.org/abs/2201.02089) 68 | * [A Benchmark of JSON-compatible Binary Serialization Specifications](https://arxiv.org/abs/2201.03051) 69 | -------------------------------------------------------------------------------- /smile-design-goals.md: -------------------------------------------------------------------------------- 1 | # Smile Format: design goals 2 | 3 | Since there is no standard binary representation of JSON data model -- not even real proposals (closest thing is 4 | [BSON](http://bsonspec.org/), which despite its name, is NOT fully JSON compatible) -- 5 | [Jackson](../../../jackson) JSON processor team decided to specify a format that is just that. 6 | Work started in 2010, and an initial version was completed by the same year, as well as first compliant parser and 7 | generator implementation as part of Jackson 1.6 release. 8 | 9 | This document covers design rationale for the data format; format itself is explained on 10 | [Smile Format specification](smile-specification.md) 11 | 12 | ## 0. Background 13 | 14 | One of opportunities for improving both space- and time efficiency of data transfer with JSON-based tools would be defining a binary format that has same logical information model -- similar to various binary XML efforts, compared to canonical textual XML -- and that could thereby be accessed through existing JSON APIs, offering equivalent set of functionality. 15 | 16 | ## 1. Goals 17 | 18 | Design goals for the desired format are: 19 | 20 | * MUST support all logical JSON events; should not add extensions beyond what existing Jackson API offers. 21 | * SHOULD be reasonably space-efficient (compact); to reduce "fluff"; especially to degree this can help processing efficiency. 22 | * SHOULD be both efficient to read (deserialize) AND write (serialize): many binary formats make significant sacrifices in write speed. 23 | * SHOULD avoid overly complex structures and algorithms 24 | * SHOULD avoid fragility: 25 | * Keep enough redundancy to allow for some self-consistency checks; specifically, there should be byte sequences that are invalid, to make detection of invalid content possible. 26 | * Add explicit format version number (nothing drastic, even a nibble will do) 27 | * MUST support auto-detection: Use a unique header ("magic cookie") 28 | * SHOULD allow simple concatenation of content; that is, concatenation of valid properly nested sequences (that is, no open start object or start array markers) must be legal format in itself 29 | * Aimed to allow chunked content output 30 | * SHOULD allow proper streaming: that is, amount of required buffering can not exceed some fixed constant, size of which is related to low-level buffering, not to length of content to encoded. Note, however, that in cases where Jackson API itself imposes limits (case for embedded binary data; as well as for String values to output), implementation can make use of this existing limitation 31 | * This specifically prevents pervasive use of length-prefix for Strings: to know byte-length of a Java String, one would need to either do additional passes (first to calculate length, second to encode), or to buffer encoded output in memory. 32 | * SHOULD support simple framing through the use of byte 0xFF (end marker) -- ability to separate binary-encoded content segments from each other AND efficiently scan to find these boundaries _without_ having to parse/decode contents. Framing is easy to do with textual JSON by using something as simple as linefeed, since linefeeds are always quoted in String values, as long as no indentation is used. 33 | * Means that effort should be made to avoid use of end marker byte when encoding content 34 | 35 | ## 2. Non-goals 36 | 37 | To further help design the format, following things were explicitly considered to be non-goals: 38 | 39 | * SHOULD NOT over-optimize for minimal compactness: that's what compression can be used for. 40 | * SHOULD NOT extend model beyond basic JSON data types; with exception of natively supporting binary content (which for textual JSON is supported using Base64 encoded text). 41 | * NEED NOT design for random-access: low-level Jackson APIs are designed for sequential access; can implement "low-hanging fruits" if that makes sense. 42 | * NEED NOT support extensive configurability: only settings that are considered obviously useful (or requested) should be added. 43 | 44 | Guidelines derived from above: 45 | 46 | * If goals of space-efficiency (compactness) and time-efficiency (processing speed) conflict, prefer time-efficiency over space-efficiency. 47 | -------------------------------------------------------------------------------- /smile-faq.md: -------------------------------------------------------------------------------- 1 | # Smile Format: FAQ 2 | 3 | ## Where did the name come from? 4 | 5 | Name came from the half-serious idea of using a smiley as the "magic cookie", part of short 6 | header that can be used to detect Smile format encoded documents. 7 | 8 | ## What is the purpose of "Safe [binary] Encoding"? 9 | 10 | The main idea is to allow use of specific marker, `0xFF` for framing 11 | purpose -- that is, as a document separator, to allow "blind" 12 | splitting of a stream of Smile-encoded documents into chunks of 13 | documents. 14 | "Blind" here means that splitter can start at ANY byte position, 15 | looking for split marker, without having to consider document 16 | structure or token boundaries: it just "blindly" focuses on finding 17 | next (or, previous) `0xFF` and knows that MUST be a document boundary. 18 | 19 | Currently (version 1.0) byte value of `0xFE` is also avoided if (but 20 | only if!) using Safe Binary Encoding setting. 21 | 22 | Note: if not writing Binary data at all, this setting has no 23 | relevance (and byte values of `0xFF` and `0xFE` are not used at all either). 24 | -------------------------------------------------------------------------------- /smile-specification.md: -------------------------------------------------------------------------------- 1 | # Efficient JSON-compatible binary format: "Smile" 2 | 3 | "Smile" is an efficient JSON-compatible binary data format, initially developed by Jackson JSON processor project team. 4 | Its logical data model is same as that of JSON, so it can be considered a "Binary JSON" format. 5 | 6 | For design on "Project Smile", which implements support for this format, see 7 | [Smile format Design goals](smile-design-goals.md). 8 | 9 | This page covers current data format specification; which is planned to eventually be standardized through a format process (most likely as IETF RFC). 10 | 11 | ## Document version 12 | 13 | ### Version history 14 | 15 | * Current: 1.0.6 (18-Apr-2025) 16 | * Previous: 1.0.5 (26-Jan-2022) 17 | * First official: 1.0.0 (12-Sep-2010) 18 | * First draft: (24-Jun-2010) 19 | 20 | ### Update history 21 | 22 | * 2025-05-18: Explained existing (implement by Jackson codec) but previously undocumented requirement to skip Shared String/Key Name references ending with `0xFE` / `0xFF` bytes. 23 | * Version 1.0.5 -> 1.0.6 24 | * 2022-02-20: Clarify handling of "unused" bits (see issue #17) primarily regarding encoding of floating-point numbers, but more generally for all unused bits. 25 | * 2022-01-26: Important fix to encoding of 7-bit encoded (safe) binary, wrt padding of the last byte. 26 | * Version 1.0.4 -> 1.0.5 27 | * 2021-03-18: Minor markup fixes, clarification to "simple literals, numbers" section 28 | * 2020-11-19: Added notes on length indicatos for "Tiny" and "Short" String values (ASCII and Unicode) (contributed by @jviotti) 29 | * 2020-11-16: Replacing accidental "Small ASCII" and "Small Unicode" references to canonical "Short ASCII" and "Short Unicode" (contributed by @jviotti) 30 | * 2020-10-24: Minor clarification on encoding of 32-bit IEEE floating point value 31 | * 2016-02-23: Minor clarification on acceptable lengths for long non-shared names (different minimum for ascii, non-ascii) 32 | * 2014-05-21: Fixed a typo: end-of-string marker is `0xFC` and NOT `0xFE` as stated in one place. 33 | * 2014-03-28: Added improvement ideas for possible chunked variants of binary data, Strings. 34 | * 2013-05-12: Important document fix: value codes `0xE8` and `0xEC` were mixed in document (Java and C codecs implement correctly) -- bumped version to 1.0.4 to signify this clarification. (big thanks to `gbooker@github` for pointing out this discrepancy) 35 | * 2013-03-06: Fixed a minor flaw in description of Zigzag encoding (multiply by two, not one) 36 | * 2012-02-21: Added description of zigzag-encoding for VInts 37 | * 2011-11-29: Clarified that values `0xF0` - `0xF7` are "reserved for future use" in value mode (but used in key mode) 38 | * 2011-07-27: Fixed 2 erroneous references to "5 MSB" which should be "5 LSB" (thanks Pierre!) 39 | * 2011-07-15: Formatting improvements; add a note about recommended MIME Type for Smile encoded data 40 | * 2011-02-16: Fix a minor typo in section 2.3 ("Short Unicode names"); range `0xF8` - `0xFE` is NOT used for encoding 41 | * 2010-09-12: Mark as 1.0.0; add some future plan notes. 42 | * 2010-08-26: Rearrange token byte values to make better use of `0xF8` - `0xFF` area for framing. 43 | * 2010-08-20: Remove references to initially planned in-frame compression; add one more header flag, extend version info to 4 bits 44 | 45 | ## External Considerations 46 | 47 | ### MIME Type 48 | 49 | There is no formal or official MIME type registered for Smile content, but the current best practice (as of July 2011) is to use: 50 | 51 | application/x-jackson-smile 52 | 53 | since this is used by multiple existing projects. 54 | 55 | ## High-level format 56 | 57 | At high level, content encoded using this format consists of a simple sequence of sections, each of which consists of: 58 | 59 | * A 4-byte header (described below) that can be used to identify content that uses the format, as well as its version and any per-section configuration settings there may be. 60 | * Sequence (0 to N) of tokens that are properly nested (all start-object/start-array tokens are matched with equivalent close tokens) within sequence. 61 | * Optional end marker, 0xFF, can be used: if encountered, it will be consider same as end-of-stream. This is added as a convenience feature to help with framing. 62 | 63 | Header consists of: 64 | 65 | * Constant byte #0: `0x3A` (ASCII ':') 66 | * Constant byte #1: `0x29` (ASCII ')') 67 | * Constant byte #2: `0x0A` (ASCII linefeed, '\n') 68 | * Variable byte #3, consisting of bits: 69 | * Bits 4-7 (4 MSB): 4-bit version number; `0x00` for current version (note: it is possible that some bits may be reused if necessary) 70 | * Bits 3: Reserved 71 | * Bit 2 (mask `0x04`) Whether "raw binary" (unescaped 8-bit) values may be present in content 72 | * Bit 1 (mask `0x02`): Whether "shared String value" checking was enabled during encoding -- if header missing, default value of `false` must be assumed for decoding (meaning parser need not store decoded String values for back referencing) 73 | * Bit 0 (mask `0x01`): Whether "'shared property name" checking was enabled during encoding -- if header missing, default value of `true` must be assumed for decoding (meaning parser MUST store seen property names for possible back references) 74 | 75 | And basically first 2 bytes form simple smiley and 3rd byte is a (Unix) linefeed: this to make command-line-tool based identification simple: choice of bytes is not significant beyond visual appearance. Fourth byte contains minimal versioning marker and additional configuration bits. 76 | 77 | ## Low-level Format 78 | 79 | Each section described above consist of set of tokens that forms properly nested JSON value. Tokens are used in two basic modes: value mode (in which tokens are "value tokens"), and property-name mode ("key tokens"). Property-name mode is used within JSON Object values to denote property names, and alternates between name / value tokens. 80 | 81 | Token lengths vary from a single byte (most common) to 9 bytes. In each case, first byte determines type, and additional bytes are used if and as indicated by the type byte. Type byte value ranges overlap between value and key tokens; but not all type bytes are legal in both modes. 82 | 83 | Use of certain byte values is limited: 84 | 85 | * Values `0xFD` through `0xFF` are not used as token type markers, key markers, or in values; with exception of optional raw binary data (which can contain any values). Instead they are used to: 86 | * `0xFF` can be used as logical data end marker; this use is intended to be compatible with Web Sockets usage 87 | * `0xFE` is reserved for future use, and not used for anything currently. 88 | * `0XFD` is used as type marker for raw binary data, to allow for uniquely identifying raw binary data sections (note too that content header will have to explicitly enable support; without this content can not contain raw binary data sections) 89 | * `0xFC` is used as String end-marker (similar to use of zero byte with C strings) for long Strings that do not use length prefix. 90 | * Since number encodings never use values `0xC0` - `0xFF`, and UTF-8 does not use values `0xF8` - `0xFF`, these are only uses within Smile format (except for possible raw binary data) 91 | * Values `0xF8` - `0xFB` are only used for type tokens `START_ARRAY`, `END_ARRAY`, `START_OBJECT` and `END_OBJECT` (respectively); they are not used for String or numeric values of field names and can otherwise only occur in raw binary data sections. 92 | * Value `0x00` has no specific handling (can occur in variable length numeric values, as UTF-8 null characters and so on). 93 | * `0x3A` is not used as type byte in either mode, since it is the first byte of 4-byte header sequence, and may thus be encountered after value tokens (and although it can not occur within key mode, it is reserved to increase chances of detecting corrupted content) 94 | * Value can occur within many kinds of values (vints, String values) 95 | 96 | ### Tokens: general 97 | 98 | Some general notes on tokens: 99 | 100 | * (2022-02-20) Unused bits in encoded bytes: 101 | * SHOULD be encoded as `0` bits by encoder 102 | * MUST be ignored by decoders for purposes of decoding itself (MUST NOT affect result of decoding even if `1`) 103 | * MAY, however, be verified by decoder but if so MUST NOT fail decoding by default; decoders MAY however report non-compliant `1` bits as warnings 104 | * Decoders MAY additionally expose optional "strict" mode in which such non-compliant bit encoding does result in an error and decoding failure 105 | * Strings are encoded using standard UTF-8 encoding; length is indicated either by using: 106 | * 6-bit byte length prefix, for lengths 1 - 63 (0 is not used since there is separate token) 107 | * End-of-String marker byte (`0xFC`) for variable length Strings. 108 | * Integral numeric values up to Java long (64-bit) are handled using `ZigZag`-encoded VInts (see Appendix for details): 109 | * sequence of 1 to 10 bytes that can represent all 64-bit numbers. 110 | * VInts are big endian, meaning that most-significant bytes come first 111 | * All bytes except for the last one have their MSB clear, leaving 7 data bits 112 | * Last byte has its MSB (bit #7) set, but bit #6 NOT set (to avoid possibility of collision with `0xFF`), leaving 6 data bits. 113 | * This means that 2 byte VInt has 13 data bits, for example; and minimum number of bytes to represent a Java long (64 bits) is 10; 9 bytes would give 62 bits (8 * 7 + 6). 114 | * Signed VInt values are handled using "zigzag" encoding, where sign bit is shifted to be the least-significant bit, and value is shifted left by one (i.e. multiplied by two). 115 | * Unsigned VInts used as length indicators do NOT use zigzag encoding (since it is only needed to help with encoding of negative values) 116 | * "Unused" bits in the last encoded byte should be handled as per earlier general note: left as `0` 117 | * Length indicators are done using VInts (for binary data, ("big") integer/decimal values) 118 | * All length indicators define _actual_ length of data; not possibly encoded length (in case of "safe" encoding, encoded data is longer, and that length can be calculated from payload data length) 119 | * "Unused" bits in the last encoded byte should be handled as per earlier general note: left as `0` 120 | * Floating point values (IEEE 32 and 64-bit) are encoded using fixed-length big-endian encoding (7 bits used to avoid use of reserved bytes like `0xFF`): 121 | * Data is "right-aligned", meaning padding is prepended to the first byte (and its MSB). 122 | * For example, the 32-bit float `29.9510 is` encoded as `0x26 0x37 0x3E 0x0F 0x04.` We get to this encoding by taking the IEEE 764 32-bit binary representation of the number 29.9510, (1) writing the least-significant 7 bits, (2) right-shifting 7 bits, and repeating the process until encoding the entire bit-string (5 times for a 32-bit float). As a result, `0x26` = `29.9510 & 0x7F`, `0x37` = `(29.9510 >> 7) & 0x7F`, `0x3E` = `(29.9510 >> 14) & 0x7F`, `0x0F` = `(29.9510 >> 21) & 0x7F`, and `0x04 = (29.9510 >> 28) & 0x7F`. 123 | * "Unused" bits in the last encoded byte should be handled as per earlier general note: left as `0` 124 | * "Big" decimal/integer values use "safe" binary encoding 125 | * "Safe" binary encoding simply uses 7 LSB (sign bit, MSB, is left as 0). 126 | * The last encoded byte contains 1 - 7 bits: if less than 7, data is "right-aligned", contained in Least-Significant Bits; there will be 0-6 MSB padding bits. 127 | * For example: when encoding 4 bytes (32 bits), the first 4 full (7-bit) encoded bytes (`0vvvvvvv`) -- ncoding 28 most-significant bits -- are followed by one incomplete byte containing the last 4 value bits: `0000vvvv`. 128 | * NOTE: before version 1.0.5 above statement claimed incorrect alignment (claiming padding would be for the LSB of output byte; instead of LSB containing actual value bits) 129 | * "Unused" bits in the last encoded byte should be handled as per earlier general note: left as `0`. 130 | 131 | ### Tokens: value mode 132 | 133 | Value is the default mode for tokens for main-level ("root") output context and JSON Array context. It is also used between JSON Object property name tokens (see next section). 134 | 135 | Conceptually tokens are divided in 8 classes, class defined by 3 MSB of the first byte: 136 | 137 | * `0x00` - `0x1F`: Short Shared Value String reference (single byte) 138 | * `0x20` - `0x3F`: Simple literals, numbers 139 | * `0x40` - `0x5F`: Tiny ASCII (1 - 32 bytes == chars) 140 | * `0x60` - `0x7F`: Short ASCII (33 - 64 bytes == chars) 141 | * `0x80` - `0x9F`: Tiny Unicode (2 - 33 bytes; <= 33 characters) 142 | * `0xA0` - `0xBF`: Short Unicode (34 - 64 bytes; <= 64 characters) 143 | * `0xC0` - `0xDF`: Small integers (single byte) 144 | * `0xE0` - `0xFF`: Binary / Long text / structure markers (`0xF0` - `0xF7` is unused, reserved for future use -- but note, used in key mode) 145 | 146 | These token class are are described below. 147 | 148 | #### Token class: Short Shared Value String reference 149 | 150 | Prefix: `0x00`; covers byte values `0x01` - `0x1F` (`0x00` not used as value type token) 151 | 152 | * 5 LSB used to get reference value of 1 - 31; 0 is not used with this version (reserved for future use) 153 | * Back reference resolved as explained in section 4. 154 | 155 | #### Token class: Simple literals, numbers 156 | 157 | Prefix: `0x20`; covers byte values `0x20` - `0x3F`, although not all values are used 158 | 159 | * Literals (simple, non-structured) 160 | * `0x20`: "" (empty String) 161 | * `0x21`: null 162 | * `0x22` / `0x23`: `false` / `true` 163 | * Numbers: 164 | * `0x24` - `0x27` Integral numbers; 2 LSB (`0x03`) contain subtype 165 | * `0x24` - 32-bit integer; zigzag encoded, 1 - 5 data bytes 166 | * `0x25` - 64-bit integer; zigzag encoded, 5 - 10 data bytes 167 | * `0x26` - `BigInteger` 168 | * Encoded as token indicator followed by 7-bit escaped binary (with Unsigned VInt (no-zigzag encoding) as length indicator) that represent magnitude value (byte array) 169 | * `0x27` - reserved for future use 170 | * `0x28` - `0x2B` floating point numbers; 2 LSB (`0x03`) contain subtype 171 | * `0x28`: 32-bit float 172 | * `0x29`: 64-bit double 173 | * `0x2A`: `BigDecimal` 174 | * Encoded as token indicator followed by zigzag encoded scale (32-bit), followed by 7-bit escaped binary (with Unsigned VInt (no-zigzag encoding) as length indicator) that represent magnitude value (byte array) of integral part. 175 | * `0x2B` - reserved for future use 176 | * Note that possible "unused" bits in the last encoded byte should be handled as per earlier general note: left as `0`, ignored on decoding. 177 | * Reserved for future use, avoided (decoding error if found) 178 | * `0x2C` - `0x2F` reserved for future use (non-overlapping with keys) 179 | * `0x30` - `0x3F` overlapping with key mode and/or header (`0x3A`) 180 | 181 | Rest of the possible values are reserved for future use and not used currently. 182 | 183 | #### Token classes: Tiny ASCII, Short ASCII 184 | 185 | Prefixes: `0x40` / `0x60`; covers all byte values between `0x40` and `0x7F`. 186 | 187 | * `0x40` - `0x5F`: Tiny ASCII 188 | * String with specified length; all bytes in ASCII range. 189 | * 5 LSB used to indicate lengths from 1 to 32 (bytes == chars) 190 | * **Note**: The character-length-length of the ASCII string is therefore what the 5 LSB encodes + 1 191 | * `0x60` - `0x7F`: Short ASCII 192 | * String with specified length; all bytes in ASCII range 193 | * 5 LSB used to indicate lengths from 33 to 64 (bytes == chars) 194 | * **Note**: The character-length of the ASCII string is therefore what the 5 LSB encodes + 33 195 | 196 | #### Token classes: Tiny Unicode, Short Unicode 197 | 198 | Prefixes: `0x80` / `0xA0`; covers all byte values between `0x80` and `0xBF`; except that `0x80` is not encodable (since there is no 1 byte long multi-byte-character String) 199 | 200 | * `0x80` - `0x9F` 201 | * String with specified length; bytes NOT guaranteed to be in ASCII range 202 | * 5 LSB used to indicate _byte_ lengths from 2 to 33 (with character length possibly less due to multi-byte characters) 203 | * Length 1 can not be expressed, since only ASCII characters have single byte encoding (which means it should be encoded with "Tiny ASCII") 204 | * **Note**: The byte-length of the UTF-8 string is therefore what the 5 LSB encodes + 2 (different from the "Tiny ASCII" encoding) 205 | * `0xA0` - `0xBF` 206 | * 5 LSB used to indicate _byte_ lengths from 34 to 65 (with character length possibly less due to multi-byte characters) 207 | * **Note**: The byte-length of the UTF-8 string is therefore what the 5 LSB encodes + 34 (different from the "Short ASCII" encoding) 208 | 209 | #### Token class: Small integers 210 | 211 | Prefix: `0xC0`; covers byte values `0xC0` - `0xDF`, all values used. 212 | 213 | * Zigzag encoded 214 | * 5 LSB used to get values from -16 to +15 215 | 216 | #### Token class: Misc; binary / text / structure markers 217 | 218 | Prefix: `0xE0`; covers byte values `0xE0` - `0xEF`, `0xF8` - `0xFF`: `0xF8` - `0xFF` not used with this format version (reserved for future use) 219 | 220 | Note, too, that value `0x36` could be viewed as "real" `END_OBJECT`; but is not included here since it is only encountered in "key mode" (where you either get a key name, or `END_OBJECT` marker) 221 | 222 | This class is further divided in 8 sub-section, using value of bits #2, #3 and #4 (0x1C) as follows: 223 | 224 | * `0xE0`: Long (variable length) ASCII text 225 | * 2 LSB (`0x03`): reserved for future use 226 | * NOTE: these values are NOT back-referencable, so they do not participate in back-reference resolution (indexes/tables not updated) 227 | * `0xE4`: Long (variable length) Unicode text 228 | * 2 LSB (`0x03`): reserved for future use 229 | * NOTE: these values are NOT back-referencable, so they do not participate in back-reference resolution (indexes/tables not updated) 230 | * `0xE8`: Binary, 7-bit encoded 231 | * 2 LSB (`0x03`): reserved for future use 232 | * followed by VInt length indicator, then data in 7/8 encoding (only 7 LSB of each byte used [sign bit always 0]; 8 such bytes are used to encode 7 "raw" bytes) 233 | * Due to alignment the last byte may contain fewer than 7 bits: if so, the LSB bits contain data and up to 7 MSB may be left as 0 (the highest bit, sign bit, is always 0). 234 | * NOTE: before version 1.0.5, some documentation suggested padding would be for LSB -- this is NOT the case. 235 | * `0xEC`: Shared String reference, Long 236 | * 2 LSB (`0x03`): used as 2 MSB of index 237 | * followed by byte used as 8 LSB of index 238 | * NOTE: this byte MUST NOT BE `0xFE` or `0xFF` -- generator MUST ensure avoidance (meaning that a small number of non-Shared Strings can not be referenced at all) 239 | * Resulting 10-bit index used as is; values 0-30 are not to be used (instead, Short reference must be used) 240 | * Back references are ONLY made to "short" and "tiny" ASCII/Unicode Strings, so generator and parser only need to retain references to these Strings and not "long" (aka variable length) Strings. 241 | * `0xF0` - `0xF7`: not used, reserved for future use (NOTE: used in key mode) 242 | * `0xF8` - `0xFB`: Structural markers 243 | * `0xF8`: `START_ARRAY` 244 | * `0xF9`: `END_ARRAY` 245 | * `0xFA`: `START_OBJECT` 246 | * `0xFB`: reserved in token mode (but is `END_OBJECT` in key mode) -- this just because object end marker comes as alternative to property name. 247 | * `0xFC`: Used as end-of-String marker 248 | * `0xFD`: Binary (raw) 249 | * followed by VInt length indicator, then raw data 250 | * `0xFE`: reserved for future use 251 | * `0xFF`: end-of-content marker (not used in content itself) 252 | 253 | ### Tokens: key mode 254 | 255 | Key mode tokens are only used within JSON Object values; if so, they alternate between value tokens (first a key token; followed by either single-value value token or multi-token JSON Object/Array value). A single token denotes end of JSON Object value; all the other tokens are used for expressing JSON Object property name. 256 | 257 | Most tokens are single byte: exceptions are 2-byte "long shared String" token, and variable-length "long Unicode String" tokens. 258 | 259 | Byte ranges are divides in 4 main sections (64 byte values each): 260 | 261 | * `0x00` - `0x3F`: miscellaneous 262 | * `0x00` - `0x1F`: not used, reserved for future versions 263 | * `0x20`: Special constant name "" (empty String) 264 | * `0x21` - `0x2F`: reserved for future use (unused for now to reduce overlap between values) 265 | * `0x30` - `0x33`: "Long" shared key name reference (2 byte token); 2 LSBs of the first byte are used as 2 MSB of 10-bit reference (up to 1024) values to a shared name: second byte used for 8 LSB of 10-bit reference. 266 | * Note: combined values of 0 through 64 are reserved, since there is more optimal representation -- encoder is not to produce such "short long" values; decoder should check that these are not encountered. Future format versions may choose to use these for specific use. 267 | * NOTE: second byte MUST NOT BE `0xFE` or `0xFF` -- generator MUST ensure avoidance (meaning that a small number of non-Shared Names can not be referenced at all) 268 | * `0x34`: Long (not-yet-shared) Unicode name. Variable-length String; token byte is followed by 64 or more bytes, followed by end-of-String marker byte. 269 | * Note: encoding of Strings shorter than 56 bytes should NOT be done using this type: if such sequence is detected it MAY be considered an error. Further, for ASCII names, Strings with length of 56-64 should also use short String notation 270 | * `0x35` - `0x39`: not used, reserved for future versions 271 | * `0x3A`: Not used; would be part of header sequence (which is NOT allowed in key mode!) 272 | * `0x3B` - `0x3F`: not used, reserved for future versions 273 | * `0x40` - `0x7F`: "Short" shared key name reference; names 0 through 63. 274 | * `0x80` - `0xBF`: Short ASCII names 275 | * Names consisting of 1 - 64 bytes, all of which represent UTF-8 ASCII characters (MSB not set) -- special case to potentially allow faster decoding 276 | * `0xC0` - `0xF7`: Short Unicode names 277 | * Names consisting of 2 - 57 bytes that can potentially contain UTF-8 multi-byte sequences: encoders are NOT required to guarantee there is one, but for decoding efficiency reasons are recommended to check (that is: decoders on many platforms will be able to handle ASCII-sequences more efficiently than general UTF-8 names) 278 | * `0xF8` - `0xFA`: reserved (avoid overlap with `START_ARRAY`/`END_ARRAY`, `START_OBJECT`) 279 | * `0xFB`: `END_OBJECT` marker 280 | * `0xFC` - `0xFF`: reserved for framing, not used in key mode (used in value mode) 281 | 282 | #### Resolved Shared String references 283 | 284 | Shared Strings refer to already encoded/decoded key names or value strings. The method used for indicating which of "already seen" String values to use is designed to allow for: 285 | 286 | * Efficient encoding AND decoding (without necessarily favoring either) 287 | * NOTE: data structures differ, however; encoder usually requires (hash) lookup whereas decoder can use simple index-accessible Array or List so typically encoder has somewhat higher overhead 288 | * To allow keeping only limited amount of buffering (of already handled names) by both encoder and decoder; this is especially beneficial to avoid unnecessary overhead for cases where there are few back references (mostly or completely unique values) 289 | 290 | Mechanism for resolving value string references differs from that used for key name references, so the two are explained separately below. 291 | 292 | #### Shared value Strings 293 | 294 | Support for shared value Strings is optional, in that generator can choose to either check for shareable value Strings or omit the checks. 295 | Format header will indicate which option generator chose: if header is missing, default value of `false` (no checks done for shared value Strings; no back-references exist in encoded content) must be assumed. 296 | 297 | One basic limitation is the encoded byte length of a String value that can be referenced is 64 bytes or less. Longer Strings can not be referenced. This is done as a performance optimization, as longer Strings are less likely to be shareable; and also because equality checks for longer Strings are most costly. 298 | As a result, parser only should keep references for eligible Strings during parsing. 299 | 300 | Reference length allowed by format is 10 bits, which means that encoder can replace references to most recent 1024 potentially shareable (referenceable) value Strings. 301 | 302 | For both encoding (writing) and decoding (parsing), same basic tumbling-window algorithm is used: when a potentially eligible String value is to be written, generator can check whether it has already written such a String, and has retained reference. If so, reference value (between 0 and 1023 -- but with limits, see "Avoding References" below) can be written instead of String value. 303 | If no such String has been written (as per generator's knowledge -- it is not required to even check this and lookup data structure may be lossy and only recognize some formerly written Strings), value is to be written. 304 | If its encoded length indicates that it is indeeed shareable (which can not be known before writing, as check is based on byte length, not character length!), decoder is to add value into its shareable String buffer -- as long as buffer size does not exceed that of 1024 values. If it already has 1024 values, it MUST clear out buffer and start from first entry. This means that reference values are NOT relative back references, but rather offsets from beginning of reference buffer. 305 | 306 | Similarly, parser has to keep track of decoded short (byte length <= 64 bytes) Strings seen so far, and have buffer of up to 1024 such values; clearing out buffer when it fills is done same way as during content generation. 307 | Any shared string value references are resolved against this buffer. 308 | 309 | Note: when a shared String is written or parsed, no entry is added to the shared value buffer (since one must already be in it) 310 | 311 | #### Shared key name Strings 312 | 313 | Support for shared property names is optional, in that generator can choose to either check for shareable property names or omit the checks. 314 | Format header will indicate which option generator chose: if header is missing, default value of "trues" (checking done for shared property names is made, and encoded content MAY contain back-references to share names) must be assumed. 315 | 316 | Shared key resolution is done same way as shared String value resolution, but buffers used are separate. Buffer sizes are same, 1024. 317 | 318 | #### Avoiding references `0x??FE` and `0x??FF` 319 | 320 | In order to avoid encoding bytes with values of `0xFE` and `0xFF` (similar to "Safe Binary Encoding"), a small number of otherwise referencable Value and Key Name Strings MUST NOT BE referenced by encoders. 321 | 322 | Basically, of all possible values for long references -- `0x0040` - `0x03FF` -- ones where lower byte is `0xFE` or `0xFF` must be avoided by generator (high order byte can never conflict). 323 | So, references like `0x00FE`, `0x00FF`, `0x1FE`, ... `0x3FF` MUST NOT BE used during encoding. 324 | Generators can implement block in different ways but possible the simplest is to simply not store lookup entries for these indexes. 325 | 326 | On decoder side decoder must keep track of these indexes (in the sense that non-shared Value/Key String has specific back reference index) but should not received any back references. 327 | Decoder may (but do not have to) verify that no such references are found. 328 | It may also choose to not keep track of such non-referencable Value/Key name Strings on decoding. 329 | 330 | NOTE: while this requirement has been implemented by some Codecs (Jackson, in particular), it was not formally documented prior to specification version 1.0.6. It is considered a requirement of Smile v1 format encoders. 331 | 332 | ----- 333 | 334 | ## Future improvement ideas 335 | 336 | **NOTE**': version 1.0 does **'NOT**' support any of features presented in this section; they are documented as ideas for future work. 337 | 338 | ### In-frame compression? 339 | 340 | Although there were initial plans to allow in-frame (in-content) compression for individual values, it was decided that support would not be added for initial version, mostly since it was felt that compression of the whole document typically yields better results. For some use cases this may not be true, however; especially when semi-random access is desired. 341 | 342 | Since enough type bits were left reserved for binary and long-text types, support may be added for future versions. 343 | 344 | ### Longer length-prefixed data? 345 | 346 | Given that encoders may be able to determine byte-length for value strings longer than 64 bytes (current limit for "short" strings), it might make sense to add value types with 2-byte prefix (or maybe just 1-byte prefix and additional length information after first fixed 64 bytes, since that allows output at constant location. Performance measurements should be made to ensure that such an option would improve performance as that would be main expected benefit. 347 | 348 | ### Pre-defined shared values (back-refs) 349 | 350 | For messages with little redundancy, but small set of always used names (from schema), it would be possible to do something similar to what deflate/gzip allows: defining "pre-amble", to allow back-references to pre-defined set of names and text values. 351 | For example, it would be possible to specify 64 names and/or shared string values for both serializer and deserializer to allow back-references to this pre-defined set of names and/or string values. This would both improve performance and reduce size of content. 352 | 353 | ### Filler value(s) 354 | 355 | It might make sense to allocate a "no-op" value or values to allow for padding of messages. 356 | This would be useful for things like: 357 | 358 | * Allow rounding up message size, for example to align entries in memory 359 | * Leave slack for possible in-place additions or modifications (like always allocating fixed space for String values) 360 | 361 | This would be a simple addition. 362 | 363 | ### Chunked values 364 | 365 | (note: inspiration for this came from [CBOR](https://tools.ietf.org/html/rfc7049) format) 366 | 367 | As an alternative for either requiring full content length (binary data), or end marker (long Strings, Objects, arrays), 368 | and to specifically allow better buffering during encoding, it might make sense to allow "chunked" variants wherein 369 | long content is encoded in chunks, size of which is individual indicated with length prefix, but whose total size 370 | need not be calculated. This would work well for including large data incrementally, and it could also allow for 371 | more efficient and flexible decoding. 372 | 373 | ----- 374 | 375 | ### Appendix A: External definitions 376 | 377 | #### ZigZag encoding for VInts 378 | 379 | Smile uses `ZigZag` encoding (defined for [protobuf format](http://code.google.com/apis/protocolbuffers/docs/encoding.html), 380 | (see [StackOverflow question](http://stackoverflow.com/questions/2210923/zig-zag-decoding) for example) 381 | which is a variant of generic [VInts](http://en.wikipedia.org/wiki/Variable-length_quantity) (Variable-length INTegers). 382 | 383 | Encoding is done logically as a two-step process: 384 | 385 | 1. Use `ZigZag` encoding to convert signed values to unsigned values: essentially this will "move the sign bit" as the LSB. 386 | 2. Encode remaining bits of unsigned integral number, starting with the most significant bits: the last byte is indicated by setting the sign bit; all the other bytes have sign bit clear. 387 | * Last byte has only 6 data bits; second-highest bit MUST be clear (to ensure that value `0xFF` is never used for encoding; values `0xC0` - `0xFF` are not used for the last byte). 388 | * Other bytes have 7 data bits. 389 | --------------------------------------------------------------------------------