├── img ├── logo.png ├── bench-pdb.png ├── bench-3j3q.png ├── bench-perf-parse.png └── bench-perf-read.png ├── benchmark.xlsx ├── benchmark.md ├── README.md ├── principle.md └── encoding.md /img/logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/molstar/BinaryCIF/HEAD/img/logo.png -------------------------------------------------------------------------------- /benchmark.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/molstar/BinaryCIF/HEAD/benchmark.xlsx -------------------------------------------------------------------------------- /img/bench-pdb.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/molstar/BinaryCIF/HEAD/img/bench-pdb.png -------------------------------------------------------------------------------- /img/bench-3j3q.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/molstar/BinaryCIF/HEAD/img/bench-3j3q.png -------------------------------------------------------------------------------- /img/bench-perf-parse.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/molstar/BinaryCIF/HEAD/img/bench-perf-parse.png -------------------------------------------------------------------------------- /img/bench-perf-read.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/molstar/BinaryCIF/HEAD/img/bench-perf-read.png -------------------------------------------------------------------------------- /benchmark.md: -------------------------------------------------------------------------------- 1 | Benchmark 2 | ========= 3 | 4 | The BinaryCIF format has been applied to encode macromolecular data stored using [mmCIF](http://mmcif.wwpdb.org/) 5 | data format (mmCIF is a schema of categories and fields that desribe a macromolecular structure 6 | stored using the CIF format). The raw data for the benchmark are included in this repository. 7 | 8 | - The "CoordinateServer" results were obtained using the corresponding query of the [CoordinateServer](https://webchemdev.ncbr.muni.cz/CoordinateServer). 9 | - The "MMTF" results were obtained using [MMTF](https://mmtf.rcsb.org) version 1.0. 10 | 11 | ## HIV-1 Capsid size 12 | 13 | Encoding the currenly largest structure in the PDB.org archive, the [HIV-1 Capsid](https://pdbe.org/3j3q) 14 | with 2.44M atoms, BinaryCIF achieves very good results. 15 | 16 | ![3j3q size](img/bench-3j3q.png) 17 | 18 | 19 | ## Whole PDB archive size 20 | 21 | - (*) RCSB PDB: 122333 Entries, some 404'ed; recompressed using the same compression level as the other data. 22 | - (**) reduced = alpha + phosphate trace + HET 23 | 24 | 25 | ![PDB Size](img/bench-pdb.png) 26 | 27 | ## Read and write performance 28 | 29 | This is the performance of BinaryCIF implementation of the [CIFTools.js library](https://github.com/dsehnal/CIFTools.js), 30 | [LiteMol](https://github.com/dsehnal/LiteMol), and the [CoordinateServer](https://webchemdev.ncbr.muni.cz/CoordinateServer). 31 | 32 | ![Parse](img/bench-perf-parse.png) 33 | 34 | ![Write](img/bench-perf-read.png) 35 | 36 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ![Version 0.3.0](https://img.shields.io/badge/Version-0.3.0-blue.svg?style=flat) 2 | 3 | ![BinaryCIF](img/logo.png) 4 | 5 | BinaryCIF is a data format that stores text based CIF files using 6 | a more efficient binary encoding. It enables both lossless and lossy 7 | compression of the original CIF file. BinaryCIF is currently mainly used by RCSB PDB and PDBe and is supported by the [Mol*](https://github.com/molstar/molstar) and [LiteMol](https://github.com/dsehnal/LiteMol) viewers. 8 | 9 | Some aspects of the BinaryCIF format, namely using [MessagePack](https://msgpack.org/) as the container 10 | and the usage the fixed point, run length, delta, and integer packing encodings was 11 | inspired by the [MMTF data format](http://mmtf.rcsb.org). 12 | 13 | Table of contents 14 | ================= 15 | 16 | * [Implementations](#implementations) 17 | * [Principles](#principles) 18 | * [Use Cases](#use-cases) 19 | - [CoordinateServer](#coordinateserver) 20 | - [DensityServer](#densityserver) 21 | 22 | Implementations 23 | ================= 24 | 25 | BinaryCIF is currently available as TypeScript (JavaScript), Java, and Python. 26 | 27 | - TypeScript implementation is part of the [Mol* project](https://github.com/molstar/molstar) (the original implementation of the BinaryCIF format is the [CIFTools.js library](https://github.com/dsehnal/CIFTools.js)). 28 | - [Mol* ciftools](https://github.com/molstar/ciftools) are available as a standalone library/tools for conversion of text CIF to BinaryCIF. 29 | - Java implementation is available at [rcsb/ciftools-java](https://github.com/rcsb/ciftools-java). 30 | - Python implementations: 31 | - [rcsb/py-mmcif](https://github.com/rcsb/py-mmcif) and higher level library wrapping the same package: [rcsb/py-rcsb_utils_io](https://github.com/rcsb/py-rcsb_utils_io) 32 | - [Biotite](https://github.com/biotite-dev/biotite) since version `0.40.0` (see [release notes](https://github.com/biotite-dev/biotite/releases/tag/v0.40.0)) 33 | 34 | Principles 35 | ========== 36 | 37 | * [Basic Principles](principle.md) 38 | * [BinaryCIF Format](encoding.md) 39 | * [Benchmark](benchmark.md) 40 | 41 | Use Cases 42 | ========= 43 | 44 | ## CoordinateServer 45 | 46 | BinaryCIF is supported by the [CoordinateServer](https://cs.litemol.org), a web service for 47 | delivering subsets of 3D macromolecular data stored in the [mmCIF format](http://mmcif.wwpdb.org/). 48 | 49 | The server can return data both in the text and binary version of the CIF format, 50 | with the binary representation being a lot more efficient (see the [benchmark](#benchmark)). 51 | 52 | ## DensityServer 53 | 54 | BinaryCIF is supported by the [DensityServer](https://ds.litemol.org), a web service for accessing subsets of volumetric density data, that automatically downsamples the data depending on the volume of the requested region to reduce the bandwidth requirements and provide near-instant access to even the largest data sets. 55 | 56 | ------------------- 57 | 58 | ## Contributing 59 | Just open an issue or make a pull request. All contributions are welcome. 60 | 61 | ## Funding 62 | 63 | Funding sources include but are not limited to: 64 | * [RCSB PDB](https://www.rcsb.org) funding by a grant [DBI-1338415; PI: SK Burley] from the NSF, the NIH, and the US DoE 65 | * [PDBe, EMBL-EBI](https://pdbe.org) 66 | * [CEITEC](https://www.ceitec.eu/) 67 | -------------------------------------------------------------------------------- /principle.md: -------------------------------------------------------------------------------- 1 | Basic Principles 2 | ================ 3 | 4 | In this chapter the basic ideas behind the BinaryCIF will are discussed. 5 | 6 | [CIF](http://www.iucr.org/resources/cif/spec/version1.1/cifsyntax) is a text based format for storing tabular data. 7 | The data is stored row by row using this syntax: 8 | 9 | ``` 10 | loop_ 11 | _category.field1 12 | _category.field2 13 | ... 14 | _category.fieldK 15 | value-1_1 value-1_2 ... value-1_K 16 | ... 17 | value-N_1 value-N_2 ... value-N_K 18 | ``` 19 | 20 | For example, the table called ``atoms`` with columns ``type, id, element, x, y, z`` 21 | 22 | |type|id|element|x|y|z| 23 | |:---:|---:|:---:|---:|---:|---:| 24 | |ATOM|1|C|0|0|0| 25 | |ATOM|2|C|1|0|0| 26 | |ATOM|3|O|0|1|0| 27 | |HETATM|4|Fe|0|0|1| 28 | 29 | would be stored in CIF as 30 | 31 | ``` 32 | loop_ 33 | _atoms.type 34 | _atoms.id 35 | _atoms.element 36 | _atoms.x 37 | _atoms.y 38 | _atoms.z 39 | ATOM 1 C 0 0 0 40 | ATOM 2 C 1 0 0 41 | ATOM 3 O 0 1 0 42 | HETATM 4 Fe 0 0 1 43 | ``` 44 | 45 | If we want to compress the rows using a dictionary compression, it would identify 46 | the string ATOM as a repeated substring and represent the rows something along the lines of 47 | 48 | ``` 49 | A = ATOM 50 | 51 | {A} 1 C 0 0 0 52 | {A} 2 C 1 0 0 53 | {A} 3 O 0 1 0 54 | HETATM 4 Fe 0 0 1 55 | ``` 56 | 57 | where ``{A}`` is a dictionary reference to the string ``ATOM``. At first, it would seem 58 | that this is an efficient solution. However, the problem with this data representation is that 59 | it is actually hard to compress because related data is not next to each other. 60 | 61 | Fortunately, we can do much better than this: we can transpose the tabular data 62 | and store them *per column* instead of *per row*: 63 | 64 | ``` 65 | _atoms.type: ATOM ATOM ATOM HETATM 66 | _atoms.id: 1 2 3 4 67 | _atoms.element: C C O Fe 68 | _atoms.x 0 1 0 0 69 | _atoms.y 0 0 1 0 70 | _atoms.z 0 0 0 1 71 | ``` 72 | 73 | Now, we can compress all the repeating ATOM values using a method called run-length encoding: 74 | 75 | ``` 76 | _atoms.type: {ATOM, 3} HETATM 77 | ``` 78 | 79 | Where ``{ATOM, 3}`` means *repeat the string* ``ATOM`` *3 times*. If the value ATOM repeats 80 | 1 million times (which is quite common), this approach saves us a lot of space. 81 | 82 | Similarly, we can apply different encoding schemes to other types of data. 83 | For example, the sequence 84 | 85 | ``` 86 | 1 2 3 4 5 ... n 87 | ``` 88 | 89 | can be encoded using delta encoding as 90 | 91 | ``` 92 | 1 1 1 1 1 ... 93 | ``` 94 | 95 | meaning we start with 1, then add 1 to the previous value, ending up with 2, then add 1 to the 96 | previous values as well getting 3, etc. At this point, we can use the run-length encoding 97 | approach from the ATOM example and end up with 98 | 99 | ``` 100 | {1, n} 101 | ``` 102 | 103 | to represent the original sequence of integers from 1 to n. 104 | 105 | The final step is to use binary instead of text encoding to store our data to make it more 106 | space efficient. For example, storing the number 1234 as text requires 4 bytes: 107 | 108 | ``` 109 | "1" "2" "3" "4" 110 | 111 | 0x31 0x32 0x33 0x34 112 | ``` 113 | 114 | However, storing the number as a 16-bit integer, we required only 2 bytes: 115 | 116 | ``` 117 | 4 * 256 + 210 118 | 119 | 0x04 0xD2 120 | ``` 121 | 122 | Applying the different encoding methods, the representation of our ``atoms`` table becomes 123 | 124 | ``` 125 | _atoms.type: {ATOM, 3} HETATM 126 | _atoms.id: {1, 4} 127 | _atoms.element: {C, 2} O Fe 128 | _atoms.x 0x00 0x01 0x00 0x00 129 | _atoms.y 0x00 0x00 0x01 0x00 130 | _atoms.z 0x00 0x00 0x00 0x01 131 | ``` -------------------------------------------------------------------------------- /encoding.md: -------------------------------------------------------------------------------- 1 | BinaryCIF Format 2 | ================ 3 | 4 | ## Data Layout 5 | 6 | A [CIF file](http://www.iucr.org/resources/cif/spec/version1.1/cifsyntax) ([example](https://www.ebi.ac.uk/pdbe/static/entry/1tqn_updated.cif)) contains: 7 | 8 | * One or more data blocks 9 | * Each data block has one or more category 10 | * Each category has one or more field 11 | * Each field contains *data* 12 | 13 | To represent this hierarchy, the basic shape of the BinaryCIF file defines the following 14 | interfaces: 15 | 16 | ``` 17 | File { 18 | version: string 19 | encoder: string 20 | dataBlocks: DataBlock[] 21 | } 22 | 23 | DataBlock { 24 | header: string 25 | categories: Category[] 26 | } 27 | 28 | Category { 29 | name: string 30 | rowCount: number 31 | columns: Column[] 32 | } 33 | 34 | Column { 35 | name: string 36 | data: Data 37 | mask: Data | undefined 38 | } 39 | 40 | Data { 41 | data: Uint8Array 42 | encoding: Encoding[] 43 | } 44 | ``` 45 | 46 | The most interesting part is the ``Data`` interface where the actual data is stored. 47 | The interface has two properties: ``data`` which is just an array of bytes (the binary data) 48 | and an array of *encodings* that describes the transformations that were applied to the 49 | source data to obtain the final binary result stored in the ``data`` field. 50 | 51 | Additionally, the ``Column`` interface defines a ``mask`` property used to determine 52 | if a certain value is present, not present (``.`` token in CIF), or unknown (``?`` token in CIF). 53 | 54 | Currently, BinaryCIF supports these encoding methods: 55 | 56 | ``` 57 | type Encoding = 58 | | ByteArray 59 | | FixedPoint 60 | | IntervalQuantization 61 | | RunLength 62 | | Delta 63 | | IntegerPacking 64 | | StringArray 65 | ``` 66 | 67 | ## Encoding Methods 68 | 69 | ### Byte Array 70 | 71 | Encodes an array of numbers of specified types as raw bytes. 72 | 73 | ``` 74 | ByteArray { 75 | kind = "ByteArray" 76 | type: Int8 | Int16 | Int32 | Uint8 | Uint16 | Uint32 | Float32 | Float64 77 | } 78 | ``` 79 | 80 | ### Fixed Point 81 | 82 | Converts an array of floating point numbers to an array of 32-bit integers multiplied 83 | by a given factor. 84 | 85 | ``` 86 | FixedPoint { 87 | kind = "FixedPoint" 88 | factor: number 89 | srcType: Float32 | Float64 90 | } 91 | ``` 92 | 93 | #### Example 94 | 95 | ``` 96 | [1.2, 1.23, 0.123] 97 | ---FixedPoint---> 98 | { factor = 100 } [120, 123, 12] 99 | ``` 100 | 101 | ### Interval Quantization 102 | 103 | Converts an array of floating point numbers to an array of 32-bit integers where 104 | the values are quantized within a given interval into specified number of 105 | discrete steps. Values lower than the minimum value or greater than the 106 | maximum are reprented by the respective boundary values. 107 | 108 | ``` 109 | IntervalQuantization { 110 | kind = "IntervalQuantization" 111 | min: number, 112 | max: number, 113 | numSteps: number, 114 | srcType: Float32 | Float64 115 | } 116 | ``` 117 | 118 | #### Example 119 | 120 | ``` 121 | [0.5, 1, 1.5, 2, 3, 1.345 ] 122 | ---IntervalQuantization---> 123 | { min = 1, max = 2, numSteps = 3 } [0, 0, 1, 2, 2, 1] 124 | ``` 125 | 126 | ### Run Length 127 | 128 | Represents each integer value in the input as a pair of ``(value, number of repeats)`` 129 | and stores the result sequentially as an array of 32-bit integers. Additionally, 130 | stores the size of the original array to make decoding easier. 131 | 132 | ``` 133 | RunLength { 134 | kind = "RunLength" 135 | srcType: int[] 136 | srcSize: number 137 | } 138 | ``` 139 | 140 | #### Example 141 | 142 | ``` 143 | [1, 1, 1, 2, 3, 3] 144 | ---RunLength---> 145 | { srcSize = 6 } [1, 3, 2, 1, 3, 2] 146 | ``` 147 | 148 | ### Delta 149 | 150 | Stores the input integer array as an array of consecutive differences. 151 | 152 | ``` 153 | Delta { 154 | kind = "Delta" 155 | origin: number 156 | srcType: int[] 157 | } 158 | ``` 159 | 160 | Because delta encoding is often used in conjuction with integer packing, 161 | the ``origin`` property is present. This is to optimize the case 162 | where the first value is large, but the differences are small. 163 | 164 | #### Example 165 | 166 | ``` 167 | [1000, 1003, 1005, 1006] 168 | ---Delta---> 169 | { origin = 1000, srcType = Int32 } [0, 3, 2, 1] 170 | ``` 171 | 172 | ### Integer Packing 173 | 174 | Stores a 32-bit integer array using 8- or 16-bit values. Includes the size 175 | of the input array for easier decoding. The encoding is more effective 176 | when only unsigned values are provided. 177 | 178 | If a value exceeds the maximum value for the 8- or 16- bit type, the maximum value may be output numerous times to add up the value. 179 | 180 | ``` 181 | IntegerPacking { 182 | kind = "IntegerPacking" 183 | byteCount: number 184 | srcSize: number 185 | isUnsigned: boolean 186 | } 187 | ``` 188 | 189 | #### Example (8-bit type output) 190 | 191 | ``` 192 | [1, 2, -3, 128] 193 | ---IntgerPacking---> 194 | { byteCount = 1, srcSize = 4, isUnsigned = false } [1, 2, -3, 127, 1] 195 | ``` 196 | 197 | ### String Array 198 | 199 | Stores an array of strings as a concatenation of all unique strings, an array of offsets 200 | describing substrings, and indices into the offset array. 201 | indices to corresponding substrings. 202 | 203 | ``` 204 | StringArray { 205 | kind = "StringArray" 206 | dataEncoding: Encoding[] 207 | stringData: string 208 | offsetEncoding: Encoding[] 209 | offsets: Uint8Array 210 | } 211 | ``` 212 | 213 | #### Example 214 | 215 | ``` 216 | ['a','AB','a'] 217 | ---StringArray---> 218 | { stringData = 'aAB', offsets = [0, 1, 3] } [0, 1, 0] 219 | ``` 220 | 221 | Encoding Process 222 | ---------------- 223 | 224 | To encode the data, a sequence of encoding transformations needs to be specified 225 | for each input column. For example, to encode the ``_atoms.id`` column 226 | from the background section, we could specify the encoding as ``[Delta, RunLength, IntegerPacking]``: 227 | 228 | ``` 229 | [1, 2, 3, 4] 230 | ---Delta---> 231 | { srcType = Int32 } [1, 1, 1, 1] 232 | ---RunLength---> 233 | { srcSize = 4 } [1, 4] 234 | ---IntegerPacking---> 235 | { byteCount = 1, srcSize = 2 } [1, 4] 236 | ``` 237 | 238 | **Little endian** is used everywhere to encode the data. 239 | 240 | Once each column has been encoded and the ``File`` data structure built, the 241 | [MessagePack](https://msgpack.org/) format (which is more or less a binary encoding of the standard 242 | JSON format) is used to produce the final binary result. 243 | 244 | Optionally, the data can be compressed using standard methods such as Gzip to achieve 245 | further compression. 246 | 247 | Decoding Process 248 | ---------------- 249 | 250 | To decode the BinaryCIF data, first the MessagePack data are decoded and then 251 | for each column, the binary data are decoded applying inverses of the transformations 252 | specified in the ``encoding`` array backwards. So to decode the encoding specified by 253 | ``[Delta, RunLength, IntegerPacking]`` we would first apply the decoding 254 | of ``IntegerPacking``, then ``RunLength``, and finally ``Delta``. 255 | 256 | Masks 257 | ----- 258 | 259 | The `Column.mask: Data | undefined` property determines the type of the value corresponding to CIF "value", `.` (undefined) and `?` tokens (unknown). 260 | 261 | - Undefined` mask means all values are defined 262 | - Value `0` corresponds to a value being present. 263 | - Value `1` corresponds to `.` token. 264 | - Value `2` corresponds to `?` token. 265 | 266 | #### Example 267 | 268 | The CIF field `x` 269 | 270 | ``` 271 | loop_ 272 | _category.x 273 | 1 274 | . 275 | 2 276 | ? 277 | ``` 278 | 279 | could be encoded as 280 | 281 | ```js 282 | Column { 283 | name: 'x' 284 | data: [1,0.2,0] // encoded 285 | mask: [0,1,0,2] // encoded 286 | } 287 | ``` --------------------------------------------------------------------------------