├── img
    ├── logo.png
    ├── bench-pdb.png
    ├── bench-3j3q.png
    ├── bench-perf-parse.png
    └── bench-perf-read.png
├── benchmark.xlsx
├── benchmark.md
├── README.md
├── principle.md
└── encoding.md


/img/logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/molstar/BinaryCIF/HEAD/img/logo.png


--------------------------------------------------------------------------------
/benchmark.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/molstar/BinaryCIF/HEAD/benchmark.xlsx


--------------------------------------------------------------------------------
/img/bench-pdb.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/molstar/BinaryCIF/HEAD/img/bench-pdb.png


--------------------------------------------------------------------------------
/img/bench-3j3q.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/molstar/BinaryCIF/HEAD/img/bench-3j3q.png


--------------------------------------------------------------------------------
/img/bench-perf-parse.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/molstar/BinaryCIF/HEAD/img/bench-perf-parse.png


--------------------------------------------------------------------------------
/img/bench-perf-read.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/molstar/BinaryCIF/HEAD/img/bench-perf-read.png


--------------------------------------------------------------------------------
/benchmark.md:
--------------------------------------------------------------------------------
 1 | Benchmark
 2 | =========
 3 | 
 4 | The BinaryCIF format has been applied to encode macromolecular data stored using [mmCIF](http://mmcif.wwpdb.org/) 
 5 | data format (mmCIF is a schema of categories and fields that desribe a macromolecular structure
 6 | stored using the CIF format). The raw data for the benchmark are included in this repository.
 7 | 
 8 | - The "CoordinateServer" results were obtained using the corresponding query of the [CoordinateServer](https://webchemdev.ncbr.muni.cz/CoordinateServer).
 9 | - The "MMTF" results were obtained using [MMTF](https://mmtf.rcsb.org) version 1.0.
10 | 
11 | ## HIV-1 Capsid size
12 | 
13 | Encoding the currenly largest structure in the PDB.org archive, the [HIV-1 Capsid](https://pdbe.org/3j3q)
14 | with 2.44M atoms, BinaryCIF achieves very good results. 
15 | 
16 | ![3j3q size](img/bench-3j3q.png)
17 | 
18 | 
19 | ## Whole PDB archive size
20 | 
21 | - (*) RCSB PDB: 122333 Entries, some 404'ed; recompressed using the same compression level as the other data.
22 | - (**) reduced = alpha + phosphate trace + HET
23 | 
24 | 
25 | ![PDB Size](img/bench-pdb.png)
26 | 
27 | ## Read and write performance
28 | 
29 | This is the performance of BinaryCIF implementation of the [CIFTools.js library](https://github.com/dsehnal/CIFTools.js),
30 | [LiteMol](https://github.com/dsehnal/LiteMol), and the [CoordinateServer](https://webchemdev.ncbr.muni.cz/CoordinateServer).
31 | 
32 | ![Parse](img/bench-perf-parse.png)
33 | 
34 | ![Write](img/bench-perf-read.png)
35 | 
36 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | ![Version 0.3.0](https://img.shields.io/badge/Version-0.3.0-blue.svg?style=flat)
 2 | 
 3 | ![BinaryCIF](img/logo.png)
 4 | 
 5 | BinaryCIF is a data format that stores text based CIF files using 
 6 | a more efficient binary encoding. It enables both lossless and lossy
 7 | compression of the original CIF file. BinaryCIF is currently mainly used by RCSB PDB and PDBe and is supported by the [Mol*](https://github.com/molstar/molstar) and [LiteMol](https://github.com/dsehnal/LiteMol) viewers.
 8 | 
 9 | Some aspects of the BinaryCIF format, namely using [MessagePack](https://msgpack.org/) as the container
10 | and the usage the fixed point, run length, delta, and integer packing encodings was
11 | inspired by the [MMTF data format](http://mmtf.rcsb.org).
12 | 
13 | Table of contents
14 | =================
15 | 
16 | * [Implementations](#implementations)
17 | * [Principles](#principles)
18 | * [Use Cases](#use-cases)
19 |     - [CoordinateServer](#coordinateserver)
20 |     - [DensityServer](#densityserver)
21 | 
22 | Implementations
23 | =================
24 | 
25 | BinaryCIF is currently available as TypeScript (JavaScript), Java, and Python.
26 | 
27 | - TypeScript implementation is part of the [Mol* project](https://github.com/molstar/molstar) (the original implementation of the BinaryCIF format is the [CIFTools.js library](https://github.com/dsehnal/CIFTools.js)).
28 | - [Mol* ciftools](https://github.com/molstar/ciftools) are available as a standalone library/tools for conversion of text CIF to BinaryCIF.
29 | - Java implementation is available at [rcsb/ciftools-java](https://github.com/rcsb/ciftools-java).
30 | - Python implementations:
31 |    - [rcsb/py-mmcif](https://github.com/rcsb/py-mmcif) and higher level library wrapping the same package: [rcsb/py-rcsb_utils_io](https://github.com/rcsb/py-rcsb_utils_io)
32 |    - [Biotite](https://github.com/biotite-dev/biotite) since version `0.40.0` (see [release notes](https://github.com/biotite-dev/biotite/releases/tag/v0.40.0))
33 | 
34 | Principles
35 | ==========
36 | 
37 | * [Basic Principles](principle.md)
38 | * [BinaryCIF Format](encoding.md)
39 | * [Benchmark](benchmark.md)
40 | 
41 | Use Cases
42 | =========
43 | 
44 | ## CoordinateServer
45 | 
46 | BinaryCIF is supported by the [CoordinateServer](https://cs.litemol.org), a web service for 
47 | delivering subsets of 3D macromolecular data stored in the [mmCIF format](http://mmcif.wwpdb.org/).
48 | 
49 | The server can return data both in the text and binary version of the CIF format,
50 | with the binary representation being a lot more efficient (see the [benchmark](#benchmark)).
51 | 
52 | ## DensityServer
53 | 
54 | BinaryCIF is supported by the [DensityServer](https://ds.litemol.org), a web service for accessing subsets of volumetric density data, that automatically downsamples the data depending on the volume of the requested region to reduce the bandwidth requirements and provide near-instant access to even the largest data sets. 
55 | 
56 | -------------------
57 | 
58 | ## Contributing
59 | Just open an issue or make a pull request. All contributions are welcome.
60 | 
61 | ## Funding
62 | 
63 | Funding sources include but are not limited to:
64 | * [RCSB PDB](https://www.rcsb.org) funding by a grant [DBI-1338415; PI: SK Burley] from the NSF, the NIH, and the US DoE
65 | * [PDBe, EMBL-EBI](https://pdbe.org)
66 | * [CEITEC](https://www.ceitec.eu/)
67 | 


--------------------------------------------------------------------------------
/principle.md:
--------------------------------------------------------------------------------
  1 | Basic Principles
  2 | ================
  3 | 
  4 | In this chapter the basic ideas behind the BinaryCIF will are discussed. 
  5 | 
  6 | [CIF](http://www.iucr.org/resources/cif/spec/version1.1/cifsyntax) is a text based format for storing tabular data. 
  7 | The data is stored row by row using this syntax:
  8 | 
  9 | ```
 10 | loop_
 11 | _category.field1
 12 | _category.field2
 13 | ...
 14 | _category.fieldK
 15 | value-1_1 value-1_2 ... value-1_K
 16 | ...
 17 | value-N_1 value-N_2 ... value-N_K
 18 | ```
 19 | 
 20 | For example, the table called ``atoms`` with columns ``type, id, element, x, y, z``
 21 | 
 22 | |type|id|element|x|y|z|
 23 | |:---:|---:|:---:|---:|---:|---:|
 24 | |ATOM|1|C|0|0|0|
 25 | |ATOM|2|C|1|0|0|
 26 | |ATOM|3|O|0|1|0|
 27 | |HETATM|4|Fe|0|0|1|
 28 | 
 29 | would be stored in CIF as 
 30 | 
 31 | ```
 32 | loop_
 33 | _atoms.type
 34 | _atoms.id
 35 | _atoms.element
 36 | _atoms.x
 37 | _atoms.y
 38 | _atoms.z
 39 | ATOM 1 C 0 0 0
 40 | ATOM 2 C 1 0 0
 41 | ATOM 3 O 0 1 0
 42 | HETATM 4 Fe 0 0 1
 43 | ```
 44 | 
 45 | If we want to compress the rows using a dictionary compression, it would identify 
 46 | the string ATOM as a repeated substring and represent the rows something along the lines of
 47 | 
 48 | ```
 49 | A = ATOM
 50 | 
 51 | {A} 1 C 0 0 0
 52 | {A} 2 C 1 0 0
 53 | {A} 3 O 0 1 0
 54 | HETATM 4 Fe 0 0 1 
 55 | ```
 56 | 
 57 | where ``{A}`` is a dictionary reference to the string ``ATOM``. At first, it would seem 
 58 | that this is an efficient solution. However, the problem with this data representation is that 
 59 | it is actually hard to compress because related data is not next to each other. 
 60 | 
 61 | Fortunately, we can do much better than this: we can transpose the tabular data 
 62 | and store them *per column* instead of *per row*:
 63 | 
 64 | ```
 65 | _atoms.type:    ATOM ATOM ATOM HETATM
 66 | _atoms.id:      1 2 3 4
 67 | _atoms.element: C C O Fe
 68 | _atoms.x        0 1 0 0
 69 | _atoms.y        0 0 1 0
 70 | _atoms.z        0 0 0 1
 71 | ```
 72 | 
 73 | Now, we can compress all the repeating ATOM values using a method called run-length encoding:
 74 | 
 75 | ```
 76 | _atoms.type: {ATOM, 3} HETATM
 77 | ```
 78 | 
 79 | Where ``{ATOM, 3}`` means *repeat the string* ``ATOM`` *3 times*. If the value ATOM repeats 
 80 | 1 million times (which is quite common), this approach saves us a lot of space.
 81 | 
 82 | Similarly, we can apply different encoding schemes to other types of data. 
 83 | For example, the sequence
 84 | 
 85 | ```
 86 | 1 2 3 4 5 ... n
 87 | ``` 
 88 | 
 89 | can be encoded using delta encoding as 
 90 | 
 91 | ```
 92 | 1 1 1 1 1 ...
 93 | ```
 94 | 
 95 | meaning we start with 1, then add 1 to the previous value, ending up with 2, then add 1 to the
 96 | previous values as well getting 3, etc. At this point, we can use the run-length encoding
 97 | approach from the ATOM example and end up with 
 98 | 
 99 | ```
100 | {1, n}
101 | ```
102 | 
103 | to represent the original sequence of integers from 1 to n.
104 | 
105 | The final step is to use binary instead of text encoding to store our data to make it more 
106 | space efficient. For example, storing the number 1234 as text requires 4 bytes:
107 | 
108 | ```
109 | "1"  "2"  "3"  "4"
110 | 
111 | 0x31 0x32 0x33 0x34
112 | ```
113 | 
114 | However, storing the number as a 16-bit integer, we required only 2 bytes:
115 | 
116 | ```
117 | 4 * 256  +   210 
118 | 
119 |   0x04      0xD2
120 | ```
121 | 
122 | Applying the different encoding methods, the representation of our ``atoms`` table becomes
123 | 
124 | ```
125 | _atoms.type:    {ATOM, 3} HETATM
126 | _atoms.id:      {1, 4}
127 | _atoms.element: {C, 2} O Fe
128 | _atoms.x        0x00 0x01 0x00 0x00
129 | _atoms.y        0x00 0x00 0x01 0x00
130 | _atoms.z        0x00 0x00 0x00 0x01 
131 | ```


--------------------------------------------------------------------------------
/encoding.md:
--------------------------------------------------------------------------------
  1 | BinaryCIF Format
  2 | ================
  3 | 
  4 | ## Data Layout
  5 | 
  6 | A [CIF file](http://www.iucr.org/resources/cif/spec/version1.1/cifsyntax) ([example](https://www.ebi.ac.uk/pdbe/static/entry/1tqn_updated.cif)) contains:
  7 | 
  8 | * One or more data blocks
  9 | * Each data block has one or more category
 10 | * Each category has one or more field
 11 | * Each field contains *data*
 12 | 
 13 | To represent this hierarchy, the basic shape of the BinaryCIF file defines the following 
 14 | interfaces:
 15 | 
 16 | ```
 17 | File {
 18 |     version: string
 19 |     encoder: string
 20 |     dataBlocks: DataBlock[]
 21 | }
 22 | 
 23 | DataBlock {
 24 |     header: string
 25 |     categories: Category[]
 26 | }
 27 | 
 28 | Category {
 29 |     name: string
 30 |     rowCount: number
 31 |     columns: Column[]
 32 | }
 33 | 
 34 | Column {
 35 |     name: string
 36 |     data: Data
 37 |     mask: Data | undefined
 38 | }
 39 | 
 40 | Data {
 41 |     data: Uint8Array
 42 |     encoding: Encoding[]
 43 | }
 44 | ```
 45 | 
 46 | The most interesting part is the ``Data`` interface where the actual data is stored. 
 47 | The interface has two properties: ``data`` which is just an array of bytes (the binary data) 
 48 | and an array of *encodings* that describes the transformations that were applied to the 
 49 | source data to obtain the final binary result stored in the ``data`` field.
 50 | 
 51 | Additionally, the ``Column`` interface defines a ``mask`` property used to determine
 52 | if a certain value is present, not present (``.`` token in CIF), or unknown (``?`` token in CIF). 
 53 | 
 54 | Currently, BinaryCIF supports these encoding methods:
 55 | 
 56 | ```
 57 | type Encoding = 
 58 |     | ByteArray 
 59 |     | FixedPoint
 60 |     | IntervalQuantization 
 61 |     | RunLength 
 62 |     | Delta 
 63 |     | IntegerPacking 
 64 |     | StringArray
 65 | ``` 
 66 | 
 67 | ## Encoding Methods
 68 | 
 69 | ### Byte Array
 70 | 
 71 | Encodes an array of numbers of specified types as raw bytes.
 72 | 
 73 | ```
 74 | ByteArray {
 75 |     kind = "ByteArray"
 76 |     type: Int8 | Int16 | Int32 | Uint8 | Uint16 | Uint32 | Float32 | Float64
 77 | }
 78 | ```
 79 | 
 80 | ### Fixed Point
 81 | 
 82 | Converts an array of floating point numbers to an array of 32-bit integers multiplied 
 83 | by a given factor. 
 84 | 
 85 | ```
 86 | FixedPoint {
 87 |     kind = "FixedPoint"
 88 |     factor: number
 89 |     srcType: Float32 | Float64
 90 | }
 91 | ```
 92 | 
 93 | #### Example
 94 | 
 95 | ```
 96 | [1.2, 1.23, 0.123] 
 97 | ---FixedPoint---> 
 98 | { factor = 100 } [120, 123, 12] 
 99 | ``` 
100 | 
101 | ### Interval Quantization
102 | 
103 | Converts an array of floating point numbers to an array of 32-bit integers where 
104 | the values are quantized within a given interval into specified number of
105 | discrete steps. Values lower than the minimum value or greater than the 
106 | maximum are reprented by the respective boundary values.
107 | 
108 | ```
109 | IntervalQuantization {
110 |     kind = "IntervalQuantization"
111 |     min: number,
112 |     max: number,
113 |     numSteps: number,
114 |     srcType: Float32 | Float64
115 | }
116 | ```
117 | 
118 | #### Example
119 | 
120 | ```
121 | [0.5, 1, 1.5, 2, 3, 1.345 ] 
122 | ---IntervalQuantization---> 
123 | { min = 1, max = 2, numSteps = 3 } [0, 0, 1, 2, 2, 1] 
124 | ``` 
125 | 
126 | ### Run Length
127 | 
128 | Represents each integer value in the input as a pair of ``(value, number of repeats)``
129 | and stores the result sequentially as an array of 32-bit integers. Additionally,
130 | stores the size of the original array to make decoding easier.
131 | 
132 | ```
133 | RunLength {
134 |     kind = "RunLength"
135 |     srcType: int[]
136 |     srcSize: number
137 | }
138 | ```
139 | 
140 | #### Example
141 | 
142 | ```
143 | [1, 1, 1, 2, 3, 3] 
144 | ---RunLength---> 
145 | { srcSize = 6 } [1, 3, 2, 1, 3, 2]
146 | ```
147 | 
148 | ### Delta
149 | 
150 | Stores the input integer array as an array of consecutive differences. 
151 | 
152 | ```
153 | Delta {
154 |     kind = "Delta"
155 |     origin: number
156 |     srcType: int[]
157 | }
158 | ```
159 | 
160 | Because delta encoding is often used in conjuction with integer packing,
161 | the ``origin`` property is present. This is to optimize the case
162 | where the first value is large, but the differences are small. 
163 | 
164 | #### Example
165 | 
166 | ```
167 | [1000, 1003, 1005, 1006] 
168 | ---Delta---> 
169 | { origin = 1000, srcType = Int32 } [0, 3, 2, 1]
170 | ```
171 | 
172 | ### Integer Packing
173 | 
174 | Stores a 32-bit integer array using 8- or 16-bit values. Includes the size 
175 | of the input array for easier decoding. The encoding is more effective 
176 | when only unsigned values are provided. 
177 | 
178 | If a value exceeds the maximum value for the 8- or 16- bit type, the maximum value may be output numerous times to add up the value.
179 | 
180 | ```
181 | IntegerPacking {
182 |     kind = "IntegerPacking"
183 |     byteCount: number
184 |     srcSize: number
185 |     isUnsigned: boolean
186 | }
187 | ```
188 | 
189 | #### Example (8-bit type output)
190 | 
191 | ```
192 | [1, 2, -3, 128] 
193 | ---IntgerPacking---> 
194 | { byteCount = 1, srcSize = 4, isUnsigned = false } [1, 2, -3, 127, 1]
195 | ```
196 | 
197 | ### String Array
198 | 
199 | Stores an array of strings as a concatenation of all unique strings, an array of offsets
200 | describing substrings, and indices into the offset array. 
201 | indices to corresponding substrings.
202 | 
203 | ```
204 | StringArray {
205 |     kind = "StringArray"
206 |     dataEncoding: Encoding[]
207 |     stringData: string
208 |     offsetEncoding: Encoding[]
209 |     offsets: Uint8Array
210 | }
211 | ```
212 | 
213 | #### Example
214 | 
215 | ```
216 | ['a','AB','a'] 
217 | ---StringArray---> 
218 | { stringData = 'aAB', offsets = [0, 1, 3] } [0, 1, 0]
219 | ```
220 | 
221 | Encoding Process
222 | ----------------
223 | 
224 | To encode the data, a sequence of encoding transformations needs to be specified
225 | for each input column. For example, to encode the ``_atoms.id`` column
226 | from the background section, we could specify the encoding as ``[Delta, RunLength, IntegerPacking]``:
227 | 
228 | ```
229 | [1, 2, 3, 4]
230 | ---Delta--->
231 | { srcType = Int32 } [1, 1, 1, 1]
232 | ---RunLength--->
233 | { srcSize = 4 } [1, 4]
234 | ---IntegerPacking--->
235 | { byteCount = 1, srcSize = 2 } [1, 4]
236 | ```
237 | 
238 | **Little endian** is used everywhere to encode the data.
239 | 
240 | Once each column has been encoded and the ``File`` data structure built, the 
241 | [MessagePack](https://msgpack.org/) format (which is more or less a binary encoding of the standard
242 | JSON format) is used to produce the final binary result. 
243 | 
244 | Optionally, the data can be compressed using standard methods such as Gzip to achieve
245 | further compression.
246 | 
247 | Decoding Process
248 | ----------------
249 | 
250 | To decode the BinaryCIF data, first the MessagePack data are decoded and then 
251 | for each column, the binary data are decoded applying inverses of the transformations
252 | specified in the ``encoding`` array backwards. So to decode the encoding specified by
253 | ``[Delta, RunLength, IntegerPacking]`` we would first apply the decoding
254 | of ``IntegerPacking``, then ``RunLength``, and finally ``Delta``.
255 | 
256 | Masks
257 | -----
258 | 
259 | The `Column.mask: Data | undefined` property determines the type of the value corresponding to CIF "value", `.` (undefined) and `?` tokens (unknown). 
260 | 
261 | - Undefined` mask means all values are defined
262 | - Value `0` corresponds to a value being present.
263 | - Value `1` corresponds to `.` token.
264 | - Value `2` corresponds to `?` token.
265 | 
266 | #### Example
267 | 
268 | The CIF field `x`
269 | 
270 | ```
271 | loop_
272 | _category.x
273 | 1
274 | .
275 | 2
276 | ?
277 | ```
278 | 
279 | could be encoded as 
280 | 
281 | ```js
282 | Column {
283 |   name: 'x'
284 |   data: [1,0.2,0]  // encoded
285 |   mask: [0,1,0,2]  // encoded 
286 | }
287 | ```


--------------------------------------------------------------------------------