├── LICENSE └── README.md /LICENSE: -------------------------------------------------------------------------------- 1 | BSD 2-Clause License 2 | 3 | Copyright (c) 2018, BUStools 4 | All rights reserved. 5 | 6 | Redistribution and use in source and binary forms, with or without 7 | modification, are permitted provided that the following conditions are met: 8 | 9 | * Redistributions of source code must retain the above copyright notice, this 10 | list of conditions and the following disclaimer. 11 | 12 | * Redistributions in binary form must reproduce the above copyright notice, 13 | this list of conditions and the following disclaimer in the documentation 14 | and/or other materials provided with the distribution. 15 | 16 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 17 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 18 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 19 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 20 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 21 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 22 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 23 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 24 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 25 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 26 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # The BUS format specification 2 | 3 | The BUS format is a binary format for storing intermediate results for single cell RNA-Seq datasets. This repository details the specification of the format. 4 | 5 | The motivation and example usage of the BUS format is described in 6 | 7 | P Melsted, V Ntranos, L Pachter, [The Barcode, UMI, Set format and BUStools](https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz279/5487510), Bioinformatics, btz279, 2019. 8 | 9 | 10 | ### Tools 11 | 12 | #### BUS generation 13 | 14 | - [kallisto](https://pachterlab.github.io/kallisto) version 0.45.0 and later 15 | 16 | #### BUS file manipulation 17 | 18 | - [bustools](https://github.com/BUStools/bustools) 19 | 20 | #### BUS parsing 21 | 22 | - BUS [R notebooks](https://github.com/BUStools/BUS_notebooks_R) and [python notebooks](https://github.com/BUStools/BUS_notebooks_python) 23 | 24 | ### Format specification 25 | 26 | A BUS file is a binary file consisting of a header followed by zero or more BUS records. Each BUS header consists of the following elements in order 27 | 28 | |Field name | Description | Type | Value | 29 | |-----------|-------------|------|-------| 30 | | magic | fixed magic string | char[4] | BUS\0 | 31 | | version | BUS format version | uint32_t | | 32 | | bc_len | Barcode length [1-32] | uint32_t | | 33 | | umi_len | UMI length [1-32] | uint32_t | | 34 | | tlen | Length of plain text header | uint32_t | | 35 | | text | Plain text header | char[tlen] | | 36 | 37 | The encoding used for the text header is ASCII. 38 | 39 | BUS records are stored directly after the header in the following format, the size of each BUS record is rounded up to 32 bytes. This is done by adding 32 unused bits at the end of the record. 40 | 41 | |Field name | Description | Type | 42 | |-----------|-------------|------| 43 | |barcode | 2-bit encoded barcode | uint64_t | 44 | |umi | 2-bit encoded UMI | uint64_t | 45 | |ec | equivalence class | int32_t | 46 | | count| fragment count | uint32_t 47 | | flags| flags | uint32_t | 48 | 49 | The flags column can be used to store extra information, but the format specification of BUS files does not put any restrictions or specify the content. 50 | 51 | The 2-bit encoding encodes `A,C,G,T` and `00,01,10,11`, and such that the first nucleotides are encoded in the most significant bits. For example the barcode `GCCA` corresponds to the bit code `10010100` or the integer `148`. All integers are encoded as little-endian. 52 | --------------------------------------------------------------------------------