├── requirements.txt ├── tests ├── test_input_1_a.bam.bai ├── test_input_1_b.bam.bai ├── test_input_1_c.bam.bai └── bai_indexer_test.py ├── .travis.yml ├── .gitignore ├── setup.py ├── README.md ├── bai_indexer └── __init__.py └── LICENSE /requirements.txt: -------------------------------------------------------------------------------- 1 | nose 2 | -------------------------------------------------------------------------------- /tests/test_input_1_a.bam.bai: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hammerlab/bai-indexer/HEAD/tests/test_input_1_a.bam.bai -------------------------------------------------------------------------------- /tests/test_input_1_b.bam.bai: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hammerlab/bai-indexer/HEAD/tests/test_input_1_b.bam.bai -------------------------------------------------------------------------------- /tests/test_input_1_c.bam.bai: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hammerlab/bai-indexer/HEAD/tests/test_input_1_c.bam.bai -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | language: python 2 | python: 3 | - "2.7" 4 | 5 | install: "pip install -r requirements.txt" 6 | script: nosetests 7 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | 5 | # C extensions 6 | *.so 7 | 8 | # Distribution / packaging 9 | .Python 10 | env/ 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | lib/ 17 | lib64/ 18 | parts/ 19 | sdist/ 20 | var/ 21 | *.egg-info/ 22 | .installed.cfg 23 | *.egg 24 | 25 | # PyInstaller 26 | # Usually these files are written by a python script from a template 27 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 28 | *.manifest 29 | *.spec 30 | 31 | # Installer logs 32 | pip-log.txt 33 | pip-delete-this-directory.txt 34 | 35 | # Unit test / coverage reports 36 | htmlcov/ 37 | .tox/ 38 | .coverage 39 | .cache 40 | nosetests.xml 41 | coverage.xml 42 | 43 | # Translations 44 | *.mo 45 | *.pot 46 | 47 | # Django stuff: 48 | *.log 49 | 50 | # Sphinx documentation 51 | docs/_build/ 52 | 53 | # PyBuilder 54 | target/ 55 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup, find_packages 2 | 3 | try: 4 | import pypandoc 5 | description = pypandoc.convert('README.md', 'rst') 6 | except (IOError, ImportError): 7 | description = '' 8 | 9 | setup(name='bai-indexer', 10 | version='0.1.1', 11 | description='An index for your BAM Index (BAI)', 12 | long_description=description, 13 | author='Dan Vanderkam', 14 | author_email='danvdk@gmail.com', 15 | url='https://github.com/hammerlab/bai-indexer/', 16 | entry_points={ 17 | 'console_scripts': [ 18 | 'bai-indexer = bai_indexer:run', 19 | ], 20 | }, 21 | packages=find_packages(exclude=['tests*']), 22 | install_requires=[], 23 | classifiers=[ 24 | 'Environment :: Console', 25 | 'Development Status :: 4 - Beta', 26 | 'Intended Audience :: Healthcare Industry', 27 | 'Intended Audience :: Information Technology', 28 | 'License :: OSI Approved :: Apache Software License', 29 | 'Topic :: Scientific/Engineering :: Bio-Informatics' 30 | ], 31 | ) 32 | -------------------------------------------------------------------------------- /tests/bai_indexer_test.py: -------------------------------------------------------------------------------- 1 | from nose.tools import * 2 | 3 | from bai_indexer import index_stream 4 | 5 | import StringIO 6 | 7 | 8 | def test_file_a(): 9 | eq_({ 10 | 'chunks': [(8, 88), (88, 168), (168, 248), (248, 256)], 11 | 'minBlockIndex': 217 12 | }, index_stream(open('tests/test_input_1_a.bam.bai'))) 13 | 14 | 15 | def test_file_b(): 16 | eq_({ 17 | 'chunks': [(8, 16), (16, 96), (96, 176), (176, 184)], 18 | 'minBlockIndex': 224 19 | }, index_stream(open('tests/test_input_1_b.bam.bai'))) 20 | 21 | 22 | def test_file_c(): 23 | eq_({ 24 | 'chunks': [(8, 88), (88, 168)], 25 | 'minBlockIndex': 177 26 | }, index_stream(open('tests/test_input_1_c.bam.bai'))) 27 | 28 | 29 | def test_stdin(): 30 | # sys.stdin only has a read() method. 31 | stream = open('tests/test_input_1_c.bam.bai') 32 | class FakeStdin(object): 33 | read = stream.read 34 | 35 | eq_({ 36 | 'chunks': [(8, 88), (88, 168)], 37 | 'minBlockIndex': 177 38 | }, index_stream(FakeStdin())) 39 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | [![Build Status](https://travis-ci.org/hammerlab/bai-indexer.svg?branch=master)](https://travis-ci.org/hammerlab/bai-indexer) 2 | 3 | bai-indexer 4 | =========== 5 | 6 | Build an index for your BAM Index (BAI). 7 | 8 | Background 9 | ---------- 10 | 11 | [BAM][1] is a common file format for storing aligned reads from a gene 12 | sequencing machine. These files can get enormous (100+ GB), so it's helpful to 13 | have an index to support fast lookup. 14 | 15 | [Samtools][2] defines a file format for a BAM index and provides a simple 16 | command for generating one: 17 | 18 | ``` 19 | samtools index file.bam file.bam.bai 20 | ``` 21 | 22 | Unfortunately, these BAM Index (BAI) files can _also_ grow very large, often to 23 | 10 MB or more. When using a genome browser like [IGV][3] or [BioDalliance][4], 24 | loading a large BAI file over a slow network is the unavoidable first step in 25 | displaying alignment tracks. 26 | 27 | bai-indexer solves this problem by building an index of your BAM Index. This is 28 | a small JSON file which maps reference ID (i.e. chromosome number) to a byte 29 | range within the BAI file. By loading the BAM index, a viewer can load only the 30 | small subset of the BAM index that it actually needs. 31 | 32 | Usage 33 | ----- 34 | 35 | pip install bai-indexer 36 | 37 | bai-indexer path/to/file.bam.bai > path/to/file.bam.bai.json 38 | 39 | Format 40 | ------ 41 | 42 | The JSON index index looks like this: 43 | 44 | ```json 45 | { 46 | "chunks": [ 47 | [8, 716520], 48 | [716520, 1463832], 49 | [1463832, 2070072], 50 | ... 51 | ], 52 | "minBlockIndex": 1234 53 | } 54 | ``` 55 | 56 | The first chunk (`[8, 716520]`) specifies the byte range in the BAI file which 57 | describes the first ref (most likely `chr1` for a human genome). This is a 58 | half-open `[start, stop)` interval. 59 | 60 | The `minBlockIndex` field specifies the position of the first block in the BAM 61 | file. Everything before this position is headers. 62 | 63 | Development 64 | ----------- 65 | 66 | After setting up a virtualenv, you can get going by running: 67 | 68 | ```bash 69 | pip install -r requirements.txt 70 | nosetests 71 | ``` 72 | 73 | 74 | [1]: https://github.com/samtools/hts-specs 75 | [2]: http://www.htslib.org/ 76 | [3]: http://www.broadinstitute.org/igv/ 77 | [4]: http://www.biodalliance.org/ 78 | -------------------------------------------------------------------------------- /bai_indexer/__init__.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | """Print out start:stop locations for each reference in a BAI file. 3 | 4 | Usage: 5 | bai_indexer.py /path/to/file.bam.bai > /path/to/file.bam.bai.json 6 | """ 7 | 8 | import json 9 | import struct 10 | import sys 11 | 12 | # -- helper functions for reading binary data from a stream -- 13 | 14 | def _unpack(stream, fmt): 15 | size = struct.calcsize(fmt) 16 | buf = stream.read(size) 17 | return struct.unpack(fmt, buf)[0] 18 | 19 | 20 | def _read_int32(stream): 21 | return _unpack(stream, '