├── LICENSE ├── README.md ├── dataset-notebooks ├── 10x_hgmm_100_python │ └── 10x_hgmm_100.ipynb ├── 10x_hgmm_6k_v2chem_python │ └── 10x_hgmm_6k_v2chem.ipynb ├── 10x_neuron_1k_v2chem_python │ ├── 10x_neuron_1k_v2chem.ipynb │ └── busparser.py ├── 10x_neuron_1k_v3chem_python │ └── 10x_neuron_1k_v3chem.ipynb ├── 10x_pbmc_1k_v2chem_python │ └── 10x_pbmc_1k_v2chem.ipynb ├── 10x_pbmc_1k_v3chem_python │ └── 10x_pbmc_1k_v3chem.ipynb ├── 10x_t4k_TCC │ └── bus2tcc-gene.ipynb ├── cell_hashing_citeseq_GSM2895284 │ └── extract_hashtags_HTO_data_SRR8281307.ipynb ├── celseq1_GSE62270_python │ └── celseq1_organoid.ipynb ├── dropseq_GSE63472_python │ └── dropseq_visual_cortex.ipynb ├── indrops_GSM2746895_python │ └── indrops_brain_activity.ipynb └── seqwell_GSE92495_python │ └── kallisto_seqwell_pbmc.ipynb └── utils ├── get_dicts-from-ref.ipynb └── transcript2gene.py /LICENSE: -------------------------------------------------------------------------------- 1 | BSD 2-Clause License 2 | 3 | Copyright (c) 2018, BUStools 4 | All rights reserved. 5 | 6 | Redistribution and use in source and binary forms, with or without 7 | modification, are permitted provided that the following conditions are met: 8 | 9 | * Redistributions of source code must retain the above copyright notice, this 10 | list of conditions and the following disclaimer. 11 | 12 | * Redistributions in binary form must reproduce the above copyright notice, 13 | this list of conditions and the following disclaimer in the documentation 14 | and/or other materials provided with the distribution. 15 | 16 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 17 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 18 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 19 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 20 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 21 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 22 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 23 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 24 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 25 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 26 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # BUS python notebooks 2 | 3 | This repository contains example python notebooks for parsing and processing BUS format single-cell RNA-seq files. 4 | To run the notebooks, make sure you have `kallisto >= 0.45` and `bustools` installed. The source and binaries are available at: 5 | 6 | [kallisto](https://pachterlab.github.io/kallisto/download) 7 | 8 | [bustools](https://github.com/BUStools/bustools) 9 | 10 | ## Getting started 11 | 12 | We recommend beginners work through this notebook: 13 | #### [10x v2 chemistry - 6k Mixture of Fresh Frozen Human (HEK293T) and Mouse (NIH3T3) Cells](https://github.com/BUStools/BUS_notebooks_python/blob/master/dataset-notebooks/10x_hgmm_6k_v2chem_python/10x_hgmm_6k_v2chem.ipynb) 14 | 15 | ## Complete notebooks 16 | These notebooks can be used to completely process datasets, starting with downloading the raw data all the way to basic QC plots. They are intended as tutorials on the use of the BUS format. 17 | 18 | #### [10x v2 chemistry - 100 1:1 Mixture of Fresh Frozen Human (HEK293T) and Mouse (NIH3T3) Cells](https://github.com/BUStools/bustools-notebooks/blob/master/dataset-notebooks/10x_hgmm_100_python/10x_hgmm_100.ipynb) 19 | 20 | #### [10x v2 chemistry - 1k Brain Cells from an E18 Mouse](https://github.com/BUStools/bustools-notebooks/blob/master/dataset-notebooks/10x_neuron_1k_v2chem_python/10x_neuron_1k_v2chem.ipynb) 21 | 22 | #### [10x v3 chemistry - 1k Brain Cells from an E18 Mouse](https://github.com/BUStools/bustools-notebooks/blob/master/dataset-notebooks/10x_neuron_1k_v3chem_python/10x_neuron_1k_v3chem.ipynb) 23 | 24 | 25 | 26 | 27 | ## Notebooks in progress 28 | These notebooks are still a work in progress and may not have all the code needed to download and process data automatically. 29 | 30 | #### [10x v2 chemistry - 1k PBMCs from a Healthy Donor](https://github.com/BUStools/bustools-notebooks/blob/master/dataset-notebooks/10x_pbmc_1k_v2chem_python/10x_pbmc_1k_v2chem.ipynb) 31 | 32 | #### [10x v3 chemistry - 1k PBMCs from a Healthy Donor](https://github.com/BUStools/bustools-notebooks/blob/master/dataset-notebooks/10x_pbmc_1k_v3chem_python/10x_pbmc_1k_v3chem.ipynb) 33 | 34 | #### [dropseq - GSE63472 mouse retina](https://github.com/BUStools/bustools-notebooks/blob/master/dataset-notebooks/dropseq_GSE63472_python/dropseq_visual_cortex.ipynb) 35 | 36 | #### [celseq1 - GSE62270 mouse intestinal cells](https://github.com/BUStools/bustools-notebooks/blob/master/dataset-notebooks/celseq1_GSE62270_python/celseq1_organoid.ipynb) 37 | 38 | #### [inDrops - GSE102827 mouse visual cortex](https://github.com/BUStools/bustools-notebooks/blob/master/dataset-notebooks/indrops_GSM2746895_python/indrops_brain_activity.ipynb) 39 | 40 | #### [Seq-Well - GSE92495 HEK/3T3 mixing, PBMCs, and TB-exposed Macrophages](https://github.com/BUStools/bustools-notebooks/blob/master/dataset-notebooks/seqwell_GSE92495_python/kallisto_seqwell_pbmc.ipynb) 41 | 42 | #### [10x_v2_chemistry - 4k Pan T Cells from a Healthy Donor (From BUS to TCC and GC Matrix example)](https://github.com/BUStools/BUS_notebooks_python/blob/master/dataset-notebooks/10x_t4k_TCC/bus2tcc-gene.ipynb) 43 | -------------------------------------------------------------------------------- /dataset-notebooks/10x_hgmm_6k_v2chem_python/10x_hgmm_6k_v2chem.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 7, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import matplotlib\n", 10 | "import numpy as np\n", 11 | "import matplotlib.pyplot as plt\n", 12 | "import sys, collections, os, argparse\n", 13 | "%matplotlib inline \n", 14 | "%config InlineBackend.figure_format = 'retina'\n" 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "# Download the 10x Dataset `6k 1:1 Mixture of Fresh Frozen Human (HEK293T) and Mouse (NIH3T3) Cells`\n", 22 | "\n", 23 | "10x datasets are available at\n", 24 | "https://support.10xgenomics.com/single-cell-gene-expression/datasets\n", 25 | "\n", 26 | "The page for the `6k 1:1 Mixture of Fresh Frozen Human (HEK293T) and Mouse (NIH3T3) Cells` dataset is\n", 27 | "https://support.10xgenomics.com/single-cell-gene-expression/datasets/2.1.0/hgmm_6k\n", 28 | "\n", 29 | "The FASTQ files (38G) can be downloaded with `curl` directly from http://http://s3-us-west-2.amazonaws.com/10x.files/samples/cell-exp/2.1.0/hgmm_6k/hgmm_6k_fastqs.tar\n", 30 | "\n", 31 | "In the cell below we check if the dataset file `hgmm_6k_fastqs.tar` already exists. If not we download the dataset to the same directory as this notebook\n" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": 8, 37 | "metadata": {}, 38 | "outputs": [ 39 | { 40 | "name": "stdout", 41 | "output_type": "stream", 42 | "text": [ 43 | "Dataset already downloaded!\n" 44 | ] 45 | } 46 | ], 47 | "source": [ 48 | "#Check if the file was downloaded already before doing curl:\n", 49 | "if not (os.path.isfile('./hgmm_6k_fastqs.tar')): \n", 50 | " # the `!` means we're running a command line statement (rather than python) \n", 51 | " !curl -O http://s3-us-west-2.amazonaws.com/10x.files/samples/cell-exp/2.1.0/hgmm_6k/hgmm_6k_fastqs.tar\n", 52 | "else: print('Dataset already downloaded!')" 53 | ] 54 | }, 55 | { 56 | "cell_type": "markdown", 57 | "metadata": {}, 58 | "source": [ 59 | "## untar the fastq files on hgmm_6k_fastqs folder\n", 60 | "Note that this dataset was sequenced from 8 10x lanes (L001-L008).\n", 61 | "Hence it has 24 files: read 1 (R1), read 2 (R1), and index (I1) for each lane " 62 | ] 63 | }, 64 | { 65 | "cell_type": "code", 66 | "execution_count": 9, 67 | "metadata": {}, 68 | "outputs": [ 69 | { 70 | "name": "stdout", 71 | "output_type": "stream", 72 | "text": [ 73 | "fastqs/\r\n", 74 | "tar: fastqs: skipping existing file\r\n", 75 | "fastqs/hgmm_6k_S1_L001_I1_001.fastq.gz\r\n", 76 | "tar: fastqs/hgmm_6k_S1_L001_I1_001.fastq.gz: skipping existing file\r\n", 77 | "fastqs/hgmm_6k_S1_L001_R1_001.fastq.gz\r\n", 78 | "tar: fastqs/hgmm_6k_S1_L001_R1_001.fastq.gz: skipping existing file\r\n", 79 | "fastqs/hgmm_6k_S1_L001_R2_001.fastq.gz\r\n", 80 | "tar: fastqs/hgmm_6k_S1_L001_R2_001.fastq.gz: skipping existing file\r\n", 81 | "fastqs/hgmm_6k_S1_L002_I1_001.fastq.gz\r\n", 82 | "tar: fastqs/hgmm_6k_S1_L002_I1_001.fastq.gz: skipping existing file\r\n", 83 | "fastqs/hgmm_6k_S1_L002_R1_001.fastq.gz\r\n", 84 | "tar: fastqs/hgmm_6k_S1_L002_R1_001.fastq.gz: skipping existing file\r\n", 85 | "fastqs/hgmm_6k_S1_L002_R2_001.fastq.gz\r\n", 86 | "tar: fastqs/hgmm_6k_S1_L002_R2_001.fastq.gz: skipping existing file\r\n", 87 | "fastqs/hgmm_6k_S1_L003_I1_001.fastq.gz\r\n", 88 | "tar: fastqs/hgmm_6k_S1_L003_I1_001.fastq.gz: skipping existing file\r\n", 89 | "fastqs/hgmm_6k_S1_L003_R1_001.fastq.gz\r\n", 90 | "tar: fastqs/hgmm_6k_S1_L003_R1_001.fastq.gz: skipping existing file\r\n", 91 | "fastqs/hgmm_6k_S1_L003_R2_001.fastq.gz\r\n", 92 | "tar: fastqs/hgmm_6k_S1_L003_R2_001.fastq.gz: skipping existing file\r\n", 93 | "fastqs/hgmm_6k_S1_L004_I1_001.fastq.gz\r\n", 94 | "tar: fastqs/hgmm_6k_S1_L004_I1_001.fastq.gz: skipping existing file\r\n", 95 | "fastqs/hgmm_6k_S1_L004_R1_001.fastq.gz\r\n", 96 | "tar: fastqs/hgmm_6k_S1_L004_R1_001.fastq.gz: skipping existing file\r\n", 97 | "fastqs/hgmm_6k_S1_L004_R2_001.fastq.gz\r\n", 98 | "tar: fastqs/hgmm_6k_S1_L004_R2_001.fastq.gz: skipping existing file\r\n", 99 | "fastqs/hgmm_6k_S1_L005_I1_001.fastq.gz\r\n", 100 | "tar: fastqs/hgmm_6k_S1_L005_I1_001.fastq.gz: skipping existing file\r\n", 101 | "fastqs/hgmm_6k_S1_L005_R1_001.fastq.gz\r\n", 102 | "tar: fastqs/hgmm_6k_S1_L005_R1_001.fastq.gz: skipping existing file\r\n", 103 | "fastqs/hgmm_6k_S1_L005_R2_001.fastq.gz\r\n", 104 | "tar: fastqs/hgmm_6k_S1_L005_R2_001.fastq.gz: skipping existing file\r\n", 105 | "fastqs/hgmm_6k_S1_L006_I1_001.fastq.gz\r\n", 106 | "tar: fastqs/hgmm_6k_S1_L006_I1_001.fastq.gz: skipping existing file\r\n", 107 | "fastqs/hgmm_6k_S1_L006_R1_001.fastq.gz\r\n", 108 | "tar: fastqs/hgmm_6k_S1_L006_R1_001.fastq.gz: skipping existing file\r\n", 109 | "fastqs/hgmm_6k_S1_L006_R2_001.fastq.gz\r\n", 110 | "tar: fastqs/hgmm_6k_S1_L006_R2_001.fastq.gz: skipping existing file\r\n", 111 | "fastqs/hgmm_6k_S1_L007_I1_001.fastq.gz\r\n", 112 | "tar: fastqs/hgmm_6k_S1_L007_I1_001.fastq.gz: skipping existing file\r\n", 113 | "fastqs/hgmm_6k_S1_L007_R1_001.fastq.gz\r\n", 114 | "tar: fastqs/hgmm_6k_S1_L007_R1_001.fastq.gz: skipping existing file\r\n", 115 | "fastqs/hgmm_6k_S1_L007_R2_001.fastq.gz\r\n", 116 | "tar: fastqs/hgmm_6k_S1_L007_R2_001.fastq.gz: skipping existing file\r\n", 117 | "fastqs/hgmm_6k_S1_L008_I1_001.fastq.gz\r\n", 118 | "tar: fastqs/hgmm_6k_S1_L008_I1_001.fastq.gz: skipping existing file\r\n", 119 | "fastqs/hgmm_6k_S1_L008_R1_001.fastq.gz\r\n", 120 | "tar: fastqs/hgmm_6k_S1_L008_R1_001.fastq.gz: skipping existing file\r\n", 121 | "fastqs/hgmm_6k_S1_L008_R2_001.fastq.gz\r\n", 122 | "tar: fastqs/hgmm_6k_S1_L008_R2_001.fastq.gz: skipping existing file\r\n" 123 | ] 124 | } 125 | ], 126 | "source": [ 127 | "!tar --skip-old-files -xvf ./hgmm_6k_fastqs.tar" 128 | ] 129 | }, 130 | { 131 | "cell_type": "markdown", 132 | "metadata": {}, 133 | "source": [ 134 | "# Buiding the kallisto index\n", 135 | "\n", 136 | "First make sure that kallisto is installed and the version is greater than 0.45\n", 137 | "\n", 138 | "If it's not installed, see instructions at https://pachterlab.github.io/kallisto/download" 139 | ] 140 | }, 141 | { 142 | "cell_type": "code", 143 | "execution_count": 10, 144 | "metadata": {}, 145 | "outputs": [ 146 | { 147 | "name": "stdout", 148 | "output_type": "stream", 149 | "text": [ 150 | "kallisto, version 0.45.0\r\n" 151 | ] 152 | } 153 | ], 154 | "source": [ 155 | "!kallisto version " 156 | ] 157 | }, 158 | { 159 | "cell_type": "markdown", 160 | "metadata": {}, 161 | "source": [ 162 | "First we build the kallisto index for the dataset. \n", 163 | "For this index in particular, because this is a species mixing experiment, we have to download the human and mouse transcriptome, concatenate them, and then build the index.\n", 164 | "Building the index takes a few minutes and needs to be done only once.\n", 165 | "\n", 166 | "### Download human and mouse reference transcriptomes from ensembl\n", 167 | "In order to do that we first download the human and mouse transcriptomes from ensembl. You can see the reference genomes they have at https://uswest.ensembl.org/info/data/ftp/index.html" 168 | ] 169 | }, 170 | { 171 | "cell_type": "code", 172 | "execution_count": 11, 173 | "metadata": {}, 174 | "outputs": [ 175 | { 176 | "name": "stdout", 177 | "output_type": "stream", 178 | "text": [ 179 | "Mouse transcriptome already downloaded!\n", 180 | "Human transcriptome already downloaded!\n" 181 | ] 182 | } 183 | ], 184 | "source": [ 185 | "#Check if the file was downloaded already before doing curl:\n", 186 | "if not (os.path.isfile('Mus_musculus.GRCm38.cdna.all.fa.gz')): \n", 187 | " # the `!` means we're running a command line statement (rather than python) \n", 188 | " !curl -O ftp://ftp.ensembl.org/pub/release-94/fasta/mus_musculus/cdna/Mus_musculus.GRCm38.cdna.all.fa.gz\n", 189 | "else: print('Mouse transcriptome already downloaded!')\n", 190 | "\n", 191 | "if not (os.path.isfile('Homo_sapiens.GRCh38.cdna.all.fa.gz')): \n", 192 | " # the `!` means we're running a command line statement (rather than python) \n", 193 | " !curl -O ftp://ftp.ensembl.org/pub/release-94/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz\n", 194 | "else: print('Human transcriptome already downloaded!')\n" 195 | ] 196 | }, 197 | { 198 | "cell_type": "code", 199 | "execution_count": 12, 200 | "metadata": {}, 201 | "outputs": [ 202 | { 203 | "name": "stdout", 204 | "output_type": "stream", 205 | "text": [ 206 | "Human and mouse transcriptomes concatenated!\n" 207 | ] 208 | } 209 | ], 210 | "source": [ 211 | "#concatenate the human and mouse transcriptomes\n", 212 | "!zcat Homo_sapiens.GRCh38.cdna.all.fa.gz Mus_musculus.GRCm38.cdna.all.fa.gz | gzip -1 - > human_mouse_contatenated_transcriptome.fa.gz\n", 213 | "print('Human and mouse transcriptomes concatenated!')" 214 | ] 215 | }, 216 | { 217 | "cell_type": "code", 218 | "execution_count": 13, 219 | "metadata": {}, 220 | "outputs": [ 221 | { 222 | "name": "stdout", 223 | "output_type": "stream", 224 | "text": [ 225 | "Human-mouse transcript index already exist!\n" 226 | ] 227 | } 228 | ], 229 | "source": [ 230 | "### Now we can build the index\n", 231 | "if not (os.path.isfile('human_mouse_transcriptome_index.idx')): \n", 232 | " !kallisto index -i human_mouse_transcriptome_index.idx human_mouse_contatenated_transcriptome.fa.gz\n", 233 | "else: print ('Human-mouse transcript index already exist!')" 234 | ] 235 | }, 236 | { 237 | "cell_type": "markdown", 238 | "metadata": {}, 239 | "source": [ 240 | "# Preparing transcript_to_gene.tsv file process the single cell data with kallisto bus" 241 | ] 242 | }, 243 | { 244 | "cell_type": "markdown", 245 | "metadata": {}, 246 | "source": [ 247 | "Depending on which transcriptome you used, you will need to create a file translating transcripts to genes. This notebook assumes the file is in `transcript_to_gene.tsv`, for ensembl transcriptomes these can be generated using biomart.\n", 248 | "\n", 249 | "The general format of `transcript_to_gene.tsv` is\n", 250 | "\n", 251 | "```\n", 252 | "ENST00000632684.1\tENSG00000282431.1\n", 253 | "ENST00000434970.2\tENSG00000237235.2\n", 254 | "ENST00000448914.1\tENSG00000228985.1\n", 255 | "ENST00000415118.1\tENSG00000223997.1\n", 256 | "ENST00000631435.1\tENSG00000282253.1\n", 257 | "...\n", 258 | "```\n", 259 | "\n", 260 | "To create the `transcript_to_gene.tsv` we fetch and parse the mouse GTF file from ensembl.\n", 261 | "\n", 262 | "The reference GTF files are available at https://uswest.ensembl.org/info/data/ftp/index.html\n", 263 | "\n", 264 | "The mouse ones which we use are at ftp://ftp.ensembl.org/pub/release-94/gtf/mus_musculus" 265 | ] 266 | }, 267 | { 268 | "cell_type": "code", 269 | "execution_count": 14, 270 | "metadata": {}, 271 | "outputs": [ 272 | { 273 | "name": "stdout", 274 | "output_type": "stream", 275 | "text": [ 276 | " % Total % Received % Xferd Average Speed Time Time Time Current\n", 277 | " Dload Upload Total Spent Left Speed\n", 278 | "100 28.0M 100 28.0M 0 0 1593k 0 0:00:18 0:00:18 --:--:-- 2902k\n", 279 | " % Total % Received % Xferd Average Speed Time Time Time Current\n", 280 | " Dload Upload Total Spent Left Speed\n", 281 | "100 41.6M 100 41.6M 0 0 1303k 0 0:00:32 0:00:32 --:--:-- 1751k\n" 282 | ] 283 | } 284 | ], 285 | "source": [ 286 | "#Check if the file was downloaded already before doing curl:\n", 287 | "if not (os.path.isfile('Mus_musculus.GRCm38.94.gtf.gz')) or os.path.isfile('Mus_musculus.GRCm38.94.gtf'): \n", 288 | " # the `!` means we're running a command line statement (rather than python) \n", 289 | " !curl -O ftp://ftp.ensembl.org/pub/release-94/gtf/mus_musculus/Mus_musculus.GRCm38.94.gtf.gz\n", 290 | "else: print('Mouse GTF already downloaded!')\n", 291 | " \n", 292 | " \n", 293 | "#Check if the file was downloaded already before doing curl:\n", 294 | "if not (os.path.isfile('Homo_sapiens.GRCh38.94.gtf.gz')) or os.path.isfile('Homo_sapiens.GRCh38.94.gtf'): \n", 295 | " # the `!` means we're running a command line statement (rather than python) \n", 296 | " !curl -O ftp://ftp.ensembl.org/pub/release-94/gtf/homo_sapiens/Homo_sapiens.GRCh38.94.gtf.gz\n", 297 | "else: print('Human GTC already downloaded!')\n" 298 | ] 299 | }, 300 | { 301 | "cell_type": "code", 302 | "execution_count": 15, 303 | "metadata": {}, 304 | "outputs": [ 305 | { 306 | "name": "stdout", 307 | "output_type": "stream", 308 | "text": [ 309 | "./Mus_musculus.GRCm38.94.gtf.gz:\t 96.2% -- replaced with ./Mus_musculus.GRCm38.94.gtf\n", 310 | "./Homo_sapiens.GRCh38.94.gtf.gz:\t 96.2% -- replaced with ./Homo_sapiens.GRCh38.94.gtf\n" 311 | ] 312 | } 313 | ], 314 | "source": [ 315 | "# Unzip the files\n", 316 | "!gunzip -v -f ./Mus_musculus.GRCm38.94.gtf.gz\n", 317 | "!gunzip -v -f ./Homo_sapiens.GRCh38.94.gtf.gz" 318 | ] 319 | }, 320 | { 321 | "cell_type": "code", 322 | "execution_count": 16, 323 | "metadata": {}, 324 | "outputs": [ 325 | { 326 | "name": "stdout", 327 | "output_type": "stream", 328 | "text": [ 329 | "Human and mouse GTF files concatenated!\n" 330 | ] 331 | } 332 | ], 333 | "source": [ 334 | "# concatenate the GTF files\n", 335 | "!cat ./Mus_musculus.GRCm38.94.gtf ./Homo_sapiens.GRCh38.94.gtf > ./human_mouse_contatenated_GTF.gtf\n", 336 | "print('Human and mouse GTF files concatenated!')" 337 | ] 338 | }, 339 | { 340 | "cell_type": "markdown", 341 | "metadata": {}, 342 | "source": [ 343 | "## Create transcript_to_gene.tsv\n", 344 | "\n", 345 | "Now we can use the cells below to parse the GTF file and keep only the transcript mapping as a tsv file in the format below.\n", 346 | "```\n", 347 | "ENST00000632684.1\tENSG00000282431.1\n", 348 | "ENST00000434970.2\tENSG00000237235.2\n", 349 | "ENST00000448914.1\tENSG00000228985.1\n", 350 | "```" 351 | ] 352 | }, 353 | { 354 | "cell_type": "code", 355 | "execution_count": 17, 356 | "metadata": {}, 357 | "outputs": [], 358 | "source": [ 359 | "def create_transcript_list(input, use_name = True, use_version = True):\n", 360 | " r = {}\n", 361 | " for line in input:\n", 362 | " if len(line) == 0 or line[0] == '#':\n", 363 | " continue\n", 364 | " l = line.strip().split('\\t')\n", 365 | " if l[2] == 'transcript':\n", 366 | " info = l[8]\n", 367 | " d = {}\n", 368 | " for x in info.split('; '):\n", 369 | " x = x.strip()\n", 370 | " p = x.find(' ')\n", 371 | " if p == -1:\n", 372 | " continue\n", 373 | " k = x[:p]\n", 374 | " p = x.find('\"',p)\n", 375 | " p2 = x.find('\"',p+1)\n", 376 | " v = x[p+1:p2]\n", 377 | " d[k] = v\n", 378 | "\n", 379 | "\n", 380 | " if 'transcript_id' not in d or 'gene_id' not in d:\n", 381 | " continue\n", 382 | "\n", 383 | " tid = d['transcript_id']\n", 384 | " gid = d['gene_id']\n", 385 | " if use_version:\n", 386 | " if 'transcript_version' not in d or 'gene_version' not in d:\n", 387 | " continue\n", 388 | "\n", 389 | " tid += '.' + d['transcript_version']\n", 390 | " gid += '.' + d['gene_version']\n", 391 | " gname = None\n", 392 | " if use_name:\n", 393 | " if 'gene_name' not in d:\n", 394 | " continue\n", 395 | " gname = d['gene_name']\n", 396 | "\n", 397 | " if tid in r:\n", 398 | " continue\n", 399 | "\n", 400 | " r[tid] = (gid, gname)\n", 401 | " return r\n", 402 | "\n", 403 | "\n", 404 | "\n", 405 | "def print_output(output, r, use_name = True):\n", 406 | " for tid in r:\n", 407 | " if use_name:\n", 408 | " output.write(\"%s\\t%s\\t%s\\n\"%(tid, r[tid][0], r[tid][1]))\n", 409 | " else:\n", 410 | " output.write(\"%s\\t%s\\n\"%(tid, r[tid][0]))" 411 | ] 412 | }, 413 | { 414 | "cell_type": "code", 415 | "execution_count": 18, 416 | "metadata": {}, 417 | "outputs": [ 418 | { 419 | "name": "stdout", 420 | "output_type": "stream", 421 | "text": [ 422 | "Created human_mouse_transcript_to_gene.tsv file\n" 423 | ] 424 | } 425 | ], 426 | "source": [ 427 | "with open('./human_mouse_contatenated_GTF.gtf') as file:\n", 428 | " r = create_transcript_list(file, use_name = True, use_version = True)\n", 429 | "with open('human_mouse_transcript_to_gene.tsv', \"w+\") as output:\n", 430 | " print_output(output, r, use_name = True)\n", 431 | "print('Created human_mouse_transcript_to_gene.tsv file')" 432 | ] 433 | }, 434 | { 435 | "cell_type": "code", 436 | "execution_count": 19, 437 | "metadata": {}, 438 | "outputs": [ 439 | { 440 | "name": "stdout", 441 | "output_type": "stream", 442 | "text": [ 443 | "Created mouse_transcript_to_gene.tsv file\n" 444 | ] 445 | } 446 | ], 447 | "source": [ 448 | "with open('./Mus_musculus.GRCm38.94.gtf') as file:\n", 449 | " r = create_transcript_list(file, use_name = True, use_version = True)\n", 450 | "with open('mouse_transcript_to_gene.tsv', \"w+\") as output:\n", 451 | " print_output(output, r, use_name = True)\n", 452 | "print('Created mouse_transcript_to_gene.tsv file')" 453 | ] 454 | }, 455 | { 456 | "cell_type": "code", 457 | "execution_count": 20, 458 | "metadata": {}, 459 | "outputs": [ 460 | { 461 | "name": "stdout", 462 | "output_type": "stream", 463 | "text": [ 464 | "Created human_transcript_to_gene.tsv file\n" 465 | ] 466 | } 467 | ], 468 | "source": [ 469 | "with open('./Homo_sapiens.GRCh38.94.gtf') as file:\n", 470 | " r = create_transcript_list(file, use_name = True, use_version = True)\n", 471 | "with open('human_transcript_to_gene.tsv', \"w+\") as output:\n", 472 | " print_output(output, r, use_name = True)\n", 473 | "print('Created human_transcript_to_gene.tsv file')" 474 | ] 475 | }, 476 | { 477 | "cell_type": "markdown", 478 | "metadata": {}, 479 | "source": [ 480 | "# Run kallisto bus\n", 481 | "kallisto bus supports several single cell sequencing technologies, as you can see below. We'll be using 10xv2 " 482 | ] 483 | }, 484 | { 485 | "cell_type": "code", 486 | "execution_count": 21, 487 | "metadata": {}, 488 | "outputs": [ 489 | { 490 | "name": "stdout", 491 | "output_type": "stream", 492 | "text": [ 493 | "List of supported single cell technologies\r\n", 494 | "\r\n", 495 | "short name description\r\n", 496 | "---------- -----------\r\n", 497 | "10Xv1 10X chemistry version 1\r\n", 498 | "10Xv2 10X chemistry verison 2\r\n", 499 | "DropSeq DropSeq\r\n", 500 | "inDrop inDrop\r\n", 501 | "CELSeq CEL-Seq\r\n", 502 | "CELSeq2 CEL-Seq version 2\r\n", 503 | "SCRBSeq SCRB-Seq\r\n", 504 | "\r\n" 505 | ] 506 | } 507 | ], 508 | "source": [ 509 | "!kallisto bus --list" 510 | ] 511 | }, 512 | { 513 | "cell_type": "markdown", 514 | "metadata": {}, 515 | "source": [ 516 | "We are using paired end reads (R1 and R2 files) with 8 samples" 517 | ] 518 | }, 519 | { 520 | "cell_type": "code", 521 | "execution_count": 37, 522 | "metadata": {}, 523 | "outputs": [ 524 | { 525 | "name": "stdout", 526 | "output_type": "stream", 527 | "text": [ 528 | "\n", 529 | "[index] k-mer length: 31\n", 530 | "[index] number of targets: 302,896\n", 531 | "[index] number of k-mers: 206,125,466\n", 532 | "[index] number of equivalence classes: 1,252,306\n", 533 | "[quant] will process sample 1: ./fastqs/hgmm_6k_S1_L001_R1_001.fastq.gz\n", 534 | " ./fastqs/hgmm_6k_S1_L001_R2_001.fastq.gz\n", 535 | "[quant] will process sample 2: ./fastqs/hgmm_6k_S1_L002_R1_001.fastq.gz\n", 536 | " ./fastqs/hgmm_6k_S1_L002_R2_001.fastq.gz\n", 537 | "[quant] will process sample 3: ./fastqs/hgmm_6k_S1_L003_R1_001.fastq.gz\n", 538 | " ./fastqs/hgmm_6k_S1_L003_R2_001.fastq.gz\n", 539 | "[quant] will process sample 4: ./fastqs/hgmm_6k_S1_L004_R1_001.fastq.gz\n", 540 | " ./fastqs/hgmm_6k_S1_L004_R2_001.fastq.gz\n", 541 | "[quant] will process sample 5: ./fastqs/hgmm_6k_S1_L005_R1_001.fastq.gz\n", 542 | " ./fastqs/hgmm_6k_S1_L005_R2_001.fastq.gz\n", 543 | "[quant] will process sample 6: ./fastqs/hgmm_6k_S1_L006_R1_001.fastq.gz\n", 544 | " ./fastqs/hgmm_6k_S1_L006_R2_001.fastq.gz\n", 545 | "[quant] will process sample 7: ./fastqs/hgmm_6k_S1_L007_R1_001.fastq.gz\n", 546 | " ./fastqs/hgmm_6k_S1_L007_R2_001.fastq.gz\n", 547 | "[quant] will process sample 8: ./fastqs/hgmm_6k_S1_L008_R1_001.fastq.gz\n", 548 | " ./fastqs/hgmm_6k_S1_L008_R2_001.fastq.gz\n", 549 | "[quant] finding pseudoalignments for the reads ... done\n", 550 | "[quant] processed 381,992,071 reads, 312,541,673 reads pseudoaligned\n" 551 | ] 552 | } 553 | ], 554 | "source": [ 555 | "!kallisto bus -i human_mouse_transcriptome_index.idx -o out_hgmm_6k -x 10xv2 -t 8 \\\n", 556 | "./fastqs/hgmm_6k_S1_L001_R1_001.fastq.gz ./fastqs/hgmm_6k_S1_L001_R2_001.fastq.gz \\\n", 557 | "./fastqs/hgmm_6k_S1_L002_R1_001.fastq.gz ./fastqs/hgmm_6k_S1_L002_R2_001.fastq.gz \\\n", 558 | "./fastqs/hgmm_6k_S1_L003_R1_001.fastq.gz ./fastqs/hgmm_6k_S1_L003_R2_001.fastq.gz \\\n", 559 | "./fastqs/hgmm_6k_S1_L004_R1_001.fastq.gz ./fastqs/hgmm_6k_S1_L004_R2_001.fastq.gz \\\n", 560 | "./fastqs/hgmm_6k_S1_L005_R1_001.fastq.gz ./fastqs/hgmm_6k_S1_L005_R2_001.fastq.gz \\\n", 561 | "./fastqs/hgmm_6k_S1_L006_R1_001.fastq.gz ./fastqs/hgmm_6k_S1_L006_R2_001.fastq.gz \\\n", 562 | "./fastqs/hgmm_6k_S1_L007_R1_001.fastq.gz ./fastqs/hgmm_6k_S1_L007_R2_001.fastq.gz \\\n", 563 | "./fastqs/hgmm_6k_S1_L008_R1_001.fastq.gz ./fastqs/hgmm_6k_S1_L008_R2_001.fastq.gz " 564 | ] 565 | }, 566 | { 567 | "cell_type": "markdown", 568 | "metadata": {}, 569 | "source": [ 570 | "### The `matrix.ec` file\n", 571 | "\n", 572 | "The `matrix.ec` is generated by kallisto and connects the equivalence class ids to sets of transcripts. The format looks like\n", 573 | "~~~\n", 574 | "0\t0\n", 575 | "1\t1\n", 576 | "2\t2\n", 577 | "3\t3\n", 578 | "4\t4\n", 579 | "...\n", 580 | "\n", 581 | "884398\t26558,53383,53384,69915,69931,85319,109252,125730\n", 582 | "884399\t7750,35941,114698,119265\n", 583 | "884400\t9585,70083,92571,138545,138546\n", 584 | "884401\t90512,90513,134202,159456\n", 585 | "~~~" 586 | ] 587 | }, 588 | { 589 | "cell_type": "code", 590 | "execution_count": 38, 591 | "metadata": {}, 592 | "outputs": [], 593 | "source": [ 594 | "#load transcript to gene file\n", 595 | "tr2g = {}\n", 596 | "trlist = []\n", 597 | "with open('./human_mouse_transcript_to_gene.tsv') as f:\n", 598 | " for line in f:\n", 599 | " l = line.split()\n", 600 | " tr2g[l[0]] = l[1]\n", 601 | " trlist.append(l[0])\n", 602 | "\n", 603 | "genes = list(set(tr2g[t] for t in tr2g))\n", 604 | "\n", 605 | "# load equivalence classes\n", 606 | "ecs = {}\n", 607 | "with open('./out_hgmm_6k/matrix.ec') as f:\n", 608 | " for line in f:\n", 609 | " l = line.split()\n", 610 | " ec = int(l[0])\n", 611 | " trs = [int(x) for x in l[1].split(',')]\n", 612 | " ecs[ec] = trs\n", 613 | " \n", 614 | "def ec2g(ec):\n", 615 | " if ec in ecs:\n", 616 | " return list(set(tr2g[trlist[t]] for t in ecs[ec])) \n", 617 | " else:\n", 618 | " return []" 619 | ] 620 | }, 621 | { 622 | "cell_type": "markdown", 623 | "metadata": {}, 624 | "source": [ 625 | "### Processing the BUS file\n", 626 | "\n", 627 | "For these notebooks we will work with the text file that `BUStools` produces, rather than the raw `BUS` file. \n", 628 | "To install `BUStools` see https://github.com/BUStools/bustools\n", 629 | "\n", 630 | "We discard any barcodes that don't have more 10 UMIs \n", 631 | "\n", 632 | "To produce the text file, starting with the `output.bus` file produced by kallisto, we first sort it on bustools:\n", 633 | "```\n", 634 | "bustools sort -o output.sorted output.bus\n", 635 | "```\n", 636 | "Then we convert it to txt:\n", 637 | "```\n", 638 | "bustools text -o output.sorted.txt output.sorted\n", 639 | "```\n", 640 | "\n" 641 | ] 642 | }, 643 | { 644 | "cell_type": "code", 645 | "execution_count": 39, 646 | "metadata": {}, 647 | "outputs": [ 648 | { 649 | "name": "stdout", 650 | "output_type": "stream", 651 | "text": [ 652 | "Read in 312541673 number of busrecords\n", 653 | "All sorted\n" 654 | ] 655 | } 656 | ], 657 | "source": [ 658 | "#sort bus file\n", 659 | "!bustools sort -o ./out_hgmm_6k/output_sorted.bus ./out_hgmm_6k/output.bus" 660 | ] 661 | }, 662 | { 663 | "cell_type": "code", 664 | "execution_count": 40, 665 | "metadata": {}, 666 | "outputs": [ 667 | { 668 | "name": "stdout", 669 | "output_type": "stream", 670 | "text": [ 671 | "Read in 262641790 number of busrecords\r\n" 672 | ] 673 | } 674 | ], 675 | "source": [ 676 | "# convert the sorted busfile to txt\n", 677 | "!bustools text -o ./out_hgmm_6k/output_sorted.txt ./out_hgmm_6k/output_sorted.bus" 678 | ] 679 | }, 680 | { 681 | "cell_type": "markdown", 682 | "metadata": {}, 683 | "source": [ 684 | "# Loading the generated data " 685 | ] 686 | }, 687 | { 688 | "cell_type": "code", 689 | "execution_count": 41, 690 | "metadata": {}, 691 | "outputs": [], 692 | "source": [ 693 | "#load human_mouse transcripts\n", 694 | "\n", 695 | "tr2g = {}\n", 696 | "trlist = []\n", 697 | "with open('./human_mouse_transcript_to_gene.tsv') as f:\n", 698 | " for line in f:\n", 699 | " l = line.split()\n", 700 | " tr2g[l[0]] = l[1]\n", 701 | " trlist.append(l[0])\n", 702 | "\n", 703 | "genes = list(set(tr2g[t] for t in tr2g))\n", 704 | "\n", 705 | "# load equivalence classes\n", 706 | "ecs = {}\n", 707 | "with open('./out_hgmm_6k/matrix.ec') as f:\n", 708 | " for line in f:\n", 709 | " l = line.split()\n", 710 | " ec = int(l[0])\n", 711 | " trs = [int(x) for x in l[1].split(',')]\n", 712 | " ecs[ec] = trs\n", 713 | " \n", 714 | "def ec2g(ec):\n", 715 | " if ec in ecs:\n", 716 | " return list(set(tr2g[trlist[t]] for t in ecs[ec])) \n", 717 | " else:\n", 718 | " return []" 719 | ] 720 | }, 721 | { 722 | "cell_type": "code", 723 | "execution_count": null, 724 | "metadata": {}, 725 | "outputs": [], 726 | "source": [ 727 | "# load kallisto bus output dataset\n", 728 | "\n", 729 | "cell_gene = collections.defaultdict(lambda: collections.defaultdict(float))\n", 730 | "pbar=None\n", 731 | "pumi=None\n", 732 | "with open('./out_hgmm_6k/output_sorted.txt') as f:\n", 733 | " gs = set()\n", 734 | " for line in f:\n", 735 | " l = line.split()\n", 736 | " barcode,umi,ec,count = line.split()\n", 737 | " ec = int(ec)\n", 738 | " \n", 739 | " if barcode == pbar:\n", 740 | " # same barcode\n", 741 | " if umi == pumi:\n", 742 | " # same UMI, let's update with intersection of genelist\n", 743 | " gl = ec2g(ec)\n", 744 | " gs.intersection_update(gl)\n", 745 | " else:\n", 746 | " # new UMI, process the previous gene set\n", 747 | " for g in gs:\n", 748 | " cell_gene[barcode][g] += 1.0/len(gs)\n", 749 | " # record new umi, reset gene set\n", 750 | " pumi = umi\n", 751 | " gs = set(ec2g(ec))\n", 752 | " else:\n", 753 | " # work with previous gene list\n", 754 | " for g in gs:\n", 755 | " cell_gene[pbar][g] += 1.0/len(gs)\n", 756 | " \n", 757 | " if sum(cell_gene[pbar][g] for g in cell_gene[pbar]) < 10:\n", 758 | " del cell_gene[pbar]\n", 759 | " \n", 760 | " pbar = barcode\n", 761 | " pumi = umi\n", 762 | " \n", 763 | " gs = set(ec2g(ec))\n", 764 | " #remember the last gene\n", 765 | " for g in gs:\n", 766 | " cell_gene[pbar][g] += 1.0/len(gs)\n", 767 | " \n", 768 | " if sum(cell_gene[pbar][g] for g in cell_gene[pbar]) < 10:\n", 769 | " del cell_gene[pbar]\n", 770 | "\n" 771 | ] 772 | }, 773 | { 774 | "cell_type": "code", 775 | "execution_count": 49, 776 | "metadata": {}, 777 | "outputs": [], 778 | "source": [ 779 | "barcode_hist = collections.defaultdict(int)\n", 780 | "for barcode in cell_gene:\n", 781 | " cg = cell_gene[barcode]\n", 782 | " s = len([cg[g] for g in cg])\n", 783 | " barcode_hist[barcode] += s\n", 784 | " " 785 | ] 786 | }, 787 | { 788 | "cell_type": "markdown", 789 | "metadata": {}, 790 | "source": [ 791 | "# Take a look at the detected barcodes and genes" 792 | ] 793 | }, 794 | { 795 | "cell_type": "code", 796 | "execution_count": 50, 797 | "metadata": { 798 | "scrolled": true 799 | }, 800 | "outputs": [ 801 | { 802 | "name": "stdout", 803 | "output_type": "stream", 804 | "text": [ 805 | "376418\n" 806 | ] 807 | }, 808 | { 809 | "data": { 810 | "image/png": "\n", 811 | "text/plain": [ 812 | "
" 813 | ] 814 | }, 815 | "metadata": { 816 | "image/png": { 817 | "height": 372, 818 | "width": 556 819 | } 820 | }, 821 | "output_type": "display_data" 822 | } 823 | ], 824 | "source": [ 825 | "threshold = 0 # this filters the data by gene count\n", 826 | "bcv = [x for b,x in barcode_hist.items() if x > 0] \n", 827 | "_ = plt.hist(bcv,bins=100, log=True)\n", 828 | "plt.rcParams[\"figure.figsize\"] = [9,6]\n", 829 | "plt.xlabel(\"Number of gene counts\")\n", 830 | "plt.ylabel(\"Number of barcodes\")\n", 831 | "plt.grid(True)\n", 832 | "print(len(bcv))" 833 | ] 834 | }, 835 | { 836 | "cell_type": "markdown", 837 | "metadata": {}, 838 | "source": [ 839 | "# Export count data as `.mtx`" 840 | ] 841 | }, 842 | { 843 | "cell_type": "code", 844 | "execution_count": 51, 845 | "metadata": {}, 846 | "outputs": [], 847 | "source": [ 848 | "outfile = './out_hgmm_6k/matrix.mtx'\n", 849 | "\n", 850 | "gene_to_id = dict((g,i+1) for i,g in enumerate(genes))\n", 851 | "barcodes_to_use = [b for b,x in barcode_hist.items() if x > 500 and x < 10000]\n", 852 | "\n", 853 | "num_entries = 0\n", 854 | "for barcode in barcodes_to_use:\n", 855 | " num_entries += len([x for x in cell_gene[barcode].values() if round(x)>0])\n", 856 | "\n" 857 | ] 858 | }, 859 | { 860 | "cell_type": "code", 861 | "execution_count": 52, 862 | "metadata": {}, 863 | "outputs": [], 864 | "source": [ 865 | "with open(outfile, 'w') as of:\n", 866 | " of.write('%%MatrixMarket matrix coordinate real general\\n%\\n')\n", 867 | " #number of genes\n", 868 | " of.write(\"%d %d %d\\n\"%(len(genes), len(barcodes_to_use), num_entries))\n", 869 | " bcid = 0\n", 870 | " for barcode in barcodes_to_use:\n", 871 | " bcid += 1\n", 872 | " cg = cell_gene[barcode]\n", 873 | " gl = [(gene_to_id[g],round(cg[g])) for g in cg if round(cg[g]) > 0]\n", 874 | " gl.sort()\n", 875 | " for x in gl:\n", 876 | " of.write(\"%d %d %d\\n\"%(x[0],bcid,x[1]))\n", 877 | " " 878 | ] 879 | }, 880 | { 881 | "cell_type": "code", 882 | "execution_count": 53, 883 | "metadata": {}, 884 | "outputs": [], 885 | "source": [ 886 | "gene_names = {}\n", 887 | "with open('./human_mouse_transcript_to_gene.tsv') as f:\n", 888 | " f.readline()\n", 889 | " for line in f:\n", 890 | " g,t,gn = line.split()\n", 891 | " gene_names[g] = gn\n" 892 | ] 893 | }, 894 | { 895 | "cell_type": "code", 896 | "execution_count": 54, 897 | "metadata": {}, 898 | "outputs": [], 899 | "source": [ 900 | "id_to_genes = dict((i,g) for (g,i) in gene_to_id.items())\n", 901 | "gl = []\n", 902 | "for i in range(1,len(genes)+1):\n", 903 | " g = id_to_genes[i]\n", 904 | " gid = g[:g.find('.')]\n", 905 | " if gid in gene_names:\n", 906 | " gn = gene_names[gid]\n", 907 | " else:\n", 908 | " gn = ''\n", 909 | " gl.append((g,gn))\n", 910 | "\n", 911 | "with open('./out_hgmm_6k/genes.tsv','w') as of:\n", 912 | " for g,gn in gl:\n", 913 | " of.write(\"%s\\t%s\\n\"%(g,gn))\n", 914 | " \n", 915 | "with open('./out_hgmm_6k/barcodes.tsv','w') as of:\n", 916 | " of.write('\\n'.join(x + '-1' for x in barcodes_to_use))\n", 917 | " of.write('\\n')" 918 | ] 919 | } 920 | ], 921 | "metadata": { 922 | "kernelspec": { 923 | "display_name": "Python 3", 924 | "language": "python", 925 | "name": "python3" 926 | }, 927 | "language_info": { 928 | "codemirror_mode": { 929 | "name": "ipython", 930 | "version": 3 931 | }, 932 | "file_extension": ".py", 933 | "mimetype": "text/x-python", 934 | "name": "python", 935 | "nbconvert_exporter": "python", 936 | "pygments_lexer": "ipython3", 937 | "version": "3.6.6" 938 | } 939 | }, 940 | "nbformat": 4, 941 | "nbformat_minor": 2 942 | } 943 | -------------------------------------------------------------------------------- /dataset-notebooks/10x_neuron_1k_v2chem_python/busparser.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | import argparse 3 | parser = argparse.ArgumentParser() 4 | parser.add_argument("bus_dir", help=": kallisto bus output directory") 5 | parser.add_argument("t2g", help=": a 3-column file with transcript, gene_ID and gene name information") 6 | parser.add_argument("gene_min", help=": minimal number of genes detected", type=int) 7 | parser.add_argument("gene_max", help=": maximal number of genes detected", type=int) 8 | 9 | args = parser.parse_args() 10 | #setup working directory 11 | import os 12 | os.chdir(args.bus_dir) 13 | 14 | from subprocess import call 15 | import numpy as np 16 | import matplotlib.pyplot as plt 17 | import pandas as pd 18 | import sys, collections 19 | 20 | tr2g = {} 21 | trlist = [] 22 | with open(args.t2g) as f: 23 | for line in f: 24 | l = line.split() 25 | tr2g[l[0]] = l[1] 26 | trlist.append(l[0]) 27 | 28 | genes = list(set(tr2g[t] for t in tr2g)) 29 | 30 | # load equivalence classes 31 | ecs = {} 32 | with open('./matrix.ec') as f: 33 | for line in f: 34 | l = line.split() 35 | ec = int(l[0]) 36 | trs = [int(x) for x in l[1].split(',')] 37 | ecs[ec] = trs 38 | 39 | def ec2g(ec): 40 | if ec in ecs: 41 | return list(set(tr2g[trlist[t]] for t in ecs[ec])) 42 | else: 43 | return [] 44 | 45 | cell_gene = collections.defaultdict(lambda: collections.defaultdict(float)) 46 | pbar=None 47 | pumi=None 48 | with open('./output.sorted.txt') as f: 49 | gs = set() 50 | for line in f: 51 | l = line.split() 52 | barcode,umi,ec,count = line.split() 53 | ec = int(ec) 54 | 55 | if barcode == pbar: 56 | # same barcode 57 | if umi == pumi: 58 | # same UMI, let's update with intersection of genelist 59 | gl = ec2g(ec) 60 | gs.intersection_update(gl) 61 | else: 62 | # new UMI, process the previous gene set 63 | for g in gs: 64 | cell_gene[barcode][g] += 1.0/len(gs) 65 | # record new umi, reset gene set 66 | pumi = umi 67 | gs = set(ec2g(ec)) 68 | else: 69 | # work with previous gene list 70 | for g in gs: 71 | cell_gene[pbar][g] += 1.0/len(gs) 72 | 73 | if sum(cell_gene[pbar][g] for g in cell_gene[pbar]) < 10: 74 | del cell_gene[pbar] 75 | 76 | pbar = barcode 77 | pumi = umi 78 | 79 | gs = set(ec2g(ec)) 80 | 81 | for g in gs: 82 | cell_gene[pbar][g] += 1.0/len(gs) 83 | 84 | if sum(cell_gene[pbar][g] for g in cell_gene[pbar]) < 10: 85 | del cell_gene[pbar] 86 | 87 | barcode_hist = collections.defaultdict(int) 88 | for barcode in cell_gene: 89 | cg = cell_gene[barcode] 90 | s = len([cg[g] for g in cg]) 91 | barcode_hist[barcode] += s 92 | 93 | #Output a gene count histogram 94 | bcv = [x for b,x in barcode_hist.items() if x > args.gene_min and x < args.gene_max] 95 | plt.switch_backend('agg') 96 | fig = plt.figure() 97 | ax = fig.add_subplot(111) 98 | ax.hist(bcv,bins=100) 99 | ax.set_title("Histogram") 100 | plt.xlabel("number of genes detected") 101 | plt.ylabel("number of barcodes") 102 | fig.savefig('gene_hist.png') 103 | 104 | outfile = './matrix.mtx' 105 | 106 | gene_to_id = dict((g,i+1) for i,g in enumerate(genes)) 107 | barcodes_to_use = [b for b,x in barcode_hist.items() if x > args.gene_min and x < args.gene_max] 108 | 109 | num_entries = 0 110 | for barcode in barcodes_to_use: 111 | num_entries += len([x for x in cell_gene[barcode].values() if x>0]) 112 | 113 | with open(outfile, 'w') as of: 114 | of.write('%%MatrixMarket matrix coordinate real general\n%\n') 115 | #number of genes 116 | of.write("%d %d %d\n"%(len(genes), len(barcodes_to_use), round(num_entries))) 117 | bcid = 0 118 | for barcode in barcodes_to_use: 119 | bcid += 1 120 | cg = cell_gene[barcode] 121 | gl = [(gene_to_id[g],cg[g]) for g in cg if cg[g] > 0] 122 | gl.sort() 123 | for x in gl: 124 | of.write("%d %d %f\n"%(x[0],bcid,x[1])) 125 | 126 | gene_names = {} 127 | with open(args.t2g) as f: 128 | f.readline() 129 | for line in f: 130 | t,g,gn = line.split() 131 | gene_names[g] = gn 132 | 133 | id_to_genes = dict((i,g) for (g,i) in gene_to_id.items()) 134 | gl = [] 135 | for i in range(1,len(genes)+1): 136 | g = id_to_genes[i] 137 | gid = g 138 | # gid = g[:g.find('.')] 139 | if gid in gene_names: 140 | gn = gene_names[gid] 141 | else: 142 | gn = '' 143 | gl.append((g,gn)) 144 | 145 | with open('./genes.tsv','w') as of: 146 | for g,gn in gl: 147 | of.write("%s\t%s\n"%(g,gn)) 148 | 149 | with open('./barcodes.tsv','w') as of: 150 | of.write('\n'.join(x + '' for x in barcodes_to_use)) 151 | of.write('\n') 152 | 153 | -------------------------------------------------------------------------------- /dataset-notebooks/10x_neuron_1k_v3chem_python/10x_neuron_1k_v3chem.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import matplotlib\n", 10 | "import numpy as np\n", 11 | "import matplotlib.pyplot as plt\n", 12 | "import sys, collections, os, argparse\n", 13 | "%matplotlib inline " 14 | ] 15 | }, 16 | { 17 | "cell_type": "markdown", 18 | "metadata": {}, 19 | "source": [ 20 | "# Download the 10x Dataset `1k Brain Cells from an E18 Mouse (v3 chemistry)`\n", 21 | "\n", 22 | "10x datasets are available at\n", 23 | "https://support.10xgenomics.com/single-cell-gene-expression/datasets\n", 24 | "\n", 25 | "The page for the `1k Brain Cells from an E18 Mouse (v3 chemistry)` dataset is\n", 26 | "https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/neuron_1k_v3\n", 27 | "\n", 28 | "But the FASTQ files (7.4GB) can be downloaded with `wget` directly (without giving them email info) from http://cf.10xgenomics.com/samples/cell-exp/3.0.0/neuron_1k_v3/neuron_1k_v3_fastqs.tar\n", 29 | "\n", 30 | "In the cell below we check if the dataset file `neuron_1k_v3_fastqs.tar` already exists. If not we download the dataset to the same directory as this notebook\n" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": 2, 36 | "metadata": {}, 37 | "outputs": [ 38 | { 39 | "name": "stdout", 40 | "output_type": "stream", 41 | "text": [ 42 | "--2018-12-03 17:29:29-- http://cf.10xgenomics.com/samples/cell-exp/3.0.0/neuron_1k_v3/neuron_1k_v3_fastqs.tar\n", 43 | "Resolving cf.10xgenomics.com (cf.10xgenomics.com)... 13.35.99.77, 13.35.99.80, 13.35.99.113, ...\n", 44 | "Connecting to cf.10xgenomics.com (cf.10xgenomics.com)|13.35.99.77|:80... connected.\n", 45 | "HTTP request sent, awaiting response... 200 OK\n", 46 | "Length: 7086786560 (6.6G) [application/x-tar]\n", 47 | "Saving to: ‘neuron_1k_v3_fastqs.tar’\n", 48 | "\n", 49 | "100%[====================================>] 7,086,786,560 85.3MB/s in 78s \n", 50 | "\n", 51 | "2018-12-03 17:30:52 (86.8 MB/s) - ‘neuron_1k_v3_fastqs.tar’ saved [7086786560/7086786560]\n", 52 | "\n" 53 | ] 54 | } 55 | ], 56 | "source": [ 57 | "#Check if the file was downloaded already before doing wget:\n", 58 | "if not (os.path.isfile('./neuron_1k_v3_fastqs.tar')): \n", 59 | " # the `!` means we're running a command line statement (rather than python) \n", 60 | " !wget http://cf.10xgenomics.com/samples/cell-exp/3.0.0/neuron_1k_v3/neuron_1k_v3_fastqs.tar\n", 61 | "else: print('Dataset already downloaded!')\n" 62 | ] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "metadata": {}, 67 | "source": [ 68 | "Because this dataset was run on two lanes, we need to uncompress the FASTQ files and concatenate them before using kallisto bus. If we had a single file kallisto could take gz files as is." 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": 3, 74 | "metadata": {}, 75 | "outputs": [ 76 | { 77 | "name": "stdout", 78 | "output_type": "stream", 79 | "text": [ 80 | "neuron_1k_v3_fastqs/\n", 81 | "neuron_1k_v3_fastqs/neuron_1k_v3_S1_L002_I1_001.fastq.gz\n", 82 | "neuron_1k_v3_fastqs/neuron_1k_v3_S1_L001_R2_001.fastq.gz\n", 83 | "neuron_1k_v3_fastqs/neuron_1k_v3_S1_L001_R1_001.fastq.gz\n", 84 | "neuron_1k_v3_fastqs/neuron_1k_v3_S1_L002_R1_001.fastq.gz\n", 85 | "neuron_1k_v3_fastqs/neuron_1k_v3_S1_L001_I1_001.fastq.gz\n", 86 | "neuron_1k_v3_fastqs/neuron_1k_v3_S1_L002_R2_001.fastq.gz\n" 87 | ] 88 | } 89 | ], 90 | "source": [ 91 | "# now we untar the fastq files on neuron_1k_v3_fastqs folder\n", 92 | "!tar -xvf ./neuron_1k_v3_fastqs.tar" 93 | ] 94 | }, 95 | { 96 | "cell_type": "markdown", 97 | "metadata": {}, 98 | "source": [ 99 | "# Buiding the kallisto index\n", 100 | "\n", 101 | "First make sure that kallisto is installed and the version is greater than 0.45\n", 102 | "\n", 103 | "If it's not installed, see instructions at https://pachterlab.github.io/kallisto/download" 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": 4, 109 | "metadata": {}, 110 | "outputs": [ 111 | { 112 | "name": "stdout", 113 | "output_type": "stream", 114 | "text": [ 115 | "kallisto, version 0.45.0\n" 116 | ] 117 | } 118 | ], 119 | "source": [ 120 | "!kallisto version " 121 | ] 122 | }, 123 | { 124 | "cell_type": "markdown", 125 | "metadata": {}, 126 | "source": [ 127 | "First we build the kallisto index for the dataset. \n", 128 | "The index is built from the published reference transcriptome for each organism.\n", 129 | "Building the index takes a few minutes and needs to be done only once for each organism.\n", 130 | "\n", 131 | "### Download reference transcriptome from ensembl\n", 132 | "In order to do that we first download the mouse transcriptome from ensembl, you can see the reference genomes they have at https://uswest.ensembl.org/info/data/ftp/index.html" 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": 5, 138 | "metadata": {}, 139 | "outputs": [ 140 | { 141 | "name": "stdout", 142 | "output_type": "stream", 143 | "text": [ 144 | "--2018-12-03 17:31:00-- ftp://ftp.ensembl.org/pub/release-94/fasta/mus_musculus/cdna/Mus_musculus.GRCm38.cdna.all.fa.gz\n", 145 | " => ‘Mus_musculus.GRCm38.cdna.all.fa.gz’\n", 146 | "Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62.193.8\n", 147 | "Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.193.8|:21... connected.\n", 148 | "Logging in as anonymous ... Logged in!\n", 149 | "==> SYST ... done. ==> PWD ... done.\n", 150 | "==> TYPE I ... done. ==> CWD (1) /pub/release-94/fasta/mus_musculus/cdna ... done.\n", 151 | "==> SIZE Mus_musculus.GRCm38.cdna.all.fa.gz ... 50809568\n", 152 | "==> PASV ... done. ==> RETR Mus_musculus.GRCm38.cdna.all.fa.gz ... done.\n", 153 | "Length: 50809568 (48M) (unauthoritative)\n", 154 | "\n", 155 | "100%[======================================>] 50,809,568 3.24MB/s in 15s \n", 156 | "\n", 157 | "2018-12-03 17:31:22 (3.26 MB/s) - ‘Mus_musculus.GRCm38.cdna.all.fa.gz’ saved [50809568]\n", 158 | "\n" 159 | ] 160 | } 161 | ], 162 | "source": [ 163 | "#Check if the file was downloaded already before doing wget:\n", 164 | "if not (os.path.isfile('Mus_musculus.GRCm38.cdna.all.fa.gz')): \n", 165 | " # the `!` means we're running a command line statement (rather than python) \n", 166 | " !wget ftp://ftp.ensembl.org/pub/release-94/fasta/mus_musculus/cdna/Mus_musculus.GRCm38.cdna.all.fa.gz\n", 167 | "else: print('Mouse transcriptome already downloaded!')\n" 168 | ] 169 | }, 170 | { 171 | "cell_type": "code", 172 | "execution_count": 6, 173 | "metadata": {}, 174 | "outputs": [ 175 | { 176 | "name": "stdout", 177 | "output_type": "stream", 178 | "text": [ 179 | "\n", 180 | "[build] loading fasta file Mus_musculus.GRCm38.cdna.all.fa.gz\n", 181 | "[build] k-mer length: 31\n", 182 | "[build] warning: clipped off poly-A tail (longer than 10)\n", 183 | " from 600 target sequences\n", 184 | "[build] warning: replaced 3 non-ACGUT characters in the input sequence\n", 185 | " with pseudorandom nucleotides\n", 186 | "[build] counting k-mers ... done.\n", 187 | "[build] building target de Bruijn graph ... done \n", 188 | "[build] creating equivalence classes ... done\n", 189 | "[build] target de Bruijn graph has 711215 contigs and contains 98989067 k-mers \n", 190 | "\n" 191 | ] 192 | } 193 | ], 194 | "source": [ 195 | "### Now we can build the index\n", 196 | "if not (os.path.isfile('mouse_transcripts.idx')): \n", 197 | " !kallisto index -i mouse_transcripts.idx Mus_musculus.GRCm38.cdna.all.fa.gz\n", 198 | "else: print ('Mouse transcript index already exist!')" 199 | ] 200 | }, 201 | { 202 | "cell_type": "markdown", 203 | "metadata": {}, 204 | "source": [ 205 | "# Preparing transcript_to_gene.tsv file process the single cell data with kallisto bus" 206 | ] 207 | }, 208 | { 209 | "cell_type": "markdown", 210 | "metadata": {}, 211 | "source": [ 212 | "Depending on which transcriptome you used, you will need to create a file translating transcripts to genes. This notebook assumes the file is in `transcript_to_gene.tsv`, for ensembl transcriptomes these can be generated using biomart.\n", 213 | "\n", 214 | "The general format of `transcript_to_gene.tsv` is\n", 215 | "\n", 216 | "```\n", 217 | "ENST00000632684.1\tENSG00000282431.1\n", 218 | "ENST00000434970.2\tENSG00000237235.2\n", 219 | "ENST00000448914.1\tENSG00000228985.1\n", 220 | "ENST00000415118.1\tENSG00000223997.1\n", 221 | "ENST00000631435.1\tENSG00000282253.1\n", 222 | "...\n", 223 | "```\n", 224 | "\n", 225 | "To create the `transcript_to_gene.tsv` we fetch and parse the mouse GTF file from ensembl.\n", 226 | "\n", 227 | "The reference GTF files are available at https://uswest.ensembl.org/info/data/ftp/index.html\n", 228 | "\n", 229 | "The mouse ones which we use are at ftp://ftp.ensembl.org/pub/release-94/gtf/mus_musculus" 230 | ] 231 | }, 232 | { 233 | "cell_type": "code", 234 | "execution_count": 7, 235 | "metadata": {}, 236 | "outputs": [ 237 | { 238 | "name": "stdout", 239 | "output_type": "stream", 240 | "text": [ 241 | "--2018-12-03 17:36:33-- ftp://ftp.ensembl.org/pub/release-94/gtf/mus_musculus/Mus_musculus.GRCm38.94.gtf.gz\n", 242 | " => ‘Mus_musculus.GRCm38.94.gtf.gz’\n", 243 | "Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62.193.8\n", 244 | "Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.193.8|:21... connected.\n", 245 | "Logging in as anonymous ... Logged in!\n", 246 | "==> SYST ... done. ==> PWD ... done.\n", 247 | "==> TYPE I ... done. ==> CWD (1) /pub/release-94/gtf/mus_musculus ... done.\n", 248 | "==> SIZE Mus_musculus.GRCm38.94.gtf.gz ... 29397390\n", 249 | "==> PASV ... done. ==> RETR Mus_musculus.GRCm38.94.gtf.gz ... done.\n", 250 | "Length: 29397390 (28M) (unauthoritative)\n", 251 | "\n", 252 | "100%[======================================>] 29,397,390 2.01MB/s in 15s \n", 253 | "\n", 254 | "2018-12-03 17:36:55 (1.90 MB/s) - ‘Mus_musculus.GRCm38.94.gtf.gz’ saved [29397390]\n", 255 | "\n" 256 | ] 257 | } 258 | ], 259 | "source": [ 260 | "#Check if the file was downloaded already before doing wget:\n", 261 | "if not (os.path.isfile('Mus_musculus.GRCm38.94.gtf.gz') or os.path.isfile('Mus_musculus.GRCm38.94.gtf')): \n", 262 | " # the `!` means we're running a command line statement (rather than python) \n", 263 | " !wget ftp://ftp.ensembl.org/pub/release-94/gtf/mus_musculus/Mus_musculus.GRCm38.94.gtf.gz\n", 264 | "else: print('Mouse transcriptome already downloaded!')\n" 265 | ] 266 | }, 267 | { 268 | "cell_type": "code", 269 | "execution_count": 8, 270 | "metadata": {}, 271 | "outputs": [], 272 | "source": [ 273 | "# Unzip the file\n", 274 | "!gunzip ./Mus_musculus.GRCm38.94.gtf.gz" 275 | ] 276 | }, 277 | { 278 | "cell_type": "markdown", 279 | "metadata": {}, 280 | "source": [ 281 | "## Create transcript_to_gene.tsv\n", 282 | "\n", 283 | "Now we can use the cells below to parse the GTF file and keep only the transcript mapping as a tsv file in the format below.\n", 284 | "```\n", 285 | "ENST00000632684.1\tENSG00000282431.1\n", 286 | "ENST00000434970.2\tENSG00000237235.2\n", 287 | "ENST00000448914.1\tENSG00000228985.1\n", 288 | "```" 289 | ] 290 | }, 291 | { 292 | "cell_type": "code", 293 | "execution_count": 9, 294 | "metadata": {}, 295 | "outputs": [], 296 | "source": [ 297 | "def create_transcript_list(input, use_name = False, use_version = True):\n", 298 | " r = {}\n", 299 | " for line in input:\n", 300 | " if len(line) == 0 or line[0] == '#':\n", 301 | " continue\n", 302 | " l = line.strip().split('\\t')\n", 303 | " if l[2] == 'transcript':\n", 304 | " info = l[8]\n", 305 | " d = {}\n", 306 | " for x in info.split('; '):\n", 307 | " x = x.strip()\n", 308 | " p = x.find(' ')\n", 309 | " if p == -1:\n", 310 | " continue\n", 311 | " k = x[:p]\n", 312 | " p = x.find('\"',p)\n", 313 | " p2 = x.find('\"',p+1)\n", 314 | " v = x[p+1:p2]\n", 315 | " d[k] = v\n", 316 | "\n", 317 | "\n", 318 | " if 'transcript_id' not in d or 'gene_id' not in d:\n", 319 | " continue\n", 320 | "\n", 321 | " tid = d['transcript_id']\n", 322 | " gid = d['gene_id']\n", 323 | " if use_version:\n", 324 | " if 'transcript_version' not in d or 'gene_version' not in d:\n", 325 | " continue\n", 326 | "\n", 327 | " tid += '.' + d['transcript_version']\n", 328 | " gid += '.' + d['gene_version']\n", 329 | " gname = None\n", 330 | " if use_name:\n", 331 | " if 'gene_name' not in d:\n", 332 | " continue\n", 333 | " gname = d['gene_name']\n", 334 | "\n", 335 | " if tid in r:\n", 336 | " continue\n", 337 | "\n", 338 | " r[tid] = (gid, gname)\n", 339 | " return r\n", 340 | "\n", 341 | "\n", 342 | "\n", 343 | "def print_output(output, r, use_name = True):\n", 344 | " for tid in r:\n", 345 | " if use_name:\n", 346 | " output.write(\"%s\\t%s\\t%s\\n\"%(tid, r[tid][0], r[tid][1]))\n", 347 | " else:\n", 348 | " output.write(\"%s\\t%s\\n\"%(tid, r[tid][0]))" 349 | ] 350 | }, 351 | { 352 | "cell_type": "code", 353 | "execution_count": 10, 354 | "metadata": {}, 355 | "outputs": [ 356 | { 357 | "name": "stdout", 358 | "output_type": "stream", 359 | "text": [ 360 | "Created transcript_to_gene.tsv file\n" 361 | ] 362 | } 363 | ], 364 | "source": [ 365 | "with open('./Mus_musculus.GRCm38.94.gtf') as file:\n", 366 | " r = create_transcript_list(file, use_name = False, use_version = True)\n", 367 | "with open('transcript_to_gene.tsv', \"w+\") as output:\n", 368 | " print_output(output, r, use_name = False)\n", 369 | "print('Created transcript_to_gene.tsv file')" 370 | ] 371 | }, 372 | { 373 | "cell_type": "markdown", 374 | "metadata": {}, 375 | "source": [ 376 | "# Run kallisto bus\n", 377 | "kallisto bus supports several single cell sequencing technologies, as you can see below. We'll be using 10xv3 " 378 | ] 379 | }, 380 | { 381 | "cell_type": "code", 382 | "execution_count": 16, 383 | "metadata": {}, 384 | "outputs": [ 385 | { 386 | "name": "stdout", 387 | "output_type": "stream", 388 | "text": [ 389 | "List of supported single cell technologies\n", 390 | "\n", 391 | "short name description\n", 392 | "---------- -----------\n", 393 | "10xv1 10x version 1 chemistry\n", 394 | "10xv2 10x version 2 chemistry\n", 395 | "10xv3 10x version 3 chemistry\n", 396 | "CELSeq CEL-Seq\n", 397 | "CELSeq2 CEL-Seq version 2\n", 398 | "DropSeq DropSeq\n", 399 | "inDrops inDrops\n", 400 | "SCRBSeq SCRB-Seq\n", 401 | "SureCell SureCell for ddSEQ\n", 402 | "\n" 403 | ] 404 | } 405 | ], 406 | "source": [ 407 | "!kallisto bus --list" 408 | ] 409 | }, 410 | { 411 | "cell_type": "code", 412 | "execution_count": 21, 413 | "metadata": {}, 414 | "outputs": [ 415 | { 416 | "name": "stdout", 417 | "output_type": "stream", 418 | "text": [ 419 | "\n", 420 | "[index] k-mer length: 31\n", 421 | "[index] number of targets: 115,270\n", 422 | "[index] number of k-mers: 98,989,067\n", 423 | "[index] number of equivalence classes: 419,171\n", 424 | "[quant] will process sample 1: ./neuron_1k_v3_fastqs/neuron_1k_v3_S1_L001_R1_001.fastq.gz\n", 425 | " ./neuron_1k_v3_fastqs/neuron_1k_v3_S1_L001_R2_001.fastq.gz\n", 426 | "[quant] will process sample 2: ./neuron_1k_v3_fastqs/neuron_1k_v3_S1_L002_R1_001.fastq.gz\n", 427 | " ./neuron_1k_v3_fastqs/neuron_1k_v3_S1_L002_R2_001.fastq.gz\n", 428 | "[quant] finding pseudoalignments for the reads ... done\n", 429 | "[quant] processed 92,902,231 reads, 58,058,974 reads pseudoaligned\n" 430 | ] 431 | } 432 | ], 433 | "source": [ 434 | "!kallisto453 bus -i mouse_transcripts.idx -o out_1k_mouse_brain_v3 -x 10xv3 -t 4 \\\n", 435 | "./neuron_1k_v3_fastqs/neuron_1k_v3_S1_L001_R1_001.fastq.gz \\\n", 436 | "./neuron_1k_v3_fastqs/neuron_1k_v3_S1_L001_R2_001.fastq.gz \\\n", 437 | "./neuron_1k_v3_fastqs/neuron_1k_v3_S1_L002_R1_001.fastq.gz \\\n", 438 | "./neuron_1k_v3_fastqs/neuron_1k_v3_S1_L002_R2_001.fastq.gz" 439 | ] 440 | }, 441 | { 442 | "cell_type": "markdown", 443 | "metadata": {}, 444 | "source": [ 445 | "### The `matrix.ec` file\n", 446 | "\n", 447 | "The `matrix.ec` is generated by kallisto and connects the equivalence class ids to sets of transcripts. The format looks like\n", 448 | "~~~\n", 449 | "0\t0\n", 450 | "1\t1\n", 451 | "2\t2\n", 452 | "3\t3\n", 453 | "4\t4\n", 454 | "...\n", 455 | "\n", 456 | "884398\t26558,53383,53384,69915,69931,85319,109252,125730\n", 457 | "884399\t7750,35941,114698,119265\n", 458 | "884400\t9585,70083,92571,138545,138546\n", 459 | "884401\t90512,90513,134202,159456\n", 460 | "~~~" 461 | ] 462 | }, 463 | { 464 | "cell_type": "code", 465 | "execution_count": 22, 466 | "metadata": {}, 467 | "outputs": [], 468 | "source": [ 469 | "#load transcript to gene file\n", 470 | "tr2g = {}\n", 471 | "trlist = []\n", 472 | "with open('./transcript_to_gene.tsv') as f:\n", 473 | " for line in f:\n", 474 | " l = line.split()\n", 475 | " tr2g[l[0]] = l[1]\n", 476 | " trlist.append(l[0])\n", 477 | "\n", 478 | "genes = list(set(tr2g[t] for t in tr2g))\n", 479 | "\n", 480 | "# load equivalence classes\n", 481 | "ecs = {}\n", 482 | "with open('./out_1k_mouse_brain_v3/matrix.ec') as f:\n", 483 | " for line in f:\n", 484 | " l = line.split()\n", 485 | " ec = int(l[0])\n", 486 | " trs = [int(x) for x in l[1].split(',')]\n", 487 | " ecs[ec] = trs\n", 488 | " \n", 489 | "def ec2g(ec):\n", 490 | " if ec in ecs:\n", 491 | " return list(set(tr2g[trlist[t]] for t in ecs[ec])) \n", 492 | " else:\n", 493 | " return []" 494 | ] 495 | }, 496 | { 497 | "cell_type": "markdown", 498 | "metadata": {}, 499 | "source": [ 500 | "### Processing the BUS file\n", 501 | "\n", 502 | "For these notebooks we will work with the text file that `BUStools` produces, rather than the raw `BUS` file. \n", 503 | "To install `BUStools` see https://github.com/BUStools/bustools\n", 504 | "\n", 505 | "We discard any barcodes that don't have more 10 UMIs \n", 506 | "\n", 507 | "To produce the text file, starting with the `output.bus` file produced by kallisto, we first sort it on bustools:\n", 508 | "```\n", 509 | "bustools sort -o output.sorted output.bus\n", 510 | "```\n", 511 | "Then we convert it to txt:\n", 512 | "```\n", 513 | "bustools text -o output.sorted.txt output.sorted\n", 514 | "```\n", 515 | "\n" 516 | ] 517 | }, 518 | { 519 | "cell_type": "code", 520 | "execution_count": 23, 521 | "metadata": {}, 522 | "outputs": [ 523 | { 524 | "name": "stdout", 525 | "output_type": "stream", 526 | "text": [ 527 | "Read in 58058974 number of busrecords\n", 528 | "All sorted\n" 529 | ] 530 | } 531 | ], 532 | "source": [ 533 | "#sort bus file\n", 534 | "!bustools sort -o ./out_1k_mouse_brain_v3/output.sorted ./out_1k_mouse_brain_v3/output.bus" 535 | ] 536 | }, 537 | { 538 | "cell_type": "code", 539 | "execution_count": 24, 540 | "metadata": {}, 541 | "outputs": [ 542 | { 543 | "name": "stdout", 544 | "output_type": "stream", 545 | "text": [ 546 | "Read in 27302856 number of busrecords\n" 547 | ] 548 | } 549 | ], 550 | "source": [ 551 | "# convert the sorted busfile to txt\n", 552 | "!bustools text -o ./out_1k_mouse_brain_v3/output.sorted.txt ./out_1k_mouse_brain_v3/output.sorted" 553 | ] 554 | }, 555 | { 556 | "cell_type": "markdown", 557 | "metadata": {}, 558 | "source": [ 559 | "# Plot the bus file results" 560 | ] 561 | }, 562 | { 563 | "cell_type": "code", 564 | "execution_count": 25, 565 | "metadata": {}, 566 | "outputs": [], 567 | "source": [ 568 | "import csv\n", 569 | "from collections import defaultdict\n", 570 | "\n", 571 | "# precompute because this is constant per ec\n", 572 | "ec2g = {ec:frozenset(tr2g[trlist[t]] for t in ecs[ec]) for ec in ecs}\n", 573 | "\n", 574 | "# first pass: collect gene sets\n", 575 | "bcu_gs = dict()\n", 576 | "\n", 577 | "with open('./out_1k_mouse_brain_v3/output.sorted.txt') as f:\n", 578 | " rdr = csv.reader(f, delimiter='\\t')\n", 579 | " for bar,umi,ec,_ in rdr:\n", 580 | " gs = ec2g[int(ec)]\n", 581 | "\n", 582 | " if (bar,umi) in bcu_gs:\n", 583 | " bcu_gs[bar,umi].intersection_update(gs)\n", 584 | " else:\n", 585 | " bcu_gs[bar,umi] = set(gs)\n", 586 | "\n", 587 | "# second pass: compute gene counts\n", 588 | "cell_gene = defaultdict(lambda: defaultdict(float))\n", 589 | "\n", 590 | "for (bar,umi),gs in bcu_gs.items():\n", 591 | " for g in gs:\n", 592 | " cell_gene[bar][g] += 1.0 / len(gs)\n", 593 | "\n", 594 | "# finally: filter out barcodes below threshold\n", 595 | "cell_gene = {bar:cell_gene[bar] for bar in cell_gene\n", 596 | " if sum(cell_gene[bar].values()) >= 10.0}" 597 | ] 598 | }, 599 | { 600 | "cell_type": "code", 601 | "execution_count": 26, 602 | "metadata": {}, 603 | "outputs": [], 604 | "source": [ 605 | "barcode_hist = collections.defaultdict(int)\n", 606 | "for barcode in cell_gene:\n", 607 | " cg = cell_gene[barcode]\n", 608 | " s = len([cg[g] for g in cg])\n", 609 | " barcode_hist[barcode] += s\n", 610 | " " 611 | ] 612 | }, 613 | { 614 | "cell_type": "markdown", 615 | "metadata": {}, 616 | "source": [ 617 | "### Download the 10x whitelist" 618 | ] 619 | }, 620 | { 621 | "cell_type": "code", 622 | "execution_count": 27, 623 | "metadata": {}, 624 | "outputs": [ 625 | { 626 | "name": "stdout", 627 | "output_type": "stream", 628 | "text": [ 629 | "--2018-12-03 17:59:45-- https://github.com/10XGenomics/supernova/blob/master/tenkit/lib/python/tenkit/barcodes/737K-august-2016.txt\n", 630 | "Resolving github.com (github.com)... 192.30.255.112, 192.30.255.113\n", 631 | "Connecting to github.com (github.com)|192.30.255.112|:443... connected.\n", 632 | "HTTP request sent, awaiting response... 200 OK\n", 633 | "Length: unspecified [text/html]\n", 634 | "Saving to: ‘737K-august-2016.txt’\n", 635 | "\n", 636 | " [ <=> ] 56,530 --.-K/s in 0.06s \n", 637 | "\n", 638 | "2018-12-03 17:59:50 (991 KB/s) - ‘737K-august-2016.txt’ saved [56530]\n", 639 | "\n" 640 | ] 641 | } 642 | ], 643 | "source": [ 644 | "!wget https://github.com/10XGenomics/supernova/blob/master/tenkit/lib/python/tenkit/barcodes/737K-august-2016.txt" 645 | ] 646 | }, 647 | { 648 | "cell_type": "code", 649 | "execution_count": 28, 650 | "metadata": {}, 651 | "outputs": [], 652 | "source": [ 653 | "whitelist = set(x.strip() for x in open('737K-august-2016.txt'))" 654 | ] 655 | }, 656 | { 657 | "cell_type": "markdown", 658 | "metadata": {}, 659 | "source": [ 660 | "### Plot counts" 661 | ] 662 | }, 663 | { 664 | "cell_type": "code", 665 | "execution_count": 29, 666 | "metadata": {}, 667 | "outputs": [ 668 | { 669 | "name": "stdout", 670 | "output_type": "stream", 671 | "text": [ 672 | "1364\n" 673 | ] 674 | }, 675 | { 676 | "data": { 677 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAD8CAYAAABn919SAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4wLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvqOYd8AAAD9dJREFUeJzt3WuM3NV5x/HvU3NLSRTjsLVcjLtGQalQpQBdIRBR1UIgJCCgEkIg1DgNlaVepKSplJryKlJfQFslIWpVYgVatyLElEBBkJZSh6iqVDmxy/3iYIhpsAAbCiFppTZOnr6Ys2Zwd5nZ3ZmdmWe/H2m0/9vsnLNn9rdnnv9/ZiMzkSRNvp8ZdQMkSYNhoEtSEQa6JBVhoEtSEQa6JBVhoEtSEQa6JBVhoEtSEQa6JBVx1HI+2IknnpjT09PL+ZCSNPF27979amZO9TpuWQN9enqaXbt2LedDStLEi4gX+jnOkoskFWGgS1IRBrokFWGgS1IRBrokFWGgS1IRBrokFWGgS1IRBrokFbGs7xRdiukt9x9e3nfDxSNsiSSNJ2foklSEgS5JRRjoklSEgS5JRRjoklSEgS5JRRjoklREX9ehR8Q+4IfAT4BDmTkTEWuA7cA0sA+4MjNfH04zJUm9LGSG/muZeXpmzrT1LcCOzDwV2NHWJUkjspSSy2XAtra8Dbh86c2RJC1Wv4GewD9FxO6I2Ny2rc3Ml9ryy8Daue4YEZsjYldE7Dp48OASmytJmk+/n+XyoczcHxE/BzwYEc9078zMjIic646ZuRXYCjAzMzPnMZKkpetrhp6Z+9vXA8DdwFnAKxGxDqB9PTCsRkqSeusZ6BFxfES8Z3YZuBB4ArgX2NQO2wTcM6xGSpJ666fksha4OyJmj/9qZv5jRHwHuCMirgVeAK4cXjMlSb30DPTMfB744BzbXwPOH0ajJEkL5ztFJakIA12SijDQJakIA12SijDQJakIA12SijDQJakIA12SijDQJakIA12SijDQJakIA12SijDQJakIA12SijDQJakIA12SijDQJakIA12Siujnf4qOnekt9x9e3nfDxSNsiSSND2foklSEgS5JRRjoklSEgS5JRRjoklSEgS5JRRjoklSEgS5JRRjoklSEgS5JRRjoklRE34EeEasi4uGIuK+tb4yInRGxNyK2R8Qxw2umJKmXhczQPwU83bV+I/CFzHw/8Dpw7SAbJklamL4CPSLWAxcDX2nrAZwH3NkO2QZcPowGSpL60+8M/YvAZ4GftvX3AW9k5qG2/iJw0oDbJklagJ6BHhGXAAcyc/diHiAiNkfErojYdfDgwcV8C0lSH/qZoZ8LXBoR+4Cv0Sm13ASsjojZf5CxHtg/150zc2tmzmTmzNTU1ACaLEmaS89Az8zrMnN9Zk4DVwHfzMxrgIeAK9phm4B7htZKSVJPS7kO/Q+Bz0TEXjo19VsG0yRJ0mIs6H+KZua3gG+15eeBswbfJEnSYvhOUUkqwkCXpCIMdEkqwkCXpCIMdEkqwkCXpCIMdEkqwkCXpCIMdEkqwkCXpCIMdEkqwkCXpCIMdEkqwkCXpCIMdEkqwkCXpCIMdEkqwkCXpCIMdEkqwkCXpCIMdEkqwkCXpCIMdEkqwkCXpCIMdEkqwkCXpCIMdEkqwkCXpCIMdEkqwkCXpCJ6BnpEHBcR346IRyPiyYj4XNu+MSJ2RsTeiNgeEccMv7mSpPn0M0P/H+C8zPwgcDpwUUScDdwIfCEz3w+8Dlw7vGZKknrpGejZ8aO2enS7JXAecGfbvg24fCgtlCT1pa8aekSsiohHgAPAg8BzwBuZeagd8iJw0nCaKEnqx1H9HJSZPwFOj4jVwN3AL/b7ABGxGdgMsGHDhsW0UdIYm95y/+HlfTdcPMKWaEFXuWTmG8BDwDnA6oiY/YOwHtg/z322ZuZMZs5MTU0tqbGSpPn1c5XLVJuZExHvAi4AnqYT7Fe0wzYB9wyrkZKk3vopuawDtkXEKjp/AO7IzPsi4ingaxHxx8DDwC1DbKckqYeegZ6ZjwFnzLH9eeCsYTRKkrRwvlNUkoow0CWpCANdkoow0CWpCANdkoow0CWpCANdkoow0CWpCANdkoow0CWpiL4+PlfS8vCjaLUUztAlqQgDXZKKMNAlqYiJr6Fbc1RVPre1UM7QJakIA12SijDQJamIia+hS5NoGPVxa+5yhi5JRRjoklSEgS5JRVhDl7D+rBqcoUtSEQa6JBVhoEtSEdbQpQnWXfuXnKFLUhEGuiQVYaBLUhE9a+gRcTLwN8BaIIGtmXlTRKwBtgPTwD7gysx8fXhN7c1riTXO5qt3L2cdfBx/R8axTZOqnxn6IeAPMvM04GzgdyPiNGALsCMzTwV2tHVJ0oj0DPTMfCkz/70t/xB4GjgJuAzY1g7bBlw+rEZKknpbUA09IqaBM4CdwNrMfKnteplOSUaSNCJ9X4ceEe8Gvg58OjPfjIjD+zIzIyLnud9mYDPAhg0bltZaSQOruS+ldu317+Oprxl6RBxNJ8xvy8y72uZXImJd278OODDXfTNza2bOZObM1NTUINosSZpDz0CPzlT8FuDpzPx81657gU1teRNwz+CbJ0nqVz8ll3OB3wAej4hH2rY/Am4A7oiIa4EXgCuH00RJUj96Bnpm/isQ8+w+f7DNkSZDv/XnYdS7h2G+7z+s68KXsz8r6dp23ykqSUUY6JJUxIr4+NyV+vJLg7XSL9Vb6f2fBM7QJakIA12SijDQJamIFVFD72Y9XcM0jnXmcWjTOLRhJXCGLklFGOiSVISBLklFrLgaulaGfs6VVK7rrpS+Deo8WJVza87QJakIA12SijDQJakIa+gqYxz+NVs1lWvxFTlDl6QiDHRJKsJAl6QiytbQrf1pLsN4Xvhce8s4/ixW0jkRZ+iSVISBLklFGOiSVETZGvooraSa3aj1U7Mdx7quFsb3GPTHGbokFWGgS1IRBrokFbGia+jz1eUq1ta0ONbfx9tyjs9C6++jqNc7Q5ekIgx0SSrCQJekInrW0CPiVuAS4EBm/lLbtgbYDkwD+4ArM/P14TVzci30f1uu9Pr9Sv9foFo+S3kejevvbD8z9L8GLjpi2xZgR2aeCuxo65KkEeoZ6Jn5L8B/HrH5MmBbW94GXD7gdkmSFmixNfS1mflSW34ZWDug9kiSFmnJ16FnZkZEzrc/IjYDmwE2bNiw1IeTVJjnR5ZmsTP0VyJiHUD7emC+AzNza2bOZObM1NTUIh9OktTLYgP9XmBTW94E3DOY5kiSFqtnoEfE7cC/AR+IiBcj4lrgBuCCiHgW+HBblySNUM8aemZePc+u8wfclrExrteYzhr39kkr1ajPAfhOUUkqwkCXpCIMdEkqYkV/Hrp6m69e/061wvmO6/f+khbHGbokFWGgS1IRllx6OLI0sJTLBMf5csNBtm2+coplFk2CSX6eOkOXpCIMdEkqwkCXpCKsoY+ZYV/mN0n1wXE+5yDNGqffKWfoklSEgS5JRRjoklSENfRCKr/NvkIfpGFzhi5JRRjoklSEgS5JRVhDX4JhXyc9qrqxn8UiTSZn6JJUhIEuSUUY6JJUhDX0BRq3+rJ1bUmznKFLUhEGuiQVYaBLUhHW0Edk2LVva+vSyuMMXZKKMNAlqQgDXZKKWFKgR8RFEbEnIvZGxJZBNUqStHCLDvSIWAX8BfBR4DTg6og4bVANkyQtzFJm6GcBezPz+cz8X+BrwGWDaZYkaaGWEugnAd/vWn+xbZMkjcDQr0OPiM3A5rb6o4jYM+zHXEYnAq+OuhFDYL8mR8U+QbF+xY3A0vr0C/0ctJRA3w+c3LW+vm17m8zcCmxdwuOMrYjYlZkzo27HoNmvyVGxT1CzX8vRp6WUXL4DnBoRGyPiGOAq4N7BNEuStFCLnqFn5qGI+D3gAWAVcGtmPjmwlkmSFmRJNfTM/AbwjQG1ZRKVLCVhvyZJxT5BzX4NvU+RmcN+DEnSMvCt/5JUhIF+hIg4OSIeioinIuLJiPhU274mIh6MiGfb1xPa9oiIL7WPP3gsIs7s+l6b2vHPRsSmUfWpqz2rIuLhiLivrW+MiJ2t7dvbyW0i4ti2vrftn+76Hte17Xsi4iOj6clbImJ1RNwZEc9ExNMRcc6kj1VE/H577j0REbdHxHGTOFYRcWtEHIiIJ7q2DWxsIuKXI+Lxdp8vRUSMsF9/2p6Dj0XE3RGxumvfnOMQ83x0ynxj3ZfM9NZ1A9YBZ7bl9wDfpfPRBn8CbGnbtwA3tuWPAf8ABHA2sLNtXwM8376e0JZPGHHfPgN8Fbivrd8BXNWWbwZ+uy3/DnBzW74K2N6WTwMeBY4FNgLPAatG3KdtwG+15WOA1ZM8VnTenPc94F1dY/SJSRwr4FeAM4EnurYNbGyAb7djo933oyPs14XAUW35xq5+zTkO7fYccEp73j4KnNY15v9vrPtq2yietJN0A+4BLgD2AOvatnXAnrb8ZeDqruP3tP1XA1/u2v6240bQj/XADuA84L72S/Bq15PwHOCBtvwAcE5bPqodF8B1wHVd3/PwcSPq03tb+MUR2yd2rHjrHdhr2s/+PuAjkzpWwPQRwTeQsWn7nuna/rbjlrtfR+z7deC2tjznOHSPYfdx7/R72c/Nkss7aC9fzwB2Amsz86W262VgbVue7yMQxu2jEb4IfBb4aVt/H/BGZh5q693tO9z2tv8H7fhx69NG4CDwV62U9JWIOJ4JHqvM3A/8GfAfwEt0fva7mfyxmjWosTmpLR+5fRx8ks4rBlh4v97p97InA30eEfFu4OvApzPzze592fnTOTGXB0XEJcCBzNw96rYM2FF0Xvr+ZWaeAfwXnZfxh03gWJ1A50PuNgI/DxwPXDTSRg3JpI1NPyLieuAQcNsoHt9An0NEHE0nzG/LzLva5lciYl3bvw440LbP9xEIfX00wjI5F7g0IvbR+VTM84CbgNURMftehO72HW572/9e4DXGq0/Qmb28mJk72/qddAJ+ksfqw8D3MvNgZv4YuIvO+E36WM0a1Njsb8tHbh+ZiPgEcAlwTftjBQvv12vMP9Y9GehHaGfKbwGezszPd+26F5g9w76JTm19dvvH21n6s4EftJeUDwAXRsQJbdZ1Ydu27DLzusxcn5nTdE6cfTMzrwEeAq5ohx3Zp9m+XtGOz7b9qnZlxUbgVDonpkYiM18Gvh8RH2ibzgeeYoLHik6p5eyI+Nn2XJzt00SPVZeBjE3b92ZEnN1+Th/v+l7LLiIuolPSvDQz/7tr13zjMOdHp7Sxm2+se1vukyTjfgM+ROdl4GPAI+32MTq1rR3As8A/A2va8UHnH308BzwOzHR9r08Ce9vtN0fdt9amX+Wtq1xOaU+uvcDfAce27ce19b1t/yld97++9XUPy3RVQY/+nA7sauP193SuhJjosQI+BzwDPAH8LZ0rJCZurIDb6ZwH+DGdV1PXDnJsgJn2M3oO+HOOODm+zP3aS6cmPpsZN/cah5Yr3237ru/aPudY93PznaKSVIQlF0kqwkCXpCIMdEkqwkCXpCIMdEkqwkCXpCIMdEkqwkCXpCL+D/pkqcXHY71DAAAAAElFTkSuQmCC\n", 678 | "text/plain": [ 679 | "
" 680 | ] 681 | }, 682 | "metadata": {}, 683 | "output_type": "display_data" 684 | } 685 | ], 686 | "source": [ 687 | "bcv = [x for b,x in barcode_hist.items() if x > 600 and x < 12000]\n", 688 | "_ = plt.hist(bcv,bins=100)\n", 689 | "print(len(bcv))" 690 | ] 691 | }, 692 | { 693 | "cell_type": "code", 694 | "execution_count": null, 695 | "metadata": {}, 696 | "outputs": [], 697 | "source": [] 698 | } 699 | ], 700 | "metadata": { 701 | "kernelspec": { 702 | "display_name": "Python 3", 703 | "language": "python", 704 | "name": "python3" 705 | }, 706 | "language_info": { 707 | "codemirror_mode": { 708 | "name": "ipython", 709 | "version": 3 710 | }, 711 | "file_extension": ".py", 712 | "mimetype": "text/x-python", 713 | "name": "python", 714 | "nbconvert_exporter": "python", 715 | "pygments_lexer": "ipython3", 716 | "version": "3.6.3" 717 | } 718 | }, 719 | "nbformat": 4, 720 | "nbformat_minor": 2 721 | } 722 | -------------------------------------------------------------------------------- /utils/get_dicts-from-ref.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import numpy as np\n", 10 | "import pickle\n", 11 | "import mygene\n", 12 | "import gzip" 13 | ] 14 | }, 15 | { 16 | "cell_type": "code", 17 | "execution_count": null, 18 | "metadata": { 19 | "scrolled": true 20 | }, 21 | "outputs": [], 22 | "source": [ 23 | "#Homo_sapiens\n", 24 | "ref='GRCh38'\n", 25 | "path_to_ref='/home/vasilis/refs/Homo_sapiens.GRCh38.cdna.all.fa.gz' \n", 26 | "ens_='ENS'\n", 27 | "\n", 28 | "# #Mus_musculus\n", 29 | "# ref='Mus_musculus.GRCm38'\n", 30 | "# path_to_ref='/home/vasilis/refs/Mus_musculus.GRCm38.cdna.all.fa.gz' \n", 31 | "# ens_='ENSMUS'" 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "metadata": {}, 37 | "source": [ 38 | "#### tr2g and g2tr" 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": null, 44 | "metadata": {}, 45 | "outputs": [], 46 | "source": [ 47 | "def readENS_ids(path_to_ref):\n", 48 | " TX_to_ENSG={}\n", 49 | " ENSG_isoforms={}\n", 50 | " tx_cnt=0;\n", 51 | " with gzip.open(path_to_ref) as f:\n", 52 | " for line in f:\n", 53 | " if line.decode('UTF-8')[0]=='>':\n", 54 | " liner=line.decode('UTF-8')\n", 55 | " ensg=ens_+'G'+liner.split(ens_+'G',1)[1][:11]\n", 56 | " enst=ens_+'T'+liner.split(ens_+'T',1)[1][:11]\n", 57 | " TX_to_ENSG[enst] = ensg\n", 58 | " ENSG_isoforms[ensg] = ENSG_isoforms.get(ensg, [])\n", 59 | " ENSG_isoforms[ensg].append(enst)\n", 60 | " tx_cnt+=1 \n", 61 | " return [TX_to_ENSG,ENSG_isoforms]" 62 | ] 63 | }, 64 | { 65 | "cell_type": "code", 66 | "execution_count": null, 67 | "metadata": {}, 68 | "outputs": [], 69 | "source": [ 70 | "[TX_to_ENSG,ENSG_isoforms]=readENS_ids(path_to_ref) " 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": null, 76 | "metadata": {}, 77 | "outputs": [], 78 | "source": [ 79 | "with open(ref+'_tr2g.pickle', 'wb') as handle:\n", 80 | " pickle.dump(TX_to_ENSG, handle, protocol=pickle.HIGHEST_PROTOCOL)\n", 81 | " \n", 82 | "with open(ref+'_g2tr.pickle', 'wb') as handle:\n", 83 | " pickle.dump(ENSG_isoforms, handle, protocol=pickle.HIGHEST_PROTOCOL)" 84 | ] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "execution_count": null, 89 | "metadata": {}, 90 | "outputs": [], 91 | "source": [ 92 | "ENSGLIST=list(np.unique(list(TX_to_ENSG.values())))" 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": null, 98 | "metadata": {}, 99 | "outputs": [], 100 | "source": [ 101 | "print('number of genes: ',len(ENSGLIST))" 102 | ] 103 | }, 104 | { 105 | "cell_type": "markdown", 106 | "metadata": {}, 107 | "source": [ 108 | "#### ginfo and g2n" 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": null, 114 | "metadata": { 115 | "scrolled": true 116 | }, 117 | "outputs": [], 118 | "source": [ 119 | "mg = mygene.MyGeneInfo()\n", 120 | "ginfo = mg.querymany(ENSGLIST, scopes='ensembl.gene',returnall=True)" 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": null, 126 | "metadata": {}, 127 | "outputs": [], 128 | "source": [ 129 | "g2n = {}\n", 130 | "count_exept=0\n", 131 | "for g in ginfo['out']:\n", 132 | " try:\n", 133 | " gene_id=str(g['query'])\n", 134 | " gene_name=str(g['symbol'])\n", 135 | " \n", 136 | " g2n[gene_id] = g2n.get(gene_id, [])\n", 137 | " g2n[gene_id].append(str(g['symbol'])) \n", 138 | " except KeyError:\n", 139 | " count_exept+=1\n", 140 | " g2n[ str(g['query']) ] = [str(g['query'])]\n", 141 | " \n", 142 | "n2g = {}\n", 143 | "count_exept=0\n", 144 | "for g in ginfo['out']:\n", 145 | " try:\n", 146 | " gene_id=str(g['query'])\n", 147 | " gene_name=str(g['symbol'])\n", 148 | " \n", 149 | " n2g[gene_name] = n2g.get(gene_name, [])\n", 150 | " n2g[gene_name].append(gene_id) \n", 151 | " except KeyError:\n", 152 | " count_exept+=1\n", 153 | " n2g[ gene_id ] = [gene_id]" 154 | ] 155 | }, 156 | { 157 | "cell_type": "code", 158 | "execution_count": null, 159 | "metadata": {}, 160 | "outputs": [], 161 | "source": [ 162 | "with open(ref+'_ginfo.pickle', 'wb') as handle:\n", 163 | " pickle.dump(ginfo, handle, protocol=pickle.HIGHEST_PROTOCOL) \n", 164 | " \n", 165 | "with open(ref+'_g2n.pickle', 'wb') as handle:\n", 166 | " pickle.dump(g2n, handle, protocol=pickle.HIGHEST_PROTOCOL) \n", 167 | " \n", 168 | "with open(ref+'_n2g.pickle', 'wb') as handle:\n", 169 | " pickle.dump(n2g, handle, protocol=pickle.HIGHEST_PROTOCOL) " 170 | ] 171 | }, 172 | { 173 | "cell_type": "code", 174 | "execution_count": null, 175 | "metadata": {}, 176 | "outputs": [], 177 | "source": [ 178 | " " 179 | ] 180 | } 181 | ], 182 | "metadata": { 183 | "kernelspec": { 184 | "display_name": "Python 3", 185 | "language": "python", 186 | "name": "python3" 187 | }, 188 | "language_info": { 189 | "codemirror_mode": { 190 | "name": "ipython", 191 | "version": 3 192 | }, 193 | "file_extension": ".py", 194 | "mimetype": "text/x-python", 195 | "name": "python", 196 | "nbconvert_exporter": "python", 197 | "pygments_lexer": "ipython3", 198 | "version": "3.6.5" 199 | } 200 | }, 201 | "nbformat": 4, 202 | "nbformat_minor": 2 203 | } 204 | -------------------------------------------------------------------------------- /utils/transcript2gene.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | import sys, argparse 4 | 5 | def create_transcript_list(input, use_name = True, use_version = True): 6 | r = {} 7 | for line in input: 8 | if len(line) == 0 or line[0] == '#': 9 | continue 10 | l = line.strip().split('\t') 11 | if l[2] == 'transcript': 12 | info = l[8] 13 | d = {} 14 | for x in info.split('; '): 15 | x = x.strip() 16 | p = x.find(' ') 17 | if p == -1: 18 | continue 19 | k = x[:p] 20 | p = x.find('"',p) 21 | p2 = x.find('"',p+1) 22 | v = x[p+1:p2] 23 | d[k] = v 24 | 25 | 26 | if 'transcript_id' not in d or 'gene_id' not in d: 27 | continue 28 | 29 | tid = d['transcript_id'] 30 | gid = d['gene_id'] 31 | if use_version: 32 | if 'transcript_version' not in d or 'gene_version' not in d: 33 | continue 34 | 35 | tid += '.' + d['transcript_version'] 36 | gid += '.' + d['gene_version'] 37 | gname = None 38 | if use_name: 39 | if 'gene_name' not in d: 40 | continue 41 | gname = d['gene_name'] 42 | 43 | if tid in r: 44 | continue 45 | 46 | r[tid] = (gid, gname) 47 | return r 48 | 49 | 50 | 51 | def print_output(output, r, use_name = True): 52 | for tid in r: 53 | if use_name: 54 | output.write("%s\t%s\t%s\n"%(tid, r[tid][0], r[tid][1])) 55 | else: 56 | output.write("%s\t%s\n"%(tid, r[tid][0])) 57 | 58 | 59 | if __name__ == "__main__": 60 | 61 | 62 | parser = argparse.ArgumentParser(add_help=True, description='Creates transcript to gene info from GTF files\nreads from standard input and writes to standard output') 63 | parser.add_argument('--use_version', '-v', action='store_true', help='Use version numbers in transcript and gene ids') 64 | parser.add_argument('--skip_gene_names', '-s', action='store_true', help='Do not output gene names') 65 | args = parser.parse_args() 66 | 67 | 68 | 69 | input = sys.stdin 70 | r = create_transcript_list(input, use_name = not args.skip_gene_names, use_version = args.use_version) 71 | output = sys.stdout 72 | print_output(output, r) 73 | --------------------------------------------------------------------------------