├── LICENSE
├── README.md
├── data_table.csv
├── dataset_example.ipynb
├── datasets
    ├── Dixit_2016.ipynb
    ├── Frangieh_2021.ipynb
    ├── Jaitin_2014.ipynb
    ├── McFarland_2020_curation.ipynb
    ├── Norman_2019.ipynb
    ├── Norman_2019_curation.ipynb
    ├── Schraivogel_2020.ipynb
    ├── Srivatsan_2019_sciplex2_curation.ipynb
    ├── Srivatsan_2019_sciplex3.ipynb
    ├── block0_load.py
    ├── block1_init.py
    ├── block2_process.py
    ├── block3_standardize.py
    ├── template.ipynb
    └── utils.py
├── readme_body.txt
├── resources.py
├── statistics.ipynb
└── update.py


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2020 Theis Lab
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # sc-pert - Machine learning for perturbational single-cell omics
 2 | 
 3 | *This repository provides a community-maintained summary of models and datasets. It was initially curated for [(Cell Systems, 2021)](https://doi.org/10.1016/j.cels.2021.05.016).*
 4 | 
 5 | ### External annotations
 6 | 
 7 | There are various resources for evaluation of single cell perturbation models. We discuss five tasks in the publication which can be supported by the following publicly available annotations:
 8 | 
 9 | - [GDSC](https://www.cancerrxgene.org/downloads/bulk_download) provides a collection of cell viability measurements for many compounds and cell lines. We provide a [code snippet](https://github.com/theislab/sc-pert/blob/main/resources.py#L4) to conveniently load GDSC-provided z-score compound response rankings per cell line.
10 | - Additional viability data can be obtained from [DepMap's PRISM dataset](https://depmap.org/portal/download/).
11 | - [Therapeutics Data Commons](https://github.com/mims-harvard/TDC) provides access to a number of compound databases as part of their [cheminformatics tasks](https://tdcommons.ai/benchmark/overview/). (In the same vein, [OpenProblems](https://openproblems.bio/) provides a [framework for tasks in single-cell](https://github.com/openproblems-bio/openproblems/tree/main/openproblems/tasks) which can also support perturbation modeling tasks in a more long term format than was previously seen in the [DREAM challenges](https://dreamchallenges.org/dream-7-nci-dream-drug-sensitivity-and-drug-synergy-challenge/).)
12 | - [PubChem](https://pubchem.ncbi.nlm.nih.gov/) contains a comprehensive record of compounds ranging from experimental entities to non-proprietary small molecules. It is queryable via [PubChemPy](https://github.com/mcs07/PubChemPy).
13 | - [DrugBank](https://go.drugbank.com/releases/latest) provides annotations for a relatively small number of small molecules in a standardized format.
14 | 
15 | ### Current modeling approaches
16 | 
17 | We [maintain a list of perturbation-related tools at scrna-tools](https://www.scrna-tools.org/tools?sort=name&cats=Perturbations). Please consider further updating and tagging tools [there](https://github.com/scRNA-tools/scRNA-tools).
18 | 
19 | For the basis of the table in the article, see this [spreadsheet of a subset of perturbation models](https://docs.google.com/spreadsheets/d/1nqNg0DW1-Om7WtvRS20q-6b28usVRv5czOcxgj83Sgg/) which includes more details.
20 | 
21 | ### Datasets
22 | 
23 | Below, we curated a [table](https://raw.githubusercontent.com/theislab/sc-pert/main/data_table.csv) of perturbation datasets based on [Svensson *et al.* (2020)](https://doi.org/10.1093/database/baaa073).
24 | 
25 | We also offer some datasets in a curated `.h5ad` format via the download links in the table below. `raw h5ad` denotes a version of the dataset that has not been filtered, normalized, or standardized.
26 | 
27 | H5ads denoted as `processed` have an accompanying processing notebook, and have been similarly preprocessed. These datasets have the following standardized fields in `.obs`:
28 | * `perturbation_name` -- Human-readable ompound names (International non-proprietary naming where possible) for small molecules and gene names for genetic perturbations.
29 | * `perturbation_type` -- `small molecule` or `genetic`
30 | * `perturbation_value` -- A continuous covariate quantity, such as the dosage concentration or the number of hours since treatment.
31 | * `perturbation_unit` -- Describes `perturbation_value`, such as `'ug'` or `'hrs'`.
32 | 
33 | 
34 | | Shorthand                                                                        | Title&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;                                                                                                                                                   | .h5ad availability                                                                                                                                                                                                                                                                                                                                                                                                                                    | Treatment            | # perturbations   | # cell types   | # doses   | # timepoints   | Reported cells total   | Organism     | Tissue               | Technique             | Data location   | Panel size   | Measurement   | Cell source                                                                                                                      |   Disease | Contrasts             |   Developmental stage |   Number of reported cell types or clusters | Cell clustering   | Pseudotime   | RNA Velocity   | PCA   | tSNE   |   H5AD location | Isolation            | BC --> Cell ID _OR_ BC --> Cluster ID                              |   Number individuals |
35 | |----------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------|-------------------|----------------|-----------|----------------|------------------------|--------------|----------------------|-----------------------|-----------------|--------------|---------------|----------------------------------------------------------------------------------------------------------------------------------|-----------|-----------------------|-----------------------|---------------------------------------------|-------------------|--------------|----------------|-------|--------|-----------------|----------------------|--------------------------------------------------------------------|----------------------|
36 | | [Jaitin *et al.* Science](https://doi.org/10.1126/science.1247651)               | Massively Parallel Single-Cell RNA-Seq for Marker-Free Decomposition of Tissues into Cell Types                                                         |                                                                                                                                                                                                                                                                                                                                                                                                                                                       | CRISPR               | 8-22              | 1              | -         | 1              | 4,468                  | Mouse        | Spleen               | MARS-seq              | GSE54006        | nan          | RNA-seq       | CD11c+ enriched splenocytes                                                                                                      |       nan | nan                   |                   nan |                                           9 | Yes               | No           | nan            | No    | No     |             nan | Sorting (FACS)       | nan                                                                |                  nan |
37 | | [Dixit *et al.* Cell](https://doi.org/10.1016/j.cell.2016.11.038)                | Perturb-Seq: Dissecting Molecular Circuits with Scalable Single-Cell RNA Profiling of Pooled Genetic Screens                                            | [\[raw h5ad\]](https://ndownloader.figshare.com/files/34011689) [\[processed h5ad\]](https://ndownloader.figshare.com/files/34014608) [\[processing nb\]](https://nbviewer.ipython.org/github/theislab/sc-pert/blob/main/datasets/Dixit_2016.ipynb)                                                                                                                                                                                                   | CRISPR               | 10,24             | 1              | -         | 1-2            | 200,000                | Human, Mouse | Culture              | Perturb-seq           | GSE90063        | nan          | RNA-seq       | BMDCs, K562                                                                                                                      |       nan | nan                   |                   nan |                                         nan | nan               | nan          | nan            | nan   | No     |             nan | Nanodroplet dilution | nan                                                                |                  nan |
38 | | [Datlinger *et al.* NMeth](https://doi.org/10.1038/nmeth.4177)                   | Pooled CRISPR screening with single-cell transcriptome readout                                                                                          |                                                                                                                                                                                                                                                                                                                                                                                                                                                       | CRISPR               | 29                | 1              | -         | 1              | 5,905                  | Human, Mouse | Culture              | CROP-seq              | GSE92872        | nan          | RNA-seq       | HEK293T, 3T3, Jurkat                                                                                                             |       nan | nan                   |                   nan |                                         nan | nan               | nan          | nan            | nan   | No     |             nan | nan                  | nan                                                                |                  nan |
39 | | [Hill *et al.* NMethods](https://doi.org/10.1038/nmeth.4604)                     | On the design of CRISPR-based single-cell molecular screens                                                                                             |                                                                                                                                                                                                                                                                                                                                                                                                                                                       | CRISPR               | 32                | 1              | 2         | 1              | 5,879                  | Human        | Culture              | CROP-seq              | GSE108699       | nan          | RNA-seq       | MCF10a cells                                                                                                                     |       nan | nan                   |                   nan |                                         nan | nan               | nan          | nan            | nan   | nan    |             nan | nan                  | https://github.com/shendurelab/single-cell-ko-screens#result-files |                  nan |
40 | | [Ursu *et al.* bioRxiv](https://doi.org/10.1101/2020.11.16.383307)               | Massively parallel phenotyping of variant impact in cancer with Perturb-seq reveals a shift in the spectrum of cell states induced by somatic mutations |                                                                                                                                                                                                                                                                                                                                                                                                                                                       | CRISPR               | 200               | 1              | -         | 1              | 162,314                | Human        | Lung                 | Perturb-seq           | nan             | nan          | RNA-seq       | nan                                                                                                                              |       nan | nan                   |                   nan |                                         nan | nan               | nan          | nan            | nan   | nan    |             nan | nan                  | nan                                                                |                  nan |
41 | | [Jin *et al.* Science](https://doi.org/10.1126/science.aaz6063)                  | In vivo Perturb-Seq reveals neuronal and glial abnormalities associated with autism risk genes                                                          |                                                                                                                                                                                                                                                                                                                                                                                                                                                       | CRISPR               | 35                | -              | -         | 1              | 46,770                 | Mouse        | Brain                | Perturb-seq           | nan             | nan          | RNA-seq       | nan                                                                                                                              |       nan | nan                   |                   nan |                                         nan | nan               | nan          | nan            | nan   | nan    |             nan | nan                  | nan                                                                |                  nan |
42 | | [Frangieh *et al.* NGenetics](https://doi.org/10.1038/s41588-021-00779-1)        | Multimodal pooled Perturb-CITE-seq screens in patient models define mechanisms of cancer immune evasion                                                 | [\[raw h5ad\]](https://ndownloader.figshare.com/files/34012565) [\[processed h5ad\]](https://ndownloader.figshare.com/files/34013717) [\[processing nb\]](https://nbviewer.ipython.org/github/theislab/sc-pert/blob/main/datasets/Frangieh_2021.ipynb)                                                                                                                                                                                                | CRISPR               | 248               | 1              | -         | 1              | 218,331                | Human        | Culture              | Perturb-CITE-seq      | SCP1064         | nan          | RNA-seq       | nan                                                                                                                              |       nan | nan                   |                   nan |                                         nan | nan               | nan          | nan            | nan   | nan    |             nan | nan                  | nan                                                                |                  nan |
43 | | [Papalexi *et al.* NGenetics](https://doi.org/10.1038/s41588-021-00778-2)        | Characterizing the molecular regulation of inhibitory immune checkpoints with multimodal single-cell screens                                            |                                                                                                                                                                                                                                                                                                                                                                                                                                                       | CRISPR               | 111 (sgRNA)       | 1              | 2         | -              | 28,295                 | Human        | Culture              | CITE-seq & ECCITE-seq | GSE153056       | nan          | RNA-seq       | nan                                                                                                                              |       nan | nan                   |                   nan |                                         nan | nan               | nan          | nan            | nan   | nan    |             nan | nan                  | nan                                                                |                  nan |
44 | | [Datlinger *et al.* NMethods](https://doi.org/10.1038/s41592-021-01153-z)        | Ultra-high-throughput single-cell RNA sequencing and perturbation screening with combinatorial fluidic indexing                                         |                                                                                                                                                                                                                                                                                                                                                                                                                                                       | CRISPR KO + antibody | 96                | 1              | 1         | 1              | nan                    | Human, Mouse | nan                  | scifi-RNA-seq         | nan             | nan          | nan           | nan                                                                                                                              |       nan | nan                   |                   nan |                                         nan | nan               | nan          | nan            | nan   | nan    |             nan | nan                  | nan                                                                |                  nan |
45 | | [Alda-Catalinas *et al.* CSystems](https://doi.org/10.1016/j.cels.2020.06.004)   | A Single-Cell Transcriptomics CRISPR-Activation Screen Identifies Epigenetic Regulators of the Zygotic Genome Activation Program                        |                                                                                                                                                                                                                                                                                                                                                                                                                                                       | CRISPRa              | 230               | 1              | -         | -              | 203,894                | Mouse        | Culture              | Chromium              | nan             | nan          | RNA-seq       | mESCs                                                                                                                            |       nan | nan                   |                   nan |                                         nan | nan               | nan          | nan            | nan   | nan    |             nan | nan                  | nan                                                                |                  nan |
46 | | [Norman *et al.* (2019)](https://doi.org/10.1126/science.aax4438)                | nan                                                                                                                                                     | [\[raw h5ad\]](https://ndownloader.figshare.com/files/34002548) [\[processed h5ad\]](https://ndownloader.figshare.com/files/34027562) [\[curation nb\]](https://nbviewer.ipython.org/github/theislab/sc-pert/blob/main/datasets/Norman_2019_curation.ipynb) [\[processing nb\]](https://nbviewer.ipython.org/github/theislab/sc-pert/blob/main/datasets/Norman_2019.ipynb)                                                                            | CRISPRa              | 287               | 1              | -         | 1              | nan                    | nan          | nan                  | Perturb-seq           | nan             | nan          | RNA-seq       | induction of gene pair targets+single gene controls in K562 cells after screening 112 genes (2x gRNA per) and their combinations |       nan | nan                   |                   nan |                                         nan | nan               | nan          | nan            | nan   | nan    |             nan | nan                  | nan                                                                |                  nan |
47 | | [Adamson *et al.* Cell](https://doi.org/10.1016/j.cell.2016.11.048)              | A Multiplexed Single-Cell CRISPR Screening Platform Enables Systematic Dissection of the Unfolded Protein Response                                      |                                                                                                                                                                                                                                                                                                                                                                                                                                                       | CRISPRi              | 9-93 (sgRNA)      | 1              | -         | 1              | 86,000                 | Human        | Culture              | Perturb-seq           | GSE90546        | nan          | RNA-seq       | K562                                                                                                                             |       nan | nan                   |                   nan |                                         nan | nan               | nan          | nan            | nan   | Yes    |             nan | nan                  | nan                                                                |                  nan |
48 | | [Gasperini *et al.* Cell](https://doi.org/10.1016/j.cell.2018.11.029)            | A Genome-wide Framework for Mapping Gene Regulation via Cellular Genetic Screens                                                                        |                                                                                                                                                                                                                                                                                                                                                                                                                                                       | CRISPRi              | 1119, 5779        | 1              | -         | 1              | 207,324                | Human        | Culture              | CROP-seq              | GSE120861       | nan          | RNA-seq       | K562 Cells                                                                                                                       |       nan | CRISPR Screen         |                   nan |                                         nan | nan               | nan          | nan            | nan   | nan    |             nan | nan                  | nan                                                                |                  nan |
49 | | [Jost *et al.* NBT](https://doi.org/10.1038/s41587-019-0387-5)                   | Titrating gene expression using libraries of systematically attenuated CRISPR guide RNAs                                                                |                                                                                                                                                                                                                                                                                                                                                                                                                                                       | CRISPRi              | 25                | 2              | -         | 1              | 19,587                 | Human        | Culture              | Perturb-seq           | GSE132080       | nan          | RNA-seq       | K562 cells                                                                                                                       |       nan | 25 gene screen        |                   nan |                                         nan | nan               | nan          | nan            | nan   | nan    |             nan | nan                  | nan                                                                |                  nan |
50 | | [Schraivogel *et al.* NMethods](https://doi.org/10.1038/s41592-020-0837-5)       | Targeted Perturb-seq enables genome-scale genetic screens in single cells                                                                               | [\[processing nb\]](https://nbviewer.ipython.org/github/theislab/sc-pert/blob/main/datasets/Schraivogel_2020.ipynb)                                                                                                                                                                                                                                                                                                                                   | CRISPRi              | 1778 (enhancers)  | 1              | -         | 1              | 231,667                | Human, Mouse | Bone marrow, Culture | TAP-seq               | GSE135497       | 1,000        | RNA-seq       | nan                                                                                                                              |       nan | nan                   |                   nan |                                         nan | nan               | nan          | nan            | nan   | Yes    |             nan | nan                  | nan                                                                |                  nan |
51 | | [Leng *et al.* bioRxiv](https://doi.org/10.1101/2021.08.23.457400)               | CRISPRi screens in human astrocytes elucidate regulators of distinct inflammatory reactive states                                                       |                                                                                                                                                                                                                                                                                                                                                                                                                                                       | CRISPRi              | 30                | 1              | 2         | -              | nan                    | nan          | nan                  | nan                   | nan             | nan          | nan           | nan                                                                                                                              |       nan | nan                   |                   nan |                                         nan | nan               | nan          | nan            | nan   | nan    |             nan | nan                  | nan                                                                |                  nan |
52 | | [Replogle *et al.* (2020)](https://doi.org/nan)                                  | nan                                                                                                                                                     |                                                                                                                                                                                                                                                                                                                                                                                                                                                       | genetic targets      | nan               | nan            | nan       | nan            | nan                    | nan          | nan                  | nan                   | nan             | nan          | nan           | nan                                                                                                                              |       nan | nan                   |                   nan |                                         nan | nan               | nan          | nan            | nan   | nan    |             nan | nan                  | nan                                                                |                  nan |
53 | | [Replogle *et al.* (2021)](https://doi.org/10.1101/2021.12.16.473013v3)          | nan                                                                                                                                                     |                                                                                                                                                                                                                                                                                                                                                                                                                                                       | genetic targets      | >10000            | 2              | -         | -              | nan                    | nan          | nan                  | Perturb-seq           | nan             | nan          | RNA-seq       | nan                                                                                                                              |       nan | nan                   |                   nan |                                         nan | nan               | nan          | nan            | nan   | nan    |             nan | nan                  | nan                                                                |                  nan |
54 | | [Shin *et al.* SAdvances](https://doi.org/10.1126/sciadv.aav2249)                | Multiplexed single-cell RNA-seq via transient barcoding for simultaneous expression profiling of various drug perturbations                             |                                                                                                                                                                                                                                                                                                                                                                                                                                                       | small molecules      | 45                | 2              | 1         | 1              | 3,091                  | Mouse, Human | Culture              | Drop-seq              | PRJNA493658     | nan          | RNA-seq       | HEK293T, NIIH3T3, A375, SW480, K562                                                                                              |       nan | 45 perturbations      |                   nan |                                         nan | nan               | nan          | nan            | nan   | nan    |             nan | nan                  | nan                                                                |                  nan |
55 | | [Srivatsan *et al.* Science](https://doi.org/10.1126/science.aax6234)            | Massively multiplex chemical transcriptomics at single-cell resolution                                                                                  | [\[raw h5ad\]](https://ndownloader.figshare.com/files/33979517) [\[curation nb\]](https://nbviewer.ipython.org/github/theislab/sc-pert/blob/main/datasets/Srivatsan_2019_sciplex2_curation.ipynb) [\[curation nb\]](https://nbviewer.ipython.org/github/theislab/sc-pert/blob/main/datasets/Srivatsan_2019_curation.ipynb) [\[processing nb\]](https://nbviewer.ipython.org/github/theislab/sc-pert/blob/main/datasets/Srivatsan_2019_sciplex3.ipynb) | small molecules      | 188               | 3              | 4         | 2              | 650,000                | Human        | Culture              | sci-Plex              | GSE139944       | nan          | RNA-seq       | Cancer cell lines A549, K562, and MCF7                                                                                           |       nan | 5,000 drug conditions |                   nan |                                           3 | Yes               | Yes          | No             | Yes   | No     |             nan | nan                  | nan                                                                |                  nan |
56 | | [Zhao *et al.* bioRxiv](https://doi.org/10.1101/2020.04.22.056341)               | Deconvolution of Cell Type-Specific Drug Responses in Human Tumor Tissue with Single-Cell RNA-seq                                                       |                                                                                                                                                                                                                                                                                                                                                                                                                                                       | small molecules      | 2,6               | 6,1            | -         | -              | 48,404                 | Human        | Brain, Tumor         | SCRB-seq (microwell)  | GSE148842       | nan          | RNA-seq       | nan                                                                                                                              |       nan | nan                   |                   nan |                                         nan | nan               | nan          | nan            | nan   | nan    |             nan | nan                  | nan                                                                |                    6 |
57 | | [McFarland *et al.* NCommunications](https://doi.org/10.1038/s41467-020-17440-w) | Multiplexed single-cell transcriptional response profiling to define cancer vulnerabilities and therapeutic mechanism of action                         | [\[curation nb\]](https://nbviewer.ipython.org/github/theislab/sc-pert/blob/main/datasets/McFarland_2020_curation.ipynb) [\[processing nb\]](https://nbviewer.ipython.org/github/theislab/sc-pert/blob/main/datasets/McFarland_2020.ipynb)                                                                                                                                                                                                            | small molecules      | 1-13              | 24-99          | 1         | 1-5            | nan                    | Human        | Culture              | MIX-seq               | nan             | nan          | RNA-seq       | nan                                                                                                                              |       nan | nan                   |                   nan |                                         nan | nan               | nan          | nan            | nan   | nan    |             nan | nan                  | nan                                                                |                  nan |
58 | | [Chen *et al.* (2020)](https://doi.org/10.1038/s41592-019-0689-z)                | nan                                                                                                                                                     |                                                                                                                                                                                                                                                                                                                                                                                                                                                       | small molecules      | 300               | 1              | 1         | 1              | nan                    | nan          | nan                  | CyTOF                 | nan             | nan          | protein       | breast  cancer  cells  undergoing  TGF-β-induced  EMT                                                                            |       nan | nan                   |                   nan |                                         nan | nan               | nan          | nan            | nan   | nan    |             nan | nan                  | nan                                                                |                  nan |


--------------------------------------------------------------------------------
/data_table.csv:
--------------------------------------------------------------------------------
 1 | ,Shorthand,Title,.h5ad availability,Treatment,# perturbations,# cell types,# doses,# timepoints,Reported cells total,Organism,Tissue,Technique,Data location,Panel size,Measurement,Cell source,Disease,Contrasts,Developmental stage,Number of reported cell types or clusters,Cell clustering,Pseudotime,RNA Velocity,PCA,tSNE,H5AD location,Isolation,BC --> Cell ID _OR_ BC --> Cluster ID,Number individuals
 2 | 0,[Jaitin *et al.* Science](https://doi.org/10.1126/science.1247651),Massively Parallel Single-Cell RNA-Seq for Marker-Free Decomposition of Tissues into Cell Types,,CRISPR,8-22,1,-,1,"4,468",Mouse,Spleen,MARS-seq,GSE54006,,RNA-seq,CD11c+ enriched splenocytes,,,,9.0,Yes,No,,No,No,,Sorting (FACS),,
 3 | 2,[Dixit *et al.* Cell](https://doi.org/10.1016/j.cell.2016.11.038),Perturb-Seq: Dissecting Molecular Circuits with Scalable Single-Cell RNA Profiling of Pooled Genetic Screens, [\[raw h5ad\]](https://ndownloader.figshare.com/files/34011689) [\[processed h5ad\]](https://ndownloader.figshare.com/files/34014608) [\[processing nb\]](https://nbviewer.ipython.org/github/theislab/sc-pert/blob/main/datasets/Dixit_2016.ipynb),CRISPR,"10,24",1,-,1-2,"200,000","Human, Mouse",Culture,Perturb-seq,GSE90063,,RNA-seq,"BMDCs, K562",,,,,,,,,No,,Nanodroplet dilution,,
 4 | 3,[Datlinger *et al.* NMeth](https://doi.org/10.1038/nmeth.4177),Pooled CRISPR screening with single-cell transcriptome readout,,CRISPR,29,1,-,1,"5,905","Human, Mouse",Culture,CROP-seq,GSE92872,,RNA-seq,"HEK293T, 3T3, Jurkat",,,,,,,,,No,,,,
 5 | 4,[Hill *et al.* NMethods](https://doi.org/10.1038/nmeth.4604),On the design of CRISPR-based single-cell molecular screens,,CRISPR,32,1,2,1,"5,879",Human,Culture,CROP-seq,GSE108699,,RNA-seq,MCF10a cells,,,,,,,,,,,,https://github.com/shendurelab/single-cell-ko-screens#result-files,
 6 | 13,[Ursu *et al.* bioRxiv](https://doi.org/10.1101/2020.11.16.383307),Massively parallel phenotyping of variant impact in cancer with Perturb-seq reveals a shift in the spectrum of cell states induced by somatic mutations,,CRISPR,200,1,-,1,"162,314",Human,Lung,Perturb-seq,,,RNA-seq,,,,,,,,,,,,,,
 7 | 14,[Jin *et al.* Science](https://doi.org/10.1126/science.aaz6063),In vivo Perturb-Seq reveals neuronal and glial abnormalities associated with autism risk genes,,CRISPR,35,-,-,1,"46,770",Mouse,Brain,Perturb-seq,,,RNA-seq,,,,,,,,,,,,,,
 8 | 15,[Frangieh *et al.* NGenetics](https://doi.org/10.1038/s41588-021-00779-1),Multimodal pooled Perturb-CITE-seq screens in patient models define mechanisms of cancer immune evasion, [\[raw h5ad\]](https://ndownloader.figshare.com/files/34012565) [\[processed h5ad\]](https://ndownloader.figshare.com/files/34013717) [\[processing nb\]](https://nbviewer.ipython.org/github/theislab/sc-pert/blob/main/datasets/Frangieh_2021.ipynb),CRISPR,248,1,-,1,"218,331",Human,Culture,Perturb-CITE-seq,SCP1064,,RNA-seq,,,,,,,,,,,,,,
 9 | 16,[Papalexi *et al.* NGenetics](https://doi.org/10.1038/s41588-021-00778-2),Characterizing the molecular regulation of inhibitory immune checkpoints with multimodal single-cell screens,,CRISPR,111 (sgRNA),1,2,-,"28,295",Human,Culture,CITE-seq & ECCITE-seq,GSE153056,,RNA-seq,,,,,,,,,,,,,,
10 | 17,[Datlinger *et al.* NMethods](https://doi.org/10.1038/s41592-021-01153-z),Ultra-high-throughput single-cell RNA sequencing and perturbation screening with combinatorial fluidic indexing,,CRISPR KO + antibody,96,1,1,1,,"Human, Mouse",,scifi-RNA-seq,,,,,,,,,,,,,,,,,
11 | 11,[Alda-Catalinas *et al.* CSystems](https://doi.org/10.1016/j.cels.2020.06.004),A Single-Cell Transcriptomics CRISPR-Activation Screen Identifies Epigenetic Regulators of the Zygotic Genome Activation Program,,CRISPRa,230,1,-,-,"203,894",Mouse,Culture,Chromium,,,RNA-seq,mESCs,,,,,,,,,,,,,
12 | 19,[Norman *et al.* (2019)](https://doi.org/10.1126/science.aax4438),, [\[raw h5ad\]](https://ndownloader.figshare.com/files/34002548) [\[processed h5ad\]](https://ndownloader.figshare.com/files/34027562) [\[curation nb\]](https://nbviewer.ipython.org/github/theislab/sc-pert/blob/main/datasets/Norman_2019_curation.ipynb) [\[processing nb\]](https://nbviewer.ipython.org/github/theislab/sc-pert/blob/main/datasets/Norman_2019.ipynb),CRISPRa,287,1,-,1,,,,Perturb-seq,,,RNA-seq,induction of gene pair targets+single gene controls in K562 cells after screening 112 genes (2x gRNA per) and their combinations,,,,,,,,,,,,,
13 | 1,[Adamson *et al.* Cell](https://doi.org/10.1016/j.cell.2016.11.048),A Multiplexed Single-Cell CRISPR Screening Platform Enables Systematic Dissection of the Unfolded Protein Response,,CRISPRi,9-93 (sgRNA),1,-,1,"86,000",Human,Culture,Perturb-seq,GSE90546,,RNA-seq,K562,,,,,,,,,Yes,,,,
14 | 5,[Gasperini *et al.* Cell](https://doi.org/10.1016/j.cell.2018.11.029),A Genome-wide Framework for Mapping Gene Regulation via Cellular Genetic Screens,,CRISPRi,"1119, 5779",1,-,1,"207,324",Human,Culture,CROP-seq,GSE120861,,RNA-seq,K562 Cells,,CRISPR Screen,,,,,,,,,,,
15 | 8,[Jost *et al.* NBT](https://doi.org/10.1038/s41587-019-0387-5),Titrating gene expression using libraries of systematically attenuated CRISPR guide RNAs,,CRISPRi,25,2,-,1,"19,587",Human,Culture,Perturb-seq,GSE132080,,RNA-seq,K562 cells,,25 gene screen,,,,,,,,,,,
16 | 10,[Schraivogel *et al.* NMethods](https://doi.org/10.1038/s41592-020-0837-5),Targeted Perturb-seq enables genome-scale genetic screens in single cells, [\[processing nb\]](https://nbviewer.ipython.org/github/theislab/sc-pert/blob/main/datasets/Schraivogel_2020.ipynb),CRISPRi,1778 (enhancers),1,-,1,"231,667","Human, Mouse","Bone marrow, Culture",TAP-seq,GSE135497,"1,000",RNA-seq,,,,,,,,,,Yes,,,,
17 | 18,[Leng *et al.* bioRxiv](https://doi.org/10.1101/2021.08.23.457400),CRISPRi screens in human astrocytes elucidate regulators of distinct inflammatory reactive states,,CRISPRi,30,1,2,-,,,,,,,,,,,,,,,,,,,,,
18 | 20,[Replogle *et al.* (2020)](https://doi.org/nan),,,genetic targets,,,,,,,,,,,,,,,,,,,,,,,,,
19 | 21,[Replogle *et al.* (2021)](https://doi.org/10.1101/2021.12.16.473013v3),,,genetic targets,>10000,2,-,-,,,,Perturb-seq,,,RNA-seq,,,,,,,,,,,,,,
20 | 6,[Shin *et al.* SAdvances](https://doi.org/10.1126/sciadv.aav2249),Multiplexed single-cell RNA-seq via transient barcoding for simultaneous expression profiling of various drug perturbations,,small molecules,45,2,1,1,"3,091","Mouse, Human",Culture,Drop-seq,PRJNA493658,,RNA-seq,"HEK293T, NIIH3T3, A375, SW480, K562",,45 perturbations,,,,,,,,,,,
21 | 7,[Srivatsan *et al.* Science](https://doi.org/10.1126/science.aax6234),Massively multiplex chemical transcriptomics at single-cell resolution, [\[raw h5ad\]](https://ndownloader.figshare.com/files/33979517) [\[curation nb\]](https://nbviewer.ipython.org/github/theislab/sc-pert/blob/main/datasets/Srivatsan_2019_sciplex2_curation.ipynb) [\[curation nb\]](https://nbviewer.ipython.org/github/theislab/sc-pert/blob/main/datasets/Srivatsan_2019_curation.ipynb) [\[processing nb\]](https://nbviewer.ipython.org/github/theislab/sc-pert/blob/main/datasets/Srivatsan_2019_sciplex3.ipynb),small molecules,188,3,4,2,"650,000",Human,Culture,sci-Plex,GSE139944,,RNA-seq,"Cancer cell lines A549, K562, and MCF7",,"5,000 drug conditions",,3.0,Yes,Yes,No,Yes,No,,,,
22 | 9,[Zhao *et al.* bioRxiv](https://doi.org/10.1101/2020.04.22.056341),Deconvolution of Cell Type-Specific Drug Responses in Human Tumor Tissue with Single-Cell RNA-seq,,small molecules,"2,6","6,1",-,-,"48,404",Human,"Brain, Tumor",SCRB-seq (microwell),GSE148842,,RNA-seq,,,,,,,,,,,,,,6.0
23 | 12,[McFarland *et al.* NCommunications](https://doi.org/10.1038/s41467-020-17440-w),Multiplexed single-cell transcriptional response profiling to define cancer vulnerabilities and therapeutic mechanism of action, [\[curation nb\]](https://nbviewer.ipython.org/github/theislab/sc-pert/blob/main/datasets/McFarland_2020_curation.ipynb) [\[processing nb\]](https://nbviewer.ipython.org/github/theislab/sc-pert/blob/main/datasets/McFarland_2020.ipynb),small molecules,1-13,24-99,1,1-5,,Human,Culture,MIX-seq,,,RNA-seq,,,,,,,,,,,,,,
24 | 22,[Chen *et al.* (2020)](https://doi.org/10.1038/s41592-019-0689-z),,,small molecules,300,1,1,1,,,,CyTOF,,,protein,breast  cancer  cells  undergoing  TGF-β-induced  EMT ,,,,,,,,,,,,,
25 | 


--------------------------------------------------------------------------------
/datasets/Jaitin_2014.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "### Jaitin et al. Science"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "Data is downloaded from this source\n",
 15 |     "https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE54006"
 16 |    ]
 17 |   },
 18 |   {
 19 |    "cell_type": "code",
 20 |    "execution_count": 1,
 21 |    "metadata": {},
 22 |    "outputs": [],
 23 |    "source": [
 24 |     "# !wget https://ftp.ncbi.nlm.nih.gov/geo/series/GSE54nnn/GSE54006/suppl/GSE54006_experimental_design.txt.gz\n",
 25 |     "# !wget https://ftp.ncbi.nlm.nih.gov/geo/series/GSE54nnn/GSE54006/suppl/GSE54006_readme0421.txt\n",
 26 |     "# !wget https://ftp.ncbi.nlm.nih.gov/geo/series/GSE54nnn/GSE54006/suppl/GSE54006_umitab.txt.gz"
 27 |    ]
 28 |   },
 29 |   {
 30 |    "cell_type": "code",
 31 |    "execution_count": 2,
 32 |    "metadata": {},
 33 |    "outputs": [
 34 |     {
 35 |      "ename": "KeyboardInterrupt",
 36 |      "evalue": "",
 37 |      "output_type": "error",
 38 |      "traceback": [
 39 |       "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
 40 |       "\u001b[0;31mKeyboardInterrupt\u001b[0m                         Traceback (most recent call last)",
 41 |       "\u001b[0;32m<ipython-input-2-0a8675344270>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mpandas\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      2\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0manndata\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0;32mimport\u001b[0m \u001b[0mscanpy\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0msc\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
 42 |       "\u001b[0;32m~/miniconda3/envs/mypython3/lib/python3.7/site-packages/scanpy/__init__.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m     16\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0;34m.\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mpreprocessing\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mpp\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     17\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0;34m.\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mplotting\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mpl\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 18\u001b[0;31m \u001b[0;32mfrom\u001b[0m \u001b[0;34m.\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mdatasets\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlogging\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mqueries\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mexternal\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mget\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     19\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     20\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0manndata\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mAnnData\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mconcat\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
 43 |       "\u001b[0;32m~/miniconda3/envs/mypython3/lib/python3.7/site-packages/scanpy/queries/__init__.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m from ._queries import (\n\u001b[0m\u001b[1;32m      2\u001b[0m     \u001b[0mbiomart_annotations\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      3\u001b[0m     \u001b[0mgene_coordinates\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      4\u001b[0m     \u001b[0mmitochondrial_genes\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m )  # Biomart queries\n",
 44 |       "\u001b[0;32m~/miniconda3/envs/mypython3/lib/python3.7/site-packages/scanpy/queries/_queries.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m    118\u001b[0m     \u001b[0mhost\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mstr\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m\"www.ensembl.org\"\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    119\u001b[0m     \u001b[0muse_cache\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mbool\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 120\u001b[0;31m ) -> pd.DataFrame:\n\u001b[0m\u001b[1;32m    121\u001b[0m     \"\"\"\\\n\u001b[1;32m    122\u001b[0m     \u001b[0mRetrieve\u001b[0m \u001b[0mgene\u001b[0m \u001b[0mcoordinates\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mspecific\u001b[0m \u001b[0morganism\u001b[0m \u001b[0mthrough\u001b[0m \u001b[0mBioMart\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
 45 |       "\u001b[0;32m~/miniconda3/envs/mypython3/lib/python3.7/site-packages/scanpy/_utils.py\u001b[0m in \u001b[0;36mdec\u001b[0;34m(obj)\u001b[0m\n\u001b[1;32m    158\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0mdec\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mobj\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    159\u001b[0m         \u001b[0mobj\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__orig_doc__\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mobj\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__doc__\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 160\u001b[0;31m         \u001b[0mobj\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__doc__\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdedent\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mobj\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__doc__\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat_map\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkwds\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    161\u001b[0m         \u001b[0;32mreturn\u001b[0m \u001b[0mobj\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    162\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
 46 |       "\u001b[0;32m~/miniconda3/envs/mypython3/lib/python3.7/textwrap.py\u001b[0m in \u001b[0;36mdedent\u001b[0;34m(text)\u001b[0m\n\u001b[1;32m    429\u001b[0m     \u001b[0mmargin\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    430\u001b[0m     \u001b[0mtext\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_whitespace_only_re\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msub\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m''\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtext\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 431\u001b[0;31m     \u001b[0mindents\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_leading_whitespace_re\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfindall\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtext\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    432\u001b[0m     \u001b[0;32mfor\u001b[0m \u001b[0mindent\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mindents\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    433\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0mmargin\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
 47 |       "\u001b[0;31mKeyboardInterrupt\u001b[0m: "
 48 |      ]
 49 |     }
 50 |    ],
 51 |    "source": [
 52 |     "import pandas as pd\n",
 53 |     "import anndata\n",
 54 |     "import scanpy as sc"
 55 |    ]
 56 |   },
 57 |   {
 58 |    "cell_type": "code",
 59 |    "execution_count": null,
 60 |    "metadata": {},
 61 |    "outputs": [],
 62 |    "source": [
 63 |     "df = pd.read_csv('GSE54006_experimental_design.txt.gz', skiprows=6, sep='\\t', index_col=0)\n",
 64 |     "df.shape"
 65 |    ]
 66 |   },
 67 |   {
 68 |    "cell_type": "code",
 69 |    "execution_count": null,
 70 |    "metadata": {},
 71 |    "outputs": [],
 72 |    "source": [
 73 |     "X = pd.read_csv('GSE54006_umitab.txt.gz', sep='\\t', index_col=0).T"
 74 |    ]
 75 |   },
 76 |   {
 77 |    "cell_type": "code",
 78 |    "execution_count": null,
 79 |    "metadata": {},
 80 |    "outputs": [],
 81 |    "source": [
 82 |     "\n",
 83 |     "adata = anndata.AnnData(X)\n",
 84 |     "adata.shape"
 85 |    ]
 86 |   },
 87 |   {
 88 |    "cell_type": "code",
 89 |    "execution_count": null,
 90 |    "metadata": {},
 91 |    "outputs": [],
 92 |    "source": [
 93 |     "obs = df.reindex(adata.obs_names.str.split('_').str[1].astype(int))\n",
 94 |     "adata.obs = obs"
 95 |    ]
 96 |   },
 97 |   {
 98 |    "cell_type": "code",
 99 |    "execution_count": null,
100 |    "metadata": {},
101 |    "outputs": [],
102 |    "source": [
103 |     "sc.pp.neighbors(adata)\n",
104 |     "sc.tl.umap(adata)"
105 |    ]
106 |   },
107 |   {
108 |    "cell_type": "code",
109 |    "execution_count": null,
110 |    "metadata": {},
111 |    "outputs": [],
112 |    "source": [
113 |     "adata.obs"
114 |    ]
115 |   },
116 |   {
117 |    "cell_type": "code",
118 |    "execution_count": null,
119 |    "metadata": {},
120 |    "outputs": [],
121 |    "source": [
122 |     "sc.set_figure_params(facecolor='white')"
123 |    ]
124 |   },
125 |   {
126 |    "cell_type": "code",
127 |    "execution_count": null,
128 |    "metadata": {},
129 |    "outputs": [],
130 |    "source": [
131 |     "import numpy as np"
132 |    ]
133 |   },
134 |   {
135 |    "cell_type": "code",
136 |    "execution_count": null,
137 |    "metadata": {},
138 |    "outputs": [],
139 |    "source": [
140 |     "adata.obs['perturbation.type'] = np.where(adata.obs['group_name'].str.contains('LPS'), 'chemical', np.nan)\n",
141 |     "adata.obs['perturbation.description'] = np.where(adata.obs['group_name'].str.contains('LPS'), 'LPS (2 hours)', np.nan)\n",
142 |     "adata.obs['perturbation.value'] = np.where(adata.obs['group_name'].str.contains('LPS'), True, np.nan)\n",
143 |     "adata.obs['perturbation.unit'] = np.where(adata.obs['group_name'].str.contains('LPS'), 'boolean', np.nan)"
144 |    ]
145 |   },
146 |   {
147 |    "cell_type": "code",
148 |    "execution_count": null,
149 |    "metadata": {},
150 |    "outputs": [],
151 |    "source": [
152 |     "from matplotlib.pyplot import rcParams\n",
153 |     "import matplotlib.pyplot as plt\n",
154 |     "rcParams['figure.figsize'] = 5, 3\n",
155 |     "sc.pl.umap(adata, color=['group_name', 'perturbation.description'], legend_fontsize=8, show=False, frameon=False, ncols=1)\n",
156 |     "plt.tight_layout()\n",
157 |     "plt.show()"
158 |    ]
159 |   },
160 |   {
161 |    "cell_type": "code",
162 |    "execution_count": null,
163 |    "metadata": {},
164 |    "outputs": [],
165 |    "source": [
166 |     "adata.obs"
167 |    ]
168 |   },
169 |   {
170 |    "cell_type": "code",
171 |    "execution_count": null,
172 |    "metadata": {},
173 |    "outputs": [],
174 |    "source": [
175 |     "adata.write('../data/Jaitin_2014.h5ad', compression='lzf')"
176 |    ]
177 |   }
178 |  ],
179 |  "metadata": {
180 |   "kernelspec": {
181 |    "display_name": "Python [conda env:mypython3] *",
182 |    "language": "python",
183 |    "name": "conda-env-mypython3-py"
184 |   },
185 |   "language_info": {
186 |    "codemirror_mode": {
187 |     "name": "ipython",
188 |     "version": 3
189 |    },
190 |    "file_extension": ".py",
191 |    "mimetype": "text/x-python",
192 |    "name": "python",
193 |    "nbconvert_exporter": "python",
194 |    "pygments_lexer": "ipython3",
195 |    "version": "3.7.8"
196 |   }
197 |  },
198 |  "nbformat": 4,
199 |  "nbformat_minor": 4
200 | }
201 | 


--------------------------------------------------------------------------------
/datasets/McFarland_2020_curation.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "76588992",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "Curation by Oksana Bilous.\n",
  9 |     "\n",
 10 |     "Accesion: https://figshare.com/s/139f64b495dea9d88c70"
 11 |    ]
 12 |   },
 13 |   {
 14 |    "cell_type": "code",
 15 |    "execution_count": 1,
 16 |    "id": "6a4ed240",
 17 |    "metadata": {},
 18 |    "outputs": [
 19 |     {
 20 |      "name": "stdout",
 21 |      "output_type": "stream",
 22 |      "text": [
 23 |       "--2022-03-30 13:30:55--  https://figshare.com/ndownloader/articles/10298696?private_link=139f64b495dea9d88c70\n",
 24 |       "Resolving figshare.com (figshare.com)... 52.30.189.251, 52.48.213.168, 2a05:d018:1f4:d000:9767:1911:7029:1844, ...\n",
 25 |       "Connecting to figshare.com (figshare.com)|52.30.189.251|:443... connected.\n",
 26 |       "HTTP request sent, awaiting response... 200 OK\n",
 27 |       "Length: 2257978361 (2.1G) [application/zip]\n",
 28 |       "Saving to: ‘mcfarland2020/mixseq.zip’\n",
 29 |       "\n",
 30 |       "mcfarland2020/mixse 100%[===================>]   2.10G  23.3MB/s    in 43s     \n",
 31 |       "\n",
 32 |       "2022-03-30 13:31:38 (50.4 MB/s) - ‘mcfarland2020/mixseq.zip’ saved [2257978361/2257978361]\n",
 33 |       "\n",
 34 |       "Archive:  mcfarland2020/mixseq.zip\n",
 35 |       " extracting: mcfarland2020/README.txt  \n",
 36 |       " extracting: mcfarland2020/Supplementary Tables.xlsx  \n",
 37 |       " extracting: mcfarland2020/Trametinib_6hr_expt1.zip  \n",
 38 |       " extracting: mcfarland2020/Trametinib_24hr_expt1.zip  \n",
 39 |       " extracting: mcfarland2020/Bortezomib_6hr_expt1.zip  \n",
 40 |       " extracting: mcfarland2020/Bortezomib_24hr_expt1.zip  \n",
 41 |       " extracting: mcfarland2020/Idasanutlin_6hr_expt1.zip  \n",
 42 |       " extracting: mcfarland2020/Idasanutlin_24hr_expt1.zip  \n",
 43 |       " extracting: mcfarland2020/DMSO_24hr_expt1.zip  \n",
 44 |       " extracting: mcfarland2020/DMSO_6hr_expt1.zip  \n",
 45 |       " extracting: mcfarland2020/Untreated_6hr_expt1.zip  \n",
 46 |       " extracting: mcfarland2020/Dabrafenib_24hr_expt3.zip  \n",
 47 |       " extracting: mcfarland2020/Navitoclax_24hr_expt3.zip  \n",
 48 |       " extracting: mcfarland2020/Trametinib_24hr_expt3.zip  \n",
 49 |       " extracting: mcfarland2020/BRD3379_24hr_expt3.zip  \n",
 50 |       " extracting: mcfarland2020/BRD3379_6hr_expt3.zip  \n",
 51 |       " extracting: mcfarland2020/DMSO_24hr_expt3.zip  \n",
 52 |       " extracting: mcfarland2020/DMSO_6hr_expt3.zip  \n",
 53 |       " extracting: mcfarland2020/sgGPX4_1_expt2.zip  \n",
 54 |       " extracting: mcfarland2020/sgGPX4_2_expt2.zip  \n",
 55 |       " extracting: mcfarland2020/sgOR2J2_expt2.zip  \n",
 56 |       " extracting: mcfarland2020/sgLACZ_expt2.zip  \n",
 57 |       " extracting: mcfarland2020/trametinib_tc_expt5.zip  \n",
 58 |       " extracting: mcfarland2020/Afatinib_expt10.zip  \n",
 59 |       " extracting: mcfarland2020/AZD5591_expt10.zip  \n",
 60 |       " extracting: mcfarland2020/DMSO_expt10.zip  \n",
 61 |       " extracting: mcfarland2020/Everolimus_expt10.zip  \n",
 62 |       " extracting: mcfarland2020/Gemcitabine_expt10.zip  \n",
 63 |       " extracting: mcfarland2020/JQ1_expt10.zip  \n",
 64 |       " extracting: mcfarland2020/Prexasertib_expt10.zip  \n",
 65 |       " extracting: mcfarland2020/Taselisib_expt10.zip  \n",
 66 |       " extracting: mcfarland2020/Trametinib_expt10.zip  \n",
 67 |       " extracting: mcfarland2020/all_CL_features.rds  \n"
 68 |      ]
 69 |     }
 70 |    ],
 71 |    "source": [
 72 |     "!mkdir mcfarland2020\n",
 73 |     "!wget https://figshare.com/ndownloader/articles/10298696?private_link=139f64b495dea9d88c70 -O mcfarland2020/mixseq.zip\n",
 74 |     "!unzip mcfarland2020/mixseq.zip -d mcfarland2020"
 75 |    ]
 76 |   },
 77 |   {
 78 |    "cell_type": "code",
 79 |    "execution_count": 2,
 80 |    "id": "d2d393f6",
 81 |    "metadata": {},
 82 |    "outputs": [],
 83 |    "source": [
 84 |     "raw_data_path = 'mcfarland2020'"
 85 |    ]
 86 |   },
 87 |   {
 88 |    "cell_type": "markdown",
 89 |    "id": "67a332a4",
 90 |    "metadata": {},
 91 |    "source": [
 92 |     "Process files."
 93 |    ]
 94 |   },
 95 |   {
 96 |    "cell_type": "code",
 97 |    "execution_count": 3,
 98 |    "id": "f09dc5d2",
 99 |    "metadata": {},
100 |    "outputs": [],
101 |    "source": [
102 |     "import pandas as pd\n",
103 |     "import numpy as np\n",
104 |     "import scanpy as sc\n",
105 |     "from zipfile import ZipFile \n",
106 |     "import os\n",
107 |     "import anndata as ad"
108 |    ]
109 |   },
110 |   {
111 |    "cell_type": "code",
112 |    "execution_count": 4,
113 |    "id": "a518efae",
114 |    "metadata": {},
115 |    "outputs": [
116 |     {
117 |      "name": "stderr",
118 |      "output_type": "stream",
119 |      "text": [
120 |       "2022-03-30 13:32:08.111777: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory\n",
121 |       "2022-03-30 13:32:08.111812: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.\n"
122 |      ]
123 |     },
124 |     {
125 |      "name": "stdout",
126 |      "output_type": "stream",
127 |      "text": [
128 |       "scanpy==1.8.2 anndata==0.7.6 umap==0.5.2 numpy==1.20.3 scipy==1.5.3 pandas==1.3.4 scikit-learn==1.0.2 statsmodels==0.11.1 python-igraph==0.8.3 leidenalg==0.8.3 pynndescent==0.5.5\n"
129 |      ]
130 |     }
131 |    ],
132 |    "source": [
133 |     "sc.logging.print_header()"
134 |    ]
135 |   },
136 |   {
137 |    "cell_type": "code",
138 |    "execution_count": 5,
139 |    "id": "efb34103",
140 |    "metadata": {},
141 |    "outputs": [],
142 |    "source": [
143 |     "class MIX_seq_data_loader:\n",
144 |     "    def __init__(self, raw_data_path, expt_no):\n",
145 |     "        self.raw_data_path = raw_data_path\n",
146 |     "        self.expt_no = expt_no\n",
147 |     "#         self.scale = scale\n",
148 |     "#         self.drugs_SMILES_representation = {\n",
149 |     "#             'Idasanutlin': 'O=C(O)C1=CC(OC)=C(NC([C@H]2[C@H](C3=C(F)C(Cl)=CC=C3)[C@](C4=CC=C(Cl)C=C4F)(C#N)[C@H](CC(C)(C)C)N2)=O)C=C1',\n",
150 |     "#             'Bortezomib': 'CC(C)C[C@H](NC(=O)[C@H](CC1=CC=CC=C1)NC(=O)C1=CN=CC=N1)B(O)O',\n",
151 |     "#             'Navitoclax': 'CC1(CCC(=C(C1)CN2CCN(CC2)C3=CC=C(C=C3)C(=O)NS(=O)(=O)C4=CC(=C(C=C4)NC(CCN5CCOCC5)CSC6=CC=CC=C6)S(=O)(=O)C(F)(F)F)C7=CC=C(C=C7)Cl)C',\n",
152 |     "#             'Dabrafenib': 'CC(C)(C)C1=NC(=C(S1)C2=NC(=NC=C2)N)C3=C(C(=CC=C3)NS(=O)(=O)C4=C(C=CC=C4F)F)F',\n",
153 |     "#             'Afatinib': 'CN(C)CC=CC(=O)NC1=C(C=C2C(=C1)C(=NC=N2)NC3=CC(=C(C=C3)F)Cl)OC4CCOC4',\n",
154 |     "#             'Prexasertib': 'COC1=C(C(=CC=C1)OCCCN)C2=CC(=NN2)NC3=NC=C(N=C3)C#N.Cl.Cl',\n",
155 |     "#             'Taselisib' : 'CC1=NN(C(=N1)C2=CN3CCOC4=C(C3=N2)C=CC(=C4)C5=CN(N=C5)C(C)(C)C(=O)N)C(C)C',\n",
156 |     "#             'Gemcitabine': 'C1=CN(C(=O)N=C1N)C2C(C(C(O2)CO)O)(F)F',\n",
157 |     "#             'Trametinib': 'CC1=C2C(=C(N(C1=O)C)NC3=C(C=C(C=C3)I)F)C(=O)N(C(=O)N2C4=CC=CC(=C4)NC(=O)C)C5CC5',\n",
158 |     "#             'Everolimus': '[H][C@@]1(C[C@@H](C)[C@]2([H])CC(=O)[C@H](C)\\C=C(C)\\[C@@H](O)[C@@H](OC)C(=O)[C@H](C)C[C@H](C)\\C=C\\C=C\\C=C(C)\\[C@H](C[C@]3([H])CC[C@@H](C)[C@@](O)(O3)C(=O)C(=O)N3CCCC[C@@]3([H])C(=O)O2)OC)CC[C@@H](OCCO)[C@@H](C1)OC',\n",
159 |     "#             'AZD5591': 'CN1N=C2CSCC3=NN(C)C(CSC4=CC5=C(C=CC=C5)C(OCCCC5=C(N(C)C6=C5C=CC(Cl)=C6C2=C1C)C(O)=O)=C4)=C3',\n",
160 |     "#             'JQ1': 'ClC1=CC=C(C(C2=C(N3C4=NN=C3C)SC(C)=C2C)=N[C@H]4CC(O)=O)C=C1',\n",
161 |     "#             'BRD3379': 'CC(=O)NC1=CC=C(C=C1)C(=O)NC2=C(C=C(C=C2)F)N',\n",
162 |     "#             'DMSO': 'CS(=O)C',\n",
163 |     "#         }\n",
164 |     "        self.adata = self.load_adata_obj()\n",
165 |     "#         self.adata_preprocessing()\n",
166 |     "        self.adata_add_additional_information()\n",
167 |     "        \n",
168 |     "#         self.perturbations = list(self.adata.obs['condition'].unique())\n",
169 |     "#         self.timetpoints = list(self.adata.obs['timepoint'].unique())\n",
170 |     "#         self.tissue_types = list(self.adata.obs['tissue_type'].unique())\n",
171 |     "#         self.drugs_without_SMILES_encoding = list(self.adata.obs[(self.adata.obs['drug_SMILES']=='') & \n",
172 |     "#                        (self.adata.obs['condition']!='Untreated')]['condition'].unique())\n",
173 |     "#         if print_summary:\n",
174 |     "#             self.print_summary()\n",
175 |     "    \n",
176 |     "#     def print_summary(self):\n",
177 |     "#         print(\"Experiment\", self.expt_no, '\\n')\n",
178 |     "#         print(\"Perturbations:\", self.perturbations, '\\n')\n",
179 |     "#         print(\"Timetpoints:\", self.timetpoints, '\\n')\n",
180 |     "#         print(\"Tissue types:\", self.tissue_types, '\\n')\n",
181 |     "#         print(\"Number of cells:\", len(self.adata.obs), '\\n')\n",
182 |     "#         print(\"Drugs for which no SMILES encoding was found:\", self.drugs_without_SMILES_encoding, '\\n')\n",
183 |     "        \n",
184 |     "    def load_adata_obj_from_single_file(self, filename):\n",
185 |     "        zf = ZipFile(self.raw_data_path + '/' + filename)\n",
186 |     "        zf.extractall(path=self.raw_data_path) \n",
187 |     "\n",
188 |     "        df_classifications = pd.read_csv(\n",
189 |     "            zf.open(zf.namelist()[0] + 'classifications.csv'),\n",
190 |     "            sep=\",\",\n",
191 |     "            index_col = 'barcode'\n",
192 |     "        )\n",
193 |     "\n",
194 |     "        adata = sc.read_10x_mtx(self.raw_data_path + '/' + zf.namelist()[0], var_names='gene_symbols')# cache=True)\n",
195 |     "#         adata.var_names_make_unique()\n",
196 |     "        adata.obs = df_classifications\n",
197 |     "        \n",
198 |     "        if 'hash_tag' in adata.obs.columns:\n",
199 |     "            adata = adata[~adata.obs[\"hash_tag\"].isin(['multiplet', 'unknown']), :]\n",
200 |     "            adata.obs['condition'] = adata.obs.apply(lambda x: x['hash_tag'][0:x['hash_tag'].find('_')], axis = 1)\n",
201 |     "            adata.obs['condition'] = adata.obs.apply(lambda x: 'Trametinib' if x['condition'] == 'Tram' else x['condition'], axis = 1)\n",
202 |     "            adata.obs['timepoint'] = adata.obs.apply(lambda x: x['hash_tag'][x['hash_tag'].find('_')+1:x['hash_tag'].find('hr')], axis = 1)\n",
203 |     "        else:\n",
204 |     "            adata.obs['condition'] = filename[0:filename.find('_')]\n",
205 |     "            if 'hr' in filename:\n",
206 |     "                adata.obs['timepoint'] = filename[filename.find('_')+1:filename.find('hr')]\n",
207 |     "            else:\n",
208 |     "                adata.obs['timepoint'] = ''\n",
209 |     "\n",
210 |     "        adata.obs['control'] = adata.obs.apply(lambda x: 'ctrl' if x['condition'] in ['Untreated', 'DMSO']\n",
211 |     "                                      else 'drug', axis=1)\n",
212 |     "        adata.obs['cell_line'] = adata.obs.apply(lambda x: x['singlet_ID'][0:x['singlet_ID'].find('_')], axis = 1)\n",
213 |     "        adata.obs['tissue_type'] = adata.obs.apply(lambda x: x['singlet_ID'][x['singlet_ID'].find('_')+1:], axis = 1)\n",
214 |     "        \n",
215 |     "        return(adata)\n",
216 |     "    \n",
217 |     "    def load_adata_obj(self):\n",
218 |     "        files_to_load = []\n",
219 |     "        for filename in os.listdir(self.raw_data_path):\n",
220 |     "            if ('.zip' in filename) and ('expt'+str(self.expt_no)+'.' in filename):\n",
221 |     "                files_to_load.append(filename)\n",
222 |     "\n",
223 |     "        adata = ad.AnnData()\n",
224 |     "        for file in files_to_load:\n",
225 |     "            adata_temp = self.load_adata_obj_from_single_file(file)\n",
226 |     "            adata_temp.obs['experiment'] = self.expt_no\n",
227 |     "            if len(adata) == 0:\n",
228 |     "                adata = adata_temp\n",
229 |     "            else:\n",
230 |     "                adata = adata.concatenate(adata_temp)\n",
231 |     "        if 'batch' in adata.obs.columns:\n",
232 |     "            adata.obs = adata.obs.drop(columns=['batch'])\n",
233 |     "            \n",
234 |     "        return(adata)\n",
235 |     "    \n",
236 |     "#     def adata_preprocessing(self):\n",
237 |     "#         # Filtering out the low-quality cells before doing the analysis\n",
238 |     "#         self.adata = self.adata[self.adata.obs[\"cell_quality\"] == \"normal\", :] \n",
239 |     "\n",
240 |     "#         # Normalization\n",
241 |     "#         sc.pp.normalize_total(self.adata, target_sum=1e4)\n",
242 |     "#         sc.pp.log1p(self.adata)\n",
243 |     "\n",
244 |     "#         # # Identifying top 5000 highly variable genes\n",
245 |     "#         # sc.pp.highly_variable_genes(self.adata, n_top_genes = 5000)\n",
246 |     "\n",
247 |     "#         # Saving the raw data object\n",
248 |     "#         self.adata.raw = self.adata\n",
249 |     "        \n",
250 |     "#         # Scaling\n",
251 |     "#         if self.scale:\n",
252 |     "#             sc.pp.scale(self.adata, max_value=10)\n",
253 |     "\n",
254 |     "#         n_comps = len(self.adata.obs['singlet_ID'].unique()) * 2 \n",
255 |     "#         # PCA\n",
256 |     "#         sc.tl.pca(self.adata, svd_solver='arpack', n_comps = n_comps)\n",
257 |     "\n",
258 |     "#         # UMAP embedding\n",
259 |     "#         sc.pp.neighbors(self.adata, n_neighbors=10, n_pcs=n_comps)\n",
260 |     "#         sc.tl.umap(self.adata)\n",
261 |     "        \n",
262 |     "    def adata_add_additional_information(self):\n",
263 |     "        mutant_cls = ['LNCAPCLONEFGC_PROSTATE','DKMG_CENTRAL_NERVOUS_SYSTEM', 'NCIH226_LUNG', 'RCC10RGB_KIDNEY', 'SNU1079_BILIARY_TRACT', 'CCFSTTG1_CENTRAL_NERVOUS_SYSTEM','COV434_OVARY']\n",
264 |     "        self.adata.obs['TP53_status'] = ['TP53_MUT' if x in mutant_cls else 'TP53_WT' for x in self.adata.obs.singlet_ID.values]\n",
265 |     "        if self.adata.obs['timepoint'][1] != '':\n",
266 |     "            self.adata.obs['condition_time'] = self.adata.obs.apply(lambda x: x['condition'] + '_' + x['timepoint'] + 'hr', axis=1)\n",
267 |     "            \n",
268 |     "        # Cell Cycle Analysis\n",
269 |     "        # Loading additional file with lists of s genes and g2m genes\n",
270 |     "#         cell_phases_df = pd.read_excel(self.raw_data_path+'/Cell_phases.xlsx', sheet_name = 'Cell_phases')\n",
271 |     "#         s_genes = [x.replace(' ', '') for x in list(cell_phases_df[pd.notna(cell_phases_df.iloc[:,0])].iloc[:,0])]\n",
272 |     "#         g2m_genes = [x.replace(' ', '') for x in list(cell_phases_df[pd.notna(cell_phases_df.iloc[:,1])].iloc[:,1])]\n",
273 |     "\n",
274 |     "        # Defining the cell cycles\n",
275 |     "#         sc.tl.score_genes_cell_cycle(self.adata, s_genes = s_genes, g2m_genes = g2m_genes)\n",
276 |     "\n",
277 |     "#         self.adata.obs['phase'] = self.adata.obs.apply(lambda x: 'G2/M' if x['phase'] == 'G2M' else ('G0/G1' if x['phase']=='G1' else x['phase']\n",
278 |     "#         ), axis = 1)\n",
279 |     "        \n",
280 |     "#         self.adata.obs['drug_SMILES'] = self.adata.obs.apply(lambda x: self.drugs_SMILES_representation[x['condition']] \n",
281 |     "#                                            if x['condition'] in self.drugs_SMILES_representation.keys() else '', axis=1)"
282 |    ]
283 |   },
284 |   {
285 |    "cell_type": "code",
286 |    "execution_count": 7,
287 |    "id": "ada87798",
288 |    "metadata": {},
289 |    "outputs": [
290 |     {
291 |      "data": {
292 |       "text/plain": [
293 |        "[AnnData object with n_obs × n_vars = 24500 × 32738\n",
294 |        "     obs: 'singlet_ID', 'num_SNPs', 'singlet_dev', 'singlet_dev_z', 'singlet_margin', 'singlet_z_margin', 'doublet_z_margin', 'tot_reads', 'doublet_dev_imp', 'doublet_CL1', 'doublet_CL2', 'percent.mito', 'cell_det_rate', 'cell_quality', 'doublet_GMM_prob', 'DepMap_ID', 'condition', 'timepoint', 'control', 'cell_line', 'tissue_type', 'experiment', 'TP53_status', 'condition_time'\n",
295 |        "     var: 'gene_ids',\n",
296 |        " AnnData object with n_obs × n_vars = 28165 × 32738\n",
297 |        "     obs: 'singlet_ID', 'num_SNPs', 'singlet_dev', 'singlet_dev_z', 'singlet_margin', 'singlet_z_margin', 'doublet_z_margin', 'tot_reads', 'doublet_dev_imp', 'doublet_CL1', 'doublet_CL2', 'percent.mito', 'cell_det_rate', 'cell_quality', 'doublet_GMM_prob', 'DepMap_ID', 'condition', 'timepoint', 'control', 'cell_line', 'tissue_type', 'experiment', 'TP53_status'\n",
298 |        "     var: 'gene_ids',\n",
299 |        " AnnData object with n_obs × n_vars = 72326 × 32738\n",
300 |        "     obs: 'singlet_ID', 'num_SNPs', 'singlet_dev', 'singlet_dev_z', 'singlet_margin', 'singlet_z_margin', 'doublet_z_margin', 'tot_reads', 'doublet_dev_imp', 'doublet_CL1', 'doublet_CL2', 'percent.mito', 'cell_det_rate', 'cell_quality', 'doublet_GMM_prob', 'DepMap_ID', 'condition', 'timepoint', 'control', 'cell_line', 'tissue_type', 'experiment', 'TP53_status', 'condition_time'\n",
301 |        "     var: 'gene_ids',\n",
302 |        " AnnData object with n_obs × n_vars = 17553 × 32738\n",
303 |        "     obs: 'singlet_ID', 'num_SNPs', 'singlet_dev', 'singlet_dev_z', 'singlet_margin', 'singlet_z_margin', 'doublet_z_margin', 'tot_reads', 'doublet_dev_imp', 'doublet_CL1', 'doublet_CL2', 'percent.mito', 'cell_det_rate', 'cell_quality', 'doublet_GMM_prob', 'hash_assignment', 'hash_tag', 'channel', 'DepMap_ID', 'condition', 'timepoint', 'control', 'cell_line', 'tissue_type', 'experiment', 'TP53_status', 'condition_time'\n",
304 |        "     var: 'gene_ids',\n",
305 |        " AnnData object with n_obs × n_vars = 37856 × 32738\n",
306 |        "     obs: 'singlet_ID', 'num_SNPs', 'singlet_dev', 'singlet_dev_z', 'singlet_margin', 'singlet_z_margin', 'doublet_z_margin', 'tot_reads', 'doublet_dev_imp', 'doublet_CL1', 'doublet_CL2', 'percent.mito', 'cell_det_rate', 'cell_quality', 'doublet_GMM_prob', 'channel', 'DepMap_ID', 'condition', 'timepoint', 'control', 'cell_line', 'tissue_type', 'experiment', 'TP53_status'\n",
307 |        "     var: 'gene_ids']"
308 |       ]
309 |      },
310 |      "execution_count": 7,
311 |      "metadata": {},
312 |      "output_type": "execute_result"
313 |     }
314 |    ],
315 |    "source": [
316 |     "%%time\n",
317 |     "adatas = []\n",
318 |     "for expt in [1,2,3,5,10]:\n",
319 |     "    print(expt)\n",
320 |     "    adatas.append(MIX_seq_data_loader(\n",
321 |     "        raw_data_path=raw_data_path, \n",
322 |     "        expt_no=expt\n",
323 |     "    ).adata)\n",
324 |     "adatas"
325 |    ]
326 |   },
327 |   {
328 |    "cell_type": "code",
329 |    "execution_count": 8,
330 |    "id": "841bdcf1",
331 |    "metadata": {},
332 |    "outputs": [
333 |     {
334 |      "data": {
335 |       "text/plain": [
336 |        "AnnData object with n_obs × n_vars = 180400 × 32738\n",
337 |        "    obs: 'singlet_ID', 'num_SNPs', 'singlet_dev', 'singlet_dev_z', 'singlet_margin', 'singlet_z_margin', 'doublet_z_margin', 'tot_reads', 'doublet_dev_imp', 'doublet_CL1', 'doublet_CL2', 'percent.mito', 'cell_det_rate', 'cell_quality', 'doublet_GMM_prob', 'DepMap_ID', 'condition', 'timepoint', 'control', 'cell_line', 'tissue_type', 'experiment', 'TP53_status'\n",
338 |        "    var: 'gene_ids'"
339 |       ]
340 |      },
341 |      "execution_count": 8,
342 |      "metadata": {},
343 |      "output_type": "execute_result"
344 |     }
345 |    ],
346 |    "source": [
347 |     "adata = ad.concat(adatas, keys=[n for n in [1,2,3,5,10]], label='experiment', index_unique='--', merge='same')\n",
348 |     "adata"
349 |    ]
350 |   },
351 |   {
352 |    "cell_type": "code",
353 |    "execution_count": 9,
354 |    "id": "88eeba27",
355 |    "metadata": {},
356 |    "outputs": [],
357 |    "source": [
358 |     "!rm -r mcfarland2020"
359 |    ]
360 |   },
361 |   {
362 |    "cell_type": "code",
363 |    "execution_count": 10,
364 |    "id": "15d2b3d5",
365 |    "metadata": {},
366 |    "outputs": [
367 |     {
368 |      "name": "stderr",
369 |      "output_type": "stream",
370 |      "text": [
371 |       "/home/icb/yuge.ji/miniconda3/envs/py37/lib/python3.7/site-packages/anndata/_core/anndata.py:1220: FutureWarning: The `inplace` parameter in pandas.Categorical.reorder_categories is deprecated and will be removed in a future version. Reordering categories will always return a new Categorical object.\n",
372 |       "  c.reorder_categories(natsorted(c.categories), inplace=True)\n",
373 |       "... storing 'singlet_ID' as categorical\n",
374 |       "/home/icb/yuge.ji/miniconda3/envs/py37/lib/python3.7/site-packages/anndata/_core/anndata.py:1220: FutureWarning: The `inplace` parameter in pandas.Categorical.reorder_categories is deprecated and will be removed in a future version. Reordering categories will always return a new Categorical object.\n",
375 |       "  c.reorder_categories(natsorted(c.categories), inplace=True)\n",
376 |       "... storing 'doublet_CL1' as categorical\n",
377 |       "/home/icb/yuge.ji/miniconda3/envs/py37/lib/python3.7/site-packages/anndata/_core/anndata.py:1220: FutureWarning: The `inplace` parameter in pandas.Categorical.reorder_categories is deprecated and will be removed in a future version. Reordering categories will always return a new Categorical object.\n",
378 |       "  c.reorder_categories(natsorted(c.categories), inplace=True)\n",
379 |       "... storing 'doublet_CL2' as categorical\n",
380 |       "/home/icb/yuge.ji/miniconda3/envs/py37/lib/python3.7/site-packages/anndata/_core/anndata.py:1220: FutureWarning: The `inplace` parameter in pandas.Categorical.reorder_categories is deprecated and will be removed in a future version. Reordering categories will always return a new Categorical object.\n",
381 |       "  c.reorder_categories(natsorted(c.categories), inplace=True)\n",
382 |       "... storing 'cell_quality' as categorical\n",
383 |       "/home/icb/yuge.ji/miniconda3/envs/py37/lib/python3.7/site-packages/anndata/_core/anndata.py:1220: FutureWarning: The `inplace` parameter in pandas.Categorical.reorder_categories is deprecated and will be removed in a future version. Reordering categories will always return a new Categorical object.\n",
384 |       "  c.reorder_categories(natsorted(c.categories), inplace=True)\n",
385 |       "... storing 'DepMap_ID' as categorical\n",
386 |       "/home/icb/yuge.ji/miniconda3/envs/py37/lib/python3.7/site-packages/anndata/_core/anndata.py:1220: FutureWarning: The `inplace` parameter in pandas.Categorical.reorder_categories is deprecated and will be removed in a future version. Reordering categories will always return a new Categorical object.\n",
387 |       "  c.reorder_categories(natsorted(c.categories), inplace=True)\n",
388 |       "... storing 'condition' as categorical\n",
389 |       "/home/icb/yuge.ji/miniconda3/envs/py37/lib/python3.7/site-packages/anndata/_core/anndata.py:1220: FutureWarning: The `inplace` parameter in pandas.Categorical.reorder_categories is deprecated and will be removed in a future version. Reordering categories will always return a new Categorical object.\n",
390 |       "  c.reorder_categories(natsorted(c.categories), inplace=True)\n",
391 |       "... storing 'timepoint' as categorical\n",
392 |       "/home/icb/yuge.ji/miniconda3/envs/py37/lib/python3.7/site-packages/anndata/_core/anndata.py:1220: FutureWarning: The `inplace` parameter in pandas.Categorical.reorder_categories is deprecated and will be removed in a future version. Reordering categories will always return a new Categorical object.\n",
393 |       "  c.reorder_categories(natsorted(c.categories), inplace=True)\n",
394 |       "... storing 'control' as categorical\n",
395 |       "/home/icb/yuge.ji/miniconda3/envs/py37/lib/python3.7/site-packages/anndata/_core/anndata.py:1220: FutureWarning: The `inplace` parameter in pandas.Categorical.reorder_categories is deprecated and will be removed in a future version. Reordering categories will always return a new Categorical object.\n",
396 |       "  c.reorder_categories(natsorted(c.categories), inplace=True)\n",
397 |       "... storing 'cell_line' as categorical\n",
398 |       "/home/icb/yuge.ji/miniconda3/envs/py37/lib/python3.7/site-packages/anndata/_core/anndata.py:1220: FutureWarning: The `inplace` parameter in pandas.Categorical.reorder_categories is deprecated and will be removed in a future version. Reordering categories will always return a new Categorical object.\n",
399 |       "  c.reorder_categories(natsorted(c.categories), inplace=True)\n",
400 |       "... storing 'tissue_type' as categorical\n",
401 |       "/home/icb/yuge.ji/miniconda3/envs/py37/lib/python3.7/site-packages/anndata/_core/anndata.py:1220: FutureWarning: The `inplace` parameter in pandas.Categorical.reorder_categories is deprecated and will be removed in a future version. Reordering categories will always return a new Categorical object.\n",
402 |       "  c.reorder_categories(natsorted(c.categories), inplace=True)\n",
403 |       "... storing 'TP53_status' as categorical\n"
404 |      ]
405 |     }
406 |    ],
407 |    "source": [
408 |     "adata.write(f'Mcfarland_2020_raw.h5ad')"
409 |    ]
410 |   }
411 |  ],
412 |  "metadata": {
413 |   "kernelspec": {
414 |    "display_name": "Python [conda env:py37] *",
415 |    "language": "python",
416 |    "name": "conda-env-py37-py"
417 |   },
418 |   "language_info": {
419 |    "codemirror_mode": {
420 |     "name": "ipython",
421 |     "version": 3
422 |    },
423 |    "file_extension": ".py",
424 |    "mimetype": "text/x-python",
425 |    "name": "python",
426 |    "nbconvert_exporter": "python",
427 |    "pygments_lexer": "ipython3",
428 |    "version": "3.7.0"
429 |   }
430 |  },
431 |  "nbformat": 4,
432 |  "nbformat_minor": 5
433 | }
434 | 


--------------------------------------------------------------------------------
/datasets/Norman_2019_curation.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "d12cb9dc",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "Accession: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE133344"
  9 |    ]
 10 |   },
 11 |   {
 12 |    "cell_type": "code",
 13 |    "execution_count": 1,
 14 |    "id": "ca04f335-6926-4764-82ec-374d7c6f94b4",
 15 |    "metadata": {},
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "import gzip\n",
 19 |     "import os\n",
 20 |     "import re\n",
 21 |     "\n",
 22 |     "import pandas as pd\n",
 23 |     "import numpy as np\n",
 24 |     "from anndata import AnnData\n",
 25 |     "from scipy.io import mmread\n",
 26 |     "from scipy.sparse import coo_matrix\n",
 27 |     "\n",
 28 |     "from utils import download_binary_file\n",
 29 |     "\n",
 30 |     "# Gene program lists obtained by cross-referencing the heatmap here\n",
 31 |     "# https://github.com/thomasmaxwellnorman/Perturbseq_GI/blob/master/GI_optimal_umap.ipynb\n",
 32 |     "# with Figure 2b in Norman 2019\n",
 33 |     "G1_CYCLE = [\n",
 34 |     "    \"CDKN1C+CDKN1B\",\n",
 35 |     "    \"CDKN1B+ctrl\",\n",
 36 |     "    \"CDKN1B+CDKN1A\",\n",
 37 |     "    \"CDKN1C+ctrl\",\n",
 38 |     "    \"ctrl+CDKN1A\",\n",
 39 |     "    \"CDKN1C+CDKN1A\",\n",
 40 |     "    \"CDKN1A+ctrl\",\n",
 41 |     "]\n",
 42 |     "\n",
 43 |     "ERYTHROID = [\n",
 44 |     "    \"BPGM+SAMD1\",\n",
 45 |     "    \"ATL1+ctrl\",\n",
 46 |     "    \"UBASH3B+ZBTB25\",\n",
 47 |     "    \"PTPN12+PTPN9\",\n",
 48 |     "    \"PTPN12+UBASH3A\",\n",
 49 |     "    \"CBL+CNN1\",\n",
 50 |     "    \"UBASH3B+CNN1\",\n",
 51 |     "    \"CBL+UBASH3B\",\n",
 52 |     "    \"UBASH3B+PTPN9\",\n",
 53 |     "    \"PTPN1+ctrl\",\n",
 54 |     "    \"CBL+PTPN9\",\n",
 55 |     "    \"CNN1+UBASH3A\",\n",
 56 |     "    \"CBL+PTPN12\",\n",
 57 |     "    \"PTPN12+ZBTB25\",\n",
 58 |     "    \"UBASH3B+PTPN12\",\n",
 59 |     "    \"SAMD1+PTPN12\",\n",
 60 |     "    \"SAMD1+UBASH3B\",\n",
 61 |     "    \"UBASH3B+UBASH3A\",\n",
 62 |     "]\n",
 63 |     "\n",
 64 |     "PIONEER_FACTORS = [\n",
 65 |     "    \"ZBTB10+SNAI1\",\n",
 66 |     "    \"FOXL2+MEIS1\",\n",
 67 |     "    \"POU3F2+CBFA2T3\",\n",
 68 |     "    \"DUSP9+SNAI1\",\n",
 69 |     "    \"FOXA3+FOXA1\",\n",
 70 |     "    \"FOXA3+ctrl\",\n",
 71 |     "    \"LYL1+IER5L\",\n",
 72 |     "    \"FOXA1+FOXF1\",\n",
 73 |     "    \"FOXF1+HOXB9\",\n",
 74 |     "    \"FOXA1+HOXB9\",\n",
 75 |     "    \"FOXA3+HOXB9\",\n",
 76 |     "    \"FOXA3+FOXA1\",\n",
 77 |     "    \"FOXA3+FOXL2\",\n",
 78 |     "    \"POU3F2+FOXL2\",\n",
 79 |     "    \"FOXF1+FOXL2\",\n",
 80 |     "    \"FOXA1+FOXL2\",\n",
 81 |     "    \"HOXA13+ctrl\",\n",
 82 |     "    \"ctrl+HOXC13\",\n",
 83 |     "    \"HOXC13+ctrl\",\n",
 84 |     "    \"MIDN+ctrl\",\n",
 85 |     "    \"TP73+ctrl\",\n",
 86 |     "]\n",
 87 |     "\n",
 88 |     "GRANULOCYTE_APOPTOSIS = [\n",
 89 |     "    \"SPI1+ctrl\",\n",
 90 |     "    \"ctrl+SPI1\",\n",
 91 |     "    \"ctrl+CEBPB\",\n",
 92 |     "    \"CEBPB+ctrl\",\n",
 93 |     "    \"JUN+CEBPA\",\n",
 94 |     "    \"CEBPB+CEBPA\",\n",
 95 |     "    \"FOSB+CEBPE\",\n",
 96 |     "    \"ZC3HAV1+CEBPA\",\n",
 97 |     "    \"KLF1+CEBPA\",\n",
 98 |     "    \"ctrl+CEBPA\",\n",
 99 |     "    \"CEBPA+ctrl\",\n",
100 |     "    \"CEBPE+CEBPA\",\n",
101 |     "    \"CEBPE+SPI1\",\n",
102 |     "    \"CEBPE+ctrl\",\n",
103 |     "    \"ctrl+CEBPE\",\n",
104 |     "    \"CEBPE+RUNX1T1\",\n",
105 |     "    \"CEBPE+CEBPB\",\n",
106 |     "    \"FOSB+CEBPB\",\n",
107 |     "    \"ETS2+CEBPE\",\n",
108 |     "]\n",
109 |     "\n",
110 |     "MEGAKARYOCYTE = [\n",
111 |     "    \"ctrl+ETS2\",\n",
112 |     "    \"MAPK1+ctrl\",\n",
113 |     "    \"ctrl+MAPK1\",\n",
114 |     "    \"ETS2+MAPK1\",\n",
115 |     "    \"CEBPB+MAPK1\",\n",
116 |     "    \"MAPK1+TGFBR2\",\n",
117 |     "]\n",
118 |     "\n",
119 |     "PRO_GROWTH = [\n",
120 |     "    \"CEBPE+KLF1\",\n",
121 |     "    \"KLF1+MAP2K6\",\n",
122 |     "    \"AHR+KLF1\",\n",
123 |     "    \"ctrl+KLF1\",\n",
124 |     "    \"KLF1+ctrl\",\n",
125 |     "    \"KLF1+BAK1\",\n",
126 |     "    \"KLF1+TGFBR2\",\n",
127 |     "]\n",
128 |     "\n",
129 |     "\n",
130 |     "def download_norman_2019(output_path: str) -> None:\n",
131 |     "    \"\"\"\n",
132 |     "    Download Norman et al. 2019 data and metadata files from the hosting URLs.\n",
133 |     "\n",
134 |     "    Args:\n",
135 |     "    ----\n",
136 |     "        output_path: Output path to store the downloaded and unzipped\n",
137 |     "        directories.\n",
138 |     "\n",
139 |     "    Returns\n",
140 |     "    -------\n",
141 |     "        None. File directories are downloaded to output_path.\n",
142 |     "    \"\"\"\n",
143 |     "\n",
144 |     "    file_urls = (\n",
145 |     "        \"https://ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133344/suppl\"\n",
146 |     "        \"/GSE133344_filtered_matrix.mtx.gz\",\n",
147 |     "        \"https://ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133344/suppl\"\n",
148 |     "        \"/GSE133344_filtered_genes.tsv.gz\",\n",
149 |     "        \"https://ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133344/suppl\"\n",
150 |     "        \"/GSE133344_filtered_barcodes.tsv.gz\",\n",
151 |     "        \"https://ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133344/suppl\"\n",
152 |     "        \"/GSE133344_filtered_cell_identities.csv.gz\",\n",
153 |     "    )\n",
154 |     "\n",
155 |     "    for url in file_urls:\n",
156 |     "        output_filename = os.path.join(output_path, url.split(\"/\")[-1])\n",
157 |     "        download_binary_file(url, output_filename)\n",
158 |     "\n",
159 |     "\n",
160 |     "def read_norman_2019(file_directory: str) -> coo_matrix:\n",
161 |     "    \"\"\"\n",
162 |     "    Read the expression data for Norman et al. 2019 in the given directory.\n",
163 |     "\n",
164 |     "    Args:\n",
165 |     "    ----\n",
166 |     "        file_directory: Directory containing Norman et al. 2019 data.\n",
167 |     "\n",
168 |     "    Returns\n",
169 |     "    -------\n",
170 |     "        A sparse matrix containing single-cell gene expression count, with rows\n",
171 |     "        representing genes and columns representing cells.\n",
172 |     "    \"\"\"\n",
173 |     "\n",
174 |     "    with gzip.open(\n",
175 |     "        os.path.join(file_directory, \"GSE133344_filtered_matrix.mtx.gz\"), \"rb\"\n",
176 |     "    ) as f:\n",
177 |     "        matrix = mmread(f)\n",
178 |     "\n",
179 |     "    return matrix"
180 |    ]
181 |   },
182 |   {
183 |    "cell_type": "code",
184 |    "execution_count": 2,
185 |    "id": "21457d17-ce85-405e-af71-b98f55cd9dfc",
186 |    "metadata": {},
187 |    "outputs": [
188 |     {
189 |      "name": "stdout",
190 |      "output_type": "stream",
191 |      "text": [
192 |       "Downloaded data from https://ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133344/suppl/GSE133344_filtered_matrix.mtx.gz at ./GSE133344_filtered_matrix.mtx.gz\n",
193 |       "Downloaded data from https://ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133344/suppl/GSE133344_filtered_genes.tsv.gz at ./GSE133344_filtered_genes.tsv.gz\n",
194 |       "Downloaded data from https://ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133344/suppl/GSE133344_filtered_barcodes.tsv.gz at ./GSE133344_filtered_barcodes.tsv.gz\n",
195 |       "Downloaded data from https://ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133344/suppl/GSE133344_filtered_cell_identities.csv.gz at ./GSE133344_filtered_cell_identities.csv.gz\n"
196 |      ]
197 |     },
198 |     {
199 |      "name": "stderr",
200 |      "output_type": "stream",
201 |      "text": [
202 |       "Trying to set attribute `.obs` of view, copying.\n"
203 |      ]
204 |     }
205 |    ],
206 |    "source": [
207 |     "download_path = \"./norman2019/\"\n",
208 |     "\n",
209 |     "download_norman_2019(download_path)\n",
210 |     "\n",
211 |     "matrix = read_norman_2019(download_path)\n",
212 |     "\n",
213 |     "# List of cell barcodes. The barcodes in this list are stored in the same order\n",
214 |     "# as cells are in the count matrix.\n",
215 |     "cell_barcodes = pd.read_csv(\n",
216 |     "    os.path.join(download_path, \"GSE133344_filtered_barcodes.tsv.gz\"),\n",
217 |     "    sep=\"\\t\",\n",
218 |     "    header=None,\n",
219 |     "    names=[\"cell_barcode\"],\n",
220 |     ")\n",
221 |     "\n",
222 |     "# IDs/names of the gene features.\n",
223 |     "gene_list = pd.read_csv(\n",
224 |     "    os.path.join(download_path, \"GSE133344_filtered_genes.tsv.gz\"),\n",
225 |     "    sep=\"\\t\",\n",
226 |     "    header=None,\n",
227 |     "    names=[\"gene_id\", \"gene_name\"],\n",
228 |     ")\n",
229 |     "\n",
230 |     "# Dataframe where each row corresponds to a cell, and each column corresponds\n",
231 |     "# to a gene feature.\n",
232 |     "matrix = pd.DataFrame(\n",
233 |     "    matrix.transpose().todense(),\n",
234 |     "    columns=gene_list[\"gene_id\"],\n",
235 |     "    index=cell_barcodes[\"cell_barcode\"],\n",
236 |     "    dtype=\"int32\",\n",
237 |     ")\n",
238 |     "\n",
239 |     "# Dataframe mapping cell barcodes to metadata about that cell (e.g. which CRISPR\n",
240 |     "# guides were applied to that cell). Unfortunately, this list has a different\n",
241 |     "# ordering from the count matrix, so we have to be careful combining the metadata\n",
242 |     "# and count data.\n",
243 |     "cell_identities = pd.read_csv(\n",
244 |     "    os.path.join(download_path, \"GSE133344_filtered_cell_identities.csv.gz\")\n",
245 |     ").set_index(\"cell_barcode\")\n",
246 |     "\n",
247 |     "# This merge call reorders our metadata dataframe to match the ordering in the\n",
248 |     "# count matrix. Some cells in `cell_barcodes` do not have metadata associated with\n",
249 |     "# them, and their metadata values will be filled in as NaN.\n",
250 |     "aligned_metadata = pd.merge(\n",
251 |     "    cell_barcodes,\n",
252 |     "    cell_identities,\n",
253 |     "    left_on=\"cell_barcode\",\n",
254 |     "    right_index=True,\n",
255 |     "    how=\"left\",\n",
256 |     ").set_index(\"cell_barcode\")\n",
257 |     "\n",
258 |     "adata = AnnData(matrix)\n",
259 |     "adata.obs = aligned_metadata\n",
260 |     "\n",
261 |     "# Filter out any cells that don't have metadata values.\n",
262 |     "rows_without_nans = [\n",
263 |     "    index for index, row in adata.obs.iterrows() if not row.isnull().any()\n",
264 |     "]\n",
265 |     "adata = adata[rows_without_nans, :]\n",
266 |     "\n",
267 |     "# Remove these as suggested by the authors. See lines referring to\n",
268 |     "# NegCtrl1_NegCtrl0 in GI_generate_populations.ipynb in the Norman 2019 paper's\n",
269 |     "# Github repo https://github.com/thomasmaxwellnorman/Perturbseq_GI/\n",
270 |     "adata = adata[adata.obs[\"guide_identity\"] != \"NegCtrl1_NegCtrl0__NegCtrl1_NegCtrl0\"]\n",
271 |     "\n",
272 |     "# We create a new metadata column with cleaner representations of CRISPR guide\n",
273 |     "# identities. The original format is <Guide1>_<Guide2>__<Guide1>_<Guide2>_<number>\n",
274 |     "adata.obs[\"guide_merged\"] = adata.obs[\"guide_identity\"]\n",
275 |     "\n",
276 |     "control_regex = re.compile(r\"NegCtrl(.*)_NegCtrl(.*)+NegCtrl(.*)_NegCtrl(.*)\")\n",
277 |     "for i in adata.obs[\"guide_merged\"].unique():\n",
278 |     "    if control_regex.match(i):\n",
279 |     "        # For any cells that only had control guides, we don't care about the\n",
280 |     "        # specific IDs of the guides. Here we relabel them just as \"ctrl\".\n",
281 |     "        adata.obs[\"guide_merged\"].replace(i, \"ctrl\", inplace=True)\n",
282 |     "    else:\n",
283 |     "        # Otherwise, we reformat the guide label to be <Guide1>+<Guide2>. If Guide1\n",
284 |     "        # or Guide2 was a control, we replace it with \"ctrl\".\n",
285 |     "        split = i.split(\"__\")[0]\n",
286 |     "        split = split.split(\"_\")\n",
287 |     "        for j, string in enumerate(split):\n",
288 |     "            if \"NegCtrl\" in split[j]:\n",
289 |     "                split[j] = \"ctrl\"\n",
290 |     "        adata.obs[\"guide_merged\"].replace(i, f\"{split[0]}+{split[1]}\", inplace=True)\n",
291 |     "\n",
292 |     "guides_to_programs = {}\n",
293 |     "guides_to_programs.update(dict.fromkeys(G1_CYCLE, \"G1 cell cycle arrest\"))\n",
294 |     "guides_to_programs.update(dict.fromkeys(ERYTHROID, \"Erythroid\"))\n",
295 |     "guides_to_programs.update(dict.fromkeys(PIONEER_FACTORS, \"Pioneer factors\"))\n",
296 |     "guides_to_programs.update(\n",
297 |     "    dict.fromkeys(GRANULOCYTE_APOPTOSIS, \"Granulocyte/apoptosis\")\n",
298 |     ")\n",
299 |     "guides_to_programs.update(dict.fromkeys(PRO_GROWTH, \"Pro-growth\"))\n",
300 |     "guides_to_programs.update(dict.fromkeys(MEGAKARYOCYTE, \"Megakaryocyte\"))\n",
301 |     "guides_to_programs.update(dict.fromkeys([\"ctrl\"], \"Ctrl\"))\n",
302 |     "\n",
303 |     "adata.obs[\"gene_program\"] = [guides_to_programs[x] if x in guides_to_programs else \"N/A\" for x in adata.obs[\"guide_merged\"]]\n",
304 |     "adata.obs[\"good_coverage\"] = adata.obs[\"good_coverage\"].astype(bool)"
305 |    ]
306 |   },
307 |   {
308 |    "cell_type": "code",
309 |    "execution_count": 5,
310 |    "id": "72c5c54f",
311 |    "metadata": {},
312 |    "outputs": [],
313 |    "source": [
314 |     "adata.write('Norman_2019_raw.h5ad')"
315 |    ]
316 |   }
317 |  ],
318 |  "metadata": {
319 |   "kernelspec": {
320 |    "display_name": "Python 3 (ipykernel)",
321 |    "language": "python",
322 |    "name": "python3"
323 |   },
324 |   "language_info": {
325 |    "codemirror_mode": {
326 |     "name": "ipython",
327 |     "version": 3
328 |    },
329 |    "file_extension": ".py",
330 |    "mimetype": "text/x-python",
331 |    "name": "python",
332 |    "nbconvert_exporter": "python",
333 |    "pygments_lexer": "ipython3",
334 |    "version": "3.7.0"
335 |   }
336 |  },
337 |  "nbformat": 4,
338 |  "nbformat_minor": 5
339 | }
340 | 


--------------------------------------------------------------------------------
/datasets/Schraivogel_2020.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "code",
   5 |    "execution_count": 4,
   6 |    "id": "78978bc6",
   7 |    "metadata": {},
   8 |    "outputs": [
   9 |     {
  10 |      "name": "stdout",
  11 |      "output_type": "stream",
  12 |      "text": [
  13 |       "--2022-01-12 17:43:26--  https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE135497&format=file\n",
  14 |       "Resolving www.ncbi.nlm.nih.gov (www.ncbi.nlm.nih.gov)... 130.14.29.110, 2607:f220:41e:4290::110\n",
  15 |       "Connecting to www.ncbi.nlm.nih.gov (www.ncbi.nlm.nih.gov)|130.14.29.110|:443... connected.\n",
  16 |       "HTTP request sent, awaiting response... 200 OK\n",
  17 |       "Length: 183797760 (175M) [application/x-tar]\n",
  18 |       "Saving to: ‘GSE135497_RAW.tar’\n",
  19 |       "\n",
  20 |       "GSE135497_RAW.tar   100%[===================>] 175.28M  2.08MB/s    in 2m 6s   \n",
  21 |       "\n",
  22 |       "2022-01-12 17:45:33 (1.39 MB/s) - ‘GSE135497_RAW.tar’ saved [183797760/183797760]\n",
  23 |       "\n",
  24 |       "GSM4012674_TASC_TAR_p1r1.counts.csv.gz\n",
  25 |       "GSM4012675_TASC_TAR_p1r2.counts.csv.gz\n",
  26 |       "GSM4012676_TASC_TAR_p2r1.counts.csv.gz\n",
  27 |       "GSM4012677_TASC_TAR_p2r2.counts.csv.gz\n",
  28 |       "GSM4012678_TASC_TAR_p12.counts.csv.gz\n",
  29 |       "GSM4012679_TASC_TAR_p123.counts.csv.gz\n",
  30 |       "GSM4012680_TASC_TAR_DropSeq.counts.csv.gz\n",
  31 |       "GSM4012682_WTX_REF.counts.txt.gz\n",
  32 |       "GSM4012683_WTX_REF_DropSeq.counts.csv.gz\n",
  33 |       "GSM4012684_TASC_CT_totalBM_sample1.counts.csv.gz\n",
  34 |       "GSM4012685_TASC_CT_totalBM_sample2.counts.csv.gz\n",
  35 |       "GSM4012686_TASC_CT_kitBM_sample1.counts.csv.gz\n",
  36 |       "GSM4012687_TASC_CT_kitBM_sample2.counts.csv.gz\n",
  37 |       "GSM4012688_TASC_DIFFEX_sample1.counts.csv.gz\n",
  38 |       "GSM4012688_TASC_DIFFEX_sample1.pertStatus.csv.gz\n",
  39 |       "GSM4012689_TASC_DIFFEX_sample2.counts.csv.gz\n",
  40 |       "GSM4012689_TASC_DIFFEX_sample2.pertStatus.csv.gz\n",
  41 |       "GSM4012690_WTX_DIFFEX_sample1.counts.csv.gz\n",
  42 |       "GSM4012690_WTX_DIFFEX_sample1.pertStatus.csv.gz\n",
  43 |       "GSM4012691_WTX_DIFFEX_sample2.counts.csv.gz\n",
  44 |       "GSM4012691_WTX_DIFFEX_sample2.pertStatus.csv.gz\n",
  45 |       "GSM4012692_WTX_DIFFEX_sample3.counts.csv.gz\n",
  46 |       "GSM4012692_WTX_DIFFEX_sample3.pertStatus.csv.gz\n",
  47 |       "GSM4012693_WTX_DIFFEX_sample4.counts.csv.gz\n",
  48 |       "GSM4012693_WTX_DIFFEX_sample4.pertStatus.csv.gz\n",
  49 |       "GSM4012694_TASC_SCREEN_chr8_sample1.counts.csv.gz\n",
  50 |       "GSM4012694_TASC_SCREEN_chr8_sample1.pertStatus.csv.gz\n",
  51 |       "GSM4012695_TASC_SCREEN_chr8_sample2.counts.csv.gz\n",
  52 |       "GSM4012695_TASC_SCREEN_chr8_sample2.pertStatus.csv.gz\n",
  53 |       "GSM4012696_TASC_SCREEN_chr8_sample3.counts.csv.gz\n",
  54 |       "GSM4012696_TASC_SCREEN_chr8_sample3.pertStatus.csv.gz\n",
  55 |       "GSM4012697_TASC_SCREEN_chr8_sample4.counts.csv.gz\n",
  56 |       "GSM4012697_TASC_SCREEN_chr8_sample4.pertStatus.csv.gz\n",
  57 |       "GSM4012698_TASC_SCREEN_chr8_sample5.counts.csv.gz\n",
  58 |       "GSM4012698_TASC_SCREEN_chr8_sample5.pertStatus.csv.gz\n",
  59 |       "GSM4012699_TASC_SCREEN_chr8_sample6.counts.csv.gz\n",
  60 |       "GSM4012699_TASC_SCREEN_chr8_sample6.pertStatus.csv.gz\n",
  61 |       "GSM4012700_TASC_SCREEN_chr8_sample7.counts.csv.gz\n",
  62 |       "GSM4012700_TASC_SCREEN_chr8_sample7.pertStatus.csv.gz\n",
  63 |       "GSM4012701_TASC_SCREEN_chr8_sample8.counts.csv.gz\n",
  64 |       "GSM4012701_TASC_SCREEN_chr8_sample8.pertStatus.csv.gz\n",
  65 |       "GSM4012702_TASC_SCREEN_chr8_sample9.counts.csv.gz\n",
  66 |       "GSM4012702_TASC_SCREEN_chr8_sample9.pertStatus.csv.gz\n",
  67 |       "GSM4012703_TASC_SCREEN_chr8_sample10.counts.csv.gz\n",
  68 |       "GSM4012703_TASC_SCREEN_chr8_sample10.pertStatus.csv.gz\n",
  69 |       "GSM4012704_TASC_SCREEN_chr8_sample11.counts.csv.gz\n",
  70 |       "GSM4012704_TASC_SCREEN_chr8_sample11.pertStatus.csv.gz\n",
  71 |       "GSM4012705_TASC_SCREEN_chr8_sample12.counts.csv.gz\n",
  72 |       "GSM4012705_TASC_SCREEN_chr8_sample12.pertStatus.csv.gz\n",
  73 |       "GSM4012706_TASC_SCREEN_chr8_sample13.counts.csv.gz\n",
  74 |       "GSM4012706_TASC_SCREEN_chr8_sample13.pertStatus.csv.gz\n",
  75 |       "GSM4012707_TASC_SCREEN_chr8_sample14.counts.csv.gz\n",
  76 |       "GSM4012707_TASC_SCREEN_chr8_sample14.pertStatus.csv.gz\n",
  77 |       "GSM4012708_TASC_SCREEN_chr11_sample1.counts.csv.gz\n",
  78 |       "GSM4012708_TASC_SCREEN_chr11_sample1.pertStatus.csv.gz\n",
  79 |       "GSM4012709_TASC_SCREEN_chr11_sample2.counts.csv.gz\n",
  80 |       "GSM4012709_TASC_SCREEN_chr11_sample2.pertStatus.csv.gz\n",
  81 |       "GSM4012710_TASC_SCREEN_chr11_sample3.counts.csv.gz\n",
  82 |       "GSM4012710_TASC_SCREEN_chr11_sample3.pertStatus.csv.gz\n",
  83 |       "GSM4012711_TASC_SCREEN_chr11_sample4.counts.csv.gz\n",
  84 |       "GSM4012711_TASC_SCREEN_chr11_sample4.pertStatus.csv.gz\n",
  85 |       "GSM4012712_TASC_SCREEN_chr11_sample5.counts.csv.gz\n",
  86 |       "GSM4012712_TASC_SCREEN_chr11_sample5.pertStatus.csv.gz\n",
  87 |       "GSM4012713_TASC_SCREEN_chr11_sample6.counts.csv.gz\n",
  88 |       "GSM4012713_TASC_SCREEN_chr11_sample6.pertStatus.csv.gz\n",
  89 |       "GSM4012714_TASC_SCREEN_chr11_sample7.counts.csv.gz\n",
  90 |       "GSM4012714_TASC_SCREEN_chr11_sample7.pertStatus.csv.gz\n",
  91 |       "GSM4012715_TASC_SCREEN_chr11_sample8.counts.csv.gz\n",
  92 |       "GSM4012715_TASC_SCREEN_chr11_sample8.pertStatus.csv.gz\n",
  93 |       "GSM4012716_TASC_SCREEN_chr11_sample9.counts.csv.gz\n",
  94 |       "GSM4012716_TASC_SCREEN_chr11_sample9.pertStatus.csv.gz\n",
  95 |       "GSM4012717_TASC_SCREEN_chr11_sample10.counts.csv.gz\n",
  96 |       "GSM4012717_TASC_SCREEN_chr11_sample10.pertStatus.csv.gz\n",
  97 |       "GSM4012718_TASC_SCREEN_chr11_sample11.counts.csv.gz\n",
  98 |       "GSM4012718_TASC_SCREEN_chr11_sample11.pertStatus.csv.gz\n",
  99 |       "GSM4012719_TASC_SCREEN_chr11_sample13.counts.csv.gz\n",
 100 |       "GSM4012719_TASC_SCREEN_chr11_sample13.pertStatus.csv.gz\n",
 101 |       "GSM4012720_TASC_SCREEN_chr11_sample14.counts.csv.gz\n",
 102 |       "GSM4012720_TASC_SCREEN_chr11_sample14.pertStatus.csv.gz\n",
 103 |       "GSM4265995_TASC_DEEP_p1.counts.csv.gz\n",
 104 |       "GSM4265998_TASC_DEEP_p2.counts.csv.gz\n",
 105 |       "GSM4266000_TASC_DEEP_p3.counts.csv.gz\n",
 106 |       "GSM4266002_TASC_DEEP_p3lung.counts.csv.gz\n",
 107 |       "GSM4266004_TASC_TAR_L1000.counts.csv.gz\n",
 108 |       "GSM4266004_TASC_TAR_L1000.pertStatus.csv.gz\n",
 109 |       "GSM4266006_WTX_REF_Deep.counts.csv.gz\n",
 110 |       "GSM4266008_WTX_REF_Mix.counts.csv.gz\n",
 111 |       "GSM4266010_WTX_REF_Lung.counts.csv.gz\n",
 112 |       "GSM4266012_TASC_REDESIGN_control.counts.csv.gz\n",
 113 |       "GSM4266012_TASC_REDESIGN_control.pertStatus.csv.gz\n",
 114 |       "GSM4266014_TASC_REDESIGN_v1.counts.csv.gz\n",
 115 |       "GSM4266014_TASC_REDESIGN_v1.pertStatus.csv.gz\n",
 116 |       "GSM4266017_TASC_REDESIGN_v2.counts.csv.gz\n",
 117 |       "GSM4266017_TASC_REDESIGN_v2.pertStatus.csv.gz\n"
 118 |      ]
 119 |     }
 120 |    ],
 121 |    "source": [
 122 |     "!wget -O GSE135497_RAW.tar 'https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE135497&format=file'\n",
 123 |     "!tar -xvf GSE135497_RAW.tar"
 124 |    ]
 125 |   },
 126 |   {
 127 |    "cell_type": "code",
 128 |    "execution_count": 8,
 129 |    "id": "772331e2",
 130 |    "metadata": {},
 131 |    "outputs": [],
 132 |    "source": [
 133 |     "import pandas as pd\n",
 134 |     "meta = pd.read_csv('GSM4012703_TASC_SCREEN_chr8_sample10.pertStatus.csv.gz', index_col=0)"
 135 |    ]
 136 |   },
 137 |   {
 138 |    "cell_type": "code",
 139 |    "execution_count": 11,
 140 |    "id": "0e6911f4",
 141 |    "metadata": {},
 142 |    "outputs": [
 143 |     {
 144 |      "data": {
 145 |       "text/html": [
 146 |        "<div>\n",
 147 |        "<style scoped>\n",
 148 |        "    .dataframe tbody tr th:only-of-type {\n",
 149 |        "        vertical-align: middle;\n",
 150 |        "    }\n",
 151 |        "\n",
 152 |        "    .dataframe tbody tr th {\n",
 153 |        "        vertical-align: top;\n",
 154 |        "    }\n",
 155 |        "\n",
 156 |        "    .dataframe thead th {\n",
 157 |        "        text-align: right;\n",
 158 |        "    }\n",
 159 |        "</style>\n",
 160 |        "<table border=\"1\" class=\"dataframe\">\n",
 161 |        "  <thead>\n",
 162 |        "    <tr style=\"text-align: right;\">\n",
 163 |        "      <th></th>\n",
 164 |        "      <th>AACCAACGACGAGGACAAGCGATG</th>\n",
 165 |        "      <th>AACCAACGACTTTCAAGCGATAGC</th>\n",
 166 |        "      <th>AACCAACGAGAGCGACAAGCGATG</th>\n",
 167 |        "      <th>AACCAACGCCTAGCTCAAGCGATG</th>\n",
 168 |        "      <th>AACCAACGTGGGCGTCAAGCGATG</th>\n",
 169 |        "      <th>AGGCAGAACCGTTCAAGCGATCCC</th>\n",
 170 |        "      <th>AGGCAGAACCGTTCAAGCGATTCT</th>\n",
 171 |        "      <th>AGGCAGAACTGATAGCAAGCGATG</th>\n",
 172 |        "      <th>AGGCAGAAGCTGCTTCAAGCGATG</th>\n",
 173 |        "      <th>AGGCAGAAGTCCTCAAGCGATTCT</th>\n",
 174 |        "      <th>...</th>\n",
 175 |        "      <th>TAGGCATGGGAGCAACAAGCGATG</th>\n",
 176 |        "      <th>TAGGCATGGTCCTCAAGCGATATA</th>\n",
 177 |        "      <th>TAGGCATGTTTGCGCCAAGCGATG</th>\n",
 178 |        "      <th>TCCTGAGCAGCAGCCCAAGCGATG</th>\n",
 179 |        "      <th>TCCTGAGCATGGGAGCAAGCGATG</th>\n",
 180 |        "      <th>TCCTGAGCGCATACAAGCGATATA</th>\n",
 181 |        "      <th>TCCTGAGCGTGTTAGCAAGCGATG</th>\n",
 182 |        "      <th>TGATTGACACGGGCTCAAGCGATG</th>\n",
 183 |        "      <th>TGATTGACGACTACAAGCGATCCC</th>\n",
 184 |        "      <th>TGATTGACTTCTACAAGCGATGAC</th>\n",
 185 |        "    </tr>\n",
 186 |        "  </thead>\n",
 187 |        "  <tbody>\n",
 188 |        "    <tr>\n",
 189 |        "      <th>CCNE2_+_95907328.23-P1P2</th>\n",
 190 |        "      <td>0</td>\n",
 191 |        "      <td>0</td>\n",
 192 |        "      <td>0</td>\n",
 193 |        "      <td>0</td>\n",
 194 |        "      <td>0</td>\n",
 195 |        "      <td>0</td>\n",
 196 |        "      <td>0</td>\n",
 197 |        "      <td>0</td>\n",
 198 |        "      <td>0</td>\n",
 199 |        "      <td>0</td>\n",
 200 |        "      <td>...</td>\n",
 201 |        "      <td>0</td>\n",
 202 |        "      <td>0</td>\n",
 203 |        "      <td>0</td>\n",
 204 |        "      <td>0</td>\n",
 205 |        "      <td>0</td>\n",
 206 |        "      <td>0</td>\n",
 207 |        "      <td>0</td>\n",
 208 |        "      <td>0</td>\n",
 209 |        "      <td>0</td>\n",
 210 |        "      <td>0</td>\n",
 211 |        "    </tr>\n",
 212 |        "    <tr>\n",
 213 |        "      <th>CCNE2_+_95907382.23-P1P2</th>\n",
 214 |        "      <td>0</td>\n",
 215 |        "      <td>0</td>\n",
 216 |        "      <td>0</td>\n",
 217 |        "      <td>0</td>\n",
 218 |        "      <td>0</td>\n",
 219 |        "      <td>0</td>\n",
 220 |        "      <td>0</td>\n",
 221 |        "      <td>0</td>\n",
 222 |        "      <td>0</td>\n",
 223 |        "      <td>0</td>\n",
 224 |        "      <td>...</td>\n",
 225 |        "      <td>0</td>\n",
 226 |        "      <td>0</td>\n",
 227 |        "      <td>0</td>\n",
 228 |        "      <td>0</td>\n",
 229 |        "      <td>0</td>\n",
 230 |        "      <td>0</td>\n",
 231 |        "      <td>0</td>\n",
 232 |        "      <td>0</td>\n",
 233 |        "      <td>0</td>\n",
 234 |        "      <td>0</td>\n",
 235 |        "    </tr>\n",
 236 |        "    <tr>\n",
 237 |        "      <th>CCNE2_+_95907406.23-P1P2</th>\n",
 238 |        "      <td>0</td>\n",
 239 |        "      <td>0</td>\n",
 240 |        "      <td>0</td>\n",
 241 |        "      <td>0</td>\n",
 242 |        "      <td>0</td>\n",
 243 |        "      <td>0</td>\n",
 244 |        "      <td>0</td>\n",
 245 |        "      <td>0</td>\n",
 246 |        "      <td>0</td>\n",
 247 |        "      <td>0</td>\n",
 248 |        "      <td>...</td>\n",
 249 |        "      <td>0</td>\n",
 250 |        "      <td>0</td>\n",
 251 |        "      <td>0</td>\n",
 252 |        "      <td>0</td>\n",
 253 |        "      <td>0</td>\n",
 254 |        "      <td>0</td>\n",
 255 |        "      <td>0</td>\n",
 256 |        "      <td>0</td>\n",
 257 |        "      <td>0</td>\n",
 258 |        "      <td>0</td>\n",
 259 |        "    </tr>\n",
 260 |        "    <tr>\n",
 261 |        "      <th>CCNE2_-_95907017.23-P1P2</th>\n",
 262 |        "      <td>0</td>\n",
 263 |        "      <td>0</td>\n",
 264 |        "      <td>0</td>\n",
 265 |        "      <td>0</td>\n",
 266 |        "      <td>0</td>\n",
 267 |        "      <td>0</td>\n",
 268 |        "      <td>0</td>\n",
 269 |        "      <td>0</td>\n",
 270 |        "      <td>0</td>\n",
 271 |        "      <td>0</td>\n",
 272 |        "      <td>...</td>\n",
 273 |        "      <td>0</td>\n",
 274 |        "      <td>0</td>\n",
 275 |        "      <td>0</td>\n",
 276 |        "      <td>0</td>\n",
 277 |        "      <td>0</td>\n",
 278 |        "      <td>0</td>\n",
 279 |        "      <td>0</td>\n",
 280 |        "      <td>0</td>\n",
 281 |        "      <td>0</td>\n",
 282 |        "      <td>0</td>\n",
 283 |        "    </tr>\n",
 284 |        "    <tr>\n",
 285 |        "      <th>CPQ_+_97657557.23-P1P2</th>\n",
 286 |        "      <td>0</td>\n",
 287 |        "      <td>0</td>\n",
 288 |        "      <td>0</td>\n",
 289 |        "      <td>0</td>\n",
 290 |        "      <td>0</td>\n",
 291 |        "      <td>0</td>\n",
 292 |        "      <td>0</td>\n",
 293 |        "      <td>0</td>\n",
 294 |        "      <td>0</td>\n",
 295 |        "      <td>0</td>\n",
 296 |        "      <td>...</td>\n",
 297 |        "      <td>0</td>\n",
 298 |        "      <td>0</td>\n",
 299 |        "      <td>0</td>\n",
 300 |        "      <td>0</td>\n",
 301 |        "      <td>0</td>\n",
 302 |        "      <td>0</td>\n",
 303 |        "      <td>0</td>\n",
 304 |        "      <td>0</td>\n",
 305 |        "      <td>0</td>\n",
 306 |        "      <td>0</td>\n",
 307 |        "    </tr>\n",
 308 |        "    <tr>\n",
 309 |        "      <th>...</th>\n",
 310 |        "      <td>...</td>\n",
 311 |        "      <td>...</td>\n",
 312 |        "      <td>...</td>\n",
 313 |        "      <td>...</td>\n",
 314 |        "      <td>...</td>\n",
 315 |        "      <td>...</td>\n",
 316 |        "      <td>...</td>\n",
 317 |        "      <td>...</td>\n",
 318 |        "      <td>...</td>\n",
 319 |        "      <td>...</td>\n",
 320 |        "      <td>...</td>\n",
 321 |        "      <td>...</td>\n",
 322 |        "      <td>...</td>\n",
 323 |        "      <td>...</td>\n",
 324 |        "      <td>...</td>\n",
 325 |        "      <td>...</td>\n",
 326 |        "      <td>...</td>\n",
 327 |        "      <td>...</td>\n",
 328 |        "      <td>...</td>\n",
 329 |        "      <td>...</td>\n",
 330 |        "      <td>...</td>\n",
 331 |        "    </tr>\n",
 332 |        "    <tr>\n",
 333 |        "      <th>non-targeting_00025</th>\n",
 334 |        "      <td>0</td>\n",
 335 |        "      <td>0</td>\n",
 336 |        "      <td>0</td>\n",
 337 |        "      <td>0</td>\n",
 338 |        "      <td>0</td>\n",
 339 |        "      <td>0</td>\n",
 340 |        "      <td>0</td>\n",
 341 |        "      <td>0</td>\n",
 342 |        "      <td>0</td>\n",
 343 |        "      <td>0</td>\n",
 344 |        "      <td>...</td>\n",
 345 |        "      <td>0</td>\n",
 346 |        "      <td>0</td>\n",
 347 |        "      <td>0</td>\n",
 348 |        "      <td>0</td>\n",
 349 |        "      <td>0</td>\n",
 350 |        "      <td>0</td>\n",
 351 |        "      <td>0</td>\n",
 352 |        "      <td>0</td>\n",
 353 |        "      <td>0</td>\n",
 354 |        "      <td>0</td>\n",
 355 |        "    </tr>\n",
 356 |        "    <tr>\n",
 357 |        "      <th>non-targeting_00026</th>\n",
 358 |        "      <td>0</td>\n",
 359 |        "      <td>0</td>\n",
 360 |        "      <td>0</td>\n",
 361 |        "      <td>0</td>\n",
 362 |        "      <td>0</td>\n",
 363 |        "      <td>0</td>\n",
 364 |        "      <td>0</td>\n",
 365 |        "      <td>0</td>\n",
 366 |        "      <td>0</td>\n",
 367 |        "      <td>0</td>\n",
 368 |        "      <td>...</td>\n",
 369 |        "      <td>0</td>\n",
 370 |        "      <td>0</td>\n",
 371 |        "      <td>0</td>\n",
 372 |        "      <td>0</td>\n",
 373 |        "      <td>0</td>\n",
 374 |        "      <td>0</td>\n",
 375 |        "      <td>0</td>\n",
 376 |        "      <td>0</td>\n",
 377 |        "      <td>0</td>\n",
 378 |        "      <td>0</td>\n",
 379 |        "    </tr>\n",
 380 |        "    <tr>\n",
 381 |        "      <th>non-targeting_00027</th>\n",
 382 |        "      <td>0</td>\n",
 383 |        "      <td>0</td>\n",
 384 |        "      <td>0</td>\n",
 385 |        "      <td>0</td>\n",
 386 |        "      <td>0</td>\n",
 387 |        "      <td>0</td>\n",
 388 |        "      <td>0</td>\n",
 389 |        "      <td>0</td>\n",
 390 |        "      <td>0</td>\n",
 391 |        "      <td>0</td>\n",
 392 |        "      <td>...</td>\n",
 393 |        "      <td>0</td>\n",
 394 |        "      <td>0</td>\n",
 395 |        "      <td>0</td>\n",
 396 |        "      <td>0</td>\n",
 397 |        "      <td>0</td>\n",
 398 |        "      <td>0</td>\n",
 399 |        "      <td>0</td>\n",
 400 |        "      <td>0</td>\n",
 401 |        "      <td>0</td>\n",
 402 |        "      <td>0</td>\n",
 403 |        "    </tr>\n",
 404 |        "    <tr>\n",
 405 |        "      <th>non-targeting_00028</th>\n",
 406 |        "      <td>0</td>\n",
 407 |        "      <td>0</td>\n",
 408 |        "      <td>0</td>\n",
 409 |        "      <td>0</td>\n",
 410 |        "      <td>0</td>\n",
 411 |        "      <td>0</td>\n",
 412 |        "      <td>0</td>\n",
 413 |        "      <td>0</td>\n",
 414 |        "      <td>0</td>\n",
 415 |        "      <td>0</td>\n",
 416 |        "      <td>...</td>\n",
 417 |        "      <td>0</td>\n",
 418 |        "      <td>0</td>\n",
 419 |        "      <td>0</td>\n",
 420 |        "      <td>0</td>\n",
 421 |        "      <td>0</td>\n",
 422 |        "      <td>0</td>\n",
 423 |        "      <td>0</td>\n",
 424 |        "      <td>0</td>\n",
 425 |        "      <td>0</td>\n",
 426 |        "      <td>0</td>\n",
 427 |        "    </tr>\n",
 428 |        "    <tr>\n",
 429 |        "      <th>non-targeting_00029</th>\n",
 430 |        "      <td>0</td>\n",
 431 |        "      <td>0</td>\n",
 432 |        "      <td>0</td>\n",
 433 |        "      <td>0</td>\n",
 434 |        "      <td>0</td>\n",
 435 |        "      <td>0</td>\n",
 436 |        "      <td>0</td>\n",
 437 |        "      <td>0</td>\n",
 438 |        "      <td>0</td>\n",
 439 |        "      <td>0</td>\n",
 440 |        "      <td>...</td>\n",
 441 |        "      <td>0</td>\n",
 442 |        "      <td>0</td>\n",
 443 |        "      <td>0</td>\n",
 444 |        "      <td>0</td>\n",
 445 |        "      <td>0</td>\n",
 446 |        "      <td>0</td>\n",
 447 |        "      <td>0</td>\n",
 448 |        "      <td>0</td>\n",
 449 |        "      <td>0</td>\n",
 450 |        "      <td>0</td>\n",
 451 |        "    </tr>\n",
 452 |        "  </tbody>\n",
 453 |        "</table>\n",
 454 |        "<p>4119 rows × 9517 columns</p>\n",
 455 |        "</div>"
 456 |       ],
 457 |       "text/plain": [
 458 |        "                          AACCAACGACGAGGACAAGCGATG  AACCAACGACTTTCAAGCGATAGC  \\\n",
 459 |        "CCNE2_+_95907328.23-P1P2                         0                         0   \n",
 460 |        "CCNE2_+_95907382.23-P1P2                         0                         0   \n",
 461 |        "CCNE2_+_95907406.23-P1P2                         0                         0   \n",
 462 |        "CCNE2_-_95907017.23-P1P2                         0                         0   \n",
 463 |        "CPQ_+_97657557.23-P1P2                           0                         0   \n",
 464 |        "...                                            ...                       ...   \n",
 465 |        "non-targeting_00025                              0                         0   \n",
 466 |        "non-targeting_00026                              0                         0   \n",
 467 |        "non-targeting_00027                              0                         0   \n",
 468 |        "non-targeting_00028                              0                         0   \n",
 469 |        "non-targeting_00029                              0                         0   \n",
 470 |        "\n",
 471 |        "                          AACCAACGAGAGCGACAAGCGATG  AACCAACGCCTAGCTCAAGCGATG  \\\n",
 472 |        "CCNE2_+_95907328.23-P1P2                         0                         0   \n",
 473 |        "CCNE2_+_95907382.23-P1P2                         0                         0   \n",
 474 |        "CCNE2_+_95907406.23-P1P2                         0                         0   \n",
 475 |        "CCNE2_-_95907017.23-P1P2                         0                         0   \n",
 476 |        "CPQ_+_97657557.23-P1P2                           0                         0   \n",
 477 |        "...                                            ...                       ...   \n",
 478 |        "non-targeting_00025                              0                         0   \n",
 479 |        "non-targeting_00026                              0                         0   \n",
 480 |        "non-targeting_00027                              0                         0   \n",
 481 |        "non-targeting_00028                              0                         0   \n",
 482 |        "non-targeting_00029                              0                         0   \n",
 483 |        "\n",
 484 |        "                          AACCAACGTGGGCGTCAAGCGATG  AGGCAGAACCGTTCAAGCGATCCC  \\\n",
 485 |        "CCNE2_+_95907328.23-P1P2                         0                         0   \n",
 486 |        "CCNE2_+_95907382.23-P1P2                         0                         0   \n",
 487 |        "CCNE2_+_95907406.23-P1P2                         0                         0   \n",
 488 |        "CCNE2_-_95907017.23-P1P2                         0                         0   \n",
 489 |        "CPQ_+_97657557.23-P1P2                           0                         0   \n",
 490 |        "...                                            ...                       ...   \n",
 491 |        "non-targeting_00025                              0                         0   \n",
 492 |        "non-targeting_00026                              0                         0   \n",
 493 |        "non-targeting_00027                              0                         0   \n",
 494 |        "non-targeting_00028                              0                         0   \n",
 495 |        "non-targeting_00029                              0                         0   \n",
 496 |        "\n",
 497 |        "                          AGGCAGAACCGTTCAAGCGATTCT  AGGCAGAACTGATAGCAAGCGATG  \\\n",
 498 |        "CCNE2_+_95907328.23-P1P2                         0                         0   \n",
 499 |        "CCNE2_+_95907382.23-P1P2                         0                         0   \n",
 500 |        "CCNE2_+_95907406.23-P1P2                         0                         0   \n",
 501 |        "CCNE2_-_95907017.23-P1P2                         0                         0   \n",
 502 |        "CPQ_+_97657557.23-P1P2                           0                         0   \n",
 503 |        "...                                            ...                       ...   \n",
 504 |        "non-targeting_00025                              0                         0   \n",
 505 |        "non-targeting_00026                              0                         0   \n",
 506 |        "non-targeting_00027                              0                         0   \n",
 507 |        "non-targeting_00028                              0                         0   \n",
 508 |        "non-targeting_00029                              0                         0   \n",
 509 |        "\n",
 510 |        "                          AGGCAGAAGCTGCTTCAAGCGATG  AGGCAGAAGTCCTCAAGCGATTCT  \\\n",
 511 |        "CCNE2_+_95907328.23-P1P2                         0                         0   \n",
 512 |        "CCNE2_+_95907382.23-P1P2                         0                         0   \n",
 513 |        "CCNE2_+_95907406.23-P1P2                         0                         0   \n",
 514 |        "CCNE2_-_95907017.23-P1P2                         0                         0   \n",
 515 |        "CPQ_+_97657557.23-P1P2                           0                         0   \n",
 516 |        "...                                            ...                       ...   \n",
 517 |        "non-targeting_00025                              0                         0   \n",
 518 |        "non-targeting_00026                              0                         0   \n",
 519 |        "non-targeting_00027                              0                         0   \n",
 520 |        "non-targeting_00028                              0                         0   \n",
 521 |        "non-targeting_00029                              0                         0   \n",
 522 |        "\n",
 523 |        "                          ...  TAGGCATGGGAGCAACAAGCGATG  \\\n",
 524 |        "CCNE2_+_95907328.23-P1P2  ...                         0   \n",
 525 |        "CCNE2_+_95907382.23-P1P2  ...                         0   \n",
 526 |        "CCNE2_+_95907406.23-P1P2  ...                         0   \n",
 527 |        "CCNE2_-_95907017.23-P1P2  ...                         0   \n",
 528 |        "CPQ_+_97657557.23-P1P2    ...                         0   \n",
 529 |        "...                       ...                       ...   \n",
 530 |        "non-targeting_00025       ...                         0   \n",
 531 |        "non-targeting_00026       ...                         0   \n",
 532 |        "non-targeting_00027       ...                         0   \n",
 533 |        "non-targeting_00028       ...                         0   \n",
 534 |        "non-targeting_00029       ...                         0   \n",
 535 |        "\n",
 536 |        "                          TAGGCATGGTCCTCAAGCGATATA  TAGGCATGTTTGCGCCAAGCGATG  \\\n",
 537 |        "CCNE2_+_95907328.23-P1P2                         0                         0   \n",
 538 |        "CCNE2_+_95907382.23-P1P2                         0                         0   \n",
 539 |        "CCNE2_+_95907406.23-P1P2                         0                         0   \n",
 540 |        "CCNE2_-_95907017.23-P1P2                         0                         0   \n",
 541 |        "CPQ_+_97657557.23-P1P2                           0                         0   \n",
 542 |        "...                                            ...                       ...   \n",
 543 |        "non-targeting_00025                              0                         0   \n",
 544 |        "non-targeting_00026                              0                         0   \n",
 545 |        "non-targeting_00027                              0                         0   \n",
 546 |        "non-targeting_00028                              0                         0   \n",
 547 |        "non-targeting_00029                              0                         0   \n",
 548 |        "\n",
 549 |        "                          TCCTGAGCAGCAGCCCAAGCGATG  TCCTGAGCATGGGAGCAAGCGATG  \\\n",
 550 |        "CCNE2_+_95907328.23-P1P2                         0                         0   \n",
 551 |        "CCNE2_+_95907382.23-P1P2                         0                         0   \n",
 552 |        "CCNE2_+_95907406.23-P1P2                         0                         0   \n",
 553 |        "CCNE2_-_95907017.23-P1P2                         0                         0   \n",
 554 |        "CPQ_+_97657557.23-P1P2                           0                         0   \n",
 555 |        "...                                            ...                       ...   \n",
 556 |        "non-targeting_00025                              0                         0   \n",
 557 |        "non-targeting_00026                              0                         0   \n",
 558 |        "non-targeting_00027                              0                         0   \n",
 559 |        "non-targeting_00028                              0                         0   \n",
 560 |        "non-targeting_00029                              0                         0   \n",
 561 |        "\n",
 562 |        "                          TCCTGAGCGCATACAAGCGATATA  TCCTGAGCGTGTTAGCAAGCGATG  \\\n",
 563 |        "CCNE2_+_95907328.23-P1P2                         0                         0   \n",
 564 |        "CCNE2_+_95907382.23-P1P2                         0                         0   \n",
 565 |        "CCNE2_+_95907406.23-P1P2                         0                         0   \n",
 566 |        "CCNE2_-_95907017.23-P1P2                         0                         0   \n",
 567 |        "CPQ_+_97657557.23-P1P2                           0                         0   \n",
 568 |        "...                                            ...                       ...   \n",
 569 |        "non-targeting_00025                              0                         0   \n",
 570 |        "non-targeting_00026                              0                         0   \n",
 571 |        "non-targeting_00027                              0                         0   \n",
 572 |        "non-targeting_00028                              0                         0   \n",
 573 |        "non-targeting_00029                              0                         0   \n",
 574 |        "\n",
 575 |        "                          TGATTGACACGGGCTCAAGCGATG  TGATTGACGACTACAAGCGATCCC  \\\n",
 576 |        "CCNE2_+_95907328.23-P1P2                         0                         0   \n",
 577 |        "CCNE2_+_95907382.23-P1P2                         0                         0   \n",
 578 |        "CCNE2_+_95907406.23-P1P2                         0                         0   \n",
 579 |        "CCNE2_-_95907017.23-P1P2                         0                         0   \n",
 580 |        "CPQ_+_97657557.23-P1P2                           0                         0   \n",
 581 |        "...                                            ...                       ...   \n",
 582 |        "non-targeting_00025                              0                         0   \n",
 583 |        "non-targeting_00026                              0                         0   \n",
 584 |        "non-targeting_00027                              0                         0   \n",
 585 |        "non-targeting_00028                              0                         0   \n",
 586 |        "non-targeting_00029                              0                         0   \n",
 587 |        "\n",
 588 |        "                          TGATTGACTTCTACAAGCGATGAC  \n",
 589 |        "CCNE2_+_95907328.23-P1P2                         0  \n",
 590 |        "CCNE2_+_95907382.23-P1P2                         0  \n",
 591 |        "CCNE2_+_95907406.23-P1P2                         0  \n",
 592 |        "CCNE2_-_95907017.23-P1P2                         0  \n",
 593 |        "CPQ_+_97657557.23-P1P2                           0  \n",
 594 |        "...                                            ...  \n",
 595 |        "non-targeting_00025                              0  \n",
 596 |        "non-targeting_00026                              0  \n",
 597 |        "non-targeting_00027                              0  \n",
 598 |        "non-targeting_00028                              0  \n",
 599 |        "non-targeting_00029                              0  \n",
 600 |        "\n",
 601 |        "[4119 rows x 9517 columns]"
 602 |       ]
 603 |      },
 604 |      "execution_count": 11,
 605 |      "metadata": {},
 606 |      "output_type": "execute_result"
 607 |     }
 608 |    ],
 609 |    "source": [
 610 |     "meta"
 611 |    ]
 612 |   },
 613 |   {
 614 |    "cell_type": "code",
 615 |    "execution_count": 16,
 616 |    "id": "76a212c0",
 617 |    "metadata": {},
 618 |    "outputs": [
 619 |     {
 620 |      "data": {
 621 |       "text/plain": [
 622 |        "CCNE2_+_95907328.23-P1P2    0\n",
 623 |        "CCNE2_+_95907382.23-P1P2    0\n",
 624 |        "CCNE2_+_95907406.23-P1P2    0\n",
 625 |        "CCNE2_-_95907017.23-P1P2    0\n",
 626 |        "CPQ_+_97657557.23-P1P2      0\n",
 627 |        "                           ..\n",
 628 |        "non-targeting_00025         0\n",
 629 |        "non-targeting_00026         0\n",
 630 |        "non-targeting_00027         0\n",
 631 |        "non-targeting_00028         0\n",
 632 |        "non-targeting_00029         2\n",
 633 |        "Length: 4119, dtype: int64"
 634 |       ]
 635 |      },
 636 |      "execution_count": 16,
 637 |      "metadata": {},
 638 |      "output_type": "execute_result"
 639 |     }
 640 |    ],
 641 |    "source": [
 642 |     "meta.sum(1)"
 643 |    ]
 644 |   },
 645 |   {
 646 |    "cell_type": "code",
 647 |    "execution_count": 15,
 648 |    "id": "3cfcb0df",
 649 |    "metadata": {},
 650 |    "outputs": [
 651 |     {
 652 |      "data": {
 653 |       "text/plain": [
 654 |        "[19,\n",
 655 |        " 19,\n",
 656 |        " 18,\n",
 657 |        " 17,\n",
 658 |        " 16,\n",
 659 |        " 14,\n",
 660 |        " 14,\n",
 661 |        " 14,\n",
 662 |        " 14,\n",
 663 |        " 14,\n",
 664 |        " 13,\n",
 665 |        " 13,\n",
 666 |        " 13,\n",
 667 |        " 13,\n",
 668 |        " 13,\n",
 669 |        " 13,\n",
 670 |        " 13,\n",
 671 |        " 13,\n",
 672 |        " 12,\n",
 673 |        " 12,\n",
 674 |        " 12,\n",
 675 |        " 12,\n",
 676 |        " 12,\n",
 677 |        " 12,\n",
 678 |        " 12,\n",
 679 |        " 12,\n",
 680 |        " 12,\n",
 681 |        " 12,\n",
 682 |        " 12,\n",
 683 |        " 11,\n",
 684 |        " 11,\n",
 685 |        " 11,\n",
 686 |        " 11,\n",
 687 |        " 11,\n",
 688 |        " 11,\n",
 689 |        " 11,\n",
 690 |        " 11,\n",
 691 |        " 11,\n",
 692 |        " 11,\n",
 693 |        " 11,\n",
 694 |        " 11,\n",
 695 |        " 11,\n",
 696 |        " 11,\n",
 697 |        " 11,\n",
 698 |        " 11,\n",
 699 |        " 11,\n",
 700 |        " 11,\n",
 701 |        " 11,\n",
 702 |        " 11,\n",
 703 |        " 10,\n",
 704 |        " 10,\n",
 705 |        " 10,\n",
 706 |        " 10,\n",
 707 |        " 10,\n",
 708 |        " 10,\n",
 709 |        " 10,\n",
 710 |        " 10,\n",
 711 |        " 10,\n",
 712 |        " 10,\n",
 713 |        " 10,\n",
 714 |        " 10,\n",
 715 |        " 10,\n",
 716 |        " 10,\n",
 717 |        " 10,\n",
 718 |        " 10,\n",
 719 |        " 10,\n",
 720 |        " 10,\n",
 721 |        " 10,\n",
 722 |        " 10,\n",
 723 |        " 10,\n",
 724 |        " 10,\n",
 725 |        " 9,\n",
 726 |        " 9,\n",
 727 |        " 9,\n",
 728 |        " 9,\n",
 729 |        " 9,\n",
 730 |        " 9,\n",
 731 |        " 9,\n",
 732 |        " 9,\n",
 733 |        " 9,\n",
 734 |        " 9,\n",
 735 |        " 9,\n",
 736 |        " 9,\n",
 737 |        " 9,\n",
 738 |        " 9,\n",
 739 |        " 9,\n",
 740 |        " 9,\n",
 741 |        " 9,\n",
 742 |        " 9,\n",
 743 |        " 9,\n",
 744 |        " 9,\n",
 745 |        " 9,\n",
 746 |        " 9,\n",
 747 |        " 9,\n",
 748 |        " 9,\n",
 749 |        " 9,\n",
 750 |        " 9,\n",
 751 |        " 9,\n",
 752 |        " 9,\n",
 753 |        " 9,\n",
 754 |        " 9,\n",
 755 |        " 9,\n",
 756 |        " 9,\n",
 757 |        " 9,\n",
 758 |        " 9,\n",
 759 |        " 9,\n",
 760 |        " 9,\n",
 761 |        " 9,\n",
 762 |        " 9,\n",
 763 |        " 9,\n",
 764 |        " 9,\n",
 765 |        " 9,\n",
 766 |        " 8,\n",
 767 |        " 8,\n",
 768 |        " 8,\n",
 769 |        " 8,\n",
 770 |        " 8,\n",
 771 |        " 8,\n",
 772 |        " 8,\n",
 773 |        " 8,\n",
 774 |        " 8,\n",
 775 |        " 8,\n",
 776 |        " 8,\n",
 777 |        " 8,\n",
 778 |        " 8,\n",
 779 |        " 8,\n",
 780 |        " 8,\n",
 781 |        " 8,\n",
 782 |        " 8,\n",
 783 |        " 8,\n",
 784 |        " 8,\n",
 785 |        " 8,\n",
 786 |        " 8,\n",
 787 |        " 8,\n",
 788 |        " 8,\n",
 789 |        " 8,\n",
 790 |        " 8,\n",
 791 |        " 8,\n",
 792 |        " 8,\n",
 793 |        " 8,\n",
 794 |        " 8,\n",
 795 |        " 8,\n",
 796 |        " 8,\n",
 797 |        " 8,\n",
 798 |        " 8,\n",
 799 |        " 8,\n",
 800 |        " 8,\n",
 801 |        " 8,\n",
 802 |        " 8,\n",
 803 |        " 8,\n",
 804 |        " 8,\n",
 805 |        " 8,\n",
 806 |        " 8,\n",
 807 |        " 8,\n",
 808 |        " 8,\n",
 809 |        " 8,\n",
 810 |        " 8,\n",
 811 |        " 8,\n",
 812 |        " 8,\n",
 813 |        " 8,\n",
 814 |        " 8,\n",
 815 |        " 8,\n",
 816 |        " 8,\n",
 817 |        " 8,\n",
 818 |        " 8,\n",
 819 |        " 8,\n",
 820 |        " 8,\n",
 821 |        " 8,\n",
 822 |        " 8,\n",
 823 |        " 8,\n",
 824 |        " 8,\n",
 825 |        " 8,\n",
 826 |        " 8,\n",
 827 |        " 8,\n",
 828 |        " 8,\n",
 829 |        " 8,\n",
 830 |        " 8,\n",
 831 |        " 8,\n",
 832 |        " 8,\n",
 833 |        " 8,\n",
 834 |        " 8,\n",
 835 |        " 7,\n",
 836 |        " 7,\n",
 837 |        " 7,\n",
 838 |        " 7,\n",
 839 |        " 7,\n",
 840 |        " 7,\n",
 841 |        " 7,\n",
 842 |        " 7,\n",
 843 |        " 7,\n",
 844 |        " 7,\n",
 845 |        " 7,\n",
 846 |        " 7,\n",
 847 |        " 7,\n",
 848 |        " 7,\n",
 849 |        " 7,\n",
 850 |        " 7,\n",
 851 |        " 7,\n",
 852 |        " 7,\n",
 853 |        " 7,\n",
 854 |        " 7,\n",
 855 |        " 7,\n",
 856 |        " 7,\n",
 857 |        " 7,\n",
 858 |        " 7,\n",
 859 |        " 7,\n",
 860 |        " 7,\n",
 861 |        " 7,\n",
 862 |        " 7,\n",
 863 |        " 7,\n",
 864 |        " 7,\n",
 865 |        " 7,\n",
 866 |        " 7,\n",
 867 |        " 7,\n",
 868 |        " 7,\n",
 869 |        " 7,\n",
 870 |        " 7,\n",
 871 |        " 7,\n",
 872 |        " 7,\n",
 873 |        " 7,\n",
 874 |        " 7,\n",
 875 |        " 7,\n",
 876 |        " 7,\n",
 877 |        " 7,\n",
 878 |        " 7,\n",
 879 |        " 7,\n",
 880 |        " 7,\n",
 881 |        " 7,\n",
 882 |        " 7,\n",
 883 |        " 7,\n",
 884 |        " 7,\n",
 885 |        " 7,\n",
 886 |        " 7,\n",
 887 |        " 7,\n",
 888 |        " 7,\n",
 889 |        " 7,\n",
 890 |        " 7,\n",
 891 |        " 7,\n",
 892 |        " 7,\n",
 893 |        " 7,\n",
 894 |        " 7,\n",
 895 |        " 7,\n",
 896 |        " 7,\n",
 897 |        " 7,\n",
 898 |        " 7,\n",
 899 |        " 7,\n",
 900 |        " 7,\n",
 901 |        " 7,\n",
 902 |        " 7,\n",
 903 |        " 7,\n",
 904 |        " 7,\n",
 905 |        " 7,\n",
 906 |        " 7,\n",
 907 |        " 7,\n",
 908 |        " 7,\n",
 909 |        " 7,\n",
 910 |        " 7,\n",
 911 |        " 7,\n",
 912 |        " 7,\n",
 913 |        " 7,\n",
 914 |        " 7,\n",
 915 |        " 7,\n",
 916 |        " 7,\n",
 917 |        " 7,\n",
 918 |        " 7,\n",
 919 |        " 7,\n",
 920 |        " 7,\n",
 921 |        " 7,\n",
 922 |        " 7,\n",
 923 |        " 7,\n",
 924 |        " 7,\n",
 925 |        " 7,\n",
 926 |        " 7,\n",
 927 |        " 7,\n",
 928 |        " 7,\n",
 929 |        " 7,\n",
 930 |        " 7,\n",
 931 |        " 7,\n",
 932 |        " 7,\n",
 933 |        " 7,\n",
 934 |        " 7,\n",
 935 |        " 7,\n",
 936 |        " 7,\n",
 937 |        " 7,\n",
 938 |        " 7,\n",
 939 |        " 7,\n",
 940 |        " 7,\n",
 941 |        " 7,\n",
 942 |        " 7,\n",
 943 |        " 7,\n",
 944 |        " 7,\n",
 945 |        " 7,\n",
 946 |        " 7,\n",
 947 |        " 7,\n",
 948 |        " 7,\n",
 949 |        " 7,\n",
 950 |        " 7,\n",
 951 |        " 7,\n",
 952 |        " 7,\n",
 953 |        " 7,\n",
 954 |        " 7,\n",
 955 |        " 7,\n",
 956 |        " 7,\n",
 957 |        " 7,\n",
 958 |        " 7,\n",
 959 |        " 7,\n",
 960 |        " 7,\n",
 961 |        " 7,\n",
 962 |        " 7,\n",
 963 |        " 7,\n",
 964 |        " 7,\n",
 965 |        " 7,\n",
 966 |        " 6,\n",
 967 |        " 6,\n",
 968 |        " 6,\n",
 969 |        " 6,\n",
 970 |        " 6,\n",
 971 |        " 6,\n",
 972 |        " 6,\n",
 973 |        " 6,\n",
 974 |        " 6,\n",
 975 |        " 6,\n",
 976 |        " 6,\n",
 977 |        " 6,\n",
 978 |        " 6,\n",
 979 |        " 6,\n",
 980 |        " 6,\n",
 981 |        " 6,\n",
 982 |        " 6,\n",
 983 |        " 6,\n",
 984 |        " 6,\n",
 985 |        " 6,\n",
 986 |        " 6,\n",
 987 |        " 6,\n",
 988 |        " 6,\n",
 989 |        " 6,\n",
 990 |        " 6,\n",
 991 |        " 6,\n",
 992 |        " 6,\n",
 993 |        " 6,\n",
 994 |        " 6,\n",
 995 |        " 6,\n",
 996 |        " 6,\n",
 997 |        " 6,\n",
 998 |        " 6,\n",
 999 |        " 6,\n",
1000 |        " 6,\n",
1001 |        " 6,\n",
1002 |        " 6,\n",
1003 |        " 6,\n",
1004 |        " 6,\n",
1005 |        " 6,\n",
1006 |        " 6,\n",
1007 |        " 6,\n",
1008 |        " 6,\n",
1009 |        " 6,\n",
1010 |        " 6,\n",
1011 |        " 6,\n",
1012 |        " 6,\n",
1013 |        " 6,\n",
1014 |        " 6,\n",
1015 |        " 6,\n",
1016 |        " 6,\n",
1017 |        " 6,\n",
1018 |        " 6,\n",
1019 |        " 6,\n",
1020 |        " 6,\n",
1021 |        " 6,\n",
1022 |        " 6,\n",
1023 |        " 6,\n",
1024 |        " 6,\n",
1025 |        " 6,\n",
1026 |        " 6,\n",
1027 |        " 6,\n",
1028 |        " 6,\n",
1029 |        " 6,\n",
1030 |        " 6,\n",
1031 |        " 6,\n",
1032 |        " 6,\n",
1033 |        " 6,\n",
1034 |        " 6,\n",
1035 |        " 6,\n",
1036 |        " 6,\n",
1037 |        " 6,\n",
1038 |        " 6,\n",
1039 |        " 6,\n",
1040 |        " 6,\n",
1041 |        " 6,\n",
1042 |        " 6,\n",
1043 |        " 6,\n",
1044 |        " 6,\n",
1045 |        " 6,\n",
1046 |        " 6,\n",
1047 |        " 6,\n",
1048 |        " 6,\n",
1049 |        " 6,\n",
1050 |        " 6,\n",
1051 |        " 6,\n",
1052 |        " 6,\n",
1053 |        " 6,\n",
1054 |        " 6,\n",
1055 |        " 6,\n",
1056 |        " 6,\n",
1057 |        " 6,\n",
1058 |        " 6,\n",
1059 |        " 6,\n",
1060 |        " 6,\n",
1061 |        " 6,\n",
1062 |        " 6,\n",
1063 |        " 6,\n",
1064 |        " 6,\n",
1065 |        " 6,\n",
1066 |        " 6,\n",
1067 |        " 6,\n",
1068 |        " 6,\n",
1069 |        " 6,\n",
1070 |        " 6,\n",
1071 |        " 6,\n",
1072 |        " 6,\n",
1073 |        " 6,\n",
1074 |        " 6,\n",
1075 |        " 6,\n",
1076 |        " 6,\n",
1077 |        " 6,\n",
1078 |        " 6,\n",
1079 |        " 6,\n",
1080 |        " 6,\n",
1081 |        " 6,\n",
1082 |        " 6,\n",
1083 |        " 6,\n",
1084 |        " 6,\n",
1085 |        " 6,\n",
1086 |        " 6,\n",
1087 |        " 6,\n",
1088 |        " 6,\n",
1089 |        " 6,\n",
1090 |        " 6,\n",
1091 |        " 6,\n",
1092 |        " 6,\n",
1093 |        " 6,\n",
1094 |        " 6,\n",
1095 |        " 6,\n",
1096 |        " 6,\n",
1097 |        " 6,\n",
1098 |        " 6,\n",
1099 |        " 6,\n",
1100 |        " 6,\n",
1101 |        " 6,\n",
1102 |        " 6,\n",
1103 |        " 6,\n",
1104 |        " 6,\n",
1105 |        " 6,\n",
1106 |        " 6,\n",
1107 |        " 6,\n",
1108 |        " 6,\n",
1109 |        " 6,\n",
1110 |        " 6,\n",
1111 |        " 6,\n",
1112 |        " 6,\n",
1113 |        " 6,\n",
1114 |        " 6,\n",
1115 |        " 6,\n",
1116 |        " 6,\n",
1117 |        " 6,\n",
1118 |        " 6,\n",
1119 |        " 6,\n",
1120 |        " 6,\n",
1121 |        " 6,\n",
1122 |        " 6,\n",
1123 |        " 6,\n",
1124 |        " 6,\n",
1125 |        " 6,\n",
1126 |        " 6,\n",
1127 |        " 6,\n",
1128 |        " 6,\n",
1129 |        " 6,\n",
1130 |        " 6,\n",
1131 |        " 6,\n",
1132 |        " 6,\n",
1133 |        " 6,\n",
1134 |        " 6,\n",
1135 |        " 6,\n",
1136 |        " 6,\n",
1137 |        " 6,\n",
1138 |        " 6,\n",
1139 |        " 6,\n",
1140 |        " 6,\n",
1141 |        " 6,\n",
1142 |        " 6,\n",
1143 |        " 6,\n",
1144 |        " 6,\n",
1145 |        " 6,\n",
1146 |        " 6,\n",
1147 |        " 6,\n",
1148 |        " 6,\n",
1149 |        " 6,\n",
1150 |        " 6,\n",
1151 |        " 6,\n",
1152 |        " 6,\n",
1153 |        " 6,\n",
1154 |        " 6,\n",
1155 |        " 6,\n",
1156 |        " 6,\n",
1157 |        " 6,\n",
1158 |        " 6,\n",
1159 |        " 6,\n",
1160 |        " 6,\n",
1161 |        " 6,\n",
1162 |        " 6,\n",
1163 |        " 6,\n",
1164 |        " 6,\n",
1165 |        " 6,\n",
1166 |        " 6,\n",
1167 |        " 6,\n",
1168 |        " 6,\n",
1169 |        " 6,\n",
1170 |        " 6,\n",
1171 |        " 6,\n",
1172 |        " 6,\n",
1173 |        " 6,\n",
1174 |        " 6,\n",
1175 |        " 6,\n",
1176 |        " 6,\n",
1177 |        " 6,\n",
1178 |        " 6,\n",
1179 |        " 6,\n",
1180 |        " 6,\n",
1181 |        " 6,\n",
1182 |        " 6,\n",
1183 |        " 6,\n",
1184 |        " 6,\n",
1185 |        " 5,\n",
1186 |        " 5,\n",
1187 |        " 5,\n",
1188 |        " 5,\n",
1189 |        " 5,\n",
1190 |        " 5,\n",
1191 |        " 5,\n",
1192 |        " 5,\n",
1193 |        " 5,\n",
1194 |        " 5,\n",
1195 |        " 5,\n",
1196 |        " 5,\n",
1197 |        " 5,\n",
1198 |        " 5,\n",
1199 |        " 5,\n",
1200 |        " 5,\n",
1201 |        " 5,\n",
1202 |        " 5,\n",
1203 |        " 5,\n",
1204 |        " 5,\n",
1205 |        " 5,\n",
1206 |        " 5,\n",
1207 |        " 5,\n",
1208 |        " 5,\n",
1209 |        " 5,\n",
1210 |        " 5,\n",
1211 |        " 5,\n",
1212 |        " 5,\n",
1213 |        " 5,\n",
1214 |        " 5,\n",
1215 |        " 5,\n",
1216 |        " 5,\n",
1217 |        " 5,\n",
1218 |        " 5,\n",
1219 |        " 5,\n",
1220 |        " 5,\n",
1221 |        " 5,\n",
1222 |        " 5,\n",
1223 |        " 5,\n",
1224 |        " 5,\n",
1225 |        " 5,\n",
1226 |        " 5,\n",
1227 |        " 5,\n",
1228 |        " 5,\n",
1229 |        " 5,\n",
1230 |        " 5,\n",
1231 |        " 5,\n",
1232 |        " 5,\n",
1233 |        " 5,\n",
1234 |        " 5,\n",
1235 |        " 5,\n",
1236 |        " 5,\n",
1237 |        " 5,\n",
1238 |        " 5,\n",
1239 |        " 5,\n",
1240 |        " 5,\n",
1241 |        " 5,\n",
1242 |        " 5,\n",
1243 |        " 5,\n",
1244 |        " 5,\n",
1245 |        " 5,\n",
1246 |        " 5,\n",
1247 |        " 5,\n",
1248 |        " 5,\n",
1249 |        " 5,\n",
1250 |        " 5,\n",
1251 |        " 5,\n",
1252 |        " 5,\n",
1253 |        " 5,\n",
1254 |        " 5,\n",
1255 |        " 5,\n",
1256 |        " 5,\n",
1257 |        " 5,\n",
1258 |        " 5,\n",
1259 |        " 5,\n",
1260 |        " 5,\n",
1261 |        " 5,\n",
1262 |        " 5,\n",
1263 |        " 5,\n",
1264 |        " 5,\n",
1265 |        " 5,\n",
1266 |        " 5,\n",
1267 |        " 5,\n",
1268 |        " 5,\n",
1269 |        " 5,\n",
1270 |        " 5,\n",
1271 |        " 5,\n",
1272 |        " 5,\n",
1273 |        " 5,\n",
1274 |        " 5,\n",
1275 |        " 5,\n",
1276 |        " 5,\n",
1277 |        " 5,\n",
1278 |        " 5,\n",
1279 |        " 5,\n",
1280 |        " 5,\n",
1281 |        " 5,\n",
1282 |        " 5,\n",
1283 |        " 5,\n",
1284 |        " 5,\n",
1285 |        " 5,\n",
1286 |        " 5,\n",
1287 |        " 5,\n",
1288 |        " 5,\n",
1289 |        " 5,\n",
1290 |        " 5,\n",
1291 |        " 5,\n",
1292 |        " 5,\n",
1293 |        " 5,\n",
1294 |        " 5,\n",
1295 |        " 5,\n",
1296 |        " 5,\n",
1297 |        " 5,\n",
1298 |        " 5,\n",
1299 |        " 5,\n",
1300 |        " 5,\n",
1301 |        " 5,\n",
1302 |        " 5,\n",
1303 |        " 5,\n",
1304 |        " 5,\n",
1305 |        " 5,\n",
1306 |        " 5,\n",
1307 |        " 5,\n",
1308 |        " 5,\n",
1309 |        " 5,\n",
1310 |        " 5,\n",
1311 |        " 5,\n",
1312 |        " 5,\n",
1313 |        " 5,\n",
1314 |        " 5,\n",
1315 |        " 5,\n",
1316 |        " 5,\n",
1317 |        " 5,\n",
1318 |        " 5,\n",
1319 |        " 5,\n",
1320 |        " 5,\n",
1321 |        " 5,\n",
1322 |        " 5,\n",
1323 |        " 5,\n",
1324 |        " 5,\n",
1325 |        " 5,\n",
1326 |        " 5,\n",
1327 |        " 5,\n",
1328 |        " 5,\n",
1329 |        " 5,\n",
1330 |        " 5,\n",
1331 |        " 5,\n",
1332 |        " 5,\n",
1333 |        " 5,\n",
1334 |        " 5,\n",
1335 |        " 5,\n",
1336 |        " 5,\n",
1337 |        " 5,\n",
1338 |        " 5,\n",
1339 |        " 5,\n",
1340 |        " 5,\n",
1341 |        " 5,\n",
1342 |        " 5,\n",
1343 |        " 5,\n",
1344 |        " 5,\n",
1345 |        " 5,\n",
1346 |        " 5,\n",
1347 |        " 5,\n",
1348 |        " 5,\n",
1349 |        " 5,\n",
1350 |        " 5,\n",
1351 |        " 5,\n",
1352 |        " 5,\n",
1353 |        " 5,\n",
1354 |        " 5,\n",
1355 |        " 5,\n",
1356 |        " 5,\n",
1357 |        " 5,\n",
1358 |        " 5,\n",
1359 |        " 5,\n",
1360 |        " 5,\n",
1361 |        " 5,\n",
1362 |        " 5,\n",
1363 |        " 5,\n",
1364 |        " 5,\n",
1365 |        " 5,\n",
1366 |        " 5,\n",
1367 |        " 5,\n",
1368 |        " 5,\n",
1369 |        " 5,\n",
1370 |        " 5,\n",
1371 |        " 5,\n",
1372 |        " 5,\n",
1373 |        " 5,\n",
1374 |        " 5,\n",
1375 |        " 5,\n",
1376 |        " 5,\n",
1377 |        " 5,\n",
1378 |        " 5,\n",
1379 |        " 5,\n",
1380 |        " 5,\n",
1381 |        " 5,\n",
1382 |        " 5,\n",
1383 |        " 5,\n",
1384 |        " 5,\n",
1385 |        " 5,\n",
1386 |        " 5,\n",
1387 |        " 5,\n",
1388 |        " 5,\n",
1389 |        " 5,\n",
1390 |        " 5,\n",
1391 |        " 5,\n",
1392 |        " 5,\n",
1393 |        " 5,\n",
1394 |        " 5,\n",
1395 |        " 5,\n",
1396 |        " 5,\n",
1397 |        " 5,\n",
1398 |        " 5,\n",
1399 |        " 5,\n",
1400 |        " 5,\n",
1401 |        " 5,\n",
1402 |        " 5,\n",
1403 |        " 5,\n",
1404 |        " 5,\n",
1405 |        " 5,\n",
1406 |        " 5,\n",
1407 |        " 5,\n",
1408 |        " 5,\n",
1409 |        " 5,\n",
1410 |        " 5,\n",
1411 |        " 5,\n",
1412 |        " 5,\n",
1413 |        " 5,\n",
1414 |        " 5,\n",
1415 |        " 5,\n",
1416 |        " 5,\n",
1417 |        " 5,\n",
1418 |        " 5,\n",
1419 |        " 5,\n",
1420 |        " 5,\n",
1421 |        " 5,\n",
1422 |        " 5,\n",
1423 |        " 5,\n",
1424 |        " 5,\n",
1425 |        " 5,\n",
1426 |        " 5,\n",
1427 |        " 5,\n",
1428 |        " 5,\n",
1429 |        " 5,\n",
1430 |        " 5,\n",
1431 |        " 5,\n",
1432 |        " 5,\n",
1433 |        " 5,\n",
1434 |        " 5,\n",
1435 |        " 5,\n",
1436 |        " 5,\n",
1437 |        " 5,\n",
1438 |        " 5,\n",
1439 |        " 5,\n",
1440 |        " 5,\n",
1441 |        " 5,\n",
1442 |        " 5,\n",
1443 |        " 5,\n",
1444 |        " 5,\n",
1445 |        " 5,\n",
1446 |        " 5,\n",
1447 |        " 5,\n",
1448 |        " 5,\n",
1449 |        " 5,\n",
1450 |        " 5,\n",
1451 |        " 5,\n",
1452 |        " 5,\n",
1453 |        " 5,\n",
1454 |        " 5,\n",
1455 |        " 5,\n",
1456 |        " 5,\n",
1457 |        " 5,\n",
1458 |        " 5,\n",
1459 |        " 5,\n",
1460 |        " 5,\n",
1461 |        " 5,\n",
1462 |        " 5,\n",
1463 |        " 5,\n",
1464 |        " 5,\n",
1465 |        " 5,\n",
1466 |        " 5,\n",
1467 |        " 5,\n",
1468 |        " 5,\n",
1469 |        " 5,\n",
1470 |        " 5,\n",
1471 |        " 5,\n",
1472 |        " 5,\n",
1473 |        " 5,\n",
1474 |        " 5,\n",
1475 |        " 5,\n",
1476 |        " 5,\n",
1477 |        " 5,\n",
1478 |        " 4,\n",
1479 |        " 4,\n",
1480 |        " 4,\n",
1481 |        " 4,\n",
1482 |        " 4,\n",
1483 |        " 4,\n",
1484 |        " 4,\n",
1485 |        " 4,\n",
1486 |        " 4,\n",
1487 |        " 4,\n",
1488 |        " 4,\n",
1489 |        " 4,\n",
1490 |        " 4,\n",
1491 |        " 4,\n",
1492 |        " 4,\n",
1493 |        " 4,\n",
1494 |        " 4,\n",
1495 |        " 4,\n",
1496 |        " 4,\n",
1497 |        " 4,\n",
1498 |        " 4,\n",
1499 |        " 4,\n",
1500 |        " 4,\n",
1501 |        " 4,\n",
1502 |        " 4,\n",
1503 |        " 4,\n",
1504 |        " 4,\n",
1505 |        " 4,\n",
1506 |        " 4,\n",
1507 |        " 4,\n",
1508 |        " 4,\n",
1509 |        " 4,\n",
1510 |        " 4,\n",
1511 |        " 4,\n",
1512 |        " 4,\n",
1513 |        " 4,\n",
1514 |        " 4,\n",
1515 |        " 4,\n",
1516 |        " 4,\n",
1517 |        " 4,\n",
1518 |        " 4,\n",
1519 |        " 4,\n",
1520 |        " 4,\n",
1521 |        " 4,\n",
1522 |        " 4,\n",
1523 |        " 4,\n",
1524 |        " 4,\n",
1525 |        " 4,\n",
1526 |        " 4,\n",
1527 |        " 4,\n",
1528 |        " 4,\n",
1529 |        " 4,\n",
1530 |        " 4,\n",
1531 |        " 4,\n",
1532 |        " 4,\n",
1533 |        " 4,\n",
1534 |        " 4,\n",
1535 |        " 4,\n",
1536 |        " 4,\n",
1537 |        " 4,\n",
1538 |        " 4,\n",
1539 |        " 4,\n",
1540 |        " 4,\n",
1541 |        " 4,\n",
1542 |        " 4,\n",
1543 |        " 4,\n",
1544 |        " 4,\n",
1545 |        " 4,\n",
1546 |        " 4,\n",
1547 |        " 4,\n",
1548 |        " 4,\n",
1549 |        " 4,\n",
1550 |        " 4,\n",
1551 |        " 4,\n",
1552 |        " 4,\n",
1553 |        " 4,\n",
1554 |        " 4,\n",
1555 |        " 4,\n",
1556 |        " 4,\n",
1557 |        " 4,\n",
1558 |        " 4,\n",
1559 |        " 4,\n",
1560 |        " 4,\n",
1561 |        " 4,\n",
1562 |        " 4,\n",
1563 |        " 4,\n",
1564 |        " 4,\n",
1565 |        " 4,\n",
1566 |        " 4,\n",
1567 |        " 4,\n",
1568 |        " 4,\n",
1569 |        " 4,\n",
1570 |        " 4,\n",
1571 |        " 4,\n",
1572 |        " 4,\n",
1573 |        " 4,\n",
1574 |        " 4,\n",
1575 |        " 4,\n",
1576 |        " 4,\n",
1577 |        " 4,\n",
1578 |        " 4,\n",
1579 |        " 4,\n",
1580 |        " 4,\n",
1581 |        " 4,\n",
1582 |        " 4,\n",
1583 |        " 4,\n",
1584 |        " 4,\n",
1585 |        " 4,\n",
1586 |        " 4,\n",
1587 |        " 4,\n",
1588 |        " 4,\n",
1589 |        " 4,\n",
1590 |        " 4,\n",
1591 |        " 4,\n",
1592 |        " 4,\n",
1593 |        " 4,\n",
1594 |        " 4,\n",
1595 |        " 4,\n",
1596 |        " 4,\n",
1597 |        " 4,\n",
1598 |        " 4,\n",
1599 |        " 4,\n",
1600 |        " 4,\n",
1601 |        " 4,\n",
1602 |        " 4,\n",
1603 |        " 4,\n",
1604 |        " 4,\n",
1605 |        " 4,\n",
1606 |        " 4,\n",
1607 |        " 4,\n",
1608 |        " 4,\n",
1609 |        " 4,\n",
1610 |        " 4,\n",
1611 |        " 4,\n",
1612 |        " 4,\n",
1613 |        " 4,\n",
1614 |        " 4,\n",
1615 |        " 4,\n",
1616 |        " 4,\n",
1617 |        " 4,\n",
1618 |        " 4,\n",
1619 |        " 4,\n",
1620 |        " 4,\n",
1621 |        " 4,\n",
1622 |        " 4,\n",
1623 |        " 4,\n",
1624 |        " 4,\n",
1625 |        " 4,\n",
1626 |        " 4,\n",
1627 |        " 4,\n",
1628 |        " 4,\n",
1629 |        " 4,\n",
1630 |        " 4,\n",
1631 |        " 4,\n",
1632 |        " 4,\n",
1633 |        " 4,\n",
1634 |        " 4,\n",
1635 |        " 4,\n",
1636 |        " 4,\n",
1637 |        " 4,\n",
1638 |        " 4,\n",
1639 |        " 4,\n",
1640 |        " 4,\n",
1641 |        " 4,\n",
1642 |        " 4,\n",
1643 |        " 4,\n",
1644 |        " 4,\n",
1645 |        " 4,\n",
1646 |        " 4,\n",
1647 |        " 4,\n",
1648 |        " 4,\n",
1649 |        " 4,\n",
1650 |        " 4,\n",
1651 |        " 4,\n",
1652 |        " 4,\n",
1653 |        " 4,\n",
1654 |        " ...]"
1655 |       ]
1656 |      },
1657 |      "execution_count": 15,
1658 |      "metadata": {},
1659 |      "output_type": "execute_result"
1660 |     }
1661 |    ],
1662 |    "source": [
1663 |     "sorted(meta.sum(1), reverse=True)"
1664 |    ]
1665 |   },
1666 |   {
1667 |    "cell_type": "code",
1668 |    "execution_count": 7,
1669 |    "id": "cfabb9ee",
1670 |    "metadata": {},
1671 |    "outputs": [
1672 |     {
1673 |      "data": {
1674 |       "text/html": [
1675 |        "<div>\n",
1676 |        "<style scoped>\n",
1677 |        "    .dataframe tbody tr th:only-of-type {\n",
1678 |        "        vertical-align: middle;\n",
1679 |        "    }\n",
1680 |        "\n",
1681 |        "    .dataframe tbody tr th {\n",
1682 |        "        vertical-align: top;\n",
1683 |        "    }\n",
1684 |        "\n",
1685 |        "    .dataframe thead th {\n",
1686 |        "        text-align: right;\n",
1687 |        "    }\n",
1688 |        "</style>\n",
1689 |        "<table border=\"1\" class=\"dataframe\">\n",
1690 |        "  <thead>\n",
1691 |        "    <tr style=\"text-align: right;\">\n",
1692 |        "      <th></th>\n",
1693 |        "      <th>AACCAACGACGAGGACAAGCGATG</th>\n",
1694 |        "      <th>AACCAACGACTTTCAAGCGATAGC</th>\n",
1695 |        "      <th>AACCAACGAGAGCGACAAGCGATG</th>\n",
1696 |        "      <th>AACCAACGCCTAGCTCAAGCGATG</th>\n",
1697 |        "      <th>AACCAACGTGGGCGTCAAGCGATG</th>\n",
1698 |        "      <th>AGGCAGAACCGTTCAAGCGATCCC</th>\n",
1699 |        "      <th>AGGCAGAACCGTTCAAGCGATTCT</th>\n",
1700 |        "      <th>AGGCAGAACTGATAGCAAGCGATG</th>\n",
1701 |        "      <th>AGGCAGAAGCTGCTTCAAGCGATG</th>\n",
1702 |        "      <th>AGGCAGAAGTCCTCAAGCGATTCT</th>\n",
1703 |        "      <th>...</th>\n",
1704 |        "      <th>TAGGCATGGGAGCAACAAGCGATG</th>\n",
1705 |        "      <th>TAGGCATGGTCCTCAAGCGATATA</th>\n",
1706 |        "      <th>TAGGCATGTTTGCGCCAAGCGATG</th>\n",
1707 |        "      <th>TCCTGAGCAGCAGCCCAAGCGATG</th>\n",
1708 |        "      <th>TCCTGAGCATGGGAGCAAGCGATG</th>\n",
1709 |        "      <th>TCCTGAGCGCATACAAGCGATATA</th>\n",
1710 |        "      <th>TCCTGAGCGTGTTAGCAAGCGATG</th>\n",
1711 |        "      <th>TGATTGACACGGGCTCAAGCGATG</th>\n",
1712 |        "      <th>TGATTGACGACTACAAGCGATCCC</th>\n",
1713 |        "      <th>TGATTGACTTCTACAAGCGATGAC</th>\n",
1714 |        "    </tr>\n",
1715 |        "  </thead>\n",
1716 |        "  <tbody>\n",
1717 |        "    <tr>\n",
1718 |        "      <th>ANGPT1</th>\n",
1719 |        "      <td>6</td>\n",
1720 |        "      <td>0</td>\n",
1721 |        "      <td>3</td>\n",
1722 |        "      <td>2</td>\n",
1723 |        "      <td>3</td>\n",
1724 |        "      <td>5</td>\n",
1725 |        "      <td>3</td>\n",
1726 |        "      <td>4</td>\n",
1727 |        "      <td>2</td>\n",
1728 |        "      <td>0</td>\n",
1729 |        "      <td>...</td>\n",
1730 |        "      <td>2</td>\n",
1731 |        "      <td>5</td>\n",
1732 |        "      <td>5</td>\n",
1733 |        "      <td>2</td>\n",
1734 |        "      <td>0</td>\n",
1735 |        "      <td>2</td>\n",
1736 |        "      <td>0</td>\n",
1737 |        "      <td>4</td>\n",
1738 |        "      <td>0</td>\n",
1739 |        "      <td>3</td>\n",
1740 |        "    </tr>\n",
1741 |        "    <tr>\n",
1742 |        "      <th>ANKRD46</th>\n",
1743 |        "      <td>6</td>\n",
1744 |        "      <td>1</td>\n",
1745 |        "      <td>1</td>\n",
1746 |        "      <td>1</td>\n",
1747 |        "      <td>2</td>\n",
1748 |        "      <td>4</td>\n",
1749 |        "      <td>4</td>\n",
1750 |        "      <td>1</td>\n",
1751 |        "      <td>1</td>\n",
1752 |        "      <td>3</td>\n",
1753 |        "      <td>...</td>\n",
1754 |        "      <td>1</td>\n",
1755 |        "      <td>3</td>\n",
1756 |        "      <td>5</td>\n",
1757 |        "      <td>2</td>\n",
1758 |        "      <td>1</td>\n",
1759 |        "      <td>1</td>\n",
1760 |        "      <td>0</td>\n",
1761 |        "      <td>3</td>\n",
1762 |        "      <td>1</td>\n",
1763 |        "      <td>1</td>\n",
1764 |        "    </tr>\n",
1765 |        "    <tr>\n",
1766 |        "      <th>ASAP1</th>\n",
1767 |        "      <td>1</td>\n",
1768 |        "      <td>0</td>\n",
1769 |        "      <td>3</td>\n",
1770 |        "      <td>1</td>\n",
1771 |        "      <td>0</td>\n",
1772 |        "      <td>1</td>\n",
1773 |        "      <td>1</td>\n",
1774 |        "      <td>1</td>\n",
1775 |        "      <td>1</td>\n",
1776 |        "      <td>2</td>\n",
1777 |        "      <td>...</td>\n",
1778 |        "      <td>1</td>\n",
1779 |        "      <td>1</td>\n",
1780 |        "      <td>0</td>\n",
1781 |        "      <td>0</td>\n",
1782 |        "      <td>0</td>\n",
1783 |        "      <td>1</td>\n",
1784 |        "      <td>2</td>\n",
1785 |        "      <td>1</td>\n",
1786 |        "      <td>0</td>\n",
1787 |        "      <td>0</td>\n",
1788 |        "    </tr>\n",
1789 |        "    <tr>\n",
1790 |        "      <th>ATAD2</th>\n",
1791 |        "      <td>0</td>\n",
1792 |        "      <td>0</td>\n",
1793 |        "      <td>0</td>\n",
1794 |        "      <td>0</td>\n",
1795 |        "      <td>1</td>\n",
1796 |        "      <td>1</td>\n",
1797 |        "      <td>0</td>\n",
1798 |        "      <td>2</td>\n",
1799 |        "      <td>0</td>\n",
1800 |        "      <td>0</td>\n",
1801 |        "      <td>...</td>\n",
1802 |        "      <td>0</td>\n",
1803 |        "      <td>1</td>\n",
1804 |        "      <td>0</td>\n",
1805 |        "      <td>1</td>\n",
1806 |        "      <td>0</td>\n",
1807 |        "      <td>1</td>\n",
1808 |        "      <td>1</td>\n",
1809 |        "      <td>1</td>\n",
1810 |        "      <td>0</td>\n",
1811 |        "      <td>0</td>\n",
1812 |        "    </tr>\n",
1813 |        "    <tr>\n",
1814 |        "      <th>ATP6V1C1</th>\n",
1815 |        "      <td>5</td>\n",
1816 |        "      <td>1</td>\n",
1817 |        "      <td>4</td>\n",
1818 |        "      <td>0</td>\n",
1819 |        "      <td>4</td>\n",
1820 |        "      <td>6</td>\n",
1821 |        "      <td>5</td>\n",
1822 |        "      <td>8</td>\n",
1823 |        "      <td>5</td>\n",
1824 |        "      <td>4</td>\n",
1825 |        "      <td>...</td>\n",
1826 |        "      <td>3</td>\n",
1827 |        "      <td>2</td>\n",
1828 |        "      <td>7</td>\n",
1829 |        "      <td>3</td>\n",
1830 |        "      <td>0</td>\n",
1831 |        "      <td>3</td>\n",
1832 |        "      <td>3</td>\n",
1833 |        "      <td>4</td>\n",
1834 |        "      <td>1</td>\n",
1835 |        "      <td>2</td>\n",
1836 |        "    </tr>\n",
1837 |        "    <tr>\n",
1838 |        "      <th>...</th>\n",
1839 |        "      <td>...</td>\n",
1840 |        "      <td>...</td>\n",
1841 |        "      <td>...</td>\n",
1842 |        "      <td>...</td>\n",
1843 |        "      <td>...</td>\n",
1844 |        "      <td>...</td>\n",
1845 |        "      <td>...</td>\n",
1846 |        "      <td>...</td>\n",
1847 |        "      <td>...</td>\n",
1848 |        "      <td>...</td>\n",
1849 |        "      <td>...</td>\n",
1850 |        "      <td>...</td>\n",
1851 |        "      <td>...</td>\n",
1852 |        "      <td>...</td>\n",
1853 |        "      <td>...</td>\n",
1854 |        "      <td>...</td>\n",
1855 |        "      <td>...</td>\n",
1856 |        "      <td>...</td>\n",
1857 |        "      <td>...</td>\n",
1858 |        "      <td>...</td>\n",
1859 |        "      <td>...</td>\n",
1860 |        "    </tr>\n",
1861 |        "    <tr>\n",
1862 |        "      <th>WDYHV1</th>\n",
1863 |        "      <td>2</td>\n",
1864 |        "      <td>1</td>\n",
1865 |        "      <td>1</td>\n",
1866 |        "      <td>0</td>\n",
1867 |        "      <td>2</td>\n",
1868 |        "      <td>1</td>\n",
1869 |        "      <td>1</td>\n",
1870 |        "      <td>4</td>\n",
1871 |        "      <td>0</td>\n",
1872 |        "      <td>3</td>\n",
1873 |        "      <td>...</td>\n",
1874 |        "      <td>1</td>\n",
1875 |        "      <td>6</td>\n",
1876 |        "      <td>5</td>\n",
1877 |        "      <td>4</td>\n",
1878 |        "      <td>2</td>\n",
1879 |        "      <td>2</td>\n",
1880 |        "      <td>0</td>\n",
1881 |        "      <td>4</td>\n",
1882 |        "      <td>0</td>\n",
1883 |        "      <td>1</td>\n",
1884 |        "    </tr>\n",
1885 |        "    <tr>\n",
1886 |        "      <th>YWHAZ</th>\n",
1887 |        "      <td>129</td>\n",
1888 |        "      <td>34</td>\n",
1889 |        "      <td>107</td>\n",
1890 |        "      <td>19</td>\n",
1891 |        "      <td>112</td>\n",
1892 |        "      <td>98</td>\n",
1893 |        "      <td>101</td>\n",
1894 |        "      <td>165</td>\n",
1895 |        "      <td>68</td>\n",
1896 |        "      <td>86</td>\n",
1897 |        "      <td>...</td>\n",
1898 |        "      <td>93</td>\n",
1899 |        "      <td>119</td>\n",
1900 |        "      <td>128</td>\n",
1901 |        "      <td>95</td>\n",
1902 |        "      <td>69</td>\n",
1903 |        "      <td>115</td>\n",
1904 |        "      <td>92</td>\n",
1905 |        "      <td>94</td>\n",
1906 |        "      <td>37</td>\n",
1907 |        "      <td>39</td>\n",
1908 |        "    </tr>\n",
1909 |        "    <tr>\n",
1910 |        "      <th>ZFPM2</th>\n",
1911 |        "      <td>0</td>\n",
1912 |        "      <td>0</td>\n",
1913 |        "      <td>1</td>\n",
1914 |        "      <td>0</td>\n",
1915 |        "      <td>0</td>\n",
1916 |        "      <td>3</td>\n",
1917 |        "      <td>3</td>\n",
1918 |        "      <td>3</td>\n",
1919 |        "      <td>0</td>\n",
1920 |        "      <td>2</td>\n",
1921 |        "      <td>...</td>\n",
1922 |        "      <td>0</td>\n",
1923 |        "      <td>0</td>\n",
1924 |        "      <td>4</td>\n",
1925 |        "      <td>4</td>\n",
1926 |        "      <td>0</td>\n",
1927 |        "      <td>0</td>\n",
1928 |        "      <td>0</td>\n",
1929 |        "      <td>4</td>\n",
1930 |        "      <td>0</td>\n",
1931 |        "      <td>0</td>\n",
1932 |        "    </tr>\n",
1933 |        "    <tr>\n",
1934 |        "      <th>ZHX1</th>\n",
1935 |        "      <td>8</td>\n",
1936 |        "      <td>1</td>\n",
1937 |        "      <td>14</td>\n",
1938 |        "      <td>5</td>\n",
1939 |        "      <td>6</td>\n",
1940 |        "      <td>6</td>\n",
1941 |        "      <td>8</td>\n",
1942 |        "      <td>14</td>\n",
1943 |        "      <td>3</td>\n",
1944 |        "      <td>5</td>\n",
1945 |        "      <td>...</td>\n",
1946 |        "      <td>10</td>\n",
1947 |        "      <td>8</td>\n",
1948 |        "      <td>10</td>\n",
1949 |        "      <td>2</td>\n",
1950 |        "      <td>3</td>\n",
1951 |        "      <td>9</td>\n",
1952 |        "      <td>4</td>\n",
1953 |        "      <td>10</td>\n",
1954 |        "      <td>5</td>\n",
1955 |        "      <td>3</td>\n",
1956 |        "    </tr>\n",
1957 |        "    <tr>\n",
1958 |        "      <th>ZNF706</th>\n",
1959 |        "      <td>74</td>\n",
1960 |        "      <td>11</td>\n",
1961 |        "      <td>38</td>\n",
1962 |        "      <td>23</td>\n",
1963 |        "      <td>62</td>\n",
1964 |        "      <td>26</td>\n",
1965 |        "      <td>37</td>\n",
1966 |        "      <td>60</td>\n",
1967 |        "      <td>44</td>\n",
1968 |        "      <td>52</td>\n",
1969 |        "      <td>...</td>\n",
1970 |        "      <td>52</td>\n",
1971 |        "      <td>66</td>\n",
1972 |        "      <td>53</td>\n",
1973 |        "      <td>13</td>\n",
1974 |        "      <td>33</td>\n",
1975 |        "      <td>45</td>\n",
1976 |        "      <td>38</td>\n",
1977 |        "      <td>33</td>\n",
1978 |        "      <td>15</td>\n",
1979 |        "      <td>4</td>\n",
1980 |        "    </tr>\n",
1981 |        "  </tbody>\n",
1982 |        "</table>\n",
1983 |        "<p>4191 rows × 9517 columns</p>\n",
1984 |        "</div>"
1985 |       ],
1986 |       "text/plain": [
1987 |        "          AACCAACGACGAGGACAAGCGATG  AACCAACGACTTTCAAGCGATAGC  \\\n",
1988 |        "ANGPT1                           6                         0   \n",
1989 |        "ANKRD46                          6                         1   \n",
1990 |        "ASAP1                            1                         0   \n",
1991 |        "ATAD2                            0                         0   \n",
1992 |        "ATP6V1C1                         5                         1   \n",
1993 |        "...                            ...                       ...   \n",
1994 |        "WDYHV1                           2                         1   \n",
1995 |        "YWHAZ                          129                        34   \n",
1996 |        "ZFPM2                            0                         0   \n",
1997 |        "ZHX1                             8                         1   \n",
1998 |        "ZNF706                          74                        11   \n",
1999 |        "\n",
2000 |        "          AACCAACGAGAGCGACAAGCGATG  AACCAACGCCTAGCTCAAGCGATG  \\\n",
2001 |        "ANGPT1                           3                         2   \n",
2002 |        "ANKRD46                          1                         1   \n",
2003 |        "ASAP1                            3                         1   \n",
2004 |        "ATAD2                            0                         0   \n",
2005 |        "ATP6V1C1                         4                         0   \n",
2006 |        "...                            ...                       ...   \n",
2007 |        "WDYHV1                           1                         0   \n",
2008 |        "YWHAZ                          107                        19   \n",
2009 |        "ZFPM2                            1                         0   \n",
2010 |        "ZHX1                            14                         5   \n",
2011 |        "ZNF706                          38                        23   \n",
2012 |        "\n",
2013 |        "          AACCAACGTGGGCGTCAAGCGATG  AGGCAGAACCGTTCAAGCGATCCC  \\\n",
2014 |        "ANGPT1                           3                         5   \n",
2015 |        "ANKRD46                          2                         4   \n",
2016 |        "ASAP1                            0                         1   \n",
2017 |        "ATAD2                            1                         1   \n",
2018 |        "ATP6V1C1                         4                         6   \n",
2019 |        "...                            ...                       ...   \n",
2020 |        "WDYHV1                           2                         1   \n",
2021 |        "YWHAZ                          112                        98   \n",
2022 |        "ZFPM2                            0                         3   \n",
2023 |        "ZHX1                             6                         6   \n",
2024 |        "ZNF706                          62                        26   \n",
2025 |        "\n",
2026 |        "          AGGCAGAACCGTTCAAGCGATTCT  AGGCAGAACTGATAGCAAGCGATG  \\\n",
2027 |        "ANGPT1                           3                         4   \n",
2028 |        "ANKRD46                          4                         1   \n",
2029 |        "ASAP1                            1                         1   \n",
2030 |        "ATAD2                            0                         2   \n",
2031 |        "ATP6V1C1                         5                         8   \n",
2032 |        "...                            ...                       ...   \n",
2033 |        "WDYHV1                           1                         4   \n",
2034 |        "YWHAZ                          101                       165   \n",
2035 |        "ZFPM2                            3                         3   \n",
2036 |        "ZHX1                             8                        14   \n",
2037 |        "ZNF706                          37                        60   \n",
2038 |        "\n",
2039 |        "          AGGCAGAAGCTGCTTCAAGCGATG  AGGCAGAAGTCCTCAAGCGATTCT  ...  \\\n",
2040 |        "ANGPT1                           2                         0  ...   \n",
2041 |        "ANKRD46                          1                         3  ...   \n",
2042 |        "ASAP1                            1                         2  ...   \n",
2043 |        "ATAD2                            0                         0  ...   \n",
2044 |        "ATP6V1C1                         5                         4  ...   \n",
2045 |        "...                            ...                       ...  ...   \n",
2046 |        "WDYHV1                           0                         3  ...   \n",
2047 |        "YWHAZ                           68                        86  ...   \n",
2048 |        "ZFPM2                            0                         2  ...   \n",
2049 |        "ZHX1                             3                         5  ...   \n",
2050 |        "ZNF706                          44                        52  ...   \n",
2051 |        "\n",
2052 |        "          TAGGCATGGGAGCAACAAGCGATG  TAGGCATGGTCCTCAAGCGATATA  \\\n",
2053 |        "ANGPT1                           2                         5   \n",
2054 |        "ANKRD46                          1                         3   \n",
2055 |        "ASAP1                            1                         1   \n",
2056 |        "ATAD2                            0                         1   \n",
2057 |        "ATP6V1C1                         3                         2   \n",
2058 |        "...                            ...                       ...   \n",
2059 |        "WDYHV1                           1                         6   \n",
2060 |        "YWHAZ                           93                       119   \n",
2061 |        "ZFPM2                            0                         0   \n",
2062 |        "ZHX1                            10                         8   \n",
2063 |        "ZNF706                          52                        66   \n",
2064 |        "\n",
2065 |        "          TAGGCATGTTTGCGCCAAGCGATG  TCCTGAGCAGCAGCCCAAGCGATG  \\\n",
2066 |        "ANGPT1                           5                         2   \n",
2067 |        "ANKRD46                          5                         2   \n",
2068 |        "ASAP1                            0                         0   \n",
2069 |        "ATAD2                            0                         1   \n",
2070 |        "ATP6V1C1                         7                         3   \n",
2071 |        "...                            ...                       ...   \n",
2072 |        "WDYHV1                           5                         4   \n",
2073 |        "YWHAZ                          128                        95   \n",
2074 |        "ZFPM2                            4                         4   \n",
2075 |        "ZHX1                            10                         2   \n",
2076 |        "ZNF706                          53                        13   \n",
2077 |        "\n",
2078 |        "          TCCTGAGCATGGGAGCAAGCGATG  TCCTGAGCGCATACAAGCGATATA  \\\n",
2079 |        "ANGPT1                           0                         2   \n",
2080 |        "ANKRD46                          1                         1   \n",
2081 |        "ASAP1                            0                         1   \n",
2082 |        "ATAD2                            0                         1   \n",
2083 |        "ATP6V1C1                         0                         3   \n",
2084 |        "...                            ...                       ...   \n",
2085 |        "WDYHV1                           2                         2   \n",
2086 |        "YWHAZ                           69                       115   \n",
2087 |        "ZFPM2                            0                         0   \n",
2088 |        "ZHX1                             3                         9   \n",
2089 |        "ZNF706                          33                        45   \n",
2090 |        "\n",
2091 |        "          TCCTGAGCGTGTTAGCAAGCGATG  TGATTGACACGGGCTCAAGCGATG  \\\n",
2092 |        "ANGPT1                           0                         4   \n",
2093 |        "ANKRD46                          0                         3   \n",
2094 |        "ASAP1                            2                         1   \n",
2095 |        "ATAD2                            1                         1   \n",
2096 |        "ATP6V1C1                         3                         4   \n",
2097 |        "...                            ...                       ...   \n",
2098 |        "WDYHV1                           0                         4   \n",
2099 |        "YWHAZ                           92                        94   \n",
2100 |        "ZFPM2                            0                         4   \n",
2101 |        "ZHX1                             4                        10   \n",
2102 |        "ZNF706                          38                        33   \n",
2103 |        "\n",
2104 |        "          TGATTGACGACTACAAGCGATCCC  TGATTGACTTCTACAAGCGATGAC  \n",
2105 |        "ANGPT1                           0                         3  \n",
2106 |        "ANKRD46                          1                         1  \n",
2107 |        "ASAP1                            0                         0  \n",
2108 |        "ATAD2                            0                         0  \n",
2109 |        "ATP6V1C1                         1                         2  \n",
2110 |        "...                            ...                       ...  \n",
2111 |        "WDYHV1                           0                         1  \n",
2112 |        "YWHAZ                           37                        39  \n",
2113 |        "ZFPM2                            0                         0  \n",
2114 |        "ZHX1                             5                         3  \n",
2115 |        "ZNF706                          15                         4  \n",
2116 |        "\n",
2117 |        "[4191 rows x 9517 columns]"
2118 |       ]
2119 |      },
2120 |      "execution_count": 7,
2121 |      "metadata": {},
2122 |      "output_type": "execute_result"
2123 |     }
2124 |    ],
2125 |    "source": [
2126 |     "pd.read_csv('GSM4012703_TASC_SCREEN_chr8_sample10.counts.csv.gz', index_col=0)"
2127 |    ]
2128 |   },
2129 |   {
2130 |    "cell_type": "code",
2131 |    "execution_count": null,
2132 |    "id": "396a165f",
2133 |    "metadata": {},
2134 |    "outputs": [],
2135 |    "source": [
2136 |     "import scanpy as sc\n",
2137 |     "sc.set_figure_params(dpi=100, frameon=False)"
2138 |    ]
2139 |   },
2140 |   {
2141 |    "cell_type": "code",
2142 |    "execution_count": null,
2143 |    "id": "864ca45b",
2144 |    "metadata": {},
2145 |    "outputs": [],
2146 |    "source": [
2147 |     "sc.pp.neighbors(adata)\n",
2148 |     "sc.tl.umap(adata)"
2149 |    ]
2150 |   }
2151 |  ],
2152 |  "metadata": {
2153 |   "kernelspec": {
2154 |    "display_name": "Python 3 (ipykernel)",
2155 |    "language": "python",
2156 |    "name": "python3"
2157 |   },
2158 |   "language_info": {
2159 |    "codemirror_mode": {
2160 |     "name": "ipython",
2161 |     "version": 3
2162 |    },
2163 |    "file_extension": ".py",
2164 |    "mimetype": "text/x-python",
2165 |    "name": "python",
2166 |    "nbconvert_exporter": "python",
2167 |    "pygments_lexer": "ipython3",
2168 |    "version": "3.7.0"
2169 |   }
2170 |  },
2171 |  "nbformat": 4,
2172 |  "nbformat_minor": 5
2173 | }
2174 | 


--------------------------------------------------------------------------------
/datasets/Srivatsan_2019_sciplex2_curation.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "498416da",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "Accession: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM4150377"
  9 |    ]
 10 |   },
 11 |   {
 12 |    "cell_type": "code",
 13 |    "execution_count": 1,
 14 |    "id": "4abe7004",
 15 |    "metadata": {},
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "import gzip\n",
 19 |     "import os\n",
 20 |     "\n",
 21 |     "import numpy as np\n",
 22 |     "import pandas as pd\n",
 23 |     "from anndata import AnnData\n",
 24 |     "\n",
 25 |     "from utils import download_binary_file"
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "code",
 30 |    "execution_count": 2,
 31 |    "id": "b5747389",
 32 |    "metadata": {},
 33 |    "outputs": [],
 34 |    "source": [
 35 |     "def download_srivatsan_2019_sciplex2(output_path: str) -> None:\n",
 36 |     "    \"\"\"\n",
 37 |     "    Download Srivatsan et al. 2019 sciplex-2 data from the hosting URLs.\n",
 38 |     "\n",
 39 |     "    Args:\n",
 40 |     "    ----\n",
 41 |     "        output_path: Output path to store the downloaded and unzipped\n",
 42 |     "        directories.\n",
 43 |     "\n",
 44 |     "    Returns\n",
 45 |     "    -------\n",
 46 |     "        None. File directories are downloaded to output_path.\n",
 47 |     "    \"\"\"\n",
 48 |     "\n",
 49 |     "    count_matrix_url = (\n",
 50 |     "        \"https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM4150377&format=file&file=\"\n",
 51 |     "        \"GSM4150377_sciPlex2_A549_Transcription_Modulators_UMI.count.matrix.gz\"\n",
 52 |     "    )\n",
 53 |     "    count_matrix_filename = os.path.join(output_path, count_matrix_url.split(\"=\")[-1])\n",
 54 |     "    download_binary_file(count_matrix_url, count_matrix_filename)\n",
 55 |     "\n",
 56 |     "    cell_metadata_url = (\n",
 57 |     "        \"https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM4150377&format=file\"\n",
 58 |     "        \"&file=GSM4150377_sciPlex2_pData.txt.gz\"\n",
 59 |     "    )\n",
 60 |     "    cell_metadata_filename = os.path.join(output_path, cell_metadata_url.split(\"=\")[-1])\n",
 61 |     "    download_binary_file(cell_metadata_url, cell_metadata_filename)\n",
 62 |     "\n",
 63 |     "    gene_metadata_url = (\n",
 64 |     "        \"https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM4150377&format=file&file=\"\n",
 65 |     "        \"GSM4150377_sciPlex2_A549_Transcription_Modulators_gene.annotations.txt.gz\"\n",
 66 |     "    )\n",
 67 |     "    cell_metadata_filename = os.path.join(output_path, gene_metadata_url.split(\"=\")[-1])\n",
 68 |     "    download_binary_file(gene_metadata_url, cell_metadata_filename)\n",
 69 |     "\n",
 70 |     "\n",
 71 |     "def read_srivatsan_2019_sciplex2(file_directory: str) -> pd.DataFrame:\n",
 72 |     "    \"\"\"\n",
 73 |     "    Read the sciplex-2 expression data for Srivatsan et al. 2019 in the given directory.\n",
 74 |     "\n",
 75 |     "    Args:\n",
 76 |     "    ----\n",
 77 |     "        file_directory: Directory containing Srivatsan et al. 2019 data.\n",
 78 |     "\n",
 79 |     "    Returns\n",
 80 |     "    -------\n",
 81 |     "        A data frame containing single-cell gene expression counts. The count\n",
 82 |     "        matrix is stored in triplet format. I.e., each row of the data frame\n",
 83 |     "        has the format (row, column, count) stored in columns (i, j, x) respectively.\n",
 84 |     "    \"\"\"\n",
 85 |     "\n",
 86 |     "    with gzip.open(\n",
 87 |     "        os.path.join(\n",
 88 |     "            file_directory,\n",
 89 |     "            \"GSM4150377_sciPlex2_A549_Transcription_Modulators_UMI.count.matrix.gz\",\n",
 90 |     "        ),\n",
 91 |     "        \"rb\",\n",
 92 |     "    ) as f:\n",
 93 |     "        df = pd.read_csv(f, sep=\"\\t\", header=None, names=[\"i\", \"j\", \"x\"])\n",
 94 |     "\n",
 95 |     "    return df"
 96 |    ]
 97 |   },
 98 |   {
 99 |    "cell_type": "code",
100 |    "execution_count": 3,
101 |    "id": "d19b7de5",
102 |    "metadata": {},
103 |    "outputs": [
104 |     {
105 |      "name": "stdout",
106 |      "output_type": "stream",
107 |      "text": [
108 |       "File ./srivatsan_2019_sciplex2/GSM4150377_sciPlex2_A549_Transcription_Modulators_UMI.count.matrix.gz already exists. No files downloaded to overwrite the existing file.\n",
109 |       "File ./srivatsan_2019_sciplex2/GSM4150377_sciPlex2_pData.txt.gz already exists. No files downloaded to overwrite the existing file.\n",
110 |       "File ./srivatsan_2019_sciplex2/GSM4150377_sciPlex2_A549_Transcription_Modulators_gene.annotations.txt.gz already exists. No files downloaded to overwrite the existing file.\n"
111 |      ]
112 |     },
113 |     {
114 |      "name": "stderr",
115 |      "output_type": "stream",
116 |      "text": [
117 |       "/tmp/ipykernel_22607/873333527.py:41: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.\n",
118 |       "  adata = AnnData(\n",
119 |       "/tmp/ipykernel_22607/873333527.py:51: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n",
120 |       "  adata.obs[\"drug\"] = [\n",
121 |       "/tmp/ipykernel_22607/873333527.py:60: SettingWithCopyWarning: \n",
122 |       "A value is trying to be set on a copy of a slice from a DataFrame\n",
123 |       "\n",
124 |       "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
125 |       "  adata.obs[\"drug\"][adata.obs[\"dose\"] == 0.0] = \"Vehicle\"\n"
126 |      ]
127 |     }
128 |    ],
129 |    "source": [
130 |     "download_path = \"./srivatsan_2019_sciplex2\"\n",
131 |     "\n",
132 |     "os.makedirs(download_path, exist_ok=True)\n",
133 |     "download_srivatsan_2019_sciplex2(download_path)\n",
134 |     "df = read_srivatsan_2019_sciplex2(download_path)\n",
135 |     "\n",
136 |     "# The Srivatsan count data is in a sparse triplet format represented\n",
137 |     "# by three columns 'i', 'j', and 'x'. 'i' refers to a row number, 'j' refers to\n",
138 |     "# a column number, and 'x' refers to a count value.\n",
139 |     "counts = df[\"x\"]\n",
140 |     "rows = (\n",
141 |     "    df[\"i\"] - 1\n",
142 |     ")  # Indices were originally 1-base-indexed --> switch to 0-base-indexing\n",
143 |     "cols = df[\"j\"] - 1\n",
144 |     "\n",
145 |     "# Convert the triplets into a numpy array\n",
146 |     "count_matrix = np.zeros([max(rows) + 1, max(cols) + 1])\n",
147 |     "count_matrix[rows, cols] = counts\n",
148 |     "\n",
149 |     "# Switch matrix from gene rows and cell columns to cell rows and gene columns\n",
150 |     "count_matrix = count_matrix.T\n",
151 |     "\n",
152 |     "cell_metadata = pd.read_csv(\n",
153 |     "    os.path.join(\n",
154 |     "        download_path,\n",
155 |     "        \"GSM4150377_sciPlex2_pData.txt.gz\",\n",
156 |     "    ),\n",
157 |     "    sep=\" \",\n",
158 |     ")\n",
159 |     "\n",
160 |     "gene_metadata = pd.read_csv(\n",
161 |     "    os.path.join(\n",
162 |     "        download_path,\n",
163 |     "        \"GSM4150377_sciPlex2_A549_Transcription_Modulators_gene.annotations.txt.gz\",\n",
164 |     "    ),\n",
165 |     "    sep=\"\\t\",\n",
166 |     "    header=None,\n",
167 |     "    index_col=0,\n",
168 |     ")\n",
169 |     "\n",
170 |     "adata = AnnData(\n",
171 |     "    X=count_matrix, obs=cell_metadata, var=pd.DataFrame(index=gene_metadata.index)\n",
172 |     ")\n",
173 |     "\n",
174 |     "# Index needs string names or else the write_h5ad call will throw an error\n",
175 |     "adata.var.index.name = \"gene_id\"\n",
176 |     "\n",
177 |     "# Treatment information is contained in the `top_oligo` column\n",
178 |     "# with the format <drug>_<dose>.\n",
179 |     "adata = adata[adata.obs[\"top_oligo\"].notna()]\n",
180 |     "adata.obs[\"drug\"] = [\n",
181 |     "    treatment.split(\"_\")[0] for treatment in adata.obs[\"top_oligo\"]\n",
182 |     "]\n",
183 |     "adata.obs[\"dose\"] = [\n",
184 |     "    treatment.split(\"_\")[1] for treatment in adata.obs[\"top_oligo\"]\n",
185 |     "]\n",
186 |     "adata.obs[\"dose\"] = adata.obs[\"dose\"].apply(pd.to_numeric, args=(\"coerce\",))\n",
187 |     "\n",
188 |     "# If a drug is listed with dosage of 0, the cell was only exposed to vehicle control\n",
189 |     "adata.obs[\"drug\"][adata.obs[\"dose\"] == 0.0] = \"Vehicle\""
190 |    ]
191 |   },
192 |   {
193 |    "cell_type": "code",
194 |    "execution_count": null,
195 |    "id": "13f04aba",
196 |    "metadata": {},
197 |    "outputs": [],
198 |    "source": [
199 |     "adata.write_h5ad(\"Srivatsan_2019_sciplex2_raw.h5ad\")"
200 |    ]
201 |   },
202 |   {
203 |    "cell_type": "code",
204 |    "execution_count": null,
205 |    "id": "f87d3ed4-e87d-4608-bec9-a9e745d483c1",
206 |    "metadata": {},
207 |    "outputs": [],
208 |    "source": []
209 |   }
210 |  ],
211 |  "metadata": {
212 |   "kernelspec": {
213 |    "display_name": "Python 3",
214 |    "language": "python",
215 |    "name": "python3"
216 |   },
217 |   "language_info": {
218 |    "codemirror_mode": {
219 |     "name": "ipython",
220 |     "version": 3
221 |    },
222 |    "file_extension": ".py",
223 |    "mimetype": "text/x-python",
224 |    "name": "python",
225 |    "nbconvert_exporter": "python",
226 |    "pygments_lexer": "ipython3",
227 |    "version": "3.9.12"
228 |   }
229 |  },
230 |  "nbformat": 4,
231 |  "nbformat_minor": 5
232 | }
233 | 


--------------------------------------------------------------------------------
/datasets/block0_load.py:
--------------------------------------------------------------------------------
 1 | author_year = #e.g. You_2022
 2 | is_counts = # True/False
 3 | var_genes = # field in .var or None
 4 | doi = # of paper source
 5 | 
 6 | import numpy as np
 7 | import pandas as pd
 8 | import scanpy as sc
 9 | sc.set_figure_params(dpi=100, frameon=False)
10 | sc.logging.print_header()
11 | 
12 | # verify
13 | assert(doi in pd.read_csv('../personal.csv').DOI.values)
14 | 
15 | adata = sc.read(f'{author_year}_raw.h5ad')
16 | adata
17 | 


--------------------------------------------------------------------------------
/datasets/block1_init.py:
--------------------------------------------------------------------------------
 1 | # metalabels
 2 | adata.uns['preprocessing_nb_link'] = f'https://nbviewer.org/github/theislab/sc-pert/blob/main/datasets/{author_year}_curation.ipynb'
 3 | adata.uns['doi'] = doi
 4 | display(adata.obs.describe(include='all').T.head(20))
 5 | 
 6 | # use gene symbols as gene names
 7 | if var_genes:
 8 |     adata.var[var_genes] = adata.var[var_genes].astype(str)
 9 |     adata.var = adata.var.reset_index().set_index(var_genes)
10 |     print(adata.var_names)
11 | 
12 | # filtering and processing
13 | sc.pp.filter_cells(adata, min_genes=200)
14 | sc.pp.filter_genes(adata, min_cells=20)
15 | adata.var_names_make_unique()
16 | 
17 | if is_counts:
18 |     adata.var['mt'] = adata.var_names.str.startswith('MT-')  # annotate the group of mitochondrial genes as 'mt'
19 |     sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
20 | 
21 |     sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'], jitter=0.4, multi_panel=True)
22 |     sc.pl.scatter(adata, x='total_counts', y='pct_counts_mt')
23 |     sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts')
24 | 


--------------------------------------------------------------------------------
/datasets/block2_process.py:
--------------------------------------------------------------------------------
 1 | if is_counts:
 2 |     adata.layers['counts'] = adata.X
 3 |     sc.pp.normalize_total(adata, target_sum=1e4)
 4 |     sc.pp.log1p(adata)
 5 | 
 6 | sc.pl.highest_expr_genes(adata, n_top=20)
 7 | sc.pp.highly_variable_genes(adata, n_top_genes=5000, subset=False)
 8 | sc.pl.highly_variable_genes(adata)
 9 | 
10 | # pre-compute plots
11 | sc.tl.pca(adata, use_highly_variable=True)
12 | sc.pp.neighbors(adata)
13 | sc.tl.leiden(adata)
14 | sc.tl.umap(adata)
15 | 


--------------------------------------------------------------------------------
/datasets/block3_standardize.py:
--------------------------------------------------------------------------------
 1 | # the following fields are meant to serve as a template
 2 | control = ???
 3 | replace_dict = {
 4 |     control: 'control',
 5 | }
 6 | adata.obs['perturbation_name'] = adata.obs[???].replace(replace_dict)
 7 | # remember to use + for combinations
 8 | adata.obs['perturbation_type'] = # 'small molecule' or 'genetic'
 9 | adata.obs['perturbation_value'] = adata.obs[???]
10 | adata.obs['perturbation_unit'] = # 'ug', 'mg', 'hrs', etc.
11 | 


--------------------------------------------------------------------------------
/datasets/template.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "0f8d8db0",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "Accession: [link]"
  9 |    ]
 10 |   },
 11 |   {
 12 |    "cell_type": "code",
 13 |    "execution_count": null,
 14 |    "id": "e9234341",
 15 |    "metadata": {},
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "%load block0_load.py"
 19 |    ]
 20 |   },
 21 |   {
 22 |    "cell_type": "code",
 23 |    "execution_count": null,
 24 |    "id": "904d339f",
 25 |    "metadata": {},
 26 |    "outputs": [],
 27 |    "source": [
 28 |     "%load block1_init.py"
 29 |    ]
 30 |   },
 31 |   {
 32 |    "cell_type": "code",
 33 |    "execution_count": 5,
 34 |    "id": "72112f55",
 35 |    "metadata": {},
 36 |    "outputs": [],
 37 |    "source": [
 38 |     "adata = adata[adata.obs.n_genes_by_counts < 8000, :]  # edit\n",
 39 |     "adata = adata[adata.obs.pct_counts_mt < 50]"
 40 |    ]
 41 |   },
 42 |   {
 43 |    "cell_type": "markdown",
 44 |    "id": "2eaff683",
 45 |    "metadata": {},
 46 |    "source": [
 47 |     "Add additional labels, normalize and add plotting variables."
 48 |    ]
 49 |   },
 50 |   {
 51 |    "cell_type": "code",
 52 |    "execution_count": null,
 53 |    "id": "aff9cff1",
 54 |    "metadata": {},
 55 |    "outputs": [],
 56 |    "source": [
 57 |     "%load block2_process.py"
 58 |    ]
 59 |   },
 60 |   {
 61 |    "cell_type": "markdown",
 62 |    "id": "df9dc5f9",
 63 |    "metadata": {},
 64 |    "source": [
 65 |     "Save."
 66 |    ]
 67 |   },
 68 |   {
 69 |    "cell_type": "code",
 70 |    "execution_count": null,
 71 |    "id": "269f5603",
 72 |    "metadata": {},
 73 |    "outputs": [],
 74 |    "source": [
 75 |     "sc.write(f'{author_year}.h5ad', adata, compression='gzip')\n",
 76 |     "adata"
 77 |    ]
 78 |   },
 79 |   {
 80 |    "cell_type": "markdown",
 81 |    "id": "3706b6a6",
 82 |    "metadata": {},
 83 |    "source": [
 84 |     "Standardized perturbation labels and save again."
 85 |    ]
 86 |   },
 87 |   {
 88 |    "cell_type": "code",
 89 |    "execution_count": null,
 90 |    "id": "1d741e81",
 91 |    "metadata": {},
 92 |    "outputs": [],
 93 |    "source": [
 94 |     "adata = sc.read(f'{author_year}.h5ad')"
 95 |    ]
 96 |   },
 97 |   {
 98 |    "cell_type": "code",
 99 |    "execution_count": null,
100 |    "id": "0d070b9b",
101 |    "metadata": {},
102 |    "outputs": [],
103 |    "source": [
104 |     "%load block3_standardize.py"
105 |    ]
106 |   },
107 |   {
108 |    "cell_type": "code",
109 |    "execution_count": null,
110 |    "id": "fef168c4",
111 |    "metadata": {},
112 |    "outputs": [],
113 |    "source": [
114 |     "sc.write(f'{author_year}.h5ad', adata, compression='gzip')\n",
115 |     "adata"
116 |    ]
117 |   },
118 |   {
119 |    "cell_type": "markdown",
120 |    "id": "afe8077b",
121 |    "metadata": {},
122 |    "source": [
123 |     "View."
124 |    ]
125 |   },
126 |   {
127 |    "cell_type": "code",
128 |    "execution_count": null,
129 |    "id": "618de1da",
130 |    "metadata": {},
131 |    "outputs": [],
132 |    "source": [
133 |     "plot_cols = [c for c in adata.obs.columns if len(adata.obs[c].unique()) < 30 or not hasattr(adata.obs[c], 'cat')]\n",
134 |     "for c in plot_cols:\n",
135 |     "    if adata.obs[c].dtype == 'bool':\n",
136 |     "        adata.obs[c] = adata.obs[c].astype('category')\n",
137 |     "sc.pl.umap(adata, color=plot_cols, wspace=.4)"
138 |    ]
139 |   },
140 |   {
141 |    "cell_type": "markdown",
142 |    "id": "a6059ff1",
143 |    "metadata": {},
144 |    "source": [
145 |     "Plot the top 30 most common perturbations. We do this separately because many datasets have too many perturbations for `scanpy`'s color palettes."
146 |    ]
147 |   },
148 |   {
149 |    "cell_type": "code",
150 |    "execution_count": null,
151 |    "id": "9d85afb7",
152 |    "metadata": {},
153 |    "outputs": [],
154 |    "source": [
155 |     "sc.pl.umap(\n",
156 |     "    adata[adata.obs.perturbation_name.isin((adata.obs.perturbation_name.value_counts().index[:30]))],\n",
157 |     "    color='perturbation_name')"
158 |    ]
159 |   }
160 |  ],
161 |  "metadata": {
162 |   "kernelspec": {
163 |    "display_name": "Python 3 (ipykernel)",
164 |    "language": "python",
165 |    "name": "python3"
166 |   },
167 |   "language_info": {
168 |    "codemirror_mode": {
169 |     "name": "ipython",
170 |     "version": 3
171 |    },
172 |    "file_extension": ".py",
173 |    "mimetype": "text/x-python",
174 |    "name": "python",
175 |    "nbconvert_exporter": "python",
176 |    "pygments_lexer": "ipython3",
177 |    "version": "3.7.0"
178 |   }
179 |  },
180 |  "nbformat": 4,
181 |  "nbformat_minor": 5
182 | }
183 | 


--------------------------------------------------------------------------------
/datasets/utils.py:
--------------------------------------------------------------------------------
 1 | import requests
 2 | import os
 3 | 
 4 | def download_binary_file(
 5 |     file_url: str, output_path: str, overwrite: bool = False
 6 | ) -> None:
 7 |     """
 8 |     Download binary data file from a URL.
 9 | 
10 |     Args:
11 |     ----
12 |         file_url: URL where the file is hosted.
13 |         output_path: Output path for the downloaded file.
14 |         overwrite: Whether to overwrite existing downloaded file.
15 | 
16 |     Returns
17 |     -------
18 |         None.
19 |     """
20 |     file_exists = os.path.exists(output_path)
21 |     if (not file_exists) or (file_exists and overwrite):
22 |         request = requests.get(file_url)
23 |         with open(output_path, "wb") as f:
24 |             f.write(request.content)
25 |         print(f"Downloaded data from {file_url} at {output_path}")
26 |     else:
27 |         print(
28 |             f"File {output_path} already exists. "
29 |             "No files downloaded to overwrite the existing file."
30 |         )


--------------------------------------------------------------------------------
/readme_body.txt:
--------------------------------------------------------------------------------
 1 | # sc-pert - Machine learning for perturbational single-cell omics
 2 | 
 3 | *This repository provides a community-maintained summary of models and datasets. It was initially curated for [(Cell Systems, 2021)](https://doi.org/10.1016/j.cels.2021.05.016).*
 4 | 
 5 | ### External annotations
 6 | 
 7 | There are various resources for evaluation of single cell perturbation models. We discuss five tasks in the publication which can be supported by the following publicly available annotations:
 8 | 
 9 | - [GDSC](https://www.cancerrxgene.org/downloads/bulk_download) provides a collection of cell viability measurements for many compounds and cell lines. We provide a [code snippet](https://github.com/theislab/sc-pert/blob/main/resources.py#L4) to conveniently load GDSC-provided z-score compound response rankings per cell line.
10 | - Additional viability data can be obtained from [DepMap's PRISM dataset](https://depmap.org/portal/download/).
11 | - [Therapeutics Data Commons](https://github.com/mims-harvard/TDC) provides access to a number of compound databases as part of their [cheminformatics tasks](https://tdcommons.ai/benchmark/overview/). (In the same vein, [OpenProblems](https://openproblems.bio/) provides a [framework for tasks in single-cell](https://github.com/openproblems-bio/openproblems/tree/main/openproblems/tasks) which can also support perturbation modeling tasks in a more long term format than was previously seen in the [DREAM challenges](https://dreamchallenges.org/dream-7-nci-dream-drug-sensitivity-and-drug-synergy-challenge/).)
12 | - [PubChem](https://pubchem.ncbi.nlm.nih.gov/) contains a comprehensive record of compounds ranging from experimental entities to non-proprietary small molecules. It is queryable via [PubChemPy](https://github.com/mcs07/PubChemPy).
13 | - [DrugBank](https://go.drugbank.com/releases/latest) provides annotations for a relatively small number of small molecules in a standardized format.
14 | 
15 | ### Current modeling approaches
16 | 
17 | We [maintain a list of perturbation-related tools at scrna-tools](https://www.scrna-tools.org/tools?sort=name&cats=Perturbations). Please consider further updating and tagging tools [there](https://github.com/scRNA-tools/scRNA-tools).
18 | 
19 | For the basis of the table in the article, see this [spreadsheet of a subset of perturbation models](https://docs.google.com/spreadsheets/d/1nqNg0DW1-Om7WtvRS20q-6b28usVRv5czOcxgj83Sgg/) which includes more details.
20 | 
21 | ### Datasets
22 | 
23 | Below, we curated a [table](https://raw.githubusercontent.com/theislab/sc-pert/main/data_table.csv) of perturbation datasets based on [Svensson *et al.* (2020)](https://doi.org/10.1093/database/baaa073).
24 | 
25 | We also offer some datasets in a curated `.h5ad` format via the download links in the table below. `raw h5ad` denotes a version of the dataset that has not been filtered, normalized, or standardized.
26 | 
27 | H5ads denoted as `processed` have an accompanying processing notebook, and have been similarly preprocessed. These datasets have the following standardized fields in `.obs`:
28 | * `perturbation_name` -- Human-readable ompound names (International non-proprietary naming where possible) for small molecules and gene names for genetic perturbations.
29 | * `perturbation_type` -- `small molecule` or `genetic`
30 | * `perturbation_value` -- A continuous covariate quantity, such as the dosage concentration or the number of hours since treatment.
31 | * `perturbation_unit` -- Describes `perturbation_value`, such as `'ug'` or `'hrs'`.
32 | 
33 | 
34 | 


--------------------------------------------------------------------------------
/resources.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import pandas as pd
 3 | 
 4 | def GDSC_response(cell_line, filename=None):
 5 |     """Downloads and returns the responsiveness ranking of compounds
 6 |     for a specific cell line from GDSC1 and GDSC2 as dataframes.
 7 | 
 8 |     Descriptions of the column names can be found
 9 |     here: ftp://ftp.sanger.ac.uk/pub/project/cancerrxgene/releases/current_release/GDSC_Fitted_Data_Description.pdf
10 | 
11 |     Params
12 |     ------
13 |     cell_line : str
14 |         Cell line identifier, e.g. A549.
15 |     filename : str (default: None)
16 |         The downloaded files will be saved as filename_GDSC1/2.csv.
17 |     """
18 |     if not os.path.exists('Cell_Lines_Details.xlsx'):
19 |         os.system('wget ftp://ftp.sanger.ac.uk/pub/project/cancerrxgene/releases/current_release/Cell_Lines_Details.xlsx')
20 |     annot = pd.read_excel('Cell_Lines_Details.xlsx')
21 |     if cell_line not in annot['Sample Name'].values:
22 |         raise ValueError(f'{cell_line} not present in GDSC')
23 |     cell_id = annot[annot['Sample Name'] == cell_line]['COSMIC identifier'].values[0].astype('int').astype('str')
24 |     if filename is None:
25 |         filename = cell_line
26 |     os.system(f"wget -O {filename}_GDSC1.csv 'https://www.cancerrxgene.org/api/cellline/download_zscore?id={cell_id}&screening_set=GDSC1&export=csv'")
27 |     os.system(f"wget -O {filename}_GDSC2.csv 'https://www.cancerrxgene.org/api/cellline/download_zscore?id={cell_id}&screening_set=GDSC2&export=csv'")
28 |     return pd.read_csv(f'{filename}_GDSC1.csv'), pd.read_csv(f'{filename}_GDSC2.csv')
29 | 


--------------------------------------------------------------------------------
/update.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import re
 3 | import glob
 4 | import pandas as pd
 5 | 
 6 | ### data updates ###
 7 | 
 8 | # single-cell database
 9 | os.system('rm data.tsv')
10 | os.system('rm personal.csv')
11 | os.system('wget http://www.nxn.se/single-cell-studies/data.tsv')
12 | os.system("wget --no-check-certificate -O personal.csv 'https://docs.google.com/spreadsheets/d/14awt-bCOnj4ca2uoKzuTNuKtUKXcoN82_-oGg2f1Ros/export?gid=1438063781&format=csv'")
13 | 
14 | # GDSC
15 | #os.system('wget -P /gdsc/ ftp://ftp.sanger.ac.uk/pub/project/cancerrxgene/releases/current_release/GDSC1_fitted_dose_response_25Feb20.xlsx')
16 | 
17 | path = 'datasets/'
18 | processed_datasets = [file for file in glob.glob(f'{path}*.ipynb') if 'curation' not in file]
19 | curated_datasets = [file for file in glob.glob(f'{path}*_curation.ipynb')]
20 | 
21 | personal_rec = pd.read_csv('personal.csv')
22 | personal_rec['author_year'] = personal_rec['Author'] + '_' + personal_rec['Year'].astype(str)
23 | dois = personal_rec.DOI.values
24 | 
25 | df = pd.read_csv('data.tsv', sep='\t')
26 | df = df[df.DOI.isin(dois)]
27 | df['Date'] = df['Date'].astype('object')  # ensure proper display
28 | not_in_db = list(set(dois) - set(df.DOI.values))
29 | 
30 | # create placeholders for dois not in the sc studies DB
31 | pdf = personal_rec[personal_rec.DOI.isin(not_in_db)]
32 | df = df.append(pdf[pdf.columns.intersection(df.columns)])
33 | 
34 | # add additional columns of info
35 | add_cols = personal_rec[['DOI', 'Treatment', '# perturbations', '# cell types', '# doses', '# timepoints', 'Author', 'Year']]
36 | n_cols = 1 - add_cols.shape[1]  # default to negative
37 | df = df.merge(
38 |     add_cols,
39 |     left_on='DOI',
40 |     right_on='DOI'
41 | )
42 | df = df[list(df.columns[:5]) + \
43 |     list(df.columns[n_cols:]) + list(df.columns[5:n_cols])]  # up to `Title`
44 | 
45 | # convert DOIs to links in markdown
46 | links = []
47 | for shorthand, link in df[['Shorthand', 'DOI']].values:
48 |     s = f'[{shorthand}](https://doi.org/{link})'
49 |     s = s.replace('et al', '*et al.*')
50 |     links.append(s)
51 | df['Shorthand'] = links
52 | 
53 | # add availability column with download links for curated datasets
54 | links = []
55 | for shorthand, author, year, link in df[['Shorthand', 'Author', 'Year', 'DOI']].values:
56 |     s = ''
57 |     base_nb_path = 'https://nbviewer.ipython.org/github/theislab/sc-pert/blob/main/'
58 |     row = personal_rec[personal_rec.author_year == f'{author}_{year}']
59 |     if not pd.isnull(row.Raw.values[0]):  # raw .h5ad
60 |         s += f' [\\[raw h5ad\\]]({row.Raw.values[0]})'
61 |     if not pd.isnull(row.Processed.values[0]):  # processed .h5ad
62 |         s += f' [\\[processed h5ad\\]]({row.Processed.values[0]})'
63 |     ## adding notebook paths
64 |     r = re.compile(f'{path}{author}_{year}.*')
65 |     for nb in filter(r.match, curated_datasets):
66 |         s += f' [\\[curation nb\\]]({base_nb_path}{nb})'
67 |     for nb in filter(r.match, processed_datasets):
68 |         s += f' [\\[processing nb\\]]({base_nb_path}{nb})'
69 | 
70 |     links.append(s)
71 | df['.h5ad availability'] = links
72 | 
73 | # clean up
74 | df = df.sort_values(by=['Treatment', 'Date'])
75 | df = df.drop(['Authors', 'Journal', 'DOI', 'bioRxiv DOI', 'Author', 'Year', 'Date'], axis=1)
76 | 
77 | # rearrange columns
78 | primary_cols = ['Shorthand', 'Title', '.h5ad availability', 'Treatment', '# perturbations', '# cell types', '# doses', '# timepoints']
79 | df = df[primary_cols + \
80 |     [c for c in df if c not in primary_cols]]
81 | 
82 | # write README
83 | filenames = []
84 | with open('README.md', 'w') as outfile:
85 |     with open('readme_body.txt') as infile:
86 |         outfile.write(infile.read())
87 |         md = df.to_markdown(index=False, tablefmt='github', floatfmt='.8g')
88 |         md = md.replace('| Title', '| Title'+'&nbsp;'*70)
89 |         outfile.write(md)
90 |         infile.close()
91 |     outfile.close()
92 | 
93 | df.to_csv('data_table.csv')
94 | 


--------------------------------------------------------------------------------