├── LICENSE ├── README.md ├── data_table.csv ├── dataset_example.ipynb ├── datasets ├── Dixit_2016.ipynb ├── Frangieh_2021.ipynb ├── Jaitin_2014.ipynb ├── McFarland_2020_curation.ipynb ├── Norman_2019.ipynb ├── Norman_2019_curation.ipynb ├── Schraivogel_2020.ipynb ├── Srivatsan_2019_sciplex2_curation.ipynb ├── Srivatsan_2019_sciplex3.ipynb ├── block0_load.py ├── block1_init.py ├── block2_process.py ├── block3_standardize.py ├── template.ipynb └── utils.py ├── readme_body.txt ├── resources.py ├── statistics.ipynb └── update.py /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2020 Theis Lab 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # sc-pert - Machine learning for perturbational single-cell omics 2 | 3 | *This repository provides a community-maintained summary of models and datasets. It was initially curated for [(Cell Systems, 2021)](https://doi.org/10.1016/j.cels.2021.05.016).* 4 | 5 | ### External annotations 6 | 7 | There are various resources for evaluation of single cell perturbation models. We discuss five tasks in the publication which can be supported by the following publicly available annotations: 8 | 9 | - [GDSC](https://www.cancerrxgene.org/downloads/bulk_download) provides a collection of cell viability measurements for many compounds and cell lines. We provide a [code snippet](https://github.com/theislab/sc-pert/blob/main/resources.py#L4) to conveniently load GDSC-provided z-score compound response rankings per cell line. 10 | - Additional viability data can be obtained from [DepMap's PRISM dataset](https://depmap.org/portal/download/). 11 | - [Therapeutics Data Commons](https://github.com/mims-harvard/TDC) provides access to a number of compound databases as part of their [cheminformatics tasks](https://tdcommons.ai/benchmark/overview/). (In the same vein, [OpenProblems](https://openproblems.bio/) provides a [framework for tasks in single-cell](https://github.com/openproblems-bio/openproblems/tree/main/openproblems/tasks) which can also support perturbation modeling tasks in a more long term format than was previously seen in the [DREAM challenges](https://dreamchallenges.org/dream-7-nci-dream-drug-sensitivity-and-drug-synergy-challenge/).) 12 | - [PubChem](https://pubchem.ncbi.nlm.nih.gov/) contains a comprehensive record of compounds ranging from experimental entities to non-proprietary small molecules. It is queryable via [PubChemPy](https://github.com/mcs07/PubChemPy). 13 | - [DrugBank](https://go.drugbank.com/releases/latest) provides annotations for a relatively small number of small molecules in a standardized format. 14 | 15 | ### Current modeling approaches 16 | 17 | We [maintain a list of perturbation-related tools at scrna-tools](https://www.scrna-tools.org/tools?sort=name&cats=Perturbations). Please consider further updating and tagging tools [there](https://github.com/scRNA-tools/scRNA-tools). 18 | 19 | For the basis of the table in the article, see this [spreadsheet of a subset of perturbation models](https://docs.google.com/spreadsheets/d/1nqNg0DW1-Om7WtvRS20q-6b28usVRv5czOcxgj83Sgg/) which includes more details. 20 | 21 | ### Datasets 22 | 23 | Below, we curated a [table](https://raw.githubusercontent.com/theislab/sc-pert/main/data_table.csv) of perturbation datasets based on [Svensson *et al.* (2020)](https://doi.org/10.1093/database/baaa073). 24 | 25 | We also offer some datasets in a curated `.h5ad` format via the download links in the table below. `raw h5ad` denotes a version of the dataset that has not been filtered, normalized, or standardized. 26 | 27 | H5ads denoted as `processed` have an accompanying processing notebook, and have been similarly preprocessed. These datasets have the following standardized fields in `.obs`: 28 | * `perturbation_name` -- Human-readable ompound names (International non-proprietary naming where possible) for small molecules and gene names for genetic perturbations. 29 | * `perturbation_type` -- `small molecule` or `genetic` 30 | * `perturbation_value` -- A continuous covariate quantity, such as the dosage concentration or the number of hours since treatment. 31 | * `perturbation_unit` -- Describes `perturbation_value`, such as `'ug'` or `'hrs'`. 32 | 33 | 34 | | Shorthand | Title                                                                       | .h5ad availability | Treatment | # perturbations | # cell types | # doses | # timepoints | Reported cells total | Organism | Tissue | Technique | Data location | Panel size | Measurement | Cell source | Disease | Contrasts | Developmental stage | Number of reported cell types or clusters | Cell clustering | Pseudotime | RNA Velocity | PCA | tSNE | H5AD location | Isolation | BC --> Cell ID _OR_ BC --> Cluster ID | Number individuals | 35 | |----------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------|-------------------|----------------|-----------|----------------|------------------------|--------------|----------------------|-----------------------|-----------------|--------------|---------------|----------------------------------------------------------------------------------------------------------------------------------|-----------|-----------------------|-----------------------|---------------------------------------------|-------------------|--------------|----------------|-------|--------|-----------------|----------------------|--------------------------------------------------------------------|----------------------| 36 | | [Jaitin *et al.* Science](https://doi.org/10.1126/science.1247651) | Massively Parallel Single-Cell RNA-Seq for Marker-Free Decomposition of Tissues into Cell Types | | CRISPR | 8-22 | 1 | - | 1 | 4,468 | Mouse | Spleen | MARS-seq | GSE54006 | nan | RNA-seq | CD11c+ enriched splenocytes | nan | nan | nan | 9 | Yes | No | nan | No | No | nan | Sorting (FACS) | nan | nan | 37 | | [Dixit *et al.* Cell](https://doi.org/10.1016/j.cell.2016.11.038) | Perturb-Seq: Dissecting Molecular Circuits with Scalable Single-Cell RNA Profiling of Pooled Genetic Screens | [\[raw h5ad\]](https://ndownloader.figshare.com/files/34011689) [\[processed h5ad\]](https://ndownloader.figshare.com/files/34014608) [\[processing nb\]](https://nbviewer.ipython.org/github/theislab/sc-pert/blob/main/datasets/Dixit_2016.ipynb) | CRISPR | 10,24 | 1 | - | 1-2 | 200,000 | Human, Mouse | Culture | Perturb-seq | GSE90063 | nan | RNA-seq | BMDCs, K562 | nan | nan | nan | nan | nan | nan | nan | nan | No | nan | Nanodroplet dilution | nan | nan | 38 | | [Datlinger *et al.* NMeth](https://doi.org/10.1038/nmeth.4177) | Pooled CRISPR screening with single-cell transcriptome readout | | CRISPR | 29 | 1 | - | 1 | 5,905 | Human, Mouse | Culture | CROP-seq | GSE92872 | nan | RNA-seq | HEK293T, 3T3, Jurkat | nan | nan | nan | nan | nan | nan | nan | nan | No | nan | nan | nan | nan | 39 | | [Hill *et al.* NMethods](https://doi.org/10.1038/nmeth.4604) | On the design of CRISPR-based single-cell molecular screens | | CRISPR | 32 | 1 | 2 | 1 | 5,879 | Human | Culture | CROP-seq | GSE108699 | nan | RNA-seq | MCF10a cells | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | https://github.com/shendurelab/single-cell-ko-screens#result-files | nan | 40 | | [Ursu *et al.* bioRxiv](https://doi.org/10.1101/2020.11.16.383307) | Massively parallel phenotyping of variant impact in cancer with Perturb-seq reveals a shift in the spectrum of cell states induced by somatic mutations | | CRISPR | 200 | 1 | - | 1 | 162,314 | Human | Lung | Perturb-seq | nan | nan | RNA-seq | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 41 | | [Jin *et al.* Science](https://doi.org/10.1126/science.aaz6063) | In vivo Perturb-Seq reveals neuronal and glial abnormalities associated with autism risk genes | | CRISPR | 35 | - | - | 1 | 46,770 | Mouse | Brain | Perturb-seq | nan | nan | RNA-seq | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 42 | | [Frangieh *et al.* NGenetics](https://doi.org/10.1038/s41588-021-00779-1) | Multimodal pooled Perturb-CITE-seq screens in patient models define mechanisms of cancer immune evasion | [\[raw h5ad\]](https://ndownloader.figshare.com/files/34012565) [\[processed h5ad\]](https://ndownloader.figshare.com/files/34013717) [\[processing nb\]](https://nbviewer.ipython.org/github/theislab/sc-pert/blob/main/datasets/Frangieh_2021.ipynb) | CRISPR | 248 | 1 | - | 1 | 218,331 | Human | Culture | Perturb-CITE-seq | SCP1064 | nan | RNA-seq | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 43 | | [Papalexi *et al.* NGenetics](https://doi.org/10.1038/s41588-021-00778-2) | Characterizing the molecular regulation of inhibitory immune checkpoints with multimodal single-cell screens | | CRISPR | 111 (sgRNA) | 1 | 2 | - | 28,295 | Human | Culture | CITE-seq & ECCITE-seq | GSE153056 | nan | RNA-seq | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 44 | | [Datlinger *et al.* NMethods](https://doi.org/10.1038/s41592-021-01153-z) | Ultra-high-throughput single-cell RNA sequencing and perturbation screening with combinatorial fluidic indexing | | CRISPR KO + antibody | 96 | 1 | 1 | 1 | nan | Human, Mouse | nan | scifi-RNA-seq | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 45 | | [Alda-Catalinas *et al.* CSystems](https://doi.org/10.1016/j.cels.2020.06.004) | A Single-Cell Transcriptomics CRISPR-Activation Screen Identifies Epigenetic Regulators of the Zygotic Genome Activation Program | | CRISPRa | 230 | 1 | - | - | 203,894 | Mouse | Culture | Chromium | nan | nan | RNA-seq | mESCs | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 46 | | [Norman *et al.* (2019)](https://doi.org/10.1126/science.aax4438) | nan | [\[raw h5ad\]](https://ndownloader.figshare.com/files/34002548) [\[processed h5ad\]](https://ndownloader.figshare.com/files/34027562) [\[curation nb\]](https://nbviewer.ipython.org/github/theislab/sc-pert/blob/main/datasets/Norman_2019_curation.ipynb) [\[processing nb\]](https://nbviewer.ipython.org/github/theislab/sc-pert/blob/main/datasets/Norman_2019.ipynb) | CRISPRa | 287 | 1 | - | 1 | nan | nan | nan | Perturb-seq | nan | nan | RNA-seq | induction of gene pair targets+single gene controls in K562 cells after screening 112 genes (2x gRNA per) and their combinations | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 47 | | [Adamson *et al.* Cell](https://doi.org/10.1016/j.cell.2016.11.048) | A Multiplexed Single-Cell CRISPR Screening Platform Enables Systematic Dissection of the Unfolded Protein Response | | CRISPRi | 9-93 (sgRNA) | 1 | - | 1 | 86,000 | Human | Culture | Perturb-seq | GSE90546 | nan | RNA-seq | K562 | nan | nan | nan | nan | nan | nan | nan | nan | Yes | nan | nan | nan | nan | 48 | | [Gasperini *et al.* Cell](https://doi.org/10.1016/j.cell.2018.11.029) | A Genome-wide Framework for Mapping Gene Regulation via Cellular Genetic Screens | | CRISPRi | 1119, 5779 | 1 | - | 1 | 207,324 | Human | Culture | CROP-seq | GSE120861 | nan | RNA-seq | K562 Cells | nan | CRISPR Screen | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 49 | | [Jost *et al.* NBT](https://doi.org/10.1038/s41587-019-0387-5) | Titrating gene expression using libraries of systematically attenuated CRISPR guide RNAs | | CRISPRi | 25 | 2 | - | 1 | 19,587 | Human | Culture | Perturb-seq | GSE132080 | nan | RNA-seq | K562 cells | nan | 25 gene screen | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 50 | | [Schraivogel *et al.* NMethods](https://doi.org/10.1038/s41592-020-0837-5) | Targeted Perturb-seq enables genome-scale genetic screens in single cells | [\[processing nb\]](https://nbviewer.ipython.org/github/theislab/sc-pert/blob/main/datasets/Schraivogel_2020.ipynb) | CRISPRi | 1778 (enhancers) | 1 | - | 1 | 231,667 | Human, Mouse | Bone marrow, Culture | TAP-seq | GSE135497 | 1,000 | RNA-seq | nan | nan | nan | nan | nan | nan | nan | nan | nan | Yes | nan | nan | nan | nan | 51 | | [Leng *et al.* bioRxiv](https://doi.org/10.1101/2021.08.23.457400) | CRISPRi screens in human astrocytes elucidate regulators of distinct inflammatory reactive states | | CRISPRi | 30 | 1 | 2 | - | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 52 | | [Replogle *et al.* (2020)](https://doi.org/nan) | nan | | genetic targets | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 53 | | [Replogle *et al.* (2021)](https://doi.org/10.1101/2021.12.16.473013v3) | nan | | genetic targets | >10000 | 2 | - | - | nan | nan | nan | Perturb-seq | nan | nan | RNA-seq | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 54 | | [Shin *et al.* SAdvances](https://doi.org/10.1126/sciadv.aav2249) | Multiplexed single-cell RNA-seq via transient barcoding for simultaneous expression profiling of various drug perturbations | | small molecules | 45 | 2 | 1 | 1 | 3,091 | Mouse, Human | Culture | Drop-seq | PRJNA493658 | nan | RNA-seq | HEK293T, NIIH3T3, A375, SW480, K562 | nan | 45 perturbations | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 55 | | [Srivatsan *et al.* Science](https://doi.org/10.1126/science.aax6234) | Massively multiplex chemical transcriptomics at single-cell resolution | [\[raw h5ad\]](https://ndownloader.figshare.com/files/33979517) [\[curation nb\]](https://nbviewer.ipython.org/github/theislab/sc-pert/blob/main/datasets/Srivatsan_2019_sciplex2_curation.ipynb) [\[curation nb\]](https://nbviewer.ipython.org/github/theislab/sc-pert/blob/main/datasets/Srivatsan_2019_curation.ipynb) [\[processing nb\]](https://nbviewer.ipython.org/github/theislab/sc-pert/blob/main/datasets/Srivatsan_2019_sciplex3.ipynb) | small molecules | 188 | 3 | 4 | 2 | 650,000 | Human | Culture | sci-Plex | GSE139944 | nan | RNA-seq | Cancer cell lines A549, K562, and MCF7 | nan | 5,000 drug conditions | nan | 3 | Yes | Yes | No | Yes | No | nan | nan | nan | nan | 56 | | [Zhao *et al.* bioRxiv](https://doi.org/10.1101/2020.04.22.056341) | Deconvolution of Cell Type-Specific Drug Responses in Human Tumor Tissue with Single-Cell RNA-seq | | small molecules | 2,6 | 6,1 | - | - | 48,404 | Human | Brain, Tumor | SCRB-seq (microwell) | GSE148842 | nan | RNA-seq | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 6 | 57 | | [McFarland *et al.* NCommunications](https://doi.org/10.1038/s41467-020-17440-w) | Multiplexed single-cell transcriptional response profiling to define cancer vulnerabilities and therapeutic mechanism of action | [\[curation nb\]](https://nbviewer.ipython.org/github/theislab/sc-pert/blob/main/datasets/McFarland_2020_curation.ipynb) [\[processing nb\]](https://nbviewer.ipython.org/github/theislab/sc-pert/blob/main/datasets/McFarland_2020.ipynb) | small molecules | 1-13 | 24-99 | 1 | 1-5 | nan | Human | Culture | MIX-seq | nan | nan | RNA-seq | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 58 | | [Chen *et al.* (2020)](https://doi.org/10.1038/s41592-019-0689-z) | nan | | small molecules | 300 | 1 | 1 | 1 | nan | nan | nan | CyTOF | nan | nan | protein | breast cancer cells undergoing TGF-β-induced EMT | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | -------------------------------------------------------------------------------- /data_table.csv: -------------------------------------------------------------------------------- 1 | ,Shorthand,Title,.h5ad availability,Treatment,# perturbations,# cell types,# doses,# timepoints,Reported cells total,Organism,Tissue,Technique,Data location,Panel size,Measurement,Cell source,Disease,Contrasts,Developmental stage,Number of reported cell types or clusters,Cell clustering,Pseudotime,RNA Velocity,PCA,tSNE,H5AD location,Isolation,BC --> Cell ID _OR_ BC --> Cluster ID,Number individuals 2 | 0,[Jaitin *et al.* Science](https://doi.org/10.1126/science.1247651),Massively Parallel Single-Cell RNA-Seq for Marker-Free Decomposition of Tissues into Cell Types,,CRISPR,8-22,1,-,1,"4,468",Mouse,Spleen,MARS-seq,GSE54006,,RNA-seq,CD11c+ enriched splenocytes,,,,9.0,Yes,No,,No,No,,Sorting (FACS),, 3 | 2,[Dixit *et al.* Cell](https://doi.org/10.1016/j.cell.2016.11.038),Perturb-Seq: Dissecting Molecular Circuits with Scalable Single-Cell RNA Profiling of Pooled Genetic Screens, [\[raw h5ad\]](https://ndownloader.figshare.com/files/34011689) [\[processed h5ad\]](https://ndownloader.figshare.com/files/34014608) [\[processing nb\]](https://nbviewer.ipython.org/github/theislab/sc-pert/blob/main/datasets/Dixit_2016.ipynb),CRISPR,"10,24",1,-,1-2,"200,000","Human, Mouse",Culture,Perturb-seq,GSE90063,,RNA-seq,"BMDCs, K562",,,,,,,,,No,,Nanodroplet dilution,, 4 | 3,[Datlinger *et al.* NMeth](https://doi.org/10.1038/nmeth.4177),Pooled CRISPR screening with single-cell transcriptome readout,,CRISPR,29,1,-,1,"5,905","Human, Mouse",Culture,CROP-seq,GSE92872,,RNA-seq,"HEK293T, 3T3, Jurkat",,,,,,,,,No,,,, 5 | 4,[Hill *et al.* NMethods](https://doi.org/10.1038/nmeth.4604),On the design of CRISPR-based single-cell molecular screens,,CRISPR,32,1,2,1,"5,879",Human,Culture,CROP-seq,GSE108699,,RNA-seq,MCF10a cells,,,,,,,,,,,,https://github.com/shendurelab/single-cell-ko-screens#result-files, 6 | 13,[Ursu *et al.* bioRxiv](https://doi.org/10.1101/2020.11.16.383307),Massively parallel phenotyping of variant impact in cancer with Perturb-seq reveals a shift in the spectrum of cell states induced by somatic mutations,,CRISPR,200,1,-,1,"162,314",Human,Lung,Perturb-seq,,,RNA-seq,,,,,,,,,,,,,, 7 | 14,[Jin *et al.* Science](https://doi.org/10.1126/science.aaz6063),In vivo Perturb-Seq reveals neuronal and glial abnormalities associated with autism risk genes,,CRISPR,35,-,-,1,"46,770",Mouse,Brain,Perturb-seq,,,RNA-seq,,,,,,,,,,,,,, 8 | 15,[Frangieh *et al.* NGenetics](https://doi.org/10.1038/s41588-021-00779-1),Multimodal pooled Perturb-CITE-seq screens in patient models define mechanisms of cancer immune evasion, [\[raw h5ad\]](https://ndownloader.figshare.com/files/34012565) [\[processed h5ad\]](https://ndownloader.figshare.com/files/34013717) [\[processing nb\]](https://nbviewer.ipython.org/github/theislab/sc-pert/blob/main/datasets/Frangieh_2021.ipynb),CRISPR,248,1,-,1,"218,331",Human,Culture,Perturb-CITE-seq,SCP1064,,RNA-seq,,,,,,,,,,,,,, 9 | 16,[Papalexi *et al.* NGenetics](https://doi.org/10.1038/s41588-021-00778-2),Characterizing the molecular regulation of inhibitory immune checkpoints with multimodal single-cell screens,,CRISPR,111 (sgRNA),1,2,-,"28,295",Human,Culture,CITE-seq & ECCITE-seq,GSE153056,,RNA-seq,,,,,,,,,,,,,, 10 | 17,[Datlinger *et al.* NMethods](https://doi.org/10.1038/s41592-021-01153-z),Ultra-high-throughput single-cell RNA sequencing and perturbation screening with combinatorial fluidic indexing,,CRISPR KO + antibody,96,1,1,1,,"Human, Mouse",,scifi-RNA-seq,,,,,,,,,,,,,,,,, 11 | 11,[Alda-Catalinas *et al.* CSystems](https://doi.org/10.1016/j.cels.2020.06.004),A Single-Cell Transcriptomics CRISPR-Activation Screen Identifies Epigenetic Regulators of the Zygotic Genome Activation Program,,CRISPRa,230,1,-,-,"203,894",Mouse,Culture,Chromium,,,RNA-seq,mESCs,,,,,,,,,,,,, 12 | 19,[Norman *et al.* (2019)](https://doi.org/10.1126/science.aax4438),, [\[raw h5ad\]](https://ndownloader.figshare.com/files/34002548) [\[processed h5ad\]](https://ndownloader.figshare.com/files/34027562) [\[curation nb\]](https://nbviewer.ipython.org/github/theislab/sc-pert/blob/main/datasets/Norman_2019_curation.ipynb) [\[processing nb\]](https://nbviewer.ipython.org/github/theislab/sc-pert/blob/main/datasets/Norman_2019.ipynb),CRISPRa,287,1,-,1,,,,Perturb-seq,,,RNA-seq,induction of gene pair targets+single gene controls in K562 cells after screening 112 genes (2x gRNA per) and their combinations,,,,,,,,,,,,, 13 | 1,[Adamson *et al.* Cell](https://doi.org/10.1016/j.cell.2016.11.048),A Multiplexed Single-Cell CRISPR Screening Platform Enables Systematic Dissection of the Unfolded Protein Response,,CRISPRi,9-93 (sgRNA),1,-,1,"86,000",Human,Culture,Perturb-seq,GSE90546,,RNA-seq,K562,,,,,,,,,Yes,,,, 14 | 5,[Gasperini *et al.* Cell](https://doi.org/10.1016/j.cell.2018.11.029),A Genome-wide Framework for Mapping Gene Regulation via Cellular Genetic Screens,,CRISPRi,"1119, 5779",1,-,1,"207,324",Human,Culture,CROP-seq,GSE120861,,RNA-seq,K562 Cells,,CRISPR Screen,,,,,,,,,,, 15 | 8,[Jost *et al.* NBT](https://doi.org/10.1038/s41587-019-0387-5),Titrating gene expression using libraries of systematically attenuated CRISPR guide RNAs,,CRISPRi,25,2,-,1,"19,587",Human,Culture,Perturb-seq,GSE132080,,RNA-seq,K562 cells,,25 gene screen,,,,,,,,,,, 16 | 10,[Schraivogel *et al.* NMethods](https://doi.org/10.1038/s41592-020-0837-5),Targeted Perturb-seq enables genome-scale genetic screens in single cells, [\[processing nb\]](https://nbviewer.ipython.org/github/theislab/sc-pert/blob/main/datasets/Schraivogel_2020.ipynb),CRISPRi,1778 (enhancers),1,-,1,"231,667","Human, Mouse","Bone marrow, Culture",TAP-seq,GSE135497,"1,000",RNA-seq,,,,,,,,,,Yes,,,, 17 | 18,[Leng *et al.* bioRxiv](https://doi.org/10.1101/2021.08.23.457400),CRISPRi screens in human astrocytes elucidate regulators of distinct inflammatory reactive states,,CRISPRi,30,1,2,-,,,,,,,,,,,,,,,,,,,,, 18 | 20,[Replogle *et al.* (2020)](https://doi.org/nan),,,genetic targets,,,,,,,,,,,,,,,,,,,,,,,,, 19 | 21,[Replogle *et al.* (2021)](https://doi.org/10.1101/2021.12.16.473013v3),,,genetic targets,>10000,2,-,-,,,,Perturb-seq,,,RNA-seq,,,,,,,,,,,,,, 20 | 6,[Shin *et al.* SAdvances](https://doi.org/10.1126/sciadv.aav2249),Multiplexed single-cell RNA-seq via transient barcoding for simultaneous expression profiling of various drug perturbations,,small molecules,45,2,1,1,"3,091","Mouse, Human",Culture,Drop-seq,PRJNA493658,,RNA-seq,"HEK293T, NIIH3T3, A375, SW480, K562",,45 perturbations,,,,,,,,,,, 21 | 7,[Srivatsan *et al.* Science](https://doi.org/10.1126/science.aax6234),Massively multiplex chemical transcriptomics at single-cell resolution, [\[raw h5ad\]](https://ndownloader.figshare.com/files/33979517) [\[curation nb\]](https://nbviewer.ipython.org/github/theislab/sc-pert/blob/main/datasets/Srivatsan_2019_sciplex2_curation.ipynb) [\[curation nb\]](https://nbviewer.ipython.org/github/theislab/sc-pert/blob/main/datasets/Srivatsan_2019_curation.ipynb) [\[processing nb\]](https://nbviewer.ipython.org/github/theislab/sc-pert/blob/main/datasets/Srivatsan_2019_sciplex3.ipynb),small molecules,188,3,4,2,"650,000",Human,Culture,sci-Plex,GSE139944,,RNA-seq,"Cancer cell lines A549, K562, and MCF7",,"5,000 drug conditions",,3.0,Yes,Yes,No,Yes,No,,,, 22 | 9,[Zhao *et al.* bioRxiv](https://doi.org/10.1101/2020.04.22.056341),Deconvolution of Cell Type-Specific Drug Responses in Human Tumor Tissue with Single-Cell RNA-seq,,small molecules,"2,6","6,1",-,-,"48,404",Human,"Brain, Tumor",SCRB-seq (microwell),GSE148842,,RNA-seq,,,,,,,,,,,,,,6.0 23 | 12,[McFarland *et al.* NCommunications](https://doi.org/10.1038/s41467-020-17440-w),Multiplexed single-cell transcriptional response profiling to define cancer vulnerabilities and therapeutic mechanism of action, [\[curation nb\]](https://nbviewer.ipython.org/github/theislab/sc-pert/blob/main/datasets/McFarland_2020_curation.ipynb) [\[processing nb\]](https://nbviewer.ipython.org/github/theislab/sc-pert/blob/main/datasets/McFarland_2020.ipynb),small molecules,1-13,24-99,1,1-5,,Human,Culture,MIX-seq,,,RNA-seq,,,,,,,,,,,,,, 24 | 22,[Chen *et al.* (2020)](https://doi.org/10.1038/s41592-019-0689-z),,,small molecules,300,1,1,1,,,,CyTOF,,,protein,breast cancer cells undergoing TGF-β-induced EMT ,,,,,,,,,,,,, 25 | -------------------------------------------------------------------------------- /datasets/Jaitin_2014.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### Jaitin et al. Science" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "Data is downloaded from this source\n", 15 | "https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE54006" 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": 1, 21 | "metadata": {}, 22 | "outputs": [], 23 | "source": [ 24 | "# !wget https://ftp.ncbi.nlm.nih.gov/geo/series/GSE54nnn/GSE54006/suppl/GSE54006_experimental_design.txt.gz\n", 25 | "# !wget https://ftp.ncbi.nlm.nih.gov/geo/series/GSE54nnn/GSE54006/suppl/GSE54006_readme0421.txt\n", 26 | "# !wget https://ftp.ncbi.nlm.nih.gov/geo/series/GSE54nnn/GSE54006/suppl/GSE54006_umitab.txt.gz" 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": 2, 32 | "metadata": {}, 33 | "outputs": [ 34 | { 35 | "ename": "KeyboardInterrupt", 36 | "evalue": "", 37 | "output_type": "error", 38 | "traceback": [ 39 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 40 | "\u001b[0;31mKeyboardInterrupt\u001b[0m Traceback (most recent call last)", 41 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mpandas\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0manndata\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0;32mimport\u001b[0m \u001b[0mscanpy\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0msc\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", 42 | "\u001b[0;32m~/miniconda3/envs/mypython3/lib/python3.7/site-packages/scanpy/__init__.py\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 16\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0;34m.\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mpreprocessing\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mpp\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 17\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0;34m.\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mplotting\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mpl\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 18\u001b[0;31m \u001b[0;32mfrom\u001b[0m \u001b[0;34m.\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mdatasets\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlogging\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mqueries\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mexternal\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mget\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 19\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 20\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0manndata\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mAnnData\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mconcat\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 43 | "\u001b[0;32m~/miniconda3/envs/mypython3/lib/python3.7/site-packages/scanpy/queries/__init__.py\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m from ._queries import (\n\u001b[0m\u001b[1;32m 2\u001b[0m \u001b[0mbiomart_annotations\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0mgene_coordinates\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0mmitochondrial_genes\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m ) # Biomart queries\n", 44 | "\u001b[0;32m~/miniconda3/envs/mypython3/lib/python3.7/site-packages/scanpy/queries/_queries.py\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 118\u001b[0m \u001b[0mhost\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mstr\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m\"www.ensembl.org\"\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 119\u001b[0m \u001b[0muse_cache\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mbool\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 120\u001b[0;31m ) -> pd.DataFrame:\n\u001b[0m\u001b[1;32m 121\u001b[0m \"\"\"\\\n\u001b[1;32m 122\u001b[0m \u001b[0mRetrieve\u001b[0m \u001b[0mgene\u001b[0m \u001b[0mcoordinates\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mspecific\u001b[0m \u001b[0morganism\u001b[0m \u001b[0mthrough\u001b[0m \u001b[0mBioMart\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 45 | "\u001b[0;32m~/miniconda3/envs/mypython3/lib/python3.7/site-packages/scanpy/_utils.py\u001b[0m in \u001b[0;36mdec\u001b[0;34m(obj)\u001b[0m\n\u001b[1;32m 158\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mdec\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mobj\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 159\u001b[0m \u001b[0mobj\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__orig_doc__\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mobj\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__doc__\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 160\u001b[0;31m \u001b[0mobj\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__doc__\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdedent\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mobj\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__doc__\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat_map\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkwds\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 161\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mobj\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 162\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", 46 | "\u001b[0;32m~/miniconda3/envs/mypython3/lib/python3.7/textwrap.py\u001b[0m in \u001b[0;36mdedent\u001b[0;34m(text)\u001b[0m\n\u001b[1;32m 429\u001b[0m \u001b[0mmargin\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 430\u001b[0m \u001b[0mtext\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_whitespace_only_re\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msub\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m''\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtext\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 431\u001b[0;31m \u001b[0mindents\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_leading_whitespace_re\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfindall\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtext\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 432\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mindent\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mindents\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 433\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mmargin\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 47 | "\u001b[0;31mKeyboardInterrupt\u001b[0m: " 48 | ] 49 | } 50 | ], 51 | "source": [ 52 | "import pandas as pd\n", 53 | "import anndata\n", 54 | "import scanpy as sc" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": null, 60 | "metadata": {}, 61 | "outputs": [], 62 | "source": [ 63 | "df = pd.read_csv('GSE54006_experimental_design.txt.gz', skiprows=6, sep='\\t', index_col=0)\n", 64 | "df.shape" 65 | ] 66 | }, 67 | { 68 | "cell_type": "code", 69 | "execution_count": null, 70 | "metadata": {}, 71 | "outputs": [], 72 | "source": [ 73 | "X = pd.read_csv('GSE54006_umitab.txt.gz', sep='\\t', index_col=0).T" 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": null, 79 | "metadata": {}, 80 | "outputs": [], 81 | "source": [ 82 | "\n", 83 | "adata = anndata.AnnData(X)\n", 84 | "adata.shape" 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": null, 90 | "metadata": {}, 91 | "outputs": [], 92 | "source": [ 93 | "obs = df.reindex(adata.obs_names.str.split('_').str[1].astype(int))\n", 94 | "adata.obs = obs" 95 | ] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "execution_count": null, 100 | "metadata": {}, 101 | "outputs": [], 102 | "source": [ 103 | "sc.pp.neighbors(adata)\n", 104 | "sc.tl.umap(adata)" 105 | ] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "execution_count": null, 110 | "metadata": {}, 111 | "outputs": [], 112 | "source": [ 113 | "adata.obs" 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": null, 119 | "metadata": {}, 120 | "outputs": [], 121 | "source": [ 122 | "sc.set_figure_params(facecolor='white')" 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": null, 128 | "metadata": {}, 129 | "outputs": [], 130 | "source": [ 131 | "import numpy as np" 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": null, 137 | "metadata": {}, 138 | "outputs": [], 139 | "source": [ 140 | "adata.obs['perturbation.type'] = np.where(adata.obs['group_name'].str.contains('LPS'), 'chemical', np.nan)\n", 141 | "adata.obs['perturbation.description'] = np.where(adata.obs['group_name'].str.contains('LPS'), 'LPS (2 hours)', np.nan)\n", 142 | "adata.obs['perturbation.value'] = np.where(adata.obs['group_name'].str.contains('LPS'), True, np.nan)\n", 143 | "adata.obs['perturbation.unit'] = np.where(adata.obs['group_name'].str.contains('LPS'), 'boolean', np.nan)" 144 | ] 145 | }, 146 | { 147 | "cell_type": "code", 148 | "execution_count": null, 149 | "metadata": {}, 150 | "outputs": [], 151 | "source": [ 152 | "from matplotlib.pyplot import rcParams\n", 153 | "import matplotlib.pyplot as plt\n", 154 | "rcParams['figure.figsize'] = 5, 3\n", 155 | "sc.pl.umap(adata, color=['group_name', 'perturbation.description'], legend_fontsize=8, show=False, frameon=False, ncols=1)\n", 156 | "plt.tight_layout()\n", 157 | "plt.show()" 158 | ] 159 | }, 160 | { 161 | "cell_type": "code", 162 | "execution_count": null, 163 | "metadata": {}, 164 | "outputs": [], 165 | "source": [ 166 | "adata.obs" 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": null, 172 | "metadata": {}, 173 | "outputs": [], 174 | "source": [ 175 | "adata.write('../data/Jaitin_2014.h5ad', compression='lzf')" 176 | ] 177 | } 178 | ], 179 | "metadata": { 180 | "kernelspec": { 181 | "display_name": "Python [conda env:mypython3] *", 182 | "language": "python", 183 | "name": "conda-env-mypython3-py" 184 | }, 185 | "language_info": { 186 | "codemirror_mode": { 187 | "name": "ipython", 188 | "version": 3 189 | }, 190 | "file_extension": ".py", 191 | "mimetype": "text/x-python", 192 | "name": "python", 193 | "nbconvert_exporter": "python", 194 | "pygments_lexer": "ipython3", 195 | "version": "3.7.8" 196 | } 197 | }, 198 | "nbformat": 4, 199 | "nbformat_minor": 4 200 | } 201 | -------------------------------------------------------------------------------- /datasets/McFarland_2020_curation.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "76588992", 6 | "metadata": {}, 7 | "source": [ 8 | "Curation by Oksana Bilous.\n", 9 | "\n", 10 | "Accesion: https://figshare.com/s/139f64b495dea9d88c70" 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": 1, 16 | "id": "6a4ed240", 17 | "metadata": {}, 18 | "outputs": [ 19 | { 20 | "name": "stdout", 21 | "output_type": "stream", 22 | "text": [ 23 | "--2022-03-30 13:30:55-- https://figshare.com/ndownloader/articles/10298696?private_link=139f64b495dea9d88c70\n", 24 | "Resolving figshare.com (figshare.com)... 52.30.189.251, 52.48.213.168, 2a05:d018:1f4:d000:9767:1911:7029:1844, ...\n", 25 | "Connecting to figshare.com (figshare.com)|52.30.189.251|:443... connected.\n", 26 | "HTTP request sent, awaiting response... 200 OK\n", 27 | "Length: 2257978361 (2.1G) [application/zip]\n", 28 | "Saving to: ‘mcfarland2020/mixseq.zip’\n", 29 | "\n", 30 | "mcfarland2020/mixse 100%[===================>] 2.10G 23.3MB/s in 43s \n", 31 | "\n", 32 | "2022-03-30 13:31:38 (50.4 MB/s) - ‘mcfarland2020/mixseq.zip’ saved [2257978361/2257978361]\n", 33 | "\n", 34 | "Archive: mcfarland2020/mixseq.zip\n", 35 | " extracting: mcfarland2020/README.txt \n", 36 | " extracting: mcfarland2020/Supplementary Tables.xlsx \n", 37 | " extracting: mcfarland2020/Trametinib_6hr_expt1.zip \n", 38 | " extracting: mcfarland2020/Trametinib_24hr_expt1.zip \n", 39 | " extracting: mcfarland2020/Bortezomib_6hr_expt1.zip \n", 40 | " extracting: mcfarland2020/Bortezomib_24hr_expt1.zip \n", 41 | " extracting: mcfarland2020/Idasanutlin_6hr_expt1.zip \n", 42 | " extracting: mcfarland2020/Idasanutlin_24hr_expt1.zip \n", 43 | " extracting: mcfarland2020/DMSO_24hr_expt1.zip \n", 44 | " extracting: mcfarland2020/DMSO_6hr_expt1.zip \n", 45 | " extracting: mcfarland2020/Untreated_6hr_expt1.zip \n", 46 | " extracting: mcfarland2020/Dabrafenib_24hr_expt3.zip \n", 47 | " extracting: mcfarland2020/Navitoclax_24hr_expt3.zip \n", 48 | " extracting: mcfarland2020/Trametinib_24hr_expt3.zip \n", 49 | " extracting: mcfarland2020/BRD3379_24hr_expt3.zip \n", 50 | " extracting: mcfarland2020/BRD3379_6hr_expt3.zip \n", 51 | " extracting: mcfarland2020/DMSO_24hr_expt3.zip \n", 52 | " extracting: mcfarland2020/DMSO_6hr_expt3.zip \n", 53 | " extracting: mcfarland2020/sgGPX4_1_expt2.zip \n", 54 | " extracting: mcfarland2020/sgGPX4_2_expt2.zip \n", 55 | " extracting: mcfarland2020/sgOR2J2_expt2.zip \n", 56 | " extracting: mcfarland2020/sgLACZ_expt2.zip \n", 57 | " extracting: mcfarland2020/trametinib_tc_expt5.zip \n", 58 | " extracting: mcfarland2020/Afatinib_expt10.zip \n", 59 | " extracting: mcfarland2020/AZD5591_expt10.zip \n", 60 | " extracting: mcfarland2020/DMSO_expt10.zip \n", 61 | " extracting: mcfarland2020/Everolimus_expt10.zip \n", 62 | " extracting: mcfarland2020/Gemcitabine_expt10.zip \n", 63 | " extracting: mcfarland2020/JQ1_expt10.zip \n", 64 | " extracting: mcfarland2020/Prexasertib_expt10.zip \n", 65 | " extracting: mcfarland2020/Taselisib_expt10.zip \n", 66 | " extracting: mcfarland2020/Trametinib_expt10.zip \n", 67 | " extracting: mcfarland2020/all_CL_features.rds \n" 68 | ] 69 | } 70 | ], 71 | "source": [ 72 | "!mkdir mcfarland2020\n", 73 | "!wget https://figshare.com/ndownloader/articles/10298696?private_link=139f64b495dea9d88c70 -O mcfarland2020/mixseq.zip\n", 74 | "!unzip mcfarland2020/mixseq.zip -d mcfarland2020" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": 2, 80 | "id": "d2d393f6", 81 | "metadata": {}, 82 | "outputs": [], 83 | "source": [ 84 | "raw_data_path = 'mcfarland2020'" 85 | ] 86 | }, 87 | { 88 | "cell_type": "markdown", 89 | "id": "67a332a4", 90 | "metadata": {}, 91 | "source": [ 92 | "Process files." 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": 3, 98 | "id": "f09dc5d2", 99 | "metadata": {}, 100 | "outputs": [], 101 | "source": [ 102 | "import pandas as pd\n", 103 | "import numpy as np\n", 104 | "import scanpy as sc\n", 105 | "from zipfile import ZipFile \n", 106 | "import os\n", 107 | "import anndata as ad" 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": 4, 113 | "id": "a518efae", 114 | "metadata": {}, 115 | "outputs": [ 116 | { 117 | "name": "stderr", 118 | "output_type": "stream", 119 | "text": [ 120 | "2022-03-30 13:32:08.111777: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory\n", 121 | "2022-03-30 13:32:08.111812: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.\n" 122 | ] 123 | }, 124 | { 125 | "name": "stdout", 126 | "output_type": "stream", 127 | "text": [ 128 | "scanpy==1.8.2 anndata==0.7.6 umap==0.5.2 numpy==1.20.3 scipy==1.5.3 pandas==1.3.4 scikit-learn==1.0.2 statsmodels==0.11.1 python-igraph==0.8.3 leidenalg==0.8.3 pynndescent==0.5.5\n" 129 | ] 130 | } 131 | ], 132 | "source": [ 133 | "sc.logging.print_header()" 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": 5, 139 | "id": "efb34103", 140 | "metadata": {}, 141 | "outputs": [], 142 | "source": [ 143 | "class MIX_seq_data_loader:\n", 144 | " def __init__(self, raw_data_path, expt_no):\n", 145 | " self.raw_data_path = raw_data_path\n", 146 | " self.expt_no = expt_no\n", 147 | "# self.scale = scale\n", 148 | "# self.drugs_SMILES_representation = {\n", 149 | "# 'Idasanutlin': 'O=C(O)C1=CC(OC)=C(NC([C@H]2[C@H](C3=C(F)C(Cl)=CC=C3)[C@](C4=CC=C(Cl)C=C4F)(C#N)[C@H](CC(C)(C)C)N2)=O)C=C1',\n", 150 | "# 'Bortezomib': 'CC(C)C[C@H](NC(=O)[C@H](CC1=CC=CC=C1)NC(=O)C1=CN=CC=N1)B(O)O',\n", 151 | "# 'Navitoclax': 'CC1(CCC(=C(C1)CN2CCN(CC2)C3=CC=C(C=C3)C(=O)NS(=O)(=O)C4=CC(=C(C=C4)NC(CCN5CCOCC5)CSC6=CC=CC=C6)S(=O)(=O)C(F)(F)F)C7=CC=C(C=C7)Cl)C',\n", 152 | "# 'Dabrafenib': 'CC(C)(C)C1=NC(=C(S1)C2=NC(=NC=C2)N)C3=C(C(=CC=C3)NS(=O)(=O)C4=C(C=CC=C4F)F)F',\n", 153 | "# 'Afatinib': 'CN(C)CC=CC(=O)NC1=C(C=C2C(=C1)C(=NC=N2)NC3=CC(=C(C=C3)F)Cl)OC4CCOC4',\n", 154 | "# 'Prexasertib': 'COC1=C(C(=CC=C1)OCCCN)C2=CC(=NN2)NC3=NC=C(N=C3)C#N.Cl.Cl',\n", 155 | "# 'Taselisib' : 'CC1=NN(C(=N1)C2=CN3CCOC4=C(C3=N2)C=CC(=C4)C5=CN(N=C5)C(C)(C)C(=O)N)C(C)C',\n", 156 | "# 'Gemcitabine': 'C1=CN(C(=O)N=C1N)C2C(C(C(O2)CO)O)(F)F',\n", 157 | "# 'Trametinib': 'CC1=C2C(=C(N(C1=O)C)NC3=C(C=C(C=C3)I)F)C(=O)N(C(=O)N2C4=CC=CC(=C4)NC(=O)C)C5CC5',\n", 158 | "# 'Everolimus': '[H][C@@]1(C[C@@H](C)[C@]2([H])CC(=O)[C@H](C)\\C=C(C)\\[C@@H](O)[C@@H](OC)C(=O)[C@H](C)C[C@H](C)\\C=C\\C=C\\C=C(C)\\[C@H](C[C@]3([H])CC[C@@H](C)[C@@](O)(O3)C(=O)C(=O)N3CCCC[C@@]3([H])C(=O)O2)OC)CC[C@@H](OCCO)[C@@H](C1)OC',\n", 159 | "# 'AZD5591': 'CN1N=C2CSCC3=NN(C)C(CSC4=CC5=C(C=CC=C5)C(OCCCC5=C(N(C)C6=C5C=CC(Cl)=C6C2=C1C)C(O)=O)=C4)=C3',\n", 160 | "# 'JQ1': 'ClC1=CC=C(C(C2=C(N3C4=NN=C3C)SC(C)=C2C)=N[C@H]4CC(O)=O)C=C1',\n", 161 | "# 'BRD3379': 'CC(=O)NC1=CC=C(C=C1)C(=O)NC2=C(C=C(C=C2)F)N',\n", 162 | "# 'DMSO': 'CS(=O)C',\n", 163 | "# }\n", 164 | " self.adata = self.load_adata_obj()\n", 165 | "# self.adata_preprocessing()\n", 166 | " self.adata_add_additional_information()\n", 167 | " \n", 168 | "# self.perturbations = list(self.adata.obs['condition'].unique())\n", 169 | "# self.timetpoints = list(self.adata.obs['timepoint'].unique())\n", 170 | "# self.tissue_types = list(self.adata.obs['tissue_type'].unique())\n", 171 | "# self.drugs_without_SMILES_encoding = list(self.adata.obs[(self.adata.obs['drug_SMILES']=='') & \n", 172 | "# (self.adata.obs['condition']!='Untreated')]['condition'].unique())\n", 173 | "# if print_summary:\n", 174 | "# self.print_summary()\n", 175 | " \n", 176 | "# def print_summary(self):\n", 177 | "# print(\"Experiment\", self.expt_no, '\\n')\n", 178 | "# print(\"Perturbations:\", self.perturbations, '\\n')\n", 179 | "# print(\"Timetpoints:\", self.timetpoints, '\\n')\n", 180 | "# print(\"Tissue types:\", self.tissue_types, '\\n')\n", 181 | "# print(\"Number of cells:\", len(self.adata.obs), '\\n')\n", 182 | "# print(\"Drugs for which no SMILES encoding was found:\", self.drugs_without_SMILES_encoding, '\\n')\n", 183 | " \n", 184 | " def load_adata_obj_from_single_file(self, filename):\n", 185 | " zf = ZipFile(self.raw_data_path + '/' + filename)\n", 186 | " zf.extractall(path=self.raw_data_path) \n", 187 | "\n", 188 | " df_classifications = pd.read_csv(\n", 189 | " zf.open(zf.namelist()[0] + 'classifications.csv'),\n", 190 | " sep=\",\",\n", 191 | " index_col = 'barcode'\n", 192 | " )\n", 193 | "\n", 194 | " adata = sc.read_10x_mtx(self.raw_data_path + '/' + zf.namelist()[0], var_names='gene_symbols')# cache=True)\n", 195 | "# adata.var_names_make_unique()\n", 196 | " adata.obs = df_classifications\n", 197 | " \n", 198 | " if 'hash_tag' in adata.obs.columns:\n", 199 | " adata = adata[~adata.obs[\"hash_tag\"].isin(['multiplet', 'unknown']), :]\n", 200 | " adata.obs['condition'] = adata.obs.apply(lambda x: x['hash_tag'][0:x['hash_tag'].find('_')], axis = 1)\n", 201 | " adata.obs['condition'] = adata.obs.apply(lambda x: 'Trametinib' if x['condition'] == 'Tram' else x['condition'], axis = 1)\n", 202 | " adata.obs['timepoint'] = adata.obs.apply(lambda x: x['hash_tag'][x['hash_tag'].find('_')+1:x['hash_tag'].find('hr')], axis = 1)\n", 203 | " else:\n", 204 | " adata.obs['condition'] = filename[0:filename.find('_')]\n", 205 | " if 'hr' in filename:\n", 206 | " adata.obs['timepoint'] = filename[filename.find('_')+1:filename.find('hr')]\n", 207 | " else:\n", 208 | " adata.obs['timepoint'] = ''\n", 209 | "\n", 210 | " adata.obs['control'] = adata.obs.apply(lambda x: 'ctrl' if x['condition'] in ['Untreated', 'DMSO']\n", 211 | " else 'drug', axis=1)\n", 212 | " adata.obs['cell_line'] = adata.obs.apply(lambda x: x['singlet_ID'][0:x['singlet_ID'].find('_')], axis = 1)\n", 213 | " adata.obs['tissue_type'] = adata.obs.apply(lambda x: x['singlet_ID'][x['singlet_ID'].find('_')+1:], axis = 1)\n", 214 | " \n", 215 | " return(adata)\n", 216 | " \n", 217 | " def load_adata_obj(self):\n", 218 | " files_to_load = []\n", 219 | " for filename in os.listdir(self.raw_data_path):\n", 220 | " if ('.zip' in filename) and ('expt'+str(self.expt_no)+'.' in filename):\n", 221 | " files_to_load.append(filename)\n", 222 | "\n", 223 | " adata = ad.AnnData()\n", 224 | " for file in files_to_load:\n", 225 | " adata_temp = self.load_adata_obj_from_single_file(file)\n", 226 | " adata_temp.obs['experiment'] = self.expt_no\n", 227 | " if len(adata) == 0:\n", 228 | " adata = adata_temp\n", 229 | " else:\n", 230 | " adata = adata.concatenate(adata_temp)\n", 231 | " if 'batch' in adata.obs.columns:\n", 232 | " adata.obs = adata.obs.drop(columns=['batch'])\n", 233 | " \n", 234 | " return(adata)\n", 235 | " \n", 236 | "# def adata_preprocessing(self):\n", 237 | "# # Filtering out the low-quality cells before doing the analysis\n", 238 | "# self.adata = self.adata[self.adata.obs[\"cell_quality\"] == \"normal\", :] \n", 239 | "\n", 240 | "# # Normalization\n", 241 | "# sc.pp.normalize_total(self.adata, target_sum=1e4)\n", 242 | "# sc.pp.log1p(self.adata)\n", 243 | "\n", 244 | "# # # Identifying top 5000 highly variable genes\n", 245 | "# # sc.pp.highly_variable_genes(self.adata, n_top_genes = 5000)\n", 246 | "\n", 247 | "# # Saving the raw data object\n", 248 | "# self.adata.raw = self.adata\n", 249 | " \n", 250 | "# # Scaling\n", 251 | "# if self.scale:\n", 252 | "# sc.pp.scale(self.adata, max_value=10)\n", 253 | "\n", 254 | "# n_comps = len(self.adata.obs['singlet_ID'].unique()) * 2 \n", 255 | "# # PCA\n", 256 | "# sc.tl.pca(self.adata, svd_solver='arpack', n_comps = n_comps)\n", 257 | "\n", 258 | "# # UMAP embedding\n", 259 | "# sc.pp.neighbors(self.adata, n_neighbors=10, n_pcs=n_comps)\n", 260 | "# sc.tl.umap(self.adata)\n", 261 | " \n", 262 | " def adata_add_additional_information(self):\n", 263 | " mutant_cls = ['LNCAPCLONEFGC_PROSTATE','DKMG_CENTRAL_NERVOUS_SYSTEM', 'NCIH226_LUNG', 'RCC10RGB_KIDNEY', 'SNU1079_BILIARY_TRACT', 'CCFSTTG1_CENTRAL_NERVOUS_SYSTEM','COV434_OVARY']\n", 264 | " self.adata.obs['TP53_status'] = ['TP53_MUT' if x in mutant_cls else 'TP53_WT' for x in self.adata.obs.singlet_ID.values]\n", 265 | " if self.adata.obs['timepoint'][1] != '':\n", 266 | " self.adata.obs['condition_time'] = self.adata.obs.apply(lambda x: x['condition'] + '_' + x['timepoint'] + 'hr', axis=1)\n", 267 | " \n", 268 | " # Cell Cycle Analysis\n", 269 | " # Loading additional file with lists of s genes and g2m genes\n", 270 | "# cell_phases_df = pd.read_excel(self.raw_data_path+'/Cell_phases.xlsx', sheet_name = 'Cell_phases')\n", 271 | "# s_genes = [x.replace(' ', '') for x in list(cell_phases_df[pd.notna(cell_phases_df.iloc[:,0])].iloc[:,0])]\n", 272 | "# g2m_genes = [x.replace(' ', '') for x in list(cell_phases_df[pd.notna(cell_phases_df.iloc[:,1])].iloc[:,1])]\n", 273 | "\n", 274 | " # Defining the cell cycles\n", 275 | "# sc.tl.score_genes_cell_cycle(self.adata, s_genes = s_genes, g2m_genes = g2m_genes)\n", 276 | "\n", 277 | "# self.adata.obs['phase'] = self.adata.obs.apply(lambda x: 'G2/M' if x['phase'] == 'G2M' else ('G0/G1' if x['phase']=='G1' else x['phase']\n", 278 | "# ), axis = 1)\n", 279 | " \n", 280 | "# self.adata.obs['drug_SMILES'] = self.adata.obs.apply(lambda x: self.drugs_SMILES_representation[x['condition']] \n", 281 | "# if x['condition'] in self.drugs_SMILES_representation.keys() else '', axis=1)" 282 | ] 283 | }, 284 | { 285 | "cell_type": "code", 286 | "execution_count": 7, 287 | "id": "ada87798", 288 | "metadata": {}, 289 | "outputs": [ 290 | { 291 | "data": { 292 | "text/plain": [ 293 | "[AnnData object with n_obs × n_vars = 24500 × 32738\n", 294 | " obs: 'singlet_ID', 'num_SNPs', 'singlet_dev', 'singlet_dev_z', 'singlet_margin', 'singlet_z_margin', 'doublet_z_margin', 'tot_reads', 'doublet_dev_imp', 'doublet_CL1', 'doublet_CL2', 'percent.mito', 'cell_det_rate', 'cell_quality', 'doublet_GMM_prob', 'DepMap_ID', 'condition', 'timepoint', 'control', 'cell_line', 'tissue_type', 'experiment', 'TP53_status', 'condition_time'\n", 295 | " var: 'gene_ids',\n", 296 | " AnnData object with n_obs × n_vars = 28165 × 32738\n", 297 | " obs: 'singlet_ID', 'num_SNPs', 'singlet_dev', 'singlet_dev_z', 'singlet_margin', 'singlet_z_margin', 'doublet_z_margin', 'tot_reads', 'doublet_dev_imp', 'doublet_CL1', 'doublet_CL2', 'percent.mito', 'cell_det_rate', 'cell_quality', 'doublet_GMM_prob', 'DepMap_ID', 'condition', 'timepoint', 'control', 'cell_line', 'tissue_type', 'experiment', 'TP53_status'\n", 298 | " var: 'gene_ids',\n", 299 | " AnnData object with n_obs × n_vars = 72326 × 32738\n", 300 | " obs: 'singlet_ID', 'num_SNPs', 'singlet_dev', 'singlet_dev_z', 'singlet_margin', 'singlet_z_margin', 'doublet_z_margin', 'tot_reads', 'doublet_dev_imp', 'doublet_CL1', 'doublet_CL2', 'percent.mito', 'cell_det_rate', 'cell_quality', 'doublet_GMM_prob', 'DepMap_ID', 'condition', 'timepoint', 'control', 'cell_line', 'tissue_type', 'experiment', 'TP53_status', 'condition_time'\n", 301 | " var: 'gene_ids',\n", 302 | " AnnData object with n_obs × n_vars = 17553 × 32738\n", 303 | " obs: 'singlet_ID', 'num_SNPs', 'singlet_dev', 'singlet_dev_z', 'singlet_margin', 'singlet_z_margin', 'doublet_z_margin', 'tot_reads', 'doublet_dev_imp', 'doublet_CL1', 'doublet_CL2', 'percent.mito', 'cell_det_rate', 'cell_quality', 'doublet_GMM_prob', 'hash_assignment', 'hash_tag', 'channel', 'DepMap_ID', 'condition', 'timepoint', 'control', 'cell_line', 'tissue_type', 'experiment', 'TP53_status', 'condition_time'\n", 304 | " var: 'gene_ids',\n", 305 | " AnnData object with n_obs × n_vars = 37856 × 32738\n", 306 | " obs: 'singlet_ID', 'num_SNPs', 'singlet_dev', 'singlet_dev_z', 'singlet_margin', 'singlet_z_margin', 'doublet_z_margin', 'tot_reads', 'doublet_dev_imp', 'doublet_CL1', 'doublet_CL2', 'percent.mito', 'cell_det_rate', 'cell_quality', 'doublet_GMM_prob', 'channel', 'DepMap_ID', 'condition', 'timepoint', 'control', 'cell_line', 'tissue_type', 'experiment', 'TP53_status'\n", 307 | " var: 'gene_ids']" 308 | ] 309 | }, 310 | "execution_count": 7, 311 | "metadata": {}, 312 | "output_type": "execute_result" 313 | } 314 | ], 315 | "source": [ 316 | "%%time\n", 317 | "adatas = []\n", 318 | "for expt in [1,2,3,5,10]:\n", 319 | " print(expt)\n", 320 | " adatas.append(MIX_seq_data_loader(\n", 321 | " raw_data_path=raw_data_path, \n", 322 | " expt_no=expt\n", 323 | " ).adata)\n", 324 | "adatas" 325 | ] 326 | }, 327 | { 328 | "cell_type": "code", 329 | "execution_count": 8, 330 | "id": "841bdcf1", 331 | "metadata": {}, 332 | "outputs": [ 333 | { 334 | "data": { 335 | "text/plain": [ 336 | "AnnData object with n_obs × n_vars = 180400 × 32738\n", 337 | " obs: 'singlet_ID', 'num_SNPs', 'singlet_dev', 'singlet_dev_z', 'singlet_margin', 'singlet_z_margin', 'doublet_z_margin', 'tot_reads', 'doublet_dev_imp', 'doublet_CL1', 'doublet_CL2', 'percent.mito', 'cell_det_rate', 'cell_quality', 'doublet_GMM_prob', 'DepMap_ID', 'condition', 'timepoint', 'control', 'cell_line', 'tissue_type', 'experiment', 'TP53_status'\n", 338 | " var: 'gene_ids'" 339 | ] 340 | }, 341 | "execution_count": 8, 342 | "metadata": {}, 343 | "output_type": "execute_result" 344 | } 345 | ], 346 | "source": [ 347 | "adata = ad.concat(adatas, keys=[n for n in [1,2,3,5,10]], label='experiment', index_unique='--', merge='same')\n", 348 | "adata" 349 | ] 350 | }, 351 | { 352 | "cell_type": "code", 353 | "execution_count": 9, 354 | "id": "88eeba27", 355 | "metadata": {}, 356 | "outputs": [], 357 | "source": [ 358 | "!rm -r mcfarland2020" 359 | ] 360 | }, 361 | { 362 | "cell_type": "code", 363 | "execution_count": 10, 364 | "id": "15d2b3d5", 365 | "metadata": {}, 366 | "outputs": [ 367 | { 368 | "name": "stderr", 369 | "output_type": "stream", 370 | "text": [ 371 | "/home/icb/yuge.ji/miniconda3/envs/py37/lib/python3.7/site-packages/anndata/_core/anndata.py:1220: FutureWarning: The `inplace` parameter in pandas.Categorical.reorder_categories is deprecated and will be removed in a future version. Reordering categories will always return a new Categorical object.\n", 372 | " c.reorder_categories(natsorted(c.categories), inplace=True)\n", 373 | "... storing 'singlet_ID' as categorical\n", 374 | "/home/icb/yuge.ji/miniconda3/envs/py37/lib/python3.7/site-packages/anndata/_core/anndata.py:1220: FutureWarning: The `inplace` parameter in pandas.Categorical.reorder_categories is deprecated and will be removed in a future version. Reordering categories will always return a new Categorical object.\n", 375 | " c.reorder_categories(natsorted(c.categories), inplace=True)\n", 376 | "... storing 'doublet_CL1' as categorical\n", 377 | "/home/icb/yuge.ji/miniconda3/envs/py37/lib/python3.7/site-packages/anndata/_core/anndata.py:1220: FutureWarning: The `inplace` parameter in pandas.Categorical.reorder_categories is deprecated and will be removed in a future version. Reordering categories will always return a new Categorical object.\n", 378 | " c.reorder_categories(natsorted(c.categories), inplace=True)\n", 379 | "... storing 'doublet_CL2' as categorical\n", 380 | "/home/icb/yuge.ji/miniconda3/envs/py37/lib/python3.7/site-packages/anndata/_core/anndata.py:1220: FutureWarning: The `inplace` parameter in pandas.Categorical.reorder_categories is deprecated and will be removed in a future version. Reordering categories will always return a new Categorical object.\n", 381 | " c.reorder_categories(natsorted(c.categories), inplace=True)\n", 382 | "... storing 'cell_quality' as categorical\n", 383 | "/home/icb/yuge.ji/miniconda3/envs/py37/lib/python3.7/site-packages/anndata/_core/anndata.py:1220: FutureWarning: The `inplace` parameter in pandas.Categorical.reorder_categories is deprecated and will be removed in a future version. Reordering categories will always return a new Categorical object.\n", 384 | " c.reorder_categories(natsorted(c.categories), inplace=True)\n", 385 | "... storing 'DepMap_ID' as categorical\n", 386 | "/home/icb/yuge.ji/miniconda3/envs/py37/lib/python3.7/site-packages/anndata/_core/anndata.py:1220: FutureWarning: The `inplace` parameter in pandas.Categorical.reorder_categories is deprecated and will be removed in a future version. Reordering categories will always return a new Categorical object.\n", 387 | " c.reorder_categories(natsorted(c.categories), inplace=True)\n", 388 | "... storing 'condition' as categorical\n", 389 | "/home/icb/yuge.ji/miniconda3/envs/py37/lib/python3.7/site-packages/anndata/_core/anndata.py:1220: FutureWarning: The `inplace` parameter in pandas.Categorical.reorder_categories is deprecated and will be removed in a future version. Reordering categories will always return a new Categorical object.\n", 390 | " c.reorder_categories(natsorted(c.categories), inplace=True)\n", 391 | "... storing 'timepoint' as categorical\n", 392 | "/home/icb/yuge.ji/miniconda3/envs/py37/lib/python3.7/site-packages/anndata/_core/anndata.py:1220: FutureWarning: The `inplace` parameter in pandas.Categorical.reorder_categories is deprecated and will be removed in a future version. Reordering categories will always return a new Categorical object.\n", 393 | " c.reorder_categories(natsorted(c.categories), inplace=True)\n", 394 | "... storing 'control' as categorical\n", 395 | "/home/icb/yuge.ji/miniconda3/envs/py37/lib/python3.7/site-packages/anndata/_core/anndata.py:1220: FutureWarning: The `inplace` parameter in pandas.Categorical.reorder_categories is deprecated and will be removed in a future version. Reordering categories will always return a new Categorical object.\n", 396 | " c.reorder_categories(natsorted(c.categories), inplace=True)\n", 397 | "... storing 'cell_line' as categorical\n", 398 | "/home/icb/yuge.ji/miniconda3/envs/py37/lib/python3.7/site-packages/anndata/_core/anndata.py:1220: FutureWarning: The `inplace` parameter in pandas.Categorical.reorder_categories is deprecated and will be removed in a future version. Reordering categories will always return a new Categorical object.\n", 399 | " c.reorder_categories(natsorted(c.categories), inplace=True)\n", 400 | "... storing 'tissue_type' as categorical\n", 401 | "/home/icb/yuge.ji/miniconda3/envs/py37/lib/python3.7/site-packages/anndata/_core/anndata.py:1220: FutureWarning: The `inplace` parameter in pandas.Categorical.reorder_categories is deprecated and will be removed in a future version. Reordering categories will always return a new Categorical object.\n", 402 | " c.reorder_categories(natsorted(c.categories), inplace=True)\n", 403 | "... storing 'TP53_status' as categorical\n" 404 | ] 405 | } 406 | ], 407 | "source": [ 408 | "adata.write(f'Mcfarland_2020_raw.h5ad')" 409 | ] 410 | } 411 | ], 412 | "metadata": { 413 | "kernelspec": { 414 | "display_name": "Python [conda env:py37] *", 415 | "language": "python", 416 | "name": "conda-env-py37-py" 417 | }, 418 | "language_info": { 419 | "codemirror_mode": { 420 | "name": "ipython", 421 | "version": 3 422 | }, 423 | "file_extension": ".py", 424 | "mimetype": "text/x-python", 425 | "name": "python", 426 | "nbconvert_exporter": "python", 427 | "pygments_lexer": "ipython3", 428 | "version": "3.7.0" 429 | } 430 | }, 431 | "nbformat": 4, 432 | "nbformat_minor": 5 433 | } 434 | -------------------------------------------------------------------------------- /datasets/Norman_2019_curation.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "d12cb9dc", 6 | "metadata": {}, 7 | "source": [ 8 | "Accession: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE133344" 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": 1, 14 | "id": "ca04f335-6926-4764-82ec-374d7c6f94b4", 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "import gzip\n", 19 | "import os\n", 20 | "import re\n", 21 | "\n", 22 | "import pandas as pd\n", 23 | "import numpy as np\n", 24 | "from anndata import AnnData\n", 25 | "from scipy.io import mmread\n", 26 | "from scipy.sparse import coo_matrix\n", 27 | "\n", 28 | "from utils import download_binary_file\n", 29 | "\n", 30 | "# Gene program lists obtained by cross-referencing the heatmap here\n", 31 | "# https://github.com/thomasmaxwellnorman/Perturbseq_GI/blob/master/GI_optimal_umap.ipynb\n", 32 | "# with Figure 2b in Norman 2019\n", 33 | "G1_CYCLE = [\n", 34 | " \"CDKN1C+CDKN1B\",\n", 35 | " \"CDKN1B+ctrl\",\n", 36 | " \"CDKN1B+CDKN1A\",\n", 37 | " \"CDKN1C+ctrl\",\n", 38 | " \"ctrl+CDKN1A\",\n", 39 | " \"CDKN1C+CDKN1A\",\n", 40 | " \"CDKN1A+ctrl\",\n", 41 | "]\n", 42 | "\n", 43 | "ERYTHROID = [\n", 44 | " \"BPGM+SAMD1\",\n", 45 | " \"ATL1+ctrl\",\n", 46 | " \"UBASH3B+ZBTB25\",\n", 47 | " \"PTPN12+PTPN9\",\n", 48 | " \"PTPN12+UBASH3A\",\n", 49 | " \"CBL+CNN1\",\n", 50 | " \"UBASH3B+CNN1\",\n", 51 | " \"CBL+UBASH3B\",\n", 52 | " \"UBASH3B+PTPN9\",\n", 53 | " \"PTPN1+ctrl\",\n", 54 | " \"CBL+PTPN9\",\n", 55 | " \"CNN1+UBASH3A\",\n", 56 | " \"CBL+PTPN12\",\n", 57 | " \"PTPN12+ZBTB25\",\n", 58 | " \"UBASH3B+PTPN12\",\n", 59 | " \"SAMD1+PTPN12\",\n", 60 | " \"SAMD1+UBASH3B\",\n", 61 | " \"UBASH3B+UBASH3A\",\n", 62 | "]\n", 63 | "\n", 64 | "PIONEER_FACTORS = [\n", 65 | " \"ZBTB10+SNAI1\",\n", 66 | " \"FOXL2+MEIS1\",\n", 67 | " \"POU3F2+CBFA2T3\",\n", 68 | " \"DUSP9+SNAI1\",\n", 69 | " \"FOXA3+FOXA1\",\n", 70 | " \"FOXA3+ctrl\",\n", 71 | " \"LYL1+IER5L\",\n", 72 | " \"FOXA1+FOXF1\",\n", 73 | " \"FOXF1+HOXB9\",\n", 74 | " \"FOXA1+HOXB9\",\n", 75 | " \"FOXA3+HOXB9\",\n", 76 | " \"FOXA3+FOXA1\",\n", 77 | " \"FOXA3+FOXL2\",\n", 78 | " \"POU3F2+FOXL2\",\n", 79 | " \"FOXF1+FOXL2\",\n", 80 | " \"FOXA1+FOXL2\",\n", 81 | " \"HOXA13+ctrl\",\n", 82 | " \"ctrl+HOXC13\",\n", 83 | " \"HOXC13+ctrl\",\n", 84 | " \"MIDN+ctrl\",\n", 85 | " \"TP73+ctrl\",\n", 86 | "]\n", 87 | "\n", 88 | "GRANULOCYTE_APOPTOSIS = [\n", 89 | " \"SPI1+ctrl\",\n", 90 | " \"ctrl+SPI1\",\n", 91 | " \"ctrl+CEBPB\",\n", 92 | " \"CEBPB+ctrl\",\n", 93 | " \"JUN+CEBPA\",\n", 94 | " \"CEBPB+CEBPA\",\n", 95 | " \"FOSB+CEBPE\",\n", 96 | " \"ZC3HAV1+CEBPA\",\n", 97 | " \"KLF1+CEBPA\",\n", 98 | " \"ctrl+CEBPA\",\n", 99 | " \"CEBPA+ctrl\",\n", 100 | " \"CEBPE+CEBPA\",\n", 101 | " \"CEBPE+SPI1\",\n", 102 | " \"CEBPE+ctrl\",\n", 103 | " \"ctrl+CEBPE\",\n", 104 | " \"CEBPE+RUNX1T1\",\n", 105 | " \"CEBPE+CEBPB\",\n", 106 | " \"FOSB+CEBPB\",\n", 107 | " \"ETS2+CEBPE\",\n", 108 | "]\n", 109 | "\n", 110 | "MEGAKARYOCYTE = [\n", 111 | " \"ctrl+ETS2\",\n", 112 | " \"MAPK1+ctrl\",\n", 113 | " \"ctrl+MAPK1\",\n", 114 | " \"ETS2+MAPK1\",\n", 115 | " \"CEBPB+MAPK1\",\n", 116 | " \"MAPK1+TGFBR2\",\n", 117 | "]\n", 118 | "\n", 119 | "PRO_GROWTH = [\n", 120 | " \"CEBPE+KLF1\",\n", 121 | " \"KLF1+MAP2K6\",\n", 122 | " \"AHR+KLF1\",\n", 123 | " \"ctrl+KLF1\",\n", 124 | " \"KLF1+ctrl\",\n", 125 | " \"KLF1+BAK1\",\n", 126 | " \"KLF1+TGFBR2\",\n", 127 | "]\n", 128 | "\n", 129 | "\n", 130 | "def download_norman_2019(output_path: str) -> None:\n", 131 | " \"\"\"\n", 132 | " Download Norman et al. 2019 data and metadata files from the hosting URLs.\n", 133 | "\n", 134 | " Args:\n", 135 | " ----\n", 136 | " output_path: Output path to store the downloaded and unzipped\n", 137 | " directories.\n", 138 | "\n", 139 | " Returns\n", 140 | " -------\n", 141 | " None. File directories are downloaded to output_path.\n", 142 | " \"\"\"\n", 143 | "\n", 144 | " file_urls = (\n", 145 | " \"https://ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133344/suppl\"\n", 146 | " \"/GSE133344_filtered_matrix.mtx.gz\",\n", 147 | " \"https://ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133344/suppl\"\n", 148 | " \"/GSE133344_filtered_genes.tsv.gz\",\n", 149 | " \"https://ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133344/suppl\"\n", 150 | " \"/GSE133344_filtered_barcodes.tsv.gz\",\n", 151 | " \"https://ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133344/suppl\"\n", 152 | " \"/GSE133344_filtered_cell_identities.csv.gz\",\n", 153 | " )\n", 154 | "\n", 155 | " for url in file_urls:\n", 156 | " output_filename = os.path.join(output_path, url.split(\"/\")[-1])\n", 157 | " download_binary_file(url, output_filename)\n", 158 | "\n", 159 | "\n", 160 | "def read_norman_2019(file_directory: str) -> coo_matrix:\n", 161 | " \"\"\"\n", 162 | " Read the expression data for Norman et al. 2019 in the given directory.\n", 163 | "\n", 164 | " Args:\n", 165 | " ----\n", 166 | " file_directory: Directory containing Norman et al. 2019 data.\n", 167 | "\n", 168 | " Returns\n", 169 | " -------\n", 170 | " A sparse matrix containing single-cell gene expression count, with rows\n", 171 | " representing genes and columns representing cells.\n", 172 | " \"\"\"\n", 173 | "\n", 174 | " with gzip.open(\n", 175 | " os.path.join(file_directory, \"GSE133344_filtered_matrix.mtx.gz\"), \"rb\"\n", 176 | " ) as f:\n", 177 | " matrix = mmread(f)\n", 178 | "\n", 179 | " return matrix" 180 | ] 181 | }, 182 | { 183 | "cell_type": "code", 184 | "execution_count": 2, 185 | "id": "21457d17-ce85-405e-af71-b98f55cd9dfc", 186 | "metadata": {}, 187 | "outputs": [ 188 | { 189 | "name": "stdout", 190 | "output_type": "stream", 191 | "text": [ 192 | "Downloaded data from https://ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133344/suppl/GSE133344_filtered_matrix.mtx.gz at ./GSE133344_filtered_matrix.mtx.gz\n", 193 | "Downloaded data from https://ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133344/suppl/GSE133344_filtered_genes.tsv.gz at ./GSE133344_filtered_genes.tsv.gz\n", 194 | "Downloaded data from https://ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133344/suppl/GSE133344_filtered_barcodes.tsv.gz at ./GSE133344_filtered_barcodes.tsv.gz\n", 195 | "Downloaded data from https://ftp.ncbi.nlm.nih.gov/geo/series/GSE133nnn/GSE133344/suppl/GSE133344_filtered_cell_identities.csv.gz at ./GSE133344_filtered_cell_identities.csv.gz\n" 196 | ] 197 | }, 198 | { 199 | "name": "stderr", 200 | "output_type": "stream", 201 | "text": [ 202 | "Trying to set attribute `.obs` of view, copying.\n" 203 | ] 204 | } 205 | ], 206 | "source": [ 207 | "download_path = \"./norman2019/\"\n", 208 | "\n", 209 | "download_norman_2019(download_path)\n", 210 | "\n", 211 | "matrix = read_norman_2019(download_path)\n", 212 | "\n", 213 | "# List of cell barcodes. The barcodes in this list are stored in the same order\n", 214 | "# as cells are in the count matrix.\n", 215 | "cell_barcodes = pd.read_csv(\n", 216 | " os.path.join(download_path, \"GSE133344_filtered_barcodes.tsv.gz\"),\n", 217 | " sep=\"\\t\",\n", 218 | " header=None,\n", 219 | " names=[\"cell_barcode\"],\n", 220 | ")\n", 221 | "\n", 222 | "# IDs/names of the gene features.\n", 223 | "gene_list = pd.read_csv(\n", 224 | " os.path.join(download_path, \"GSE133344_filtered_genes.tsv.gz\"),\n", 225 | " sep=\"\\t\",\n", 226 | " header=None,\n", 227 | " names=[\"gene_id\", \"gene_name\"],\n", 228 | ")\n", 229 | "\n", 230 | "# Dataframe where each row corresponds to a cell, and each column corresponds\n", 231 | "# to a gene feature.\n", 232 | "matrix = pd.DataFrame(\n", 233 | " matrix.transpose().todense(),\n", 234 | " columns=gene_list[\"gene_id\"],\n", 235 | " index=cell_barcodes[\"cell_barcode\"],\n", 236 | " dtype=\"int32\",\n", 237 | ")\n", 238 | "\n", 239 | "# Dataframe mapping cell barcodes to metadata about that cell (e.g. which CRISPR\n", 240 | "# guides were applied to that cell). Unfortunately, this list has a different\n", 241 | "# ordering from the count matrix, so we have to be careful combining the metadata\n", 242 | "# and count data.\n", 243 | "cell_identities = pd.read_csv(\n", 244 | " os.path.join(download_path, \"GSE133344_filtered_cell_identities.csv.gz\")\n", 245 | ").set_index(\"cell_barcode\")\n", 246 | "\n", 247 | "# This merge call reorders our metadata dataframe to match the ordering in the\n", 248 | "# count matrix. Some cells in `cell_barcodes` do not have metadata associated with\n", 249 | "# them, and their metadata values will be filled in as NaN.\n", 250 | "aligned_metadata = pd.merge(\n", 251 | " cell_barcodes,\n", 252 | " cell_identities,\n", 253 | " left_on=\"cell_barcode\",\n", 254 | " right_index=True,\n", 255 | " how=\"left\",\n", 256 | ").set_index(\"cell_barcode\")\n", 257 | "\n", 258 | "adata = AnnData(matrix)\n", 259 | "adata.obs = aligned_metadata\n", 260 | "\n", 261 | "# Filter out any cells that don't have metadata values.\n", 262 | "rows_without_nans = [\n", 263 | " index for index, row in adata.obs.iterrows() if not row.isnull().any()\n", 264 | "]\n", 265 | "adata = adata[rows_without_nans, :]\n", 266 | "\n", 267 | "# Remove these as suggested by the authors. See lines referring to\n", 268 | "# NegCtrl1_NegCtrl0 in GI_generate_populations.ipynb in the Norman 2019 paper's\n", 269 | "# Github repo https://github.com/thomasmaxwellnorman/Perturbseq_GI/\n", 270 | "adata = adata[adata.obs[\"guide_identity\"] != \"NegCtrl1_NegCtrl0__NegCtrl1_NegCtrl0\"]\n", 271 | "\n", 272 | "# We create a new metadata column with cleaner representations of CRISPR guide\n", 273 | "# identities. The original format is _____\n", 274 | "adata.obs[\"guide_merged\"] = adata.obs[\"guide_identity\"]\n", 275 | "\n", 276 | "control_regex = re.compile(r\"NegCtrl(.*)_NegCtrl(.*)+NegCtrl(.*)_NegCtrl(.*)\")\n", 277 | "for i in adata.obs[\"guide_merged\"].unique():\n", 278 | " if control_regex.match(i):\n", 279 | " # For any cells that only had control guides, we don't care about the\n", 280 | " # specific IDs of the guides. Here we relabel them just as \"ctrl\".\n", 281 | " adata.obs[\"guide_merged\"].replace(i, \"ctrl\", inplace=True)\n", 282 | " else:\n", 283 | " # Otherwise, we reformat the guide label to be +. If Guide1\n", 284 | " # or Guide2 was a control, we replace it with \"ctrl\".\n", 285 | " split = i.split(\"__\")[0]\n", 286 | " split = split.split(\"_\")\n", 287 | " for j, string in enumerate(split):\n", 288 | " if \"NegCtrl\" in split[j]:\n", 289 | " split[j] = \"ctrl\"\n", 290 | " adata.obs[\"guide_merged\"].replace(i, f\"{split[0]}+{split[1]}\", inplace=True)\n", 291 | "\n", 292 | "guides_to_programs = {}\n", 293 | "guides_to_programs.update(dict.fromkeys(G1_CYCLE, \"G1 cell cycle arrest\"))\n", 294 | "guides_to_programs.update(dict.fromkeys(ERYTHROID, \"Erythroid\"))\n", 295 | "guides_to_programs.update(dict.fromkeys(PIONEER_FACTORS, \"Pioneer factors\"))\n", 296 | "guides_to_programs.update(\n", 297 | " dict.fromkeys(GRANULOCYTE_APOPTOSIS, \"Granulocyte/apoptosis\")\n", 298 | ")\n", 299 | "guides_to_programs.update(dict.fromkeys(PRO_GROWTH, \"Pro-growth\"))\n", 300 | "guides_to_programs.update(dict.fromkeys(MEGAKARYOCYTE, \"Megakaryocyte\"))\n", 301 | "guides_to_programs.update(dict.fromkeys([\"ctrl\"], \"Ctrl\"))\n", 302 | "\n", 303 | "adata.obs[\"gene_program\"] = [guides_to_programs[x] if x in guides_to_programs else \"N/A\" for x in adata.obs[\"guide_merged\"]]\n", 304 | "adata.obs[\"good_coverage\"] = adata.obs[\"good_coverage\"].astype(bool)" 305 | ] 306 | }, 307 | { 308 | "cell_type": "code", 309 | "execution_count": 5, 310 | "id": "72c5c54f", 311 | "metadata": {}, 312 | "outputs": [], 313 | "source": [ 314 | "adata.write('Norman_2019_raw.h5ad')" 315 | ] 316 | } 317 | ], 318 | "metadata": { 319 | "kernelspec": { 320 | "display_name": "Python 3 (ipykernel)", 321 | "language": "python", 322 | "name": "python3" 323 | }, 324 | "language_info": { 325 | "codemirror_mode": { 326 | "name": "ipython", 327 | "version": 3 328 | }, 329 | "file_extension": ".py", 330 | "mimetype": "text/x-python", 331 | "name": "python", 332 | "nbconvert_exporter": "python", 333 | "pygments_lexer": "ipython3", 334 | "version": "3.7.0" 335 | } 336 | }, 337 | "nbformat": 4, 338 | "nbformat_minor": 5 339 | } 340 | -------------------------------------------------------------------------------- /datasets/Schraivogel_2020.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 4, 6 | "id": "78978bc6", 7 | "metadata": {}, 8 | "outputs": [ 9 | { 10 | "name": "stdout", 11 | "output_type": "stream", 12 | "text": [ 13 | "--2022-01-12 17:43:26-- https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE135497&format=file\n", 14 | "Resolving www.ncbi.nlm.nih.gov (www.ncbi.nlm.nih.gov)... 130.14.29.110, 2607:f220:41e:4290::110\n", 15 | "Connecting to www.ncbi.nlm.nih.gov (www.ncbi.nlm.nih.gov)|130.14.29.110|:443... connected.\n", 16 | "HTTP request sent, awaiting response... 200 OK\n", 17 | "Length: 183797760 (175M) [application/x-tar]\n", 18 | "Saving to: ‘GSE135497_RAW.tar’\n", 19 | "\n", 20 | "GSE135497_RAW.tar 100%[===================>] 175.28M 2.08MB/s in 2m 6s \n", 21 | "\n", 22 | "2022-01-12 17:45:33 (1.39 MB/s) - ‘GSE135497_RAW.tar’ saved [183797760/183797760]\n", 23 | "\n", 24 | "GSM4012674_TASC_TAR_p1r1.counts.csv.gz\n", 25 | "GSM4012675_TASC_TAR_p1r2.counts.csv.gz\n", 26 | "GSM4012676_TASC_TAR_p2r1.counts.csv.gz\n", 27 | "GSM4012677_TASC_TAR_p2r2.counts.csv.gz\n", 28 | "GSM4012678_TASC_TAR_p12.counts.csv.gz\n", 29 | "GSM4012679_TASC_TAR_p123.counts.csv.gz\n", 30 | "GSM4012680_TASC_TAR_DropSeq.counts.csv.gz\n", 31 | "GSM4012682_WTX_REF.counts.txt.gz\n", 32 | "GSM4012683_WTX_REF_DropSeq.counts.csv.gz\n", 33 | "GSM4012684_TASC_CT_totalBM_sample1.counts.csv.gz\n", 34 | "GSM4012685_TASC_CT_totalBM_sample2.counts.csv.gz\n", 35 | "GSM4012686_TASC_CT_kitBM_sample1.counts.csv.gz\n", 36 | "GSM4012687_TASC_CT_kitBM_sample2.counts.csv.gz\n", 37 | "GSM4012688_TASC_DIFFEX_sample1.counts.csv.gz\n", 38 | "GSM4012688_TASC_DIFFEX_sample1.pertStatus.csv.gz\n", 39 | "GSM4012689_TASC_DIFFEX_sample2.counts.csv.gz\n", 40 | "GSM4012689_TASC_DIFFEX_sample2.pertStatus.csv.gz\n", 41 | "GSM4012690_WTX_DIFFEX_sample1.counts.csv.gz\n", 42 | "GSM4012690_WTX_DIFFEX_sample1.pertStatus.csv.gz\n", 43 | "GSM4012691_WTX_DIFFEX_sample2.counts.csv.gz\n", 44 | "GSM4012691_WTX_DIFFEX_sample2.pertStatus.csv.gz\n", 45 | "GSM4012692_WTX_DIFFEX_sample3.counts.csv.gz\n", 46 | "GSM4012692_WTX_DIFFEX_sample3.pertStatus.csv.gz\n", 47 | "GSM4012693_WTX_DIFFEX_sample4.counts.csv.gz\n", 48 | "GSM4012693_WTX_DIFFEX_sample4.pertStatus.csv.gz\n", 49 | "GSM4012694_TASC_SCREEN_chr8_sample1.counts.csv.gz\n", 50 | "GSM4012694_TASC_SCREEN_chr8_sample1.pertStatus.csv.gz\n", 51 | "GSM4012695_TASC_SCREEN_chr8_sample2.counts.csv.gz\n", 52 | "GSM4012695_TASC_SCREEN_chr8_sample2.pertStatus.csv.gz\n", 53 | "GSM4012696_TASC_SCREEN_chr8_sample3.counts.csv.gz\n", 54 | "GSM4012696_TASC_SCREEN_chr8_sample3.pertStatus.csv.gz\n", 55 | "GSM4012697_TASC_SCREEN_chr8_sample4.counts.csv.gz\n", 56 | "GSM4012697_TASC_SCREEN_chr8_sample4.pertStatus.csv.gz\n", 57 | "GSM4012698_TASC_SCREEN_chr8_sample5.counts.csv.gz\n", 58 | "GSM4012698_TASC_SCREEN_chr8_sample5.pertStatus.csv.gz\n", 59 | "GSM4012699_TASC_SCREEN_chr8_sample6.counts.csv.gz\n", 60 | "GSM4012699_TASC_SCREEN_chr8_sample6.pertStatus.csv.gz\n", 61 | "GSM4012700_TASC_SCREEN_chr8_sample7.counts.csv.gz\n", 62 | "GSM4012700_TASC_SCREEN_chr8_sample7.pertStatus.csv.gz\n", 63 | "GSM4012701_TASC_SCREEN_chr8_sample8.counts.csv.gz\n", 64 | "GSM4012701_TASC_SCREEN_chr8_sample8.pertStatus.csv.gz\n", 65 | "GSM4012702_TASC_SCREEN_chr8_sample9.counts.csv.gz\n", 66 | "GSM4012702_TASC_SCREEN_chr8_sample9.pertStatus.csv.gz\n", 67 | "GSM4012703_TASC_SCREEN_chr8_sample10.counts.csv.gz\n", 68 | "GSM4012703_TASC_SCREEN_chr8_sample10.pertStatus.csv.gz\n", 69 | "GSM4012704_TASC_SCREEN_chr8_sample11.counts.csv.gz\n", 70 | "GSM4012704_TASC_SCREEN_chr8_sample11.pertStatus.csv.gz\n", 71 | "GSM4012705_TASC_SCREEN_chr8_sample12.counts.csv.gz\n", 72 | "GSM4012705_TASC_SCREEN_chr8_sample12.pertStatus.csv.gz\n", 73 | "GSM4012706_TASC_SCREEN_chr8_sample13.counts.csv.gz\n", 74 | "GSM4012706_TASC_SCREEN_chr8_sample13.pertStatus.csv.gz\n", 75 | "GSM4012707_TASC_SCREEN_chr8_sample14.counts.csv.gz\n", 76 | "GSM4012707_TASC_SCREEN_chr8_sample14.pertStatus.csv.gz\n", 77 | "GSM4012708_TASC_SCREEN_chr11_sample1.counts.csv.gz\n", 78 | "GSM4012708_TASC_SCREEN_chr11_sample1.pertStatus.csv.gz\n", 79 | "GSM4012709_TASC_SCREEN_chr11_sample2.counts.csv.gz\n", 80 | "GSM4012709_TASC_SCREEN_chr11_sample2.pertStatus.csv.gz\n", 81 | "GSM4012710_TASC_SCREEN_chr11_sample3.counts.csv.gz\n", 82 | "GSM4012710_TASC_SCREEN_chr11_sample3.pertStatus.csv.gz\n", 83 | "GSM4012711_TASC_SCREEN_chr11_sample4.counts.csv.gz\n", 84 | "GSM4012711_TASC_SCREEN_chr11_sample4.pertStatus.csv.gz\n", 85 | "GSM4012712_TASC_SCREEN_chr11_sample5.counts.csv.gz\n", 86 | "GSM4012712_TASC_SCREEN_chr11_sample5.pertStatus.csv.gz\n", 87 | "GSM4012713_TASC_SCREEN_chr11_sample6.counts.csv.gz\n", 88 | "GSM4012713_TASC_SCREEN_chr11_sample6.pertStatus.csv.gz\n", 89 | "GSM4012714_TASC_SCREEN_chr11_sample7.counts.csv.gz\n", 90 | "GSM4012714_TASC_SCREEN_chr11_sample7.pertStatus.csv.gz\n", 91 | "GSM4012715_TASC_SCREEN_chr11_sample8.counts.csv.gz\n", 92 | "GSM4012715_TASC_SCREEN_chr11_sample8.pertStatus.csv.gz\n", 93 | "GSM4012716_TASC_SCREEN_chr11_sample9.counts.csv.gz\n", 94 | "GSM4012716_TASC_SCREEN_chr11_sample9.pertStatus.csv.gz\n", 95 | "GSM4012717_TASC_SCREEN_chr11_sample10.counts.csv.gz\n", 96 | "GSM4012717_TASC_SCREEN_chr11_sample10.pertStatus.csv.gz\n", 97 | "GSM4012718_TASC_SCREEN_chr11_sample11.counts.csv.gz\n", 98 | "GSM4012718_TASC_SCREEN_chr11_sample11.pertStatus.csv.gz\n", 99 | "GSM4012719_TASC_SCREEN_chr11_sample13.counts.csv.gz\n", 100 | "GSM4012719_TASC_SCREEN_chr11_sample13.pertStatus.csv.gz\n", 101 | "GSM4012720_TASC_SCREEN_chr11_sample14.counts.csv.gz\n", 102 | "GSM4012720_TASC_SCREEN_chr11_sample14.pertStatus.csv.gz\n", 103 | "GSM4265995_TASC_DEEP_p1.counts.csv.gz\n", 104 | "GSM4265998_TASC_DEEP_p2.counts.csv.gz\n", 105 | "GSM4266000_TASC_DEEP_p3.counts.csv.gz\n", 106 | "GSM4266002_TASC_DEEP_p3lung.counts.csv.gz\n", 107 | "GSM4266004_TASC_TAR_L1000.counts.csv.gz\n", 108 | "GSM4266004_TASC_TAR_L1000.pertStatus.csv.gz\n", 109 | "GSM4266006_WTX_REF_Deep.counts.csv.gz\n", 110 | "GSM4266008_WTX_REF_Mix.counts.csv.gz\n", 111 | "GSM4266010_WTX_REF_Lung.counts.csv.gz\n", 112 | "GSM4266012_TASC_REDESIGN_control.counts.csv.gz\n", 113 | "GSM4266012_TASC_REDESIGN_control.pertStatus.csv.gz\n", 114 | "GSM4266014_TASC_REDESIGN_v1.counts.csv.gz\n", 115 | "GSM4266014_TASC_REDESIGN_v1.pertStatus.csv.gz\n", 116 | "GSM4266017_TASC_REDESIGN_v2.counts.csv.gz\n", 117 | "GSM4266017_TASC_REDESIGN_v2.pertStatus.csv.gz\n" 118 | ] 119 | } 120 | ], 121 | "source": [ 122 | "!wget -O GSE135497_RAW.tar 'https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE135497&format=file'\n", 123 | "!tar -xvf GSE135497_RAW.tar" 124 | ] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "execution_count": 8, 129 | "id": "772331e2", 130 | "metadata": {}, 131 | "outputs": [], 132 | "source": [ 133 | "import pandas as pd\n", 134 | "meta = pd.read_csv('GSM4012703_TASC_SCREEN_chr8_sample10.pertStatus.csv.gz', index_col=0)" 135 | ] 136 | }, 137 | { 138 | "cell_type": "code", 139 | "execution_count": 11, 140 | "id": "0e6911f4", 141 | "metadata": {}, 142 | "outputs": [ 143 | { 144 | "data": { 145 | "text/html": [ 146 | "
\n", 147 | "\n", 160 | "\n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | "
AACCAACGACGAGGACAAGCGATGAACCAACGACTTTCAAGCGATAGCAACCAACGAGAGCGACAAGCGATGAACCAACGCCTAGCTCAAGCGATGAACCAACGTGGGCGTCAAGCGATGAGGCAGAACCGTTCAAGCGATCCCAGGCAGAACCGTTCAAGCGATTCTAGGCAGAACTGATAGCAAGCGATGAGGCAGAAGCTGCTTCAAGCGATGAGGCAGAAGTCCTCAAGCGATTCT...TAGGCATGGGAGCAACAAGCGATGTAGGCATGGTCCTCAAGCGATATATAGGCATGTTTGCGCCAAGCGATGTCCTGAGCAGCAGCCCAAGCGATGTCCTGAGCATGGGAGCAAGCGATGTCCTGAGCGCATACAAGCGATATATCCTGAGCGTGTTAGCAAGCGATGTGATTGACACGGGCTCAAGCGATGTGATTGACGACTACAAGCGATCCCTGATTGACTTCTACAAGCGATGAC
CCNE2_+_95907328.23-P1P20000000000...0000000000
CCNE2_+_95907382.23-P1P20000000000...0000000000
CCNE2_+_95907406.23-P1P20000000000...0000000000
CCNE2_-_95907017.23-P1P20000000000...0000000000
CPQ_+_97657557.23-P1P20000000000...0000000000
..................................................................
non-targeting_000250000000000...0000000000
non-targeting_000260000000000...0000000000
non-targeting_000270000000000...0000000000
non-targeting_000280000000000...0000000000
non-targeting_000290000000000...0000000000
\n", 454 | "

4119 rows × 9517 columns

\n", 455 | "
" 456 | ], 457 | "text/plain": [ 458 | " AACCAACGACGAGGACAAGCGATG AACCAACGACTTTCAAGCGATAGC \\\n", 459 | "CCNE2_+_95907328.23-P1P2 0 0 \n", 460 | "CCNE2_+_95907382.23-P1P2 0 0 \n", 461 | "CCNE2_+_95907406.23-P1P2 0 0 \n", 462 | "CCNE2_-_95907017.23-P1P2 0 0 \n", 463 | "CPQ_+_97657557.23-P1P2 0 0 \n", 464 | "... ... ... \n", 465 | "non-targeting_00025 0 0 \n", 466 | "non-targeting_00026 0 0 \n", 467 | "non-targeting_00027 0 0 \n", 468 | "non-targeting_00028 0 0 \n", 469 | "non-targeting_00029 0 0 \n", 470 | "\n", 471 | " AACCAACGAGAGCGACAAGCGATG AACCAACGCCTAGCTCAAGCGATG \\\n", 472 | "CCNE2_+_95907328.23-P1P2 0 0 \n", 473 | "CCNE2_+_95907382.23-P1P2 0 0 \n", 474 | "CCNE2_+_95907406.23-P1P2 0 0 \n", 475 | "CCNE2_-_95907017.23-P1P2 0 0 \n", 476 | "CPQ_+_97657557.23-P1P2 0 0 \n", 477 | "... ... ... \n", 478 | "non-targeting_00025 0 0 \n", 479 | "non-targeting_00026 0 0 \n", 480 | "non-targeting_00027 0 0 \n", 481 | "non-targeting_00028 0 0 \n", 482 | "non-targeting_00029 0 0 \n", 483 | "\n", 484 | " AACCAACGTGGGCGTCAAGCGATG AGGCAGAACCGTTCAAGCGATCCC \\\n", 485 | "CCNE2_+_95907328.23-P1P2 0 0 \n", 486 | "CCNE2_+_95907382.23-P1P2 0 0 \n", 487 | "CCNE2_+_95907406.23-P1P2 0 0 \n", 488 | "CCNE2_-_95907017.23-P1P2 0 0 \n", 489 | "CPQ_+_97657557.23-P1P2 0 0 \n", 490 | "... ... ... \n", 491 | "non-targeting_00025 0 0 \n", 492 | "non-targeting_00026 0 0 \n", 493 | "non-targeting_00027 0 0 \n", 494 | "non-targeting_00028 0 0 \n", 495 | "non-targeting_00029 0 0 \n", 496 | "\n", 497 | " AGGCAGAACCGTTCAAGCGATTCT AGGCAGAACTGATAGCAAGCGATG \\\n", 498 | "CCNE2_+_95907328.23-P1P2 0 0 \n", 499 | "CCNE2_+_95907382.23-P1P2 0 0 \n", 500 | "CCNE2_+_95907406.23-P1P2 0 0 \n", 501 | "CCNE2_-_95907017.23-P1P2 0 0 \n", 502 | "CPQ_+_97657557.23-P1P2 0 0 \n", 503 | "... ... ... \n", 504 | "non-targeting_00025 0 0 \n", 505 | "non-targeting_00026 0 0 \n", 506 | "non-targeting_00027 0 0 \n", 507 | "non-targeting_00028 0 0 \n", 508 | "non-targeting_00029 0 0 \n", 509 | "\n", 510 | " AGGCAGAAGCTGCTTCAAGCGATG AGGCAGAAGTCCTCAAGCGATTCT \\\n", 511 | "CCNE2_+_95907328.23-P1P2 0 0 \n", 512 | "CCNE2_+_95907382.23-P1P2 0 0 \n", 513 | "CCNE2_+_95907406.23-P1P2 0 0 \n", 514 | "CCNE2_-_95907017.23-P1P2 0 0 \n", 515 | "CPQ_+_97657557.23-P1P2 0 0 \n", 516 | "... ... ... \n", 517 | "non-targeting_00025 0 0 \n", 518 | "non-targeting_00026 0 0 \n", 519 | "non-targeting_00027 0 0 \n", 520 | "non-targeting_00028 0 0 \n", 521 | "non-targeting_00029 0 0 \n", 522 | "\n", 523 | " ... TAGGCATGGGAGCAACAAGCGATG \\\n", 524 | "CCNE2_+_95907328.23-P1P2 ... 0 \n", 525 | "CCNE2_+_95907382.23-P1P2 ... 0 \n", 526 | "CCNE2_+_95907406.23-P1P2 ... 0 \n", 527 | "CCNE2_-_95907017.23-P1P2 ... 0 \n", 528 | "CPQ_+_97657557.23-P1P2 ... 0 \n", 529 | "... ... ... \n", 530 | "non-targeting_00025 ... 0 \n", 531 | "non-targeting_00026 ... 0 \n", 532 | "non-targeting_00027 ... 0 \n", 533 | "non-targeting_00028 ... 0 \n", 534 | "non-targeting_00029 ... 0 \n", 535 | "\n", 536 | " TAGGCATGGTCCTCAAGCGATATA TAGGCATGTTTGCGCCAAGCGATG \\\n", 537 | "CCNE2_+_95907328.23-P1P2 0 0 \n", 538 | "CCNE2_+_95907382.23-P1P2 0 0 \n", 539 | "CCNE2_+_95907406.23-P1P2 0 0 \n", 540 | "CCNE2_-_95907017.23-P1P2 0 0 \n", 541 | "CPQ_+_97657557.23-P1P2 0 0 \n", 542 | "... ... ... \n", 543 | "non-targeting_00025 0 0 \n", 544 | "non-targeting_00026 0 0 \n", 545 | "non-targeting_00027 0 0 \n", 546 | "non-targeting_00028 0 0 \n", 547 | "non-targeting_00029 0 0 \n", 548 | "\n", 549 | " TCCTGAGCAGCAGCCCAAGCGATG TCCTGAGCATGGGAGCAAGCGATG \\\n", 550 | "CCNE2_+_95907328.23-P1P2 0 0 \n", 551 | "CCNE2_+_95907382.23-P1P2 0 0 \n", 552 | "CCNE2_+_95907406.23-P1P2 0 0 \n", 553 | "CCNE2_-_95907017.23-P1P2 0 0 \n", 554 | "CPQ_+_97657557.23-P1P2 0 0 \n", 555 | "... ... ... \n", 556 | "non-targeting_00025 0 0 \n", 557 | "non-targeting_00026 0 0 \n", 558 | "non-targeting_00027 0 0 \n", 559 | "non-targeting_00028 0 0 \n", 560 | "non-targeting_00029 0 0 \n", 561 | "\n", 562 | " TCCTGAGCGCATACAAGCGATATA TCCTGAGCGTGTTAGCAAGCGATG \\\n", 563 | "CCNE2_+_95907328.23-P1P2 0 0 \n", 564 | "CCNE2_+_95907382.23-P1P2 0 0 \n", 565 | "CCNE2_+_95907406.23-P1P2 0 0 \n", 566 | "CCNE2_-_95907017.23-P1P2 0 0 \n", 567 | "CPQ_+_97657557.23-P1P2 0 0 \n", 568 | "... ... ... \n", 569 | "non-targeting_00025 0 0 \n", 570 | "non-targeting_00026 0 0 \n", 571 | "non-targeting_00027 0 0 \n", 572 | "non-targeting_00028 0 0 \n", 573 | "non-targeting_00029 0 0 \n", 574 | "\n", 575 | " TGATTGACACGGGCTCAAGCGATG TGATTGACGACTACAAGCGATCCC \\\n", 576 | "CCNE2_+_95907328.23-P1P2 0 0 \n", 577 | "CCNE2_+_95907382.23-P1P2 0 0 \n", 578 | "CCNE2_+_95907406.23-P1P2 0 0 \n", 579 | "CCNE2_-_95907017.23-P1P2 0 0 \n", 580 | "CPQ_+_97657557.23-P1P2 0 0 \n", 581 | "... ... ... \n", 582 | "non-targeting_00025 0 0 \n", 583 | "non-targeting_00026 0 0 \n", 584 | "non-targeting_00027 0 0 \n", 585 | "non-targeting_00028 0 0 \n", 586 | "non-targeting_00029 0 0 \n", 587 | "\n", 588 | " TGATTGACTTCTACAAGCGATGAC \n", 589 | "CCNE2_+_95907328.23-P1P2 0 \n", 590 | "CCNE2_+_95907382.23-P1P2 0 \n", 591 | "CCNE2_+_95907406.23-P1P2 0 \n", 592 | "CCNE2_-_95907017.23-P1P2 0 \n", 593 | "CPQ_+_97657557.23-P1P2 0 \n", 594 | "... ... \n", 595 | "non-targeting_00025 0 \n", 596 | "non-targeting_00026 0 \n", 597 | "non-targeting_00027 0 \n", 598 | "non-targeting_00028 0 \n", 599 | "non-targeting_00029 0 \n", 600 | "\n", 601 | "[4119 rows x 9517 columns]" 602 | ] 603 | }, 604 | "execution_count": 11, 605 | "metadata": {}, 606 | "output_type": "execute_result" 607 | } 608 | ], 609 | "source": [ 610 | "meta" 611 | ] 612 | }, 613 | { 614 | "cell_type": "code", 615 | "execution_count": 16, 616 | "id": "76a212c0", 617 | "metadata": {}, 618 | "outputs": [ 619 | { 620 | "data": { 621 | "text/plain": [ 622 | "CCNE2_+_95907328.23-P1P2 0\n", 623 | "CCNE2_+_95907382.23-P1P2 0\n", 624 | "CCNE2_+_95907406.23-P1P2 0\n", 625 | "CCNE2_-_95907017.23-P1P2 0\n", 626 | "CPQ_+_97657557.23-P1P2 0\n", 627 | " ..\n", 628 | "non-targeting_00025 0\n", 629 | "non-targeting_00026 0\n", 630 | "non-targeting_00027 0\n", 631 | "non-targeting_00028 0\n", 632 | "non-targeting_00029 2\n", 633 | "Length: 4119, dtype: int64" 634 | ] 635 | }, 636 | "execution_count": 16, 637 | "metadata": {}, 638 | "output_type": "execute_result" 639 | } 640 | ], 641 | "source": [ 642 | "meta.sum(1)" 643 | ] 644 | }, 645 | { 646 | "cell_type": "code", 647 | "execution_count": 15, 648 | "id": "3cfcb0df", 649 | "metadata": {}, 650 | "outputs": [ 651 | { 652 | "data": { 653 | "text/plain": [ 654 | "[19,\n", 655 | " 19,\n", 656 | " 18,\n", 657 | " 17,\n", 658 | " 16,\n", 659 | " 14,\n", 660 | " 14,\n", 661 | " 14,\n", 662 | " 14,\n", 663 | " 14,\n", 664 | " 13,\n", 665 | " 13,\n", 666 | " 13,\n", 667 | " 13,\n", 668 | " 13,\n", 669 | " 13,\n", 670 | " 13,\n", 671 | " 13,\n", 672 | " 12,\n", 673 | " 12,\n", 674 | " 12,\n", 675 | " 12,\n", 676 | " 12,\n", 677 | " 12,\n", 678 | " 12,\n", 679 | " 12,\n", 680 | " 12,\n", 681 | " 12,\n", 682 | " 12,\n", 683 | " 11,\n", 684 | " 11,\n", 685 | " 11,\n", 686 | " 11,\n", 687 | " 11,\n", 688 | " 11,\n", 689 | " 11,\n", 690 | " 11,\n", 691 | " 11,\n", 692 | " 11,\n", 693 | " 11,\n", 694 | " 11,\n", 695 | " 11,\n", 696 | " 11,\n", 697 | " 11,\n", 698 | " 11,\n", 699 | " 11,\n", 700 | " 11,\n", 701 | " 11,\n", 702 | " 11,\n", 703 | " 10,\n", 704 | " 10,\n", 705 | " 10,\n", 706 | " 10,\n", 707 | " 10,\n", 708 | " 10,\n", 709 | " 10,\n", 710 | " 10,\n", 711 | " 10,\n", 712 | " 10,\n", 713 | " 10,\n", 714 | " 10,\n", 715 | " 10,\n", 716 | " 10,\n", 717 | " 10,\n", 718 | " 10,\n", 719 | " 10,\n", 720 | " 10,\n", 721 | " 10,\n", 722 | " 10,\n", 723 | " 10,\n", 724 | " 10,\n", 725 | " 9,\n", 726 | " 9,\n", 727 | " 9,\n", 728 | " 9,\n", 729 | " 9,\n", 730 | " 9,\n", 731 | " 9,\n", 732 | " 9,\n", 733 | " 9,\n", 734 | " 9,\n", 735 | " 9,\n", 736 | " 9,\n", 737 | " 9,\n", 738 | " 9,\n", 739 | " 9,\n", 740 | " 9,\n", 741 | " 9,\n", 742 | " 9,\n", 743 | " 9,\n", 744 | " 9,\n", 745 | " 9,\n", 746 | " 9,\n", 747 | " 9,\n", 748 | " 9,\n", 749 | " 9,\n", 750 | " 9,\n", 751 | " 9,\n", 752 | " 9,\n", 753 | " 9,\n", 754 | " 9,\n", 755 | " 9,\n", 756 | " 9,\n", 757 | " 9,\n", 758 | " 9,\n", 759 | " 9,\n", 760 | " 9,\n", 761 | " 9,\n", 762 | " 9,\n", 763 | " 9,\n", 764 | " 9,\n", 765 | " 9,\n", 766 | " 8,\n", 767 | " 8,\n", 768 | " 8,\n", 769 | " 8,\n", 770 | " 8,\n", 771 | " 8,\n", 772 | " 8,\n", 773 | " 8,\n", 774 | " 8,\n", 775 | " 8,\n", 776 | " 8,\n", 777 | " 8,\n", 778 | " 8,\n", 779 | " 8,\n", 780 | " 8,\n", 781 | " 8,\n", 782 | " 8,\n", 783 | " 8,\n", 784 | " 8,\n", 785 | " 8,\n", 786 | " 8,\n", 787 | " 8,\n", 788 | " 8,\n", 789 | " 8,\n", 790 | " 8,\n", 791 | " 8,\n", 792 | " 8,\n", 793 | " 8,\n", 794 | " 8,\n", 795 | " 8,\n", 796 | " 8,\n", 797 | " 8,\n", 798 | " 8,\n", 799 | " 8,\n", 800 | " 8,\n", 801 | " 8,\n", 802 | " 8,\n", 803 | " 8,\n", 804 | " 8,\n", 805 | " 8,\n", 806 | " 8,\n", 807 | " 8,\n", 808 | " 8,\n", 809 | " 8,\n", 810 | " 8,\n", 811 | " 8,\n", 812 | " 8,\n", 813 | " 8,\n", 814 | " 8,\n", 815 | " 8,\n", 816 | " 8,\n", 817 | " 8,\n", 818 | " 8,\n", 819 | " 8,\n", 820 | " 8,\n", 821 | " 8,\n", 822 | " 8,\n", 823 | " 8,\n", 824 | " 8,\n", 825 | " 8,\n", 826 | " 8,\n", 827 | " 8,\n", 828 | " 8,\n", 829 | " 8,\n", 830 | " 8,\n", 831 | " 8,\n", 832 | " 8,\n", 833 | " 8,\n", 834 | " 8,\n", 835 | " 7,\n", 836 | " 7,\n", 837 | " 7,\n", 838 | " 7,\n", 839 | " 7,\n", 840 | " 7,\n", 841 | " 7,\n", 842 | " 7,\n", 843 | " 7,\n", 844 | " 7,\n", 845 | " 7,\n", 846 | " 7,\n", 847 | " 7,\n", 848 | " 7,\n", 849 | " 7,\n", 850 | " 7,\n", 851 | " 7,\n", 852 | " 7,\n", 853 | " 7,\n", 854 | " 7,\n", 855 | " 7,\n", 856 | " 7,\n", 857 | " 7,\n", 858 | " 7,\n", 859 | " 7,\n", 860 | " 7,\n", 861 | " 7,\n", 862 | " 7,\n", 863 | " 7,\n", 864 | " 7,\n", 865 | " 7,\n", 866 | " 7,\n", 867 | " 7,\n", 868 | " 7,\n", 869 | " 7,\n", 870 | " 7,\n", 871 | " 7,\n", 872 | " 7,\n", 873 | " 7,\n", 874 | " 7,\n", 875 | " 7,\n", 876 | " 7,\n", 877 | " 7,\n", 878 | " 7,\n", 879 | " 7,\n", 880 | " 7,\n", 881 | " 7,\n", 882 | " 7,\n", 883 | " 7,\n", 884 | " 7,\n", 885 | " 7,\n", 886 | " 7,\n", 887 | " 7,\n", 888 | " 7,\n", 889 | " 7,\n", 890 | " 7,\n", 891 | " 7,\n", 892 | " 7,\n", 893 | " 7,\n", 894 | " 7,\n", 895 | " 7,\n", 896 | " 7,\n", 897 | " 7,\n", 898 | " 7,\n", 899 | " 7,\n", 900 | " 7,\n", 901 | " 7,\n", 902 | " 7,\n", 903 | " 7,\n", 904 | " 7,\n", 905 | " 7,\n", 906 | " 7,\n", 907 | " 7,\n", 908 | " 7,\n", 909 | " 7,\n", 910 | " 7,\n", 911 | " 7,\n", 912 | " 7,\n", 913 | " 7,\n", 914 | " 7,\n", 915 | " 7,\n", 916 | " 7,\n", 917 | " 7,\n", 918 | " 7,\n", 919 | " 7,\n", 920 | " 7,\n", 921 | " 7,\n", 922 | " 7,\n", 923 | " 7,\n", 924 | " 7,\n", 925 | " 7,\n", 926 | " 7,\n", 927 | " 7,\n", 928 | " 7,\n", 929 | " 7,\n", 930 | " 7,\n", 931 | " 7,\n", 932 | " 7,\n", 933 | " 7,\n", 934 | " 7,\n", 935 | " 7,\n", 936 | " 7,\n", 937 | " 7,\n", 938 | " 7,\n", 939 | " 7,\n", 940 | " 7,\n", 941 | " 7,\n", 942 | " 7,\n", 943 | " 7,\n", 944 | " 7,\n", 945 | " 7,\n", 946 | " 7,\n", 947 | " 7,\n", 948 | " 7,\n", 949 | " 7,\n", 950 | " 7,\n", 951 | " 7,\n", 952 | " 7,\n", 953 | " 7,\n", 954 | " 7,\n", 955 | " 7,\n", 956 | " 7,\n", 957 | " 7,\n", 958 | " 7,\n", 959 | " 7,\n", 960 | " 7,\n", 961 | " 7,\n", 962 | " 7,\n", 963 | " 7,\n", 964 | " 7,\n", 965 | " 7,\n", 966 | " 6,\n", 967 | " 6,\n", 968 | " 6,\n", 969 | " 6,\n", 970 | " 6,\n", 971 | " 6,\n", 972 | " 6,\n", 973 | " 6,\n", 974 | " 6,\n", 975 | " 6,\n", 976 | " 6,\n", 977 | " 6,\n", 978 | " 6,\n", 979 | " 6,\n", 980 | " 6,\n", 981 | " 6,\n", 982 | " 6,\n", 983 | " 6,\n", 984 | " 6,\n", 985 | " 6,\n", 986 | " 6,\n", 987 | " 6,\n", 988 | " 6,\n", 989 | " 6,\n", 990 | " 6,\n", 991 | " 6,\n", 992 | " 6,\n", 993 | " 6,\n", 994 | " 6,\n", 995 | " 6,\n", 996 | " 6,\n", 997 | " 6,\n", 998 | " 6,\n", 999 | " 6,\n", 1000 | " 6,\n", 1001 | " 6,\n", 1002 | " 6,\n", 1003 | " 6,\n", 1004 | " 6,\n", 1005 | " 6,\n", 1006 | " 6,\n", 1007 | " 6,\n", 1008 | " 6,\n", 1009 | " 6,\n", 1010 | " 6,\n", 1011 | " 6,\n", 1012 | " 6,\n", 1013 | " 6,\n", 1014 | " 6,\n", 1015 | " 6,\n", 1016 | " 6,\n", 1017 | " 6,\n", 1018 | " 6,\n", 1019 | " 6,\n", 1020 | " 6,\n", 1021 | " 6,\n", 1022 | " 6,\n", 1023 | " 6,\n", 1024 | " 6,\n", 1025 | " 6,\n", 1026 | " 6,\n", 1027 | " 6,\n", 1028 | " 6,\n", 1029 | " 6,\n", 1030 | " 6,\n", 1031 | " 6,\n", 1032 | " 6,\n", 1033 | " 6,\n", 1034 | " 6,\n", 1035 | " 6,\n", 1036 | " 6,\n", 1037 | " 6,\n", 1038 | " 6,\n", 1039 | " 6,\n", 1040 | " 6,\n", 1041 | " 6,\n", 1042 | " 6,\n", 1043 | " 6,\n", 1044 | " 6,\n", 1045 | " 6,\n", 1046 | " 6,\n", 1047 | " 6,\n", 1048 | " 6,\n", 1049 | " 6,\n", 1050 | " 6,\n", 1051 | " 6,\n", 1052 | " 6,\n", 1053 | " 6,\n", 1054 | " 6,\n", 1055 | " 6,\n", 1056 | " 6,\n", 1057 | " 6,\n", 1058 | " 6,\n", 1059 | " 6,\n", 1060 | " 6,\n", 1061 | " 6,\n", 1062 | " 6,\n", 1063 | " 6,\n", 1064 | " 6,\n", 1065 | " 6,\n", 1066 | " 6,\n", 1067 | " 6,\n", 1068 | " 6,\n", 1069 | " 6,\n", 1070 | " 6,\n", 1071 | " 6,\n", 1072 | " 6,\n", 1073 | " 6,\n", 1074 | " 6,\n", 1075 | " 6,\n", 1076 | " 6,\n", 1077 | " 6,\n", 1078 | " 6,\n", 1079 | " 6,\n", 1080 | " 6,\n", 1081 | " 6,\n", 1082 | " 6,\n", 1083 | " 6,\n", 1084 | " 6,\n", 1085 | " 6,\n", 1086 | " 6,\n", 1087 | " 6,\n", 1088 | " 6,\n", 1089 | " 6,\n", 1090 | " 6,\n", 1091 | " 6,\n", 1092 | " 6,\n", 1093 | " 6,\n", 1094 | " 6,\n", 1095 | " 6,\n", 1096 | " 6,\n", 1097 | " 6,\n", 1098 | " 6,\n", 1099 | " 6,\n", 1100 | " 6,\n", 1101 | " 6,\n", 1102 | " 6,\n", 1103 | " 6,\n", 1104 | " 6,\n", 1105 | " 6,\n", 1106 | " 6,\n", 1107 | " 6,\n", 1108 | " 6,\n", 1109 | " 6,\n", 1110 | " 6,\n", 1111 | " 6,\n", 1112 | " 6,\n", 1113 | " 6,\n", 1114 | " 6,\n", 1115 | " 6,\n", 1116 | " 6,\n", 1117 | " 6,\n", 1118 | " 6,\n", 1119 | " 6,\n", 1120 | " 6,\n", 1121 | " 6,\n", 1122 | " 6,\n", 1123 | " 6,\n", 1124 | " 6,\n", 1125 | " 6,\n", 1126 | " 6,\n", 1127 | " 6,\n", 1128 | " 6,\n", 1129 | " 6,\n", 1130 | " 6,\n", 1131 | " 6,\n", 1132 | " 6,\n", 1133 | " 6,\n", 1134 | " 6,\n", 1135 | " 6,\n", 1136 | " 6,\n", 1137 | " 6,\n", 1138 | " 6,\n", 1139 | " 6,\n", 1140 | " 6,\n", 1141 | " 6,\n", 1142 | " 6,\n", 1143 | " 6,\n", 1144 | " 6,\n", 1145 | " 6,\n", 1146 | " 6,\n", 1147 | " 6,\n", 1148 | " 6,\n", 1149 | " 6,\n", 1150 | " 6,\n", 1151 | " 6,\n", 1152 | " 6,\n", 1153 | " 6,\n", 1154 | " 6,\n", 1155 | " 6,\n", 1156 | " 6,\n", 1157 | " 6,\n", 1158 | " 6,\n", 1159 | " 6,\n", 1160 | " 6,\n", 1161 | " 6,\n", 1162 | " 6,\n", 1163 | " 6,\n", 1164 | " 6,\n", 1165 | " 6,\n", 1166 | " 6,\n", 1167 | " 6,\n", 1168 | " 6,\n", 1169 | " 6,\n", 1170 | " 6,\n", 1171 | " 6,\n", 1172 | " 6,\n", 1173 | " 6,\n", 1174 | " 6,\n", 1175 | " 6,\n", 1176 | " 6,\n", 1177 | " 6,\n", 1178 | " 6,\n", 1179 | " 6,\n", 1180 | " 6,\n", 1181 | " 6,\n", 1182 | " 6,\n", 1183 | " 6,\n", 1184 | " 6,\n", 1185 | " 5,\n", 1186 | " 5,\n", 1187 | " 5,\n", 1188 | " 5,\n", 1189 | " 5,\n", 1190 | " 5,\n", 1191 | " 5,\n", 1192 | " 5,\n", 1193 | " 5,\n", 1194 | " 5,\n", 1195 | " 5,\n", 1196 | " 5,\n", 1197 | " 5,\n", 1198 | " 5,\n", 1199 | " 5,\n", 1200 | " 5,\n", 1201 | " 5,\n", 1202 | " 5,\n", 1203 | " 5,\n", 1204 | " 5,\n", 1205 | " 5,\n", 1206 | " 5,\n", 1207 | " 5,\n", 1208 | " 5,\n", 1209 | " 5,\n", 1210 | " 5,\n", 1211 | " 5,\n", 1212 | " 5,\n", 1213 | " 5,\n", 1214 | " 5,\n", 1215 | " 5,\n", 1216 | " 5,\n", 1217 | " 5,\n", 1218 | " 5,\n", 1219 | " 5,\n", 1220 | " 5,\n", 1221 | " 5,\n", 1222 | " 5,\n", 1223 | " 5,\n", 1224 | " 5,\n", 1225 | " 5,\n", 1226 | " 5,\n", 1227 | " 5,\n", 1228 | " 5,\n", 1229 | " 5,\n", 1230 | " 5,\n", 1231 | " 5,\n", 1232 | " 5,\n", 1233 | " 5,\n", 1234 | " 5,\n", 1235 | " 5,\n", 1236 | " 5,\n", 1237 | " 5,\n", 1238 | " 5,\n", 1239 | " 5,\n", 1240 | " 5,\n", 1241 | " 5,\n", 1242 | " 5,\n", 1243 | " 5,\n", 1244 | " 5,\n", 1245 | " 5,\n", 1246 | " 5,\n", 1247 | " 5,\n", 1248 | " 5,\n", 1249 | " 5,\n", 1250 | " 5,\n", 1251 | " 5,\n", 1252 | " 5,\n", 1253 | " 5,\n", 1254 | " 5,\n", 1255 | " 5,\n", 1256 | " 5,\n", 1257 | " 5,\n", 1258 | " 5,\n", 1259 | " 5,\n", 1260 | " 5,\n", 1261 | " 5,\n", 1262 | " 5,\n", 1263 | " 5,\n", 1264 | " 5,\n", 1265 | " 5,\n", 1266 | " 5,\n", 1267 | " 5,\n", 1268 | " 5,\n", 1269 | " 5,\n", 1270 | " 5,\n", 1271 | " 5,\n", 1272 | " 5,\n", 1273 | " 5,\n", 1274 | " 5,\n", 1275 | " 5,\n", 1276 | " 5,\n", 1277 | " 5,\n", 1278 | " 5,\n", 1279 | " 5,\n", 1280 | " 5,\n", 1281 | " 5,\n", 1282 | " 5,\n", 1283 | " 5,\n", 1284 | " 5,\n", 1285 | " 5,\n", 1286 | " 5,\n", 1287 | " 5,\n", 1288 | " 5,\n", 1289 | " 5,\n", 1290 | " 5,\n", 1291 | " 5,\n", 1292 | " 5,\n", 1293 | " 5,\n", 1294 | " 5,\n", 1295 | " 5,\n", 1296 | " 5,\n", 1297 | " 5,\n", 1298 | " 5,\n", 1299 | " 5,\n", 1300 | " 5,\n", 1301 | " 5,\n", 1302 | " 5,\n", 1303 | " 5,\n", 1304 | " 5,\n", 1305 | " 5,\n", 1306 | " 5,\n", 1307 | " 5,\n", 1308 | " 5,\n", 1309 | " 5,\n", 1310 | " 5,\n", 1311 | " 5,\n", 1312 | " 5,\n", 1313 | " 5,\n", 1314 | " 5,\n", 1315 | " 5,\n", 1316 | " 5,\n", 1317 | " 5,\n", 1318 | " 5,\n", 1319 | " 5,\n", 1320 | " 5,\n", 1321 | " 5,\n", 1322 | " 5,\n", 1323 | " 5,\n", 1324 | " 5,\n", 1325 | " 5,\n", 1326 | " 5,\n", 1327 | " 5,\n", 1328 | " 5,\n", 1329 | " 5,\n", 1330 | " 5,\n", 1331 | " 5,\n", 1332 | " 5,\n", 1333 | " 5,\n", 1334 | " 5,\n", 1335 | " 5,\n", 1336 | " 5,\n", 1337 | " 5,\n", 1338 | " 5,\n", 1339 | " 5,\n", 1340 | " 5,\n", 1341 | " 5,\n", 1342 | " 5,\n", 1343 | " 5,\n", 1344 | " 5,\n", 1345 | " 5,\n", 1346 | " 5,\n", 1347 | " 5,\n", 1348 | " 5,\n", 1349 | " 5,\n", 1350 | " 5,\n", 1351 | " 5,\n", 1352 | " 5,\n", 1353 | " 5,\n", 1354 | " 5,\n", 1355 | " 5,\n", 1356 | " 5,\n", 1357 | " 5,\n", 1358 | " 5,\n", 1359 | " 5,\n", 1360 | " 5,\n", 1361 | " 5,\n", 1362 | " 5,\n", 1363 | " 5,\n", 1364 | " 5,\n", 1365 | " 5,\n", 1366 | " 5,\n", 1367 | " 5,\n", 1368 | " 5,\n", 1369 | " 5,\n", 1370 | " 5,\n", 1371 | " 5,\n", 1372 | " 5,\n", 1373 | " 5,\n", 1374 | " 5,\n", 1375 | " 5,\n", 1376 | " 5,\n", 1377 | " 5,\n", 1378 | " 5,\n", 1379 | " 5,\n", 1380 | " 5,\n", 1381 | " 5,\n", 1382 | " 5,\n", 1383 | " 5,\n", 1384 | " 5,\n", 1385 | " 5,\n", 1386 | " 5,\n", 1387 | " 5,\n", 1388 | " 5,\n", 1389 | " 5,\n", 1390 | " 5,\n", 1391 | " 5,\n", 1392 | " 5,\n", 1393 | " 5,\n", 1394 | " 5,\n", 1395 | " 5,\n", 1396 | " 5,\n", 1397 | " 5,\n", 1398 | " 5,\n", 1399 | " 5,\n", 1400 | " 5,\n", 1401 | " 5,\n", 1402 | " 5,\n", 1403 | " 5,\n", 1404 | " 5,\n", 1405 | " 5,\n", 1406 | " 5,\n", 1407 | " 5,\n", 1408 | " 5,\n", 1409 | " 5,\n", 1410 | " 5,\n", 1411 | " 5,\n", 1412 | " 5,\n", 1413 | " 5,\n", 1414 | " 5,\n", 1415 | " 5,\n", 1416 | " 5,\n", 1417 | " 5,\n", 1418 | " 5,\n", 1419 | " 5,\n", 1420 | " 5,\n", 1421 | " 5,\n", 1422 | " 5,\n", 1423 | " 5,\n", 1424 | " 5,\n", 1425 | " 5,\n", 1426 | " 5,\n", 1427 | " 5,\n", 1428 | " 5,\n", 1429 | " 5,\n", 1430 | " 5,\n", 1431 | " 5,\n", 1432 | " 5,\n", 1433 | " 5,\n", 1434 | " 5,\n", 1435 | " 5,\n", 1436 | " 5,\n", 1437 | " 5,\n", 1438 | " 5,\n", 1439 | " 5,\n", 1440 | " 5,\n", 1441 | " 5,\n", 1442 | " 5,\n", 1443 | " 5,\n", 1444 | " 5,\n", 1445 | " 5,\n", 1446 | " 5,\n", 1447 | " 5,\n", 1448 | " 5,\n", 1449 | " 5,\n", 1450 | " 5,\n", 1451 | " 5,\n", 1452 | " 5,\n", 1453 | " 5,\n", 1454 | " 5,\n", 1455 | " 5,\n", 1456 | " 5,\n", 1457 | " 5,\n", 1458 | " 5,\n", 1459 | " 5,\n", 1460 | " 5,\n", 1461 | " 5,\n", 1462 | " 5,\n", 1463 | " 5,\n", 1464 | " 5,\n", 1465 | " 5,\n", 1466 | " 5,\n", 1467 | " 5,\n", 1468 | " 5,\n", 1469 | " 5,\n", 1470 | " 5,\n", 1471 | " 5,\n", 1472 | " 5,\n", 1473 | " 5,\n", 1474 | " 5,\n", 1475 | " 5,\n", 1476 | " 5,\n", 1477 | " 5,\n", 1478 | " 4,\n", 1479 | " 4,\n", 1480 | " 4,\n", 1481 | " 4,\n", 1482 | " 4,\n", 1483 | " 4,\n", 1484 | " 4,\n", 1485 | " 4,\n", 1486 | " 4,\n", 1487 | " 4,\n", 1488 | " 4,\n", 1489 | " 4,\n", 1490 | " 4,\n", 1491 | " 4,\n", 1492 | " 4,\n", 1493 | " 4,\n", 1494 | " 4,\n", 1495 | " 4,\n", 1496 | " 4,\n", 1497 | " 4,\n", 1498 | " 4,\n", 1499 | " 4,\n", 1500 | " 4,\n", 1501 | " 4,\n", 1502 | " 4,\n", 1503 | " 4,\n", 1504 | " 4,\n", 1505 | " 4,\n", 1506 | " 4,\n", 1507 | " 4,\n", 1508 | " 4,\n", 1509 | " 4,\n", 1510 | " 4,\n", 1511 | " 4,\n", 1512 | " 4,\n", 1513 | " 4,\n", 1514 | " 4,\n", 1515 | " 4,\n", 1516 | " 4,\n", 1517 | " 4,\n", 1518 | " 4,\n", 1519 | " 4,\n", 1520 | " 4,\n", 1521 | " 4,\n", 1522 | " 4,\n", 1523 | " 4,\n", 1524 | " 4,\n", 1525 | " 4,\n", 1526 | " 4,\n", 1527 | " 4,\n", 1528 | " 4,\n", 1529 | " 4,\n", 1530 | " 4,\n", 1531 | " 4,\n", 1532 | " 4,\n", 1533 | " 4,\n", 1534 | " 4,\n", 1535 | " 4,\n", 1536 | " 4,\n", 1537 | " 4,\n", 1538 | " 4,\n", 1539 | " 4,\n", 1540 | " 4,\n", 1541 | " 4,\n", 1542 | " 4,\n", 1543 | " 4,\n", 1544 | " 4,\n", 1545 | " 4,\n", 1546 | " 4,\n", 1547 | " 4,\n", 1548 | " 4,\n", 1549 | " 4,\n", 1550 | " 4,\n", 1551 | " 4,\n", 1552 | " 4,\n", 1553 | " 4,\n", 1554 | " 4,\n", 1555 | " 4,\n", 1556 | " 4,\n", 1557 | " 4,\n", 1558 | " 4,\n", 1559 | " 4,\n", 1560 | " 4,\n", 1561 | " 4,\n", 1562 | " 4,\n", 1563 | " 4,\n", 1564 | " 4,\n", 1565 | " 4,\n", 1566 | " 4,\n", 1567 | " 4,\n", 1568 | " 4,\n", 1569 | " 4,\n", 1570 | " 4,\n", 1571 | " 4,\n", 1572 | " 4,\n", 1573 | " 4,\n", 1574 | " 4,\n", 1575 | " 4,\n", 1576 | " 4,\n", 1577 | " 4,\n", 1578 | " 4,\n", 1579 | " 4,\n", 1580 | " 4,\n", 1581 | " 4,\n", 1582 | " 4,\n", 1583 | " 4,\n", 1584 | " 4,\n", 1585 | " 4,\n", 1586 | " 4,\n", 1587 | " 4,\n", 1588 | " 4,\n", 1589 | " 4,\n", 1590 | " 4,\n", 1591 | " 4,\n", 1592 | " 4,\n", 1593 | " 4,\n", 1594 | " 4,\n", 1595 | " 4,\n", 1596 | " 4,\n", 1597 | " 4,\n", 1598 | " 4,\n", 1599 | " 4,\n", 1600 | " 4,\n", 1601 | " 4,\n", 1602 | " 4,\n", 1603 | " 4,\n", 1604 | " 4,\n", 1605 | " 4,\n", 1606 | " 4,\n", 1607 | " 4,\n", 1608 | " 4,\n", 1609 | " 4,\n", 1610 | " 4,\n", 1611 | " 4,\n", 1612 | " 4,\n", 1613 | " 4,\n", 1614 | " 4,\n", 1615 | " 4,\n", 1616 | " 4,\n", 1617 | " 4,\n", 1618 | " 4,\n", 1619 | " 4,\n", 1620 | " 4,\n", 1621 | " 4,\n", 1622 | " 4,\n", 1623 | " 4,\n", 1624 | " 4,\n", 1625 | " 4,\n", 1626 | " 4,\n", 1627 | " 4,\n", 1628 | " 4,\n", 1629 | " 4,\n", 1630 | " 4,\n", 1631 | " 4,\n", 1632 | " 4,\n", 1633 | " 4,\n", 1634 | " 4,\n", 1635 | " 4,\n", 1636 | " 4,\n", 1637 | " 4,\n", 1638 | " 4,\n", 1639 | " 4,\n", 1640 | " 4,\n", 1641 | " 4,\n", 1642 | " 4,\n", 1643 | " 4,\n", 1644 | " 4,\n", 1645 | " 4,\n", 1646 | " 4,\n", 1647 | " 4,\n", 1648 | " 4,\n", 1649 | " 4,\n", 1650 | " 4,\n", 1651 | " 4,\n", 1652 | " 4,\n", 1653 | " 4,\n", 1654 | " ...]" 1655 | ] 1656 | }, 1657 | "execution_count": 15, 1658 | "metadata": {}, 1659 | "output_type": "execute_result" 1660 | } 1661 | ], 1662 | "source": [ 1663 | "sorted(meta.sum(1), reverse=True)" 1664 | ] 1665 | }, 1666 | { 1667 | "cell_type": "code", 1668 | "execution_count": 7, 1669 | "id": "cfabb9ee", 1670 | "metadata": {}, 1671 | "outputs": [ 1672 | { 1673 | "data": { 1674 | "text/html": [ 1675 | "
\n", 1676 | "\n", 1689 | "\n", 1690 | " \n", 1691 | " \n", 1692 | " \n", 1693 | " \n", 1694 | " \n", 1695 | " \n", 1696 | " \n", 1697 | " \n", 1698 | " \n", 1699 | " \n", 1700 | " \n", 1701 | " \n", 1702 | " \n", 1703 | " \n", 1704 | " \n", 1705 | " \n", 1706 | " \n", 1707 | " \n", 1708 | " \n", 1709 | " \n", 1710 | " \n", 1711 | " \n", 1712 | " \n", 1713 | " \n", 1714 | " \n", 1715 | " \n", 1716 | " \n", 1717 | " \n", 1718 | " \n", 1719 | " \n", 1720 | " \n", 1721 | " \n", 1722 | " \n", 1723 | " \n", 1724 | " \n", 1725 | " \n", 1726 | " \n", 1727 | " \n", 1728 | " \n", 1729 | " \n", 1730 | " \n", 1731 | " \n", 1732 | " \n", 1733 | " \n", 1734 | " \n", 1735 | " \n", 1736 | " \n", 1737 | " \n", 1738 | " \n", 1739 | " \n", 1740 | " \n", 1741 | " \n", 1742 | " \n", 1743 | " \n", 1744 | " \n", 1745 | " \n", 1746 | " \n", 1747 | " \n", 1748 | " \n", 1749 | " \n", 1750 | " \n", 1751 | " \n", 1752 | " \n", 1753 | " \n", 1754 | " \n", 1755 | " \n", 1756 | " \n", 1757 | " \n", 1758 | " \n", 1759 | " \n", 1760 | " \n", 1761 | " \n", 1762 | " \n", 1763 | " \n", 1764 | " \n", 1765 | " \n", 1766 | " \n", 1767 | " \n", 1768 | " \n", 1769 | " \n", 1770 | " \n", 1771 | " \n", 1772 | " \n", 1773 | " \n", 1774 | " \n", 1775 | " \n", 1776 | " \n", 1777 | " \n", 1778 | " \n", 1779 | " \n", 1780 | " \n", 1781 | " \n", 1782 | " \n", 1783 | " \n", 1784 | " \n", 1785 | " \n", 1786 | " \n", 1787 | " \n", 1788 | " \n", 1789 | " \n", 1790 | " \n", 1791 | " \n", 1792 | " \n", 1793 | " \n", 1794 | " \n", 1795 | " \n", 1796 | " \n", 1797 | " \n", 1798 | " \n", 1799 | " \n", 1800 | " \n", 1801 | " \n", 1802 | " \n", 1803 | " \n", 1804 | " \n", 1805 | " \n", 1806 | " \n", 1807 | " \n", 1808 | " \n", 1809 | " \n", 1810 | " \n", 1811 | " \n", 1812 | " \n", 1813 | " \n", 1814 | " \n", 1815 | " \n", 1816 | " \n", 1817 | " \n", 1818 | " \n", 1819 | " \n", 1820 | " \n", 1821 | " \n", 1822 | " \n", 1823 | " \n", 1824 | " \n", 1825 | " \n", 1826 | " \n", 1827 | " \n", 1828 | " \n", 1829 | " \n", 1830 | " \n", 1831 | " \n", 1832 | " \n", 1833 | " \n", 1834 | " \n", 1835 | " \n", 1836 | " \n", 1837 | " \n", 1838 | " \n", 1839 | " \n", 1840 | " \n", 1841 | " \n", 1842 | " \n", 1843 | " \n", 1844 | " \n", 1845 | " \n", 1846 | " \n", 1847 | " \n", 1848 | " \n", 1849 | " \n", 1850 | " \n", 1851 | " \n", 1852 | " \n", 1853 | " \n", 1854 | " \n", 1855 | " \n", 1856 | " \n", 1857 | " \n", 1858 | " \n", 1859 | " \n", 1860 | " \n", 1861 | " \n", 1862 | " \n", 1863 | " \n", 1864 | " \n", 1865 | " \n", 1866 | " \n", 1867 | " \n", 1868 | " \n", 1869 | " \n", 1870 | " \n", 1871 | " \n", 1872 | " \n", 1873 | " \n", 1874 | " \n", 1875 | " \n", 1876 | " \n", 1877 | " \n", 1878 | " \n", 1879 | " \n", 1880 | " \n", 1881 | " \n", 1882 | " \n", 1883 | " \n", 1884 | " \n", 1885 | " \n", 1886 | " \n", 1887 | " \n", 1888 | " \n", 1889 | " \n", 1890 | " \n", 1891 | " \n", 1892 | " \n", 1893 | " \n", 1894 | " \n", 1895 | " \n", 1896 | " \n", 1897 | " \n", 1898 | " \n", 1899 | " \n", 1900 | " \n", 1901 | " \n", 1902 | " \n", 1903 | " \n", 1904 | " \n", 1905 | " \n", 1906 | " \n", 1907 | " \n", 1908 | " \n", 1909 | " \n", 1910 | " \n", 1911 | " \n", 1912 | " \n", 1913 | " \n", 1914 | " \n", 1915 | " \n", 1916 | " \n", 1917 | " \n", 1918 | " \n", 1919 | " \n", 1920 | " \n", 1921 | " \n", 1922 | " \n", 1923 | " \n", 1924 | " \n", 1925 | " \n", 1926 | " \n", 1927 | " \n", 1928 | " \n", 1929 | " \n", 1930 | " \n", 1931 | " \n", 1932 | " \n", 1933 | " \n", 1934 | " \n", 1935 | " \n", 1936 | " \n", 1937 | " \n", 1938 | " \n", 1939 | " \n", 1940 | " \n", 1941 | " \n", 1942 | " \n", 1943 | " \n", 1944 | " \n", 1945 | " \n", 1946 | " \n", 1947 | " \n", 1948 | " \n", 1949 | " \n", 1950 | " \n", 1951 | " \n", 1952 | " \n", 1953 | " \n", 1954 | " \n", 1955 | " \n", 1956 | " \n", 1957 | " \n", 1958 | " \n", 1959 | " \n", 1960 | " \n", 1961 | " \n", 1962 | " \n", 1963 | " \n", 1964 | " \n", 1965 | " \n", 1966 | " \n", 1967 | " \n", 1968 | " \n", 1969 | " \n", 1970 | " \n", 1971 | " \n", 1972 | " \n", 1973 | " \n", 1974 | " \n", 1975 | " \n", 1976 | " \n", 1977 | " \n", 1978 | " \n", 1979 | " \n", 1980 | " \n", 1981 | " \n", 1982 | "
AACCAACGACGAGGACAAGCGATGAACCAACGACTTTCAAGCGATAGCAACCAACGAGAGCGACAAGCGATGAACCAACGCCTAGCTCAAGCGATGAACCAACGTGGGCGTCAAGCGATGAGGCAGAACCGTTCAAGCGATCCCAGGCAGAACCGTTCAAGCGATTCTAGGCAGAACTGATAGCAAGCGATGAGGCAGAAGCTGCTTCAAGCGATGAGGCAGAAGTCCTCAAGCGATTCT...TAGGCATGGGAGCAACAAGCGATGTAGGCATGGTCCTCAAGCGATATATAGGCATGTTTGCGCCAAGCGATGTCCTGAGCAGCAGCCCAAGCGATGTCCTGAGCATGGGAGCAAGCGATGTCCTGAGCGCATACAAGCGATATATCCTGAGCGTGTTAGCAAGCGATGTGATTGACACGGGCTCAAGCGATGTGATTGACGACTACAAGCGATCCCTGATTGACTTCTACAAGCGATGAC
ANGPT16032353420...2552020403
ANKRD466111244113...1352110311
ASAP11031011112...1100012100
ATAD20000110200...0101011100
ATP6V1C15140465854...3273033412
..................................................................
WDYHV12110211403...1654220401
YWHAZ1293410719112981011656886...93119128956911592943739
ZFPM20010033302...0044000400
ZHX1811456681435...1081023941053
ZNF70674113823622637604452...5266531333453833154
\n", 1983 | "

4191 rows × 9517 columns

\n", 1984 | "
" 1985 | ], 1986 | "text/plain": [ 1987 | " AACCAACGACGAGGACAAGCGATG AACCAACGACTTTCAAGCGATAGC \\\n", 1988 | "ANGPT1 6 0 \n", 1989 | "ANKRD46 6 1 \n", 1990 | "ASAP1 1 0 \n", 1991 | "ATAD2 0 0 \n", 1992 | "ATP6V1C1 5 1 \n", 1993 | "... ... ... \n", 1994 | "WDYHV1 2 1 \n", 1995 | "YWHAZ 129 34 \n", 1996 | "ZFPM2 0 0 \n", 1997 | "ZHX1 8 1 \n", 1998 | "ZNF706 74 11 \n", 1999 | "\n", 2000 | " AACCAACGAGAGCGACAAGCGATG AACCAACGCCTAGCTCAAGCGATG \\\n", 2001 | "ANGPT1 3 2 \n", 2002 | "ANKRD46 1 1 \n", 2003 | "ASAP1 3 1 \n", 2004 | "ATAD2 0 0 \n", 2005 | "ATP6V1C1 4 0 \n", 2006 | "... ... ... \n", 2007 | "WDYHV1 1 0 \n", 2008 | "YWHAZ 107 19 \n", 2009 | "ZFPM2 1 0 \n", 2010 | "ZHX1 14 5 \n", 2011 | "ZNF706 38 23 \n", 2012 | "\n", 2013 | " AACCAACGTGGGCGTCAAGCGATG AGGCAGAACCGTTCAAGCGATCCC \\\n", 2014 | "ANGPT1 3 5 \n", 2015 | "ANKRD46 2 4 \n", 2016 | "ASAP1 0 1 \n", 2017 | "ATAD2 1 1 \n", 2018 | "ATP6V1C1 4 6 \n", 2019 | "... ... ... \n", 2020 | "WDYHV1 2 1 \n", 2021 | "YWHAZ 112 98 \n", 2022 | "ZFPM2 0 3 \n", 2023 | "ZHX1 6 6 \n", 2024 | "ZNF706 62 26 \n", 2025 | "\n", 2026 | " AGGCAGAACCGTTCAAGCGATTCT AGGCAGAACTGATAGCAAGCGATG \\\n", 2027 | "ANGPT1 3 4 \n", 2028 | "ANKRD46 4 1 \n", 2029 | "ASAP1 1 1 \n", 2030 | "ATAD2 0 2 \n", 2031 | "ATP6V1C1 5 8 \n", 2032 | "... ... ... \n", 2033 | "WDYHV1 1 4 \n", 2034 | "YWHAZ 101 165 \n", 2035 | "ZFPM2 3 3 \n", 2036 | "ZHX1 8 14 \n", 2037 | "ZNF706 37 60 \n", 2038 | "\n", 2039 | " AGGCAGAAGCTGCTTCAAGCGATG AGGCAGAAGTCCTCAAGCGATTCT ... \\\n", 2040 | "ANGPT1 2 0 ... \n", 2041 | "ANKRD46 1 3 ... \n", 2042 | "ASAP1 1 2 ... \n", 2043 | "ATAD2 0 0 ... \n", 2044 | "ATP6V1C1 5 4 ... \n", 2045 | "... ... ... ... \n", 2046 | "WDYHV1 0 3 ... \n", 2047 | "YWHAZ 68 86 ... \n", 2048 | "ZFPM2 0 2 ... \n", 2049 | "ZHX1 3 5 ... \n", 2050 | "ZNF706 44 52 ... \n", 2051 | "\n", 2052 | " TAGGCATGGGAGCAACAAGCGATG TAGGCATGGTCCTCAAGCGATATA \\\n", 2053 | "ANGPT1 2 5 \n", 2054 | "ANKRD46 1 3 \n", 2055 | "ASAP1 1 1 \n", 2056 | "ATAD2 0 1 \n", 2057 | "ATP6V1C1 3 2 \n", 2058 | "... ... ... \n", 2059 | "WDYHV1 1 6 \n", 2060 | "YWHAZ 93 119 \n", 2061 | "ZFPM2 0 0 \n", 2062 | "ZHX1 10 8 \n", 2063 | "ZNF706 52 66 \n", 2064 | "\n", 2065 | " TAGGCATGTTTGCGCCAAGCGATG TCCTGAGCAGCAGCCCAAGCGATG \\\n", 2066 | "ANGPT1 5 2 \n", 2067 | "ANKRD46 5 2 \n", 2068 | "ASAP1 0 0 \n", 2069 | "ATAD2 0 1 \n", 2070 | "ATP6V1C1 7 3 \n", 2071 | "... ... ... \n", 2072 | "WDYHV1 5 4 \n", 2073 | "YWHAZ 128 95 \n", 2074 | "ZFPM2 4 4 \n", 2075 | "ZHX1 10 2 \n", 2076 | "ZNF706 53 13 \n", 2077 | "\n", 2078 | " TCCTGAGCATGGGAGCAAGCGATG TCCTGAGCGCATACAAGCGATATA \\\n", 2079 | "ANGPT1 0 2 \n", 2080 | "ANKRD46 1 1 \n", 2081 | "ASAP1 0 1 \n", 2082 | "ATAD2 0 1 \n", 2083 | "ATP6V1C1 0 3 \n", 2084 | "... ... ... \n", 2085 | "WDYHV1 2 2 \n", 2086 | "YWHAZ 69 115 \n", 2087 | "ZFPM2 0 0 \n", 2088 | "ZHX1 3 9 \n", 2089 | "ZNF706 33 45 \n", 2090 | "\n", 2091 | " TCCTGAGCGTGTTAGCAAGCGATG TGATTGACACGGGCTCAAGCGATG \\\n", 2092 | "ANGPT1 0 4 \n", 2093 | "ANKRD46 0 3 \n", 2094 | "ASAP1 2 1 \n", 2095 | "ATAD2 1 1 \n", 2096 | "ATP6V1C1 3 4 \n", 2097 | "... ... ... \n", 2098 | "WDYHV1 0 4 \n", 2099 | "YWHAZ 92 94 \n", 2100 | "ZFPM2 0 4 \n", 2101 | "ZHX1 4 10 \n", 2102 | "ZNF706 38 33 \n", 2103 | "\n", 2104 | " TGATTGACGACTACAAGCGATCCC TGATTGACTTCTACAAGCGATGAC \n", 2105 | "ANGPT1 0 3 \n", 2106 | "ANKRD46 1 1 \n", 2107 | "ASAP1 0 0 \n", 2108 | "ATAD2 0 0 \n", 2109 | "ATP6V1C1 1 2 \n", 2110 | "... ... ... \n", 2111 | "WDYHV1 0 1 \n", 2112 | "YWHAZ 37 39 \n", 2113 | "ZFPM2 0 0 \n", 2114 | "ZHX1 5 3 \n", 2115 | "ZNF706 15 4 \n", 2116 | "\n", 2117 | "[4191 rows x 9517 columns]" 2118 | ] 2119 | }, 2120 | "execution_count": 7, 2121 | "metadata": {}, 2122 | "output_type": "execute_result" 2123 | } 2124 | ], 2125 | "source": [ 2126 | "pd.read_csv('GSM4012703_TASC_SCREEN_chr8_sample10.counts.csv.gz', index_col=0)" 2127 | ] 2128 | }, 2129 | { 2130 | "cell_type": "code", 2131 | "execution_count": null, 2132 | "id": "396a165f", 2133 | "metadata": {}, 2134 | "outputs": [], 2135 | "source": [ 2136 | "import scanpy as sc\n", 2137 | "sc.set_figure_params(dpi=100, frameon=False)" 2138 | ] 2139 | }, 2140 | { 2141 | "cell_type": "code", 2142 | "execution_count": null, 2143 | "id": "864ca45b", 2144 | "metadata": {}, 2145 | "outputs": [], 2146 | "source": [ 2147 | "sc.pp.neighbors(adata)\n", 2148 | "sc.tl.umap(adata)" 2149 | ] 2150 | } 2151 | ], 2152 | "metadata": { 2153 | "kernelspec": { 2154 | "display_name": "Python 3 (ipykernel)", 2155 | "language": "python", 2156 | "name": "python3" 2157 | }, 2158 | "language_info": { 2159 | "codemirror_mode": { 2160 | "name": "ipython", 2161 | "version": 3 2162 | }, 2163 | "file_extension": ".py", 2164 | "mimetype": "text/x-python", 2165 | "name": "python", 2166 | "nbconvert_exporter": "python", 2167 | "pygments_lexer": "ipython3", 2168 | "version": "3.7.0" 2169 | } 2170 | }, 2171 | "nbformat": 4, 2172 | "nbformat_minor": 5 2173 | } 2174 | -------------------------------------------------------------------------------- /datasets/Srivatsan_2019_sciplex2_curation.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "498416da", 6 | "metadata": {}, 7 | "source": [ 8 | "Accession: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM4150377" 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": 1, 14 | "id": "4abe7004", 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "import gzip\n", 19 | "import os\n", 20 | "\n", 21 | "import numpy as np\n", 22 | "import pandas as pd\n", 23 | "from anndata import AnnData\n", 24 | "\n", 25 | "from utils import download_binary_file" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 2, 31 | "id": "b5747389", 32 | "metadata": {}, 33 | "outputs": [], 34 | "source": [ 35 | "def download_srivatsan_2019_sciplex2(output_path: str) -> None:\n", 36 | " \"\"\"\n", 37 | " Download Srivatsan et al. 2019 sciplex-2 data from the hosting URLs.\n", 38 | "\n", 39 | " Args:\n", 40 | " ----\n", 41 | " output_path: Output path to store the downloaded and unzipped\n", 42 | " directories.\n", 43 | "\n", 44 | " Returns\n", 45 | " -------\n", 46 | " None. File directories are downloaded to output_path.\n", 47 | " \"\"\"\n", 48 | "\n", 49 | " count_matrix_url = (\n", 50 | " \"https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM4150377&format=file&file=\"\n", 51 | " \"GSM4150377_sciPlex2_A549_Transcription_Modulators_UMI.count.matrix.gz\"\n", 52 | " )\n", 53 | " count_matrix_filename = os.path.join(output_path, count_matrix_url.split(\"=\")[-1])\n", 54 | " download_binary_file(count_matrix_url, count_matrix_filename)\n", 55 | "\n", 56 | " cell_metadata_url = (\n", 57 | " \"https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM4150377&format=file\"\n", 58 | " \"&file=GSM4150377_sciPlex2_pData.txt.gz\"\n", 59 | " )\n", 60 | " cell_metadata_filename = os.path.join(output_path, cell_metadata_url.split(\"=\")[-1])\n", 61 | " download_binary_file(cell_metadata_url, cell_metadata_filename)\n", 62 | "\n", 63 | " gene_metadata_url = (\n", 64 | " \"https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM4150377&format=file&file=\"\n", 65 | " \"GSM4150377_sciPlex2_A549_Transcription_Modulators_gene.annotations.txt.gz\"\n", 66 | " )\n", 67 | " cell_metadata_filename = os.path.join(output_path, gene_metadata_url.split(\"=\")[-1])\n", 68 | " download_binary_file(gene_metadata_url, cell_metadata_filename)\n", 69 | "\n", 70 | "\n", 71 | "def read_srivatsan_2019_sciplex2(file_directory: str) -> pd.DataFrame:\n", 72 | " \"\"\"\n", 73 | " Read the sciplex-2 expression data for Srivatsan et al. 2019 in the given directory.\n", 74 | "\n", 75 | " Args:\n", 76 | " ----\n", 77 | " file_directory: Directory containing Srivatsan et al. 2019 data.\n", 78 | "\n", 79 | " Returns\n", 80 | " -------\n", 81 | " A data frame containing single-cell gene expression counts. The count\n", 82 | " matrix is stored in triplet format. I.e., each row of the data frame\n", 83 | " has the format (row, column, count) stored in columns (i, j, x) respectively.\n", 84 | " \"\"\"\n", 85 | "\n", 86 | " with gzip.open(\n", 87 | " os.path.join(\n", 88 | " file_directory,\n", 89 | " \"GSM4150377_sciPlex2_A549_Transcription_Modulators_UMI.count.matrix.gz\",\n", 90 | " ),\n", 91 | " \"rb\",\n", 92 | " ) as f:\n", 93 | " df = pd.read_csv(f, sep=\"\\t\", header=None, names=[\"i\", \"j\", \"x\"])\n", 94 | "\n", 95 | " return df" 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": 3, 101 | "id": "d19b7de5", 102 | "metadata": {}, 103 | "outputs": [ 104 | { 105 | "name": "stdout", 106 | "output_type": "stream", 107 | "text": [ 108 | "File ./srivatsan_2019_sciplex2/GSM4150377_sciPlex2_A549_Transcription_Modulators_UMI.count.matrix.gz already exists. No files downloaded to overwrite the existing file.\n", 109 | "File ./srivatsan_2019_sciplex2/GSM4150377_sciPlex2_pData.txt.gz already exists. No files downloaded to overwrite the existing file.\n", 110 | "File ./srivatsan_2019_sciplex2/GSM4150377_sciPlex2_A549_Transcription_Modulators_gene.annotations.txt.gz already exists. No files downloaded to overwrite the existing file.\n" 111 | ] 112 | }, 113 | { 114 | "name": "stderr", 115 | "output_type": "stream", 116 | "text": [ 117 | "/tmp/ipykernel_22607/873333527.py:41: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.\n", 118 | " adata = AnnData(\n", 119 | "/tmp/ipykernel_22607/873333527.py:51: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.\n", 120 | " adata.obs[\"drug\"] = [\n", 121 | "/tmp/ipykernel_22607/873333527.py:60: SettingWithCopyWarning: \n", 122 | "A value is trying to be set on a copy of a slice from a DataFrame\n", 123 | "\n", 124 | "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", 125 | " adata.obs[\"drug\"][adata.obs[\"dose\"] == 0.0] = \"Vehicle\"\n" 126 | ] 127 | } 128 | ], 129 | "source": [ 130 | "download_path = \"./srivatsan_2019_sciplex2\"\n", 131 | "\n", 132 | "os.makedirs(download_path, exist_ok=True)\n", 133 | "download_srivatsan_2019_sciplex2(download_path)\n", 134 | "df = read_srivatsan_2019_sciplex2(download_path)\n", 135 | "\n", 136 | "# The Srivatsan count data is in a sparse triplet format represented\n", 137 | "# by three columns 'i', 'j', and 'x'. 'i' refers to a row number, 'j' refers to\n", 138 | "# a column number, and 'x' refers to a count value.\n", 139 | "counts = df[\"x\"]\n", 140 | "rows = (\n", 141 | " df[\"i\"] - 1\n", 142 | ") # Indices were originally 1-base-indexed --> switch to 0-base-indexing\n", 143 | "cols = df[\"j\"] - 1\n", 144 | "\n", 145 | "# Convert the triplets into a numpy array\n", 146 | "count_matrix = np.zeros([max(rows) + 1, max(cols) + 1])\n", 147 | "count_matrix[rows, cols] = counts\n", 148 | "\n", 149 | "# Switch matrix from gene rows and cell columns to cell rows and gene columns\n", 150 | "count_matrix = count_matrix.T\n", 151 | "\n", 152 | "cell_metadata = pd.read_csv(\n", 153 | " os.path.join(\n", 154 | " download_path,\n", 155 | " \"GSM4150377_sciPlex2_pData.txt.gz\",\n", 156 | " ),\n", 157 | " sep=\" \",\n", 158 | ")\n", 159 | "\n", 160 | "gene_metadata = pd.read_csv(\n", 161 | " os.path.join(\n", 162 | " download_path,\n", 163 | " \"GSM4150377_sciPlex2_A549_Transcription_Modulators_gene.annotations.txt.gz\",\n", 164 | " ),\n", 165 | " sep=\"\\t\",\n", 166 | " header=None,\n", 167 | " index_col=0,\n", 168 | ")\n", 169 | "\n", 170 | "adata = AnnData(\n", 171 | " X=count_matrix, obs=cell_metadata, var=pd.DataFrame(index=gene_metadata.index)\n", 172 | ")\n", 173 | "\n", 174 | "# Index needs string names or else the write_h5ad call will throw an error\n", 175 | "adata.var.index.name = \"gene_id\"\n", 176 | "\n", 177 | "# Treatment information is contained in the `top_oligo` column\n", 178 | "# with the format _.\n", 179 | "adata = adata[adata.obs[\"top_oligo\"].notna()]\n", 180 | "adata.obs[\"drug\"] = [\n", 181 | " treatment.split(\"_\")[0] for treatment in adata.obs[\"top_oligo\"]\n", 182 | "]\n", 183 | "adata.obs[\"dose\"] = [\n", 184 | " treatment.split(\"_\")[1] for treatment in adata.obs[\"top_oligo\"]\n", 185 | "]\n", 186 | "adata.obs[\"dose\"] = adata.obs[\"dose\"].apply(pd.to_numeric, args=(\"coerce\",))\n", 187 | "\n", 188 | "# If a drug is listed with dosage of 0, the cell was only exposed to vehicle control\n", 189 | "adata.obs[\"drug\"][adata.obs[\"dose\"] == 0.0] = \"Vehicle\"" 190 | ] 191 | }, 192 | { 193 | "cell_type": "code", 194 | "execution_count": null, 195 | "id": "13f04aba", 196 | "metadata": {}, 197 | "outputs": [], 198 | "source": [ 199 | "adata.write_h5ad(\"Srivatsan_2019_sciplex2_raw.h5ad\")" 200 | ] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "execution_count": null, 205 | "id": "f87d3ed4-e87d-4608-bec9-a9e745d483c1", 206 | "metadata": {}, 207 | "outputs": [], 208 | "source": [] 209 | } 210 | ], 211 | "metadata": { 212 | "kernelspec": { 213 | "display_name": "Python 3", 214 | "language": "python", 215 | "name": "python3" 216 | }, 217 | "language_info": { 218 | "codemirror_mode": { 219 | "name": "ipython", 220 | "version": 3 221 | }, 222 | "file_extension": ".py", 223 | "mimetype": "text/x-python", 224 | "name": "python", 225 | "nbconvert_exporter": "python", 226 | "pygments_lexer": "ipython3", 227 | "version": "3.9.12" 228 | } 229 | }, 230 | "nbformat": 4, 231 | "nbformat_minor": 5 232 | } 233 | -------------------------------------------------------------------------------- /datasets/block0_load.py: -------------------------------------------------------------------------------- 1 | author_year = #e.g. You_2022 2 | is_counts = # True/False 3 | var_genes = # field in .var or None 4 | doi = # of paper source 5 | 6 | import numpy as np 7 | import pandas as pd 8 | import scanpy as sc 9 | sc.set_figure_params(dpi=100, frameon=False) 10 | sc.logging.print_header() 11 | 12 | # verify 13 | assert(doi in pd.read_csv('../personal.csv').DOI.values) 14 | 15 | adata = sc.read(f'{author_year}_raw.h5ad') 16 | adata 17 | -------------------------------------------------------------------------------- /datasets/block1_init.py: -------------------------------------------------------------------------------- 1 | # metalabels 2 | adata.uns['preprocessing_nb_link'] = f'https://nbviewer.org/github/theislab/sc-pert/blob/main/datasets/{author_year}_curation.ipynb' 3 | adata.uns['doi'] = doi 4 | display(adata.obs.describe(include='all').T.head(20)) 5 | 6 | # use gene symbols as gene names 7 | if var_genes: 8 | adata.var[var_genes] = adata.var[var_genes].astype(str) 9 | adata.var = adata.var.reset_index().set_index(var_genes) 10 | print(adata.var_names) 11 | 12 | # filtering and processing 13 | sc.pp.filter_cells(adata, min_genes=200) 14 | sc.pp.filter_genes(adata, min_cells=20) 15 | adata.var_names_make_unique() 16 | 17 | if is_counts: 18 | adata.var['mt'] = adata.var_names.str.startswith('MT-') # annotate the group of mitochondrial genes as 'mt' 19 | sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True) 20 | 21 | sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'], jitter=0.4, multi_panel=True) 22 | sc.pl.scatter(adata, x='total_counts', y='pct_counts_mt') 23 | sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts') 24 | -------------------------------------------------------------------------------- /datasets/block2_process.py: -------------------------------------------------------------------------------- 1 | if is_counts: 2 | adata.layers['counts'] = adata.X 3 | sc.pp.normalize_total(adata, target_sum=1e4) 4 | sc.pp.log1p(adata) 5 | 6 | sc.pl.highest_expr_genes(adata, n_top=20) 7 | sc.pp.highly_variable_genes(adata, n_top_genes=5000, subset=False) 8 | sc.pl.highly_variable_genes(adata) 9 | 10 | # pre-compute plots 11 | sc.tl.pca(adata, use_highly_variable=True) 12 | sc.pp.neighbors(adata) 13 | sc.tl.leiden(adata) 14 | sc.tl.umap(adata) 15 | -------------------------------------------------------------------------------- /datasets/block3_standardize.py: -------------------------------------------------------------------------------- 1 | # the following fields are meant to serve as a template 2 | control = ??? 3 | replace_dict = { 4 | control: 'control', 5 | } 6 | adata.obs['perturbation_name'] = adata.obs[???].replace(replace_dict) 7 | # remember to use + for combinations 8 | adata.obs['perturbation_type'] = # 'small molecule' or 'genetic' 9 | adata.obs['perturbation_value'] = adata.obs[???] 10 | adata.obs['perturbation_unit'] = # 'ug', 'mg', 'hrs', etc. 11 | -------------------------------------------------------------------------------- /datasets/template.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "0f8d8db0", 6 | "metadata": {}, 7 | "source": [ 8 | "Accession: [link]" 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": null, 14 | "id": "e9234341", 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "%load block0_load.py" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": null, 24 | "id": "904d339f", 25 | "metadata": {}, 26 | "outputs": [], 27 | "source": [ 28 | "%load block1_init.py" 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "execution_count": 5, 34 | "id": "72112f55", 35 | "metadata": {}, 36 | "outputs": [], 37 | "source": [ 38 | "adata = adata[adata.obs.n_genes_by_counts < 8000, :] # edit\n", 39 | "adata = adata[adata.obs.pct_counts_mt < 50]" 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "id": "2eaff683", 45 | "metadata": {}, 46 | "source": [ 47 | "Add additional labels, normalize and add plotting variables." 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": null, 53 | "id": "aff9cff1", 54 | "metadata": {}, 55 | "outputs": [], 56 | "source": [ 57 | "%load block2_process.py" 58 | ] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "id": "df9dc5f9", 63 | "metadata": {}, 64 | "source": [ 65 | "Save." 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": null, 71 | "id": "269f5603", 72 | "metadata": {}, 73 | "outputs": [], 74 | "source": [ 75 | "sc.write(f'{author_year}.h5ad', adata, compression='gzip')\n", 76 | "adata" 77 | ] 78 | }, 79 | { 80 | "cell_type": "markdown", 81 | "id": "3706b6a6", 82 | "metadata": {}, 83 | "source": [ 84 | "Standardized perturbation labels and save again." 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": null, 90 | "id": "1d741e81", 91 | "metadata": {}, 92 | "outputs": [], 93 | "source": [ 94 | "adata = sc.read(f'{author_year}.h5ad')" 95 | ] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "execution_count": null, 100 | "id": "0d070b9b", 101 | "metadata": {}, 102 | "outputs": [], 103 | "source": [ 104 | "%load block3_standardize.py" 105 | ] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "execution_count": null, 110 | "id": "fef168c4", 111 | "metadata": {}, 112 | "outputs": [], 113 | "source": [ 114 | "sc.write(f'{author_year}.h5ad', adata, compression='gzip')\n", 115 | "adata" 116 | ] 117 | }, 118 | { 119 | "cell_type": "markdown", 120 | "id": "afe8077b", 121 | "metadata": {}, 122 | "source": [ 123 | "View." 124 | ] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "execution_count": null, 129 | "id": "618de1da", 130 | "metadata": {}, 131 | "outputs": [], 132 | "source": [ 133 | "plot_cols = [c for c in adata.obs.columns if len(adata.obs[c].unique()) < 30 or not hasattr(adata.obs[c], 'cat')]\n", 134 | "for c in plot_cols:\n", 135 | " if adata.obs[c].dtype == 'bool':\n", 136 | " adata.obs[c] = adata.obs[c].astype('category')\n", 137 | "sc.pl.umap(adata, color=plot_cols, wspace=.4)" 138 | ] 139 | }, 140 | { 141 | "cell_type": "markdown", 142 | "id": "a6059ff1", 143 | "metadata": {}, 144 | "source": [ 145 | "Plot the top 30 most common perturbations. We do this separately because many datasets have too many perturbations for `scanpy`'s color palettes." 146 | ] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "execution_count": null, 151 | "id": "9d85afb7", 152 | "metadata": {}, 153 | "outputs": [], 154 | "source": [ 155 | "sc.pl.umap(\n", 156 | " adata[adata.obs.perturbation_name.isin((adata.obs.perturbation_name.value_counts().index[:30]))],\n", 157 | " color='perturbation_name')" 158 | ] 159 | } 160 | ], 161 | "metadata": { 162 | "kernelspec": { 163 | "display_name": "Python 3 (ipykernel)", 164 | "language": "python", 165 | "name": "python3" 166 | }, 167 | "language_info": { 168 | "codemirror_mode": { 169 | "name": "ipython", 170 | "version": 3 171 | }, 172 | "file_extension": ".py", 173 | "mimetype": "text/x-python", 174 | "name": "python", 175 | "nbconvert_exporter": "python", 176 | "pygments_lexer": "ipython3", 177 | "version": "3.7.0" 178 | } 179 | }, 180 | "nbformat": 4, 181 | "nbformat_minor": 5 182 | } 183 | -------------------------------------------------------------------------------- /datasets/utils.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import os 3 | 4 | def download_binary_file( 5 | file_url: str, output_path: str, overwrite: bool = False 6 | ) -> None: 7 | """ 8 | Download binary data file from a URL. 9 | 10 | Args: 11 | ---- 12 | file_url: URL where the file is hosted. 13 | output_path: Output path for the downloaded file. 14 | overwrite: Whether to overwrite existing downloaded file. 15 | 16 | Returns 17 | ------- 18 | None. 19 | """ 20 | file_exists = os.path.exists(output_path) 21 | if (not file_exists) or (file_exists and overwrite): 22 | request = requests.get(file_url) 23 | with open(output_path, "wb") as f: 24 | f.write(request.content) 25 | print(f"Downloaded data from {file_url} at {output_path}") 26 | else: 27 | print( 28 | f"File {output_path} already exists. " 29 | "No files downloaded to overwrite the existing file." 30 | ) -------------------------------------------------------------------------------- /readme_body.txt: -------------------------------------------------------------------------------- 1 | # sc-pert - Machine learning for perturbational single-cell omics 2 | 3 | *This repository provides a community-maintained summary of models and datasets. It was initially curated for [(Cell Systems, 2021)](https://doi.org/10.1016/j.cels.2021.05.016).* 4 | 5 | ### External annotations 6 | 7 | There are various resources for evaluation of single cell perturbation models. We discuss five tasks in the publication which can be supported by the following publicly available annotations: 8 | 9 | - [GDSC](https://www.cancerrxgene.org/downloads/bulk_download) provides a collection of cell viability measurements for many compounds and cell lines. We provide a [code snippet](https://github.com/theislab/sc-pert/blob/main/resources.py#L4) to conveniently load GDSC-provided z-score compound response rankings per cell line. 10 | - Additional viability data can be obtained from [DepMap's PRISM dataset](https://depmap.org/portal/download/). 11 | - [Therapeutics Data Commons](https://github.com/mims-harvard/TDC) provides access to a number of compound databases as part of their [cheminformatics tasks](https://tdcommons.ai/benchmark/overview/). (In the same vein, [OpenProblems](https://openproblems.bio/) provides a [framework for tasks in single-cell](https://github.com/openproblems-bio/openproblems/tree/main/openproblems/tasks) which can also support perturbation modeling tasks in a more long term format than was previously seen in the [DREAM challenges](https://dreamchallenges.org/dream-7-nci-dream-drug-sensitivity-and-drug-synergy-challenge/).) 12 | - [PubChem](https://pubchem.ncbi.nlm.nih.gov/) contains a comprehensive record of compounds ranging from experimental entities to non-proprietary small molecules. It is queryable via [PubChemPy](https://github.com/mcs07/PubChemPy). 13 | - [DrugBank](https://go.drugbank.com/releases/latest) provides annotations for a relatively small number of small molecules in a standardized format. 14 | 15 | ### Current modeling approaches 16 | 17 | We [maintain a list of perturbation-related tools at scrna-tools](https://www.scrna-tools.org/tools?sort=name&cats=Perturbations). Please consider further updating and tagging tools [there](https://github.com/scRNA-tools/scRNA-tools). 18 | 19 | For the basis of the table in the article, see this [spreadsheet of a subset of perturbation models](https://docs.google.com/spreadsheets/d/1nqNg0DW1-Om7WtvRS20q-6b28usVRv5czOcxgj83Sgg/) which includes more details. 20 | 21 | ### Datasets 22 | 23 | Below, we curated a [table](https://raw.githubusercontent.com/theislab/sc-pert/main/data_table.csv) of perturbation datasets based on [Svensson *et al.* (2020)](https://doi.org/10.1093/database/baaa073). 24 | 25 | We also offer some datasets in a curated `.h5ad` format via the download links in the table below. `raw h5ad` denotes a version of the dataset that has not been filtered, normalized, or standardized. 26 | 27 | H5ads denoted as `processed` have an accompanying processing notebook, and have been similarly preprocessed. These datasets have the following standardized fields in `.obs`: 28 | * `perturbation_name` -- Human-readable ompound names (International non-proprietary naming where possible) for small molecules and gene names for genetic perturbations. 29 | * `perturbation_type` -- `small molecule` or `genetic` 30 | * `perturbation_value` -- A continuous covariate quantity, such as the dosage concentration or the number of hours since treatment. 31 | * `perturbation_unit` -- Describes `perturbation_value`, such as `'ug'` or `'hrs'`. 32 | 33 | 34 | -------------------------------------------------------------------------------- /resources.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pandas as pd 3 | 4 | def GDSC_response(cell_line, filename=None): 5 | """Downloads and returns the responsiveness ranking of compounds 6 | for a specific cell line from GDSC1 and GDSC2 as dataframes. 7 | 8 | Descriptions of the column names can be found 9 | here: ftp://ftp.sanger.ac.uk/pub/project/cancerrxgene/releases/current_release/GDSC_Fitted_Data_Description.pdf 10 | 11 | Params 12 | ------ 13 | cell_line : str 14 | Cell line identifier, e.g. A549. 15 | filename : str (default: None) 16 | The downloaded files will be saved as filename_GDSC1/2.csv. 17 | """ 18 | if not os.path.exists('Cell_Lines_Details.xlsx'): 19 | os.system('wget ftp://ftp.sanger.ac.uk/pub/project/cancerrxgene/releases/current_release/Cell_Lines_Details.xlsx') 20 | annot = pd.read_excel('Cell_Lines_Details.xlsx') 21 | if cell_line not in annot['Sample Name'].values: 22 | raise ValueError(f'{cell_line} not present in GDSC') 23 | cell_id = annot[annot['Sample Name'] == cell_line]['COSMIC identifier'].values[0].astype('int').astype('str') 24 | if filename is None: 25 | filename = cell_line 26 | os.system(f"wget -O {filename}_GDSC1.csv 'https://www.cancerrxgene.org/api/cellline/download_zscore?id={cell_id}&screening_set=GDSC1&export=csv'") 27 | os.system(f"wget -O {filename}_GDSC2.csv 'https://www.cancerrxgene.org/api/cellline/download_zscore?id={cell_id}&screening_set=GDSC2&export=csv'") 28 | return pd.read_csv(f'{filename}_GDSC1.csv'), pd.read_csv(f'{filename}_GDSC2.csv') 29 | -------------------------------------------------------------------------------- /update.py: -------------------------------------------------------------------------------- 1 | import os 2 | import re 3 | import glob 4 | import pandas as pd 5 | 6 | ### data updates ### 7 | 8 | # single-cell database 9 | os.system('rm data.tsv') 10 | os.system('rm personal.csv') 11 | os.system('wget http://www.nxn.se/single-cell-studies/data.tsv') 12 | os.system("wget --no-check-certificate -O personal.csv 'https://docs.google.com/spreadsheets/d/14awt-bCOnj4ca2uoKzuTNuKtUKXcoN82_-oGg2f1Ros/export?gid=1438063781&format=csv'") 13 | 14 | # GDSC 15 | #os.system('wget -P /gdsc/ ftp://ftp.sanger.ac.uk/pub/project/cancerrxgene/releases/current_release/GDSC1_fitted_dose_response_25Feb20.xlsx') 16 | 17 | path = 'datasets/' 18 | processed_datasets = [file for file in glob.glob(f'{path}*.ipynb') if 'curation' not in file] 19 | curated_datasets = [file for file in glob.glob(f'{path}*_curation.ipynb')] 20 | 21 | personal_rec = pd.read_csv('personal.csv') 22 | personal_rec['author_year'] = personal_rec['Author'] + '_' + personal_rec['Year'].astype(str) 23 | dois = personal_rec.DOI.values 24 | 25 | df = pd.read_csv('data.tsv', sep='\t') 26 | df = df[df.DOI.isin(dois)] 27 | df['Date'] = df['Date'].astype('object') # ensure proper display 28 | not_in_db = list(set(dois) - set(df.DOI.values)) 29 | 30 | # create placeholders for dois not in the sc studies DB 31 | pdf = personal_rec[personal_rec.DOI.isin(not_in_db)] 32 | df = df.append(pdf[pdf.columns.intersection(df.columns)]) 33 | 34 | # add additional columns of info 35 | add_cols = personal_rec[['DOI', 'Treatment', '# perturbations', '# cell types', '# doses', '# timepoints', 'Author', 'Year']] 36 | n_cols = 1 - add_cols.shape[1] # default to negative 37 | df = df.merge( 38 | add_cols, 39 | left_on='DOI', 40 | right_on='DOI' 41 | ) 42 | df = df[list(df.columns[:5]) + \ 43 | list(df.columns[n_cols:]) + list(df.columns[5:n_cols])] # up to `Title` 44 | 45 | # convert DOIs to links in markdown 46 | links = [] 47 | for shorthand, link in df[['Shorthand', 'DOI']].values: 48 | s = f'[{shorthand}](https://doi.org/{link})' 49 | s = s.replace('et al', '*et al.*') 50 | links.append(s) 51 | df['Shorthand'] = links 52 | 53 | # add availability column with download links for curated datasets 54 | links = [] 55 | for shorthand, author, year, link in df[['Shorthand', 'Author', 'Year', 'DOI']].values: 56 | s = '' 57 | base_nb_path = 'https://nbviewer.ipython.org/github/theislab/sc-pert/blob/main/' 58 | row = personal_rec[personal_rec.author_year == f'{author}_{year}'] 59 | if not pd.isnull(row.Raw.values[0]): # raw .h5ad 60 | s += f' [\\[raw h5ad\\]]({row.Raw.values[0]})' 61 | if not pd.isnull(row.Processed.values[0]): # processed .h5ad 62 | s += f' [\\[processed h5ad\\]]({row.Processed.values[0]})' 63 | ## adding notebook paths 64 | r = re.compile(f'{path}{author}_{year}.*') 65 | for nb in filter(r.match, curated_datasets): 66 | s += f' [\\[curation nb\\]]({base_nb_path}{nb})' 67 | for nb in filter(r.match, processed_datasets): 68 | s += f' [\\[processing nb\\]]({base_nb_path}{nb})' 69 | 70 | links.append(s) 71 | df['.h5ad availability'] = links 72 | 73 | # clean up 74 | df = df.sort_values(by=['Treatment', 'Date']) 75 | df = df.drop(['Authors', 'Journal', 'DOI', 'bioRxiv DOI', 'Author', 'Year', 'Date'], axis=1) 76 | 77 | # rearrange columns 78 | primary_cols = ['Shorthand', 'Title', '.h5ad availability', 'Treatment', '# perturbations', '# cell types', '# doses', '# timepoints'] 79 | df = df[primary_cols + \ 80 | [c for c in df if c not in primary_cols]] 81 | 82 | # write README 83 | filenames = [] 84 | with open('README.md', 'w') as outfile: 85 | with open('readme_body.txt') as infile: 86 | outfile.write(infile.read()) 87 | md = df.to_markdown(index=False, tablefmt='github', floatfmt='.8g') 88 | md = md.replace('| Title', '| Title'+' '*70) 89 | outfile.write(md) 90 | infile.close() 91 | outfile.close() 92 | 93 | df.to_csv('data_table.csv') 94 | --------------------------------------------------------------------------------