├── GDSC_DATASET_S1-S12.zip ├── CCLE_DATASET_S13-S26.zip ├── CTRP_DATASET_S40-S43.zip ├── NCI60_DATASET_S27-S39.zip └── README.md /GDSC_DATASET_S1-S12.zip: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:60a0d0d4b206fcc5ef25e6e88779e6077c4ae2d411ef928f22227cca880a26fc 3 | size 94001962 4 | -------------------------------------------------------------------------------- /CCLE_DATASET_S13-S26.zip: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:caa6be1f10fccbfce2631373ccf0704f01f8839399cfafa94d154adf9f504182 3 | size 30524593 4 | -------------------------------------------------------------------------------- /CTRP_DATASET_S40-S43.zip: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:d703dd6403896dd09f9aa2d1b78e252985d34aa4a67cd20518c6854ef8e22006 3 | size 20271615 4 | -------------------------------------------------------------------------------- /NCI60_DATASET_S27-S39.zip: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:13b77641a939b31198335e094d8ec209aa454df423b3b8711e9d42bd9e52b82e 3 | size 5984076 4 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Supplementary-data-for-drug-response-prediction 2 | The Supplementary data in the paper "A Survey and Systematic Assessment of Computational Methods for Drug Response Prediction" 3 | 4 | ############# GDSC_DATASET ############# 5 | 6 | Table S1. Gene expression profiles for 734 cancer cell lines and 8046 genes. The first column shows the COSMIC identifiers of the 734 cancer cell lines, and the first row shows the HGNC symbols of the 8046 genes. 7 | 8 | Table S2. DNA methyaltion profiles for 734 cancer cell lines and 8473 CpG loci, where each row represents a cancer cell line and each column represents a CpG locus. 9 | 10 | Table S3. Mutation profiles for 734 cancer cell lines and 636 genes, where each row represents a cancer cell line and each column represents a gene appearing in the catalogue of Cancer Gene Census from the COSMIC database (https://cancer.sanger.ac.uk/cosmic/download, version 88), and 1 indicates mutations occur in the corresponding cell line and gene. 11 | 12 | Table S4. Copy number variation (CNV) profiles for 734 cancer cell lines and 694 genes, where each row represents a cancer cell line and each column represents a gene appearing in the catalogue of Cancer Gene Census from the COSMIC database (https://cancer.sanger.ac.uk/cosmic/download, version 88), and 1 indicates the CNA occurs in the corresponding gene and cell line. 13 | 14 | Table S5. The annotations of 734 cancer cell lines. The four columns are the cancer cell line names, cosmic identifiers of cancer cell lines, tissue sites which cell lines are from and histology category which cancer cell lines belong to, respectively. 15 | 16 | Table S6. Drug response data (i.e. common logarithm of the IC50 readout) for 734 cancer cell lines and 201 drugs. The first column shows the COSMIC identifiers of the 734 cancer cell lines, and the first row shows the 201 drugs. 17 | 18 | Table S7. Drug response data (i.e. the area under the drug response curve, AUC readout) for 734 cancer cell lines and 201 drugs. The first column shows the COSMIC identifiers of the 734 cancer cell lines, and the first row shows the 201 drugs. 19 | 20 | Table S8. The protein--protein interaction (PPI) network involving 7136 proteins encoded by the corresponding genes from the 8046 genes appearing in the gene expression profiles. This network is extracted from the PathwayCommons database (http://www.pathway commons.org/, accessed 11 March 2019). In this table, each row represents a link in this PPI network. 21 | 22 | Table S9. Drugs' structural features. This table shows nine types of molecular fingerprints for 175 out of 201 drugs, where we did not find out the structural information for the other 26 drugs. Each sheet records a binary matrix where each row represents a drug and each column represents a structural feature, and 1 indicates the drug presents the structural feature. 23 | The features with zero value across all the drugs are removed from the tables. 24 | 25 | The description for each type of fingerprint is provided below: 26 | SHEET NAME, FEATURE TYPE, DESCRIPTION 27 | FP, CDK fingerprint, Fingerprint of length 1024 and search depth of 8 28 | ExtFP, CDK extended fingerprint, Extends the Fingerprinter with additional bits describing ring features 29 | EStateFP, Estate fingerprint, E-State fragments 30 | GraphFP, CDK graph only fingerprint, Specialized version of the Fingerprinter which does not take bond orders into account 31 | MACCSFP, MACCS fingerprint, MACCS keys 32 | PubchemFP, Pubchem fingerprint, Pubchem fingerprint 33 | SubFP, Substructure fingerprint, Presence of SMARTS Patterns for Functional Group Classification by Christian Laggner 34 | KRFP, Klekota-Roth fingerprint, Presence of chemical substructures 35 | AD2D, 2D atom pairs, Presence of atom pairs at various topological distances 36 | 37 | 38 | Table S10. Drug similarity matrix for 201 drugs used in the DualNets method. This is obtained by computing the Pearson's correlation between each pair of drug profiles derived from 1444 1-dimensional and 2-dimensional structural descriptors. 39 | 40 | Table S11. 71 groups for 8046 genes in the gene expression profiles (see Table S1). 71 gene sets are selected from C2:CP subcollection and C6 collection in the molecular signatures database (MSigDB v6.2, http://software.broadinstitute.org/gsea/msigdb/collections.jsp#C2; C2:CP gene sets are canonical representations of biological processes compiled by domain experts; C6 gene sets represent signatures of cellular pathways which are often dis-regulated in cancer) if they contain one or more primary targets of the 201 drugs. Then 8046 genes are assigned to these 71 gene sets according to whether the gene appears in the gene set or not. Genes not included in any gene set form a single group named "Others" which is not shown here. 41 | 42 | Table S12. Drug groups. 201 drugs are classified into 23 groups, each containing drugs targeting the same pathway separated by semicolons in the second column. 43 | 44 | 45 | ############# CCLE_DATASET ############# 46 | 47 | Table S13, S16 -- S26 are organized similarly with those in GDSC_DATASET. 48 | 49 | Table S14. MicroRNA expression profiles for 385 cancer cell lines and 734 microRNAs. The first column shows the identifiers of the 385 cancer cell lines, and the first row shows the symbols of the 734 microRNAs. 50 | 51 | Table S15. Protein expression profiles (Reverse Phase Protein Array (RPPA) data) for 385 cancer cell lines and 214 proteins. The first column shows the identifiers of the 385 cancer cell lines, and the first row shows the antibody names. 52 | 53 | ############# NCI60_DATASET ############# 54 | 55 | Table S27 -- S39 are organized similarly with those in CCLE_DATASET. 56 | 57 | Different with the GDSC and CCLE datasets, in the NCI-60 dataset, drug activity levels expressed as 50% growth-inhibitory levels (GI50). The entries in Table S33 are z-score from negative log10[GI50(molar)] data across NCI-60 for single drug. 58 | 59 | ############# CTRP_DATASET ############# 60 | 61 | Table S40. Gene expression profiles for 720 cancer cell lines and 7770 genes. The first column shows the COSMIC identifiers of the 720 cancer cell lines, and the first row shows the HGNC symbols of the 7770 genes. 62 | 63 | Table S41. Drug response data (i.e. the normalized area under the drug response curve, AUC readout) for 720 cancer cell lines and 63 drugs. The first column shows the COSMIC identifiers of the 734 cancer cell lines, and the first row shows the 63 drugs. 64 | 65 | Table S42. The annotation of 720 cell lines in CTRP v2.1 dataset used for independent validation. The three columns are cell line names, primary tissue sites which cell lines are from and histology category which cell lines belong to, respectively. 66 | 67 | Table S43. The common drugs between training dataset (GDSC) and testing dataset (CCLE or CTRP v2.1). The "GDSC_CCLE" (or "GDSC_CTRP") sheet provides the selected common drugs and their names in the GDSC and CCLE (or CTRP v2.1) dataset, respectively. 68 | 69 | 70 | 71 | Reference: 72 | [1] Yang, W., Soares, J, Greninger, P. Edelman, E., Lightfoot, H. et al. (2016) Genomics of Drug Sensitivity in Cancer (GDSC): a resource for therapeutic biomarker discovery in cancer cells. Nucleic Acids Research, 41, D1, 955-961. 73 | 74 | [2] Barretina, J., Caponigro, G., Stransky, N., Venkatesan, K., Sellers, W. et al. (2012) The Cancer Cell Line Encyclopedia Enables Predictive Modelling of Anticancer Drug Sensitivity. Nature, 483, 603–607. 75 | 76 | [3] Ghandi, M., Huang, F., Jané-Valbuena, J., Kryukov, G., LO, C. et al. (2019) Next-generation characterization of the Cancer Cell Line Encyclopedia. Nature, 569, 503–508. 77 | 78 | [4] Reinhold, W., Sunshine, M., Liu, H., Varma, S., Kohn, K. et al. (2012) CellMiner: a web-based suite of genomic and pharmacologic tools to explore transcript and drug patterns in the NCI-60 cell line set, Cancer research, 72, 3499-3511. 79 | 80 | [5] Rees, M., Seashore-Ludlow, B., Cheah, J., Adams, D., Price, E. et al. (2016) Correlating chemical sensitivity and basal gene expression reveals mechanism of action. Nature Chemical Biology, 12, 109. 81 | 82 | 83 | --------------------------------------------------------------------------------