└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # Computational Biology Datasets Suitable For Machine Learning 2 | This is a curated list of computational biology datasets that have been pre-processed for machine learning. 3 | This list is a work in progress, please submit a pull request for any dataset you would like to advertise! 4 | 5 | ## Genotyping 6 | |Name | Description | Comments | 7 | |:-:|---|---| 8 | |[The Cancer Genome Atlas](https://cancergenome.nih.gov/)| Variety of Cancer Data | most cancer types have 100-1000 samples | 9 | |[NIH GDC](https://gdc-portal.nci.nih.gov/)| Cancer, many types of genomic data | | 10 | |[UK Biobank](http://www.ukbiobank.ac.uk/about-biobank-uk/) | | | 11 | |[European Genome-Phenome Archive](https://www.ebi.ac.uk/ega/datasets)| | | 12 | |[METABRIC](http://www.cbioportal.org/study?id=brca_metabric#summary)| The genomic profiles (somatic mutations [targeted sequencing], copy number alterations, and gene expression) of 2509 breast cancers.| | 13 | |[HapMap](https://www.genome.gov/hapmap/)| | | 14 | |[23andMe](http://www.biorxiv.org/content/early/2017/04/19/127241)| 2280 Public Domain Curated Genotypes | | 15 | |[Mice](http://wp.cs.ucl.ac.uk/outbredmice/heterogeneous-stock-mice/) | SNPs, 2000+ samples | 4 generations. It might be possible to learn a family structure out of the data. | 16 | |[Arabidopsis](https://www.arabidopsis.org/download/) | SNPs, 100+ phenotypes | | 17 | 18 | ## Promoter-Enhancer Pairs 19 | |Name | Description | Comments | 20 | |:-:|---|---| 21 | |[TargetFinder](https://github.com/shwhalen/targetfinder)|~100,000 DNA-DNA interaction pairs | | 22 | 23 | ## Gene/Protein Expression 24 | |Name | Description | Comments | 25 | |:-:|---|---| 26 | |[GEO](http://www.ncbi.nlm.nih.gov/geo/) | Main place for NCBI data | | 27 | |[ENCODE](http://www.encodeproject.org/) | Variety of assays to identify functional elements | | 28 | |[ArrayExpress](http://www.ebi.ac.uk/arrayexpress/) | DNA sequencing, gene/protein expression, epigenetics | | 29 | |[Cytometry Continuous](http://science.sciencemag.org/content/308/5721/523) | flow cytometry data of 11 proteins+phospholipids, Discretized and cleaned data available offline | Classical benchmark dataset for learning graphical models; contains known errors | 30 | |[Transcription factor binding](http://www.pnas.org/content/106/51/21521.abstract?tab=ds) | ChIP-Seq data on 12 TFs | | 31 | |[GTEx](http://www.gtexportal.org/home/) | Landmark study for EQTL analysis | | 32 | |[PharmacoGenomics DB](https://www.pharmgkb.org/) | | | 33 | |[ProteomeXChange](http://www.proteomexchange.org/)| | | 34 | |[BeatAML](https://www.nature.com/articles/s41586-018-0623-z)| whole-exome sequencing, RNA sequencing and analyses of ex vivo drug sensitivity | 672 tumour specimens collected from 562 patients | 35 | 36 | ## Single-cell Data 37 | |Name | Description | Comments | 38 | |:-:|---|---| 39 | |[Single-cell expression atlas](https://www.ebi.ac.uk/gxa/sc/) | | | 40 | |[scPerturb](https://www.nature.com/articles/s41592-023-02144-y) | single-cell perturbation-response datasets | harmonized and preprocessed across 44 original datasets | 41 | 42 | ## Regulatory Networks 43 | |Name | Description | Comments | 44 | |:-:|---|---| 45 | |[TRRUST](http://www.grnpedia.org/trrust/)| manually curated database of human transcriptional regulatory network | | 46 | |[Yeast Network](http://science.sciencemag.org/content/353/6306/aaf1420/tab-pdf)| 23-million yeast 2-hybrid experiments to investigate genetic interactions | | 47 | |[Perturb-Seq](http://www.sciencedirect.com/science/article/pii/S0092867416316105)| Integrated model of perturbations, single cell phenotypes, and epistatic interactions | | 48 | |[KEGG Metabolic Regulatory Network (Undirected)](https://archive.ics.uci.edu/ml/datasets/KEGG+Metabolic+Reaction+Network+%28Undirected%29) | 65554 instances, 29 attributes each | | 49 | |[KEGG Metabolic Regulatory Network (Directed)](https://archive.ics.uci.edu/ml/datasets/KEGG+Metabolic+Relation+Network+%28Directed%29) |53414 instance, 24 attributes each | | 50 | 51 | ## Images 52 | |Name | Description | Comments | 53 | |:-:|---|---| 54 | |[The Cancer Imaging Archive](http://www.cancerimagingarchive.net/)| Extracts the images from the TCGA data | | 55 | |[Multiple Myeloma DREAM Challenge](https://www.synapse.org/#!Synapse:syn6187098/wiki/401884)| Challenge to identify Multiple Myeloma Patients | | 56 | |[Breast Cancer Wisconsin (Diagnostic) Data Set](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data)| Predict whether the cancer is benign or malignant | | 57 | |[DDSM](http://marathon.csee.usf.edu/Mammography/Database.html)|Mammogram Database | | 58 | |[Kaggle Soft Tissue Sarcomas](https://www.kaggle.com/4quant/soft-tissue-sarcoma)| Preprocessed subset of the TCIA study "Soft Tissue Sarcoma" | segmentation task | 59 | |[Kaggle Cervical Cancer Screening](https://www.kaggle.com/c/intel-mobileodt-cervical-cancer-screening)| Classify cervix type from images| | 60 | |[CMELYON17](https://camelyon17.grand-challenge.org/) | Pathology challenge - automated detection and classification of breast cancer metastases in whole-slide images of histological lymph node sections| | 61 | |[Grand Challenges](https://grand-challenge.org/all_challenges/) | Datasets from biomedical image analysis competitions | | 62 | |[Breast Cancer MRI Dataset](https://sites.duke.edu/mazurowski/resources/breast-cancer-mri-dataset/) | Demographic, clinical, pathology, treatment, outcomes, and genomic data + MRI images | | 63 | 64 | ## fMRI 65 | |Name | Description | Comments | 66 | |:-:|---|---| 67 | |[ENGIMA Cerebellum](https://my.vanderbilt.edu/enigmacerebellum/)| Goal: Examine the relationships between regional atrophy and motor and cognitive dysfunction | | 68 | |[Seizure Prediction](https://www.kaggle.com/c/melbourne-university-seizure-prediction/data) | Goal: Classify EEG time series into pre-seizure vs. interictal (i.e., not preceding a seizure). | | 69 | 70 | ## Electronic Medical Records 71 | |Name | Description | Comments | 72 | |:-:|---|---| 73 | |[MIMIC](https://mimic.physionet.org/)| 59,000 EHRs | | 74 | |[UCI Diabetes](https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008)| 130 US hospital data for 1999-2008| | 75 | |[i2b2](https://www.i2b2.org/NLP/DataSets/Main.php) | Clinical notes only, designed for NLP tasks | | 76 | |[PhysioNet](https://www.physionet.org/physiobank/database/) | | | 77 | |[Metadata Acquired from Clinical Case Reports (MACCRs)](https://www.nature.com/articles/sdata2018258) | 3,100 curated clinical case reports spanning 15 disease groups and more than 750 reports of rare diseases | | 78 | |[eICU](https://www.nature.com/articles/sdata2018178)| 200k EHRs | | 79 | |[All of Us](https://databrowser.researchallofus.org/)| >250k EHRs, some genomic data | | 80 | |[PMC-Patients](https://www.nature.com/articles/s41597-023-02814-8)| 167k patient summaries with 3.1 M patient-article relevance annotations and 293k patient-patient similarity annotations | | 81 | 82 | ## Radiographs 83 | |Name | Description | Comments | 84 | |:-:|---|---| 85 | |[CheXPert](https://stanfordmlgroup.github.io/competitions/chexpert/) | 200k chest radiographs | Competition and leaderboard associated | 86 | |[MIMIC-CXR](https://arxiv.org/abs/1901.07042) | ~400k chest x-rays, 14 labels | Data on PhysioNet | 87 | |[PadChest](http://bimcv.cipf.es/bimcv-projects/padchest/) | 160k chest x-rays, 174 different findings | | 88 | 89 | ## Protein-Protein Interactions 90 | |Name | Description | Comments | 91 | |:-:|---|---| 92 | |[HINT (High-quality INTeractomes)](http://hint.yulab.org/) | curated compilation of high-quality protein-protein interactions from 8 interactome resources | | 93 | 94 | ## Longitudinal Studies 95 | |Name | Description | Comments | 96 | |:-:|---|---| 97 | |[National Population Health Survey](http://www.statcan.gc.ca/eng/survey/household/3225)| Longitudinal Survey that collects health information via surveys every two years. | | 98 | 99 | ## Protein Structure 100 | |Name | Description | Comments | 101 | |:-:|---|---| 102 | |[ProteinNet](https://github.com/aqlaboratory/proteinnet) | Standardized dataset for learning protein structure. Includes sequences, structures, alignments, PSSMs, and standardized train/test/valid splits. | | 103 | 104 | ## Natural Language Data 105 | |Name | Description | Comments | 106 | |:-:|---|---| 107 | |[BioASQ](http://www.bioasq.org/) | Abstracts of medical articles (from PubMed); ontologies of medical concepts. | Tasks: MLC, QA. | 108 | |[Cases](http://www.casesdatabase.com/) | Articles from medical case studies. | | 109 | |[UPMC Pathology](http://path.upmc.edu/cases.html) | UPMC Pathology case studies. | | 110 | 111 | ## Therapeutics 112 | |Name | Description | Comments | 113 | |:-:|---|---| 114 | |[Therapeutic Data Commons](https://tdcommons.ai/)| Many preprocessed datasets for therapeutic discovery, including target discovery, activity modeling, efficacy and safety, and manufacturing. | Available as Python modules. | 115 | |[Cancer Omics Drug Experiment Response Dataset](https://github.com/PNNL-CompBio/coderdata)| Molecular datasets paired with corresponding drug sensitivity data | Seeks to standardize datasets of cancer drug responses into a [standard schema](https://github.com/PNNL-CompBio/coderdata/blob/main/schema/README.md) | 116 | --------------------------------------------------------------------------------