├── LICENSE └── README.md /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 Data Science Group 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # awesome-datasets 2 | Curated datasets for machine learning tasks according to use cases adapted from a [now defunct article on Kaggle](https://www.pinterest.com/pin/541980136382090788/). Also check out this [repo of winning solutions](https://www.kaggle.com/sudalairajkumar/winning-solutions-of-kaggle-competitions). 3 | 4 | For each type of analysis think about: 5 | * What problem does it solve and for who? 6 | * How is it being solved today? 7 | * What are the data inputs and where do they come from? 8 | * What are the outputs and how are they consumed? Online models, static or dynamic reports? 9 | * Is it a revenue leakage (“saves us money”) or a revenue growth (“makes us money”) problem? 10 | 11 | # Use Cases By Functions and Verticals 12 | ## Marketing 13 | ### Demand Forecasting 14 | Forecast volumes of sales, inventory needed, etc. 15 | * [Rossman](https://www.kaggle.com/c/rossmann-store-sales) - Supermarket sales forecasting 16 | * [Online Product Sales](https://www.kaggle.com/c/online-sales/data) - self-help product sales forecasting 17 | ### Predicting Lifetime Value / Recency-Frequency Matrix 18 | Identify the most lucrative and loyal segments of your customers 19 | * [Lifetimes](https://github.com/CamDavidsonPilon/lifetimes) - Synthetic data and library for calculating CLV 20 | * [CDNow](http://brucehardie.com/datasets/) - CDNow transaction records 21 | ### Churn / Up-sell 22 | Identify characteristics and timing of customer churns/upgrades in order to prevent/encourage them 23 | * [KKBox's Churn Prediction Challenge](https://www.kaggle.com/c/kkbox-churn-prediction-challenge/data) 24 | ### Customer Segmentation 25 | Identify main customer clusters and their characteristics 26 | * [Instacart Market Basket Analysis](https://www.kaggle.com/c/instacart-market-basket-analysis) 27 | * [Online Retail Dataset](https://archive.ics.uci.edu/ml/datasets/Online+Retail) 28 | * [Loyal Customer Prediction](https://www.kaggle.com/c/loyal-customer-prediction/data) - new customers from 11/11 event on Tmall 29 | ### Product Grouping / Category Tree 30 | Group products together in the most reasonable category trees 31 | * [Instacart Market Basket Analysis](https://www.kaggle.com/c/instacart-market-basket-analysis) 32 | * [Online Retail Dataset](https://archive.ics.uci.edu/ml/datasets/Online+Retail) 33 | ### Cross-selling / Recommendation / Market Basket Analysis 34 | Identify which products a customer is going to buy based on past purchases 35 | #### Explicit Ratings 36 | * [MovieLens](https://movielens.org/) - Movie recommendation dataset 37 | * [Jester](http://eigentaste.berkeley.edu/) - Joke recommendation dataset 38 | * [Book-Crossings](http://www2.informatik.uni-freiburg.de/~cziegler/BX/) - Book recommendation dataset 39 | * [HetRec](https://grouplens.org/datasets/hetrec-2011/) - Music recommendation dataset 40 | #### Implicit Ratings 41 | * [Instacart Market Basket Analysis](https://www.kaggle.com/c/instacart-market-basket-analysis) 42 | * [WikiLens](https://grouplens.org/datasets/wikilens/) - Wiki edits dataset 43 | * [OpenStreetMap](https://planet.openstreetmap.org/planet/full-history/) - OpenStreetMap edits dataset 44 | ### Channel Attribution and Optimization 45 | Allocate credits fairly to all ads channels and have portfolio for your ads spending 46 | * [AnalyzeCore](https://analyzecore.com/2016/08/03/attribution-model-r-part-1/) - Synthetic data and attribution models 47 | ### Ad Optimization 48 | Predict and price impressions, clicks, conversions or any performance metrics for ads 49 | * [Avazu Click-Through Rate Prediction](https://www.kaggle.com/c/avazu-ctr-prediction) - Mobile ads click-through-rate prediction 50 | * [Avito Demand Prediction Challenge](https://www.kaggle.com/c/avito-demand-prediction) - Predict demand for an online classified ad 51 | ### Ad Fraud 52 | Detect ad click/install frauds 53 | * [TalkingData AdTracking Fraud Detection Challenge](https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection/data) - Can you detect fraudulent click traffic for mobile app ads? 54 | ### Dynamic Pricing 55 | Optimal price for growth, profit, customer retention, etc. 56 | * [AWS Spot Pricing Market](https://www.kaggle.com/noqcks/aws-spot-pricing-market/home) 57 | ### Store Layout Optimization 58 | Optimal store/website layout for growth, profit, customer retention, etc. 59 | ### Customer Feedback 60 | Text classification to determine customer feedbacks/sentiment about your products 61 | * [IMDb](https://www.imdb.com/interfaces/) - Movie reviews 62 | * [Amazon Reviews](http://jmcauley.ucsd.edu/data/amazon/) 63 | * [Yelp Open Dataset](https://www.yelp.com/dataset) - Yelp reviews 64 | * [Wongnai Challenge](https://www.kaggle.com/c/wongnai-challenge-review-rating-prediction) - Restaurant reviews 65 | * [OpinRank Review Dataset](https://archive.ics.uci.edu/ml/datasets/OpinRank+Review+Dataset) - TripAdvisor and Edmunds Reviews 66 | 67 | ## Customer Support 68 | ### Question Answering 69 | Generate natural language answers based on given context and questions 70 | * [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) - Stanford Question Answering Dataset 71 | ### Wait Time Prediction 72 | Predict wait time based on customer history, time of day, call volumes, products owned, churn risk, LTV, etc. 73 | 74 | ## Human Resources 75 | ### Resume screening 76 | Score candidates based on resumes and internal records 77 | * [DonorsChoose.org Application Screening](https://www.kaggle.com/c/donorschoose-application-screening) 78 | ### Employee Churn 79 | Predicts which employees are most likely to leave 80 | * [SAS Employee Turnover](http://shell.cas.usf.edu/~pspector/sasdir/datasets.html) - Synthetic employee churn dataset 81 | * [IBM HR Employee Attrition and Performance](https://www.ibm.com/communities/analytics/watson-analytics-blog/hr-employee-attrition/) - Synthetic employee churn dataset 82 | * [Employee Attrition](https://www.kaggle.com/HRAnalyticRepository/employee-attrition-data) - Synthetic employee churn dataset 83 | 84 | ## Healthcare 85 | ### Medical Image Classification 86 | Classify medical images according to conditions 87 | * [Grand Challenges](http://www.grand-challenge.org/) - Collection of Biomedical Image Competitions 88 | * [MURA](https://stanfordmlgroup.github.io/competitions/mura/) - Large Dataset for Abnormality Detection in Musculoskeletal Radiographs 89 | * [ISIC](https://isic-archive.com/) - International Skin Imaging Collaboration 90 | * [DermNet](http://www.dermnet.com/) - Skin Disease Atlas 91 | * [TCIA](http://www.cancerimagingarchive.net/) - Cancer Imaging Archive 92 | * [OASIS](http://www.oasis-brains.org/#data) - Longitudinal Neuroimaging Dataset 93 | * [DDSM](http://marathon.csee.usf.edu/Mammography/Database.html) - Digital Database for Screening Mammography 94 | * [Breast Histopathology Images](https://www.kaggle.com/paultimothymooney/breast-histopathology-images/) 95 | * [NIH Chest X-rays](https://www.kaggle.com/nih-chest-xrays) 96 | * [HERLEV](http://mde-lab.aegean.gr/downloads/) - Pap-smear Database 97 | * [Stanford Tissue Microarray Database](https://tma.im/cgi-bin/home.pl) 98 | * [CheXPert](https://stanfordmlgroup.github.io/competitions/chexpert/) 99 | * [MIMIC-CXR](https://arxiv.org/abs/1901.07042) 100 | ### Readmission risk 101 | Predict risk of re-admittance based on patient attributes, medical history, diagnose & treatment 102 | * [Diabetes 130-US hospitals for years 1999-2008 Data Set](https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008) 103 | ### Patient Report Summary 104 | Generate natural language reports based on tabular data 105 | ### Automated Triage 106 | Classify patients according to their initial complaints 107 | ### Hospital Operations Management 108 | Optimize/predict operating theatre & bed occupancy based on initial patient visits 109 | * [Healthcare in Washington](https://www.doh.wa.gov/DataandStatisticalReports/HealthcareinWashington/HospitalandPatientData) 110 | * [Mini Heritage Health Prize](https://github.com/jiunjiunma/heritage-health-prize) - Processed version of [Heritage Health Prize dataset](https://www.kaggle.com/c/hhp) 111 | ### Real-time Patient Monitoring 112 | Activity monitoring of patients 113 | * [OPPORTUNITY](https://archive.ics.uci.edu/ml/datasets/OPPORTUNITY+Activity+Recognition) - Dataset for Human Activity Recognition from Wearable, Object, and Ambient Sensors 114 | * [PAMAP2](https://archive.ics.uci.edu/ml/datasets/PAMAP2+Physical+Activity+Monitoring) - Physical Activity Monitoring Data Set 115 | ### Survival Analysis 116 | Predict survival rates of patients 117 | * [Haberman's Survival Data Set](https://www.kaggle.com/gilsousa/habermans-survival-data-set) - Survival of patients who had undergone surgery for breast cancer 118 | ### Dosage Effectiveness 119 | Analyse effects of admitting different types and dosage of medication for a disease 120 | 121 | ## Media 122 | ### News Summary 123 | Generate short length descriptions of news articles. 124 | * [NEWS SUMMARY](https://www.kaggle.com/sunnysai12345/news-summary) 125 | 126 | ## Insurance 127 | ### Claim Prediction 128 | Predict timing and size of claims 129 | * [TSA Claims Database](https://www.kaggle.com/terminal-security-agency/tsa-claims-database/home) 130 | * [Allstate Claims Severity](https://www.kaggle.com/c/allstate-claims-severity) 131 | ### Claim Fraud 132 | Outlier detection for insurance claim fraud 133 | ### Policy Prediction 134 | Predict type of insurance 135 | * [Insurance Company Benchmark (COIL 2000) Data Set](https://archive.ics.uci.edu/ml/datasets/Insurance+Company+Benchmark+%28COIL+2000%29) 136 | 137 | ## Finance 138 | ### Credit Scoring / Loan Approval / Debt Recovery 139 | Predict which customers are going to default 140 | * [Statlog (German Credit Data) Data Set](https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)) 141 | * [Statlog (Australian Credit Approval) Data Set](https://archive.ics.uci.edu/ml/datasets/Statlog+%28Australian+Credit+Approval%29) 142 | * [Home Credit Default Risk](https://www.kaggle.com/c/home-credit-default-risk) 143 | * [A Fin tech fraud transaction classification](https://www.kaggle.com/c/a-fin-tech-fraud-transaction-classification) - default prediction with anonymized features 144 | ### Portfolio Optimization 145 | Optimize portfolio of assets according to risks and returns 146 | * [quantmod](https://www.quantmod.com/) - library for financial modeling in R; APIs for downloading fundamental and technical data 147 | * [Stanford EE103](https://stanford.edu/class/ee103/portfolio.html) - Popular ETFs from 2006 to 2016 148 | ### Automated Trading 149 | Trade financial assets using automated models 150 | * [quantmod](https://www.quantmod.com/) - library for financial modeling in R; APIs for downloading fundamental and technical data 151 | * [Get Rich or Die Modelin'](https://datatouille.org/competition/) - Bitcoin trading signals 152 | ### Fraud Detection 153 | Identify fraudulent transactions and parties with outlier detection and network analysis 154 | * [Credit Card Fraud Detection](https://www.kaggle.com/mlg-ulb/creditcardfraud) - Anonymized features 155 | * [PaySim Synthetic Financial Datasets For Fraud Detection](https://www.kaggle.com/ntnu-testimon/paysim1) 156 | * [Bitcoin Transactions](http://www.vo.elte.hu/bitcoin/downloads.htm) 157 | 158 | ## Manufacturing 159 | ### Quality Control 160 | Detect malfunctioning pieces with computer vision 161 | ### Process Optimization 162 | Find bottlenecks in manufacturing processes 163 | * [Mercedes-Benz Greener Manufacturing](https://www.kaggle.com/c/mercedes-benz-greener-manufacturing) 164 | ### Warranty Analytics 165 | Predict your products' rate and timing of failures 166 | ### Design 167 | Design new products 168 | * [Fashion MNIST](https://github.com/zalandoresearch/fashion-mnist) - Labeled fashion images 169 | 170 | ## Agriculture, Geography and Environment 171 | ### Yield Forecasting 172 | Forecast agricultural yields 173 | * [Honey Production In The USA (1998-2012)](https://www.kaggle.com/jessicali9530/honey-production) 174 | * [Agricuture Crops Production In india](https://www.kaggle.com/srinivas1/agricuture-crops-production-in-india) 175 | ### Satellite Image Classification and Extraction 176 | * [Planet: Understanding the Amazon from Space](https://www.kaggle.com/c/planet-understanding-the-amazon-from-space) 177 | * [SpaceNet](https://spacenetchallenge.github.io/datasets/datasetHomePage.html) - Annotated satellite images of buildings and roads 178 | * [Dstl Satellite Imagery Feature Detection](https://www.kaggle.com/c/dstl-satellite-imagery-feature-detection) 179 | ### Air Quality 180 | * [Italy Air Quality Data Set](https://archive.ics.uci.edu/ml/datasets/Air+Quality) 181 | ### Wildlife Classification 182 | Classify wild animals 183 | * [North American Camera Trap Images (NACTI)](https://www.biorxiv.org/content/early/2018/06/14/346809) - images of trapped animals 184 | 185 | ## Real Estate 186 | ### Pricing 187 | Predict real estate values based on their characteristics 188 | * [Zillow’s Home Value Prediction (Zestimate)](https://www.kaggle.com/c/zillow-prize-1) 189 | 190 | ## Education 191 | ### Automated Essay Scoring 192 | Score essays based on past pieces 193 | * [The Hewlett Foundation: Automated Essay Scoring](https://www.kaggle.com/c/asap-aes) 194 | 195 | ## Utilities 196 | ### Distribution Network Optimization 197 | Optimize distribution networks of electricity, water, etc. 198 | * [Individual household electric power consumption Data Set](https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption) 199 | 200 | # Others 201 | * [Analyze Survey Data for Free](http://asdfree.com/) 202 | --------------------------------------------------------------------------------