├── Datasets └── titanic.csv.zip ├── .travis.yml ├── LICENSE ├── Government.rst └── README.rst /Datasets/titanic.csv.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Phonbopit/awesome-public-datasets/master/Datasets/titanic.csv.zip -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | # language: ruby 2 | # rvm: 3 | # - 2.2 4 | # before_script: 5 | # - gem install awesome_bot 6 | # script: 7 | # - site404=www.datawrangling.com,getglue-data.s3.amazonaws.com,archive.org/details/2011-05-calufa-twitter-sql,www.stats4stem.org,lib.stat.cmu.edu,http://www.oecd.org/document/0,census.gov/acs/www/data_documentation/data_release_info/ 8 | # - whtlist=travis,crawdad.cs.dartmouth.edu,data.nasdaq.com,137.189.35.203/WebUI/CatDatabase/catData.html,numbrary.com,www.cmr.osu.edu,gutenberg.org,donnees.gouv.qc.ca,data.rio.rj.gov.br,ntrl.ntis.gov,openflights.org,www.data.gov.bc.ca,earthdata.nasa,pgp-hms,cru.uea.ac.uk,networkdata.ics,datos.argentina,data.gov.ie,isi.edu,data.go.id,wiki.dbpedia,www.laval.ca,www.wunderground.com,data.lexingtonky.gov,arcgis,bixi 9 | # - site503=datamob.org,research.microsoft.com 10 | # - awesome_bot README.rst --allow-dupe --allow-redirect --set-timeout 5 --allow-timeout --white-list $site404,$whtlist,$site503 11 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2014-2015 Xiaming Chen and other contributors to this list. 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | 23 | -------------------------------------------------------------------------------- /Government.rst: -------------------------------------------------------------------------------- 1 | Government 2 | ---------- 3 | 4 | * `Alberta, Province of Canada `_ 5 | * `Antwerp, Belgium `_ 6 | * `Argentina (non official) `_ 7 | * `Argentina `_ 8 | * `Austin, TX, US `_ 9 | * `Australia (abs.gov.au) `_ 10 | * `Australia (data.gov.au) `_ 11 | * `Austria (data.gv.at) `_ 12 | * `Baton Rouge, LA, US `_ 13 | * `Belgium `_ 14 | * `Brazil `_ 15 | * `Buenos Aires, Argentina `_ 16 | * `Calgary, AB, Canada `_ 17 | * `Cambridge, MA, US `_ 18 | * `Canada `_ 19 | * `Chicago `_ 20 | * `Chile `_ 21 | * `Dallas Open Data `_ 22 | * `DataBC - data from the Province of British Columbia `_ 23 | * `Denver Open Data `_ 24 | * `Durham, NC Open Data `_ 25 | * `Edmonton, AB, Canada `_ 26 | * `England LGInform `_ 27 | * `EuroStat `_ 28 | * `FedStats `_ 29 | * `Finland `_ 30 | * `France `_ 31 | * `Fredericton, NB, Canada `_ 32 | * `Gatineau, QC, Canada `_ 33 | * `Germany `_ 34 | * `Ghent, Belgium `_ 35 | * `Glasgow, Scotland, UK `_ 36 | * `Greece `_ 37 | * `Guardian world governments `_ 38 | * `Halifax, NS, Canada `_ 39 | * `Helsinki Region, Finland `_ 40 | * `Hong Kong, China `_ 41 | * `Houston Open Data `_ 42 | * `Indian Government Data `_ 43 | * `Indonesian Data Portal `_ 44 | * `Ireland's Open Data Portal `_ 45 | * `Japan `_ 46 | * `Laval, QC, Canada `_ 47 | * `Lexington, KY `_ 48 | * `London Datastore, UK `_ 49 | * `London, ON, Canada `_ 50 | * `Los Angeles Open Data `_ 51 | * `MassGIS, Massachusetts, U.S. `_ 52 | * `Mexico `_ 53 | * `Missisauga, ON, Canada `_ 54 | * `Moldova `_ 55 | * `Moncton, NB, Canada `_ 56 | * `Montreal, QC, Canada `_ 57 | * `Netherlands `_ 58 | * `New Zealand `_ 59 | * `NYC betanyc `_ 60 | * `NYC Open Data `_ 61 | * `OECD `_ 62 | * `Oklahoma `_ 63 | * `Open Government Data (OGD) Platform India `_ 64 | * `Oregon `_ 65 | * `Ottawa, ON, Canada `_ 66 | * `Portland, Oregon `_ 67 | * `Portugal - Pordata organization `_ 68 | * `Puerto Rico Government `_ 69 | * `Quebec City, QC, Canada `_ 70 | * `Quebec Province of Canada `_ 71 | * `Regina SK, Canada `_ 72 | * `Rio de Janeiro, Brazil `_ 73 | * `Romania `_ 74 | * `Russia `_ 75 | * `San Francisco Data sets `_ 76 | * `Saskatchewan, Province of Canada `_ 77 | * `Seattle `_ 78 | * `Singapore Government Data `_ 79 | * `South Africa `_ 80 | * `South Africa Trade Statistics `_ 81 | * `State of Utah, US `_ 82 | * `Switzerland `_ 83 | * `Taiwan `_ 84 | * `Taiwan g0v `_ 85 | * `Texas Open Data `_ 86 | * `The World Bank `_ 87 | * `Toronto, ON, Canada `_ 88 | * `U.K. Government Data `_ 89 | * `U.S. American Community Survey `_ 90 | * `U.S. CDC Public Health datasets `_ 91 | * `U.S. Census Bureau `_ 92 | * `U.S. Department of Housing and Urban Development (HUD) `_ 93 | * `U.S. Federal Government Agencies `_ 94 | * `U.S. Federal Government Data Catalog `_ 95 | * `U.S. Food and Drug Administration (FDA) `_ 96 | * `U.S. National Center for Education Statistics (NCES) `_ 97 | * `U.S. Open Government `_ 98 | * `UK 2011 Census Open Atlas Project `_ 99 | * `United Nations `_ 100 | * `Uruguay `_ 101 | * `Vancouver, BC Open Data Catalog `_ 102 | * `Victoria, BC, Canada `_ 103 | * `Vienna, Austria `_ -------------------------------------------------------------------------------- /README.rst: -------------------------------------------------------------------------------- 1 | Awesome Public Datasets 2 | ======================= 3 | .. image:: https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg 4 | :alt: Awesome 5 | :target: https://github.com/sindresorhus/awesome 6 | 7 | `This list of public data sources `_ 8 | are collected and tidied from blogs, answers, and user responses. 9 | Most of the data sets listed below are free, however, some are not. 10 | Other amazingly awesome lists can be found in the 11 | `awesome-awesomeness `_ and 12 | `sindresorhus's awesome `_ list. 13 | 14 | .. contents:: Table of Contents 15 | 16 | 17 | Agriculture 18 | ------------ 19 | * `U.S. Department of Agriculture's PLANTS Database `_ 20 | 21 | 22 | Biology 23 | ------- 24 | 25 | * `1000 Genomes `_ 26 | * `American Gut (Microbiome Project) `_ 27 | * `Broad Cancer Cell Line Encyclopedia (CCLE) `_ 28 | * `Broad Bioimage Benchmark Collection (BBBC) `_ 29 | * `Cell Image Library `_ 30 | * `Complete Genomics Public Data `_ 31 | * `EBI ArrayExpress `_ 32 | * `EBI Protein Data Bank in Europe `_ 33 | * `Electron Microscopy Pilot Image Archive (EMPIAR) `_ 34 | * `ENCODE project `_ 35 | * `Ensembl Genomes `_ 36 | * `Gene Expression Omnibus (GEO) `_ 37 | * `Gene Ontology (GO) `_ 38 | * `Global Biotic Interactions (GloBI) `_ 39 | * `Harvard Medical School (HMS) LINCS Project `_ 40 | * `Human Genome Diversity Project `_ 41 | * `Human Microbiome Project (HMP) `_ 42 | * `ICOS PSP Benchmark `_ 43 | * `International HapMap Project `_ 44 | * `Journal of Cell Biology DataViewer `_ 45 | * `MIT Cancer Genomics Data `_ 46 | * `NCBI Proteins `_ 47 | * `NCBI Taxonomy `_ 48 | * `NIH Microarray data `_ or `FTP `_ (see FTP link on `RAW `_) 49 | * `OpenSNP genotypes data `_ 50 | * `Pathguid - Protein-Protein Interactions Catalog `_ 51 | * `Protein Data Bank `_ 52 | * `Psychiatric Genomics Consortium `_ 53 | * `PubChem Project `_ 54 | * `PubGene (now Coremine Medical) `_ 55 | * `Sanger Catalogue of Somatic Mutations in Cancer (COSMIC) `_ 56 | * `Sanger Genomics of Drug Sensitivity in Cancer Project (GDSC) `_ 57 | * `Sequence Read Archive(SRA) `_ 58 | * `Stanford Microarray Data `_ 59 | * `Stowers Institute Original Data Repository `_ 60 | * `Systems Science of Biological Dynamics (SSBD) Database `_ 61 | * `The Cancer Genome Atlas (TCGA), available via Broad GDAC `_ 62 | * `The Catalogue of Life `_ 63 | * `The Personal Genome Project `_ or `PGP `_ 64 | * `UCSC Public Data `_ 65 | * `Universal Protein Resource (UnitProt) `_ 66 | * `UniGene `_ 67 | 68 | 69 | Climate/Weather 70 | --------------- 71 | 72 | * `Australian Weather `_ 73 | * `Aviation Weather Center - Consistent, timely and accurate weather information for the world airspace system `_ 74 | * `Brazilian Weather - Historical data (In Portuguese) `_ 75 | * `Canadian Meteorological Centre `_ 76 | * `Climate Data from UEA (updated monthly) `_ 77 | * `European Climate Assessment & Dataset `_ 78 | * `Global Climate Data Since 1929 `_ 79 | * `NASA Global Imagery Browse Services `_ 80 | * `NOAA Bering Sea Climate `_ 81 | * `NOAA Climate Datasets `_ 82 | * `NOAA Realtime Weather Models `_ 83 | * `The World Bank Open Data Resources for Climate Change `_ 84 | * `UEA Climatic Research Unit `_ 85 | * `WorldClim - Global Climate Data `_ 86 | * `WU Historical Weather Worldwide `_ 87 | 88 | 89 | Complex Networks 90 | ---------------- 91 | 92 | * `AMiner Citation Network Dataset `_ 93 | * `CrossRef DOI URLs `_ 94 | * `DBLP Citation dataset `_ 95 | * `NBER Patent Citations `_ 96 | * `Network Repository with Interactive Exploratory Analysis Tools `_ 97 | * `NIST complex networks data collection `_ 98 | * `Protein-protein interaction network `_ 99 | * `PyPI and Maven Dependency Network `_ 100 | * `Scopus Citation Database `_ 101 | * `Small Network Data `_ 102 | * `Stanford GraphBase (Steven Skiena) `_ 103 | * `Stanford Large Network Dataset Collection `_ 104 | * `Stanford Longitudinal Network Data Sources `_ 105 | * `The Koblenz Network Collection `_ 106 | * `The Laboratory for Web Algorithmics (UNIMI) `_ 107 | * `The Nexus Network Repository `_ 108 | * `UCI Network Data Repository `_ 109 | * `UFL sparse matrix collection `_ 110 | * `WSU Graph Database `_ 111 | * `DIMACS Road Networks Collection `_ 112 | 113 | Computer Networks 114 | ----------------- 115 | 116 | * `3.5B Web Pages from CommonCraw 2012 `_ 117 | * `53.5B Web clicks of 100K users in Indiana Univ. `_ 118 | * `CAIDA Internet Datasets `_ 119 | * `ClueWeb09 - 1B web pages `_ 120 | * `ClueWeb12 - 733M web pages `_ 121 | * `CommonCrawl Web Data over 7 years `_ 122 | * `CRAWDAD Wireless datasets from Dartmouth Univ. `_ 123 | * `Criteo click-through data `_ 124 | * `Open Mobile Data by MobiPerf `_ 125 | * `Rapid7 Sonar Internet Scans `_ 126 | * `UCSD Network Telescope, IPv4 /8 net `_ 127 | 128 | 129 | Contextual Data 130 | --------------- 131 | 132 | * `Context-aware data sets from five domains `_ or `GitHub `_ 133 | 134 | 135 | Data Challenges 136 | --------------- 137 | 138 | * `Challenges in Machine Learning `_ 139 | * `CrowdANALYTIX dataX `_ 140 | * `D4D Challenge of Orange `_ 141 | * `DrivenData Competitions for Social Good `_ 142 | * `ICWSM Data Challenge (since 2009) `_ 143 | * `Kaggle Competition Data `_ 144 | * `KDD Cup by Tencent 2012 `_ 145 | * `Localytics Data Visualization Challenge `_ 146 | * `Netflix Prize `_ 147 | * `Space Apps Challenge `_ 148 | * `Telecom Italia Big Data Challenge `_ 149 | * `Yelp Dataset Challenge `_ 150 | * `Bruteforce Database `_ 151 | 152 | 153 | Earth Science 154 | ------------- 155 | 156 | * `AQUASTAT - Global water resources and uses `_ 157 | * `BODC - marine data of ~22K vars `_ 158 | * `Earth Models `_ 159 | * `EOSDIS - NASA's earth observing system data `_ 160 | * `Integrated Marine Observing System (IMOS) - roughly 30TB of ocean measurements `_ or `on S3 `_ 161 | * `Marinexplore - Open Oceanographic Data `_ 162 | * `Smithsonian Institution Global Volcano and Eruption Database `_ 163 | * `USGS Earthquake Archives `_ 164 | 165 | 166 | Economics 167 | --------- 168 | 169 | * `American Economic Association (AEA) `_ 170 | * `EconData from UMD `_ 171 | * `Economic Freedom of the World Data `_ 172 | * `Historical MacroEconomc Statistics `_ 173 | * `International Economics Database `_ and `various data tools `_ 174 | * `International Trade Statistics `_ 175 | * `Internet Product Code Database `_ 176 | * `Joint External Debt Data Hub `_ 177 | * `Jon Haveman International Trade Data Links `_ 178 | * `OpenCorporates Database of Companies in the World `_ 179 | * `Our World in Data `_ 180 | * `SciencesPo World Trade Gravity Datasets `_ 181 | * `The Atlas of Economic Complexity `_ 182 | * `The Center for International Data `_ 183 | * `The Observatory of Economic Complexity `_ 184 | * `UN Commodity Trade Statistics `_ 185 | * `UN Human Development Reports `_ 186 | 187 | 188 | Education 189 | ------------ 190 | 191 | * `Student Data from Free Code Camp `_ 192 | 193 | 194 | Energy 195 | ------ 196 | 197 | * `AMPds `_ 198 | * `BLUEd `_ 199 | * `COMBED `_ 200 | * `Dataport `_ 201 | * `DRED `_ 202 | * `ECO `_ 203 | * `EIA `_ 204 | * `HES `_ - Household Electricity Study, UK 205 | * `HFED `_ 206 | * `iAWE `_ 207 | * `PLAID `_ - the Plug Load Appliance Identification Dataset 208 | * `REDD `_ 209 | * `Tracebase `_ 210 | * `UK-DALE `_ - UK Domestic Appliance-Level Electricity 211 | * `WHITED `_ 212 | 213 | 214 | 215 | Finance 216 | ------- 217 | 218 | * `CBOE Futures Exchange `_ 219 | * `Google Finance `_ 220 | * `Google Trends `_ 221 | * `NASDAQ `_ 222 | * `OANDA `_ 223 | * `OSU Financial data `_ 224 | * `Quandl `_ 225 | * `St Louis Federal `_ 226 | * `Yahoo Finance `_ 227 | * `NYSE Market Data `_ (see FTP link on `RAW `_) 228 | 229 | 230 | GIS 231 | --- 232 | 233 | * `Cambridge, MA, US, GIS data on GitHub `_ 234 | * `Factual Global Location Data `_ 235 | * `Geo Spatial Data from ASU `_ 236 | * `Geo Wiki Project - Citizen-driven Environmental Monitoring `_ 237 | * `GeoFabrik - OSM data extracted to a variety of formats and areas `_ 238 | * `GeoNames Worldwide `_ 239 | * `Global Administrative Areas Database (GADM) `_ 240 | * `Homeland Infrastructure Foundation-Level Data `_ 241 | * `Landsat 8 on AWS `_ 242 | * `List of all countries in all languages `_ 243 | * `National Weather Service GIS Data Portal `_ 244 | * `Natural Earth - vectors and rasters of the world `_ 245 | * `OpenAddresses `_ 246 | * `OpenStreetMap (OSM) `_ 247 | * `Pleiades - Gazetteer and graph of ancient places `_ 248 | * `Reverse Geocoder using OSM data `_ & `additional high-resolution data files `_ 249 | * `TIGER/Line - U.S. boundaries and roads `_ 250 | * `TwoFishes - Foursquare's coarse geocoder `_ 251 | * `TZ Timezones shapfiles `_ 252 | * `UN Environmental Data `_ 253 | * `World boundaries from the U.S. Department of State `_ 254 | * `World countries in multiple formats `_ 255 | 256 | 257 | Government 258 | ---------- 259 | 260 | * `OpenDataSoft's list of 1,600 open data portals `_ 261 | * `A list of cities and countries contributed by community `_ 262 | 263 | 264 | Healthcare 265 | ---------- 266 | 267 | * `EHDP Large Health Data Sets `_ 268 | * `Gapminder World demographic databases `_ 269 | * `Medicare Coverage Database (MCD), U.S. `_ 270 | * `Medicare Data Engine of medicare.gov Data `_ 271 | * `Medicare Data File `_ 272 | * `MeSH, the vocabulary thesaurus used for indexing articles for PubMed `_ 273 | * `Number of Ebola Cases and Deaths in Affected Countries (2014) `_ 274 | * `Open-ODS (structure of the UK NHS) `_ 275 | * `OpenPaymentsData, Healthcare financial relationship data `_ 276 | * `The Cancer Genome Atlas project (TCGA) `_ and `BigQuery table `_ 277 | * `World Health Organization Global Health Observatory `_ 278 | 279 | 280 | Image Processing 281 | ---------------- 282 | 283 | * `10k US Adult Faces Database `_ 284 | * `2GB of Photos of Cats `_ or `Archive version `_ 285 | * `Affective Image Classification `_ 286 | * `Animals with attributes `_ 287 | * `Face Recognition Benchmark `_ 288 | * `ImageNet (in WordNet hierarchy) `_ 289 | * `Indoor Scene Recognition `_ 290 | * `International Affective Picture System, UFL `_ 291 | * `Massive Visual Memory Stimuli, MIT `_ 292 | * `Several Shape-from-Silhouette Datasets `_ 293 | * `Stanford Dogs Dataset `_ 294 | * `SUN database, MIT `_ 295 | * `The Oxford-IIIT Pet Dataset `_ 296 | * `YouTube Faces Database `_ 297 | * `Adience Unfiltered faces for gender and age classification `_ 298 | * `The Action Similarity Labeling (ASLAN) Challenge `_ 299 | * `Violent-Flows - Crowd Violence \ Non-violence Database and benchmark `_ 300 | 301 | Machine Learning 302 | ---------------- 303 | 304 | * `Delve Datasets for classification and regression (Univ. of Toronto) `_ 305 | * `Discogs Monthly Data `_ 306 | * `eBay Online Auctions (2012) `_ 307 | * `IMDb Database `_ 308 | * `Keel Repository for classification, regression and time series `_ 309 | * `Labeled Faces in the Wild (LFW) `_ 310 | * `Lending Club Loan Data `_ 311 | * `Machine Learning Data Set Repository `_ 312 | * `Million Song Dataset `_ 313 | * `More Song Datasets `_ 314 | * `New Yorker caption contest ratings `_ 315 | * `MovieLens Data Sets `_ 316 | * `RDataMining - "R and Data Mining" ebook data `_ 317 | * `Registered Meteorites on Earth `_ 318 | * `Restaurants Health Score Data in San Francisco `_ 319 | * `UCI Machine Learning Repository `_ 320 | * `Yahoo! Ratings and Classification Data `_ 321 | 322 | 323 | Museums 324 | ------- 325 | 326 | * `Canada Science and Technology Museums Corporation's Open Data `_ 327 | * `Cooper-Hewitt's Collection Database `_ 328 | * `Minneapolis Institute of Arts metadata `_ 329 | * `Natural History Museum (London) Data Portal `_ 330 | * `Rijksmuseum Historical Art Collection `_ 331 | * `Tate Collection metadata `_ 332 | * `The Getty vocabularies `_ 333 | 334 | 335 | Natural Language 336 | ---------------- 337 | 338 | * `Blogger Corpus `_ 339 | * `CLiPS Stylometry Investigation Corpus `_ 340 | * `ClueWeb09 FACC `_ 341 | * `ClueWeb12 FACC `_ 342 | * `DBpedia - 4.58M things with 583M facts `_ 343 | * `Flickr Personal Taxonomies `_ 344 | * `Freebase.com of people, places, and things `_ 345 | * `Google Books Ngrams (2.2TB) `_ 346 | * `Google Web 5gram (1TB, 2006) `_ 347 | * `Gutenberg eBooks List `_ 348 | * `Hansards text chunks of Canadian Parliament `_ 349 | * `Machine Comprehension Test (MCTest) of text from Microsoft Research `_ 350 | * `Machine Translation of European languages `_ 351 | * `Personae Corpus `_ 352 | * `SaudiNewsNet Collection of Saudi Newspaper Articles (Arabic, 30K articles) `_ 353 | * `SMS Spam Collection in English `_ 354 | * `USENET postings corpus of 2005~2011 `_ 355 | * `Wikidata - Wikipedia databases `_ 356 | * `Wikipedia Links data - 40 Million Entities in Context `_ 357 | * `Universal Dependencies `_ 358 | * `WordNet databases and tools `_ 359 | * `Open Multilingual Wordnet `_ 360 | 361 | 362 | Neuroscience 363 | ------------- 364 | 365 | * `Allen Institute Datasets `_ 366 | * `Brain Catalogue `_ 367 | * `Brainomics `_ 368 | * `CodeNeuro Datasets `_ 369 | * `Collaborative Research in Computational Neuroscience (CRCNS) `_ 370 | * `FCP-INDI `_ 371 | * `Human Connectome Project `_ 372 | * `NDAR `_ 373 | * `NIMH Data Archive `_ 374 | * `NeuroData `_ 375 | * `OASIS `_ 376 | * `OpenfMRI `_ 377 | * `Neuroelectro `_ 378 | * `Study Forrest `_ 379 | 380 | 381 | Physics 382 | ------- 383 | 384 | * `CERN Open Data Portal `_ 385 | * `Crystallography Open Database `_ 386 | * `NASA Exoplanet Archive `_ 387 | * `NSSDC (NASA) data of 550 space spacecraft `_ 388 | * `Sloan Digital Sky Survey (SDSS) - Mapping the Universe `_ 389 | 390 | 391 | Psychology/Cognition 392 | -------------------- 393 | 394 | * `OSU Cognitive Modeling Repository Datasets `_ 395 | 396 | 397 | Public Domains 398 | -------------- 399 | 400 | * `Amazon `_ 401 | * `Archive-it from Internet Archive `_ 402 | * `Archive.org Datasets `_ 403 | * `CMU JASA data archive `_ 404 | * `CMU StatLab collections `_ 405 | * `Data360 `_ 406 | * `Datamob.org `_ 407 | * `Google `_ 408 | * `Infochimps `_ 409 | * `KDNuggets Data Collections `_ 410 | * `Microsoft Azure Data Market Free DataSets `_ 411 | * `Numbray `_ 412 | * `Open Library Data Dumps `_ 413 | * `Reddit Datasets `_ 414 | * `RevolutionAnalytics Collection `_ 415 | * `Sample R data sets `_ 416 | * `Stats4Stem R data sets `_ 417 | * `StatSci.org `_ 418 | * `The Washington Post List `_ 419 | * `UCLA SOCR data collection `_ 420 | * `UFO Reports `_ 421 | * `Wikileaks 911 pager intercepts `_ 422 | * `Yahoo Webscope `_ 423 | 424 | 425 | Search Engines 426 | -------------- 427 | 428 | * `Academic Torrents of data sharing from UMB `_ 429 | * `Datahub.io `_ 430 | * `DataMarket (Qlik) `_ 431 | * `Harvard Dataverse Network of scientific data `_ 432 | * `ICPSR (UMICH) `_ 433 | * `Institute of Education Sciences `_ 434 | * `National Technical Reports Library `_ 435 | * `Open Data Certificates (beta) `_ 436 | * `OpenDataNetwork - A search engine of all Socrata powered data portals `_ 437 | * `Statista.com - statistics and Studies `_ 438 | * `Zenodo - An open dependable home for the long-tail of science `_ 439 | 440 | 441 | Social Networks 442 | --------------- 443 | 444 | * `72 hours #gamergate Twitter Scrape `_ 445 | * `Ancestry.com Forum Dataset over 10 years `_ 446 | * `Cheng-Caverlee-Lee September 2009 - January 2010 Twitter Scrape `_ 447 | * `CMU Enron Email of 150 users `_ 448 | * `EDRM Enron EMail of 151 users, hosted on S3 `_ 449 | * `Facebook Data Scrape (2005) `_ 450 | * `Facebook Social Networks from LAW (since 2007) `_ 451 | * `Foursquare from UMN/Sarwat (2013) `_ 452 | * `GetGlue - users rating TV shows `_ 453 | * `GitHub Collaboration Archive `_ 454 | * `Google Scholar citation relations `_ 455 | * `High-Resolution Contact Networks from Wearable Sensors `_ 456 | * `Mobile Social Networks from UMASS `_ 457 | * `Network Twitter Data `_ 458 | * `Reddit Comments `_ 459 | * `Skytrax' Air Travel Reviews Dataset `_ 460 | * `Social Twitter Data `_ 461 | * `SourceForge.net Research Data `_ 462 | * `Twitter Data for Sentiment Analysis `_ 463 | * `Twitter Data for Online Reputation Management `_ 464 | * `Twitter Graph of entire Twitter site `_ 465 | * `Twitter Scrape Calufa May 2011 `_ 466 | * `UNIMI/LAW Social Network Datasets `_ 467 | * `Yahoo! Graph and Social Data `_ 468 | * `Youtube Video Social Graph in 2007,2008 `_ 469 | 470 | 471 | Social Sciences 472 | --------------- 473 | 474 | * `ACLED (Armed Conflict Location & Event Data Project) `_ 475 | * `Canadian Legal Information Institute `_ 476 | * `Center for Systemic Peace Datasets - Conflict Trends, Polities, State Fragility, etc `_ 477 | * `Correlates of War Project `_ 478 | * `Cryptome Conspiracy Theory Items `_ 479 | * `Datacards `_ 480 | * `European Social Survey `_ 481 | * `FBI Hate Crime 2013 - aggregated data `_ 482 | * `GDELT Global Events Database `_ 483 | * `General Social Survey (GSS) since 1972 `_ 484 | * `German Social Survey `_ 485 | * `Global Religious Futures Project `_ 486 | * `Humanitarian Data Exchange `_ 487 | * `Institute for Demographic Studies `_ 488 | * `International Networks Archive `_ 489 | * `International Social Survey Program ISSP `_ 490 | * `International Studies Compendium Project `_ 491 | * `James McGuire Cross National Data `_ 492 | * `MacroData Guide by Norsk samfunnsvitenskapelig datatjeneste `_ 493 | * `Minnesota Population Center `_ 494 | * `MIT Reality Mining Dataset `_ 495 | * `Open Crime and Policing Data in England, Wales and Northern Ireland `_ 496 | * `Paul Hensel General International Data Page `_ 497 | * `PewResearch Internet Survey Project `_ 498 | * `PewResearch Society Data Collection `_ 499 | * `Political Polarity Data `_ 500 | * `StackExchange Data Explorer `_ 501 | * `Terrorism Research and Analysis Consortium `_ 502 | * `Texas Inmates Executed Since 1984 `_ 503 | * `Titanic Survival Data Set `_ or `on Kaggle `_ 504 | * `UCB's Archive of Social Science Data (D-Lab) `_ 505 | * `Uppsala Conflict Data Program `_ 506 | * `UCLA Social Sciences Data Archive `_ 507 | * `UN Civil Society Database `_ 508 | * `Universities Worldwide `_ 509 | * `UPJOHN for Labor Employment Research `_ 510 | * `World Bank Data `_ 511 | * `WorldPop project - Worldwide human population distributions `_ 512 | 513 | 514 | Software 515 | -------- 516 | 517 | * `FLOSSmole data about free, libre, and open source software development `_ 518 | 519 | Sports 520 | ------ 521 | 522 | * `Basketball (NBA/NCAA/Euro) Player Database and Statistics `_ 523 | * `Betfair Historical Exchange Data `_ 524 | * `Cricsheet Matches (cricket) `_ 525 | * `Ergast Formula 1, from 1950 up to date (API) `_ 526 | * `Football/Soccer resources (data and APIs) `_ 527 | * `Lahman's Baseball Database `_ 528 | * `Pinhooker: Thoroughbred Bloodstock Sale Data `_ 529 | * `Retrosheet Baseball Statistics `_ 530 | 531 | 532 | Time Series 533 | ----------- 534 | 535 | * `Databanks International Cross National Time Series Data Archive `_ 536 | * `Hard Drive Failure Rates `_ 537 | * `Heart Rate Time Series from MIT `_ 538 | * `Time Series Data Library (TSDL) from MU `_ 539 | * `UC Riverside Time Series Dataset `_ 540 | 541 | 542 | Transportation 543 | -------------- 544 | 545 | * `Airlines OD Data 1987-2008 `_ 546 | * `Bay Area Bike Share Data `_ 547 | * `Bike Share Systems (BSS) collection `_ 548 | * `GeoLife GPS Trajectory from Microsoft Research `_ 549 | * `German train system by Deutsche Bahn `_ 550 | * `Hubway Million Rides in MA `_ 551 | * `Marine Traffic - ship tracks, port calls and more `_ 552 | * `Montreal BIXI Bike Share `_ 553 | * `NYC Taxi Trip Data 2009- `_ 554 | * `NYC Taxi Trip Data 2013 (FOIA/FOILed) `_ 555 | * `NYC Uber trip data April 2014 to September 2014 `_ 556 | * `Open Traffic collection `_ 557 | * `OpenFlights - airport, airline and route data `_ 558 | * `Philadelphia Bike Share Stations (JSON) `_ 559 | * `Plane Crash Database, since 1920 `_ 560 | * `RITA Airline On-Time Performance data `_ 561 | * `RITA/BTS transport data collection (TranStat) `_ 562 | * `Toronto Bike Share Stations (XML file) `_ 563 | * `Transport for London (TFL) `_ 564 | * `Travel Tracker Survey (TTS) for Chicago `_ 565 | * `U.S. Bureau of Transportation Statistics (BTS) `_ 566 | * `U.S. Domestic Flights 1990 to 2009 `_ 567 | * `U.S. Freight Analysis Framework since 2007 `_ 568 | 569 | 570 | Complementary Collections 571 | ------------------------- 572 | 573 | * `Data Packaged Core Datasets `_ 574 | * `Database of Scientific Code Contributions `_ 575 | * DataWrangling: `Some Datasets Available on the Web `_ 576 | * Inside-r: `Finding Data on the Internet `_ 577 | * OpenDataMonitor: `An overview of available open data resources in Europe `_ 578 | * Quora: `Where can I find large datasets open to the public? `_ 579 | * RS.io: `100+ Interesting Data Sets for Statistics `_ 580 | * StaTrek: `Leveraging open data to understand urban lives `_ 581 | --------------------------------------------------------------------------------