├── LICENSE └── README.md /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2014 Onur Akpolat 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Awesome Big Data 2 | 3 | A curated list of awesome big data frameworks, resources and other awesomeness. Inspired by [awesome-php](https://github.com/ziadoz/awesome-php), [awesome-python](https://github.com/vinta/awesome-python), [awesome-ruby](https://github.com/Sdogruyol/awesome-ruby), [hadoopecosystemtable](http://hadoopecosystemtable.github.io/) & [big-data](http://blog.andreamostosi.name/big-data/). 4 | 5 | Your contributions are always welcome! 6 | 7 | - [Awesome Big Data](#awesome-bigdata) 8 | - [Frameworks](#frameworks) 9 | - [Distributed Programming](#distributed-programming) 10 | - [Distributed Filesystem](#distributed-filesystem) 11 | - [Key-Map Data Model](#key-map-data-model) 12 | - [Document Data Model](#document-data-model) 13 | - [Key-value Data Model](#key-value-data-model) 14 | - [Graph Data Model](#graph-data-model) 15 | - [NewSQL Databases](#newsql-databases) 16 | - [Columnar Databases](#columnar-databases) 17 | - [Time-Series Databases](#time-series-databases) 18 | - [SQL-like processing](#sql-like-processing) 19 | - [Integrated Development Environments](#integrated-development-environments) 20 | - [Data Ingestion](#data-ingestion) 21 | - [Service Programming](#service-programming) 22 | - [Scheduling](#scheduling) 23 | - [Machine Learning](#machine-learning) 24 | - [Benchmarking](#benchmarking) 25 | - [Security](#security) 26 | - [System Deployment](#system-deployment) 27 | - [Applications](#applications) 28 | - [Search engine and framework](#search-engine-and-framework) 29 | - [MySQL forks and evolutions](#mysql-forks-and-evolutions) 30 | - [PostgreSQL forks and evolutions](#postgresql-forks-and-evolutions) 31 | - [Memcached forks and evolutions](#memcached-forks-and-evolutions) 32 | - [Embedded Databases](#embedded-databases) 33 | - [Business Intelligence](#business-intelligence) 34 | - [Data Visualization](#data-visualization) 35 | - [Internet of things and sensor data](#internet-of-things-and-sensor-data) 36 | - [Interesting Readings](#interesting-readings) 37 | - [Interesting Papers](#interesting-papers) 38 | - [Other Awesome Lists](#other-awesome-lists) 39 | 40 | ## Frameworks 41 | 42 | * [Apache Hadoop](http://hadoop.apache.org/) - framework for distributed processing. Integrates MapReduce (parallel processing), YARN (job scheduling) and HDFS (distributed file system). 43 | 44 | ## Distributed Programming 45 | 46 | * [AddThis Hydra](https://github.com/addthis/hydra) - distributed data processing and storage system originally developed at AddThis. 47 | * [AMPLab SIMR](http://databricks.github.io/simr/) - run Spark on Hadoop MapReduce v1. 48 | * [Apache Crunch](http://crunch.apache.org/) - a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce. 49 | * [Apache DataFu](http://incubator.apache.org/projects/datafu.html) - collection of user-defined functions for Hadoop and Pig developed by LinkedIn. 50 | * [Apache Flink](http://flink.incubator.apache.org/) - high-performance runtime, and automatic program optimization. 51 | * [Apache Gora](http://gora.apache.org/) - framework for in-memory data model and persistence. 52 | * [Apache Hama](http://hama.apache.org/) - BSP (Bulk Synchronous Parallel) computing framework. 53 | * [Apache MapReduce](http://wiki.apache.org/hadoop/MapReduce/) - programming model for processing large data sets with a parallel, distributed algorithm on a cluster. 54 | * [Apache Pig](https://pig.apache.org/) - high level language to express data analysis programs for Hadoop. 55 | * [Apache S4](http://incubator.apache.org/s4/) - framework for stream processing, implementation of S4. 56 | * [Apache Spark](http://spark.incubator.apache.org/) - framework for in-memory cluster computing. 57 | * [Apache Spark Streaming](http://spark.incubator.apache.org/docs/0.7.3/streaming-programming-guide.html) - framework for stream processing, part of Spark. 58 | * [Apache Storm](http://storm-project.net/) - framework for stream processing by Twitter also on YARN. 59 | * [Apache Tez](http://tez.incubator.apache.org/) - application framework for executing a complex DAG (directed acyclic graph) of tasks, built on YARN. 60 | * [Apache Twill](https://incubator.apache.org/projects/twill.html) - abstraction over YARN that reduces the complexity of developing distributed applications. 61 | * [Cascalog](http://cascalog.org/) - data processing and querying library. 62 | * [Cheetah](http://vldbarc.org/pvldb/vldb2010/pvldb_vol3/I08.pdf) - High Performance, Custom Data Warehouse on Top of MapReduce. 63 | * [Concurrent Cascading](http://www.cascading.org/) - framework for data management/analytics on Hadoop. 64 | * [Damballa Parkour](https://github.com/damballa/parkour) - MapReduce library for Clojure. 65 | * [Datasalt Pangool](https://github.com/datasalt/pangool) - alternative MapReduce paradigm. 66 | * [DataTorrent StrAM](https://www.datatorrent.com/) - real-time engine is designed to enable distributed, asynchronous, real time in-memory big-data computations in as unblocked a way as possible, with minimal overhead and impact on performance. 67 | * [Facebook Corona](https://www.facebook.com/notes/facebook-engineering/under-the-hood-scheduling-mapreduce-jobs-more-efficiently-with-corona/10151142560538920) - Hadoop enhancement which removes single point of failure. 68 | * [Facebook Peregrine](http://peregrine_mapreduce.bitbucket.org/) - Map Reduce framework. 69 | * [Facebook Scuba](https://www.facebook.com/notes/facebook-engineering/under-the-hood-data-diving-with-scuba/10150599692628920) - distributed in-memory datastore. 70 | * [Google Dataflow](http://googledevelopers.blogspot.it/2014/06/cloud-platform-at-google-io-new-big.html) - create data pipelines to help themæingest, transform and analyze data. 71 | * [Google MapReduce](http://research.google.com/archive/mapreduce.html) - map reduce framework. 72 | * [Google MillWheel](http://research.google.com/pubs/pub41378.html) - fault tolerant stream processing framework. 73 | * [JAQL](https://code.google.com/p/jaql/) - declarative programming language for working with structured, semi-structured and unstructured data. 74 | * [Kite](http://kitesdk.org/docs/current/) - is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem. 75 | * [Metamarkers Druid](http://druid.io/) - framework for real-time analysis of large datasets. 76 | * [Netflix PigPen](https://github.com/Netflix/PigPen) - map-reduce for Clojure whiche compiles to Apache Pig. 77 | * [Nokia Disco](http://discoproject.org/) - MapReduce framework developed by Nokia. 78 | * [Pinterest Pinlater](http://engineering.pinterest.com/post/91288882494/pinlater-an-asynchronous-job-execution-system) - asynchronous job execution system. 79 | * [Pydoop](http://pydoop.sourceforge.net/docs/) - Python MapReduce and HDFS API for Hadoop. 80 | * [Stratosphere](http://stratosphere.eu/) - general purpose cluster computing framework. 81 | * [Streamdrill](https://streamdrill.com/) - usefull for counting activities of event streams over different time windows and finding the most active one. 82 | * [Twitter Scalding](https://github.com/twitter/scalding) - Scala library for Map Reduce jobs, built on Cascading. 83 | * [Twitter Summingbird](https://github.com/twitter/summingbird) - Streaming MapReduce with Scalding and Storm, by Twitter. 84 | * [Twitter TSAR](https://blog.twitter.com/2014/tsar-a-timeseries-aggregator) - TimeSeries AggregatoR by Twitter. 85 | 86 | ## Distributed Filesystem 87 | 88 | * [Apache HDFS](http://hadoop.apache.org/) - a way to store large files across multiple machines. 89 | * [BeeGFS](http://www.fhgfs.com/cms/) - formerly FhGFS, parallel distributed file system. 90 | * [Ceph Filesystem](http://ceph.com/ceph-storage/file-system/) - software storage platform designed. 91 | * [Disco DDFS](http://disco.readthedocs.org/en/latest/howto/ddfs.html) - distributed filesystem. 92 | * [Facebook Haystack](https://www.facebook.com/note.php?note_id=76191543919) - object storage system. 93 | * [Google Colossus](https://google.com/) - distributed filesystem (GFS2). 94 | * [Google GFS](https://google.com/) - distributed filesystem. 95 | * [Google Megastore](http://research.google.com/pubs/pub36971.html) - scalable, highly available storage. 96 | * [GridGain](http://www.gridgain.org/) - GGFS, Hadoop compliant in-memory file system. 97 | * [Lustre file system](http://wiki.lustre.org/) - high-performance distributed filesystem. 98 | * [Quantcast File System QFS](https://www.quantcast.com/engineering/qfs/) - open-source distributed file system. 99 | * [Red Hat GlusterFS](http://www.gluster.org/) - scale-out network-attached storage file system. 100 | * [Tachyon](http://tachyon-project.org/) - reliable file sharing at memory speed across cluster frameworks. 101 | 102 | ## Document Data Model 103 | 104 | * [Actian Versant](http://www.actian.com/products/operational-databases/) - commercial object-oriented database management systems . 105 | * [Crate Data](https://crate.io/) - is an open source massively scalable data store. It requires zero administration. 106 | * [Facebook Apollo](http://www.infoq.com/news/2014/06/facebook-apollo) - Facebook’s Paxos-like NoSQL database. 107 | * [jumboDB](http://comsysto.github.io/jumbodb/) - document oriented datastore over Hadoop. 108 | * [LinkedIn Espresso](http://data.linkedin.com/projects/espresso) - horizontally scalable document-oriented NoSQL data store. 109 | * [MarkLogic](http://www.marklogic.com/) - Schema-agnostic Enterprise NoSQL database technology. 110 | * [MongoDB](http://www.mongodb.org/) - Document-oriented database system. 111 | * [RavenDB](http://www.ravendb.net/) - A transactional, open-source Document Database. 112 | * [RethinkDB](http://www.rethinkdb.com/) - document database that supports queries like table joins and group by. 113 | 114 | ## Key Map Data Model 115 | 116 | **Note**: There is some term confusion in the industry, and two different things are called "Columnar Databases". Some, listed here, are distributed, persistent databases built around the "key-map" data model: all data has a (possibly composite) key, with which a map of key-value pairs is associated. In some systems, multiple such value maps can be associated with a key, and these maps are referred to as "column families" (with value map keys being referred to as "columns"). 117 | 118 | Another group of technologies that can also be called "columnar databases" is distinguished by how it stores data, on disk or in memory -- rather than storing data the traditional way, where all column values for a given key are stored next to each other, "row by row", these systems store all *column* values next to each other. So more work is needed to get all columns for a given key, but less work is needed to get all values for a given column. 119 | 120 | The former group is referred to as "key map data model" here. The line between these and the [Key-value Data Model](#key-value-data-model) stores is fairly blurry. 121 | 122 | The latter, being more about the storage format than about the data model, is listed under [Columnar Databases](#columnar-databases). 123 | 124 | You can read more about this distinction on Prof. Daniel Abadi's blog: [Distinguishing two major types of Column Stores](http://dbmsmusings.blogspot.com/2010/03/distinguishing-two-major-types-of_29.html). 125 | 126 | * [Apache Accumulo](http://accumulo.apache.org/) - distribuited key/value store, built on Hadoop. 127 | * [Apache Cassandra](http://cassandra.apache.org/) - column-oriented distribuited datastore, inspired by BigTable. 128 | * [Apache HBase](http://hbase.apache.org/) - column-oriented distribuited datastore, inspired by BigTable. 129 | * [Facebook HydraBase](https://code.facebook.com/posts/321111638043166/hydrabase-the-evolution-of-hbase-facebook/) - evolution of HBase made by Facebook. 130 | * [Google BigTable](http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/bigtable-osdi06.pdf) - column-oriented distributed datastore. 131 | * [Google Cloud Datastore](https://developers.google.com/datastore/) - is a fully managed, schemaless database for storing non-relational data over BigTable. 132 | * [Hypertable](http://hypertable.org/) - column-oriented distribuited datastore, inspired by BigTable. 133 | * [InfiniDB](http://infinidb.co/) - is accessed through a MySQL interface and use massive parallel processing to parallelize queries. 134 | * [OhmData C5](http://ohmdata.com/) - improved version of HBase. 135 | * [Tephra](https://github.com/continuuity/tephra) - Transactions for HBase. 136 | * [Twitter Manhattan](https://blog.twitter.com/2014/manhattan-our-real-time-multi-tenant-distributed-database-for-twitter-scale) - real-time, multi-tenant distributed database for Twitter scale. 137 | 138 | 139 | ## Key-value Data Model 140 | 141 | * [Aerospike](http://www.aerospike.com/) - NoSQL flash-optimized, in-memory. Open source and "Server code in 'C' (not Java or Erlang) precisely tuned to avoid context switching and memory copies." 142 | * [Amazon DynamoDB](http://aws.amazon.com/dynamodb/) - distributed key/value store, implementation of Dynamo paper. 143 | * [Edis](http://inaka.github.io/edis/) - is a protocol-compatible Server replacement for Redis. 144 | * [ElephantDB](https://github.com/nathanmarz/elephantdb) - Distributed database specialized in exporting data from Hadoop. 145 | * [EventStore](http://geteventstore.com) - distributed time series database. 146 | * [LinkedIn Krati](https://github.com/linkedin-sna/sna-page/tree/master/krati) - is a simple persistent data store with very low latency and high throughput. 147 | * [Linkedin Voldemort](http://www.project-voldemort.com/voldemort/) - distributed key/value storage system. 148 | * [Oracle NoSQL Database](http://www.oracle.com/technetwork/database/database-technologies/nosqldb/overview/index.html) - distributed key-value database by Oracle Corporation. 149 | * [Redis](http://redis.io) - in memory key value datastore. 150 | * [Riak](https://github.com/basho/riak) - a decentralized datastore. 151 | * [Storehaus](https://github.com/twitter/storehaus) - library to work with asynchronous key value stores, by Twitter. 152 | * [Tarantool](https://github.com/tarantool/tarantool) - an efficient NoSQL database and a Lua application server. 153 | * [TreodeDB](https://github.com/Treode/store) - key-value store that's replicated and sharded and provides atomic multirow writes. 154 | 155 | 156 | ## Graph Data Model 157 | 158 | * [Apache Giraph](http://giraph.apache.org/) - implementation of Pregel, based on Hadoop. 159 | * [Apache Spark Bagel](http://spark.incubator.apache.org/docs/0.7.3/bagel-programming-guide.html) - implementation of Pregel, part of Spark. 160 | * [ArangoDB](https://www.arangodb.org/) - multi model distribuited database. 161 | * [Facebook TAO](https://www.facebook.com/notes/facebook-engineering/tao-the-power-of-the-graph/10151525983993920) - TAO is the distributed data store that is widely used at facebook to store and serve the social graph. 162 | * [Google Cayley](https://github.com/google/cayley) - open-source graph database. 163 | * [Google Pregel](http://kowshik.github.io/JPregel/pregel_paper.pdf) - graph processing framework. 164 | * [GraphLab PowerGraph](http://graphlab.org/projects/source.html) - a core C++ GraphLab API and a collection of high-performance machine learning and data mining toolkits built on top of the GraphLab API. 165 | * [GraphX](https://amplab.cs.berkeley.edu/publication/graphx-grades/) - resilient Distributed Graph System on Spark. 166 | * [Gremlin](https://github.com/tinkerpop/gremlin) - graph traversal Language. 167 | * [Infovore](https://github.com/paulhoule/infovore) - RDF-centric Map/Reduce framework. 168 | * [Intel GraphBuilder](https://01.org/graphbuilder/) - tools to construct large-scale graphs on top of Hadoop. 169 | * [MapGraph](http://mapgraph.io/) - Massively Parallel Graph processing on GPUs. 170 | * [Neo4j](http://www.neo4j.org/) - graph database writting entirely in Java. 171 | * [OrientDB](http://www.orientechnologies.com/) - document and graph database. 172 | * [Phoebus](https://github.com/xslogic/phoebus) - framework for large scale graph processing. 173 | * [Titan](http://thinkaurelius.github.io/titan/) - distributed graph database, built over Cassandra. 174 | * [Twitter FlockDB](https://github.com/twitter/flockdb) - distribuited graph database. 175 | 176 | 177 | ## Columnar Databases 178 | 179 | **Note** please read the note on [Key-Map Data Model](#key-map-data-model) section. 180 | 181 | * [Columnar Storage](http://the-paper-trail.org/blog/columnar-storage/) - an explanation of what columnar storage is and when you might want it. 182 | * [Actian Vector](http://www.actian.com/) - column-oriented analytic database. 183 | * [C-Store](http://db.lcs.mit.edu/projects/cstore/) - column oriented DBMS. 184 | * [MonetDB](https://www.monetdb.org/) - column store database. 185 | * [Parquet](http://parquet.incubator.apache.org/) - columnar storage format for Hadoop. 186 | * [Pivotal Greenplum](https://www.pivotal.io/big-data/pivotal-greenplum-database) - purpose-built, dedicated analytic data warehouse that offers a columnar engine as well as a traditional row-based one. 187 | * [Vertica](http://www.vertica.com/) - is designed to manage large, fast-growing volumes of data and provide very fast query performance when used for data warehouses. 188 | * [Google BigQuery](https://developers.google.com/bigquery/) Google's cloud offering backed by their pioneering work on Dremel. 189 | * [Amazon Redshift](http://aws.amazon.com/redshift/) Amazon's cloud offering, also based on a columnar datastore backend. 190 | 191 | ## NewSQL Databases 192 | 193 | * [Actian Ingres](http://www.actian.com/products/operational-databases/) - commercially supported, open-source SQL relational database management system. 194 | * [Amazon RedShift](http://aws.amazon.com/redshift/) - data warehouse service, based on PostgreSQL. 195 | * [BayesDB](http://probcomp.csail.mit.edu/bayesdb/index.html) - statistic oriented SQL database. 196 | * [Cockroach](https://github.com/cockroachdb/cockroach) - Scalable, Geo-Replicated, Transactional Datastore. 197 | * [Datomic](http://www.datomic.com/) - distributed database designed to enable scalable, flexible and intelligent applications. 198 | * [FoundationDB](https://foundationdb.com/) - distributed database, inspired by F1. 199 | * [Google F1](http://research.google.com/pubs/pub41344.html) - distributed SQL database built on Spanner. 200 | * [Google Spanner](http://research.google.com/archive/spanner.html) - globally distributed semi-relational database. 201 | * [H-Store](http://hstore.cs.brown.edu/) - is an experimental main-memory, parallel database management system that is optimized for on-line transaction processing (OLTP) applications. 202 | * [Haeinsa](https://github.com/VCNC/haeinsa) - linearly scalable multi-row, multi-table transaction library for HBase based on Percolator. 203 | * [HandlerSocket](http://www.percona.com/doc/percona-server/5.5/performance/handlersocket.html) - NoSQL plugin for MySQL/MariaDB. 204 | * [InfiniSQL](http://www.infinisql.org/) - infinity scalable RDBMS. 205 | * [MemSQL](http://www.memsql.com/) - in memory SQL database witho optimized columnar storage on flash. 206 | * [NuoDB](http://www.nuodb.com/) - SQL/ACID compliant distributed database. 207 | * [Oracle Database](http://www.oracle.com/us/corporate/features/database-12c/index.html) - object-relational database management system. 208 | * [Oracle TimesTen in-Memory Database](http://www.oracle.com/technetwork/database/database-technologies/timesten/overview/index.html) - in-memory, relational database management system with persistence and recoverability. 209 | * [Pivotal GemFire XD](http://gemfirexd.docs.gopivotal.com/latest/userguide/index.html?q=about_users_guide.html/) - Low-latency, in-memory, distributed SQL data store. Provides SQL interface to in-memory table data, persistable in HDFS. 210 | * [SAP HANA](http://www.saphana.com/welcome) - is an in-memory, column-oriented, relational database management system. 211 | * [SenseiDB](http://senseidb.com/) - distributed, realtime, semi-structured database. 212 | * [Sky](http://skydb.io/) - database used for flexible, high performance analysis of behavioral data. 213 | * [SymmetricDS](http://www.symmetricds.org/) - open source software for both file and database synchronization. 214 | 215 | ## Time-Series Databases 216 | 217 | * [Cube](http://square.github.io/cube/) - uses MongoDB to store time series data. 218 | * [InfluxDB](http://influxdb.com/) - distributed time series database. 219 | * [Kairosdb](https://code.google.com/p/kairosdb/) - similar to OpenTSDB but allows for Cassandra. 220 | * [OpenTSDB](http://opentsdb.net) - distributed time series database on top of HBase. 221 | 222 | ## SQL-like processing 223 | 224 | * [Actian SQL for Hadoop](http://www.actian.com/products/analytics-platform/) - high performance interactive SQL access to all Hadoop data. 225 | * [AMPLAB Shark](https://github.com/amplab/shark/) - data warehouse system for Spark. 226 | * [Apache Drill](http://incubator.apache.org/drill/) - framework for interactive analysis, inspired by Dremel. 227 | * [Apache HCatalog](http://hive.apache.org/docs/hcat_r0.5.0/) - table and storage management layer for Hadoop. 228 | * [Apache Hive](http://hive.apache.org/) - SQL-like data warehouse system for Hadoop. 229 | * [Apache Optiq](https://wiki.apache.org/incubator/OptiqProposal) - framework that allows efficient translation of queries involving heterogeneous and federated data. 230 | * [Apache Phoenix](http://phoenix.incubator.apache.org/index.html) - SQL skin over HBase. 231 | * [BlinkDB](http://blinkdb.org/) - massively parallel, approximate query engine. 232 | * [Cloudera Impala](http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html) - framework for interactive analysis, Inspired by Dremel. 233 | * [Concurrent Lingual](http://www.cascading.org/lingual/) - SQL-like query language for Cascading. 234 | * [Datasalt Splout SQL](http://www.datasalt.com/products/splout-sql/) - full SQL query engine for big datasets. 235 | * [Facebook PrestoDB](http://prestodb.io/) - distributed SQL query engine. 236 | * [Google BigQuery](http://research.google.com/pubs/pub36632.html) - framework for interactive analysis, implementation of Dremel. 237 | * [Pivotal HAWQ](http://www.gopivotal.com/pivotal-products/data/pivotal-hd) - SQL-like data warehouse system for Hadoop. 238 | * [RainstorDB](http://rainstor.com/products/rainstor-database/) - database for storing petabyte-scale volumes of structured and semi-structured data. 239 | * [Spark Catalyst](https://github.com/apache/spark/tree/master/sql) - is a Query Optimization Framework for Spark and Shark. 240 | * [SparkSQL](http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.html) - Manipulating Structured Data Using Spark. 241 | * [Splice Machine](http://www.splicemachine.com/) - a full-featured SQL-on-Hadoop RDBMS with ACID transactions. 242 | * [Stinger](http://hortonworks.com/labs/stinger/) - interactive query for Hive. 243 | * [Tajo](http://tajo.incubator.apache.org/) - distributed data warehouse system on Hadoop. 244 | * [Trafodion](https://wiki.trafodion.org/wiki/index.php/Main_Page) - enterprise-class SQL-on-HBase solution targeting big data transactional or operational workloads. 245 | 246 | ## Integrated Development Environments 247 | 248 | * [R-Studio](https://github.com/rstudio/rstudio) - IDE for R. 249 | 250 | ## Data Ingestion 251 | 252 | * [Amazon Kinesis](http://aws.amazon.com/kinesis/) - real-time processing of streaming data at massive scale. 253 | * [Apache Chukwa](http://incubator.apache.org/chukwa/) - data collection system. 254 | * [Apache Flume](http://flume.apache.org/) - service to manage large amount of log data. 255 | * [Apache Kafka](http://kafka.apache.org/) - distributed publish-subscribe messaging system. 256 | * [Apache Samza](http://samza.incubator.apache.org/) - stream processing framework, based on Kafla and YARN. 257 | * [Apache Sqoop](http://sqoop.apache.org/) - tool to transfer data between Hadoop and a structured datastore. 258 | * [Cloudera Morphlines](https://github.com/cloudera/cdk/tree/master/cdk-morphlines) - framework that help ETL to Solr, HBase and HDFS. 259 | * [Facebook Scribe](https://github.com/facebook/scribe) - streamed log data aggregator. 260 | * [Fluentd](http://fluentd.org/) - tool to collect events and logs. 261 | * [Google Photon](http://research.google.com/pubs/pub41318.html) - geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency. 262 | * [Heka](https://github.com/mozilla-services/heka) - open source stream processing software system. 263 | * [HIHO](https://github.com/sonalgoyal/hiho) - framework for connecting disparate data sources with Hadoop. 264 | * [Kestrel](http://robey.github.io/kestrel/) - distributed message queue system. 265 | * [LinkedIn Databus](http://data.linkedin.com/projects/databus) - stream of change capture events for a database. 266 | * [LinkedIn Kamikaze](https://github.com/linkedin/kamikaze) - utility package for compressing sorted integer arrays. 267 | * [LinkedIn White Elephant](https://github.com/linkedin/white-elephant) - log aggregator and dashboard. 268 | * [Logstash](http://logstash.net) - a tool for managing events and logs. 269 | * [Netflix Suro](https://github.com/Netflix/suro) - log agregattor like Storm and Samza based on Chukwa. 270 | * [Pinterest Secor](https://github.com/pinterest/secor) - is a service implementing Kafka log persistance. 271 | 272 | ## Service Programming 273 | 274 | * [Akka Toolkit](http://akka.io/) - runtime for distributed, and fault tolerant event-driven applications on the JVM. 275 | * [Apache Avro](http://avro.apache.org/) - data serialization system. 276 | * [Apache Curator](http://curator.apache.org/) - Java libaries for Apache ZooKeeper. 277 | * [Apache Karaf](http://karaf.apache.org/) - OSGi runtime that runs on top of any OSGi framework. 278 | * [Apache Thrift](http://thrift.apache.org//) - framework to build binary protocols. 279 | * [Apache Zookeeper](http://zookeeper.apache.org/) - centralized service for process management. 280 | * [Google Chubby](http://research.google.com/archive/chubby.html) - a lock service for loosely-coupled distributed systems. 281 | * [Linkedin Norbert](http://data.linkedin.com/opensource/norbert) - cluster manager. 282 | * [OpenMPI](http://www.open-mpi.org/) - message passing framework. 283 | * [Serf](http://www.serfdom.io/) - decentralized solution for service discovery and orchestration. 284 | * [Spotify Luigi](https://github.com/spotify/luigi) - a Python package for building complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more. 285 | * [Spring XD](https://github.com/spring-projects/spring-xd) - distributed and extensible system for data ingestion, real time analytics, batch processing, and data export. 286 | * [Twitter Elephant Bird](https://github.com/kevinweil/elephant-bird) - libraries for working with LZOP-compressed data. 287 | * [Twitter Finagle](https://twitter.github.io/finagle/) - asynchronous network stack for the JVM. 288 | 289 | ## Scheduling 290 | 291 | * [Apache Aurora](http://aurora.incubator.apache.org/) - is a service scheduler that runs on top of Apache Mesos. 292 | * [Apache Falcon](http://falcon.incubator.apache.org/) - data management framework. 293 | * [Apache Oozie](http://oozie.apache.org/) - workflow job scheduler. 294 | * [Chronos](http://airbnb.github.io/chronos/) - distributed and fault-tolerant scheduler. 295 | * [Linkedin Azkaban](http://azkaban.github.io/azkaban2/) - batch workflow job scheduler. 296 | * [Sparrow](https://github.com/radlab/sparrow) - scheduling platform. 297 | 298 | ## Machine Learning 299 | 300 | * [Apache Mahout](http://mahout.apache.org/) - machine learning library for Hadoop. 301 | * [brain](https://github.com/harthur/brain) - Neural networks in JavaScript. 302 | * [Cloudera Oryx](https://github.com/cloudera/oryx) - real-time large-scale machine learning. 303 | * [Concurrent Pattern](http://www.cascading.org/pattern/) - machine learning library for Cascading. 304 | * [convnetjs](https://github.com/karpathy/convnetjs) - Deep Learning in Javascript. Train Convolutional Neural Networks (or ordinary ones) in your browser. 305 | * [Decider](https://github.com/danielsdeleo/Decider) - Flexible and Extensible Machine Learning in Ruby. 306 | * [etcML](http://www.etcml.com/) - text classification with machine learning. 307 | * [Etsy Conjecture](https://github.com/etsy/Conjecture) - scalable Machine Learning in Scalding. 308 | * [Google Sibyl](http://users.soe.ucsc.edu/~niejiazhong/slides/chandra.pdf) - System for Large Scale Machine Learning at Google. 309 | * [H2O](http://0xdata.github.io/h2o/) - statistical, machine learning and math runtime for Hadoop. 310 | * [MLbase](http://www.mlbase.org/) - distributed machine learning libraries for the BDAS stack. 311 | * [MLPNeuralNet](https://github.com/nikolaypavlov/MLPNeuralNet) - Fast multilayer perceptron neural network library for iOS and Mac OS X. 312 | * [nupic](https://github.com/numenta/nupic) - Numenta Platform for Intelligent Computing: a brain-inspired machine intelligence platform, and biologically accurate neural network based on cortical learning algorithms. 313 | * [PredictionIO](http://prediction.io/) - machine learning server buit on Hadoop, Mahout and Cascading. 314 | * [scikit-learn](https://github.com/scikit-learn/scikit-learn) - scikit-learn: machine learning in Python. 315 | * [Spark MLlib](http://spark.apache.org/docs/0.9.0/mllib-guide.html) - a Spark implementation of some common machine learning (ML) functionality. 316 | * [Vowpal Wabbit](https://github.com/JohnLangford/vowpal_wabbit/wiki) - learning system sponsored by Microsoft and Yahoo!. 317 | * [WEKA](http://www.cs.waikato.ac.nz/ml/weka/) - suite of machine learning software. 318 | 319 | ## Benchmarking 320 | 321 | * [Apache Hadoop Benchmarking](https://issues.apache.org/jira/browse/MAPREDUCE-3561) - micro-benchmarks for testing Hadoop performances. 322 | * [Berkeley SWIM Benchmark](https://github.com/SWIMProjectUCB/SWIM/wiki) - real-world big data workload benchmark. 323 | * [Intel HiBench](https://github.com/intel-hadoop/HiBench) - a Hadoop benchmark suite. 324 | * [PUMA Benchmarking](https://issues.apache.org/jira/browse/MAPREDUCE-5116) - benchmark suite for MapReduce applications. 325 | * [Yahoo Gridmix3](https://developer.yahoo.com/blogs/hadoop/gridmix3-emulating-production-workload-apache-hadoop-450.html) - Hadoop cluster benchmarking from Yahoo engineer team. 326 | 327 | ## Security 328 | 329 | * [Apache Knox Gateway](http://knox.apache.org/) - single point of secure access for Hadoop clusters. 330 | * [Apache Sentry](http://incubator.apache.org/projects/sentry.html) - security module for data stored in Hadoop. 331 | 332 | ## System Deployment 333 | 334 | * [Apache Ambari](http://ambari.apache.org/) - operational framework for Hadoop mangement. 335 | * [Apache Bigtop](http://bigtop.apache.org//) - system deployment framework for the Hadoop ecosystem. 336 | * [Apache Helix](http://helix.apache.org/) - cluster management framework. 337 | * [Apache Mesos](http://mesos.apache.org/) - cluster manager. 338 | * [Apache Slider](https://github.com/hortonworks/slider) - is a YARN application to deploy existing distributed applications on YARN. 339 | * [Apache Whirr](http://whirr.apache.org/) - set of libraries for running cloud services. 340 | * [Apache YARN](http://hortonworks.com/hadoop/yarn/) - Cluster manager. 341 | * [Brooklyn](http://brooklyncentral.github.io/) - library that simplifies application deployment and management. 342 | * [Buildoop](http://buildoop.github.io/) - Similar to Apache BigTop based on Groovy language. 343 | * [Cloudera HUE](http://gethue.com/) - web application for interacting with Hadoop. 344 | * [Facebook Prism](http://www.wired.com/2012/08/facebook-prism/) - multi datacenters replication system. 345 | * [Google Borg](http://www.wired.com/wiredenterprise/2013/03/google-borg-twitter-mesos/all/) - job scheduling and monitoring system. 346 | * [Google Omega](https://www.youtube.com/watch?v=0ZFMlO98Jkc) - job scheduling and monitoring system. 347 | * [Hortonworks HOYA](http://hortonworks.com/blog/introducing-hoya-hbase-on-yarn/) - application that can deploy HBase cluster on YARN. 348 | * [Marathon](https://github.com/mesosphere/marathon) - Mesos framework for long-running services. 349 | 350 | ## Applications 351 | 352 | * [Adobe spindle](https://github.com/adobe-research/spindle) - Next-generation web analytics processing with Scala, Spark, and Parquet. 353 | * [Apache Kiji](http://www.kiji.org/) - framework to collect and analyze data in real-time, based on HBase. 354 | * [Apache Nutch](http://nutch.apache.org/) - open source web crawler. 355 | * [Apache OODT](http://oodt.apache.org/) - capturing, processing and sharing of data for NASA's scientific archives. 356 | * [Apache Tika](https://tika.apache.org/) - content analysis toolkit. 357 | * [Domino](http://www.dominoup.com/) - Run, scale, share, and deploy models — without any infrastructure. 358 | * [Eclipse BIRT](http://www.eclipse.org/birt/) - Eclipse-based reporting system. 359 | * [Eventhub](https://github.com/Codecademy/EventHub) - open source event analytics platform. 360 | * [HIPI Library](http://hipi.cs.virginia.edu/) - API for performing image processing tasks on Hadoop's MapReduce. 361 | * [Hunk](http://www.splunk.com/download/hunk) - Splunk analytics for Hadoop. 362 | * [MADlib](http://madlib.net/community/) - data-processing library of an RDBMS to analyze data. 363 | * [PivotalR](https://github.com/gopivotal/PivotalR) - R on Pivotal HD / HAWQ and PostgreSQL. 364 | * [Qubole](http://www.qubole.com/) - auto-scaling Hadoop cluster, built-in data connectors. 365 | * [Sense](https://senseplatform.com//) - Cloud Platform for Data Science and Big Data Analytics. 366 | * [Snowplow](https://github.com/snowplow/snowplow) - enterprise-strength web and event analytics, powered by Hadoop, Kinesis, Redshift and Postgres. 367 | * [SparkR](http://amplab-extras.github.io/SparkR-pkg/) - R frontend for Spark. 368 | * [Splunk](http://www.splunk.com/) - analyzer for machine-generated date. 369 | * [Talend](http://www.talend.com/products/big-data) - unified open source environment for YARN, Hadoop, HBASE, Hive, HCatalog & Pig. 370 | 371 | ## Search engine and framework 372 | 373 | * [Apache Lucene](http://lucene.apache.org/) - Search engine library. 374 | * [Apache Solr](http://lucene.apache.org/solr/) - Search platform for Apache Lucene. 375 | * [ElasticSearch](http://www.elasticsearch.org/) - Search and analytics engine based on Apache Lucene. 376 | * [Enigma.io](http://enigma.io) – Freemium robust web application for exploring, filtering, analyzing, searching and exporting massive datasets scraped from across the Web. 377 | * [Facebook Unicorn](https://www.facebook.com/publications/219621248185635/) - social graph search platform. 378 | * [Google Caffeine](http://googleblog.blogspot.it/2010/06/our-new-search-index-caffeine.html) - continuous indexing system. 379 | * [Google Percolator](http://research.google.com/pubs/pub36726.html) - continuous indexing system. 380 | * [TeraGoogle]() - large search index. 381 | * [HBase Coprocessor](https://blogs.apache.org/hbase/entry/coprocessor_introduction) - implementation of Percolator, part of HBase. 382 | * [Lily HBase Indexer](http://ngdata.github.io/hbase-indexer/) - quickly and easily search for any content stored in HBase. 383 | * [LinkedIn Bobo](http://senseidb.github.io/bobo/) - is a Faceted Search implementation written purely in Java, an extension to Apache Lucene. 384 | * [LinkedIn Cleo](https://github.com/linkedin/cleo) - is a flexible software library for enabling rapid development of partial, out-of-order and real-time typeahead search. 385 | * [LinkedIn Galene](http://engineering.linkedin.com/search/did-you-mean-galene) - search architecture at LinkedIn. 386 | * [LinkedIn Zoie](https://github.com/senseidb/zoie) - is a realtime search/indexing system written in Java. 387 | * [Sphnix Search Server](http://sphinxsearch.com/) - fulltext search engine. 388 | 389 | ## MySQL forks and evolutions 390 | 391 | * [Amazon RDS](http://aws.amazon.com/rds/) - MySQL databases in Amazon's cloud. 392 | * [Drizzle](http://www.drizzle.org/) - evolution of MySQL 6.0. 393 | * [Google Cloud SQL](https://developers.google.com/cloud-sql/) - MySQL databases in Google's cloud. 394 | * [MariaDB](https://mariadb.org/) - enhanced, drop-in replacement for MySQL. 395 | * [MySQL Cluster](http://www.mysql.com/products/cluster/) - MySQL implementation using NDB Cluster storage engine. 396 | * [Percona Server](http://www.percona.com/software/percona-server) - enhanced, drop-in replacement for MySQL. 397 | * [ProxySQL](https://github.com/renecannao/proxysql) - High Performance Proxy for MySQL. 398 | * [TokuDB](http://www.tokutek.com/products/tokudb-for-mysql/) - TokuDB is a storage engine for MySQL and MariaDB. 399 | * [WebScaleSQL](http://webscalesql.org/) - is a collaboration among engineers from several companies that face similar challenges in running MySQL at scale. 400 | 401 | ## PostgreSQL forks and evolutions 402 | 403 | * [HadoopDB](http://db.cs.yale.edu/hadoopdb/hadoopdb.html) - hybrid of MapReduce and DBMS. 404 | * [IBM Netezza](http://www-01.ibm.com/software/data/netezza/) - high-performance data warehouse appliances. 405 | * [Postgres-XL](http://www.postgres-xl.org/) - Scalable Open Source PostgreSQL-based Database Cluster. 406 | * [RecDB](http://www-users.cs.umn.edu/~sarwat/RecDB/) - Open Source Recommendation Engine Built Entirely Inside PostgreSQL. 407 | * [Stado](http://www.stormdb.com/community/stado) - open source MPP database system solely targeted at data warehousing and data mart applications. 408 | * [Yahoo Everest](http://www.scribd.com/doc/3159239/70-Everest-PGCon-RT) - multi-peta-byte database / MPP derived by PostgreSQL. 409 | 410 | ## Memcached forks and evolutions 411 | 412 | * [Facebook McDipper](https://www.facebook.com/notes/facebook-engineering/mcdipper-a-key-value-cache-for-flash-storage/10151347090423920) - key/value cache for flash storage. 413 | * [Facebook Memcached](https://www.facebook.com/notes/facebook-engineering/scaling-memcache-at-facebook/10151411410803920) - fork of Memcache. 414 | * [Twemproxy](https://github.com/twitter/twemproxy) - A fast, light-weight proxy for memcached and redis. 415 | * [Twitter Fatcache](https://github.com/twitter/fatcache) - key/value cache for flash storage. 416 | * [Twitter Twemcache](https://github.com/twitter/twemcache) - fork of Memcache. 417 | 418 | ## Embedded Databases 419 | 420 | * [Actian PSQL](http://www.actian.com/products/operational-databases/) - ACID-compliant DBMS developed by Pervasive Software, optimized for embedding in applications. 421 | * [BerkeleyDB](http://www.oracle.com/us/products/database/berkeley-db/overview/index.html) - a software library that provides a high-performance embedded database for key/value data. 422 | * [HanoiDB](https://github.com/krestenkrab/hanoidb) - Erlang LSM BTree Storage. 423 | * [LevelDB](https://code.google.com/p/leveldb/) - a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values. 424 | * [LMDB](http://symas.com/mdb/) - ultra-fast, ultra-compact key-value embedded data store developed by Symas. 425 | * [RocksDB](http://rocksdb.org/) - embeddable persistent key-value store for fast storage based on LevelDB. 426 | 427 | ## Business Intelligence 428 | 429 | * [BIME Analytics](http://www.bimeanalytics.com/?lang=en) - business intelligence platform in the cloud. 430 | * [Chartio](https://chartio.com) - lean business intelligence platform to visualize and explore your data. 431 | * [datapine](http://www.datapine.com/) - self-service business intelligence tool in the cloud. 432 | * [Jaspersoft](https://www.jaspersoft.com/) - powerful business intelligence suite. 433 | * [Jedox Palo](http://www.jedox.com/) - customisable Business Intelligence platform. 434 | * [Microsoft](http://www.microsoft.com/en-us/server-cloud/solutions/business-intelligence/default.aspx) - business intelligence software and platform. 435 | * [Microstrategy](http://www.microstrategy.com/) - software platforms for business intelligence, mobile intelligence, and network applications. 436 | * [Pentaho](http://www.pentaho.com/) - business intelligence platform. 437 | * [Qlik](http://www.qlik.com/) - business intelligence and analytics platform. 438 | * [SpagoBI](http://www.spagoworld.org/xwiki/bin/view/SpagoBI/) - open source business intelligence platform. 439 | * [Tableau](https://www.tableausoftware.com/) - business intelligence platform. 440 | * [Zoomdata](http://www.zoomdata.com/) - Big Data Analytics. 441 | 442 | ## Data Visualization 443 | 444 | * [Arbor](https://github.com/samizdatco/arbor) - graph visualization library using web workers and jQuery. 445 | * [CartoDB](https://github.com/CartoDB/cartodb) - open-source or freemium hosting for geospatial databases with powerful front-end editing capabilities and a robust API. 446 | * [Chart.js](http://www.chartjs.org/) - open source HTML5 Charts visualizations. 447 | * [Chartist.js](https://github.com/gionkunz/chartist-js) - another open source HTML5 Charts visualization. 448 | * [Crossfilter](http://square.github.io/crossfilter/) - JavaScript library for exploring large multivariate datasets in the browser. Works well with dc.js and d3.js. 449 | * [Cubism](https://github.com/square/cubism) - JavaScript library for time series visualization. 450 | * [Cytoscape](http://cytoscape.github.io/) - JavaScript library for visualizing complex networks. 451 | * [DC.js](http://dc-js.github.io/dc.js/) - Dimensional charting built to work natively with crossfilter rendered using d3.js. Excellent for connecting charts/additional metadata to hover events in D3. 452 | * [D3](http://d3js.org/) - javaScript library for manipulating documents. 453 | * [Envisionjs](https://github.com/HumbleSoftware/envisionjs) - dynamic HTML5 visualization. 454 | * [Freeboard](https://github.com/Freeboard/freeboard) - pen source real-time dashboard builder for IOT and other web mashups. 455 | * [Gephi](https://github.com/gephi/gephi) - An award-winning open-source platform for visualizing and manipulating large graphs and network connections. It's like Photoshop, but for graphs. Available for Windows and Mac OS X. 456 | * [Google Charts](https://developers.google.com/chart/) - simple charting API. 457 | * [Grafana](http://grafana.org/) - graphite dashboard frontend, editor and graph composer. 458 | * [Graphite](http://graphite.wikidot.com/) - scalable Realtime Graphing. 459 | * [Highcharts](http://www.highcharts.com/) - simple and flexible charting API. 460 | * [IPython](http://ipython.org/) - provides a rich architecture for interactive computing. 461 | * [Matplotlib](https://github.com/matplotlib/matplotlib) - plotting with Python. 462 | * [NVD3](http://nvd3.org/) - chart components for d3.js. 463 | * [Peity](https://github.com/benpickles/peity) - Progressive SVG bar, line and pie charts. 464 | * [Plot.ly](http://plot.ly) - Easy-to-use web service that allows for rapid creation of complex charts, from heatmaps to histograms. Upload data to create and style charts with Plotly's online spreadsheet. Fork others' plots. 465 | * [Recline](https://github.com/okfn/recline) - simple but powerful library for building data applications in pure Javascript and HTML. 466 | * [Redash](https://github.com/everythingme/redash) - open-source platform to query and visualize data. 467 | * [Sigma.js](https://github.com/jacomyal/sigma.js) - JavaScript library dedicated to graph drawing. 468 | * [Vega](https://github.com/trifacta/vega) - a visualization grammar. 469 | 470 | ## Internet of things and sensor data 471 | 472 | * [TempoIQ](https://tempoiq.com/) - Cloud-based sensor analytics. 473 | 474 | ## Interesting Readings 475 | 476 | * [Big Data Benchmark](https://amplab.cs.berkeley.edu/benchmark/) - Benchmark of Redshift, Hive, Shark, Impala and Stiger/Tez. 477 | * [NoSQL Comparison](http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis) - Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase vs Couchbase vs Neo4j vs Hypertable vs ElasticSearch vs Accumulo vs VoltDB vs Scalaris comparison. 478 | 479 | ## Interesting Papers 480 | 481 | ### 2013 - 2014 482 | * [2014](http://infolab.stanford.edu/~ullman/mmds/book.pdf) - **Stanford** - Mining of Massive Datasets. 483 | * [2013](https://amplab.cs.berkeley.edu/wp-content/uploads/2013/03/eurosys13-paper83.pdf) - **AMPLab** - Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices. 484 | * [2013](https://amplab.cs.berkeley.edu/wp-content/uploads/2013/01/dmx1.pdf) - **AMPLab** - MLbase: A Distributed Machine-learning System. 485 | * [2013](https://amplab.cs.berkeley.edu/wp-content/uploads/2013/02/shark_sigmod2013.pdf) - **AMPLab** - Shark: SQL and Rich Analytics at Scale. 486 | * [2013](https://amplab.cs.berkeley.edu/wp-content/uploads/2013/05/grades-graphx_with_fonts.pdf) - **AMPLab** - GraphX: A Resilient Distributed Graph System on Spark. 487 | * [2013](http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/40671.pdf) - **Google** - HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm. 488 | * [2013](http://research.microsoft.com/pubs/200169/now-vldb.pdf) - **Microsoft** - Scalable Progressive Analytics on Big Data in the Cloud. 489 | * [2013](http://static.druid.io/docs/druid.pdf) - **Metamarkets** - Druid: A Real-time Analytical Data Store. 490 | * [2013](http://db.disi.unitn.eu/pages/VLDBProgram/pdf/industry/p764-rae.pdf) - **Google** - Online, Asynchronous Schema Change in F1. 491 | * [2013](http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/41344.pdf) - **Google** - F1: A Distributed SQL Database That Scales. 492 | * [2013](http://db.disi.unitn.eu/pages/VLDBProgram/pdf/industry/p734-akidau.pdf) - **Google** - MillWheel: Fault-Tolerant Stream Processing at Internet Scale. 493 | * [2013](http://db.disi.unitn.eu/pages/VLDBProgram/pdf/industry/p767-wiener.pdf) - **Facebook** - Scuba: Diving into Data at Facebook. 494 | * [2013](http://db.disi.unitn.eu/pages/VLDBProgram/pdf/industry/p871-curtiss.pdf) - **Facebook** - Unicorn: A System for Searching the Social Graph. 495 | * [2013](https://www.usenix.org/system/files/conference/nsdi13/nsdi13-final170_update.pdf) - **Facebook** - Scaling Memcache at Facebook. 496 | 497 | ### 2011 - 2012 498 | 499 | * [2012](http://vldb.org/pvldb/vol5/p1771_georgelee_vldb2012.pdf) - **Twitter** - The Unified Logging Infrastructure 500 | for Data Analytics at Twitter. 501 | * [2012](https://amplab.cs.berkeley.edu/wp-content/uploads/2013/04/blinkdb_vldb12_demo.pdf) - **AMPLab** - Blink and It’s Done: Interactive Queries on Very Large Data. 502 | * [2012](https://www.usenix.org/system/files/login/articles/zaharia.pdf) - **AMPLab** - Fast and Interactive Analytics over Hadoop Data with Spark. 503 | * [2012](https://amplab.cs.berkeley.edu/wp-content/uploads/2012/03/mod482-xin1.pdf) - **AMPLab** - Shark: Fast Data Analysis Using Coarse-grained Distributed Memory. 504 | * [2012](https://www.usenix.org/legacy/event/nsdi11/tech/full_papers/Bolosky.pdf) - **Microsoft** - Paxos Replicated State Machines as the Basis of a High-Performance Data Store. 505 | * [2012](http://research.microsoft.com/pubs/178045/ppaoxs-paper29.pdf) - **Microsoft** - Paxos Made Parallel. 506 | * [2012](http://arxiv.org/pdf/1203.5485.pdf) - **AMPLab** - BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. 507 | * [2012](http://vldb.org/pvldb/vol5/p1436_alexanderhall_vldb2012.pdf) - **Google** - Processing a trillion cells per mouse click. 508 | * [2012](http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/spanner-osdi2012.pdf) - **Google** - Spanner: Google’s Globally-Distributed Database. 509 | * [2011](https://amplab.cs.berkeley.edu/wp-content/uploads/2011/06/euro118-ananthanarayanan.pdf) - **AMPLab** - Scarlett: Coping with Skewed Popularity Content in MapReduce Clusters. 510 | * [2011](https://amplab.cs.berkeley.edu/wp-content/uploads/2011/06/Mesos-A-Platform-for-Fine-Grained-Resource-Sharing-in-the-Data-Center.pdf) - **AMPLab** - Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. 511 | * [2011](http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/36971.pdf) - **Google** - Megastore: Providing Scalable, Highly Available Storage for Interactive Services. 512 | 513 | ### 2001 - 2010 514 | 515 | * [2010](https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf) - **Facebook** - Finding a needle in Haystack: Facebook’s photo storage. 516 | * [2010](https://amplab.cs.berkeley.edu/wp-content/uploads/2011/06/Spark-Cluster-Computing-with-Working-Sets.pdf) - **AMPLab** - Spark: Cluster Computing with Working Sets. 517 | * [2010](http://static.googleusercontent.com/media/research.google.com/en/us/university/relations/facultysummit2010/storage_architecture_and_challenges.pdf) - **Google** - Storage Architecture and Challenges. 518 | * [2010](http://kowshik.github.io/JPregel/pregel_paper.pdf) - **Google** - Pregel: A System for Large-Scale Graph Processing. 519 | * [2010](http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/36726.pdf) - **Google** - Large-scale Incremental Processing Using Distributed Transactions and Notifications base of Percolator and Caffeine. 520 | * [2010](http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/36632.pdf) - **Google** - Dremel: Interactive Analysis of Web-Scale Datasets. 521 | * [2010](http://www.4lunas.org/pub/2010-s4.pdf) - **Yahoo** - S4: Distributed Stream Computing Platform. 522 | * [2009](http://www.vldb.org/pvldb/2/vldb09-861.pdf) - HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. 523 | * [2008](http://www.cca08.org/papers/Paper-13-Ariel-Rabkin.pdf) - **AMPLab** - Chukwa: A large-scale monitoring system. 524 | * [2007](http://www.read.seas.harvard.edu/~kohler/class/cs239-w08/decandia07dynamo.pdf) - **Amazon** - Dynamo: Amazon’s Highly Available Key-value Store. 525 | * [2006](http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/chubby-osdi06.pdf) - **Google** - The Chubby lock service for loosely-coupled distributed systems. 526 | * [2006](http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/bigtable-osdi06.pdf) - **Google** - Bigtable: A Distributed Storage System for Structured Data. 527 | * [2004](http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/mapreduce-osdi04.pdf) - **Google** - MapReduce: Simplied Data Processing on Large Clusters. 528 | * [2003](http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/gfs-sosp2003.pdf) - **Google** - The Google File System. 529 | 530 | # Other Awesome Lists 531 | - Other awesome lists [awesome-awesomeness](https://github.com/bayandin/awesome-awesomeness). 532 | - Even more lists [awesome](https://github.com/sindresorhus/awesome). 533 | - Another list? [list](https://github.com/jnv/lists). 534 | - WTF! [awesome-awesome-awesome](https://github.com/t3chnoboy/awesome-awesome-awesome). 535 | - Analytics [awesome-analytics](https://github.com/onurakpolat/awesome-analytics). 536 | --------------------------------------------------------------------------------