├── LICENSE
└── README.md


/LICENSE:
--------------------------------------------------------------------------------
 1 | The MIT License (MIT)
 2 | 
 3 | Copyright (c) 2014 Onur Akpolat
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Awesome Big Data
  2 | 
  3 | A curated list of awesome big data frameworks, resources and other awesomeness. Inspired by [awesome-php](https://github.com/ziadoz/awesome-php), [awesome-python](https://github.com/vinta/awesome-python), [awesome-ruby](https://github.com/Sdogruyol/awesome-ruby), [hadoopecosystemtable](http://hadoopecosystemtable.github.io/) & [big-data](http://blog.andreamostosi.name/big-data/).
  4 | 
  5 | Your contributions are always welcome!
  6 | 
  7 | - [Awesome Big Data](#awesome-bigdata)
  8 |     - [Frameworks](#frameworks)
  9 |     - [Distributed Programming](#distributed-programming)
 10 |     - [Distributed Filesystem](#distributed-filesystem)
 11 |     - [Key-Map Data Model](#key-map-data-model)
 12 |     - [Document Data Model](#document-data-model)
 13 |     - [Key-value Data Model](#key-value-data-model)
 14 |     - [Graph Data Model](#graph-data-model)
 15 |     - [NewSQL Databases](#newsql-databases)
 16 |     - [Columnar Databases](#columnar-databases)
 17 |     - [Time-Series Databases](#time-series-databases)
 18 |     - [SQL-like processing](#sql-like-processing)
 19 |     - [Integrated Development Environments](#integrated-development-environments)
 20 |     - [Data Ingestion](#data-ingestion)
 21 |     - [Service Programming](#service-programming)
 22 |     - [Scheduling](#scheduling)
 23 |     - [Machine Learning](#machine-learning)
 24 |     - [Benchmarking](#benchmarking)
 25 |     - [Security](#security)
 26 |     - [System Deployment](#system-deployment)
 27 |     - [Applications](#applications)
 28 |     - [Search engine and framework](#search-engine-and-framework)
 29 |     - [MySQL forks and evolutions](#mysql-forks-and-evolutions)
 30 |     - [PostgreSQL forks and evolutions](#postgresql-forks-and-evolutions)
 31 |     - [Memcached forks and evolutions](#memcached-forks-and-evolutions)
 32 |     - [Embedded Databases](#embedded-databases)
 33 |     - [Business Intelligence](#business-intelligence)
 34 |     - [Data Visualization](#data-visualization)
 35 |     - [Internet of things and sensor data](#internet-of-things-and-sensor-data)
 36 |     - [Interesting Readings](#interesting-readings)
 37 |     - [Interesting Papers](#interesting-papers)
 38 | - [Other Awesome Lists](#other-awesome-lists)
 39 | 
 40 | ## Frameworks
 41 | 
 42 | * [Apache Hadoop](http://hadoop.apache.org/) - framework for distributed processing. Integrates MapReduce (parallel processing), YARN (job scheduling) and HDFS (distributed file system).
 43 | 
 44 | ## Distributed Programming
 45 | 
 46 | * [AddThis Hydra](https://github.com/addthis/hydra) - distributed data processing and storage system originally developed at AddThis.
 47 | * [AMPLab SIMR](http://databricks.github.io/simr/) - run Spark on Hadoop MapReduce v1.
 48 | * [Apache Crunch](http://crunch.apache.org/) - a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce.
 49 | * [Apache DataFu](http://incubator.apache.org/projects/datafu.html) - collection of user-defined functions for Hadoop and Pig developed by LinkedIn.
 50 | * [Apache Flink](http://flink.incubator.apache.org/) - high-performance runtime, and automatic program optimization.
 51 | * [Apache Gora](http://gora.apache.org/) - framework for in-memory data model and persistence.
 52 | * [Apache Hama](http://hama.apache.org/) - BSP (Bulk Synchronous Parallel) computing framework.
 53 | * [Apache MapReduce](http://wiki.apache.org/hadoop/MapReduce/) - programming model for processing large data sets with a parallel, distributed algorithm on a cluster.
 54 | * [Apache Pig](https://pig.apache.org/) - high level language to express data analysis programs for Hadoop.
 55 | * [Apache S4](http://incubator.apache.org/s4/) - framework for stream processing, implementation of S4.
 56 | * [Apache Spark](http://spark.incubator.apache.org/) - framework for in-memory cluster computing.
 57 | * [Apache Spark Streaming](http://spark.incubator.apache.org/docs/0.7.3/streaming-programming-guide.html) - framework for stream processing, part of Spark.
 58 | * [Apache Storm](http://storm-project.net/) - framework for stream processing by Twitter also on YARN.
 59 | * [Apache Tez](http://tez.incubator.apache.org/) - application framework for executing a complex DAG (directed acyclic graph) of tasks, built on YARN.
 60 | * [Apache Twill](https://incubator.apache.org/projects/twill.html) - abstraction over YARN that reduces the complexity of developing distributed applications.
 61 | * [Cascalog](http://cascalog.org/) - data processing and querying library.
 62 | * [Cheetah](http://vldbarc.org/pvldb/vldb2010/pvldb_vol3/I08.pdf) - High Performance, Custom Data Warehouse on Top of MapReduce.
 63 | * [Concurrent Cascading](http://www.cascading.org/) - framework for data management/analytics on Hadoop.
 64 | * [Damballa Parkour](https://github.com/damballa/parkour) - MapReduce library for Clojure.
 65 | * [Datasalt Pangool](https://github.com/datasalt/pangool) - alternative MapReduce paradigm.
 66 | * [DataTorrent StrAM](https://www.datatorrent.com/) - real-time engine is designed to enable distributed, asynchronous, real time in-memory big-data computations in as unblocked a way as possible, with minimal overhead and impact on performance.
 67 | * [Facebook Corona](https://www.facebook.com/notes/facebook-engineering/under-the-hood-scheduling-mapreduce-jobs-more-efficiently-with-corona/10151142560538920) - Hadoop enhancement which removes single point of failure.
 68 | * [Facebook Peregrine](http://peregrine_mapreduce.bitbucket.org/) - Map Reduce framework.
 69 | * [Facebook Scuba](https://www.facebook.com/notes/facebook-engineering/under-the-hood-data-diving-with-scuba/10150599692628920) - distributed in-memory datastore.
 70 | * [Google Dataflow](http://googledevelopers.blogspot.it/2014/06/cloud-platform-at-google-io-new-big.html) - create data pipelines to help themæingest, transform and analyze data.
 71 | * [Google MapReduce](http://research.google.com/archive/mapreduce.html) - map reduce framework.
 72 | * [Google MillWheel](http://research.google.com/pubs/pub41378.html) - fault tolerant stream processing framework.
 73 | * [JAQL](https://code.google.com/p/jaql/) - declarative programming language for working with structured, semi-structured and unstructured data.
 74 | * [Kite](http://kitesdk.org/docs/current/) - is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem.
 75 | * [Metamarkers Druid](http://druid.io/) - framework for real-time analysis of large datasets.
 76 | * [Netflix PigPen](https://github.com/Netflix/PigPen) - map-reduce for Clojure whiche compiles to Apache Pig.
 77 | * [Nokia Disco](http://discoproject.org/) - MapReduce framework developed by Nokia.
 78 | * [Pinterest Pinlater](http://engineering.pinterest.com/post/91288882494/pinlater-an-asynchronous-job-execution-system) - asynchronous job execution system.
 79 | * [Pydoop](http://pydoop.sourceforge.net/docs/) - Python MapReduce and HDFS API for Hadoop.
 80 | * [Stratosphere](http://stratosphere.eu/) - general purpose cluster computing framework.
 81 | * [Streamdrill](https://streamdrill.com/) - usefull for counting activities of event streams over different time windows and finding the most active one.
 82 | * [Twitter Scalding](https://github.com/twitter/scalding) - Scala library for Map Reduce jobs, built on Cascading.
 83 | * [Twitter Summingbird](https://github.com/twitter/summingbird) - Streaming MapReduce with Scalding and Storm, by Twitter.
 84 | * [Twitter TSAR](https://blog.twitter.com/2014/tsar-a-timeseries-aggregator) - TimeSeries AggregatoR by Twitter.
 85 | 
 86 | ## Distributed Filesystem
 87 | 
 88 | * [Apache HDFS](http://hadoop.apache.org/) - a way to store large files across multiple machines.
 89 | * [BeeGFS](http://www.fhgfs.com/cms/) - formerly FhGFS, parallel distributed file system.
 90 | * [Ceph Filesystem](http://ceph.com/ceph-storage/file-system/) - software storage platform designed.
 91 | * [Disco DDFS](http://disco.readthedocs.org/en/latest/howto/ddfs.html) - distributed filesystem.
 92 | * [Facebook Haystack](https://www.facebook.com/note.php?note_id=76191543919) - object storage system.
 93 | * [Google Colossus](https://google.com/) - distributed filesystem (GFS2).
 94 | * [Google GFS](https://google.com/) - distributed filesystem.
 95 | * [Google Megastore](http://research.google.com/pubs/pub36971.html) - scalable, highly available storage.
 96 | * [GridGain](http://www.gridgain.org/) - GGFS, Hadoop compliant in-memory file system.
 97 | * [Lustre file system](http://wiki.lustre.org/) - high-performance distributed filesystem.
 98 | * [Quantcast File System QFS](https://www.quantcast.com/engineering/qfs/) - open-source distributed file system.
 99 | * [Red Hat GlusterFS](http://www.gluster.org/) - scale-out network-attached storage file system.
100 | * [Tachyon](http://tachyon-project.org/) - reliable file sharing at memory speed across cluster frameworks.
101 | 
102 | ## Document Data Model
103 | 
104 | * [Actian Versant](http://www.actian.com/products/operational-databases/) - commercial object-oriented database management systems .
105 | * [Crate Data](https://crate.io/) - is an open source massively scalable data store. It requires zero administration.
106 | * [Facebook Apollo](http://www.infoq.com/news/2014/06/facebook-apollo) - Facebook’s Paxos-like NoSQL database.
107 | * [jumboDB](http://comsysto.github.io/jumbodb/) - document oriented datastore over Hadoop.
108 | * [LinkedIn Espresso](http://data.linkedin.com/projects/espresso) - horizontally scalable document-oriented NoSQL data store.
109 | * [MarkLogic](http://www.marklogic.com/) - Schema-agnostic Enterprise NoSQL database technology.
110 | * [MongoDB](http://www.mongodb.org/) - Document-oriented database system.
111 | * [RavenDB](http://www.ravendb.net/) - A transactional, open-source Document Database.
112 | * [RethinkDB](http://www.rethinkdb.com/) - document database that supports queries like table joins and group by.
113 | 
114 | ## Key Map Data Model
115 | 
116 | **Note**: There is some term confusion in the industry, and two different things are called "Columnar Databases". Some, listed here, are distributed, persistent databases built around the "key-map" data model: all data has a (possibly composite) key, with which a map of key-value pairs is associated. In some systems, multiple such value maps can be associated with a key, and these maps are referred to as "column families" (with value map keys being referred to as "columns").
117 | 
118 | Another group of technologies that can also be called "columnar databases" is distinguished by how it stores data, on disk or in memory -- rather than storing data the traditional way, where all column values for a given key are stored next to each other, "row by row", these systems store all *column* values next to each other. So more work is needed to get all columns for a given key, but less work is needed to get all values for a given column.
119 | 
120 | The former group is referred to as "key map data model" here. The line between these and the [Key-value Data Model](#key-value-data-model) stores is fairly blurry.
121 |  
122 | The latter, being more about the storage format than about the data model, is listed under [Columnar Databases](#columnar-databases).
123 | 
124 | You can read more about this distinction on Prof. Daniel Abadi's blog: [Distinguishing two major types of Column Stores](http://dbmsmusings.blogspot.com/2010/03/distinguishing-two-major-types-of_29.html). 
125 | 
126 | * [Apache Accumulo](http://accumulo.apache.org/) - distribuited key/value store, built on Hadoop.
127 | * [Apache Cassandra](http://cassandra.apache.org/) - column-oriented distribuited datastore, inspired by BigTable.
128 | * [Apache HBase](http://hbase.apache.org/) - column-oriented distribuited datastore, inspired by BigTable.
129 | * [Facebook HydraBase](https://code.facebook.com/posts/321111638043166/hydrabase-the-evolution-of-hbase-facebook/) - evolution of HBase made by Facebook.
130 | * [Google BigTable](http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/bigtable-osdi06.pdf) - column-oriented distributed datastore.
131 | * [Google Cloud Datastore](https://developers.google.com/datastore/) - is a fully managed, schemaless database for storing non-relational data over BigTable.
132 | * [Hypertable](http://hypertable.org/) - column-oriented distribuited datastore, inspired by BigTable.
133 | * [InfiniDB](http://infinidb.co/) - is accessed through a MySQL interface and use massive parallel processing to parallelize queries.
134 | * [OhmData C5](http://ohmdata.com/) - improved version of HBase.
135 | * [Tephra](https://github.com/continuuity/tephra) - Transactions for HBase.
136 | * [Twitter Manhattan](https://blog.twitter.com/2014/manhattan-our-real-time-multi-tenant-distributed-database-for-twitter-scale) - real-time, multi-tenant distributed database for Twitter scale.
137 | 
138 | 
139 | ## Key-value Data Model
140 | 
141 | * [Aerospike](http://www.aerospike.com/) - NoSQL flash-optimized, in-memory. Open source and "Server code in 'C' (not Java or Erlang) precisely tuned to avoid context switching and memory copies."
142 | * [Amazon DynamoDB](http://aws.amazon.com/dynamodb/) - distributed key/value store, implementation of Dynamo paper.
143 | * [Edis](http://inaka.github.io/edis/) - is a protocol-compatible Server replacement for Redis.
144 | * [ElephantDB](https://github.com/nathanmarz/elephantdb) - Distributed database specialized in exporting data from Hadoop.
145 | * [EventStore](http://geteventstore.com) - distributed time series database.
146 | * [LinkedIn Krati](https://github.com/linkedin-sna/sna-page/tree/master/krati) - is a simple persistent data store with very low latency and high throughput.
147 | * [Linkedin Voldemort](http://www.project-voldemort.com/voldemort/) - distributed key/value storage system.
148 | * [Oracle NoSQL Database](http://www.oracle.com/technetwork/database/database-technologies/nosqldb/overview/index.html) - distributed key-value database by Oracle Corporation.
149 | * [Redis](http://redis.io) - in memory key value datastore.
150 | * [Riak](https://github.com/basho/riak) - a decentralized datastore.
151 | * [Storehaus](https://github.com/twitter/storehaus) - library to work with asynchronous key value stores, by Twitter.
152 | * [Tarantool](https://github.com/tarantool/tarantool) - an efficient NoSQL database and a Lua application server.
153 | * [TreodeDB](https://github.com/Treode/store) - key-value store that's replicated and sharded and provides atomic multirow writes.
154 | 
155 | 
156 | ## Graph Data Model
157 | 
158 | * [Apache Giraph](http://giraph.apache.org/) - implementation of Pregel, based on Hadoop.
159 | * [Apache Spark Bagel](http://spark.incubator.apache.org/docs/0.7.3/bagel-programming-guide.html) - implementation of Pregel, part of Spark.
160 | * [ArangoDB](https://www.arangodb.org/) - multi model distribuited database.
161 | * [Facebook TAO](https://www.facebook.com/notes/facebook-engineering/tao-the-power-of-the-graph/10151525983993920) - TAO is the distributed data store that is widely used at facebook to store and serve the social graph.
162 | * [Google Cayley](https://github.com/google/cayley) - open-source graph database.
163 | * [Google Pregel](http://kowshik.github.io/JPregel/pregel_paper.pdf) - graph processing framework.
164 | * [GraphLab PowerGraph](http://graphlab.org/projects/source.html) - a core C++ GraphLab API and a collection of high-performance machine learning and data mining toolkits built on top of the GraphLab API.
165 | * [GraphX](https://amplab.cs.berkeley.edu/publication/graphx-grades/) - resilient Distributed Graph System on Spark.
166 | * [Gremlin](https://github.com/tinkerpop/gremlin) - graph traversal Language.
167 | * [Infovore](https://github.com/paulhoule/infovore) - RDF-centric Map/Reduce framework.
168 | * [Intel GraphBuilder](https://01.org/graphbuilder/) - tools to construct large-scale graphs on top of Hadoop.
169 | * [MapGraph](http://mapgraph.io/) - Massively Parallel Graph processing on GPUs.
170 | * [Neo4j](http://www.neo4j.org/) - graph database writting entirely in Java.
171 | * [OrientDB](http://www.orientechnologies.com/) - document and graph database.
172 | * [Phoebus](https://github.com/xslogic/phoebus) - framework for large scale graph processing.
173 | * [Titan](http://thinkaurelius.github.io/titan/) - distributed graph database, built over Cassandra.
174 | * [Twitter FlockDB](https://github.com/twitter/flockdb) - distribuited graph database.
175 | 
176 | 
177 | ## Columnar Databases
178 | 
179 | **Note** please read the note on [Key-Map Data Model](#key-map-data-model) section.
180 | 
181 | * [Columnar Storage](http://the-paper-trail.org/blog/columnar-storage/) - an explanation of what columnar storage is and when you might want it.
182 | * [Actian Vector](http://www.actian.com/) - column-oriented analytic database.
183 | * [C-Store](http://db.lcs.mit.edu/projects/cstore/) - column oriented DBMS.
184 | * [MonetDB](https://www.monetdb.org/) - column store database.
185 | * [Parquet](http://parquet.incubator.apache.org/) - columnar storage format for Hadoop.
186 | * [Pivotal Greenplum](https://www.pivotal.io/big-data/pivotal-greenplum-database) - purpose-built, dedicated analytic data warehouse that offers a columnar engine as well as a traditional row-based one.
187 | * [Vertica](http://www.vertica.com/) - is designed to manage large, fast-growing volumes of data and provide very fast query performance when used for data warehouses.
188 | * [Google BigQuery](https://developers.google.com/bigquery/) Google's cloud offering backed by their pioneering work on Dremel.
189 | * [Amazon Redshift](http://aws.amazon.com/redshift/) Amazon's cloud offering, also based on a columnar datastore backend.
190 | 
191 | ## NewSQL Databases
192 | 
193 | * [Actian Ingres](http://www.actian.com/products/operational-databases/) - commercially supported, open-source SQL relational database management system.
194 | * [Amazon RedShift](http://aws.amazon.com/redshift/) - data warehouse service, based on PostgreSQL.
195 | * [BayesDB](http://probcomp.csail.mit.edu/bayesdb/index.html) - statistic oriented SQL database.
196 | * [Cockroach](https://github.com/cockroachdb/cockroach) - Scalable, Geo-Replicated, Transactional Datastore.
197 | * [Datomic](http://www.datomic.com/) - distributed database designed to enable scalable, flexible and intelligent applications.
198 | * [FoundationDB](https://foundationdb.com/) - distributed database, inspired by F1.
199 | * [Google F1](http://research.google.com/pubs/pub41344.html) - distributed SQL database built on Spanner.
200 | * [Google Spanner](http://research.google.com/archive/spanner.html) - globally distributed semi-relational database.
201 | * [H-Store](http://hstore.cs.brown.edu/) - is an experimental main-memory, parallel database management system that is optimized for on-line transaction processing (OLTP) applications.
202 | * [Haeinsa](https://github.com/VCNC/haeinsa) - linearly scalable multi-row, multi-table transaction library for HBase based on Percolator.
203 | * [HandlerSocket](http://www.percona.com/doc/percona-server/5.5/performance/handlersocket.html) - NoSQL plugin for MySQL/MariaDB.
204 | * [InfiniSQL](http://www.infinisql.org/) - infinity scalable RDBMS.
205 | * [MemSQL](http://www.memsql.com/) - in memory SQL database witho optimized columnar storage on flash.
206 | * [NuoDB](http://www.nuodb.com/) - SQL/ACID compliant distributed database.
207 | * [Oracle Database](http://www.oracle.com/us/corporate/features/database-12c/index.html) - object-relational database management system.
208 | * [Oracle TimesTen in-Memory Database](http://www.oracle.com/technetwork/database/database-technologies/timesten/overview/index.html) - in-memory, relational database management system with persistence and recoverability.
209 | * [Pivotal GemFire XD](http://gemfirexd.docs.gopivotal.com/latest/userguide/index.html?q=about_users_guide.html/) - Low-latency, in-memory, distributed SQL data store. Provides SQL interface to in-memory table data, persistable in HDFS.
210 | * [SAP HANA](http://www.saphana.com/welcome) - is an in-memory, column-oriented, relational database management system.
211 | * [SenseiDB](http://senseidb.com/) - distributed, realtime, semi-structured database.
212 | * [Sky](http://skydb.io/) - database used for flexible, high performance analysis of behavioral data.
213 | * [SymmetricDS](http://www.symmetricds.org/) - open source software for both file and database synchronization.
214 | 
215 | ## Time-Series Databases
216 | 
217 | * [Cube](http://square.github.io/cube/) - uses MongoDB to store time series data.
218 | * [InfluxDB](http://influxdb.com/) - distributed time series database.
219 | * [Kairosdb](https://code.google.com/p/kairosdb/) - similar to OpenTSDB but allows for Cassandra.
220 | * [OpenTSDB](http://opentsdb.net) - distributed time series database on top of HBase.
221 | 
222 | ## SQL-like processing
223 | 
224 | * [Actian SQL for Hadoop](http://www.actian.com/products/analytics-platform/) - high performance interactive SQL access to all Hadoop data.
225 | * [AMPLAB Shark](https://github.com/amplab/shark/) - data warehouse system for Spark.
226 | * [Apache Drill](http://incubator.apache.org/drill/) - framework for interactive analysis, inspired by Dremel.
227 | * [Apache HCatalog](http://hive.apache.org/docs/hcat_r0.5.0/) - table and storage management layer for Hadoop.
228 | * [Apache Hive](http://hive.apache.org/) - SQL-like data warehouse system for Hadoop.
229 | * [Apache Optiq](https://wiki.apache.org/incubator/OptiqProposal) - framework that allows efficient translation of queries involving heterogeneous and federated data.
230 | * [Apache Phoenix](http://phoenix.incubator.apache.org/index.html) - SQL skin over HBase.
231 | * [BlinkDB](http://blinkdb.org/) - massively parallel, approximate query engine.
232 | * [Cloudera Impala](http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html) - framework for interactive analysis, Inspired by Dremel.
233 | * [Concurrent Lingual](http://www.cascading.org/lingual/) - SQL-like query language for Cascading.
234 | * [Datasalt Splout SQL](http://www.datasalt.com/products/splout-sql/) - full SQL query engine for big datasets.
235 | * [Facebook PrestoDB](http://prestodb.io/) - distributed SQL query engine.
236 | * [Google BigQuery](http://research.google.com/pubs/pub36632.html) - framework for interactive analysis, implementation of Dremel.
237 | * [Pivotal HAWQ](http://www.gopivotal.com/pivotal-products/data/pivotal-hd) - SQL-like data warehouse system for Hadoop.
238 | * [RainstorDB](http://rainstor.com/products/rainstor-database/) - database for storing petabyte-scale volumes of structured and semi-structured data.
239 | * [Spark Catalyst](https://github.com/apache/spark/tree/master/sql) - is a Query Optimization Framework for Spark and Shark.
240 | * [SparkSQL](http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.html) - Manipulating Structured Data Using Spark.
241 | * [Splice Machine](http://www.splicemachine.com/) - a full-featured SQL-on-Hadoop RDBMS with ACID transactions.
242 | * [Stinger](http://hortonworks.com/labs/stinger/) - interactive query for Hive.
243 | * [Tajo](http://tajo.incubator.apache.org/) - distributed data warehouse system on Hadoop.
244 | * [Trafodion](https://wiki.trafodion.org/wiki/index.php/Main_Page) - enterprise-class SQL-on-HBase solution targeting big data transactional or operational workloads.
245 | 
246 | ## Integrated Development Environments
247 | 
248 | * [R-Studio](https://github.com/rstudio/rstudio) - IDE for R.
249 | 
250 | ## Data Ingestion
251 | 
252 | * [Amazon Kinesis](http://aws.amazon.com/kinesis/) - real-time processing of streaming data at massive scale.
253 | * [Apache Chukwa](http://incubator.apache.org/chukwa/) - data collection system.
254 | * [Apache Flume](http://flume.apache.org/) - service to manage large amount of log data.
255 | * [Apache Kafka](http://kafka.apache.org/) - distributed publish-subscribe messaging system.
256 | * [Apache Samza](http://samza.incubator.apache.org/) - stream processing framework, based on Kafla and YARN.
257 | * [Apache Sqoop](http://sqoop.apache.org/) - tool to transfer data between Hadoop and a structured datastore.
258 | * [Cloudera Morphlines](https://github.com/cloudera/cdk/tree/master/cdk-morphlines) - framework that help ETL to Solr, HBase and HDFS.
259 | * [Facebook Scribe](https://github.com/facebook/scribe) - streamed log data aggregator.
260 | * [Fluentd](http://fluentd.org/) - tool to collect events and logs.
261 | * [Google Photon](http://research.google.com/pubs/pub41318.html) - geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency.
262 | * [Heka](https://github.com/mozilla-services/heka) - open source stream processing software system.
263 | * [HIHO](https://github.com/sonalgoyal/hiho) - framework for connecting disparate data sources with Hadoop.
264 | * [Kestrel](http://robey.github.io/kestrel/) - distributed message queue system.
265 | * [LinkedIn Databus](http://data.linkedin.com/projects/databus) - stream of change capture events for a database.
266 | * [LinkedIn Kamikaze](https://github.com/linkedin/kamikaze) - utility package for compressing sorted integer arrays.
267 | * [LinkedIn White Elephant](https://github.com/linkedin/white-elephant) - log aggregator and dashboard.
268 | * [Logstash](http://logstash.net) - a tool for managing events and logs.
269 | * [Netflix Suro](https://github.com/Netflix/suro) - log agregattor like Storm and Samza based on Chukwa.
270 | * [Pinterest Secor](https://github.com/pinterest/secor) - is a service implementing Kafka log persistance.
271 | 
272 | ## Service Programming
273 | 
274 | * [Akka Toolkit](http://akka.io/) - runtime for distributed, and fault tolerant event-driven applications on the JVM.
275 | * [Apache Avro](http://avro.apache.org/) - data serialization system.
276 | * [Apache Curator](http://curator.apache.org/) - Java libaries for Apache ZooKeeper.
277 | * [Apache Karaf](http://karaf.apache.org/) - OSGi runtime that runs on top of any OSGi framework.
278 | * [Apache Thrift](http://thrift.apache.org//) - framework to build binary protocols.
279 | * [Apache Zookeeper](http://zookeeper.apache.org/) - centralized service for process management.
280 | * [Google Chubby](http://research.google.com/archive/chubby.html) - a lock service for loosely-coupled distributed systems.
281 | * [Linkedin Norbert](http://data.linkedin.com/opensource/norbert) - cluster manager.
282 | * [OpenMPI](http://www.open-mpi.org/) - message passing framework.
283 | * [Serf](http://www.serfdom.io/) - decentralized solution for service discovery and orchestration.
284 | * [Spotify Luigi](https://github.com/spotify/luigi) - a Python package for building complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.
285 | * [Spring XD](https://github.com/spring-projects/spring-xd) - distributed and extensible system for data ingestion, real time analytics, batch processing, and data export.
286 | * [Twitter Elephant Bird](https://github.com/kevinweil/elephant-bird) - libraries for working with LZOP-compressed data.
287 | * [Twitter Finagle](https://twitter.github.io/finagle/) - asynchronous network stack for the JVM.
288 | 
289 | ## Scheduling
290 | 
291 | * [Apache Aurora](http://aurora.incubator.apache.org/) - is a service scheduler that runs on top of Apache Mesos.
292 | * [Apache Falcon](http://falcon.incubator.apache.org/) - data management framework.
293 | * [Apache Oozie](http://oozie.apache.org/) - workflow job scheduler.
294 | * [Chronos](http://airbnb.github.io/chronos/) - distributed and fault-tolerant scheduler.
295 | * [Linkedin Azkaban](http://azkaban.github.io/azkaban2/) - batch workflow job scheduler.
296 | * [Sparrow](https://github.com/radlab/sparrow) - scheduling platform.
297 | 
298 | ## Machine Learning
299 | 
300 | * [Apache Mahout](http://mahout.apache.org/) - machine learning library for Hadoop.
301 | * [brain](https://github.com/harthur/brain) - Neural networks in JavaScript.
302 | * [Cloudera Oryx](https://github.com/cloudera/oryx) - real-time large-scale machine learning.
303 | * [Concurrent Pattern](http://www.cascading.org/pattern/) - machine learning library for Cascading.
304 | * [convnetjs](https://github.com/karpathy/convnetjs) - Deep Learning in Javascript. Train Convolutional Neural Networks (or ordinary ones) in your browser.
305 | * [Decider](https://github.com/danielsdeleo/Decider) - Flexible and Extensible Machine Learning in Ruby.
306 | * [etcML](http://www.etcml.com/) - text classification with machine learning.
307 | * [Etsy Conjecture](https://github.com/etsy/Conjecture) - scalable Machine Learning in Scalding.
308 | * [Google Sibyl](http://users.soe.ucsc.edu/~niejiazhong/slides/chandra.pdf) - System for Large Scale Machine Learning at Google.
309 | * [H2O](http://0xdata.github.io/h2o/) - statistical, machine learning and math runtime for Hadoop.
310 | * [MLbase](http://www.mlbase.org/) - distributed machine learning libraries for the BDAS stack.
311 | * [MLPNeuralNet](https://github.com/nikolaypavlov/MLPNeuralNet) - Fast multilayer perceptron neural network library for iOS and Mac OS X.
312 | * [nupic](https://github.com/numenta/nupic) - Numenta Platform for Intelligent Computing: a brain-inspired machine intelligence platform, and biologically accurate neural network based on cortical learning algorithms.
313 | * [PredictionIO](http://prediction.io/) - machine learning server buit on Hadoop, Mahout and Cascading.
314 | * [scikit-learn](https://github.com/scikit-learn/scikit-learn) - scikit-learn: machine learning in Python.
315 | * [Spark MLlib](http://spark.apache.org/docs/0.9.0/mllib-guide.html) - a Spark implementation of some common machine learning (ML) functionality.
316 | * [Vowpal Wabbit](https://github.com/JohnLangford/vowpal_wabbit/wiki) - learning system sponsored by Microsoft and Yahoo!.
317 | * [WEKA](http://www.cs.waikato.ac.nz/ml/weka/) - suite of machine learning software.
318 | 
319 | ## Benchmarking
320 | 
321 | * [Apache Hadoop Benchmarking](https://issues.apache.org/jira/browse/MAPREDUCE-3561) - micro-benchmarks for testing Hadoop performances.
322 | * [Berkeley SWIM Benchmark](https://github.com/SWIMProjectUCB/SWIM/wiki) - real-world big data workload benchmark.
323 | * [Intel HiBench](https://github.com/intel-hadoop/HiBench) - a Hadoop benchmark suite.
324 | * [PUMA Benchmarking](https://issues.apache.org/jira/browse/MAPREDUCE-5116) - benchmark suite for MapReduce applications.
325 | * [Yahoo Gridmix3](https://developer.yahoo.com/blogs/hadoop/gridmix3-emulating-production-workload-apache-hadoop-450.html) - Hadoop cluster benchmarking from Yahoo engineer team.
326 | 
327 | ## Security
328 | 
329 | * [Apache Knox Gateway](http://knox.apache.org/) - single point of secure access for Hadoop clusters.
330 | * [Apache Sentry](http://incubator.apache.org/projects/sentry.html) - security module for data stored in Hadoop.
331 | 
332 | ## System Deployment
333 | 
334 | * [Apache Ambari](http://ambari.apache.org/) - operational framework for Hadoop mangement.
335 | * [Apache Bigtop](http://bigtop.apache.org//) - system deployment framework for the Hadoop ecosystem.
336 | * [Apache Helix](http://helix.apache.org/) - cluster management framework.
337 | * [Apache Mesos](http://mesos.apache.org/) - cluster manager.
338 | * [Apache Slider](https://github.com/hortonworks/slider) - is a YARN application to deploy existing distributed applications on YARN.
339 | * [Apache Whirr](http://whirr.apache.org/) - set of libraries for running cloud services.
340 | * [Apache YARN](http://hortonworks.com/hadoop/yarn/) - Cluster manager.
341 | * [Brooklyn](http://brooklyncentral.github.io/) - library that simplifies application deployment and management.
342 | * [Buildoop](http://buildoop.github.io/) - Similar to Apache BigTop based on Groovy language.
343 | * [Cloudera HUE](http://gethue.com/) - web application for interacting with Hadoop.
344 | * [Facebook Prism](http://www.wired.com/2012/08/facebook-prism/) - multi datacenters replication system.
345 | * [Google Borg](http://www.wired.com/wiredenterprise/2013/03/google-borg-twitter-mesos/all/) - job scheduling and monitoring system.
346 | * [Google Omega](https://www.youtube.com/watch?v=0ZFMlO98Jkc) - job scheduling and monitoring system.
347 | * [Hortonworks HOYA](http://hortonworks.com/blog/introducing-hoya-hbase-on-yarn/) - application that can deploy HBase cluster on YARN.
348 | * [Marathon](https://github.com/mesosphere/marathon) - Mesos framework for long-running services.
349 | 
350 | ## Applications
351 | 
352 | * [Adobe spindle](https://github.com/adobe-research/spindle) - Next-generation web analytics processing with Scala, Spark, and Parquet.
353 | * [Apache Kiji](http://www.kiji.org/) - framework to collect and analyze data in real-time, based on HBase.
354 | * [Apache Nutch](http://nutch.apache.org/) - open source web crawler.
355 | * [Apache OODT](http://oodt.apache.org/) - capturing, processing and sharing of data for NASA's scientific archives.
356 | * [Apache Tika](https://tika.apache.org/) - content analysis toolkit.
357 | * [Domino](http://www.dominoup.com/) - Run, scale, share, and deploy models — without any infrastructure.
358 | * [Eclipse BIRT](http://www.eclipse.org/birt/) - Eclipse-based reporting system.
359 | * [Eventhub](https://github.com/Codecademy/EventHub) - open source event analytics platform.
360 | * [HIPI Library](http://hipi.cs.virginia.edu/) - API for performing image processing tasks on Hadoop's MapReduce.
361 | * [Hunk](http://www.splunk.com/download/hunk) - Splunk analytics for Hadoop.
362 | * [MADlib](http://madlib.net/community/) - data-processing library of an RDBMS to analyze data.
363 | * [PivotalR](https://github.com/gopivotal/PivotalR) - R on Pivotal HD / HAWQ and PostgreSQL.
364 | * [Qubole](http://www.qubole.com/) - auto-scaling Hadoop cluster, built-in data connectors.
365 | * [Sense](https://senseplatform.com//) - Cloud Platform for Data Science and Big Data Analytics.
366 | * [Snowplow](https://github.com/snowplow/snowplow) - enterprise-strength web and event analytics, powered by Hadoop, Kinesis, Redshift and Postgres.
367 | * [SparkR](http://amplab-extras.github.io/SparkR-pkg/) - R frontend for Spark.
368 | * [Splunk](http://www.splunk.com/) - analyzer for machine-generated date.
369 | * [Talend](http://www.talend.com/products/big-data) - unified open source environment for YARN, Hadoop, HBASE, Hive, HCatalog & Pig.
370 | 
371 | ## Search engine and framework
372 | 
373 | * [Apache Lucene](http://lucene.apache.org/) - Search engine library.
374 | * [Apache Solr](http://lucene.apache.org/solr/) - Search platform for Apache Lucene.
375 | * [ElasticSearch](http://www.elasticsearch.org/) - Search and analytics engine based on Apache Lucene.
376 | * [Enigma.io](http://enigma.io) – Freemium robust web application for exploring, filtering, analyzing, searching and exporting massive datasets scraped from across the Web.
377 | * [Facebook Unicorn](https://www.facebook.com/publications/219621248185635/) - social graph search platform.
378 | * [Google Caffeine](http://googleblog.blogspot.it/2010/06/our-new-search-index-caffeine.html) - continuous indexing system.
379 | * [Google Percolator](http://research.google.com/pubs/pub36726.html) - continuous indexing system.
380 | * [TeraGoogle]() - large search index.
381 | * [HBase Coprocessor](https://blogs.apache.org/hbase/entry/coprocessor_introduction) - implementation of Percolator, part of HBase.
382 | * [Lily HBase Indexer](http://ngdata.github.io/hbase-indexer/) - quickly and easily search for any content stored in HBase.
383 | * [LinkedIn Bobo](http://senseidb.github.io/bobo/) - is a Faceted Search implementation written purely in Java, an extension to Apache Lucene.
384 | * [LinkedIn Cleo](https://github.com/linkedin/cleo) - is a flexible software library for enabling rapid development of partial, out-of-order and real-time typeahead search.
385 | * [LinkedIn Galene](http://engineering.linkedin.com/search/did-you-mean-galene) - search architecture at LinkedIn.
386 | * [LinkedIn Zoie](https://github.com/senseidb/zoie) - is a realtime search/indexing system written in Java.
387 | * [Sphnix Search Server](http://sphinxsearch.com/) - fulltext search engine.
388 | 
389 | ## MySQL forks and evolutions
390 | 
391 | * [Amazon RDS](http://aws.amazon.com/rds/) - MySQL databases in Amazon's cloud.
392 | * [Drizzle](http://www.drizzle.org/) - evolution of MySQL 6.0.
393 | * [Google Cloud SQL](https://developers.google.com/cloud-sql/) - MySQL databases in Google's cloud.
394 | * [MariaDB](https://mariadb.org/) - enhanced, drop-in replacement for MySQL.
395 | * [MySQL Cluster](http://www.mysql.com/products/cluster/) - MySQL implementation using NDB Cluster storage engine.
396 | * [Percona Server](http://www.percona.com/software/percona-server) - enhanced, drop-in replacement for MySQL.
397 | * [ProxySQL](https://github.com/renecannao/proxysql) - High Performance Proxy for MySQL.
398 | * [TokuDB](http://www.tokutek.com/products/tokudb-for-mysql/) - TokuDB is a storage engine for MySQL and MariaDB.
399 | * [WebScaleSQL](http://webscalesql.org/) - is a collaboration among engineers from several companies that face similar challenges in running MySQL at scale.
400 | 
401 | ## PostgreSQL forks and evolutions
402 | 
403 | * [HadoopDB](http://db.cs.yale.edu/hadoopdb/hadoopdb.html) - hybrid of MapReduce and DBMS.
404 | * [IBM Netezza](http://www-01.ibm.com/software/data/netezza/) - high-performance data warehouse appliances.
405 | * [Postgres-XL](http://www.postgres-xl.org/) - Scalable Open Source PostgreSQL-based Database Cluster.
406 | * [RecDB](http://www-users.cs.umn.edu/~sarwat/RecDB/) - Open Source Recommendation Engine Built Entirely Inside PostgreSQL.
407 | * [Stado](http://www.stormdb.com/community/stado) - open source MPP database system solely targeted at data warehousing and data mart applications.
408 | * [Yahoo Everest](http://www.scribd.com/doc/3159239/70-Everest-PGCon-RT) - multi-peta-byte database / MPP derived by PostgreSQL.
409 | 
410 | ## Memcached forks and evolutions
411 | 
412 | * [Facebook McDipper](https://www.facebook.com/notes/facebook-engineering/mcdipper-a-key-value-cache-for-flash-storage/10151347090423920) - key/value cache for flash storage.
413 | * [Facebook Memcached](https://www.facebook.com/notes/facebook-engineering/scaling-memcache-at-facebook/10151411410803920) - fork of Memcache.
414 | * [Twemproxy](https://github.com/twitter/twemproxy) - A fast, light-weight proxy for memcached and redis.
415 | * [Twitter Fatcache](https://github.com/twitter/fatcache) - key/value cache for flash storage.
416 | * [Twitter Twemcache](https://github.com/twitter/twemcache) - fork of Memcache.
417 | 
418 | ## Embedded Databases
419 | 
420 | * [Actian PSQL](http://www.actian.com/products/operational-databases/) - ACID-compliant DBMS developed by Pervasive Software, optimized for embedding in applications.
421 | * [BerkeleyDB](http://www.oracle.com/us/products/database/berkeley-db/overview/index.html) - a software library that provides a high-performance embedded database for key/value data.
422 | * [HanoiDB](https://github.com/krestenkrab/hanoidb) - Erlang LSM BTree Storage.
423 | * [LevelDB](https://code.google.com/p/leveldb/) - a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.
424 | * [LMDB](http://symas.com/mdb/) - ultra-fast, ultra-compact key-value embedded data store developed by Symas.
425 | * [RocksDB](http://rocksdb.org/) - embeddable persistent key-value store for fast storage based on LevelDB.
426 | 
427 | ## Business Intelligence
428 | 
429 | * [BIME Analytics](http://www.bimeanalytics.com/?lang=en) - business intelligence platform in the cloud.
430 | * [Chartio](https://chartio.com) - lean business intelligence platform to visualize and explore your data.
431 | * [datapine](http://www.datapine.com/) - self-service business intelligence tool in the cloud.
432 | * [Jaspersoft](https://www.jaspersoft.com/) - powerful business intelligence suite.
433 | * [Jedox Palo](http://www.jedox.com/) - customisable Business Intelligence platform.
434 | * [Microsoft](http://www.microsoft.com/en-us/server-cloud/solutions/business-intelligence/default.aspx) - business intelligence software and platform.
435 | * [Microstrategy](http://www.microstrategy.com/) - software platforms for business intelligence, mobile intelligence, and network applications.
436 | * [Pentaho](http://www.pentaho.com/) - business intelligence platform.
437 | * [Qlik](http://www.qlik.com/) - business intelligence and analytics platform.
438 | * [SpagoBI](http://www.spagoworld.org/xwiki/bin/view/SpagoBI/) - open source business intelligence platform.
439 | * [Tableau](https://www.tableausoftware.com/) - business intelligence platform.
440 | * [Zoomdata](http://www.zoomdata.com/) - Big Data Analytics.
441 | 
442 | ## Data Visualization
443 | 
444 | * [Arbor](https://github.com/samizdatco/arbor) - graph visualization library using web workers and jQuery.
445 | * [CartoDB](https://github.com/CartoDB/cartodb) - open-source or freemium hosting for geospatial databases with powerful front-end editing capabilities and a robust API. 
446 | * [Chart.js](http://www.chartjs.org/) - open source HTML5 Charts visualizations.
447 | * [Chartist.js](https://github.com/gionkunz/chartist-js) - another open source HTML5 Charts visualization.
448 | * [Crossfilter](http://square.github.io/crossfilter/) -  JavaScript library for exploring large multivariate datasets in the browser. Works well with dc.js and d3.js. 
449 | * [Cubism](https://github.com/square/cubism) - JavaScript library for time series visualization.
450 | * [Cytoscape](http://cytoscape.github.io/) - JavaScript library for visualizing complex networks.
451 | * [DC.js](http://dc-js.github.io/dc.js/) - Dimensional charting built to work natively with crossfilter rendered using d3.js. Excellent for connecting charts/additional metadata to hover events in D3.
452 | * [D3](http://d3js.org/) - javaScript library for manipulating documents.
453 | * [Envisionjs](https://github.com/HumbleSoftware/envisionjs) - dynamic HTML5 visualization.
454 | * [Freeboard](https://github.com/Freeboard/freeboard) - pen source real-time dashboard builder for IOT and other web mashups.
455 | * [Gephi](https://github.com/gephi/gephi) - An award-winning open-source platform for visualizing and manipulating large graphs and network connections. It's like Photoshop, but for graphs. Available for Windows and Mac OS X. 
456 | * [Google Charts](https://developers.google.com/chart/) - simple charting API.
457 | * [Grafana](http://grafana.org/) - graphite dashboard frontend, editor and graph composer.
458 | * [Graphite](http://graphite.wikidot.com/) - scalable Realtime Graphing.
459 | * [Highcharts](http://www.highcharts.com/) - simple and flexible charting API.
460 | * [IPython](http://ipython.org/) - provides a rich architecture for interactive computing.
461 | * [Matplotlib](https://github.com/matplotlib/matplotlib) - plotting with Python.
462 | * [NVD3](http://nvd3.org/) - chart components for d3.js.
463 | * [Peity](https://github.com/benpickles/peity) - Progressive SVG bar, line and pie charts.
464 | * [Plot.ly](http://plot.ly) - Easy-to-use web service that allows for rapid creation of complex charts, from heatmaps to histograms. Upload data to create and style charts with Plotly's online spreadsheet. Fork others' plots.
465 | * [Recline](https://github.com/okfn/recline) - simple but powerful library for building data applications in pure Javascript and HTML.
466 | * [Redash](https://github.com/everythingme/redash) - open-source platform to query and visualize data.
467 | * [Sigma.js](https://github.com/jacomyal/sigma.js) - JavaScript library dedicated to graph drawing.
468 | * [Vega](https://github.com/trifacta/vega) - a visualization grammar.
469 | 
470 | ## Internet of things and sensor data
471 | 
472 | * [TempoIQ](https://tempoiq.com/) - Cloud-based sensor analytics.
473 | 
474 | ## Interesting Readings
475 | 
476 | * [Big Data Benchmark](https://amplab.cs.berkeley.edu/benchmark/) - Benchmark of Redshift, Hive, Shark, Impala and Stiger/Tez.
477 | * [NoSQL Comparison](http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis) - Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase vs Couchbase vs Neo4j vs Hypertable vs ElasticSearch vs Accumulo vs VoltDB vs Scalaris comparison.
478 | 
479 | ## Interesting Papers
480 | 
481 | ### 2013 - 2014
482 | * [2014](http://infolab.stanford.edu/~ullman/mmds/book.pdf) - **Stanford** - Mining of Massive Datasets.
483 | * [2013](https://amplab.cs.berkeley.edu/wp-content/uploads/2013/03/eurosys13-paper83.pdf) - **AMPLab** - Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices.
484 | * [2013](https://amplab.cs.berkeley.edu/wp-content/uploads/2013/01/dmx1.pdf) - **AMPLab** - MLbase: A Distributed Machine-learning System.
485 | * [2013](https://amplab.cs.berkeley.edu/wp-content/uploads/2013/02/shark_sigmod2013.pdf) - **AMPLab** - Shark: SQL and Rich Analytics at Scale.
486 | * [2013](https://amplab.cs.berkeley.edu/wp-content/uploads/2013/05/grades-graphx_with_fonts.pdf) - **AMPLab** - GraphX: A Resilient Distributed Graph System on Spark.
487 | * [2013](http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/40671.pdf) - **Google** - HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm.
488 | * [2013](http://research.microsoft.com/pubs/200169/now-vldb.pdf) - **Microsoft** - Scalable Progressive Analytics on Big Data in the Cloud.
489 | * [2013](http://static.druid.io/docs/druid.pdf) - **Metamarkets** - Druid: A Real-time Analytical Data Store.
490 | * [2013](http://db.disi.unitn.eu/pages/VLDBProgram/pdf/industry/p764-rae.pdf) - **Google** - Online, Asynchronous Schema Change in F1.
491 | * [2013](http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/41344.pdf) - **Google** - F1: A Distributed SQL Database That Scales.
492 | * [2013](http://db.disi.unitn.eu/pages/VLDBProgram/pdf/industry/p734-akidau.pdf) - **Google** - MillWheel: Fault-Tolerant Stream Processing at Internet Scale.
493 | * [2013](http://db.disi.unitn.eu/pages/VLDBProgram/pdf/industry/p767-wiener.pdf) - **Facebook** - Scuba: Diving into Data at Facebook.
494 | * [2013](http://db.disi.unitn.eu/pages/VLDBProgram/pdf/industry/p871-curtiss.pdf) - **Facebook** - Unicorn: A System for Searching the Social Graph.
495 | * [2013](https://www.usenix.org/system/files/conference/nsdi13/nsdi13-final170_update.pdf) - **Facebook** - Scaling Memcache at Facebook.
496 | 
497 | ### 2011 - 2012
498 | 
499 | * [2012](http://vldb.org/pvldb/vol5/p1771_georgelee_vldb2012.pdf) - **Twitter** - The Unified Logging Infrastructure
500 | for Data Analytics at Twitter.
501 | * [2012](https://amplab.cs.berkeley.edu/wp-content/uploads/2013/04/blinkdb_vldb12_demo.pdf) - **AMPLab** - Blink and It’s Done: Interactive Queries on Very Large Data.
502 | * [2012](https://www.usenix.org/system/files/login/articles/zaharia.pdf) - **AMPLab** - Fast and Interactive Analytics over Hadoop Data with Spark.
503 | * [2012](https://amplab.cs.berkeley.edu/wp-content/uploads/2012/03/mod482-xin1.pdf) - **AMPLab** - Shark: Fast Data Analysis Using Coarse-grained Distributed Memory.
504 | * [2012](https://www.usenix.org/legacy/event/nsdi11/tech/full_papers/Bolosky.pdf) - **Microsoft** - Paxos Replicated State Machines as the Basis of a High-Performance Data Store.
505 | * [2012](http://research.microsoft.com/pubs/178045/ppaoxs-paper29.pdf) - **Microsoft** - Paxos Made Parallel.
506 | * [2012](http://arxiv.org/pdf/1203.5485.pdf) - **AMPLab** - BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data.
507 | * [2012](http://vldb.org/pvldb/vol5/p1436_alexanderhall_vldb2012.pdf) - **Google** - Processing a trillion cells per mouse click.
508 | * [2012](http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/spanner-osdi2012.pdf) - **Google** - Spanner: Google’s Globally-Distributed Database.
509 | * [2011](https://amplab.cs.berkeley.edu/wp-content/uploads/2011/06/euro118-ananthanarayanan.pdf) - **AMPLab** - Scarlett: Coping with Skewed Popularity Content in MapReduce Clusters.
510 | * [2011](https://amplab.cs.berkeley.edu/wp-content/uploads/2011/06/Mesos-A-Platform-for-Fine-Grained-Resource-Sharing-in-the-Data-Center.pdf) - **AMPLab** - Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center.
511 | * [2011](http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/36971.pdf) - **Google** - Megastore: Providing Scalable, Highly Available Storage for Interactive Services.
512 | 
513 | ### 2001 - 2010
514 | 
515 | * [2010](https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf) - **Facebook** - Finding a needle in Haystack: Facebook’s photo storage.
516 | * [2010](https://amplab.cs.berkeley.edu/wp-content/uploads/2011/06/Spark-Cluster-Computing-with-Working-Sets.pdf) - **AMPLab** - Spark: Cluster Computing with Working Sets.
517 | * [2010](http://static.googleusercontent.com/media/research.google.com/en/us/university/relations/facultysummit2010/storage_architecture_and_challenges.pdf) - **Google** - Storage Architecture and Challenges.
518 | * [2010](http://kowshik.github.io/JPregel/pregel_paper.pdf) - **Google** - Pregel: A System for Large-Scale Graph Processing.
519 | * [2010](http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/36726.pdf) - **Google** - Large-scale Incremental Processing Using Distributed Transactions and Notiﬁcations base of Percolator and Caffeine.
520 | * [2010](http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/36632.pdf) - **Google** - Dremel: Interactive Analysis of Web-Scale Datasets.
521 | * [2010](http://www.4lunas.org/pub/2010-s4.pdf) - **Yahoo** - S4: Distributed Stream Computing Platform.
522 | * [2009](http://www.vldb.org/pvldb/2/vldb09-861.pdf) - HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads.
523 | * [2008](http://www.cca08.org/papers/Paper-13-Ariel-Rabkin.pdf) - **AMPLab** - Chukwa: A large-scale monitoring system.
524 | * [2007](http://www.read.seas.harvard.edu/~kohler/class/cs239-w08/decandia07dynamo.pdf) - **Amazon** - Dynamo: Amazon’s Highly Available Key-value Store.
525 | * [2006](http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/chubby-osdi06.pdf) - **Google** - The Chubby lock service for loosely-coupled distributed systems.
526 | * [2006](http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/bigtable-osdi06.pdf) - **Google** - Bigtable: A Distributed Storage System for Structured Data.
527 | * [2004](http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/mapreduce-osdi04.pdf) - **Google** - MapReduce: Simplied Data Processing on Large Clusters.
528 | * [2003](http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/gfs-sosp2003.pdf) - **Google** - The Google File System.
529 | 
530 | # Other Awesome Lists
531 | - Other awesome lists [awesome-awesomeness](https://github.com/bayandin/awesome-awesomeness).
532 | - Even more lists [awesome](https://github.com/sindresorhus/awesome).
533 | - Another list? [list](https://github.com/jnv/lists).
534 | - WTF! [awesome-awesome-awesome](https://github.com/t3chnoboy/awesome-awesome-awesome).
535 | - Analytics [awesome-analytics](https://github.com/onurakpolat/awesome-analytics).
536 | 


--------------------------------------------------------------------------------