├── .github
    └── workflows
    │   └── main.yml
└── README.md


/.github/workflows/main.yml:
--------------------------------------------------------------------------------
 1 | name: CI
 2 | 
 3 | on: [push]
 4 | 
 5 | jobs:
 6 |   build:
 7 | 
 8 |     runs-on: ubuntu-latest
 9 | 
10 |     steps:
11 |     - uses: actions/checkout@v1
12 |     - uses: docker://dkhamsing/awesome_bot:latest
13 |       with:
14 |         args: --allow-redirect /github/workspace/README.md
15 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | Data-Infra-Projects
  2 | ====================
  3 | 
  4 | This is an attempt to list out all the interesting projects.
  5 | 
  6 | It is intended for anyone designing modern large scale architectures and need to choose tools/technoglogies/frameworks. The purpose is to help in making that choices with resources like comparisons/use-cases/features/maturity or really anything that helps in making an informed decision.
  7 | 
  8 | 
  9 | ## Abstractions
 10 | * [Spark RDD](https://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-distributed-datasets-rdds)
 11 | * [Spark Dataframe](https://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes)
 12 | * [DataFlow](http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf)
 13 | * [MapReduce](http://en.wikipedia.org/wiki/MapReduce)
 14 | * [Bulk Synchronous Parallel - BSP](http://en.wikipedia.org/wiki/Bulk_synchronous_parallel)
 15 | * [CRDT](http://hal.upmc.fr/docs/00/55/55/88/PDF/techreport.pdf)
 16 | * [Merkle Tree](http://en.wikipedia.org/wiki/Merkle_tree)
 17 | * [DHT](http://en.wikipedia.org/wiki/Distributed_hash_table)
 18 | 
 19 | ## Distributed Coordination
 20 | 
 21 | This are implementations/libraries to help write distributed applications which require some form of coordination.
 22 | 
 23 | * [ZooKeeper](zookeeper.apache.org)
 24 | * [Raft](http://raftconsensus.github.io/)
 25 | * [Serf](http://www.serfdom.io/)
 26 | * [Doozer](https://github.com/ha/doozerd)
 27 | 
 28 | ## Infrastructure Management
 29 | * [Yarn](http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html)
 30 | * [Mesos](http://mesos.apache.org)
 31 | * [OpenStack](http://www.openstack.org/)
 32 | * [CloudStack](http://cloudstack.apache.org/)
 33 | * [Ganeti](https://code.google.com/p/ganeti/)
 34 | * [CoreOS](https://coreos.com)
 35 | * [Kubernetes](https://github.com/kubernetes/kubernetes)
 36 | * [Eucalyptus](https://github.com/eucalyptus/eucalyptus)
 37 | 
 38 | ### comparisons
 39 | * [Serf vs *](http://www.serfdom.io/intro/vs-other-sw.html)
 40 | * [Serf vs Mesos](https://groups.google.com/forum/#!topic/serfdom/zFEiXgeGABc)
 41 | * [CoreOs-Fleet vs Mesos/YARN](https://groups.google.com/forum/#!msg/coreos-dev/nHK8irdnmM0/BSwZpV1SNisJ)
 42 | 
 43 | ## File Systems
 44 | * [HDFS](http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html)
 45 | * [GlusterFS](http://www.gluster.org/)
 46 | * [Ceph](https://github.com/ceph/ceph)
 47 | * [QFS](http://quantcast.github.io/qfs/)
 48 | 
 49 | ## Distributed Databases
 50 | * [Swift](https://github.com/openstack/swift)
 51 | * [Riak](https://github.com/basho/riak)
 52 | * [Pinot](https://github.com/linkedin/pinot)
 53 | 
 54 | ## Infrastrcuture Logging/Monitoring
 55 | * [Nagios](http://www.nagios.org/)
 56 | * [Ganglia](http://ganglia.sourceforge.net/)
 57 | * [Chukwa](https://chukwa.apache.org/)
 58 | * [Netflix Suro](https://github.com/Netflix/suro)
 59 | * [Zabbix](https://www.zabbix.org/wiki)
 60 | * [Riemann](http://riemann.io/)
 61 | * [Servo](https://github.com/Netflix/servo)
 62 | 
 63 | ## Infrastructure Helpers
 64 | * [Aurora](http://aurora.incubator.apache.org/)
 65 | * [Marathon](https://github.com/mesosphere/marathon)
 66 | * [Cronos](https://github.com/airbnb/chronos)
 67 | * [Fleet](https://github.com/coreos/fleet)
 68 | * [Helix](http://helix.apache.org/)
 69 | * [Slider](http://slider.incubator.apache.org/)
 70 | 
 71 | ## MultiCloud/CrossCloud utilities
 72 | * [Fog](http://fog.io/)
 73 | * [Multistack](http://multistack.org/)
 74 | * [jClouds](https://jclouds.apache.org/)
 75 | * [Whirr](https://whirr.apache.org/)
 76 | 
 77 | ## Virtualization
 78 | * [LXC](https://linuxcontainers.org/)
 79 | * [ZeroVM](http://www.zerovm.org/)
 80 | * [KVM](http://www.linux-kvm.org/)
 81 | * [XEN](http://www.xenproject.org/)
 82 | * [Solaris Zones](http://en.wikipedia.org/wiki/Solaris_Containers)
 83 | * [BSD Jails](http://en.wikipedia.org/wiki/FreeBSD_jail)
 84 | 
 85 | ## Virtualization++
 86 | * [Docker](https://www.docker.com/)
 87 | * [CloudFoundry](http://cloudfoundry.org/)
 88 | * [Redhat Project Atomic](http://www.projectatomic.io/)
 89 | * [OSv](http://osv.io/)
 90 | 
 91 | ## Generalized Data Processing
 92 | * [Hadoop](http://hadoop.apache.org/)
 93 | * [Spark](https://spark.apache.org/)
 94 | * [REEF](http://www.reef-project.org/)
 95 | * [Flink](http://flink.incubator.apache.org/) (previously know as Stratosphere)
 96 | * [Tez](http://tez.apache.org/)
 97 | * [Dryad](http://research.microsoft.com/en-us/projects/dryad/)
 98 | * [Hyracks](https://code.google.com/p/hyracks/)
 99 | * [Naiad](http://research.microsoft.com/en-us/projects/naiad/)
100 | * [Vespa](https://github.com/vespa-engine/vespa) 
101 | 
102 | ### comparisons
103 | * [Tez vs Dryad](http://yhemanth.wordpress.com/2013/11/07/comparing-apache-tez-and-microsoft-dryad/)
104 | * Hadoop vs Spark - Too many differences, no good link.
105 | 
106 | ## Largescale Distributed ML
107 | * [MLBase/MLlib](http://www.mlbase.org/)
108 | * [Vowpal Wabbit](https://github.com/JohnLangford/vowpal_wabbit/)
109 | * [Jubatus](http://jubat.us/en/)
110 | * [Mahout](https://mahout.apache.org/)
111 | * [Hama](https://hama.apache.org)
112 | * [h2o](http://0xdata.com/h2o/)
113 | * [Oryx](https://github.com/OryxProject/oryx)
114 | 
115 | ## pub-sub / messaging 
116 | * [Kafka](http://kafka.apache.org/)
117 | * [RabbitMQ](http://www.rabbitmq.com/)
118 | * [ZeroMQ](http://zeromq.org/)
119 | * [ActiveMQ](http://activemq.apache.org/)
120 | * [hornetq](http://hornetq.jboss.org/)
121 | 
122 | ## Data Ingest
123 | * [Sqoop](http://sqoop.apache.org/)
124 | * [Flume](http://flume.apache.org/)
125 | * [Gobblin](https://github.com/apache/incubator-gobblin)
126 | 
127 | ## Data change management
128 | * [SpinalTap](https://github.com/airbnb/SpinalTap)
129 | 
130 | ## Graph Storing and/or Processing
131 | * [Turi](https://turi.com/) (previously knowns as GraphLab)
132 | * [Giraph](http://giraph.apache.org/)
133 | * [Neo4j](http://www.neo4j.org/)
134 | * [Cassovary](https://github.com/twitter/cassovary)
135 | 
136 | ## SQL Engines
137 | * [Hive](https://hive.apache.org/)
138 | * [Impala](http://impala.io/)
139 | * [Spark-SQL](https://spark.apache.org/sql/)
140 | * [Tajo](http://tajo.incubator.apache.org/)
141 | * [Presto](http://prestodb.io/)
142 | 
143 | ## Stream Processing
144 | * [Storm](https://storm.incubator.apache.org/)
145 | * [Spark Streaming](https://spark.apache.org/streaming/)
146 | * [Samza](http://samza.incubator.apache.org/)
147 | * [Borealis](http://cs.brown.edu/research/borealis/public/)
148 | 
149 | ## Security
150 | * [Scumblr](https://github.com/Netflix/Scumblr)
151 | 
152 | ## Performance Analysis
153 | * [Dr. Elephant](https://github.com/linkedin/dr-elephant)
154 |  
155 | ## Workflow engines/DAG-executors/Pipelines
156 | * [Oozie](http://oozie.apache.org/)
157 | * [Luigi](https://github.com/spotify/luigi)
158 | * [Azkaban](https://azkaban.github.io/)
159 | * [Cascading](http://www.cascading.org/)
160 | * [Hmake](https://code.google.com/p/hamake/)
161 | * [Crunch](https://crunch.apache.org/) (Modeled after [FlumeJava](http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/FlumeJava.pdf))
162 | 
163 | ### Comparisons
164 |   * [Oozie vs Crunch](http://mail-archives.apache.org/mod_mbox/incubator-general/201205.mbox/%3CCAH29n6MPf4Qxb--9Ayv+U9xoJuJC_QRYkqC+ADisy=YkjjV=Vg@mail.gmail.com%3E)
165 |   * [Oozie vs Azkaban](http://www.quora.com/What-are-the-differences-advantages-disadvantages-of-Azkaban-vs-Oozie) 
166 |   * [Feature comparison](http://sarveshspn.blogspot.in/2012/02/comparison-of-hadoop-workflow-engines.html)
167 | 
168 | ## Configuration Management 
169 | * [Chef](http://www.getchef.com/chef/)
170 | * [Puppet](http://puppetlabs.com/puppet)
171 | * [Salt](https://github.com/saltstack/salt)
172 | * [Ansible](https://github.com/ansible/ansible)
173 | * [Vagrant](http://www.vagrantup.com/)
174 | * [Capistrano](http://capistranorb.com/)
175 | * [Fabric](http://www.fabfile.org/)
176 | * [Bosh](https://github.com/cloudfoundry/bosh)
177 | 
178 | ## Service Discovery
179 | * [etcd](https://github.com/coreos/etcd)
180 | * [confd](https://github.com/kelseyhightower/confd)
181 | * [Vulcand](https://github.com/mailgun/vulcand)
182 | * [Frontrunner](https://github.com/Wizcorp/frontrunner)
183 | * [Consul](https://github.com/hashicorp/consul)
184 | * [SkyDNS](https://github.com/skynetservices/skydns)
185 | * [Synapse](https://github.com/airbnb/synapse) - part of SmartStack combined with [Nerve](https://github.com/airbnb/nerve)
186 | 
187 | ### Comparison
188 | * [Consul vs others](http://www.consul.io/intro/vs/)
189 | 
190 | ## Testing
191 | * [Jespen](https://github.com/aphyr/jepsen)
192 | * [Simian Army](https://github.com/Netflix/SimianArmy)
193 | 
194 | ## Visualization
195 | * [White Elephent](https://github.com/linkedin/white-elephant)
196 | * [Ambrose](https://github.com/twitter/ambrose)
197 | * [Lipstick](https://github.com/Netflix/Lipstick)
198 | * [Hue](http://gethue.com/) - Hadoop Web UI
199 | * [Inviso](https://github.com/Netflix/inviso)
200 | * [Timberlake](https://github.com/stripe/timberlake)
201 | 
202 | ## Libraries
203 | * [Zoie](https://github.com/senseidb/zoie)
204 | * [Norbert](https://github.com/rhavyn/norbert) - cluster manager and networking layer built on top of Zookeeper.
205 | * [Okapi](https://github.com/grafos-ml/okapi) - Large-scale ML & graph analytics on Giraph
206 | * [Scalding](https://github.com/twitter/scalding) - A Scala API for Cascading
207 | * [SummingBird](https://github.com/twitter/summingbird) - Streaming MapReduce with Scalding and Storm
208 | * [Curator](http://curator.apache.org/) - set of Java libraries that make using Apache ZooKeeper much easier
209 | * [Turbine](https://github.com/Netflix/Turbine) - Low latency high throughput aggregator for real time streams
210 | * [DataFu](http://datafu.incubator.apache.org/) - Collection of MapReduce lib
211 | * [Twill](http://twill.incubator.apache.org/) (Previsously known as Weave) - YARN application writing lib
212 | 
213 | ## Search
214 | * [Lucene+Solr](http://lucene.apache.org/)
215 | * [ElasticSearch](http://www.elasticsearch.org/)
216 | 
217 | ## others
218 | * [Nutch](http://nutch.apache.org/) - web crawler
219 | * [Ambari](http://ambari.apache.org/) - Hadoop Deployment + Management 
220 | * [Bigtop](http://bigtop.apache.org/) - Hadoop Packaging 
221 | * [Skuld](https://github.com/Factual/skuld)
222 | * [Camus](https://github.com/linkedin/camus) - LinkedIn's Kafka to HDFS pipeline.
223 | * [Kiji](http://www.kiji.org/) - collect, analyze and serve data in real time on Apache Hadoop and HBase
224 | 
225 | 


--------------------------------------------------------------------------------