├── .github └── workflows │ └── main.yml └── README.md /.github/workflows/main.yml: -------------------------------------------------------------------------------- 1 | name: CI 2 | 3 | on: [push] 4 | 5 | jobs: 6 | build: 7 | 8 | runs-on: ubuntu-latest 9 | 10 | steps: 11 | - uses: actions/checkout@v1 12 | - uses: docker://dkhamsing/awesome_bot:latest 13 | with: 14 | args: --allow-redirect /github/workspace/README.md 15 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Data-Infra-Projects 2 | ==================== 3 | 4 | This is an attempt to list out all the interesting projects. 5 | 6 | It is intended for anyone designing modern large scale architectures and need to choose tools/technoglogies/frameworks. The purpose is to help in making that choices with resources like comparisons/use-cases/features/maturity or really anything that helps in making an informed decision. 7 | 8 | 9 | ## Abstractions 10 | * [Spark RDD](https://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-distributed-datasets-rdds) 11 | * [Spark Dataframe](https://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes) 12 | * [DataFlow](http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf) 13 | * [MapReduce](http://en.wikipedia.org/wiki/MapReduce) 14 | * [Bulk Synchronous Parallel - BSP](http://en.wikipedia.org/wiki/Bulk_synchronous_parallel) 15 | * [CRDT](http://hal.upmc.fr/docs/00/55/55/88/PDF/techreport.pdf) 16 | * [Merkle Tree](http://en.wikipedia.org/wiki/Merkle_tree) 17 | * [DHT](http://en.wikipedia.org/wiki/Distributed_hash_table) 18 | 19 | ## Distributed Coordination 20 | 21 | This are implementations/libraries to help write distributed applications which require some form of coordination. 22 | 23 | * [ZooKeeper](zookeeper.apache.org) 24 | * [Raft](http://raftconsensus.github.io/) 25 | * [Serf](http://www.serfdom.io/) 26 | * [Doozer](https://github.com/ha/doozerd) 27 | 28 | ## Infrastructure Management 29 | * [Yarn](http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html) 30 | * [Mesos](http://mesos.apache.org) 31 | * [OpenStack](http://www.openstack.org/) 32 | * [CloudStack](http://cloudstack.apache.org/) 33 | * [Ganeti](https://code.google.com/p/ganeti/) 34 | * [CoreOS](https://coreos.com) 35 | * [Kubernetes](https://github.com/kubernetes/kubernetes) 36 | * [Eucalyptus](https://github.com/eucalyptus/eucalyptus) 37 | 38 | ### comparisons 39 | * [Serf vs *](http://www.serfdom.io/intro/vs-other-sw.html) 40 | * [Serf vs Mesos](https://groups.google.com/forum/#!topic/serfdom/zFEiXgeGABc) 41 | * [CoreOs-Fleet vs Mesos/YARN](https://groups.google.com/forum/#!msg/coreos-dev/nHK8irdnmM0/BSwZpV1SNisJ) 42 | 43 | ## File Systems 44 | * [HDFS](http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html) 45 | * [GlusterFS](http://www.gluster.org/) 46 | * [Ceph](https://github.com/ceph/ceph) 47 | * [QFS](http://quantcast.github.io/qfs/) 48 | 49 | ## Distributed Databases 50 | * [Swift](https://github.com/openstack/swift) 51 | * [Riak](https://github.com/basho/riak) 52 | * [Pinot](https://github.com/linkedin/pinot) 53 | 54 | ## Infrastrcuture Logging/Monitoring 55 | * [Nagios](http://www.nagios.org/) 56 | * [Ganglia](http://ganglia.sourceforge.net/) 57 | * [Chukwa](https://chukwa.apache.org/) 58 | * [Netflix Suro](https://github.com/Netflix/suro) 59 | * [Zabbix](https://www.zabbix.org/wiki) 60 | * [Riemann](http://riemann.io/) 61 | * [Servo](https://github.com/Netflix/servo) 62 | 63 | ## Infrastructure Helpers 64 | * [Aurora](http://aurora.incubator.apache.org/) 65 | * [Marathon](https://github.com/mesosphere/marathon) 66 | * [Cronos](https://github.com/airbnb/chronos) 67 | * [Fleet](https://github.com/coreos/fleet) 68 | * [Helix](http://helix.apache.org/) 69 | * [Slider](http://slider.incubator.apache.org/) 70 | 71 | ## MultiCloud/CrossCloud utilities 72 | * [Fog](http://fog.io/) 73 | * [Multistack](http://multistack.org/) 74 | * [jClouds](https://jclouds.apache.org/) 75 | * [Whirr](https://whirr.apache.org/) 76 | 77 | ## Virtualization 78 | * [LXC](https://linuxcontainers.org/) 79 | * [ZeroVM](http://www.zerovm.org/) 80 | * [KVM](http://www.linux-kvm.org/) 81 | * [XEN](http://www.xenproject.org/) 82 | * [Solaris Zones](http://en.wikipedia.org/wiki/Solaris_Containers) 83 | * [BSD Jails](http://en.wikipedia.org/wiki/FreeBSD_jail) 84 | 85 | ## Virtualization++ 86 | * [Docker](https://www.docker.com/) 87 | * [CloudFoundry](http://cloudfoundry.org/) 88 | * [Redhat Project Atomic](http://www.projectatomic.io/) 89 | * [OSv](http://osv.io/) 90 | 91 | ## Generalized Data Processing 92 | * [Hadoop](http://hadoop.apache.org/) 93 | * [Spark](https://spark.apache.org/) 94 | * [REEF](http://www.reef-project.org/) 95 | * [Flink](http://flink.incubator.apache.org/) (previously know as Stratosphere) 96 | * [Tez](http://tez.apache.org/) 97 | * [Dryad](http://research.microsoft.com/en-us/projects/dryad/) 98 | * [Hyracks](https://code.google.com/p/hyracks/) 99 | * [Naiad](http://research.microsoft.com/en-us/projects/naiad/) 100 | * [Vespa](https://github.com/vespa-engine/vespa) 101 | 102 | ### comparisons 103 | * [Tez vs Dryad](http://yhemanth.wordpress.com/2013/11/07/comparing-apache-tez-and-microsoft-dryad/) 104 | * Hadoop vs Spark - Too many differences, no good link. 105 | 106 | ## Largescale Distributed ML 107 | * [MLBase/MLlib](http://www.mlbase.org/) 108 | * [Vowpal Wabbit](https://github.com/JohnLangford/vowpal_wabbit/) 109 | * [Jubatus](http://jubat.us/en/) 110 | * [Mahout](https://mahout.apache.org/) 111 | * [Hama](https://hama.apache.org) 112 | * [h2o](http://0xdata.com/h2o/) 113 | * [Oryx](https://github.com/OryxProject/oryx) 114 | 115 | ## pub-sub / messaging 116 | * [Kafka](http://kafka.apache.org/) 117 | * [RabbitMQ](http://www.rabbitmq.com/) 118 | * [ZeroMQ](http://zeromq.org/) 119 | * [ActiveMQ](http://activemq.apache.org/) 120 | * [hornetq](http://hornetq.jboss.org/) 121 | 122 | ## Data Ingest 123 | * [Sqoop](http://sqoop.apache.org/) 124 | * [Flume](http://flume.apache.org/) 125 | * [Gobblin](https://github.com/apache/incubator-gobblin) 126 | 127 | ## Data change management 128 | * [SpinalTap](https://github.com/airbnb/SpinalTap) 129 | 130 | ## Graph Storing and/or Processing 131 | * [Turi](https://turi.com/) (previously knowns as GraphLab) 132 | * [Giraph](http://giraph.apache.org/) 133 | * [Neo4j](http://www.neo4j.org/) 134 | * [Cassovary](https://github.com/twitter/cassovary) 135 | 136 | ## SQL Engines 137 | * [Hive](https://hive.apache.org/) 138 | * [Impala](http://impala.io/) 139 | * [Spark-SQL](https://spark.apache.org/sql/) 140 | * [Tajo](http://tajo.incubator.apache.org/) 141 | * [Presto](http://prestodb.io/) 142 | 143 | ## Stream Processing 144 | * [Storm](https://storm.incubator.apache.org/) 145 | * [Spark Streaming](https://spark.apache.org/streaming/) 146 | * [Samza](http://samza.incubator.apache.org/) 147 | * [Borealis](http://cs.brown.edu/research/borealis/public/) 148 | 149 | ## Security 150 | * [Scumblr](https://github.com/Netflix/Scumblr) 151 | 152 | ## Performance Analysis 153 | * [Dr. Elephant](https://github.com/linkedin/dr-elephant) 154 | 155 | ## Workflow engines/DAG-executors/Pipelines 156 | * [Oozie](http://oozie.apache.org/) 157 | * [Luigi](https://github.com/spotify/luigi) 158 | * [Azkaban](https://azkaban.github.io/) 159 | * [Cascading](http://www.cascading.org/) 160 | * [Hmake](https://code.google.com/p/hamake/) 161 | * [Crunch](https://crunch.apache.org/) (Modeled after [FlumeJava](http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/FlumeJava.pdf)) 162 | 163 | ### Comparisons 164 | * [Oozie vs Crunch](http://mail-archives.apache.org/mod_mbox/incubator-general/201205.mbox/%3CCAH29n6MPf4Qxb--9Ayv+U9xoJuJC_QRYkqC+ADisy=YkjjV=Vg@mail.gmail.com%3E) 165 | * [Oozie vs Azkaban](http://www.quora.com/What-are-the-differences-advantages-disadvantages-of-Azkaban-vs-Oozie) 166 | * [Feature comparison](http://sarveshspn.blogspot.in/2012/02/comparison-of-hadoop-workflow-engines.html) 167 | 168 | ## Configuration Management 169 | * [Chef](http://www.getchef.com/chef/) 170 | * [Puppet](http://puppetlabs.com/puppet) 171 | * [Salt](https://github.com/saltstack/salt) 172 | * [Ansible](https://github.com/ansible/ansible) 173 | * [Vagrant](http://www.vagrantup.com/) 174 | * [Capistrano](http://capistranorb.com/) 175 | * [Fabric](http://www.fabfile.org/) 176 | * [Bosh](https://github.com/cloudfoundry/bosh) 177 | 178 | ## Service Discovery 179 | * [etcd](https://github.com/coreos/etcd) 180 | * [confd](https://github.com/kelseyhightower/confd) 181 | * [Vulcand](https://github.com/mailgun/vulcand) 182 | * [Frontrunner](https://github.com/Wizcorp/frontrunner) 183 | * [Consul](https://github.com/hashicorp/consul) 184 | * [SkyDNS](https://github.com/skynetservices/skydns) 185 | * [Synapse](https://github.com/airbnb/synapse) - part of SmartStack combined with [Nerve](https://github.com/airbnb/nerve) 186 | 187 | ### Comparison 188 | * [Consul vs others](http://www.consul.io/intro/vs/) 189 | 190 | ## Testing 191 | * [Jespen](https://github.com/aphyr/jepsen) 192 | * [Simian Army](https://github.com/Netflix/SimianArmy) 193 | 194 | ## Visualization 195 | * [White Elephent](https://github.com/linkedin/white-elephant) 196 | * [Ambrose](https://github.com/twitter/ambrose) 197 | * [Lipstick](https://github.com/Netflix/Lipstick) 198 | * [Hue](http://gethue.com/) - Hadoop Web UI 199 | * [Inviso](https://github.com/Netflix/inviso) 200 | * [Timberlake](https://github.com/stripe/timberlake) 201 | 202 | ## Libraries 203 | * [Zoie](https://github.com/senseidb/zoie) 204 | * [Norbert](https://github.com/rhavyn/norbert) - cluster manager and networking layer built on top of Zookeeper. 205 | * [Okapi](https://github.com/grafos-ml/okapi) - Large-scale ML & graph analytics on Giraph 206 | * [Scalding](https://github.com/twitter/scalding) - A Scala API for Cascading 207 | * [SummingBird](https://github.com/twitter/summingbird) - Streaming MapReduce with Scalding and Storm 208 | * [Curator](http://curator.apache.org/) - set of Java libraries that make using Apache ZooKeeper much easier 209 | * [Turbine](https://github.com/Netflix/Turbine) - Low latency high throughput aggregator for real time streams 210 | * [DataFu](http://datafu.incubator.apache.org/) - Collection of MapReduce lib 211 | * [Twill](http://twill.incubator.apache.org/) (Previsously known as Weave) - YARN application writing lib 212 | 213 | ## Search 214 | * [Lucene+Solr](http://lucene.apache.org/) 215 | * [ElasticSearch](http://www.elasticsearch.org/) 216 | 217 | ## others 218 | * [Nutch](http://nutch.apache.org/) - web crawler 219 | * [Ambari](http://ambari.apache.org/) - Hadoop Deployment + Management 220 | * [Bigtop](http://bigtop.apache.org/) - Hadoop Packaging 221 | * [Skuld](https://github.com/Factual/skuld) 222 | * [Camus](https://github.com/linkedin/camus) - LinkedIn's Kafka to HDFS pipeline. 223 | * [Kiji](http://www.kiji.org/) - collect, analyze and serve data in real time on Apache Hadoop and HBase 224 | 225 | --------------------------------------------------------------------------------