├── contributing.md
└── README.md


/contributing.md:
--------------------------------------------------------------------------------
 1 | # Contribution Guidelines
 2 | 
 3 | Please ensure your pull request adheres to the following guidelines:
 4 | 
 5 | - Search previous suggestions before making a new one, as yours may be a duplicate.
 6 | - Make sure your list is useful before submitting. That implies it having enough content and every item a good succinct description.
 7 | - A link back to this list from yours, so users can discover more lists, would be appreciated.
 8 | - Make an individual pull request for each suggestion.
 9 | - Titles should be [capitalized](http://grammar.yourdictionary.com/capitalization/rules-for-capitalization-in-titles.html).
10 | - Use the following format: `[List Name](link)`
11 | - Link additions should be added to the bottom of the relevant category.
12 | - New categories or improvements to the existing categorization are welcome.
13 | - Check your spelling and grammar.
14 | - Make sure your text editor is set to remove trailing whitespace.
15 | - The pull request and commit should have a useful title.
16 | 
17 | Thank you for your suggestions!
18 | 
19 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | Awesome Data Engineering
  2 | ==========================
  3 | 
  4 | A curated list of data engineering tools for software developers [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome)
  5 | 
  6 | List of content
  7 | 
  8 | 1. [Databases](#databases)
  9 | 2. [Ingestion](#data-ingestion)
 10 | 3. [File System](#file-system)
 11 | 4. [Serialization format](#serialization-format)
 12 | 5. [Stream Processing](#stream-processing)
 13 | 6. [Batch Processing](#batch-processing)
 14 | 7. [Charts and Dashboards](#charts-and-dashboards)
 15 | 8. [Workflow](#workflow)
 16 | 9. [Datasets](#datasets)
 17 | 10. [Monitoring](#monitoring)
 18 | 11. [Docker](#docker)
 19 | 
 20 | # Databases
 21 | - Relational
 22 | 	* [RQLite](https://github.com/otoolep/rqlite) Replicated SQLite using the Raft consensus protocol
 23 | 	* [MySQL](http://www.mysql.com/) The world's most popular open source database.
 24 | 		* [TiDB](https://github.com/pingcap/tidb) TiDB is a distributed NewSQL database compatible with MySQL protocol 	
 25 | 		* [Percona XtraBackup](https://www.percona.com/software/mysql-database/percona-xtrabackup) Percona XtraBackup is a free, open source, complete online backup solution for all versions of Percona Server, MySQL® and MariaDB®
 26 | 		* [mysql_utils](https://github.com/pinterest/mysql_utils) Pinterest MySQL Management Tools
 27 | 	* [MariaDB](https://mariadb.org/) An enhanced, drop-in replacement for MySQL.
 28 | 	* [PostgreSQL](http://www.postgresql.org/) The world's most advanced open source database.
 29 | 	* [Amazon RDS](http://aws.amazon.com/rds/) Amazon RDS makes it easy to set up, operate, and scale a relational database in the cloud. 
 30 | 	* [Crate.IO](https://crate.io/) Scalable SQL database with the NOSQL goodies.
 31 | - Key-Value
 32 | 	* [Redis](http://redis.io/) An open source, BSD licensed, advanced key-value cache and store.
 33 | 	* [Riak](https://docs.basho.com/riak/latest/) A distributed database designed to deliver maximum data availability by distributing data across multiple servers.
 34 | 	* [AWS DynamoDB](http://aws.amazon.com/dynamodb/) A fast and flexible NoSQL database service for all applications that need consistent, single-digit millisecond latency at any scale.
 35 | 	* [HyperDex](https://github.com/rescrv/HyperDex) HyperDex is a scalable, searchable key-value store
 36 | 	* [SSDB](http://ssdb.io) A high performance NoSQL database supporting many data structures, an alternative to Redis
 37 | 	* [Kyoto Tycoon](https://github.com/sapo/kyoto) Kyoto Tycoon is a lightweight network server on top of the Kyoto Cabinet key-value database, built for high-performance and concurrency
 38 | 	* [IonDB](https://github.com/iondbproject/iondb) A key-value store for microcontroller and IoT applications
 39 | - Column
 40 | 	* [Cassandra](http://cassandra.apache.org/) The right choice when you need scalability and high availability without compromising performance.
 41 | 		* [Cassandra Calculator](http://www.ecyrd.com/cassandracalculator/) This simple form allows you to try out different values for your Apache Cassandra cluster and see what the impact is for your application.
 42 | 		* [CCM](https://github.com/pcmanus/ccm) A script to easily create and destroy an Apache Cassandra cluster on localhost
 43 | 		* [ScyllaDB](https://github.com/scylladb/scylla) NoSQL data store using the seastar framework, compatible with Apache Cassandra http://www.scylladb.com/
 44 | 	* [HBase](http://hbase.apache.org/) The Hadoop database, a distributed, scalable, big data store.
 45 | 	* [Infobright](http://www.infobright.org) Column oriented, open-source analytic database provides both speed and efficiency.  
 46 | 	* [AWS Redshift](http://aws.amazon.com/redshift/) A fast, fully managed, petabyte-scale data warehouse that makes it simple and cost-effective to analyze all your data using your existing business intelligence tools.
 47 | 	* FiloDB (https://github.com/tuplejump/FiloDB) Distributed. Columnar. Versioned. Streaming. SQL.
 48 | 	* [HPE Vertica](http://www8.hp.com/us/en/software-solutions/advanced-sql-big-data-analytics/index.html) Distributed, MPP columnar database with extensive analytics SQL.
 49 | - Document
 50 | 	* [MongoDB](https://www.mongodb.org/) An open-source, document database designed for ease of development and scaling. 
 51 | 		* [Percona Server for MongoDB](https://www.percona.com/software/mongo-database/percona-server-for-mongodb) Percona Server for MongoDB® is a free, enhanced, fully compatible, open source, drop-in replacement for the MongoDB® Community Edition that includes enterprise-grade features and functionality.
 52 | 		* [MemDB](https://github.com/rain1017/memdb) Distributed Transactional In-Memory Database (based on MongoDB)
 53 | 	* [Elasticsearch](https://www.elastic.co/) Search & Analyze Data in Real Time.
 54 | 	* [Couchbase](http://www.couchbase.com/) The highest performing NoSQL distributed database.
 55 | 	* [RethinkDB](http://rethinkdb.com/) The open-source database for the realtime web.
 56 | - Graph
 57 | 	* [Neo4j](http://neo4j.com/) The world’s leading graph database.
 58 | 	* [OrientDB](http://orientdb.com/orientdb/) 2nd Generation Distributed Graph Database with the flexibility of Documents in one product with an Open Source commercial friendly license.
 59 | 	* [ArangoDB](https://www.arangodb.com/) A distributed free and open-source database with a flexible data model for documents, graphs, and key-values. 
 60 | 	* [Titan](http://thinkaurelius.github.io/titan/) A scalable graph database optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster.
 61 | 	* [FlockDB](https://github.com/twitter/flockdb) A distributed, fault-tolerant graph database by Twitter.
 62 | - Distributed
 63 | 	* [DAtomic](http://www.datomic.com) The fully transactional, cloud-ready, distributed database.
 64 | 	* [Apache Geode](http://geode.incubator.apache.org) An open source, distributed, in-memory database for scale-out applications.
 65 | 	* [Gaffer ](https://github.com/GovernmentCommunicationsHeadquarters/Gaffer) A large-scale graph database
 66 | - Timeseries
 67 | 	* [InfluxDB](https://github.com/influxdata/influxdb) Scalable datastore for metrics, events, and real-time analytics.
 68 | 	* [OpenTSDB](https://github.com/OpenTSDB/opentsdb) A scalable, distributed Time Series Database.
 69 | 	* [kairosdb](https://github.com/kairosdb/kairosdb) Fast scalable time series database.
 70 | 	* [Heroic](https://github.com/spotify/heroic) A scalable time series database based on Cassandra and Elasticsearch, by Spotify
 71 | 	* [Druid](https://github.com/druid-io/druid/) Column oriented distributed data store ideal for powering interactive applications
 72 | 	* [Riak-TS](http://basho.com/products/riak-ts/) Riak TS is the only enterprise-grade NoSQL time series database optimized specifically for IoT and Time Series data
 73 | 	* [Akumuli](https://github.com/akumuli/Akumuli) Akumuli is a numeric time-series database. It can be used to capture, store and process time-series data in real-time. The word "akumuli" can be translated from esperanto as "accumulate".
 74 | 	* [Rhombus](https://github.com/Pardot/Rhombus) A time-series object store for Cassandra that handles all the complexity of building wide row indexes.
 75 | 	* [Dalmatiner DB](https://github.com/dalmatinerdb/dalmatinerdb) Fast distributed metrics database
 76 | 	* [Blueflood](https://github.com/rackerlabs/blueflood) A distributed system designed to ingest and process time series data
 77 | 	* [Timely](https://github.com/NationalSecurityAgency/timely) Timely is a time series database application that provides secure access to time series data based on Accumulo and Grafana.
 78 | - Other
 79 | 	* [Tarantool](https://github.com/tarantool/tarantool/) Tarantool is an in-memory database and application server.
 80 | 	* [GreenPlum](https://github.com/greenplum-db/gpdb) The Greenplum Database (GPDB) is an advanced, fully featured, open source data warehouse. It provides powerful and rapid analytics on petabyte scale data volumes.
 81 | 	* [cayley](https://github.com/google/cayley) An open-source graph database. Google.
 82 | 	* [Snappydata](https://github.com/SnappyDataInc/snappydata)SnappyData: OLTP + OLAP Database built on Apache Spark
 83 | 
 84 | # Data Ingestion
 85 | * [Kafka](http://kafka.apache.org/) Publish-subscribe messaging rethought as a distributed commit log.
 86 | 	* [Camus](https://github.com/linkedin/camus) LinkedIn's Kafka to HDFS pipeline.
 87 | 	* [BottledWater](https://github.com/confluentinc/bottledwater-pg) Change data capture from PostgreSQL into Kafka
 88 | 	* [kafkat](https://github.com/airbnb/kafkat) Simplified command-line administration for Kafka brokers
 89 | 	* [kafkacat](https://github.com/edenhill/kafkacat) Generic command line non-JVM Apache Kafka producer and consumer
 90 | 	* [pg-kafka](https://github.com/xstevens/pg_kafka) A PostgreSQL extension to produce messages to Apache Kafka
 91 | 	* [librdkafka](https://github.com/edenhill/librdkafka) The Apache Kafka C/C++ library
 92 | 	* [kafka-docker](https://github.com/wurstmeister/kafka-docker) Kafka in Docker
 93 | 	* [kafka-manager](https://github.com/yahoo/kafka-manager) A tool for managing Apache Kafka
 94 | 	* [kafka-node](https://github.com/SOHU-Co/kafka-node) Node.js client for Apache Kafka 0.8
 95 | 	* [Secor](https://github.com/pinterest/secor) Pinterest's Kafka to S3 distributed consumer
 96 | 	* [Kafka-logger](https://github.com/uber/kafka-logger) Kafka-winston logger for nodejs from uber
 97 | 	* [Kafka Awesome List](https://github.com/monksy/awesome-kafka) A super list of resources about Apache Kafka. 
 98 | * [AWS Kinesis](http://aws.amazon.com/kinesis/) A fully managed, cloud-based service for real-time data processing over large, distributed data streams.
 99 | * [RabbitMQ](http://www.rabbitmq.com/) Robust messaging for applications.
100 | * [FluentD](http://www.fluentd.org) An open source data collector for unified logging layer.
101 | * [Embulk](http://www.embulk.org) An open source bulk data loader that helps data transfer between various databases, storages, file formats, and cloud services.
102 | * [Apache Sqoop](https://sqoop.apache.org) A tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
103 | * [Heka](https://github.com/mozilla-services/heka) Data Acquisition and Processing Made Easy
104 | * [Gobblin](https://github.com/linkedin/gobblin) Universal data ingestion framework for Hadoop from Linkedin
105 | * [Nakadi](http://nakadi.io) Nakadi is an open source event messaging platform that provides a REST API on top of Kafka-like queues.
106 | * [Pravega](https://pravega.io) Pravega provides a new storage abstraction - a stream - for continuous and unbounded data.
107 | * [Apache Pulsar](https://pulsar.apache.org/) Apache Pulsar is an open-source distributed pub-sub messaging system.
108 | 
109 | # File System
110 | * [HDFS](https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html)
111 | 	* [Snakebite](https://github.com/spotify/snakebite) A pure python HDFS client
112 | * [AWS S3](http://aws.amazon.com/s3/)
113 | 	* [smart_open](https://github.com/piskvorky/smart_open) Utils for streaming large files (S3, HDFS, gzip, bz2)
114 | * [Tachyon](http://tachyon-project.org/) Tachyon is a memory-centric distributed storage system enabling reliable data sharing at memory-speed across cluster frameworks, such as Spark and MapReduce
115 | * [CEPH](http://ceph.com/) Ceph is a unified, distributed storage system designed for excellent performance, reliability and scalability
116 | * [OrangeFS](http://www.orangefs.org/) Orange File System is a branch of the Parallel Virtual File System
117 | * [SnackFS](https://github.com/tuplejump/snackfs-release) SnackFS is our bite-sized, lightweight HDFS compatible FileSystem built over Cassandra
118 | * [GlusterFS](http://www.gluster.org/) Gluster Filesystem
119 | * [XtreemFS](http://www.xtreemfs.org/) fault-tolerant distributed file system for all storage needs
120 | * [SeaweedFS](https://github.com/chrislusf/seaweedfs) Seaweed-FS is a simple and highly scalable distributed file system. There are two objectives: to store billions of files! to serve the files fast! Instead of supporting full POSIX file system semantics, Seaweed-FS choose to implement only a key~file mapping. Similar to the word "NoSQL", you can call it as "NoFS".
121 | * [S3QL](https://bitbucket.org/nikratio/s3ql) S3QL is a file system that stores all its data online using storage services like Google Storage, Amazon S3, or OpenStack.
122 | * [LizardFS](https://lizardfs.com/) LizardFS Software Defined Storage is a distributed, parallel, scalable, fault-tolerant, Geo-Redundant and highly available file system.
123 | 
124 | # Serialization format
125 | * [Apache Avro](https://avro.apache.org) Apache Avro™ is a data serialization system
126 | * [Apache Parquet](https://parquet.apache.org) Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
127 | 	* [Snappy](https://github.com/google/snappy) A fast compressor/decompressor. Used with Parquet
128 | 	* [PigZ](http://zlib.net/pigz/) A parallel implementation of gzip for modern
129 | multi-processor, multi-core machines
130 | * [Apache ORC](https://orc.apache.org/) The smallest, fastest columnar storage for Hadoop workloads 
131 | * [Apache Thrift](https://thrift.apache.org) The Apache Thrift software framework, for scalable cross-language services development
132 | * [ProtoBuf](https://github.com/google/protobuf) Protocol Buffers - Google's data interchange format
133 | * [SequenceFile](http://wiki.apache.org/hadoop/SequenceFile) SequenceFile is a flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats
134 | * [Kryo](https://github.com/EsotericSoftware/kryo) Kryo is a fast and efficient object graph serialization framework for Java
135 | 
136 | 
137 | # Stream Processing
138 | * [Apache Beam](https://beam.apache.org/) Apache Beam is a unified programming model that implements both batch and streaming data processing jobs that run on many execution engines.
139 | * [Spark Streaming](https://spark.apache.org/streaming/) Spark Streaming makes it easy to build scalable fault-tolerant streaming applications.
140 | * [Apache Flink](https://flink.apache.org/) Apache Flink is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams.
141 | * [Apache Storm](https://storm.apache.org) Apache Storm is a free and open source distributed realtime computation system
142 | * [Apache Samza](https://samza.apache.org) Apache Samza is a distributed stream processing framework
143 | * [Apache NiFi](http://nifi.apache.org/) is an easy to use, powerful, and reliable system to process and distribute data
144 | * [VoltDB](https://voltdb.com/) VoltDb is an ACID-compliant RDBMS which uses a [shared nothing architecture](https://en.wikipedia.org/wiki/Shared-nothing_architecture).
145 | * [PipelineDB](https://github.com/pipelinedb/pipelinedb) The Streaming SQL Database https://www.pipelinedb.com
146 | * [Spring Cloud Dataflow](http://cloud.spring.io/spring-cloud-dataflow/) Streaming and tasks execution between Spring Boot apps
147 | * [Bonobo](https://www.bonobo-project.org/) Bonobo is a data-processing toolkit for python 3.5+
148 | * [Robinhood's Faust](https://github.com/robinhood/faust) Forever scalable event processing & in-memory durable K/V store as a library with asyncio & static typing.
149 | 
150 | # Batch Processing
151 | * [Hadoop MapReduce](http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html) Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner
152 | * [Spark](https://spark.apache.org/)
153 | 	* [Spark Packages](http://spark-packages.org) A community index of packages for Apache Spark
154 | 	* [Deep Spark](https://github.com/Stratio/deep-spark) Connecting Apache Spark with different data stores 
155 | 	* [Spark RDD API Examples](http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html) by Zhen He
156 | 	* [Livy](https://github.com/cloudera/hue/tree/master/apps/spark/java#welcome-to-livy-the-rest-spark-server) Livy, the REST Spark Server
157 | * [AWS EMR](http://aws.amazon.com/elasticmapreduce/) A web service that makes it easy to quickly and cost-effectively process vast amounts of data.
158 | * [Tez](https://tez.apache.org/) An application framework which allows for a complex directed-acyclic-graph of tasks for processing data.
159 | * [Bistro](https://github.com/asavinov/bistro) is a light-weight engine for general-purpose data processing including both batch and stream analytics. It is based on a novel unique data model, which represents data via *functions* and processes data via *columns operations* as opposed to having only set operations in conventional approaches like MapReduce or SQL.
160 | - Batch ML
161 | 	* [H2O](http://www.h2o.ai/) Fast scalable machine learning API for smarter applications.
162 | 	* [Mahout](http://mahout.apache.org/) An environment for quickly creating scalable performant machine learning applications.
163 | 	* [Spark MLlib](https://spark.apache.org/docs/1.2.1/mllib-guide.html) Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives.
164 | - Batch Graph
165 | 	* [GraphLab Create](https://turi.com/products/create/docs/) A machine learning platform that enables data scientists and app developers to easily create intelligent apps at scale.
166 | 	* [Giraph](http://giraph.apache.org/) An iterative graph processing system built for high scalability. 
167 | 	* [Spark GraphX](https://spark.apache.org/graphx/) Apache Spark's API for graphs and graph-parallel computation. 
168 | - Batch SQL
169 | 	* [Presto](https://prestodb.io/docs/current/index.html) A distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources.
170 | 	* [Hive](http://hive.apache.org) Data warehouse software facilitates querying and managing large datasets residing in distributed storage. 
171 | 		* [Hivemall](https://github.com/myui/hivemall) Scalable machine learning library for Hive/Hadoop.
172 | 		* [PyHive](https://github.com/dropbox/PyHive) Python interface to Hive and Presto.
173 | 	* [Drill](https://drill.apache.org/) Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage.
174 | 
175 | # Charts and Dashboards
176 | * [Highcharts](http://www.highcharts.com/) A charting library written in pure JavaScript, offering an easy way of adding interactive charts to your web site or web application.
177 | * [ZingChart](http://www.zingchart.com/) Fast JavaScript charts for any data set.
178 | * [C3.js](http://c3js.org) D3-based reusable chart library.
179 | * [D3.js](http://d3js.org/) A JavaScript library for manipulating documents based on data.
180 | 	* [D3Plus](http://d3plus.org) D3's simplier, easier to use cousin. Mostly predefined templates that you can just plug data in.
181 | * [SmoothieCharts](http://smoothiecharts.org) A JavaScript Charting Library for Streaming Data.
182 | * [PyXley](https://github.com/stitchfix/pyxley) Python helpers for building dashboards using Flask and React
183 | * [Plotly](https://github.com/plotly/dash) Flask, JS, and CSS boilerplate for interactive, web-based visualization apps in Python
184 | * [Apache Superset](https://github.com/airbnb/superset) Apache Superset (incubating) is a modern, enterprise-ready business intelligence web application
185 | * [Redash](https://redash.io/) Make Your Company Data Driven. Connect to any data source, easily visualize and share your data.
186 | * [Metabase](https://github.com/metabase/metabase) Metabase is the easy, open source way for everyone in your company to ask questions and learn from data.
187 | * [PyQtGraph](http://www.pyqtgraph.org/) PyQtGraph is a pure-python graphics and GUI library built on PyQt4 / PySide and numpy. It is intended for use in mathematics / scientific / engineering applications.
188 | 
189 | 
190 | # Workflow
191 | * [Luigi](https://github.com/spotify/luigi) Luigi is a Python module that helps you build complex pipelines of batch jobs.
192 | 	* [CronQ](https://github.com/seatgeek/cronq) An application cron-like system. [Used](http://chairnerd.seatgeek.com/building-out-the-seatgeek-data-pipeline/) w/Luige 
193 | * [Cascading](http://www.cascading.org/) Java based application development platform.
194 | * [Airflow](https://github.com/airbnb/airflow) Airflow is a system to programmaticaly author, schedule and monitor data pipelines.
195 | * [Azkaban](https://azkaban.github.io/) Azkaban is a batch workflow job scheduler created at LinkedIn to run Hadoop jobs. Azkaban resolves the ordering through job dependencies and provides an easy to use web user interface to maintain and track your workflows. 
196 | * [Oozie](http://oozie.apache.org/) Oozie is a workflow scheduler system to manage Apache Hadoop jobs
197 | * [Pinball](https://github.com/pinterest/pinball) DAG based workflow manager. Job flows are defined programmaticaly in Python. Support output passing between jobs.
198 | 
199 | # ELK Elastic Logstash Kibana
200 | * [docker-logstash](https://github.com/pblittle/docker-logstash) A highly configurable logstash (1.4.4) docker image running Elasticsearch (1.7.0) and Kibana (3.1.2).
201 | * [elasticsearch-jdbc](https://github.com/jprante/elasticsearch-jdbc) JDBC importer for Elasticsearch
202 | * [ZomboDB](https://github.com/zombodb/zombodb) Postgres Extension that allows creating an index backed by Elasticsearch
203 | 
204 | # Docker
205 | * [Gockerize](https://github.com/aerofs/gockerize) Package golang service into minimal docker containers
206 | * [Flocker](https://github.com/ClusterHQ/flocker) Easily manage Docker containers & their data
207 | * [Rancher](http://rancher.com/rancher-os/) RancherOS is a 20mb Linux distro that runs the entire OS as Docker containers
208 | * [Kontena](http://www.kontena.io/) Application Containers for Masses
209 | * [Weave](https://github.com/weaveworks/weave) Weaving Docker containers into applications http://www.weave.works/
210 | * [Zodiac](https://github.com/CenturyLinkLabs/zodiac) A lightweight tool for easy deployment and rollback of dockerized applications
211 | * [cAdvisor](https://github.com/google/cadvisor) Analyzes resource usage and performance characteristics of running containers
212 | * [Micro S3 persistence](https://github.com/shinymayhem/micro-s3-persistence) Docker microservice for saving/restoring volume data to S3
213 | * [Dockup](https://github.com/tutumcloud/dockup) Docker image to backup/restore your Docker container volumes to AWS S3
214 | * [Rocker-compose](https://github.com/grammarly/rocker-compose) Docker composition tool with idempotency features for deploying apps composed of multiple containers.
215 | * [Nomad](https://github.com/hashicorp/nomad) Nomad is a cluster manager, designed for both long lived services and short lived batch processing workloads
216 | * [ImageLayers](https://imagelayers.io/) Vizualize docker images and the layers that compose them
217 | 
218 | 
219 | # Datasets
220 | ## Realtime
221 | * [Twitter Realtime](https://dev.twitter.com/streaming/overview) The Streaming APIs give developers low latency access to Twitter’s global stream of Tweet data.
222 | * [Eventsim](https://github.com/Interana/eventsim) Event data simulator. Generates a stream of pseudo-random events from a set of users, designed to simulate web traffic.
223 | * [Reddit](https://www.reddit.com/r/datasets/comments/3mk1vg/realtime_data_is_available_including_comments/) Real-time data is available including comments, submissions and links posted to reddit
224 | 
225 | ## Data Dumps
226 | * [GitHub Archive](https://www.githubarchive.org/) GitHub's public timeline since 2011, updated every hour
227 | * [Common Crawl](https://commoncrawl.org/) Open source repository of web crawl data
228 | * [Wikipedia](https://dumps.wikimedia.org/enwiki/latest/) Wikipedia's complete copy of all wikis, in the form of wikitext source and metadata embedded in XML. A number of raw database tables in SQL form are also available.
229 | 
230 | # Monitoring
231 | 
232 | ## Prometheus
233 | * [Prometheus.io](https://github.com/prometheus/prometheus) An open-source service monitoring system and time series database
234 | * [HAProxy Exporter](https://github.com/prometheus/haproxy_exporter) Simple server that scrapes HAProxy stats and exports them via HTTP for Prometheus consumption
235 | 
236 | # Community
237 | 
238 | ## Forums
239 | * [/r/dataengineering](https://www.reddit.com/r/dataengineering/) News, tips and background on Data Engineering
240 | * [/r/etl](https://www.reddit.com/r/ETL/) Subreddit focused on ETL
241 | 
242 | ## Conferences
243 | * [DataEngConf](http://www.dataengconf.com/about) DataEngConf is the first technical conference that bridges the gap between data scientists, data engineers and data analysts.
244 | 
245 | ## Podcasts
246 | * [Data Engineering Podcast](https://www.dataengineeringpodcast.com/) The show about modern data infrastructure.
247 | 
248 | Cheers to [The Data Engineering Ecosystem: An Interactive Map](http://xyz.insightdataengineering.com/blog/pipeline_map.html)
249 | 
250 | Inspired by the [awesome](https://github.com/sindresorhus/awesome) list. Created by [Insight Data Engineering](http://insightdataengineering.com) fellows.
251 | 
252 | ## License
253 | 
254 | [![CC0](http://i.creativecommons.org/p/zero/1.0/88x31.png)](http://creativecommons.org/publicdomain/zero/1.0/)
255 | 
256 | To the extent possible under law, [Igor Barinov](https://github.com/igorbarinov/) has waived all copyright and related or neighboring rights to this work.
257 | 
258 | 
259 | [![Bitdeli Badge](https://d2weczhvl823v0.cloudfront.net/igorbarinov/awesome-data-engineering/trend.png)](https://bitdeli.com/free "Bitdeli Badge")
260 | 
261 | 


--------------------------------------------------------------------------------