└── README.md


/README.md:
--------------------------------------------------------------------------------
  1 | Sharing interesting and noteworthy Data Engineering content - namely blogs, podcasts, repos, books, videos, and MOOCs.  This was mostly curated by and for Fellows in the [Insight Data Engineering Fellows Program](http://insightdataengineering.com), and inspired by the [repo](https://github.com/igorbarinov/awesome-data-engineering) of one of our Fellows, Igor Barinov.
  2 | 
  3 | If you have ideas or other interesting resources, feel free to open an Issue or Pull Request.
  4 | 
  5 | # Table of Contents
  6 | 
  7 | # Technologies
  8 | All technologies are listed alphabetically in their given section.
  9 | 
 10 | ## Overviews
 11 | * [Data Engineering Ecosystem](http://insightdataengineering.com/blog/pipeline_map.html)
 12 | 
 13 | ## File Formats
 14 | 
 15 | ### Avro
 16 | 
 17 | ### ORCFiles
 18 | 
 19 | ### Parquet
 20 | 
 21 | ### Protocol Buffers
 22 | 
 23 | ### Thrift
 24 | 
 25 | ## File Systems
 26 | 
 27 | ### Hadoop Distributed File System (HDFS)
 28 | 
 29 | ### S3
 30 | 
 31 | #### Blogs
 32 | * [Excellent summary](https://medium.com/@markobonaci/the-history-of-hadoop-68984a11704) of the history of Hadoop by Marco Bonaci.  This post is also [read as a podcast](http://softwareengineeringdaily.com/2016/02/06/history-of-hadoop/) by Software Engineering Daily.
 33 | 
 34 | ## Databases
 35 | 
 36 | ### Overviews
 37 | * [Jepsen](https://aphyr.com/tags/Jepsen) - Kyle Kingsbury's (Aphyr) guide on distributed systems and databases, and how they fail. 
 38 | 
 39 | ### Relational Databases
 40 | 
 41 | #### MySQL
 42 | 
 43 | #### Postgres
 44 | 
 45 | ### Key-Value Databases
 46 | 
 47 | #### Redis
 48 | 
 49 | #### Riak
 50 | 
 51 | ### Column-Family Databases
 52 | 
 53 | #### Accumulo
 54 | 
 55 | #### Cassandra
 56 | ##### Blogs
 57 | 
 58 | * [Nice post](http://www.planetcassandra.org/blog/we-shall-have-order/) about using `clustering order by` in Cassandra
 59 | * [Post](http://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling) by Datastax about basics of data modeling in Cassandra
 60 | 
 61 | #### HBase
 62 | 
 63 | ### Graph Databases
 64 | 
 65 | #### Neo4j
 66 | 
 67 | #### OrientDB
 68 | 
 69 | ### Search Tools
 70 | 
 71 | #### Elasticsearch
 72 | 
 73 | #### Lucene
 74 | 
 75 | #### Solr
 76 | 
 77 | ## General Batch Processing
 78 | 
 79 | ### Hadoop MapReduce
 80 | 
 81 | #### Blogs
 82 | * [Excellent summary](https://medium.com/@markobonaci/the-history-of-hadoop-68984a11704) of the history of Hadoop by Marco Bonaci.  This post is also [read as a podcast](http://softwareengineeringdaily.com/2016/02/06/history-of-hadoop/) by Software Engineering Daily.
 83 | 
 84 | ### Hadoop Abstractions
 85 | 
 86 | #### Cascalog
 87 | 
 88 | #### Cascading
 89 | 
 90 | #### Hadoop Streaming / mrjob
 91 | 
 92 | #### Hive
 93 | 
 94 | #### Pig
 95 | 
 96 | #### Scalding
 97 | 
 98 | ### Spark
 99 | 
100 | ## Graph Processing
101 | 
102 | ### Giraph
103 | 
104 | ### GraphLab Create
105 | 
106 | ### Spark GraphX
107 | 
108 | ## Machine Learning Tools
109 | 
110 | ### FlinkML
111 | 
112 | ### H2O
113 | 
114 | ### Mahout
115 | 
116 | ### Spark MLlib
117 | 
118 | ## Stream Processing
119 | 
120 | ### Flink
121 | 
122 | #### Slides
123 | * [Strata Talk](http://www.slideshare.net/KostasTzoumas/apache-flink-at-strata-san-jose-2016) by Kostas Tzoumas on Flink Streaming's capabilities.
124 | * [Streaming Benchmark talk](http://www.slideshare.net/JamieGrier/extending-the-yahoo-streaming-benchmark) by Jamie Grier on extending Yahoo's Benchmark, based off this [blog](http://data-artisans.com/extending-the-yahoo-streaming-benchmark/)
125 | 
126 | #### Blogs
127 | * [Asynchronous Snapshots Blog](http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/) by Data Artisans, and a summary in [the morning paper](https://blog.acolyer.org/2015/08/19/asynchronous-distributed-snapshots-for-distributed-dataflows/)  
128 | 
129 | #### Papers
130 | * [Millwheel Paper](http://research.google.com/pubs/pub41378.html) which discusses Low Watermarks for Exactly-Once Semantics
131 | * [Asynchronous Snapshots Barrier Paper](http://arxiv.org/abs/1506.08603) describing Flink's snapshot algorithm
132 | * [Chandy-Lamport Paper](http://research.microsoft.com/en-us/um/people/lamport/pubs/chandy.pdf) on Distributed Snapshots, and a summary in [the morning paper](https://blog.acolyer.org/2015/04/22/distributed-snapshots-determining-global-states-of-distributed-systems/)  
133 | 
134 | ### NiFi
135 | 
136 | ### Samza
137 | 
138 | ### Spark Streaming
139 | 
140 | ### Storm
141 | 
142 | 
143 | ## Ingestion Tools
144 | 
145 | ### Flume
146 | 
147 | ### Logstash
148 | 
149 | 
150 | ## Messaging Queues / PubSub
151 | ### Kafka
152 | #### Blogs
153 | * [Part 1](http://www.confluent.io/blog/stream-data-platform-1/) and [Part 2](http://www.confluent.io/blog/stream-data-platform-2/) of Jay Krep's on streams in Kafka
154 | 
155 | * [Part 1](https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/) of series of 3 blogs on how Datadog monitors Kafka.  Part 1 is an especially good intro to Kafka's architecture.
156 | 
157 | #### Podcasts
158 | 
159 | #### Videos
160 |  * [Video] (https://www.youtube.com/watch?v=aJuo_bLSW6s&feature=youtu.be) by Jay Kreps on logs, stream processing and Kafka
161 | 
162 | ### RabbitMQ
163 | 
164 | ### ZeroMQ
165 | 
166 | ## Workflow and Scheduling
167 | 
168 | ### Airflow
169 | 
170 | #### Podcasts
171 | * [Interview with Maxime Beauchemin](Software Engineering Daily) on Airflow, Airpal, and Caravel on Software Engineering Daily. 
172 | 
173 | ### Azkaban
174 | 
175 | ### Luigi
176 | 
177 | ### Oozie
178 | 
179 | ## Cluster Management and Coordination
180 | 
181 | ### Docker
182 | 
183 | ### Kubernetes
184 | 
185 | ### Mesos
186 | 
187 | ### YARN
188 | 
189 | ### Zookeeper
190 | 
191 | # Important Algorithms and Theorems
192 | 
193 | * [List of 100 Seminal Data Engineering Papers](https://www.linkedin.com/pulse/100-open-source-big-data-architecture-papers-anil-madan) from Anil Madan
194 | 
195 | ## Distributed Systems
196 | 
197 | * [General Notes](https://github.com/aphyr/distsys-class) from Kyle Kingsbury (Aphyr) on Distributed Sytems
198 | 
199 | ### Paxos
200 | 
201 | *  [Visualization of Paxos](http://harry.me/blog/2014/12/27/neat-algorithms-paxos/) with explanation
202 | 
203 | ### RAFT
204 | 
205 | ### MapReduce
206 | 
207 | ### Distributed graph and machine learning algorithms
208 | 
209 | #### Papers
210 |  * [Paper] (https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf) on using z-values for implementing approximate k-nearest neighbors in a MapReduce framework.  There is also a [Background paper](http://cs.sjtu.edu.cn/~yaobin/papers/icde10_knn.pdf) on the topic, describing the non-distributed version.
211 |  * [Paper] (http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf) on sPCA -- Scalable Principal Component Analysis
212 | 
213 | ### Gossip Protocol
214 | 
215 | ### Chandy-Lamport
216 | * [Chandy-Lamport Paper](http://research.microsoft.com/en-us/um/people/lamport/pubs/chandy.pdf) on Distributed Snapshots, and a summary in [the morning paper](https://blog.acolyer.org/2015/04/22/distributed-snapshots-determining-global-states-of-distributed-systems/) 
217 | 
218 | ### Load Balancing
219 | 
220 | ### Transactions
221 | 
222 | ### CAP Theorem
223 | * [Blog](http://blog.thislongrun.com/2015/03/the-cap-theorem-series.html) on nuances of the CAP theorem by Nicolas Liochon
224 | 
225 | # Background and Interview Prep
226 | 
227 | * [Repo](https://github.com/prakhar1989/awesome-courses) of awesome computer science courses.
228 | 
229 | ## General Guidance for Interviews
230 | [Excellent post](http://blog.triplebyte.com/how-to-pass-a-programming-interview) on preparing for interview from TripleByte, both technically and strategically
231 | 
232 | ## Data Structures and Algorithms
233 | 
234 | ### MOOCs
235 | * [Part 1](https://www.coursera.org/course/algo) and [Part 2](https://www.coursera.org/course/algo2) of Tim Roughgarden's MOOC, based off his Stanford course.
236 | 
237 | ### Books
238 | * [Cracking the Coding Interview](http://www.amazon.com/Cracking-Coding-Interview-6th-Programming/dp/0984782850/ref=sr_1_1?s=books&ie=UTF8&qid=1460873864&sr=1-1&keywords=cracking+the+coding+interview), with solutions in many languages [here](https://github.com/careercup/ctci)
239 | 
240 | ### Practice Websites
241 | * [Leetcode Online Judge](https://leetcode.com/)
242 | * [HackerRank](https://www.hackerrank.com/)
243 | 
244 | ## SQL and Database Design
245 | 
246 | ### MOOCs
247 | * [Jennifer Widom's self-paced MOOC](https://lagunita.stanford.edu/courses/DB/2014/SelfPaced/about) from first principles, based off her Stanford course.
248 | 
249 | ### Practice Websites
250 | * [Leetcode Online Judge](https://leetcode.com/)
251 | 
252 | ## System Design
253 | * [Repo](https://github.com/checkcheckzz/system-design-interview) of many sytem design studies, resources, and strategies.
254 | 
255 | ## Software Engineering Best Practices
256 | 
257 | ## Programming Languages
258 | 
259 | ## Learning Linux Commands
260 | 
261 | ## Operating Systems and Networking
262 | 
263 | * Excellent [Review](https://blog.acolyer.org/2016/04/26/the-linux-scheduler-a-decade-of-wasted-cores/) of Fair Scheduling in Linux from The Morning Paper. 
264 | * [Blog](http://eng.localytics.com/performance-in-big-data-land-every-cpu-cycle-matters-part-1/) on the impact of saving CPU cycles while processing billions of records and the [effects of tuning CPU](http://eng.localytics.com/performance-in-big-data-land-every-cpu-cycle-matters-part-2/) from the Localytics engineering team.
265 | 


--------------------------------------------------------------------------------