└── README.md /README.md: -------------------------------------------------------------------------------- 1 | Sharing interesting and noteworthy Data Engineering content - namely blogs, podcasts, repos, books, videos, and MOOCs. This was mostly curated by and for Fellows in the [Insight Data Engineering Fellows Program](http://insightdataengineering.com), and inspired by the [repo](https://github.com/igorbarinov/awesome-data-engineering) of one of our Fellows, Igor Barinov. 2 | 3 | If you have ideas or other interesting resources, feel free to open an Issue or Pull Request. 4 | 5 | # Table of Contents 6 | 7 | # Technologies 8 | All technologies are listed alphabetically in their given section. 9 | 10 | ## Overviews 11 | * [Data Engineering Ecosystem](http://insightdataengineering.com/blog/pipeline_map.html) 12 | 13 | ## File Formats 14 | 15 | ### Avro 16 | 17 | ### ORCFiles 18 | 19 | ### Parquet 20 | 21 | ### Protocol Buffers 22 | 23 | ### Thrift 24 | 25 | ## File Systems 26 | 27 | ### Hadoop Distributed File System (HDFS) 28 | 29 | ### S3 30 | 31 | #### Blogs 32 | * [Excellent summary](https://medium.com/@markobonaci/the-history-of-hadoop-68984a11704) of the history of Hadoop by Marco Bonaci. This post is also [read as a podcast](http://softwareengineeringdaily.com/2016/02/06/history-of-hadoop/) by Software Engineering Daily. 33 | 34 | ## Databases 35 | 36 | ### Overviews 37 | * [Jepsen](https://aphyr.com/tags/Jepsen) - Kyle Kingsbury's (Aphyr) guide on distributed systems and databases, and how they fail. 38 | 39 | ### Relational Databases 40 | 41 | #### MySQL 42 | 43 | #### Postgres 44 | 45 | ### Key-Value Databases 46 | 47 | #### Redis 48 | 49 | #### Riak 50 | 51 | ### Column-Family Databases 52 | 53 | #### Accumulo 54 | 55 | #### Cassandra 56 | ##### Blogs 57 | 58 | * [Nice post](http://www.planetcassandra.org/blog/we-shall-have-order/) about using `clustering order by` in Cassandra 59 | * [Post](http://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling) by Datastax about basics of data modeling in Cassandra 60 | 61 | #### HBase 62 | 63 | ### Graph Databases 64 | 65 | #### Neo4j 66 | 67 | #### OrientDB 68 | 69 | ### Search Tools 70 | 71 | #### Elasticsearch 72 | 73 | #### Lucene 74 | 75 | #### Solr 76 | 77 | ## General Batch Processing 78 | 79 | ### Hadoop MapReduce 80 | 81 | #### Blogs 82 | * [Excellent summary](https://medium.com/@markobonaci/the-history-of-hadoop-68984a11704) of the history of Hadoop by Marco Bonaci. This post is also [read as a podcast](http://softwareengineeringdaily.com/2016/02/06/history-of-hadoop/) by Software Engineering Daily. 83 | 84 | ### Hadoop Abstractions 85 | 86 | #### Cascalog 87 | 88 | #### Cascading 89 | 90 | #### Hadoop Streaming / mrjob 91 | 92 | #### Hive 93 | 94 | #### Pig 95 | 96 | #### Scalding 97 | 98 | ### Spark 99 | 100 | ## Graph Processing 101 | 102 | ### Giraph 103 | 104 | ### GraphLab Create 105 | 106 | ### Spark GraphX 107 | 108 | ## Machine Learning Tools 109 | 110 | ### FlinkML 111 | 112 | ### H2O 113 | 114 | ### Mahout 115 | 116 | ### Spark MLlib 117 | 118 | ## Stream Processing 119 | 120 | ### Flink 121 | 122 | #### Slides 123 | * [Strata Talk](http://www.slideshare.net/KostasTzoumas/apache-flink-at-strata-san-jose-2016) by Kostas Tzoumas on Flink Streaming's capabilities. 124 | * [Streaming Benchmark talk](http://www.slideshare.net/JamieGrier/extending-the-yahoo-streaming-benchmark) by Jamie Grier on extending Yahoo's Benchmark, based off this [blog](http://data-artisans.com/extending-the-yahoo-streaming-benchmark/) 125 | 126 | #### Blogs 127 | * [Asynchronous Snapshots Blog](http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/) by Data Artisans, and a summary in [the morning paper](https://blog.acolyer.org/2015/08/19/asynchronous-distributed-snapshots-for-distributed-dataflows/) 128 | 129 | #### Papers 130 | * [Millwheel Paper](http://research.google.com/pubs/pub41378.html) which discusses Low Watermarks for Exactly-Once Semantics 131 | * [Asynchronous Snapshots Barrier Paper](http://arxiv.org/abs/1506.08603) describing Flink's snapshot algorithm 132 | * [Chandy-Lamport Paper](http://research.microsoft.com/en-us/um/people/lamport/pubs/chandy.pdf) on Distributed Snapshots, and a summary in [the morning paper](https://blog.acolyer.org/2015/04/22/distributed-snapshots-determining-global-states-of-distributed-systems/) 133 | 134 | ### NiFi 135 | 136 | ### Samza 137 | 138 | ### Spark Streaming 139 | 140 | ### Storm 141 | 142 | 143 | ## Ingestion Tools 144 | 145 | ### Flume 146 | 147 | ### Logstash 148 | 149 | 150 | ## Messaging Queues / PubSub 151 | ### Kafka 152 | #### Blogs 153 | * [Part 1](http://www.confluent.io/blog/stream-data-platform-1/) and [Part 2](http://www.confluent.io/blog/stream-data-platform-2/) of Jay Krep's on streams in Kafka 154 | 155 | * [Part 1](https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/) of series of 3 blogs on how Datadog monitors Kafka. Part 1 is an especially good intro to Kafka's architecture. 156 | 157 | #### Podcasts 158 | 159 | #### Videos 160 | * [Video] (https://www.youtube.com/watch?v=aJuo_bLSW6s&feature=youtu.be) by Jay Kreps on logs, stream processing and Kafka 161 | 162 | ### RabbitMQ 163 | 164 | ### ZeroMQ 165 | 166 | ## Workflow and Scheduling 167 | 168 | ### Airflow 169 | 170 | #### Podcasts 171 | * [Interview with Maxime Beauchemin](Software Engineering Daily) on Airflow, Airpal, and Caravel on Software Engineering Daily. 172 | 173 | ### Azkaban 174 | 175 | ### Luigi 176 | 177 | ### Oozie 178 | 179 | ## Cluster Management and Coordination 180 | 181 | ### Docker 182 | 183 | ### Kubernetes 184 | 185 | ### Mesos 186 | 187 | ### YARN 188 | 189 | ### Zookeeper 190 | 191 | # Important Algorithms and Theorems 192 | 193 | * [List of 100 Seminal Data Engineering Papers](https://www.linkedin.com/pulse/100-open-source-big-data-architecture-papers-anil-madan) from Anil Madan 194 | 195 | ## Distributed Systems 196 | 197 | * [General Notes](https://github.com/aphyr/distsys-class) from Kyle Kingsbury (Aphyr) on Distributed Sytems 198 | 199 | ### Paxos 200 | 201 | * [Visualization of Paxos](http://harry.me/blog/2014/12/27/neat-algorithms-paxos/) with explanation 202 | 203 | ### RAFT 204 | 205 | ### MapReduce 206 | 207 | ### Distributed graph and machine learning algorithms 208 | 209 | #### Papers 210 | * [Paper] (https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf) on using z-values for implementing approximate k-nearest neighbors in a MapReduce framework. There is also a [Background paper](http://cs.sjtu.edu.cn/~yaobin/papers/icde10_knn.pdf) on the topic, describing the non-distributed version. 211 | * [Paper] (http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf) on sPCA -- Scalable Principal Component Analysis 212 | 213 | ### Gossip Protocol 214 | 215 | ### Chandy-Lamport 216 | * [Chandy-Lamport Paper](http://research.microsoft.com/en-us/um/people/lamport/pubs/chandy.pdf) on Distributed Snapshots, and a summary in [the morning paper](https://blog.acolyer.org/2015/04/22/distributed-snapshots-determining-global-states-of-distributed-systems/) 217 | 218 | ### Load Balancing 219 | 220 | ### Transactions 221 | 222 | ### CAP Theorem 223 | * [Blog](http://blog.thislongrun.com/2015/03/the-cap-theorem-series.html) on nuances of the CAP theorem by Nicolas Liochon 224 | 225 | # Background and Interview Prep 226 | 227 | * [Repo](https://github.com/prakhar1989/awesome-courses) of awesome computer science courses. 228 | 229 | ## General Guidance for Interviews 230 | [Excellent post](http://blog.triplebyte.com/how-to-pass-a-programming-interview) on preparing for interview from TripleByte, both technically and strategically 231 | 232 | ## Data Structures and Algorithms 233 | 234 | ### MOOCs 235 | * [Part 1](https://www.coursera.org/course/algo) and [Part 2](https://www.coursera.org/course/algo2) of Tim Roughgarden's MOOC, based off his Stanford course. 236 | 237 | ### Books 238 | * [Cracking the Coding Interview](http://www.amazon.com/Cracking-Coding-Interview-6th-Programming/dp/0984782850/ref=sr_1_1?s=books&ie=UTF8&qid=1460873864&sr=1-1&keywords=cracking+the+coding+interview), with solutions in many languages [here](https://github.com/careercup/ctci) 239 | 240 | ### Practice Websites 241 | * [Leetcode Online Judge](https://leetcode.com/) 242 | * [HackerRank](https://www.hackerrank.com/) 243 | 244 | ## SQL and Database Design 245 | 246 | ### MOOCs 247 | * [Jennifer Widom's self-paced MOOC](https://lagunita.stanford.edu/courses/DB/2014/SelfPaced/about) from first principles, based off her Stanford course. 248 | 249 | ### Practice Websites 250 | * [Leetcode Online Judge](https://leetcode.com/) 251 | 252 | ## System Design 253 | * [Repo](https://github.com/checkcheckzz/system-design-interview) of many sytem design studies, resources, and strategies. 254 | 255 | ## Software Engineering Best Practices 256 | 257 | ## Programming Languages 258 | 259 | ## Learning Linux Commands 260 | 261 | ## Operating Systems and Networking 262 | 263 | * Excellent [Review](https://blog.acolyer.org/2016/04/26/the-linux-scheduler-a-decade-of-wasted-cores/) of Fair Scheduling in Linux from The Morning Paper. 264 | * [Blog](http://eng.localytics.com/performance-in-big-data-land-every-cpu-cycle-matters-part-1/) on the impact of saving CPU cycles while processing billions of records and the [effects of tuning CPU](http://eng.localytics.com/performance-in-big-data-land-every-cpu-cycle-matters-part-2/) from the Localytics engineering team. 265 | --------------------------------------------------------------------------------