├── .gitignore ├── LICENSE ├── Makefile ├── README.md ├── bigdata.bbl ├── bigdata.bib ├── bigdata.tex ├── docs ├── bigdata.css ├── bigdata10.html ├── bigdata11.html ├── bigdata12.html ├── bigdata13.html ├── bigdata14.html ├── bigdata15.html ├── bigdata16.html ├── bigdata17.html ├── bigdata18.html ├── bigdata19.html ├── bigdata2.html ├── bigdata3.html ├── bigdata4.html ├── bigdata5.html ├── bigdata6.html ├── bigdata7.html ├── bigdata8.html ├── bigdata9.html ├── images └── index.html └── images ├── MapReduce.png ├── PigHive_MR.png ├── PigHive_Tez.png ├── data-management.png ├── hadoop.png ├── hbase-architecture.png ├── hdfs-architecture.png ├── mesos-architecture.jpg ├── mongodb-replica-set.png ├── mongodb-sharding.png ├── mongodb-storage-structure.png ├── riak-data-distribution.png ├── riak-ring.png ├── yarn-architecture.png └── zookeeper.jpg /.gitignore: -------------------------------------------------------------------------------- 1 | bigdata.epub 2 | bigdata.pdf 3 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | GNU GENERAL PUBLIC LICENSE 2 | Version 2, June 1991 3 | 4 | Copyright (C) 1989, 1991 Free Software Foundation, Inc., 5 | 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA 6 | Everyone is permitted to copy and distribute verbatim copies 7 | of this license document, but changing it is not allowed. 8 | 9 | Preamble 10 | 11 | The licenses for most software are designed to take away your 12 | freedom to share and change it. By contrast, the GNU General Public 13 | License is intended to guarantee your freedom to share and change free 14 | software--to make sure the software is free for all its users. This 15 | General Public License applies to most of the Free Software 16 | Foundation's software and to any other program whose authors commit to 17 | using it. (Some other Free Software Foundation software is covered by 18 | the GNU Lesser General Public License instead.) You can apply it to 19 | your programs, too. 20 | 21 | When we speak of free software, we are referring to freedom, not 22 | price. Our General Public Licenses are designed to make sure that you 23 | have the freedom to distribute copies of free software (and charge for 24 | this service if you wish), that you receive source code or can get it 25 | if you want it, that you can change the software or use pieces of it 26 | in new free programs; and that you know you can do these things. 27 | 28 | To protect your rights, we need to make restrictions that forbid 29 | anyone to deny you these rights or to ask you to surrender the rights. 30 | These restrictions translate to certain responsibilities for you if you 31 | distribute copies of the software, or if you modify it. 32 | 33 | For example, if you distribute copies of such a program, whether 34 | gratis or for a fee, you must give the recipients all the rights that 35 | you have. You must make sure that they, too, receive or can get the 36 | source code. And you must show them these terms so they know their 37 | rights. 38 | 39 | We protect your rights with two steps: (1) copyright the software, and 40 | (2) offer you this license which gives you legal permission to copy, 41 | distribute and/or modify the software. 42 | 43 | Also, for each author's protection and ours, we want to make certain 44 | that everyone understands that there is no warranty for this free 45 | software. If the software is modified by someone else and passed on, we 46 | want its recipients to know that what they have is not the original, so 47 | that any problems introduced by others will not reflect on the original 48 | authors' reputations. 49 | 50 | Finally, any free program is threatened constantly by software 51 | patents. We wish to avoid the danger that redistributors of a free 52 | program will individually obtain patent licenses, in effect making the 53 | program proprietary. To prevent this, we have made it clear that any 54 | patent must be licensed for everyone's free use or not licensed at all. 55 | 56 | The precise terms and conditions for copying, distribution and 57 | modification follow. 58 | 59 | GNU GENERAL PUBLIC LICENSE 60 | TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION 61 | 62 | 0. This License applies to any program or other work which contains 63 | a notice placed by the copyright holder saying it may be distributed 64 | under the terms of this General Public License. The "Program", below, 65 | refers to any such program or work, and a "work based on the Program" 66 | means either the Program or any derivative work under copyright law: 67 | that is to say, a work containing the Program or a portion of it, 68 | either verbatim or with modifications and/or translated into another 69 | language. (Hereinafter, translation is included without limitation in 70 | the term "modification".) Each licensee is addressed as "you". 71 | 72 | Activities other than copying, distribution and modification are not 73 | covered by this License; they are outside its scope. The act of 74 | running the Program is not restricted, and the output from the Program 75 | is covered only if its contents constitute a work based on the 76 | Program (independent of having been made by running the Program). 77 | Whether that is true depends on what the Program does. 78 | 79 | 1. You may copy and distribute verbatim copies of the Program's 80 | source code as you receive it, in any medium, provided that you 81 | conspicuously and appropriately publish on each copy an appropriate 82 | copyright notice and disclaimer of warranty; keep intact all the 83 | notices that refer to this License and to the absence of any warranty; 84 | and give any other recipients of the Program a copy of this License 85 | along with the Program. 86 | 87 | You may charge a fee for the physical act of transferring a copy, and 88 | you may at your option offer warranty protection in exchange for a fee. 89 | 90 | 2. You may modify your copy or copies of the Program or any portion 91 | of it, thus forming a work based on the Program, and copy and 92 | distribute such modifications or work under the terms of Section 1 93 | above, provided that you also meet all of these conditions: 94 | 95 | a) You must cause the modified files to carry prominent notices 96 | stating that you changed the files and the date of any change. 97 | 98 | b) You must cause any work that you distribute or publish, that in 99 | whole or in part contains or is derived from the Program or any 100 | part thereof, to be licensed as a whole at no charge to all third 101 | parties under the terms of this License. 102 | 103 | c) If the modified program normally reads commands interactively 104 | when run, you must cause it, when started running for such 105 | interactive use in the most ordinary way, to print or display an 106 | announcement including an appropriate copyright notice and a 107 | notice that there is no warranty (or else, saying that you provide 108 | a warranty) and that users may redistribute the program under 109 | these conditions, and telling the user how to view a copy of this 110 | License. (Exception: if the Program itself is interactive but 111 | does not normally print such an announcement, your work based on 112 | the Program is not required to print an announcement.) 113 | 114 | These requirements apply to the modified work as a whole. If 115 | identifiable sections of that work are not derived from the Program, 116 | and can be reasonably considered independent and separate works in 117 | themselves, then this License, and its terms, do not apply to those 118 | sections when you distribute them as separate works. But when you 119 | distribute the same sections as part of a whole which is a work based 120 | on the Program, the distribution of the whole must be on the terms of 121 | this License, whose permissions for other licensees extend to the 122 | entire whole, and thus to each and every part regardless of who wrote it. 123 | 124 | Thus, it is not the intent of this section to claim rights or contest 125 | your rights to work written entirely by you; rather, the intent is to 126 | exercise the right to control the distribution of derivative or 127 | collective works based on the Program. 128 | 129 | In addition, mere aggregation of another work not based on the Program 130 | with the Program (or with a work based on the Program) on a volume of 131 | a storage or distribution medium does not bring the other work under 132 | the scope of this License. 133 | 134 | 3. You may copy and distribute the Program (or a work based on it, 135 | under Section 2) in object code or executable form under the terms of 136 | Sections 1 and 2 above provided that you also do one of the following: 137 | 138 | a) Accompany it with the complete corresponding machine-readable 139 | source code, which must be distributed under the terms of Sections 140 | 1 and 2 above on a medium customarily used for software interchange; or, 141 | 142 | b) Accompany it with a written offer, valid for at least three 143 | years, to give any third party, for a charge no more than your 144 | cost of physically performing source distribution, a complete 145 | machine-readable copy of the corresponding source code, to be 146 | distributed under the terms of Sections 1 and 2 above on a medium 147 | customarily used for software interchange; or, 148 | 149 | c) Accompany it with the information you received as to the offer 150 | to distribute corresponding source code. (This alternative is 151 | allowed only for noncommercial distribution and only if you 152 | received the program in object code or executable form with such 153 | an offer, in accord with Subsection b above.) 154 | 155 | The source code for a work means the preferred form of the work for 156 | making modifications to it. For an executable work, complete source 157 | code means all the source code for all modules it contains, plus any 158 | associated interface definition files, plus the scripts used to 159 | control compilation and installation of the executable. However, as a 160 | special exception, the source code distributed need not include 161 | anything that is normally distributed (in either source or binary 162 | form) with the major components (compiler, kernel, and so on) of the 163 | operating system on which the executable runs, unless that component 164 | itself accompanies the executable. 165 | 166 | If distribution of executable or object code is made by offering 167 | access to copy from a designated place, then offering equivalent 168 | access to copy the source code from the same place counts as 169 | distribution of the source code, even though third parties are not 170 | compelled to copy the source along with the object code. 171 | 172 | 4. You may not copy, modify, sublicense, or distribute the Program 173 | except as expressly provided under this License. Any attempt 174 | otherwise to copy, modify, sublicense or distribute the Program is 175 | void, and will automatically terminate your rights under this License. 176 | However, parties who have received copies, or rights, from you under 177 | this License will not have their licenses terminated so long as such 178 | parties remain in full compliance. 179 | 180 | 5. You are not required to accept this License, since you have not 181 | signed it. However, nothing else grants you permission to modify or 182 | distribute the Program or its derivative works. These actions are 183 | prohibited by law if you do not accept this License. Therefore, by 184 | modifying or distributing the Program (or any work based on the 185 | Program), you indicate your acceptance of this License to do so, and 186 | all its terms and conditions for copying, distributing or modifying 187 | the Program or works based on it. 188 | 189 | 6. Each time you redistribute the Program (or any work based on the 190 | Program), the recipient automatically receives a license from the 191 | original licensor to copy, distribute or modify the Program subject to 192 | these terms and conditions. You may not impose any further 193 | restrictions on the recipients' exercise of the rights granted herein. 194 | You are not responsible for enforcing compliance by third parties to 195 | this License. 196 | 197 | 7. If, as a consequence of a court judgment or allegation of patent 198 | infringement or for any other reason (not limited to patent issues), 199 | conditions are imposed on you (whether by court order, agreement or 200 | otherwise) that contradict the conditions of this License, they do not 201 | excuse you from the conditions of this License. If you cannot 202 | distribute so as to satisfy simultaneously your obligations under this 203 | License and any other pertinent obligations, then as a consequence you 204 | may not distribute the Program at all. For example, if a patent 205 | license would not permit royalty-free redistribution of the Program by 206 | all those who receive copies directly or indirectly through you, then 207 | the only way you could satisfy both it and this License would be to 208 | refrain entirely from distribution of the Program. 209 | 210 | If any portion of this section is held invalid or unenforceable under 211 | any particular circumstance, the balance of the section is intended to 212 | apply and the section as a whole is intended to apply in other 213 | circumstances. 214 | 215 | It is not the purpose of this section to induce you to infringe any 216 | patents or other property right claims or to contest validity of any 217 | such claims; this section has the sole purpose of protecting the 218 | integrity of the free software distribution system, which is 219 | implemented by public license practices. Many people have made 220 | generous contributions to the wide range of software distributed 221 | through that system in reliance on consistent application of that 222 | system; it is up to the author/donor to decide if he or she is willing 223 | to distribute software through any other system and a licensee cannot 224 | impose that choice. 225 | 226 | This section is intended to make thoroughly clear what is believed to 227 | be a consequence of the rest of this License. 228 | 229 | 8. If the distribution and/or use of the Program is restricted in 230 | certain countries either by patents or by copyrighted interfaces, the 231 | original copyright holder who places the Program under this License 232 | may add an explicit geographical distribution limitation excluding 233 | those countries, so that distribution is permitted only in or among 234 | countries not thus excluded. In such case, this License incorporates 235 | the limitation as if written in the body of this License. 236 | 237 | 9. The Free Software Foundation may publish revised and/or new versions 238 | of the General Public License from time to time. Such new versions will 239 | be similar in spirit to the present version, but may differ in detail to 240 | address new problems or concerns. 241 | 242 | Each version is given a distinguishing version number. If the Program 243 | specifies a version number of this License which applies to it and "any 244 | later version", you have the option of following the terms and conditions 245 | either of that version or of any later version published by the Free 246 | Software Foundation. If the Program does not specify a version number of 247 | this License, you may choose any version ever published by the Free Software 248 | Foundation. 249 | 250 | 10. If you wish to incorporate parts of the Program into other free 251 | programs whose distribution conditions are different, write to the author 252 | to ask for permission. For software which is copyrighted by the Free 253 | Software Foundation, write to the Free Software Foundation; we sometimes 254 | make exceptions for this. Our decision will be guided by the two goals 255 | of preserving the free status of all derivatives of our free software and 256 | of promoting the sharing and reuse of software generally. 257 | 258 | NO WARRANTY 259 | 260 | 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY 261 | FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN 262 | OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES 263 | PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED 264 | OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF 265 | MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS 266 | TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE 267 | PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, 268 | REPAIR OR CORRECTION. 269 | 270 | 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING 271 | WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR 272 | REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, 273 | INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING 274 | OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED 275 | TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY 276 | YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER 277 | PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE 278 | POSSIBILITY OF SUCH DAMAGES. 279 | 280 | END OF TERMS AND CONDITIONS 281 | 282 | How to Apply These Terms to Your New Programs 283 | 284 | If you develop a new program, and you want it to be of the greatest 285 | possible use to the public, the best way to achieve this is to make it 286 | free software which everyone can redistribute and change under these terms. 287 | 288 | To do so, attach the following notices to the program. It is safest 289 | to attach them to the start of each source file to most effectively 290 | convey the exclusion of warranty; and each file should have at least 291 | the "copyright" line and a pointer to where the full notice is found. 292 | 293 | {description} 294 | Copyright (C) {year} {fullname} 295 | 296 | This program is free software; you can redistribute it and/or modify 297 | it under the terms of the GNU General Public License as published by 298 | the Free Software Foundation; either version 2 of the License, or 299 | (at your option) any later version. 300 | 301 | This program is distributed in the hope that it will be useful, 302 | but WITHOUT ANY WARRANTY; without even the implied warranty of 303 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 304 | GNU General Public License for more details. 305 | 306 | You should have received a copy of the GNU General Public License along 307 | with this program; if not, write to the Free Software Foundation, Inc., 308 | 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. 309 | 310 | Also add information on how to contact you by electronic and paper mail. 311 | 312 | If the program is interactive, make it output a short notice like this 313 | when it starts in an interactive mode: 314 | 315 | Gnomovision version 69, Copyright (C) year name of author 316 | Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'. 317 | This is free software, and you are welcome to redistribute it 318 | under certain conditions; type `show c' for details. 319 | 320 | The hypothetical commands `show w' and `show c' should show the appropriate 321 | parts of the General Public License. Of course, the commands you use may 322 | be called something other than `show w' and `show c'; they could even be 323 | mouse-clicks or menu items--whatever suits your program. 324 | 325 | You should also get your employer (if you work as a programmer) or your 326 | school, if any, to sign a "copyright disclaimer" for the program, if 327 | necessary. Here is a sample; alter the names: 328 | 329 | Yoyodyne, Inc., hereby disclaims all copyright interest in the program 330 | `Gnomovision' (which makes passes at compilers) written by James Hacker. 331 | 332 | {signature of Ty Coon}, 1 April 1989 333 | Ty Coon, President of Vice 334 | 335 | This General Public License does not permit incorporating your program into 336 | proprietary programs. If your program is a subroutine library, you may 337 | consider it more useful to permit linking proprietary applications with the 338 | library. If this is what you want to do, use the GNU Lesser General 339 | Public License instead of this License. 340 | 341 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | all: README.md bigdata.epub 2 | 3 | README.md: bigdata.tex 4 | pandoc -s -S --toc --webtex --filter pandoc-citeproc --metadata bibliography=bigdata.bib $< -o $@ 5 | 6 | bigdata.epub: bigdata.tex 7 | pandoc -S --toc $< -o $@ 8 | 9 | bigdata.pdf: bigdata.tex 10 | pandoc --toc $< -o $@ 11 | -------------------------------------------------------------------------------- /bigdata.bbl: -------------------------------------------------------------------------------- 1 | \begin{thebibliography}{10} 2 | 3 | \bibitem{Accenture13Seattle} 4 | Accenture. 5 | \newblock Accenture analytics and smart building solutions are helping seattle 6 | boost energy efficiency downtown, 2013. 7 | 8 | \bibitem{Accenture14SmartGrid} 9 | Accenture. 10 | \newblock Accenture to help thames water prove the benefits of smart monitoring 11 | capabilities, 2014. 12 | 13 | \bibitem{AdpHcm} 14 | ADP. 15 | \newblock Adp hcm solutions for large business. 16 | \newblock \url{http://www.adp.com/solutions/large-business/products.aspx}, 17 | 2014. 18 | 19 | \bibitem{AMPLabBenchmark2014} 20 | AMPLab. 21 | \newblock Big data benchmark. 22 | \newblock \url{http://amplab.cs.berkeley.edu/benchmark/}. 23 | 24 | \bibitem{Accumulo} 25 | Apache. 26 | \newblock Accumulo. 27 | \newblock \url{http://accumulo.apache.org}. 28 | 29 | \bibitem{Cassandra} 30 | Apache. 31 | \newblock Cassandra. 32 | \newblock \url{http://cassandra.apache.org}. 33 | 34 | \bibitem{Chukwa} 35 | Apache. 36 | \newblock Chukwa. 37 | \newblock \url{http://chukwa.apache.org}. 38 | 39 | \bibitem{HdfsShell} 40 | Apache. 41 | \newblock File system shell guide. 42 | \newblock 43 | \url{http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html}. 44 | 45 | \bibitem{Flume} 46 | Apache. 47 | \newblock Flume. 48 | \newblock \url{http://flume.apache.org}. 49 | 50 | \bibitem{Hadoop} 51 | Apache. 52 | \newblock Hadoop. 53 | \newblock \url{http://hadoop.apache.org}. 54 | 55 | \bibitem{HBase} 56 | Apache. 57 | \newblock Hbase. 58 | \newblock \url{http://hbase.apache.org}. 59 | 60 | \bibitem{HdfsThrift} 61 | Apache. 62 | \newblock {HDFS} {API}s in perl, python, ruby and php. 63 | \newblock \url{http://wiki.apache.org/hadoop/HDFS-APIs}. 64 | 65 | \bibitem{Kafka} 66 | Apache. 67 | \newblock Kafka. 68 | \newblock \url{http://kafka.apache.org}. 69 | 70 | \bibitem{MapReduceTutorial} 71 | Apache. 72 | \newblock Mapreduce tutorial. 73 | \newblock 74 | \url{http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html}. 75 | 76 | \bibitem{Maven} 77 | Apache. 78 | \newblock Maven. 79 | \newblock \url{http://maven.apache.org}. 80 | 81 | \bibitem{Spark} 82 | Apache. 83 | \newblock Spark. 84 | \newblock \url{http://spark.apache.org}. 85 | 86 | \bibitem{Sqoop} 87 | Apache. 88 | \newblock Sqoop. 89 | \newblock \url{http://sqoop.apache.org}. 90 | 91 | \bibitem{Storm} 92 | Apache. 93 | \newblock Storm. 94 | \newblock \url{http://storm.apache.org}. 95 | 96 | \bibitem{Tez} 97 | Apache. 98 | \newblock Tez. 99 | \newblock \url{http://tez.apache.org}. 100 | 101 | \bibitem{Thrift} 102 | Apache. 103 | \newblock Thrift. 104 | \newblock \url{http://thrift.apache.org}. 105 | 106 | \bibitem{ZooKeeper} 107 | Apache. 108 | \newblock Zookeeper. 109 | \newblock \url{http://zookeeper.apache.org}. 110 | 111 | \bibitem{Riak} 112 | Basho. 113 | \newblock Riak. 114 | \newblock \url{http://basho.com/riak/}. 115 | 116 | \bibitem{opac:2009} 117 | Philip~A. Bernstein and Eric Newcomer. 118 | \newblock {\em Principles of transaction processing}. 119 | \newblock The Morgan Kaufmann series in data management systems. Morgan 120 | Kaufmann Publishers, Burlington, MA, second edition, 2009. 121 | 122 | \bibitem{Bloom:1970:STH} 123 | Burton~H. Bloom. 124 | \newblock Space/time trade-offs in hash coding with allowable errors. 125 | \newblock {\em Commun. ACM}, 13(7):422--426, July 1970. 126 | 127 | \bibitem{Brewer:2000:TRD} 128 | Eric~A. Brewer. 129 | \newblock Towards robust distributed systems (abstract). 130 | \newblock In {\em Proceedings of the Nineteenth Annual ACM Symposium on 131 | Principles of Distributed Computing}, PODC '00, 2000. 132 | 133 | \bibitem{Brewer:2012} 134 | Eric~A. Brewer. 135 | \newblock Cap twelve years later: How the ``rules'' have changed. 136 | \newblock {\em Computer}, 45(2):23--29, February 2012. 137 | 138 | \bibitem{Chang:2006:BDS} 139 | Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson~C. Hsieh, Deborah~A. Wallach, 140 | Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert~E. Gruber. 141 | \newblock Bigtable: A distributed storage system for structured data. 142 | \newblock In {\em Proceedings of the 7th USENIX Symposium on Operating Systems 143 | Design and Implementation - Volume 7}, OSDI '06, 2006. 144 | 145 | \bibitem{ClouderaImpala2014} 146 | Cloudera. 147 | \newblock New sql choices in the apache hadoop ecosystem: Why impala continues 148 | to lead. 149 | \newblock 150 | \url{http://blog.cloudera.com/blog/2014/05/new-sql-choices-in-the-apache-hadoop-ecosystem-why-impala-continues-to-lead/}. 151 | 152 | \bibitem{LevelDB} 153 | Jeffrey Dean and Sanjay Ghemawat. 154 | \newblock Leveldb. 155 | \newblock \url{http://leveldb.org}. 156 | 157 | \bibitem{Dean:2008:MSD} 158 | Jeffrey Dean and Sanjay Ghemawat. 159 | \newblock Mapreduce: Simplified data processing on large clusters. 160 | \newblock {\em Commun. ACM}, 51(1):107--113, January 2008. 161 | 162 | \bibitem{DeCandia:2007:DAH} 163 | Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, 164 | Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, 165 | and Werner Vogels. 166 | \newblock Dynamo: {Amazon's} highly available key-value store. 167 | \newblock In {\em Proceedings of Twenty-first ACM SIGOPS Symposium on Operating 168 | Systems Principles}, SOSP '07, pages 205--220, 2007. 169 | 170 | \bibitem{DeWitt:1992:PDS} 171 | David DeWitt and Jim Gray. 172 | \newblock Parallel database systems: The future of high performance database 173 | systems. 174 | \newblock {\em Commun. ACM}, 35(6):85--98, June 1992. 175 | 176 | \bibitem{fidge1988timestamps} 177 | C.~J. Fidge. 178 | \newblock Timestamps in message-passing systems that preserve the partial 179 | ordering. 180 | \newblock {\em Proceedings of the 11th Australian Computer Science Conference}, 181 | 10(1):56–66, 1988. 182 | 183 | \bibitem{Forum:1994:MMI} 184 | Message~P Forum. 185 | \newblock Mpi: A message-passing interface standard. 186 | \newblock Technical report, University of Tennessee, Knoxville, TN, USA, 1994. 187 | 188 | \bibitem{Gartner2014} 189 | Gartner. 190 | \newblock Gartner's 2014 hype cycle for emerging technologies maps the journey 191 | to digital business. 192 | \newblock \url{http://www.gartner.com/newsroom/id/2819918}, 2014. 193 | 194 | \bibitem{Stinger} 195 | Alan Gates. 196 | \newblock The stinger initiative: Making apache hive 100 times faster. 197 | \newblock \url{http://hortonworks.com/blog/100x-faster-hive/}. 198 | 199 | \bibitem{IndustrialInternetReport2014} 200 | GE and Accenture. 201 | \newblock Industrial internet insights report, 2014. 202 | 203 | \bibitem{Ghemawat:2003:GFS} 204 | Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 205 | \newblock The google file system. 206 | \newblock In {\em Proceedings of the Nineteenth ACM Symposium on Operating 207 | Systems Principles}, SOSP '03, pages 29--43, 2003. 208 | 209 | \bibitem{Ghodsi:2011:DRF} 210 | Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, and 211 | Ion Stoica. 212 | \newblock Dominant resource fairness: Fair allocation of multiple resource 213 | types. 214 | \newblock In {\em Proceedings of the 8th USENIX Conference on Networked Systems 215 | Design and Implementation}, NSDI'11, pages 323--336, 2011. 216 | 217 | \bibitem{Gilbert:2002:BCF} 218 | Seth Gilbert and Nancy Lynch. 219 | \newblock Brewer's conjecture and the feasibility of consistent, available, 220 | partition-tolerant web services. 221 | \newblock {\em SIGACT News}, 33(2):51--59, June 2002. 222 | 223 | \bibitem{Gropp:1999:UMA} 224 | William Gropp, Ewing Lusk, and Rajeev Thakur. 225 | \newblock {\em Using MPI-2: Advanced Features of the Message-Passing 226 | Interface}. 227 | \newblock MIT Press, Cambridge, MA, USA, 1999. 228 | 229 | \bibitem{HDFS2010:265} 230 | Apache Hadoop. 231 | \newblock Asf jira hdfs-265, 2010. 232 | 233 | \bibitem{YARN2011:279} 234 | Apache Hadoop. 235 | \newblock Asf jira mapreduce-279, 2011. 236 | 237 | \bibitem{HDFS} 238 | Apache Hadoop. 239 | \newblock Hdfs architecture. 240 | \newblock 241 | \url{http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html}, 242 | 2014. 243 | 244 | \bibitem{Hindman:2011:MPF} 245 | Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony~D. Joseph, 246 | Randy Katz, Scott Shenker, and Ion Stoica. 247 | \newblock Mesos: A platform for fine-grained resource sharing in the data 248 | center. 249 | \newblock In {\em Proceedings of the 8th USENIX Conference on Networked Systems 250 | Design and Implementation}, NSDI'11, pages 295--308, 2011. 251 | 252 | \bibitem{Watson2013Cancer} 253 | IBM. 254 | \newblock Ibm watson helps fight cancer with evidence-based diagnosis and 255 | treatment suggestions, 2013. 256 | 257 | \bibitem{Watson2013Healthcare} 258 | IBM. 259 | \newblock Putting watson to work: Watson in healthcare, 2013. 260 | 261 | \bibitem{IBM2013} 262 | IBM. 263 | \newblock What is big data? 264 | \newblock 265 | \url{http://www-01.ibm.com/software/data/bigdata/what-is-big-data.html}, 266 | 2013. 267 | 268 | \bibitem{Watson2014} 269 | IBM. 270 | \newblock Watson. 271 | \newblock \url{http://www.ibm.com/smarterplanet/us/en/ibmwatson/}, 2014. 272 | 273 | \bibitem{Isard:2007:DDD} 274 | Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. 275 | \newblock Dryad: Distributed data-parallel programs from sequential building 276 | blocks. 277 | \newblock In {\em Proceedings of the 2Nd ACM SIGOPS/EuroSys European Conference 278 | on Computer Systems 2007}, EuroSys '07, pages 59--72, 2007. 279 | 280 | \bibitem{Jain:1988:ACD} 281 | Anil~K. Jain and Richard~C. Dubes. 282 | \newblock {\em Algorithms for Clustering Data}. 283 | \newblock Prentice-Hall, Inc., 1988. 284 | 285 | \bibitem{HBaseCoprocessor} 286 | Mingjie Lai, Eugene Koontz, and Andrew Purtell. 287 | \newblock Coprocessor introduction. 288 | \newblock \url{http://blogs.apache.org/hbase/entry/coprocessor_introduction}. 289 | 290 | \bibitem{Lakshman:2010:CDS} 291 | Avinash Lakshman and Prashant Malik. 292 | \newblock Cassandra: A decentralized structured storage system. 293 | \newblock {\em SIGOPS Oper. Syst. Rev.}, 44(2):35--40, April 2010. 294 | 295 | \bibitem{Lamport:1998:PP} 296 | Leslie Lamport. 297 | \newblock The part-time parliament. 298 | \newblock {\em ACM Trans. Comput. Syst.}, 16(2):133--169, May 1998. 299 | 300 | \bibitem{Laney2012} 301 | Douglas Laney. 302 | \newblock The importance of `big data': A definition. 303 | \newblock {\em Gartner}, June 2012. 304 | 305 | \bibitem{HBaseMTTR} 306 | Nicolas Liochon. 307 | \newblock Introduction to hbase mean time to recovery ({MTTR}). 308 | \newblock 309 | \url{http://hortonworks.com/blog/introduction-to-hbase-mean-time-to-recover-mttr/}. 310 | 311 | \bibitem{MacLeodClarke2012} 312 | David MacLeod and Nita Clarke. 313 | \newblock Engaging for success: enhancing performance through employee 314 | engagement, 2012. 315 | 316 | \bibitem{Mattern89virtualtime} 317 | Friedemann Mattern. 318 | \newblock Virtual time and global states of distributed systems. 319 | \newblock In {\em Parallel and Distributed Algorithms}, pages 215--226, 1989. 320 | 321 | \bibitem{McKusick:2009:GEF} 322 | Marshall~Kirk McKusick and Sean Quinlan. 323 | \newblock Gfs: Evolution on fast-forward. 324 | \newblock {\em Queue}, 7(7):10--20, August 2009. 325 | 326 | \bibitem{BSON} 327 | MongoDB. 328 | \newblock Bson. 329 | \newblock \url{http://bsonspec.org}. 330 | 331 | \bibitem{MongoDB} 332 | MongoDB. 333 | \newblock Mongodb. 334 | \newblock \url{https://www.mongodb.com}. 335 | 336 | \bibitem{WiredTiger} 337 | MongoDB. 338 | \newblock Wiredtiger. 339 | \newblock \url{http://www.wiredtiger.com}. 340 | 341 | \bibitem{NunesKambil2001} 342 | Paul~F. Nunes and Ajit Kambil. 343 | \newblock Personalization? {No} thanks. 344 | \newblock {\em Harvard Business Review}, April 2001. 345 | 346 | \bibitem{O'Neil96thelog-structured} 347 | Patrick O'Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O'Neil. 348 | \newblock The log-structured merge-tree (lsm-tree), 1996. 349 | 350 | \bibitem{Pavlo:2009:CAL} 351 | Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel~J. Abadi, David~J. DeWitt, 352 | Samuel Madden, and Michael Stonebraker. 353 | \newblock A comparison of approaches to large-scale data analysis. 354 | \newblock In {\em Proceedings of the 2009 ACM SIGMOD International Conference 355 | on Management of Data}, SIGMOD '09, pages 165--178, 2009. 356 | 357 | \bibitem{Preguica:2012:BAE} 358 | Nuno Pregui\c{c}a, Carlos Bauqero, Paulo~S{\'e}rgio Almeida, Victor Fonte, and 359 | Ricardo Gon\c{c}alves. 360 | \newblock Brief announcement: Efficient causality tracking in distributed 361 | storage systems with dotted version vectors. 362 | \newblock In {\em Proceedings of the 2012 ACM Symposium on Principles of 363 | Distributed Computing}, PODC '12, pages 335--336, 2012. 364 | 365 | \bibitem{Reed:2008:STO} 366 | Benjamin Reed and Flavio~P. Junqueira. 367 | \newblock A simple totally ordered broadcast protocol. 368 | \newblock In {\em Proceedings of the 2Nd Workshop on Large-Scale Distributed 369 | Systems and Middleware}, LADIS '08, 2008. 370 | 371 | \bibitem{ReichheldSasser1990} 372 | Frederick~F. Reichheld and Jr. W.~Earl~Sasser. 373 | \newblock Zero defections: Quality comes to services. 374 | \newblock {\em Harvard Business}, September 1990. 375 | 376 | \bibitem{Rowstron:2012:NEG} 377 | Antony Rowstron, Dushyanth Narayanan, Austin Donnelly, Greg O'Shea, and Andrew 378 | Douglas. 379 | \newblock Nobody ever got fired for using hadoop on a cluster. 380 | \newblock In {\em Proceedings of the 1st International Workshop on Hot Topics 381 | in Cloud Data Processing}, HotCDP '12, pages 1--5, 2012. 382 | 383 | \bibitem{TezTutorial} 384 | Bikas Saha. 385 | \newblock Apache tez: A new chapter in hadoop data processing. 386 | \newblock 387 | \url{http://hortonworks.com/blog/apache-tez-a-new-chapter-in-hadoop-data-processing/}. 388 | 389 | \bibitem{Schwarzkopf:2013:OFS} 390 | Malte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek, and John Wilkes. 391 | \newblock Omega: Flexible, scalable schedulers for large compute clusters. 392 | \newblock In {\em Proceedings of the 8th ACM European Conference on Computer 393 | Systems}, EuroSys '13, pages 351--364, 2013. 394 | 395 | \bibitem{CRDT2011} 396 | Marc Shapiro, Nuno Preguiça, Carlos Baquero, and Marek Zawirski. 397 | \newblock A comprehensive study of convergent and commutative replicated data 398 | types. 399 | \newblock Technical Report RR-7506, INRIA, 2011. 400 | 401 | \bibitem{Bitcask} 402 | Justin Sheehy and David Smith. 403 | \newblock Bitcask: A log-structured hash table for fast key/value data. 404 | \newblock 405 | \url{http://github.com/basho/basho_docs/raw/master/source/data/bitcask-intro.pdf}. 406 | 407 | \bibitem{Tar2Seq} 408 | Stuart Sierra. 409 | \newblock A million little files. 410 | \newblock \url{http://stuartsierra.com/2008/04/24/a-million-little-files}, 411 | 2008. 412 | 413 | \bibitem{Tachyon} 414 | Tachyon Team. 415 | \newblock Tachyon project. 416 | \newblock \url{http://tachyon-project.org}. 417 | 418 | \bibitem{Top500} 419 | Top500.org. 420 | \newblock Numerical wind tunnel: National aerospace laboratory of japan. 421 | \newblock 422 | \url{http://www.top500.org/featured/systems/numerical-wind-tunnel-national-aerospace-laboratory-of-japan/}, 423 | 2014. 424 | 425 | \bibitem{Facebook13Mouse} 426 | Lisa Vaas. 427 | \newblock Facebook mulls silently tracking users' cursor movements to see which 428 | ads we like best, 2013. 429 | 430 | \bibitem{VagateWilfong2014} 431 | Pamela Vagata and Kevin Wilfong. 432 | \newblock Scaling the facebook data warehouse to 300 {PB}. 433 | \newblock 434 | \url{https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/}, 435 | April 2014. 436 | 437 | \bibitem{SmallFiles} 438 | Tom White. 439 | \newblock The small files problem. 440 | \newblock \url{http://blog.cloudera.com/blog/2009/02/the-small-files-problem/}, 441 | 2009. 442 | 443 | \bibitem{Nvidia2014} 444 | Wikipedia. 445 | \newblock Nvidia tesla. 446 | \newblock \url{http://en.wikipedia.org/wiki/Nvidia_Tesla}, 2014. 447 | 448 | \bibitem{Zaharia:2012:RDD} 449 | Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy 450 | McCauley, Michael~J. Franklin, Scott Shenker, and Ion Stoica. 451 | \newblock Resilient distributed datasets: A fault-tolerant abstraction for 452 | in-memory cluster computing. 453 | \newblock In {\em Proceedings of the 9th USENIX Conference on Networked Systems 454 | Design and Implementation}, NSDI'12, 2012. 455 | 456 | \bibitem{Zaharia:2010:SCC} 457 | Matei Zaharia, Mosharaf Chowdhury, Michael~J. Franklin, Scott Shenker, and Ion 458 | Stoica. 459 | \newblock Spark: Cluster computing with working sets. 460 | \newblock In {\em Proceedings of the 2Nd USENIX Conference on Hot Topics in 461 | Cloud Computing}, HotCloud'10, 2010. 462 | 463 | \end{thebibliography} 464 | -------------------------------------------------------------------------------- /bigdata.bib: -------------------------------------------------------------------------------- 1 | @misc{Gartner2014, 2 | AUTHOR="Gartner", 3 | TITLE="Gartner's 2014 Hype Cycle for Emerging Technologies Maps the Journey to Digital Business", 4 | howpublished={\url{http://www.gartner.com/newsroom/id/2819918}}, 5 | YEAR=2014 6 | } 7 | @misc{VagateWilfong2014, 8 | author = {Pamela Vagata and Kevin Wilfong}, 9 | title = {Scaling the Facebook data warehouse to 300 {PB}}, 10 | month = {April}, 11 | year = {2014}, 12 | howpublished={\url{https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/}} 13 | } 14 | @ARTICLE{Laney2012, 15 | AUTHOR="Douglas Laney", 16 | TITLE="The Importance of `Big Data': A Definition", 17 | JOURNAL="Gartner", 18 | month = {June}, 19 | YEAR=2012 20 | } 21 | @misc{IBM2013, 22 | AUTHOR="IBM", 23 | TITLE="What is big data?", 24 | howpublished={\url{http://www-01.ibm.com/software/data/bigdata/what-is-big-data.html}}, 25 | YEAR=2013 26 | } 27 | @misc{Top500, 28 | AUTHOR="Top500.org", 29 | TITLE="Numerical Wind Tunnel: National Aerospace Laboratory of Japan", 30 | howpublished={\url{http://www.top500.org/featured/systems/numerical-wind-tunnel-national-aerospace-laboratory-of-japan/}}, 31 | YEAR=2014 32 | } 33 | @misc{Nvidia2014, 34 | AUTHOR="Wikipedia", 35 | TITLE="Nvidia Tesla", 36 | howpublished={\url{http://en.wikipedia.org/wiki/Nvidia_Tesla}}, 37 | YEAR=2014 38 | } 39 | @ARTICLE{NunesKambil2001, 40 | AUTHOR="Paul F. Nunes and Ajit Kambil", 41 | TITLE="Personalization? {No} Thanks.", 42 | JOURNAL="Harvard Business Review", 43 | howpublished={\url{http://hbr.org/2001/04/personalization-no-thanks/ar/1}}, 44 | month = {April}, 45 | YEAR=2001 46 | } 47 | @inproceedings{Huetal2000, 48 | author = "J. Hu, and H.R. Wu and A. Jennings and X. Wang", 49 | title = "Fast and robust equalization: A case study", 50 | booktitle = "Proceedings of the World Multiconference on Systemics, Cybernetics and Informatics, (SCI 2000), Florida, USA, 23-26 July 2000", 51 | publisher = "International Institute of Informatics and Systemics", 52 | address = "FL, USA", 53 | pages = "398--403", 54 | year = "2000" 55 | } 56 | @Book{Conway2000, 57 | author = {Damian Conway}, 58 | title = {Object {O}riented {P}erl: {A} comprehensive guide to concepts and programming techniques}, 59 | publisher = {Manning Publications Co.}, 60 | year = {2000}, 61 | address = {Connecticut, USA} 62 | } 63 | @misc{YARN2011:279, 64 | author = {Apache Hadoop}, 65 | title = {ASF JIRA MAPREDUCE-279}, 66 | year = {2011} 67 | } 68 | @misc{Hadoop, 69 | author = {Apache}, 70 | title = {Hadoop}, 71 | howpublished={\url{http://hadoop.apache.org}} 72 | } 73 | @misc{Tez, 74 | author = {Apache}, 75 | title = {Tez}, 76 | howpublished={\url{http://tez.apache.org}} 77 | } 78 | @misc{TezTutorial, 79 | author = {Bikas Saha}, 80 | title = {Apache Tez: A New Chapter in Hadoop Data Processing}, 81 | howpublished={\url{http://hortonworks.com/blog/apache-tez-a-new-chapter-in-hadoop-data-processing/}} 82 | } 83 | @misc{Stinger, 84 | author = {Alan Gates}, 85 | title = {The Stinger Initiative: Making Apache Hive 100 Times Faster}, 86 | howpublished={\url{http://hortonworks.com/blog/100x-faster-hive/}} 87 | } 88 | @inproceedings{Hindman:2011:MPF, 89 | author = {Hindman, Benjamin and Konwinski, Andy and Zaharia, Matei and Ghodsi, Ali and Joseph, Anthony D. and Katz, Randy and Shenker, Scott and Stoica, Ion}, 90 | title = {Mesos: A Platform for Fine-grained Resource Sharing in the Data Center}, 91 | booktitle = {Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation}, 92 | series = {NSDI'11}, 93 | year = {2011}, 94 | location = {Boston, MA}, 95 | pages = {295--308}, 96 | numpages = {14}, 97 | url = {http://dl.acm.org/citation.cfm?id=1972457.1972488}, 98 | } 99 | @inproceedings{Schwarzkopf:2013:OFS, 100 | author = {Schwarzkopf, Malte and Konwinski, Andy and Abd-El-Malek, Michael and Wilkes, John}, 101 | title = {Omega: Flexible, Scalable Schedulers for Large Compute Clusters}, 102 | booktitle = {Proceedings of the 8th ACM European Conference on Computer Systems}, 103 | series = {EuroSys '13}, 104 | year = {2013}, 105 | location = {Prague, Czech Republic}, 106 | pages = {351--364}, 107 | numpages = {14}, 108 | url = {http://doi.acm.org/10.1145/2465351.2465386}, 109 | } 110 | @inproceedings{Ghodsi:2011:DRF, 111 | author = {Ghodsi, Ali and Zaharia, Matei and Hindman, Benjamin and Konwinski, Andy and Shenker, Scott and Stoica, Ion}, 112 | title = {Dominant Resource Fairness: Fair Allocation of Multiple Resource Types}, 113 | booktitle = {Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation}, 114 | series = {NSDI'11}, 115 | year = {2011}, 116 | location = {Boston, MA}, 117 | pages = {323--336}, 118 | numpages = {14}, 119 | url = {http://dl.acm.org/citation.cfm?id=1972457.1972490} 120 | } 121 | @inproceedings{Ghemawat:2003:GFS, 122 | author = {Ghemawat, Sanjay and Gobioff, Howard and Leung, Shun-Tak}, 123 | title = {The Google File System}, 124 | booktitle = {Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles}, 125 | series = {SOSP '03}, 126 | year = {2003}, 127 | location = {Bolton Landing, NY, USA}, 128 | pages = {29--43} 129 | } 130 | @techreport{Forum:1994:MMI, 131 | author = {Forum, Message P}, 132 | title = {MPI: A Message-Passing Interface Standard}, 133 | year = {1994}, 134 | institution = {University of Tennessee}, 135 | address = {Knoxville, TN, USA} 136 | } 137 | @article{Dean:2008:MSD, 138 | author = {Dean, Jeffrey and Ghemawat, Sanjay}, 139 | title = {MapReduce: Simplified Data Processing on Large Clusters}, 140 | journal = {Commun. ACM}, 141 | issue_date = {January 2008}, 142 | volume = {51}, 143 | number = {1}, 144 | month = jan, 145 | year = {2008}, 146 | pages = {107--113} 147 | } 148 | @book{Gropp:1999:UMA, 149 | author = {Gropp, William and Lusk, Ewing and Thakur, Rajeev}, 150 | title = {Using MPI-2: Advanced Features of the Message-Passing Interface}, 151 | year = {1999}, 152 | publisher = {MIT Press}, 153 | address = {Cambridge, MA, USA} 154 | } 155 | @article{McKusick:2009:GEF, 156 | author = {McKusick, Marshall Kirk and Quinlan, Sean}, 157 | title = {GFS: Evolution on Fast-forward}, 158 | journal = {Queue}, 159 | issue_date = {August 2009}, 160 | volume = {7}, 161 | number = {7}, 162 | month = Aug, 163 | year = {2009}, 164 | pages = {10--20} 165 | } 166 | @misc{HDFS, 167 | AUTHOR="Apache Hadoop", 168 | TITLE="HDFS Architecture", 169 | howpublished={\url{http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html}}, 170 | YEAR=2014 171 | } 172 | @misc{HDFS2010:265, 173 | author = {Apache Hadoop}, 174 | title = {ASF JIRA HDFS-265}, 175 | year = {2010} 176 | } 177 | @misc{MacLeodClarke2012, 178 | author = {David MacLeod and Nita Clarke}, 179 | title = {Engaging for Success: enhancing performance through employee engagement}, 180 | year = {2012} 181 | } 182 | @ARTICLE{ReichheldSasser1990, 183 | AUTHOR="Frederick F. Reichheld and W. Earl Sasser, Jr.", 184 | TITLE="Zero Defections: Quality Comes to Services", 185 | JOURNAL="Harvard Business", 186 | month = {September}, 187 | YEAR=1990 188 | } 189 | @misc{Accenture14SmartGrid, 190 | title = {Accenture to Help Thames Water Prove the Benefits of Smart Monitoring Capabilities}, 191 | author = {Accenture}, 192 | year = {2014} 193 | } 194 | @misc{Accenture13Seattle, 195 | title = {Accenture Analytics and Smart Building Solutions are helping Seattle boost energy efficiency downtown}, 196 | author = {Accenture}, 197 | year = {2013} 198 | } 199 | @misc{IndustrialInternetReport2014, 200 | title = {Industrial Internet Insights report}, 201 | author = {GE and Accenture}, 202 | year = {2014} 203 | } 204 | @misc{Watson2014, 205 | AUTHOR="IBM", 206 | TITLE="Watson", 207 | howpublished={\url{http://www.ibm.com/smarterplanet/us/en/ibmwatson/}}, 208 | YEAR=2014 209 | } 210 | @misc{Watson2013Healthcare, 211 | AUTHOR="IBM", 212 | TITLE="Putting Watson to Work: Watson in Healthcare", 213 | YEAR=2013 214 | } 215 | @misc{Watson2013Cancer, 216 | AUTHOR="IBM", 217 | TITLE="IBM Watson Helps Fight Cancer with Evidence-Based Diagnosis and Treatment Suggestions", 218 | YEAR=2013 219 | } 220 | @misc{Facebook13Mouse, 221 | AUTHOR="Lisa Vaas", 222 | TITLE="Facebook mulls silently tracking users' cursor movements to see which ads we like best", 223 | YEAR=2013 224 | } 225 | @misc{AdpHcm, 226 | AUTHOR="ADP", 227 | TITLE="ADP HCM Solutions for Large Business", 228 | howpublished={\url{http://www.adp.com/solutions/large-business/products.aspx}}, 229 | YEAR=2014 230 | } 231 | @misc{SmallFiles, 232 | AUTHOR="Tom White", 233 | institution = {Cloudera Blog}, 234 | TITLE="The Small Files Problem", 235 | howpublished={\url{http://blog.cloudera.com/blog/2009/02/the-small-files-problem/}}, 236 | YEAR=2009 237 | } 238 | @misc{Tar2Seq, 239 | AUTHOR="Stuart Sierra", 240 | TITLE="A Million Little Files", 241 | howpublished={\url{http://stuartsierra.com/2008/04/24/a-million-little-files}}, 242 | YEAR=2008 243 | } 244 | @misc{Maven, 245 | AUTHOR="Apache", 246 | TITLE="Maven", 247 | howpublished={\url{http://maven.apache.org}} 248 | } 249 | @misc{Thrift, 250 | AUTHOR="Apache", 251 | TITLE="Thrift", 252 | howpublished={\url{http://thrift.apache.org}} 253 | } 254 | @misc{HdfsThrift, 255 | AUTHOR="Apache", 256 | TITLE="{HDFS} {API}s in perl, python, ruby and php", 257 | howpublished={\url{http://wiki.apache.org/hadoop/HDFS-APIs}} 258 | } 259 | @misc{Sqoop, 260 | AUTHOR="Apache", 261 | TITLE="Sqoop", 262 | howpublished={\url{http://sqoop.apache.org}} 263 | } 264 | @misc{Flume, 265 | AUTHOR="Apache", 266 | TITLE="Flume", 267 | howpublished={\url{http://flume.apache.org}} 268 | } 269 | @misc{Kafka, 270 | AUTHOR="Apache", 271 | TITLE="Kafka", 272 | howpublished={\url{http://kafka.apache.org}} 273 | } 274 | @misc{Chukwa, 275 | AUTHOR="Apache", 276 | TITLE="Chukwa", 277 | howpublished={\url{http://chukwa.apache.org}} 278 | } 279 | @misc{Storm, 280 | AUTHOR="Apache", 281 | TITLE="Storm", 282 | howpublished={\url{http://storm.apache.org}} 283 | } 284 | @misc{HdfsShell, 285 | AUTHOR="Apache", 286 | TITLE="File System Shell Guide", 287 | howpublished={\url{http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html}} 288 | } 289 | @inproceedings{Pavlo:2009:CAL, 290 | author = {Pavlo, Andrew and Paulson, Erik and Rasin, Alexander and Abadi, Daniel J. and DeWitt, David J. and Madden, Samuel and Stonebraker, Michael}, 291 | title = {A Comparison of Approaches to Large-scale Data Analysis}, 292 | booktitle = {Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data}, 293 | series = {SIGMOD '09}, 294 | year = {2009}, 295 | location = {Providence, Rhode Island, USA}, 296 | pages = {165--178} 297 | } 298 | @article{DeWitt:1992:PDS, 299 | author = {DeWitt, David and Gray, Jim}, 300 | title = {Parallel Database Systems: The Future of High Performance Database Systems}, 301 | journal = {Commun. ACM}, 302 | issue_date = {June 1992}, 303 | volume = {35}, 304 | number = {6}, 305 | month = jun, 306 | year = {1992}, 307 | pages = {85--98} 308 | } 309 | @book{Jain:1988:ACD, 310 | author = {Jain, Anil K. and Dubes, Richard C.}, 311 | title = {Algorithms for Clustering Data}, 312 | year = {1988}, 313 | publisher = {Prentice-Hall, Inc.} 314 | } 315 | @misc{MapReduceTutorial, 316 | AUTHOR="Apache", 317 | TITLE="MapReduce Tutorial", 318 | howpublished={\url{http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html}} 319 | } 320 | @inproceedings{Zaharia:2010:SCC, 321 | author = {Zaharia, Matei and Chowdhury, Mosharaf and Franklin, Michael J. and Shenker, Scott and Stoica, Ion}, 322 | title = {Spark: Cluster Computing with Working Sets}, 323 | booktitle = {Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing}, 324 | series = {HotCloud'10}, 325 | year = {2010}, 326 | location = {Boston, MA}, 327 | } 328 | @inproceedings{Zaharia:2012:RDD, 329 | author = {Zaharia, Matei and Chowdhury, Mosharaf and Das, Tathagata and Dave, Ankur and Ma, Justin and McCauley, Murphy and Franklin, Michael J. and Shenker, Scott and Stoica, Ion}, 330 | title = {Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing}, 331 | booktitle = {Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation}, 332 | series = {NSDI'12}, 333 | year = {2012}, 334 | location = {San Jose, CA}, 335 | } 336 | @misc{Spark, 337 | AUTHOR="Apache", 338 | TITLE="Spark", 339 | howpublished={\url{http://spark.apache.org}} 340 | } 341 | @inproceedings{Rowstron:2012:NEG, 342 | author = {Rowstron, Antony and Narayanan, Dushyanth and Donnelly, Austin and O'Shea, Greg and Douglas, Andrew}, 343 | title = {Nobody Ever Got Fired for Using Hadoop on a Cluster}, 344 | booktitle = {Proceedings of the 1st International Workshop on Hot Topics in Cloud Data Processing}, 345 | series = {HotCDP '12}, 346 | year = {2012}, 347 | location = {Bern, Switzerland}, 348 | pages = {1--5} 349 | } 350 | @inproceedings{Isard:2007:DDD, 351 | author = {Isard, Michael and Budiu, Mihai and Yu, Yuan and Birrell, Andrew and Fetterly, Dennis}, 352 | title = {Dryad: Distributed Data-parallel Programs from Sequential Building Blocks}, 353 | booktitle = {Proceedings of the 2Nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007}, 354 | series = {EuroSys '07}, 355 | year = {2007}, 356 | location = {Lisbon, Portugal}, 357 | pages = {59--72} 358 | } 359 | @misc{Tachyon, 360 | AUTHOR="Tachyon Team", 361 | TITLE="Tachyon Project", 362 | howpublished={\url{http://tachyon-project.org}} 363 | } 364 | @inproceedings{Brewer:2000:TRD, 365 | author = {Brewer, Eric A.}, 366 | title = {Towards Robust Distributed Systems (Abstract)}, 367 | booktitle = {Proceedings of the Nineteenth Annual ACM Symposium on Principles of Distributed Computing}, 368 | series = {PODC '00}, 369 | year = {2000}, 370 | location = {Portland, Oregon, USA}, 371 | } 372 | @article{Gilbert:2002:BCF, 373 | author = {Gilbert, Seth and Lynch, Nancy}, 374 | title = {Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-tolerant Web Services}, 375 | journal = {SIGACT News}, 376 | issue_date = {June 2002}, 377 | volume = {33}, 378 | number = {2}, 379 | month = jun, 380 | year = {2002}, 381 | pages = {51--59} 382 | } 383 | @article{Brewer:2012, 384 | author = {Brewer, Eric A.}, 385 | title = {CAP Twelve Years Later: How the ``Rules'' Have Changed}, 386 | journal = {Computer}, 387 | volume = {45}, 388 | number = {2}, 389 | month = Feb, 390 | year = {2012}, 391 | pages = {23--29} 392 | } 393 | @misc{ZooKeeper, 394 | AUTHOR="Apache", 395 | TITLE="Zookeeper", 396 | howpublished={\url{http://zookeeper.apache.org}} 397 | } 398 | @inproceedings{Reed:2008:STO, 399 | author = {Reed, Benjamin and Junqueira, Flavio P.}, 400 | title = {A Simple Totally Ordered Broadcast Protocol}, 401 | booktitle = {Proceedings of the 2Nd Workshop on Large-Scale Distributed Systems and Middleware}, 402 | series = {LADIS '08}, 403 | year = {2008}, 404 | location = {Yorktown Heights, New York} 405 | } 406 | @article{Lamport:1998:PP, 407 | author = {Lamport, Leslie}, 408 | title = {The Part-time Parliament}, 409 | journal = {ACM Trans. Comput. Syst.}, 410 | issue_date = {May 1998}, 411 | volume = {16}, 412 | number = {2}, 413 | month = may, 414 | year = {1998}, 415 | pages = {133--169} 416 | } 417 | @book{opac:2009, 418 | title = "Principles of transaction processing", 419 | author = "Bernstein, Philip A. and Newcomer, Eric", 420 | series = "The Morgan Kaufmann series in data management systems", 421 | edition = "Second", 422 | publisher = "Morgan Kaufmann Publishers", 423 | address = "Burlington, MA", 424 | year = 2009 425 | } 426 | @inproceedings{Chang:2006:BDS, 427 | author = {Chang, Fay and Dean, Jeffrey and Ghemawat, Sanjay and Hsieh, Wilson C. and Wallach, Deborah A. and Burrows, Mike and Chandra, Tushar and Fikes, Andrew and Gruber, Robert E.}, 428 | title = {Bigtable: A Distributed Storage System for Structured Data}, 429 | booktitle = {Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7}, 430 | series = {OSDI '06}, 431 | year = {2006}, 432 | location = {Seattle, WA} 433 | } 434 | @misc{HBase, 435 | AUTHOR="Apache", 436 | TITLE="Hbase", 437 | howpublished={\url{http://hbase.apache.org}} 438 | } 439 | @misc{Accumulo, 440 | AUTHOR="Apache", 441 | TITLE="Accumulo", 442 | howpublished={\url{http://accumulo.apache.org}} 443 | } 444 | @article{Bloom:1970:STH, 445 | author = {Bloom, Burton H.}, 446 | title = {Space/Time Trade-offs in Hash Coding with Allowable Errors}, 447 | journal = {Commun. ACM}, 448 | issue_date = {July 1970}, 449 | volume = {13}, 450 | number = {7}, 451 | month = jul, 452 | year = {1970}, 453 | pages = {422--426} 454 | } 455 | @misc{HBaseMTTR, 456 | AUTHOR="Nicolas Liochon", 457 | TITLE="Introduction to HBase Mean Time to Recovery ({MTTR})", 458 | howpublished={\url{http://hortonworks.com/blog/introduction-to-hbase-mean-time-to-recover-mttr/}} 459 | } 460 | @misc{HBaseCoprocessor, 461 | AUTHOR="Mingjie Lai and Eugene Koontz and Andrew Purtell", 462 | TITLE="Coprocessor Introduction", 463 | howpublished={\url{http://blogs.apache.org/hbase/entry/coprocessor_introduction}} 464 | } 465 | @misc{Riak, 466 | AUTHOR="Basho", 467 | TITLE="Riak", 468 | howpublished={\url{http://basho.com/riak/}} 469 | } 470 | @inproceedings{DeCandia:2007:DAH, 471 | author = {DeCandia, Giuseppe and Hastorun, Deniz and Jampani, Madan and Kakulapati, Gunavardhan and Lakshman, Avinash and Pilchin, Alex and Sivasubramanian, Swaminathan and Vosshall, Peter and Vogels, Werner}, 472 | title = {Dynamo: {Amazon's} Highly Available Key-value Store}, 473 | booktitle = {Proceedings of Twenty-first ACM SIGOPS Symposium on Operating Systems Principles}, 474 | series = {SOSP '07}, 475 | year = {2007}, 476 | location = {Stevenson, Washington, USA}, 477 | pages = {205--220} 478 | } 479 | @misc{LevelDB, 480 | AUTHOR="Jeffrey Dean and Sanjay Ghemawat", 481 | TITLE="LevelDB", 482 | howpublished={\url{http://leveldb.org}} 483 | } 484 | @misc{Bitcask, 485 | AUTHOR= "Justin Sheehy and David Smith", 486 | TITLE="Bitcask: A Log-Structured Hash Table for Fast Key/Value Data", 487 | howpublished={\url{http://github.com/basho/basho_docs/raw/master/source/data/bitcask-intro.pdf}} 488 | } 489 | @techreport{CRDT2011, 490 | author = {Marc Shapiro and Nuno Preguiça and Carlos Baquero and Marek Zawirski}, 491 | title = {A comprehensive study of Convergent and Commutative Replicated Data Types}, 492 | year = {2011}, 493 | number = {RR-7506}, 494 | institution = {INRIA} 495 | } 496 | @INPROCEEDINGS{Mattern89virtualtime, 497 | author = {Friedemann Mattern}, 498 | title = {Virtual Time and Global States of Distributed Systems}, 499 | booktitle = {Parallel and Distributed Algorithms}, 500 | year = {1989}, 501 | pages = {215--226} 502 | } 503 | @article{fidge1988timestamps, 504 | author = {Fidge, C. J.}, 505 | journal = {Proceedings of the 11th Australian Computer Science Conference}, 506 | number = 1, 507 | pages = {56–66}, 508 | title = {Timestamps in message-passing systems that preserve the partial ordering}, 509 | volume = 10, 510 | year = 1988 511 | } 512 | @inproceedings{Preguica:2012:BAE, 513 | author = {Pregui\c{c}a, Nuno and Bauqero, Carlos and Almeida, Paulo S{\'e}rgio and Fonte, Victor and Gon\c{c}alves, Ricardo}, 514 | title = {Brief Announcement: Efficient Causality Tracking in Distributed Storage Systems with Dotted Version Vectors}, 515 | booktitle = {Proceedings of the 2012 ACM Symposium on Principles of Distributed Computing}, 516 | series = {PODC '12}, 517 | year = {2012}, 518 | location = {Madeira, Portugal}, 519 | pages = {335--336} 520 | } 521 | @misc{Cassandra, 522 | AUTHOR="Apache", 523 | TITLE="Cassandra", 524 | howpublished={\url{http://cassandra.apache.org}} 525 | } 526 | @article{Lakshman:2010:CDS, 527 | author = {Lakshman, Avinash and Malik, Prashant}, 528 | title = {Cassandra: A Decentralized Structured Storage System}, 529 | journal = {SIGOPS Oper. Syst. Rev.}, 530 | issue_date = {April 2010}, 531 | volume = {44}, 532 | number = {2}, 533 | month = apr, 534 | year = {2010}, 535 | pages = {35--40} 536 | } 537 | @MISC{O'Neil96thelog-structured, 538 | author = {Patrick O'Neil and Edward Cheng and Dieter Gawlick and Elizabeth O'Neil}, 539 | title = {The Log-Structured Merge-Tree (LSM-Tree)}, 540 | year = {1996} 541 | } 542 | @misc{MongoDB, 543 | AUTHOR="MongoDB", 544 | TITLE="MongoDB", 545 | howpublished={\url{https://www.mongodb.com}} 546 | } 547 | @misc{BSON, 548 | AUTHOR="MongoDB", 549 | TITLE="BSON", 550 | howpublished={\url{http://bsonspec.org}} 551 | } 552 | @misc{WiredTiger, 553 | AUTHOR="MongoDB", 554 | TITLE="WiredTiger", 555 | howpublished={\url{http://www.wiredtiger.com}} 556 | } 557 | @misc{ClouderaImpala2014, 558 | AUTHOR="Cloudera", 559 | TITLE="New SQL Choices in the Apache Hadoop Ecosystem: Why Impala Continues to Lead", 560 | howpublished={\url{http://blog.cloudera.com/blog/2014/05/new-sql-choices-in-the-apache-hadoop-ecosystem-why-impala-continues-to-lead/}} 561 | } 562 | @misc{AMPLabBenchmark2014, 563 | AUTHOR="AMPLab", 564 | TITLE="Big Data Benchmark", 565 | howpublished={\url{http://amplab.cs.berkeley.edu/benchmark/}} 566 | } 567 | -------------------------------------------------------------------------------- /docs/bigdata.css: -------------------------------------------------------------------------------- 1 | 2 | /* start css.sty */ 3 | .cmr-10{font-size:83%;} 4 | .cmr-17x-x-120{font-size:170%;} 5 | .cmr-12x-x-120{font-size:120%;} 6 | .pcrr7t-x-x-120{font-family: monospace;} 7 | .cmmi-12{font-style: italic;} 8 | .cmmi-8{font-size:66%;font-style: italic;} 9 | .cmr-10x-x-109{font-size:90%;} 10 | .cmti-12{ font-style: italic;} 11 | .cmbx-12{ font-weight: bold;} 12 | .pcrr7t-x-x-109{font-size:90%;font-family: monospace;} 13 | .pcrr7t-{font-size:83%;font-family: monospace;} 14 | .pcrb7t-x-x-120{ font-family: monospace; font-weight: bold;} 15 | p.noindent { text-indent: 0em } 16 | td p.noindent { text-indent: 0em; margin-top:0em; } 17 | p.nopar { text-indent: 0em; } 18 | p.indent{ text-indent: 1.5em } 19 | @media print {div.crosslinks {visibility:hidden;}} 20 | a img { border-top: 0; border-left: 0; border-right: 0; } 21 | center { margin-top:1em; margin-bottom:1em; } 22 | td center { margin-top:0em; margin-bottom:0em; } 23 | .Canvas { position:relative; } 24 | img.math{vertical-align:middle;} 25 | li p.indent { text-indent: 0em } 26 | li p:first-child{ margin-top:0em; } 27 | li p:last-child, li div:last-child { margin-bottom:0.5em; } 28 | li p~ul:last-child, li p~ol:last-child{ margin-bottom:0.5em; } 29 | .enumerate1 {list-style-type:decimal;} 30 | .enumerate2 {list-style-type:lower-alpha;} 31 | .enumerate3 {list-style-type:lower-roman;} 32 | .enumerate4 {list-style-type:upper-alpha;} 33 | div.newtheorem { margin-bottom: 2em; margin-top: 2em;} 34 | .obeylines-h,.obeylines-v {white-space: nowrap; } 35 | div.obeylines-v p { margin-top:0; margin-bottom:0; } 36 | .overline{ text-decoration:overline; } 37 | .overline img{ border-top: 1px solid black; } 38 | td.displaylines {text-align:center; white-space:nowrap;} 39 | .centerline {text-align:center;} 40 | .rightline {text-align:right;} 41 | div.verbatim {font-family: monospace; white-space: nowrap; text-align:left; clear:both; } 42 | .fbox {padding-left:3.0pt; padding-right:3.0pt; text-indent:0pt; border:solid black 0.4pt; } 43 | div.fbox {display:table} 44 | div.center div.fbox {text-align:center; clear:both; padding-left:3.0pt; padding-right:3.0pt; text-indent:0pt; border:solid black 0.4pt; } 45 | div.minipage{width:100%;} 46 | div.center, div.center div.center {text-align: center; margin-left:1em; margin-right:1em;} 47 | div.center div {text-align: left;} 48 | div.flushright, div.flushright div.flushright {text-align: right;} 49 | div.flushright div {text-align: left;} 50 | div.flushleft {text-align: left;} 51 | .underline{ text-decoration:underline; } 52 | .underline img{ border-bottom: 1px solid black; margin-bottom:1pt; } 53 | .framebox-c, .framebox-l, .framebox-r { padding-left:3.0pt; padding-right:3.0pt; text-indent:0pt; border:solid black 0.4pt; } 54 | .framebox-c {text-align:center;} 55 | .framebox-l {text-align:left;} 56 | .framebox-r {text-align:right;} 57 | span.thank-mark{ vertical-align: super } 58 | span.footnote-mark sup.textsuperscript, span.footnote-mark a sup.textsuperscript{ font-size:80%; } 59 | div.tabular, div.center div.tabular {text-align: center; margin-top:0.5em; margin-bottom:0.5em; } 60 | table.tabular td p{margin-top:0em;} 61 | table.tabular {margin-left: auto; margin-right: auto;} 62 | td p:first-child{ margin-top:0em; } 63 | td p:last-child{ margin-bottom:0em; } 64 | div.td00{ margin-left:0pt; margin-right:0pt; } 65 | div.td01{ margin-left:0pt; margin-right:5pt; } 66 | div.td10{ margin-left:5pt; margin-right:0pt; } 67 | div.td11{ margin-left:5pt; margin-right:5pt; } 68 | table[rules] {border-left:solid black 0.4pt; border-right:solid black 0.4pt; } 69 | td.td00{ padding-left:0pt; padding-right:0pt; } 70 | td.td01{ padding-left:0pt; padding-right:5pt; } 71 | td.td10{ padding-left:5pt; padding-right:0pt; } 72 | td.td11{ padding-left:5pt; padding-right:5pt; } 73 | table[rules] {border-left:solid black 0.4pt; border-right:solid black 0.4pt; } 74 | .hline hr, .cline hr{ height : 1px; margin:0px; } 75 | .tabbing-right {text-align:right;} 76 | span.TEX {letter-spacing: -0.125em; } 77 | span.TEX span.E{ position:relative;top:0.5ex;left:-0.0417em;} 78 | a span.TEX span.E {text-decoration: none; } 79 | span.LATEX span.A{ position:relative; top:-0.5ex; left:-0.4em; font-size:85%;} 80 | span.LATEX span.TEX{ position:relative; left: -0.4em; } 81 | div.float, div.figure {margin-left: auto; margin-right: auto;} 82 | div.float img {text-align:center;} 83 | div.figure img {text-align:center;} 84 | .marginpar {width:20%; float:right; text-align:left; margin-left:auto; margin-top:0.5em; font-size:85%; text-decoration:underline;} 85 | .marginpar p{margin-top:0.4em; margin-bottom:0.4em;} 86 | table.equation {width:100%;} 87 | .equation td{text-align:center; } 88 | td.equation { margin-top:1em; margin-bottom:1em; } 89 | td.equation-label { width:5%; text-align:center; } 90 | td.eqnarray4 { width:5%; white-space: normal; } 91 | td.eqnarray2 { width:5%; } 92 | table.eqnarray-star, table.eqnarray {width:100%;} 93 | div.eqnarray{text-align:center;} 94 | div.array {text-align:center;} 95 | div.pmatrix {text-align:center;} 96 | table.pmatrix {width:100%;} 97 | span.pmatrix img{vertical-align:middle;} 98 | div.pmatrix {text-align:center;} 99 | table.pmatrix {width:100%;} 100 | span.bar-css {text-decoration:overline;} 101 | img.cdots{vertical-align:middle;} 102 | .partToc a, .partToc, .likepartToc a, .likepartToc {line-height: 200%; font-weight:bold; font-size:110%;} 103 | .chapterToc a, .chapterToc, .likechapterToc a, .likechapterToc, .appendixToc a, .appendixToc {line-height: 200%; font-weight:bold;} 104 | .index-item, .index-subitem, .index-subsubitem {display:block} 105 | div.caption {text-indent:-2em; margin-left:3em; margin-right:1em; text-align:left;} 106 | div.caption span.id{font-weight: bold; white-space: nowrap; } 107 | h1.partHead{text-align: center} 108 | p.bibitem { text-indent: -2em; margin-left: 2em; margin-top:0.6em; margin-bottom:0.6em; } 109 | p.bibitem-p { text-indent: 0em; margin-left: 2em; margin-top:0.6em; margin-bottom:0.6em; } 110 | .paragraphHead, .likeparagraphHead { margin-top:2em; font-weight: bold;} 111 | .subparagraphHead, .likesubparagraphHead { font-weight: bold;} 112 | .quote {margin-bottom:0.25em; margin-top:0.25em; margin-left:1em; margin-right:1em; text-align:justify;} 113 | .verse{white-space:nowrap; margin-left:2em} 114 | div.maketitle {text-align:center;} 115 | h2.titleHead{text-align:center;} 116 | div.maketitle{ margin-bottom: 2em; } 117 | div.author, div.date {text-align:center;} 118 | div.thanks{text-align:left; margin-left:10%; font-size:85%; font-style:italic; } 119 | div.author{white-space: nowrap;} 120 | .quotation {margin-bottom:0.25em; margin-top:0.25em; margin-left:1em; } 121 | h1.partHead{text-align: center} 122 | .chapterToc, .likechapterToc {margin-left:0em;} 123 | .chapterToc ~ .likesectionToc, .chapterToc ~ .sectionToc, .likechapterToc ~ .likesectionToc, .likechapterToc ~ .sectionToc {margin-left:2em;} 124 | .chapterToc ~ .likesectionToc ~ .likesubsectionToc, .chapterToc ~ .likesectionToc ~ .subsectionToc, .chapterToc ~ .sectionToc ~ .likesubsectionToc, .chapterToc ~ .sectionToc ~ .subsectionToc, .likechapterToc ~ .likesectionToc ~ .likesubsectionToc, .likechapterToc ~ .likesectionToc ~ .subsectionToc, .likechapterToc ~ .sectionToc ~ .likesubsectionToc, .likechapterToc ~ .sectionToc ~ .subsectionToc {margin-left:4em;} 125 | .chapterToc ~ .likesectionToc ~ .likesubsectionToc ~ .likesubsubsectionToc, .chapterToc ~ .likesectionToc ~ .likesubsectionToc ~ .subsubsectionToc, .chapterToc ~ .likesectionToc ~ .subsectionToc ~ .likesubsubsectionToc, .chapterToc ~ .likesectionToc ~ .subsectionToc ~ .subsubsectionToc, .chapterToc ~ .sectionToc ~ .likesubsectionToc ~ .likesubsubsectionToc, .chapterToc ~ .sectionToc ~ .likesubsectionToc ~ .subsubsectionToc, .chapterToc ~ .sectionToc ~ .subsectionToc ~ .likesubsubsectionToc, .chapterToc ~ .sectionToc ~ .subsectionToc ~ .subsubsectionToc, .likechapterToc ~ .likesectionToc ~ .likesubsectionToc ~ .likesubsubsectionToc, .likechapterToc ~ .likesectionToc ~ .likesubsectionToc ~ .subsubsectionToc, .likechapterToc ~ .likesectionToc ~ .subsectionToc ~ .likesubsubsectionToc, .likechapterToc ~ .likesectionToc ~ .subsectionToc ~ .subsubsectionToc, .likechapterToc ~ .sectionToc ~ .likesubsectionToc ~ .likesubsubsectionToc, .likechapterToc ~ .sectionToc ~ .likesubsectionToc ~ .subsubsectionToc, .likechapterToc ~ .sectionToc ~ .subsectionToc ~ .likesubsubsectionToc .likechapterToc ~ .sectionToc ~ .subsectionToc ~ .subsubsectionToc {margin-left:6em;} 126 | .likesectionToc , .sectionToc {margin-left:0em;} 127 | .likesectionToc ~ .likesubsectionToc, .likesectionToc ~ .subsectionToc, .sectionToc ~ .likesubsectionToc, .sectionToc ~ .subsectionToc {margin-left:2em;} 128 | .likesectionToc ~ .likesubsectionToc ~ .likesubsubsectionToc, .likesectionToc ~ .likesubsectionToc ~ .subsubsectionToc, .likesectionToc ~ .subsectionToc ~ .likesubsubsectionToc, .likesectionToc ~ .subsectionToc ~ .subsubsectionToc, .sectionToc ~ .likesubsectionToc ~ .likesubsubsectionToc, .sectionToc ~ .likesubsectionToc ~ .subsubsectionToc, .sectionToc ~ .subsectionToc ~ .likesubsubsectionToc, .sectionToc ~ .subsectionToc ~ .subsubsectionToc {margin-left:4em;} 129 | .likesubsectionToc, .subsectionToc {margin-left:0em;} 130 | .likesubsectionToc ~ .subsubsectionToc, .subsectionToc ~ .subsubsectionToc, {margin-left:2em;} 131 | .figure img.graphics {margin-left:10%;} 132 | .lstlisting .label{margin-right:0.5em; } 133 | div.lstlisting{font-family: monospace; white-space: nowrap; margin-top:0.5em; margin-bottom:0.5em; } 134 | div.lstinputlisting{ font-family: monospace; white-space: nowrap; } 135 | .lstinputlisting .label{margin-right:0.5em;} 136 | /* end css.sty */ 137 | 138 | -------------------------------------------------------------------------------- /docs/bigdata10.html: -------------------------------------------------------------------------------- 1 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 14 |
15 |

1Google BigQuery is the public implementation of Dremel. 18 | BigQuery provides the core set of features available in Dremel to third 20 | party developers via a REST API.

22 | 23 | 24 | -------------------------------------------------------------------------------- /docs/bigdata11.html: -------------------------------------------------------------------------------- 1 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 14 |
15 |

1In the asynchronous model, there is no clock, and nodes must 18 | make decisions based only on the messages received and local computation. 20 | In the asynchronous model an algorithm has no way of determining 22 | whether a message has been lost, or has been arbitrarily delayed in the 24 | transmission channel.

26 | 27 | 28 | -------------------------------------------------------------------------------- /docs/bigdata12.html: -------------------------------------------------------------------------------- 1 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 14 |
15 |

2In a partially synchronous model, every node has a clock, and all 18 | clocks increase at the same rate. However, the clocks themselves are not 20 | synchronized, in that they may display different values at the same real 22 | time.

24 | 25 | 26 | -------------------------------------------------------------------------------- /docs/bigdata13.html: -------------------------------------------------------------------------------- 1 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 14 |
15 |

3Once the system enters partition mode, one approach is to limit 18 | some operations, thereby reducing availability. The alternative allows 20 | inconsistency but records extra information about the operations that will 22 | be helpful during partition recovery.

24 | 25 | 26 | -------------------------------------------------------------------------------- /docs/bigdata14.html: -------------------------------------------------------------------------------- 1 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 14 |
15 |

4In contrast, the BlockCache keeps data blocks resident in memory 18 | after they’re read.

20 | 21 | 22 | -------------------------------------------------------------------------------- /docs/bigdata15.html: -------------------------------------------------------------------------------- 1 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 14 |
15 |

5Earlier versions used the MapFile format. The MapFile is actually 20 | a directory that contains two SequenceFile: the data file and the index file. 22 |

23 | 24 | 25 | -------------------------------------------------------------------------------- /docs/bigdata16.html: -------------------------------------------------------------------------------- 1 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 14 |
15 |

6Riak was originally created to serve as a highly scalable session 18 | store.

20 | 21 | 22 | -------------------------------------------------------------------------------- /docs/bigdata17.html: -------------------------------------------------------------------------------- 1 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 14 |
15 |

7It is not recommended to store objects over 50M for performance 18 | reasons.

20 | 21 | 22 | -------------------------------------------------------------------------------- /docs/bigdata18.html: -------------------------------------------------------------------------------- 1 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 14 |
15 |

8The first reads will return the error 404 Not Found.

18 | 19 | 20 | -------------------------------------------------------------------------------- /docs/bigdata19.html: -------------------------------------------------------------------------------- 1 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 14 |
15 |

1YARN is a monolithic system and Google’s Omega is a shared 20 | state scheduler [71].

27 | 28 | 29 | -------------------------------------------------------------------------------- /docs/bigdata2.html: -------------------------------------------------------------------------------- 1 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 14 |
15 |

1Interestingly, people were even against the idea of personalization 18 | back to 2001 [63].

25 | 26 | 27 | -------------------------------------------------------------------------------- /docs/bigdata3.html: -------------------------------------------------------------------------------- 1 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 14 |
15 |

2Facebook is even testing data mining methods that would 18 | silently follow users’ mouse movements to see not only where we 20 | click but even where we pause, where we hover and for how long 22 | [77].

28 | 29 | 30 | -------------------------------------------------------------------------------- /docs/bigdata4.html: -------------------------------------------------------------------------------- 1 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 14 |
15 |

1Multiple DataNodes on the same machine is possible but rare in 18 | production deployments.

20 | 21 | 22 | -------------------------------------------------------------------------------- /docs/bigdata5.html: -------------------------------------------------------------------------------- 1 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 14 |
15 |

2Archiving does not delete the input files. The user has to do it 18 | manually to reduce namespace.

20 | 21 | 22 | -------------------------------------------------------------------------------- /docs/bigdata6.html: -------------------------------------------------------------------------------- 1 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 14 |
15 |

3The Apache Thrift [20] is a software framework for scalable 23 | cross-language services development. It combines a software stack with a 25 | code generation engine to build services that work efficiently and 27 | seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, 29 | C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml, and Delphi, 31 | etc.

33 | 34 | 35 | -------------------------------------------------------------------------------- /docs/bigdata7.html: -------------------------------------------------------------------------------- 1 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 14 |
15 |

4MPI-I/O [41] was added into MPI-2 but it is not easy to use and 23 | difficult to achieve good performance.

25 | 26 | 27 | -------------------------------------------------------------------------------- /docs/bigdata8.html: -------------------------------------------------------------------------------- 1 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 14 |
15 |

5Also frequently referred as fold.

20 | 21 | 22 | -------------------------------------------------------------------------------- /docs/bigdata9.html: -------------------------------------------------------------------------------- 1 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 14 |
15 |

1Tachyon is a memory-centric distributed file system enabling reliable 18 | file sharing at memory-speed.

20 | 21 | -------------------------------------------------------------------------------- /docs/images: -------------------------------------------------------------------------------- 1 | ../images -------------------------------------------------------------------------------- /images/MapReduce.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/haifengl/bigdata/52a4be5b2ebb64c8a99f5f9a7b6596a78f54f1af/images/MapReduce.png -------------------------------------------------------------------------------- /images/PigHive_MR.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/haifengl/bigdata/52a4be5b2ebb64c8a99f5f9a7b6596a78f54f1af/images/PigHive_MR.png -------------------------------------------------------------------------------- /images/PigHive_Tez.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/haifengl/bigdata/52a4be5b2ebb64c8a99f5f9a7b6596a78f54f1af/images/PigHive_Tez.png -------------------------------------------------------------------------------- /images/data-management.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/haifengl/bigdata/52a4be5b2ebb64c8a99f5f9a7b6596a78f54f1af/images/data-management.png -------------------------------------------------------------------------------- /images/hadoop.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/haifengl/bigdata/52a4be5b2ebb64c8a99f5f9a7b6596a78f54f1af/images/hadoop.png -------------------------------------------------------------------------------- /images/hbase-architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/haifengl/bigdata/52a4be5b2ebb64c8a99f5f9a7b6596a78f54f1af/images/hbase-architecture.png -------------------------------------------------------------------------------- /images/hdfs-architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/haifengl/bigdata/52a4be5b2ebb64c8a99f5f9a7b6596a78f54f1af/images/hdfs-architecture.png -------------------------------------------------------------------------------- /images/mesos-architecture.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/haifengl/bigdata/52a4be5b2ebb64c8a99f5f9a7b6596a78f54f1af/images/mesos-architecture.jpg -------------------------------------------------------------------------------- /images/mongodb-replica-set.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/haifengl/bigdata/52a4be5b2ebb64c8a99f5f9a7b6596a78f54f1af/images/mongodb-replica-set.png -------------------------------------------------------------------------------- /images/mongodb-sharding.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/haifengl/bigdata/52a4be5b2ebb64c8a99f5f9a7b6596a78f54f1af/images/mongodb-sharding.png -------------------------------------------------------------------------------- /images/mongodb-storage-structure.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/haifengl/bigdata/52a4be5b2ebb64c8a99f5f9a7b6596a78f54f1af/images/mongodb-storage-structure.png -------------------------------------------------------------------------------- /images/riak-data-distribution.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/haifengl/bigdata/52a4be5b2ebb64c8a99f5f9a7b6596a78f54f1af/images/riak-data-distribution.png -------------------------------------------------------------------------------- /images/riak-ring.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/haifengl/bigdata/52a4be5b2ebb64c8a99f5f9a7b6596a78f54f1af/images/riak-ring.png -------------------------------------------------------------------------------- /images/yarn-architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/haifengl/bigdata/52a4be5b2ebb64c8a99f5f9a7b6596a78f54f1af/images/yarn-architecture.png -------------------------------------------------------------------------------- /images/zookeeper.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/haifengl/bigdata/52a4be5b2ebb64c8a99f5f9a7b6596a78f54f1af/images/zookeeper.jpg --------------------------------------------------------------------------------