├── BigDataProcessing.md ├── DatabaseImplementation.md ├── GraphProcessing.md ├── README.md ├── SparkGPU.md └── SysML.md /BigDataProcessing.md: -------------------------------------------------------------------------------- 1 | # 分布式数据处理系统 2 | 3 | **注:加粗的论文是具有代表性的分布式数据处理系统,有助于基本了解。** 4 | 5 | ## 分布式文件系统 6 | 7 | * GFS 8 | + **Ghemawat, S., Gobioff, H., & Leung, S.-T. (2003). The Google File System. In SOSP (pp. 29–43).** 9 | 10 | ## 协调服务系统 11 | 12 | * ZooKeeper 13 | + **Hunt, P., Konar, M., Junqueira, F. P., & Reed, B. (2010). ZooKeeper : Wait-free coordination for Internet-scale systems. In USENIX Annual Technology Conference (pp. 1–14).** 14 | * Chubby 15 | + Burrows, M. (2006). The Chubby lock service for loosely-coupled distributed systems. In OSDI (pp. 335–350). 16 | 17 | ## 资源管理系统 18 | 19 | * Yarn 20 | + **Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., … Saha, B. (2013). Apache Hadoop yarn: Yet another resource negotiator. In SoCC (p. 5:1-5:16).** 21 | * Mesos 22 | + Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., Joseph, A. D., Katz, R., … Stoica, I. (2010). Mesos : A Platform for Fine-Grained Resource Sharing in the Data Center. In NSDI. 23 | 24 | ## 批处理系统 25 | 26 | * MapReduce 27 | + **Dean, J., & Ghemawat, S. (2004). MapReduce : Simplified Data Processing on Large Clusters. In OSDI (pp. 137–149).** 28 | + Yang, H., Dasdan, A., Hsiao, R., & Parker, D. S. (2007). Map-Reduce-Merge : Simplified Relational Data Processing on Large Clusters. In SIGMOD Conference (pp. 1029–1040). 29 | + Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S., Qiu, J., & Fox, G. (2010). Twister: a runtime for iterative MapReduce. In HPDC (pp. 810–818). 30 | + Bu, Y., Howe, B., & Ernst, M. D. (2010). HaLoop : Efficient Iterative Data Processing on Large Clusters. PVLDB, 3(1), 285–296. 31 | + Dittrich, J., Quiané-Ruiz, J.-A., Jindal, A., Kargin, Y., Setty, V., & Schad, J. (2010). Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). PVLDB, 3(1), 515–529. 32 | + Chambers, C., Raniwala, A., Perry, F., Adams, S., Henry, R. R., Bradshaw, R., & Weizenbaum, N. (2010). FlumeJava: Easy, Efficient Data-parallel Pipelines. In PLDI (Vol. 45, pp. 363--375). 33 | 34 | * Spark 35 | + **Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., Mccauley, M., … Stoica, I. (2012). Resilient Distributed Datasets : A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In NSDI (pp. 15–28).** 36 | + Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Stoica, I. (2010). Spark : Cluster Computing with Working Sets. In HotCloud (pp. 1–7). 37 | 38 | * Stratosphere 39 | + Battré, D., Ewen, S., & Hueske, F. (2010). Nephele / PACTs : A Programming Model and Execution Framework for Web-Scale Analytical Processing. In SoCC (pp. 119–130). 40 | + Ewen, S. (2012). Spinning Fast Iterative Data Flows. PVLDB, 5(11), 1268–1279. 41 | + Hueske, F., Peters, M., Sax, M. J., & Rheinl, A. (2012). Opening the Black Boxes in Data Flow Optimization. PVLDB, 5(11), 1256–1267. 42 | + Alexandrov, A., Bergmann, R., Ewen, S., Freytag, J. C., Hueske, F., Heise, A., … Warneke, D. (2014). The Stratosphere platform for big data analytics. VLDB Journal, 23(6), 939–964. 43 | 44 | * Dryad 45 | + Isard, M., Birrell, A., & Fetterly, D. (2007). Dryad : Distributed Data-Parallel Programs from Sequential Building Blocks. In EuroSys (pp. 59–72). 46 | 47 | ## 流计算系统 48 | 49 | * Streaming系统雏形 50 | + Condie, T., Conway, N., Alvaro, P., Hellerstein, J. M., Elmeleegy, K., & Sears, R. (2010). MapReduce Online. In NSDI (pp. 313–328). 51 | + Lam, W., Liu, L., Prasad, S., Rajaraman, A., Vacheri, Z., & Doan, A. (2012). Muppet: MapReduce-style Processing of Fast Data. PVLDB, 5(12), 1814–1825. 52 | + Neumeyer, L., Robbins, B., Nair, A., & Kesari, A. (2010). S4: Distributed Stream Computing Platform. In ICDMW (pp. 170–177). 53 | 54 | * Storm 55 | + **Toshniwal, A., Donham, J., Bhagat, N., Mittal, S., Ryaboy, D., Taneja, S., … Fu, M. (2014). Storm@twitter. In SIGMOD Conference (pp. 147–156).** 56 | + Kulkarni, S., Bhagat, N., Fu, M., Kedigehalli, V., Kellogg, C., Mittal, S., … Taneja, S. (2015). Twitter Heron: Stream Processing at Scale. In SIGMOD Conference (pp. 239–250). 57 | + Fu, M., Agrawal, A., Floratou, A., Graham, B., Jorgensen, A., Li, M., … Wang, C. (2017). Twitter Heron: Towards Extensible Streaming Engines. In ICDE (pp. 1165–1172). 58 | 59 | * Spark Streaming 60 | + Zaharia, M., Das, T., Li, H., Shenker, S., & Stoica, I. (2012). Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters. In HotCloud (pp. 10–10). 61 | + **Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., & Stoica, I. (2013). Discretized Streams: Fault-Tolerant Streaming Computation at Scale. In SOSP (pp. 423–438).** 62 | 63 | * MillWheel 64 | + Chernyak, S., Haberman, J., Akidau, T., Balikov, A., Bekiro, K., Lax, R., … Whittle, S. (2013). MillWheel : Fault-Tolerant Stream Processing at Internet Scale. PVLDB, 6(11), 1033–1044. 65 | 66 | ## 批流融合系统 67 | 68 | * Google Dataflow 69 | + **Akidau, T., Bradshaw, R., Chambers, C., Chernyak, S., Fer Andez-Moctezuma, R. J., Lax, R., … Google, S. W. (2015). The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing. PVLDB, 8(12), 1792–1803.** 70 | 71 | * Flink 72 | + Carbone, P., Ewen, S., Haridi, S., Katsifodimos, A., Markl, V., & Tzoumas, K. (2015). Apache Flink: Unified Stream and Batch Processing in a Single Engine. IEEE Data Eng. Bull., 38(4), 28–38. 73 | + Carbone, P., Ewen, S., Richter, S., & Gyula, F. (2017). State Management in Apache Flink. PVLDB, 10(20), 1718–1729. 74 | 75 | * Spark Structured Streaming 76 | + Armbrust, M., Das, T., & Torres, J. (2018). Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark. In SIGMOD Conference (pp. 465–476). 77 | 78 | * Beam 79 | + FlumeJava 80 | + MillWheel 81 | + Google Dataflow 82 | 83 | ## 图处理系统 84 | 85 | * Pregel 86 | + **Malewicz, G., Austern, M. H., Bik, A. J. C., Dehnert, J. C., Horn, I., Leiser, N., & Czajkowski, G. (2010). Pregel : A System for Large-Scale Graph Processing. In SIGMOD Conference (pp. 135–145).** 87 | + Zhou, C., Gao, J., Sun, B., & Yu, J. X. (2014). MOCgraph : Scalable Distributed Graph Processing Using Message Online Computing. PVLDB, 8(4), 377–388. 88 | + Tian, Y., Balmin, A., Corsten, S. A., Tatikonda, S., & Mcpherson, J. (2013). From “Think Like a Vertex” to “Think Like a Graph.” PVLDB, 7(3), 193–204. 89 | 90 | * GraphX 91 | + Xin, R. S., Gonzalez, J. E., Franklin, M. J., & Stoica, I. (2013). GraphX: A Resilient Distributed Graph System on Spark. In GRADES (p. 2:1-2:6). 92 | + Gonzalez, J. E., Xin, R. S., Dave, A., Crankshaw, D., Franklin, M. J., Stoica, I., & Amplab, B. (2014). GraphX : Graph Processing in a Distributed Dataflow Framework. In OSDI (pp. 599–613). 93 | 94 | 95 | ## 综述 96 | 97 | * 王珊,王会举,覃雄派,周烜.架构大数据:挑战、现状与展望[J].计算机学报,2011,34(10):1741-1752. 98 | * 覃雄派,王会举,杜小勇,王珊.大数据分析——RDBMS与MapReduce的竞争与共生[J].软件学报,2012,23(01):32-45. 99 | * 申德荣,于戈,王习特,聂铁铮,寇月.支持大数据管理的NoSQL系统研究综述[J].软件学报,2013,24(08):1786-1803. 100 | * 程学旗,靳小龙,王元卓,郭嘉丰,张铁赢,李国杰.大数据系统和分析技术综述[J].软件学报,2014,25(09):1889-1908. 101 | * 孙大为,张广艳,郑纬民.大数据流式计算:关键技术及系统实例[J].软件学报,2014,25(04):839-862. 102 | * 杜小勇,卢卫,张峰.大数据管理系统的历史、现状与未来[J].软件学报,2019,30(01):127-141. 103 | * 崔斌,高军,童咏昕,许建秋,张东祥,邹磊.新型数据管理系统研究进展与趋势[J].软件学报,2019,30(01):164-193. 104 | -------------------------------------------------------------------------------- /DatabaseImplementation.md: -------------------------------------------------------------------------------- 1 | # 数据库系统实现 2 | 3 | 4 | 经典教材:[《数据库系统实现》](https://book.douban.com/subject/4838430/) 5 | 6 | ## 查询处理 7 | 8 | - 查询优化 9 | - Begoli E, Camacho-Rodríguez J, Hyde J, et al. Apache calcite: A foundational framework for optimized query processing over heterogeneous data sources[C]//Proceedings of the 2018 International Conference on Management of Data. ACM, 2018: 221-230. 10 | - 查询执行 11 | - tuple-at-a-time 12 | - Graefe G. Volcano - an extensible and parallel query evaluation system[J]. IEEE Transactions on Knowledge and Data Engineering, 1994, 6(1): 120-135. 13 | - Graefe G. Encapsulation of parallelism in the Volcano query processing system[M]. ACM, 1990. 14 | - vector-at-a-time 15 | - Padmanabhan S, Malkemus T, Jhingran A, et al. Block oriented processing of relational database operations in modern computer architectures[C]//Proceedings 17th International Conference on Data Engineering. IEEE, 2001: 567-574. 16 | - Boncz P A, Zukowski M, Nes N. MonetDB/X100: Hyper-Pipelining Query Execution[C]//Cidr. 2005, 5: 225-237. 17 | - Zukowski M, Boncz P A, Nes N, et al. MonetDB/X100-A DBMS In The CPU Cache[J]. IEEE Data Eng. Bull., 2005, 28(2): 17-22. 18 | - Code Generation 19 | - Neumann T. Efficiently compiling efficient query plans for modern hardware[J]. Proceedings of the VLDB Endowment, 2011, 4(9): 539-550. 20 | - Chrysogelos P, Karpathiotakis M, Appuswamy R, et al. HetExchange: Encapsulating heterogeneous CPU-GPU parallelism in JIT compiled engines[J]. Proceedings of the VLDB Endowment, 2019, 12(5): 544-556. 21 | - SIMD 22 | - Polychroniou O, Raghavan A, Ross K A. Rethinking SIMD vectorization for in-memory databases[C]//Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 2015: 1493-1508. 23 | 24 | 25 | ## 事务管理 26 | -------------------------------------------------------------------------------- /GraphProcessing.md: -------------------------------------------------------------------------------- 1 | # 图处理系统 2 | 3 | 4 | * Survey 5 | 6 | - Robert Ryan McCune, Tim Weninger, Greg Madey: Thinking Like a Vertex: A Survey of Vertex-Centric Frameworks for Large-Scale Distributed Graph Processing. ACM Comput. Surv. 48(2): 25:1-25:39 (2015) 7 | - Vasiliki Kalavri, Vladimir Vlassov, Seif Haridi: High-Level Programming Abstractions for Distributed Graph Processing. IEEE Trans. Knowl. Data Eng. 30(2): 305-324 (2018) 8 | - Safiollah Heidari, Yogesh Simmhan, Rodrigo N. Calheiros, Rajkumar Buyya: Scalable Graph Processing Frameworks: A Taxonomy and Open Challenges. ACM Comput. Surv. 51(3): 60:1-60:53 (2018) 9 | - 于戈, 谷峪, 鲍玉斌, 王志刚. 云计算环境下的大规模图数据处理技术. 计算机学报, 34(10):1753–1767, 2011. 10 | - 王童童, 荣垂田, 卢卫, 杜小勇. 分布式图处理系统技术综述. 软件学报,29(3):569–586, 2018. 11 | * System 12 | 13 | * Pregel 14 | 15 | * **Malewicz, G., Austern, M. H., Bik, A. J. C., Dehnert, J. C., Horn, I., Leiser, N., & Czajkowski, G. (2010). Pregel : A System for Large-Scale Graph Processing. In SIGMOD Conference (pp. 135–145).** 16 | * Zhou, C., Gao, J., Sun, B., & Yu, J. X. (2014). MOCgraph : Scalable Distributed Graph Processing Using Message Online Computing. PVLDB, 8(4), 377–388. 17 | * Tian, Y., Balmin, A., Corsten, S. A., Tatikonda, S., & Mcpherson, J. (2013). From “Think Like a Vertex” to “Think Like a Graph.” PVLDB, 7(3), 193–204. 18 | * GraphX 19 | 20 | * Xin, R. S., Gonzalez, J. E., Franklin, M. J., & Stoica, I. (2013). GraphX: A Resilient Distributed Graph System on Spark. In GRADES (p. 2:1-2:6). 21 | * Gonzalez, J. E., Xin, R. S., Dave, A., Crankshaw, D., Franklin, M. J., Stoica, I., & Amplab, B. (2014). GraphX : Graph Processing in a Distributed Dataflow Framework. In OSDI (pp. 599–613). 22 | * GraphLab 23 | 24 | * Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, Joseph M. Hellerstein: Distributed GraphLab: A Framework for Machine Learning in the Cloud. PVLDB 5(8): 716-727 (2012) 25 | * Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, Carlos Guestrin:PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs. OSDI 2012: 17-30 26 | * Fault Tolerance 27 | - Yanyan Shen, Gang Chen, H. V. Jagadish, Wei Lu, Beng Chin Ooi, Bogdan Marius Tudor: 28 | Fast Failure Recovery in Distributed Graph Processing Systems. PVLDB 8(4): 437-448 (2014) 29 | - Peng Wang, Kaiyuan Zhang, Rong Chen, Haibo Chen, Haibing Guan: Replication-Based Fault-Tolerance for Large-Scale Graph Processing. DSN 2014: 562-573 -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # ReadingList 2 | 3 | 各主题的论文列表,阅读该组论文需要仔细体会一组论文中的联系与区别。 4 | 5 | 6 | [数据库系统实现](DatabaseImplementation.md): 由[@LiefB](https://github.com/LiefB)维护 7 | 8 | [分布式数据处理系统](BigDataProcessing.md) 9 | 10 | * [图处理系统](GraphProcessing.md): 由[@ikroal](https://github.com/ikroal)维护 11 | 12 | [机器学习系统](SysML.md): 由[@Allen-Czyysx](https://github.com/Allen-Czyysx)维护 -------------------------------------------------------------------------------- /SparkGPU.md: -------------------------------------------------------------------------------- 1 | # SparkGPU: Accelerate Spark with GPU 2 | 3 | - Offload Spark computations to acceleration devices 4 | - Li P, Luo Y, Zhang N, et al. Heterospark: A heterogeneous cpu/gpu spark platform for machine learning algorithms[C]//2015 IEEE International Conference on Networking, Architecture and Storage (NAS). IEEE, 2015: 347-348. 5 | - Manzi D, Tompkins D. Exploring GPU acceleration of apache spark[C]//2016 IEEE International Conference on Cloud Engineering (IC2E). IEEE, 2016: 222-223. 6 | - Choi W, Hong S, Jeong W K. Vispark: GPU-accelerated distributed visual computing using spark[J]. SIAM Journal on Scientific Computing, 2016, 38(5): S700-S719. 7 | - Yuan Y, Salmi M F, Huai Y, et al. Spark-GPU: An accelerated in-memory data processing engine on clusters[C]//2016 IEEE International Conference on Big Data (Big Data). IEEE, 2016: 273-283. 8 | - Reduce data transfer over PCIe while accelerating Spark with GPU 9 | - Hong S, Choi W, Jeong W K. GPU in-memory processing using Spark for iterative computation[C]//Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. IEEE Press, 2017: 31-41. 10 | - Asai R, Okita M, Ino F, et al. Transparent Avoidance of Redundant Data Transfer on GPU-enabled Apache Spark[C]//Proceedings of the 11th Workshop on General Purpose GPUs. ACM, 2018: 22-30. 11 | - Reduce data transfer over GMem by kernel fusion 12 | - Wahib M, Maruyama N. Scalable kernel fusion for memory-bound GPU applications[C]//Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press, 2014: 191-202. 13 | - Funke H, Breß S, Noll S, et al. Pipelined query processing in coprocessor environments[C]//Proceedings of the 2018 International Conference on Management of Data. ACM, 2018: 1603-1618. 14 | -------------------------------------------------------------------------------- /SysML.md: -------------------------------------------------------------------------------- 1 | ## SysML: The New Frontier of Machine Learning Systems 2 | 3 | * Training System 4 | + Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., & Hellerstein, J. M. (2012). Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. PVLDB, 5(8), 716–727. 5 | + Li, M., Andersen, D. G., Park, J. W., Ahmed, A., Josifovski, V., Long, J., … Ahmed, A. (2014). Scaling Distributed Machine Learning with the Parameter Server. In OSDI (pp. 583–598). 6 | + Boehm, M., Surve, A. C., Tatikonda, S., Dusenberry, M. W., Eriksson, D., Evfimievski, A. V., … Sen, P. (2016). SystemML: Declarative Machine Learning on Spark. PVLDB, 9(13), 1425–1436. 7 | + Parameter Server 8 | - Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., … Zheng, X. (2016). TensorFlow: A System for Large-Scale Machine Learning. In OSDI (pp. 265–284). 9 | - Huang, Y., Jin, T., Wu, Y., Cai, Z., Yan, X., Yang, F., … Cheng, J. (2018). FlexPS: Flexible parallelism control in parameter server architecture. Proceedings of the VLDB Endowment, 11(5), 566–579. 10 | 11 | * Inference/Serving System 12 | + Lee, Y., Scolari, A., Chun, B.-G., Santambrogio, M. D., Weimer, M., & Interlandi, M. (2018). PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems. In OSDI. 13 | + Crankshaw, D., Wang, X., Zhou, G., Franklin, M. J., Gonzalez, J. E., Zhou, G., & Stoica, I. (2017). Clipper: A Low-Latency Online Prediction Serving System. In NSDI. 14 | + Olston, C., & Harmsen, J. (2017). TensorFlow-Serving: Flexible, High-Performance ML Serving. In NIPS (pp. 1–8). 15 | + Crankshaw, D., Bailis, P., Gonzalez, J. E., Li, H., Zhang, Z., Franklin, M. J., … Jordan, M. I. (2015). The Missing Piece in Complex Analytics: Low Latency, Scalable Model Management and Serving with Velox. In CIDR. 16 | 17 | * ML Pipelines 18 | + Shang, Z., Zgraggen, E., Buratti, B., Chung, Y., Eichmann, P., Binnig, C., … Kraska, T. (2019). Democratizing Data Science through Interactive Curation of ML Pipelines. In SIGMOD Conference (pp. 1171–1188). 19 | + Kunft, A., Katsifodimos, A., & Schelter, S. (2019). An Intermediate Representation for Optimizing Machine Learning Pipelines. PVLDB, 12(11), 1553–1567. 20 | + Xin, D., Macke, S., Ma, L., Liu, J., Song, S., & Parameswaran, A. (2018). Helix: Holistic Optimization for Accelerating Iterative Machine Learning. PVLDB, 12(4), 446–460. 21 | 22 | * Reinforcement Learning 23 | + Moritz, P., Nishihara, R., Wang, S., Tumanov, A., Liaw, R., Liang, E., … Stoica, I. (2018). Ray : A Distributed Framework for Emerging AI Applications. In OSDI (pp. 561–577). 24 | 25 | * Relation with General-purpose Frameworks 26 | + Zhang, Z., Cui, B., Shao, Y., Yu, L., Jiang, J., & Miao, X. (2019). PS2: Parameter Server on Spark. In Proceedings of the 2019 International Conference on Management of Data - SIGMOD ’19 (pp. 376–388). 27 | + Aurick Qiao, Abutalib Aghayev, Weiren Yu, Haoyang Chen, Qirong Ho, Garth A. Gibson, Eric P. Xing: 28 | Litz: Elastic Framework for High-Performance Distributed Machine Learning. USENIX Annual Technical Conference 2018: 631-644 29 | + Xing, E. P., Yu, Y., Ho, Q., Dai, W., Kim, J.-K., Wei, J., … Kumar, A. (2015). Petuum: A New Platform for Distributed Machine Learning on Big Data. KDD ’15 (pp. 1335–1344). 30 | --------------------------------------------------------------------------------