└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # Marco's SysML reading list 2 | A curated reading list of computer science research for work at the intersection of machine learning and systems. PR are welcome. 3 | 4 | ## Review 5 | 6 | A Berkeley View of Systems Challenges for AI 7 | https://arxiv.org/pdf/1712.05855.pdf 8 | 9 | Strategies and Principles of Distributed Machine Learning on Big Data 10 | https://arxiv.org/abs/1512.09295 11 | 12 | ## Background 13 | 14 | Deep learning 15 | Nature volume 521, 2015 16 | https://www.nature.com/articles/nature14539 17 | 18 | Deep learning reading list 19 | http://deeplearning.net/reading-list 20 | 21 | ## Measurement 22 | 23 | Multi-tenant GPU Clusters for Deep LearningWorkloads: Analysis and Implications 24 | https://www.microsoft.com/en-us/research/uploads/prod/2018/05/gpu_sched_tr.pdf 25 | 26 | ## Frameworks 27 | 28 | TensorFlow: A System for Large-Scale Machine Learning 29 | OSDI 2016 30 | https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf 31 | 32 | Ray: A Distributed Framework for Emerging AI Applications 33 | OSDI 2018 34 | https://www.usenix.org/system/files/osdi18-moritz.pdf 35 | 36 | ## Tuning 37 | 38 | HyperDrive: Exploring Hyperparameters with POP Scheduling 39 | MiddleWare 2017 40 | https://dl.acm.org/citation.cfm?id=3135994 41 | 42 | Ease.ml: Towards Multi-tenant Resource Sharing for Machine Learning Workloads 43 | VLDB 2018 44 | http://www.vldb.org/pvldb/vol11/p607-li.pdf 45 | 46 | Automating Model Search for Large Scale Machine Learning 47 | SoCC 2015 48 | http://dl.acm.org/authorize?N91362 49 | 50 | Google Vizier: A Service for Black-Box Optimization 51 | KDD 2017 52 | https://dl.acm.org/citation.cfm?id=3098043 53 | 54 | Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization 55 | Journal of Machine Learning Research 18 (2018) 56 | https://arxiv.org/pdf/1603.06560.pdf 57 | 58 | Hyperopt: a Python library for model selection and hyperparameter optimization 59 | Computational Science & Discovery, 8(1) 2015 60 | http://iopscience.iop.org/article/10.1088/1749-4699/8/1/014008 61 | 62 | Auto-Keras: Efficient Neural Architecture Search with Network Morphism 63 | https://arxiv.org/pdf/1806.10282v2.pdf 64 | 65 | ## Runtime execution 66 | 67 | Cavs: An Efficient Runtime System for Dynamic Neural Networks 68 | ATC 2018 69 | https://www.usenix.org/system/files/conference/atc18/atc18-xu-shizhen.pdf 70 | 71 | TVM: An Automated End-to-End Optimizing Compiler for Deep Learning 72 | OSDI 2018 73 | https://www.usenix.org/system/files/osdi18-chen.pdf 74 | 75 | PipeDream: Fast and Efficient Pipeline Parallel DNN Training 76 | https://arxiv.org/pdf/1806.03377.pdf 77 | 78 | STRADS: A Distributed Framework for Scheduled Model Parallel Machine Learning 79 | EuroSys 2016 80 | https://dl.acm.org/citation.cfm?id=2901331 81 | 82 | Dynamic Control Flow in Large-Scale Machine Learning 83 | EuroSys 2018 84 | https://dl.acm.org/citation.cfm?id=3190551 85 | 86 | Improving the Expressiveness of Deep Learning Frameworks with Recursion 87 | EuroSys 2018 88 | https://dl.acm.org/citation.cfm?id=3190530 89 | 90 | Continuum: A Platform for Cost-Aware, Low-Latency Continual Learning 91 | SoCC 2018 92 | https://dl.acm.org/citation.cfm?id=3267817 93 | 94 | KeystoneML: Optimizing Pipelines for Large-ScaleAdvanced Analytics 95 | ICDE 2017 96 | https://amplab.cs.berkeley.edu/wp-content/uploads/2017/01/ICDE_2017_CameraReady_475.pdf 97 | 98 | Owl: A General-Purpose Numerical Library in OCaml 99 | https://arxiv.org/pdf/1707.09616.pdf 100 | 101 | ## Distributed learning 102 | 103 | Large Scale Distributed Deep Networks 104 | NIPS 2012 105 | https://ai.google/research/pubs/pub40565.pdf 106 | 107 | Managed Communication and Consistency for Fast Data-Parallel Iterative Analytics 108 | SoCC 2015 109 | http://dl.acm.org/authorize?N91363 110 | 111 | Ako: Decentralised Deep Learning with Partial Gradient Exchange 112 | SOCC 2016 113 | https://lsds.doc.ic.ac.uk/sites/default/files/ako-socc16.pdf 114 | 115 | Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters 116 | ATC 2017 117 | https://www.usenix.org/system/files/conference/atc17/atc17-zhang.pdf 118 | 119 | Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training 120 | SoCC 2018 121 | https://dl.acm.org/citation.cfm?id=3267840 122 | 123 | MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems 124 | ML Systems Workshop at NIPS 2016 125 | https://arxiv.org/pdf/1512.01274.pdf 126 | 127 | Scaling Distributed Machine Learning with the Parameter Server 128 | OSDI 2014 129 | https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-li_mu.pdf 130 | 131 | Project Adam: Building an Efficient and Scalable Deep Learning Training System 132 | OSDI 2014 133 | https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-chilimbi.pdf 134 | 135 | Orpheus: Efficient Distributed Machine Learning via System and Algorithm Co-design 136 | SoCC 2018 137 | https://dl.acm.org/citation.cfm?id=3267810 138 | 139 | Petuum: A New Platform for Distributed Machine Learning on Big Data 140 | KDD 2015 141 | https://arxiv.org/pdf/1312.7651.pdf 142 | 143 | GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism 144 | https://arxiv.org/pdf/1811.06965.pdf 145 | 146 | ## Serving systems and inference 147 | 148 | DeepCPU: Serving RNN-based Deep Learning Models 10x Faster 149 | ATC 2018 150 | https://www.usenix.org/system/files/conference/atc18/atc18-zhang-minjia.pdf 151 | 152 | Clipper: A Low-Latency Online Prediction Serving System 153 | NSDI 2017 154 | https://www.usenix.org/system/files/conference/nsdi17/nsdi17-crankshaw.pdf 155 | 156 | Research for Practice: Prediction-Serving Systems 157 | ACM Queue 16(1), 2018 158 | https://queue.acm.org/detail.cfm?id=3210557 159 | 160 | InferLine: ML Inference Pipeline Composition 161 | https://arxiv.org/pdf/1812.01776.pdf 162 | 163 | PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems 164 | OSDI 2018 165 | https://www.usenix.org/system/files/osdi18-lee.pdf 166 | 167 | Olympian: Scheduling GPU Usage in a Deep Neural Network Model Serving System 168 | MiddleWare 2018 169 | https://dl.acm.org/citation.cfm?id=3274813 170 | 171 | Low Latency RNN Inference with Cellular Batching 172 | EuroSys 2018 173 | https://dl.acm.org/citation.cfm?id=3190541 174 | 175 | SERF: Efficient Scheduling for Fast Deep Neural Network Serving via Judicious Parallelism 176 | SC 2016 177 | https://ieeexplore.ieee.org/document/7877104 178 | 179 | NoScope: Optimizing Neural Network Queries over Video at Scale 180 | VLDB 2017 181 | https://dl.acm.org/citation.cfm?id=3137664 182 | 183 | Scheduling 184 | Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters 185 | EuroSys 2018 186 | https://dl.acm.org/citation.cfm?id=3190517 187 | 188 | SLAQ: Quality-Driven Scheduling for Distributed Machine Learning 189 | SoCC 2017 190 | https://dl.acm.org/authorize?N46878 191 | 192 | Proteus: agile ML elasticity through tiered reliability in dynamic resource markets 193 | EuroSys 2017 194 | https://dl.acm.org/citation.cfm?id=3064182 195 | 196 | Gandiva: Introspective Cluster Scheduling for Deep Learning 197 | OSDI 2018 198 | https://www.usenix.org/system/files/osdi18-xiao.pdf 199 | 200 | Topology-Aware GPU Scheduling for Learning Workloads in Cloud Environments 201 | SC 2017 202 | https://dl.acm.org/citation.cfm?id=3126933 203 | 204 | ## Algorithmic aspects in scalable ML 205 | 206 | Hemingway: Modeling Distributed Optimization Algorithms 207 | ML Systems Workshop at NIPS 2016 208 | https://arxiv.org/pdf/1702.05865.pdf 209 | 210 | Asynchronous Methods for Deep Reinforcement Learning 211 | ICML 2016 212 | http://proceedings.mlr.press/v48/mniha16.pdf 213 | 214 | Don't Use Large Mini-Batches, Use Local SGD 215 | https://arxiv.org/pdf/1808.07217.pdf 216 | 217 | GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server 218 | EuroSys 2016 219 | https://dl.acm.org/citation.cfm?id=2901323 220 | 221 | ImageNet Training in Minutes 222 | ICPP 2018 223 | https://dl.acm.org/citation.cfm?id=3225069 224 | 225 | Semantics-Preserving Parallelization of Stochastic Gradient Descent 226 | IPDPS 2018 227 | https://ieeexplore.ieee.org/abstract/document/8425176 228 | 229 | HOGWILD!: A Lock-Free Approach to ParallelizingStochastic Gradient Descent 230 | NIPS 2011 231 | https://papers.nips.cc/paper/4390-hogwild-a-lock-free-approach-to-parallelizing-stochastic-gradient-descent.pdf 232 | 233 | QSGD: Communication-Efficient SGD via Randomized Quantization 234 | NIPS 2017 235 | https://papers.nips.cc/paper/6768-qsgd-communication-efficient-sgd-via-gradient-quantization-and-encoding.pdf 236 | 237 | Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent 238 | NIPS 2017 239 | https://papers.nips.cc/paper/7117-can-decentralized-algorithms-outperform-centralized-algorithms-a-case-study-for-decentralized-parallel-stochastic-gradient-descent.pdf 240 | 241 | Slow and Stale Gradients Can Win the Race: Error-Runtime Trade-offs in Distributed SGD 242 | AIStats 2018 243 | https://arxiv.org/pdf/1803.01113.pdf 244 | 245 | Probabilistic Synchronous Parallel 246 | https://arxiv.org/pdf/1709.07772.pdf 247 | 248 | ## AI Testing and Verification 249 | 250 | DeepXplore: Automated Whitebox Testing of Deep Learning Systems 251 | SOSP 2017 252 | https://dl.acm.org/authorize?N47145 253 | 254 | Programmatically Interpretable Reinforcement Learning 255 | ICML 2018 256 | https://arxiv.org/pdf/1804.02477.pdf 257 | 258 | AI2: Safety and Robustness Certification of Neural Networks with Abstract Interpretation 259 | SP 2018 260 | https://ieeexplore.ieee.org/document/8418593 261 | 262 | ## Interpretability and Explainability 263 | 264 | “Why Should I Trust You?”Explaining the Predictions of Any Classifier 265 | KDD 2016 266 | https://www.kdd.org/kdd2016/papers/files/rfp0573-ribeiroA.pdf 267 | 268 | Learning to Explain: An Information-Theoretic Perspective on Model Interpretation 269 | ICML 2018 270 | https://arxiv.org/pdf/1802.07814.pdf 271 | 272 | A Unified Approach to Interpreting Model Predictions 273 | NIPS 2017 274 | https://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf 275 | 276 | The Mythos of Model Interpretability 277 | WHI 2016 278 | https://arxiv.org/pdf/1606.03490.pdf 279 | 280 | ## Model Management 281 | 282 | MISTIQUE: A System to Store and Query Model Intermediates for Model Diagnosis 283 | SIGMOD 2018 284 | https://dl.acm.org/citation.cfm?id=3196934 285 | 286 | MODELDB: A System for Machine Learning Model Management 287 | HILDA 2016 288 | https://mitdbg.github.io/modeldb/papers/hilda_modeldb.pdf 289 | 290 | Model Governance: Reducing the Anarchy of Production ML 291 | ATC 2018 292 | https://www.usenix.org/system/files/conference/atc18/atc18-sridhar.pdf 293 | 294 | The Missing Piece in Complex Analytics: Low Latency,Scalable Model Management and Serving with Velox 295 | CIDR 2015 296 | http://www.bailis.org/papers/velox-cidr2015.pdf 297 | 298 | Bandana: Using Non-volatile Memory for Storing Deep Learning Models 299 | SysML 2019 300 | https://arxiv.org/abs/1811.05922 301 | 302 | ## Hardware 303 | 304 | Deep learning with limited numerical precision 305 | ICML 2015 306 | http://proceedings.mlr.press/v37/gupta15.pdf 307 | 308 | In-Datacenter Performance Analysis of a Tensor Processing Unit 309 | ISCA 2017 310 | https://dl.acm.org/citation.cfm?id=3080246 311 | 312 | Serving DNNs in Real Timeat Datacenter Scale with Project Brainwave 313 | IEEE MICRO 38(2), Mar./Apr. 2018 314 | https://ieeexplore.ieee.org/document/8344479 315 | 316 | ## Security aspects 317 | 318 | Efficient Deep Learning on Multi-Source Private Data 319 | https://arxiv.org/pdf/1807.06689.pdf 320 | 321 | Chiron: Privacy-preserving Machine Learning as a Service 322 | https://arxiv.org/pdf/1803.05961.pdf 323 | 324 | MLCapsule: Guarded Offline Deployment of Machine Learning as a Service 325 | https://arxiv.org/pdf/1808.00590.pdf 326 | 327 | Slalom: Fast, Verifiable and Private Execution of Neural Networks in Trusted Hardware 328 | https://arxiv.org/pdf/1806.03287.pdf 329 | 330 | Privado: Practical and Secure DNN Inference 331 | https://arxiv.org/pdf/1810.00602.pdf 332 | 333 | ## ML Platforms (Applied) 334 | 335 | Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective 336 | HPCA 2018 337 | https://research.fb.com/publications/applied-machine-learning-at-facebook-a-datacenter-infrastructure-perspective/ 338 | 339 | Machine Learning at Facebook: Understanding Inference at the Edge 340 | HPCA 2019 341 | https://research.fb.com/publications/machine-learning-at-facebook-understanding-inference-at-the-edge/ 342 | 343 | Meet Michelangelo: Uber’s Machine Learning Platform 344 | https://eng.uber.com/michelangelo/ 345 | 346 | Introducing FBLearner Flow: Facebook’s AI backbone 347 | https://code.fb.com/core-data/introducing-fblearner-flow-facebook-s-ai-backbone/ 348 | 349 | TFX: A TensorFlow-Based Production-Scale Machine LearningPlatform 350 | http://dl.acm.org/authorize?N33328 351 | 352 | Horovod: fast and easy distributed deep learning in TensorFlow 353 | https://arxiv.org/pdf/1802.05799v3.pdf 354 | 355 | ## ML for Systems 356 | 357 | Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms 358 | SOSP 2017 359 | https://dl.acm.org/authorize?N47144 360 | 361 | Adaptive Execution of Continuous and Data-intensive Workflows with Machine Learning 362 | MiddleWare 2018 363 | https://dl.acm.org/citation.cfm?id=3274827 364 | 365 | AuTO: Scaling Deep Reinforcement Learning to Enable Datacenter-Scale Automatic Traffic Optimization 366 | SIGCOMM 2018 367 | https://dl.acm.org/citation.cfm?id=3230551 368 | 369 | Neural Adaptive Video Streaming with Pensieve 370 | SIGCOMM 2017 371 | https://dl.acm.org/citation.cfm?id=3098843 372 | 373 | Neural Adaptive Content-aware Internet Video Delivery 374 | OSDI 2018 375 | https://www.usenix.org/system/files/osdi18-yeo.pdf 376 | 377 | ## Workshops 378 | 379 | Systems for ML and Open Source Software Workshop at NeurIPS 2018 380 | http://learningsys.org/nips18/acceptedpapers.html 381 | 382 | SysML 2018 383 | http://www.sysml.cc/2018/index.html 384 | 385 | Engineering Dependable and Secure Machine Learning Systems 2019 386 | https://sites.google.com/view/edsmls2019/program 387 | 388 | Engineering Dependable and Secure Machine Learning Systems 2018 389 | https://sites.google.com/edu.haifa.ac.il/edsmls/program 390 | 391 | Workshop on Distributed Machine Learning 2017 392 | https://distributedml2017.wordpress.com/schedule/ 393 | 394 | ML Systems Workshop at NIPS 2016 395 | https://sites.google.com/site/mlsysnips2016/accepted-papers 396 | 397 | ## Upcoming 2019 398 | 399 | ColumnML: Column Store Machine Learning with On The Fly Data Transformation 400 | VLDB 2019 401 | 402 | Continuous Integration of Machine Learning Models: A Rigorous Yet Practical Treatment 403 | SysML 2019 404 | 405 | Seer: Leveraging Big Data to Navigate the Complexity of Performance Debugging in Cloud Microservices 406 | ASPLOS 2019 407 | 408 | RLgraph: Flexible Computation Graphs for Deep Reinforcement Learning 409 | SysML 2019 410 | https://arxiv.org/pdf/1810.09028.pdf 411 | 412 | ## For adding/updating the list 413 | 414 | 1. Fork the repository 415 | 2. Update this file 416 | 3. Send a pull request 417 | 418 | --------------------------------------------------------------------------------