└── README.md


/README.md:
--------------------------------------------------------------------------------
  1 | # Marco's SysML reading list
  2 | A curated reading list of computer science research for work at the intersection of machine learning and systems. PR are welcome.
  3 | 
  4 | ## Review
  5 | 
  6 | A Berkeley View of Systems Challenges for AI
  7 | https://arxiv.org/pdf/1712.05855.pdf
  8 | 
  9 | Strategies and Principles of Distributed Machine Learning on Big Data
 10 | https://arxiv.org/abs/1512.09295
 11 | 
 12 | ## Background
 13 | 
 14 | Deep learning
 15 | Nature volume 521, 2015
 16 | https://www.nature.com/articles/nature14539
 17 | 
 18 | Deep learning reading list
 19 | http://deeplearning.net/reading-list
 20 | 
 21 | ## Measurement
 22 | 
 23 | Multi-tenant GPU Clusters for Deep LearningWorkloads: Analysis and Implications
 24 | https://www.microsoft.com/en-us/research/uploads/prod/2018/05/gpu_sched_tr.pdf
 25 | 
 26 | ## Frameworks
 27 | 
 28 | TensorFlow: A System for Large-Scale Machine Learning
 29 | OSDI 2016
 30 | https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf
 31 | 
 32 | Ray: A Distributed Framework for Emerging AI Applications
 33 | OSDI 2018
 34 | https://www.usenix.org/system/files/osdi18-moritz.pdf
 35 | 
 36 | ## Tuning
 37 | 
 38 | HyperDrive: Exploring Hyperparameters with POP Scheduling
 39 | MiddleWare 2017
 40 | https://dl.acm.org/citation.cfm?id=3135994
 41 | 
 42 | Ease.ml: Towards Multi-tenant Resource Sharing for Machine Learning Workloads
 43 | VLDB 2018
 44 | http://www.vldb.org/pvldb/vol11/p607-li.pdf
 45 | 
 46 | Automating Model Search for Large Scale Machine Learning
 47 | SoCC 2015
 48 | http://dl.acm.org/authorize?N91362
 49 | 
 50 | Google Vizier: A Service for Black-Box Optimization
 51 | KDD 2017
 52 | https://dl.acm.org/citation.cfm?id=3098043
 53 | 
 54 | Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization
 55 | Journal of Machine Learning Research 18 (2018)
 56 | https://arxiv.org/pdf/1603.06560.pdf
 57 | 
 58 | Hyperopt: a Python library for model selection and hyperparameter optimization
 59 | Computational Science & Discovery, 8(1) 2015 
 60 | http://iopscience.iop.org/article/10.1088/1749-4699/8/1/014008
 61 | 
 62 | Auto-Keras: Efficient Neural Architecture Search with Network Morphism
 63 | https://arxiv.org/pdf/1806.10282v2.pdf
 64 | 
 65 | ## Runtime execution
 66 | 
 67 | Cavs: An Efficient Runtime System for Dynamic Neural Networks
 68 | ATC 2018
 69 | https://www.usenix.org/system/files/conference/atc18/atc18-xu-shizhen.pdf
 70 | 
 71 | TVM: An Automated End-to-End Optimizing Compiler for Deep Learning
 72 | OSDI 2018
 73 | https://www.usenix.org/system/files/osdi18-chen.pdf
 74 | 
 75 | PipeDream: Fast and Efficient Pipeline Parallel DNN Training
 76 | https://arxiv.org/pdf/1806.03377.pdf
 77 | 
 78 | STRADS: A Distributed Framework for Scheduled Model Parallel Machine Learning
 79 | EuroSys 2016
 80 | https://dl.acm.org/citation.cfm?id=2901331
 81 | 
 82 | Dynamic Control Flow in Large-Scale Machine Learning
 83 | EuroSys 2018
 84 | https://dl.acm.org/citation.cfm?id=3190551
 85 | 
 86 | Improving the Expressiveness of Deep Learning Frameworks with Recursion
 87 | EuroSys 2018
 88 | https://dl.acm.org/citation.cfm?id=3190530
 89 | 
 90 | Continuum: A Platform for Cost-Aware, Low-Latency Continual Learning
 91 | SoCC 2018
 92 | https://dl.acm.org/citation.cfm?id=3267817
 93 | 
 94 | KeystoneML: Optimizing Pipelines for Large-ScaleAdvanced Analytics
 95 | ICDE 2017
 96 | https://amplab.cs.berkeley.edu/wp-content/uploads/2017/01/ICDE_2017_CameraReady_475.pdf
 97 | 
 98 | Owl: A General-Purpose Numerical Library in OCaml
 99 | https://arxiv.org/pdf/1707.09616.pdf
100 | 
101 | ## Distributed learning
102 | 
103 | Large Scale Distributed Deep Networks
104 | NIPS 2012
105 | https://ai.google/research/pubs/pub40565.pdf
106 | 
107 | Managed Communication and Consistency for Fast Data-Parallel Iterative Analytics
108 | SoCC 2015
109 | http://dl.acm.org/authorize?N91363
110 | 
111 | Ako: Decentralised Deep Learning with Partial Gradient Exchange
112 | SOCC 2016
113 | https://lsds.doc.ic.ac.uk/sites/default/files/ako-socc16.pdf
114 | 
115 | Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters
116 | ATC 2017
117 | https://www.usenix.org/system/files/conference/atc17/atc17-zhang.pdf
118 | 
119 | Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training
120 | SoCC 2018
121 | https://dl.acm.org/citation.cfm?id=3267840
122 | 
123 | MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems
124 | ML Systems Workshop at NIPS 2016
125 | https://arxiv.org/pdf/1512.01274.pdf
126 | 
127 | Scaling Distributed Machine Learning with the Parameter Server
128 | OSDI 2014
129 | https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-li_mu.pdf
130 | 
131 | Project Adam: Building an Efficient and Scalable Deep Learning Training System
132 | OSDI 2014
133 | https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-chilimbi.pdf
134 | 
135 | Orpheus: Efficient Distributed Machine Learning via System and Algorithm Co-design
136 | SoCC 2018
137 | https://dl.acm.org/citation.cfm?id=3267810
138 | 
139 | Petuum: A New Platform for Distributed Machine Learning on Big Data
140 | KDD 2015
141 | https://arxiv.org/pdf/1312.7651.pdf
142 | 
143 | GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
144 | https://arxiv.org/pdf/1811.06965.pdf
145 | 
146 | ## Serving systems and inference
147 | 
148 | DeepCPU: Serving RNN-based Deep Learning Models 10x Faster
149 | ATC 2018
150 | https://www.usenix.org/system/files/conference/atc18/atc18-zhang-minjia.pdf
151 | 
152 | Clipper: A Low-Latency Online Prediction Serving System
153 | NSDI 2017
154 | https://www.usenix.org/system/files/conference/nsdi17/nsdi17-crankshaw.pdf
155 | 
156 | Research for Practice: Prediction-Serving Systems
157 | ACM Queue 16(1), 2018
158 | https://queue.acm.org/detail.cfm?id=3210557
159 | 
160 | InferLine: ML Inference Pipeline Composition
161 | https://arxiv.org/pdf/1812.01776.pdf
162 | 
163 | PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
164 | OSDI 2018
165 | https://www.usenix.org/system/files/osdi18-lee.pdf
166 | 
167 | Olympian: Scheduling GPU Usage in a Deep Neural Network Model Serving System
168 | MiddleWare 2018
169 | https://dl.acm.org/citation.cfm?id=3274813
170 | 
171 | Low Latency RNN Inference with Cellular Batching
172 | EuroSys 2018
173 | https://dl.acm.org/citation.cfm?id=3190541
174 | 
175 | SERF: Efficient Scheduling for Fast Deep Neural Network Serving via Judicious Parallelism
176 | SC 2016
177 | https://ieeexplore.ieee.org/document/7877104
178 | 
179 | NoScope: Optimizing Neural Network Queries over Video at Scale
180 | VLDB 2017
181 | https://dl.acm.org/citation.cfm?id=3137664
182 | 
183 | Scheduling
184 | Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters
185 | EuroSys 2018
186 | https://dl.acm.org/citation.cfm?id=3190517
187 | 
188 | SLAQ: Quality-Driven Scheduling for Distributed Machine Learning
189 | SoCC 2017
190 | https://dl.acm.org/authorize?N46878
191 | 
192 | Proteus: agile ML elasticity through tiered reliability in dynamic resource markets 
193 | EuroSys 2017
194 | https://dl.acm.org/citation.cfm?id=3064182
195 | 
196 | Gandiva: Introspective Cluster Scheduling for Deep Learning
197 | OSDI 2018
198 | https://www.usenix.org/system/files/osdi18-xiao.pdf
199 | 
200 | Topology-Aware GPU Scheduling for Learning Workloads in Cloud Environments
201 | SC 2017
202 | https://dl.acm.org/citation.cfm?id=3126933
203 | 
204 | ## Algorithmic aspects in scalable ML
205 | 
206 | Hemingway: Modeling Distributed Optimization Algorithms
207 | ML Systems Workshop at NIPS 2016
208 | https://arxiv.org/pdf/1702.05865.pdf
209 | 
210 | Asynchronous Methods for Deep Reinforcement Learning
211 | ICML 2016
212 | http://proceedings.mlr.press/v48/mniha16.pdf
213 | 
214 | Don't Use Large Mini-Batches, Use Local SGD
215 | https://arxiv.org/pdf/1808.07217.pdf
216 | 
217 | GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server
218 | EuroSys 2016
219 | https://dl.acm.org/citation.cfm?id=2901323
220 | 
221 | ImageNet Training in Minutes
222 | ICPP 2018
223 | https://dl.acm.org/citation.cfm?id=3225069
224 | 
225 | Semantics-Preserving Parallelization of Stochastic Gradient Descent
226 | IPDPS 2018
227 | https://ieeexplore.ieee.org/abstract/document/8425176
228 | 
229 | HOGWILD!: A Lock-Free Approach to ParallelizingStochastic Gradient Descent
230 | NIPS 2011
231 | https://papers.nips.cc/paper/4390-hogwild-a-lock-free-approach-to-parallelizing-stochastic-gradient-descent.pdf
232 | 
233 | QSGD: Communication-Efficient SGD via Randomized Quantization
234 | NIPS 2017
235 | https://papers.nips.cc/paper/6768-qsgd-communication-efficient-sgd-via-gradient-quantization-and-encoding.pdf
236 | 
237 | Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent
238 | NIPS 2017
239 | https://papers.nips.cc/paper/7117-can-decentralized-algorithms-outperform-centralized-algorithms-a-case-study-for-decentralized-parallel-stochastic-gradient-descent.pdf
240 | 
241 | Slow and Stale Gradients Can Win the Race: Error-Runtime Trade-offs in Distributed SGD
242 | AIStats 2018
243 | https://arxiv.org/pdf/1803.01113.pdf
244 | 
245 | Probabilistic Synchronous Parallel
246 | https://arxiv.org/pdf/1709.07772.pdf
247 | 
248 | ## AI Testing and Verification
249 | 
250 | DeepXplore: Automated Whitebox Testing of Deep Learning Systems
251 | SOSP 2017
252 | https://dl.acm.org/authorize?N47145
253 | 
254 | Programmatically Interpretable Reinforcement Learning
255 | ICML 2018
256 | https://arxiv.org/pdf/1804.02477.pdf
257 | 
258 | AI2: Safety and Robustness Certification of Neural Networks with Abstract Interpretation
259 | SP 2018
260 | https://ieeexplore.ieee.org/document/8418593
261 | 
262 | ## Interpretability and Explainability
263 | 
264 | “Why Should I Trust You?”Explaining the Predictions of Any Classifier
265 | KDD 2016
266 | https://www.kdd.org/kdd2016/papers/files/rfp0573-ribeiroA.pdf
267 | 
268 | Learning to Explain: An Information-Theoretic Perspective on Model Interpretation
269 | ICML 2018
270 | https://arxiv.org/pdf/1802.07814.pdf
271 | 
272 | A Unified Approach to Interpreting Model Predictions
273 | NIPS 2017
274 | https://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf
275 | 
276 | The Mythos of Model Interpretability
277 | WHI 2016
278 | https://arxiv.org/pdf/1606.03490.pdf
279 | 
280 | ## Model Management
281 | 
282 | MISTIQUE: A System to Store and Query Model Intermediates for Model Diagnosis
283 | SIGMOD 2018
284 | https://dl.acm.org/citation.cfm?id=3196934
285 | 
286 | MODELDB: A System for Machine Learning Model Management
287 | HILDA 2016
288 | https://mitdbg.github.io/modeldb/papers/hilda_modeldb.pdf
289 | 
290 | Model Governance: Reducing the Anarchy of Production ML
291 | ATC 2018
292 | https://www.usenix.org/system/files/conference/atc18/atc18-sridhar.pdf
293 | 
294 | The Missing Piece in Complex Analytics: Low Latency,Scalable Model Management and Serving with Velox
295 | CIDR 2015
296 | http://www.bailis.org/papers/velox-cidr2015.pdf
297 | 
298 | Bandana: Using Non-volatile Memory for Storing Deep Learning Models
299 | SysML 2019
300 | https://arxiv.org/abs/1811.05922
301 | 
302 | ## Hardware
303 | 
304 | Deep learning with limited numerical precision
305 | ICML 2015
306 | http://proceedings.mlr.press/v37/gupta15.pdf
307 | 
308 | In-Datacenter Performance Analysis of a Tensor Processing Unit
309 | ISCA 2017
310 | https://dl.acm.org/citation.cfm?id=3080246
311 | 
312 | Serving DNNs in Real Timeat Datacenter Scale with Project Brainwave
313 | IEEE MICRO 38(2), Mar./Apr. 2018
314 | https://ieeexplore.ieee.org/document/8344479
315 | 
316 | ## Security aspects
317 | 
318 | Efficient Deep Learning on Multi-Source Private Data
319 | https://arxiv.org/pdf/1807.06689.pdf
320 | 
321 | Chiron: Privacy-preserving Machine Learning as a Service
322 | https://arxiv.org/pdf/1803.05961.pdf
323 | 
324 | MLCapsule: Guarded Offline Deployment of Machine Learning as a Service
325 | https://arxiv.org/pdf/1808.00590.pdf
326 | 
327 | Slalom: Fast, Verifiable and Private Execution of Neural Networks in Trusted Hardware
328 | https://arxiv.org/pdf/1806.03287.pdf
329 | 
330 | Privado: Practical and Secure DNN Inference
331 | https://arxiv.org/pdf/1810.00602.pdf
332 | 
333 | ## ML Platforms (Applied)
334 | 
335 | Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective
336 | HPCA 2018
337 | https://research.fb.com/publications/applied-machine-learning-at-facebook-a-datacenter-infrastructure-perspective/
338 | 
339 | Machine Learning at Facebook: Understanding Inference at the Edge
340 | HPCA 2019
341 | https://research.fb.com/publications/machine-learning-at-facebook-understanding-inference-at-the-edge/
342 | 
343 | Meet Michelangelo: Uber’s Machine Learning Platform
344 | https://eng.uber.com/michelangelo/
345 | 
346 | Introducing FBLearner Flow: Facebook’s AI backbone
347 | https://code.fb.com/core-data/introducing-fblearner-flow-facebook-s-ai-backbone/
348 | 
349 | TFX: A TensorFlow-Based Production-Scale Machine LearningPlatform
350 | http://dl.acm.org/authorize?N33328
351 | 
352 | Horovod: fast and easy distributed deep learning in TensorFlow
353 | https://arxiv.org/pdf/1802.05799v3.pdf
354 | 
355 | ## ML for Systems
356 | 
357 | Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms
358 | SOSP 2017
359 | https://dl.acm.org/authorize?N47144
360 | 
361 | Adaptive Execution of Continuous and Data-intensive Workflows with Machine Learning
362 | MiddleWare 2018
363 | https://dl.acm.org/citation.cfm?id=3274827
364 | 
365 | AuTO: Scaling Deep Reinforcement Learning to Enable Datacenter-Scale Automatic Traffic Optimization
366 | SIGCOMM 2018
367 | https://dl.acm.org/citation.cfm?id=3230551
368 | 
369 | Neural Adaptive Video Streaming with Pensieve
370 | SIGCOMM 2017
371 | https://dl.acm.org/citation.cfm?id=3098843
372 | 
373 | Neural Adaptive Content-aware Internet Video Delivery
374 | OSDI 2018
375 | https://www.usenix.org/system/files/osdi18-yeo.pdf
376 | 
377 | ## Workshops
378 | 
379 | Systems for ML and Open Source Software Workshop at NeurIPS 2018
380 | http://learningsys.org/nips18/acceptedpapers.html
381 | 
382 | SysML 2018
383 | http://www.sysml.cc/2018/index.html
384 | 
385 | Engineering Dependable and Secure Machine Learning Systems 2019
386 | https://sites.google.com/view/edsmls2019/program
387 | 
388 | Engineering Dependable and Secure Machine Learning Systems 2018
389 | https://sites.google.com/edu.haifa.ac.il/edsmls/program
390 | 
391 | Workshop on Distributed Machine Learning 2017
392 | https://distributedml2017.wordpress.com/schedule/
393 | 
394 | ML Systems Workshop at NIPS 2016
395 | https://sites.google.com/site/mlsysnips2016/accepted-papers
396 | 
397 | ## Upcoming 2019
398 | 
399 | ColumnML: Column Store Machine Learning with On The Fly Data Transformation
400 | VLDB 2019
401 | 
402 | Continuous Integration of Machine Learning Models: A Rigorous Yet Practical Treatment
403 | SysML 2019
404 | 
405 | Seer: Leveraging Big Data to Navigate the Complexity of Performance Debugging in Cloud Microservices
406 | ASPLOS 2019
407 | 
408 | RLgraph: Flexible Computation Graphs for Deep Reinforcement Learning
409 | SysML 2019
410 | https://arxiv.org/pdf/1810.09028.pdf
411 | 
412 | ## For adding/updating the list
413 | 
414 | 1. Fork the repository
415 | 2. Update this file
416 | 3. Send a pull request
417 | 
418 | 


--------------------------------------------------------------------------------