├── 20180621_technical_debt_ml_rubric ├── citation.bibtex └── ml_test_score_notes.md ├── README.md ├── 20180712_mixed_effects_forests └── MERF_notes.md └── 20180823_data_tyranny_for_social_good └── tyranny_notes.md /20180621_technical_debt_ml_rubric/citation.bibtex: -------------------------------------------------------------------------------- 1 | @inproceedings{46555, 2 | title = {The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction}, 3 | author = {Eric Breck and Shanqing Cai and Eric Nielsen and Michael Salib and D. Sculley}, 4 | year = {2017}, 5 | booktitle = {Proceedings of IEEE Big Data} 6 | } 7 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # dsapp-reading-group 2 | Proceedings of the Center for Data Scientists arguing (about) Public Papers 3 | 4 | 5 | # Reading schedule 6 | |Date| Paper| Link| Presenter| 7 | |-----|:---------------------:|:-----------:|:---------------------:| 8 | 06/28 | Planning Discussion |`TBD`| All| 9 | 07/05 | *Lack of group-to-individual generalizability is a threat to human subjects research* | [Link](http://www.pnas.org/content/early/2018/06/15/1711978115)| Iván | 10 | 07/12 | *Mixed-effects random forest for clustered data* | [Link](https://www.tandfonline.com/doi/full/10.1080/00949655.2012.741599?scroll=top&needAccess=true) | Cristoph | 11 | 07/19 | *Statistically Controlling for Confounding Constructs Is Harder than You Think* | [Link](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0152719)*| Erika | 12 | 07/26 | *A Generalization Error for Q-Learning* | [Link](http://www.jmlr.org/papers/volume6/murphy05a/murphy05a.pdf) | Jeremy | 13 | 08/02 |*Statistics and Causal Inference*|[Link](https://www.jstor.org/stable/2289064?seq=1#page_scan_tab_contents) | Iván 14 | 08/23 |*The Tyranny of Data? The Bright and Dark Sides of Data-Driven Decision-Making for Social Good*| [Link](https://link.springer.com/chapter/10.1007/978-3-319-54024-5_1) | Erica 15 | 08/30 |*Targeted maximum likelihood estimation for causal inference in observational studies*| [Link](https://www.ncbi.nlm.nih.gov/pubmed/27941068) | Iván 16 | 09/06 |*Mind the Gap: A Generative Approach to Interpretable Feature Selection and Extraction* | [Link](http://papers.nips.cc/paper/5957-mind-the-gap-a-generative-approach-to-interpretable-feature-selection-and-extraction.pdf) | Eddie 17 | 09/13 |*A Dynamic Pipeline for Spatio-Temporal Fire Risk Prediction* |[Link](http://www.kdd.org/kdd2018/accepted-papers/view/a-dynamic-pipeline-for-spatio-temporal-fire-risk-prediction) | Avishek 18 | 09/20|*Comparing cluster-level dynamic treatment regimens using sequential, multiple assignment, randomized trials: Regression estimation and sample size considerations* | [Link](https://arxiv.org/pdf/1607.04039.pdf) | Jeremy 19 | -------------------------------------------------------------------------------- /20180712_mixed_effects_forests/MERF_notes.md: -------------------------------------------------------------------------------- 1 | ## Mixed Effects Regression Trees and Forests 2 | 3 | #### Hajjem et al. 2011, 2014 4 | 5 | Motivation 6 | 7 | * Hierarchical/ nested data structure 8 | * Cluster variable not useful as feature 9 | + often many categories/ cluster 10 | + want predictions also for new clusters 11 | * Improve prediction performance 12 | 13 | MERT 14 | 15 | * Previous work: some papers on longitudinal trends and trees, Sela and Simonoff (2012) 16 | * MERT 17 | + y_i = f(X_i) + Z_i b_i + e_i 18 | + b_i ~ N(0,D), e_i ~ N(0,R_i) 19 | * Fixed part approximated by tree structure 20 | * Linear random part, allows for random intercepts and random slopes 21 | * Fitted via modified EM algorithm where y* (without random part) is handed over to CART to estimate f(X_i) 22 | 1. random effects known, estimate CART tree 23 | 2. CART tree known, estimate random effects 24 | * Prediction 25 | + known cluster: get fixed part from tree node and add random part based on cluster membership 26 | + new cluster: use only fixed part 27 | 28 | MERF 29 | 30 | * Previous work: cluster based bootstrapping 31 | * Same as MERT, but now f(X_i) is estimated by RF and out-of-bag predictions f*(X_i) are used to estimate random effects in next step 32 | * Standard Bootstrapping used based on assumption that random effects totally explain intracluster correlation 33 | 34 | Extensions 35 | 36 | * Generalized Mixed Effects Regression Trees (GMERT) 37 | * GMERF? 38 | 39 | Application 40 | 41 | * Simulations: MERT, MERF outperform CART, RF when random effects present 42 | * Real data examples 43 | 44 | Implementation 45 | 46 | * Python: https://pypi.org/project/merf/ 47 | * R: only packages for related methods 48 | + https://cran.r-project.org/web/packages/glmertree/index.html 49 | + https://cran.r-project.org/web/packages/REEMtree/REEMtree.pdf 50 | 51 | Comments 52 | 53 | * Problems of multilevel models may apply here as well 54 | + issues of small level-2 sample sizes (Stegmueller 2013, Elff et al. 2016) 55 | + estimation with many random effects can become cumbersome 56 | * Tuning MERFs? 57 | -------------------------------------------------------------------------------- /20180823_data_tyranny_for_social_good/tyranny_notes.md: -------------------------------------------------------------------------------- 1 | # The Tyranny of Data? The Bright and Dark Sides of Data-Driven Decision-Making for Social Good 2 | 3 | ## Summary Notes 4 | 5 | Data tyranny adapted from Easterly's tyranny of experts: 6 | 7 | - expert justifications considered objective 8 | - beneficiaries are uninformed about process 9 | - experts act coercively without systems for checking them and are allowed to validate their own systems against their own "objective" rules 10 | 11 | Rise of data for good: 12 | - human decision-making is riddled with bias 13 | - algorithms can be used for processes too complex for human understanding 14 | - already in use in policing, healthcare, and economics 15 | 16 | Problems: 17 | 18 | - Can infer lots of missing private information (identity, sexual orientation, ethnicity, mh diagnosis) 19 | - information asymmetry deepens power differentials between people and data holders 20 | - types of lack of transparency 21 | - intentional opacity: to protect IP 22 | - illiterate opacity: people lack the technical understanding 23 | - intrinsic opacity: algorithms are just hard to interpret 24 | - social exclusion and discrimination 25 | - disparate impact if data are not weighted properly (e.g. indirect discrimination by overweighting zip codes) 26 | - creating an algorithm in the first place (e.g., direct discrimination through disparate treatment) 27 | - misuse of models in inappropriate contexts 28 | - biased training data used to justify and validate algorithms 29 | - lose access to or be denied benefits on the basis of others' behavior 30 | - lack of transparency means that detecting the source of errors (e.g., incorrect entity resolution, bad data values) is much harder 31 | - siloed data prohibits users from accessing and controling their own data 32 | 33 | human-centric requirements for social good algorithms: 34 | 35 | - grant people "personal data vaults" where they can own and control their personal data 36 | - improve transparency and accountability 37 | - tools to identify and rectify bias 38 | - explanation tools 39 | - send the code to the data (instead of vice versa; OPAL) 40 | - "living labs" to experiment with data 41 | - communities of volunteers willing to participate in policy experiments 42 | 43 | Organizations working on transparent algorithms: 44 | 45 | - Data Transparency Lab 46 | - DARPA Explainable Artificial Intelligence (XAI) project 47 | - New York University’s Information Law Institute, such as Helen Nissenbaum and Solon Barocas 48 | - Microsoft Research, such as Kate Crawford and Tarleton Gillespie 49 | - Open Algorithms (OPAL) project (Orange, the MIT Media Lab, Data-Pop Alliance, Imperial College London, and the World Economic Forum) 50 | 51 | ## Questions 52 | 53 | - How do we properly consent people to the use of their data for thousands of applications, many not yet invented? How do we consent people in a way that they will attend to and understand that isn't purely compliance-based like IRB consent forms or dense like TOS? 54 | - What education is required to make people informed about data? Who is responsible for providing that education? Where will it be delivered? Can we ethically develop algorithms for populations that are unlikely or unable to receive that education? 55 | - How can people make informed decisions about their data? Given the millions of data points they produce in a day, how can they view, explore, and understand their data sufficiently to give *informed* consent? 56 | - How could we make data users accountable to human-centric requirements? What incentives are there for following those requirements? What organizations (goverments) are responsible for policing users? What powers could they have for enforcement? How would they discover data misuse? 57 | - For many of the applications DSaPP works on, consent requirements would either delay development and implementation of algorithms for years (e.g., need to wait five years for enough consents from people entering jail, homeless services, etc.) or be highly ambiguous (who needs to consent to data use for multiple housing inspections? building managers? tenants? building owners? what happens when they turnover?). Is this simply a cost we should pay? Are there some cases where consent is not required? 58 | -------------------------------------------------------------------------------- /20180621_technical_debt_ml_rubric/ml_test_score_notes.md: -------------------------------------------------------------------------------- 1 | # Reading group 2018-06-21 2 | 3 | ## The ML Test Score: A Rubric for ML Production Readiness and Techincal Debt Reduction (Proceedings of IEEE Big Data, 2017) 4 | 5 | Link: https://ai.google/research/pubs/pub46555 6 | 7 | * Introduction 8 | * Goal: 28 'actionable' tests for ML systems 9 | * "prior work has identified problems but been largely silent on how to address them" 10 | * "this paper details actionable advice drawn from practice and verified with interviews" 11 | * Tests for features and data 12 | * Data 1: feature expectations are captured in a schema 13 | * What are sensible 'feature expectations' - think in terms of statistical priors? 14 | * Prior support - hard boundary for feasible data values 15 | * Prior density - "soft boundary" for likelihood of data values 16 | * Data 2: all features are beneficial 17 | * Very vague, reduces to "test feature importance / significance" 18 | * Data 3: No feature's cost is too much 19 | * How do you quantify the cost of adding a feature? 20 | * Highly model dependent 21 | * Generally nonlinear? 22 | * How do you measure instability with adding noisy features? 23 | * Sparsity constraints not typically enforced in overtrained ML models - how to control? 24 | * Data 4: features adhere to meta-level requirements 25 | * i.e. "data constraints are uniformly applied through the pipeline" 26 | * Data 5: data pipeline has privacy controls 27 | * Data 6: new features can be added quickly 28 | * Data 7: test ETL for feature generation 29 | * Tests for model development 30 | * Model 1: Every model specification undergoes a code review and is checked into a repository 31 | * Best way to version control config files? Private repos? 32 | * Model 2: Offline proxy metrics correlate with actual online impact metrics 33 | * Model 3: all hyperparameters have been tuned 34 | * This is an entire field of research, and hardly a trivial question 35 | * Model 4: the impact of model staleness is known 36 | * Evaluating "stale" models - what determines staleness? 37 | * How much do we expect target distributions to evolve over time? 38 | * Does a fast decay of "stale" models imply recency bias / overfitting? 39 | * Model 5: complex models outperform simple models 40 | * How can we design model groups to best test this? 41 | * Model 6: model quality is sufficient on all important data slices 42 | * Slicing by bias categories, at risk categories, other pubpol-specific topics 43 | * Model 7: model has been tested for consideration of inclusion - see #6 44 | * Tests for ML infrastructure 45 | * Infra 1: Training is reproducible 46 | * What goes into complete reproducibility? 47 | * Controlling pseudo-random number generation? 48 | * Controlling live data sources? 49 | * Infra 2: Model specification code is unit tested 50 | * Testing implementations - do we currently have integration tests for `catwalk` models? 51 | * Infra 3: the full ML pipeline is integration tested 52 | * Infra 4: inspect data quality before serving it in model 53 | * Infra 5: the model allows debugging by observing the step-by-step computation of training or inference on a single example 54 | * On the up side: could be useful for showing how `individual_importances` translate into decision 55 | * On the down side: could yield bias in how to interpret feature importances 56 | * Infra 6: models are tested via a canary process before they enter production serving environments 57 | * What distinguishes a "canary process" from a critical integration test? 58 | * Infra 7: models can be quickly and safely rolled back to a previous serving version 59 | * Monitoring tests for ML 60 | * Monitor 1: dependency changes result in notification 61 | * Monitor 2: data invariants hold in training and serving inputs - duplicate? 62 | * Monitor 3: Training and serving features compute the same values - duplicate? 63 | * Monitor 4: models are not too stale - duplicate? 64 | * Monitor 5: the model is numerically stable 65 | * Numerical stability is not limited to ETL issues- see model monitor deep dive 66 | * Monitor 6: the model has not experienced a dramatic or slow-leak regression in training speed, serving latency, throughput, or RAM usage 67 | * Monitor 7: the model has not experienced a regression in prediction quality on served data - duplicate? 68 | * Survey insights 69 | * Goal is to grade ML systems based on their adherence to the traits above 70 | * Performed interviews at Google 71 | * Aside: authors literally include a citation for the statement "checklists are helpful even for expert teams [18]" 72 | * Common themes: 73 | * Teams have trouble communicating data ownership, which reduces incentives to do integration tests 74 | * Model "canarying" occurs long after engineering decisions were made 75 | * Survey self-assessment 76 | * "Teams using purely image or audio data did not feel many of the Data tests were applicable" ??? 77 | * How to handle cost (time, data, money, etc.) of acquiring lables? Not addressed at all! 78 | 79 | ## Meeting notes: 80 | 81 | Introduction (Eddie): 82 | * Paper is follow-up to earlier work 83 | * Thinking of data as "part of the source code" 84 | * Allows us to help determine new features in `triage` 85 | 86 | Discussion notes: 87 | * Going through `triage` and assessing if we meet requirements, worth it to implement 88 | * Some packages already exist: [https://github.com/great-expectations/great\_expectations] 89 | * Data requirements are ill-defined - creates ambiguity 90 | * When we get data dumps, the data has already undergone unobserved transformations 91 | * Organizational problems: ownership of tools, common practices often undocumented 92 | * Enrichment of data with metadata 93 | * How do we define "confidence" in what the data should look like? (i.e. prior confidence) 94 | * Expanding feature groups 95 | * Looking at time aggregations / blocking to better investigate subsets of features 96 | * Example: examine smallest 10% of feature importance for all models in a model group 97 | * Improve using `audition` / `postmodeling`? 98 | * More difficult to gauge impacts of nonlinear models 99 | * Feature selection using ML? Should be implemented in `triage`? 100 | * Pipelines instead of models 101 | * What parts of pipelines are algorithmic and what could be considered preprocessing or postprocessing? 102 | * How do we change the feature space? 103 | * Problem with `triage`: no way of processing lagged data (but Adolfo is working on it) 104 | * Information is lost when aggregated - how much, and how do we control? 105 | * Trading off computational costs vs. ML costs - how to measure? 106 | * Data ownership is important, and in general people are bad at it! 107 | 108 | Takeaways: 109 | * We should explore feature selection in `triage` 110 | * We should explore converting model objects into pipeline objects 111 | * We should explore ETL checks at different places besides only model output 112 | * We should explore new ways to integrate partner `domain knowledge` into models 113 | 114 | --------------------------------------------------------------------------------