├── 20180621_technical_debt_ml_rubric
    ├── citation.bibtex
    └── ml_test_score_notes.md
├── README.md
├── 20180712_mixed_effects_forests
    └── MERF_notes.md
└── 20180823_data_tyranny_for_social_good
    └── tyranny_notes.md


/20180621_technical_debt_ml_rubric/citation.bibtex:
--------------------------------------------------------------------------------
1 | @inproceedings{46555,
2 | title	= {The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction},
3 | author	= {Eric Breck and Shanqing Cai and Eric Nielsen and Michael Salib and D. Sculley},
4 | year	= {2017},
5 | booktitle	= {Proceedings of IEEE Big Data}
6 | }
7 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # dsapp-reading-group
 2 | Proceedings of the Center for Data Scientists arguing (about) Public Papers
 3 | 
 4 | 
 5 | # Reading schedule 
 6 | |Date| Paper| Link| Presenter|
 7 | |-----|:---------------------:|:-----------:|:---------------------:|
 8 | 06/28 | Planning Discussion |`TBD`| All|
 9 | 07/05 | *Lack of group-to-individual generalizability is a threat to human subjects research* | [Link](http://www.pnas.org/content/early/2018/06/15/1711978115)| Iván |
10 | 07/12 | *Mixed-effects random forest for clustered data* | [Link](https://www.tandfonline.com/doi/full/10.1080/00949655.2012.741599?scroll=top&needAccess=true) | Cristoph |
11 | 07/19 |  *Statistically Controlling for Confounding Constructs Is Harder than You Think* |  [Link](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0152719)*|    Erika   |
12 | 07/26 | *A Generalization Error for Q-Learning* | [Link](http://www.jmlr.org/papers/volume6/murphy05a/murphy05a.pdf)  |  Jeremy  |
13 | 08/02 |*Statistics and Causal Inference*|[Link](https://www.jstor.org/stable/2289064?seq=1#page_scan_tab_contents) | Iván
14 | 08/23 |*The Tyranny of Data? The Bright and Dark Sides of Data-Driven Decision-Making for Social Good*| [Link](https://link.springer.com/chapter/10.1007/978-3-319-54024-5_1) | Erica
15 | 08/30 |*Targeted maximum likelihood estimation for causal inference in observational studies*| [Link](https://www.ncbi.nlm.nih.gov/pubmed/27941068) | Iván
16 | 09/06 |*Mind the Gap: A Generative Approach to Interpretable Feature Selection and Extraction* | [Link](http://papers.nips.cc/paper/5957-mind-the-gap-a-generative-approach-to-interpretable-feature-selection-and-extraction.pdf)  | Eddie
17 | 09/13 |*A Dynamic Pipeline for Spatio-Temporal Fire Risk Prediction* |[Link](http://www.kdd.org/kdd2018/accepted-papers/view/a-dynamic-pipeline-for-spatio-temporal-fire-risk-prediction) | Avishek
18 | 09/20|*Comparing cluster-level dynamic treatment regimens using sequential, multiple assignment, randomized trials: Regression estimation and sample size considerations* | [Link](https://arxiv.org/pdf/1607.04039.pdf) | Jeremy
19 | 


--------------------------------------------------------------------------------
/20180712_mixed_effects_forests/MERF_notes.md:
--------------------------------------------------------------------------------
 1 | ## Mixed Effects Regression Trees and Forests
 2 | 
 3 | #### Hajjem et al. 2011, 2014
 4 | 
 5 | Motivation
 6 | 
 7 | * Hierarchical/ nested data structure
 8 | * Cluster variable not useful as feature
 9 |     + often many categories/ cluster
10 |     + want predictions also for new clusters
11 | * Improve prediction performance
12 | 
13 | MERT
14 | 
15 | * Previous work: some papers on longitudinal trends and trees, Sela and Simonoff (2012)
16 | * MERT
17 |     + y_i = f(X_i) + Z_i b_i + e_i
18 |     + b_i ~ N(0,D), e_i ~ N(0,R_i)
19 | * Fixed part approximated by tree structure
20 | * Linear random part, allows for random intercepts and random slopes
21 | * Fitted via modified EM algorithm where y* (without random part) is handed over to CART to estimate f(X_i)
22 |     1. random effects known, estimate CART tree
23 |     2. CART tree known, estimate random effects
24 | * Prediction
25 |     + known cluster: get fixed part from tree node and add random part based on cluster membership
26 |     + new cluster: use only fixed part
27 | 
28 | MERF
29 | 
30 | * Previous work: cluster based bootstrapping
31 | * Same as MERT, but now f(X_i) is estimated by RF and out-of-bag predictions f*(X_i) are used to estimate random effects in next step
32 | * Standard Bootstrapping used based on assumption that random effects totally explain intracluster correlation
33 | 
34 | Extensions
35 | 
36 | * Generalized Mixed Effects Regression Trees (GMERT)
37 | * GMERF?
38 | 
39 | Application
40 | 
41 | * Simulations: MERT, MERF outperform CART, RF when random effects present
42 | * Real data examples
43 | 
44 | Implementation
45 | 
46 | * Python: https://pypi.org/project/merf/
47 | * R: only packages for related methods
48 |     + https://cran.r-project.org/web/packages/glmertree/index.html
49 |     + https://cran.r-project.org/web/packages/REEMtree/REEMtree.pdf
50 | 
51 | Comments
52 | 
53 | * Problems of multilevel models may apply here as well
54 |     + issues of small level-2 sample sizes (Stegmueller 2013, Elff et al. 2016)
55 |     + estimation with many random effects can become cumbersome
56 | * Tuning MERFs?
57 | 


--------------------------------------------------------------------------------
/20180823_data_tyranny_for_social_good/tyranny_notes.md:
--------------------------------------------------------------------------------
 1 | # The Tyranny of Data? The Bright and Dark Sides of Data-Driven Decision-Making for Social Good
 2 | 
 3 | ## Summary Notes
 4 | 
 5 | Data tyranny adapted from Easterly's tyranny of experts:
 6 | 
 7 | - expert justifications considered objective
 8 | - beneficiaries are uninformed about process
 9 | - experts act coercively without systems for checking them and are allowed to validate their own systems against their own "objective" rules
10 | 
11 | Rise of data for good:
12 | - human decision-making is riddled with bias
13 | - algorithms can be used for processes too complex for human understanding
14 | - already in use in policing, healthcare, and economics
15 | 
16 | Problems:
17 | 
18 | - Can infer lots of missing private information (identity, sexual orientation, ethnicity, mh diagnosis)
19 | - information asymmetry deepens power differentials between people and data holders
20 | - types of lack of transparency
21 |     - intentional opacity: to protect IP
22 |     - illiterate opacity: people lack the technical understanding
23 |     - intrinsic opacity: algorithms are just hard to interpret
24 | - social exclusion and discrimination
25 |     - disparate impact if data are not weighted properly (e.g. indirect discrimination by overweighting zip codes)
26 |     - creating an algorithm in the first place (e.g., direct discrimination through disparate treatment)
27 |     - misuse of models in inappropriate contexts
28 |     - biased training data used to justify and validate algorithms
29 | - lose access to or be denied benefits on the basis of others' behavior
30 | - lack of transparency means that detecting the source of errors (e.g., incorrect entity resolution, bad data values) is much harder
31 | - siloed data prohibits users from accessing and controling their own data
32 | 
33 | human-centric requirements for social good algorithms:
34 | 
35 | - grant people "personal data vaults" where they can own and control their personal data
36 | - improve transparency and accountability
37 |     - tools to identify and rectify bias
38 |     - explanation tools
39 |     - send the code to the data (instead of vice versa; OPAL)
40 | - "living labs" to experiment with data
41 |     - communities of volunteers willing to participate in policy experiments
42 | 
43 | Organizations working on transparent algorithms:
44 | 
45 | - Data Transparency Lab
46 | - DARPA Explainable Artificial Intelligence (XAI) project
47 | - New York University’s Information Law Institute, such as Helen Nissenbaum and Solon Barocas
48 | - Microsoft Research, such as Kate Crawford and Tarleton Gillespie
49 | - Open Algorithms (OPAL) project (Orange, the MIT Media Lab, Data-Pop Alliance, Imperial College London, and the World Economic Forum)
50 | 
51 | ## Questions
52 | 
53 | - How do we properly consent people to the use of their data for thousands of applications, many not yet invented? How do we consent people in a way that they will attend to and understand that isn't purely compliance-based like IRB consent forms or dense like TOS?
54 | - What education is required to make people informed about data? Who is responsible for providing that education? Where will it be delivered? Can we ethically develop algorithms for populations that are unlikely or unable to receive that education?
55 | - How can people make informed decisions about their data? Given the millions of data points they produce in a day, how can they view, explore, and understand their data sufficiently to give *informed* consent?
56 | - How could we make data users accountable to human-centric requirements? What incentives are there for following those requirements? What organizations (goverments) are responsible for policing users? What powers could they have for enforcement? How would they discover data misuse?
57 | - For many of the applications DSaPP works on, consent requirements would either delay development and implementation of algorithms for years (e.g., need to wait five years for enough consents from people entering jail, homeless services, etc.) or be highly ambiguous (who needs to consent to data use for multiple housing inspections? building managers? tenants? building owners? what happens when they turnover?). Is this simply a cost we should pay? Are there some cases where consent is not required?
58 | 


--------------------------------------------------------------------------------
/20180621_technical_debt_ml_rubric/ml_test_score_notes.md:
--------------------------------------------------------------------------------
  1 | # Reading group 2018-06-21
  2 | 
  3 | ## The ML Test Score: A Rubric for ML Production Readiness and Techincal Debt Reduction  (Proceedings of IEEE Big Data, 2017)
  4 | 
  5 | Link: https://ai.google/research/pubs/pub46555
  6 | 
  7 | * Introduction
  8 |   * Goal: 28 'actionable' tests for ML systems
  9 |   * "prior work has identified problems but been largely silent on how to address them"
 10 |   * "this paper details actionable advice drawn from practice and verified with interviews"
 11 | * Tests for features and data
 12 |   * Data 1: feature expectations are captured in a schema
 13 |     * What are sensible 'feature expectations' - think in terms of statistical priors?
 14 |       * Prior support - hard boundary for feasible data values
 15 |       * Prior density - "soft boundary" for likelihood of data values
 16 |   * Data 2: all features are beneficial
 17 |     * Very vague, reduces to "test feature importance / significance"
 18 |   * Data 3: No feature's cost is too much
 19 |     * How do you quantify the cost of adding a feature? 
 20 |       * Highly model dependent
 21 |       * Generally nonlinear?
 22 |     * How do you measure instability with adding noisy features?
 23 |     * Sparsity constraints not typically enforced in overtrained ML models - how to control?
 24 |   * Data 4: features adhere to meta-level requirements
 25 |     * i.e. "data constraints are uniformly applied through the pipeline"
 26 |   * Data 5: data pipeline has privacy controls 
 27 |   * Data 6: new features can be added quickly 
 28 |   * Data 7: test ETL for feature generation 
 29 | * Tests for model development
 30 |   * Model 1: Every model specification undergoes a code review and is checked into a repository
 31 |     * Best way to version control config files? Private repos? 
 32 |   * Model 2: Offline proxy metrics correlate with actual online impact metrics
 33 |   * Model 3: all hyperparameters have been tuned 
 34 |     * This is an entire field of research, and hardly a trivial question
 35 |   * Model 4: the impact of model staleness is known 
 36 |     * Evaluating "stale" models - what determines staleness?
 37 |       * How much do we expect target distributions to evolve over time?
 38 |       * Does a fast decay of "stale" models imply recency bias / overfitting?
 39 |   * Model 5: complex models outperform simple models
 40 |     * How can we design model groups to best test this?
 41 |   * Model 6: model quality is sufficient on all important data slices
 42 |     * Slicing by bias categories, at risk categories, other pubpol-specific topics
 43 |   * Model 7: model has been tested for consideration of inclusion - see #6
 44 | * Tests for ML infrastructure
 45 |   * Infra 1: Training is reproducible 
 46 |     * What goes into complete reproducibility?
 47 |       * Controlling pseudo-random number generation?
 48 |       * Controlling live data sources?
 49 |   * Infra 2: Model specification code is unit tested 
 50 |     * Testing implementations - do we currently have integration tests for `catwalk` models?
 51 |   * Infra 3: the full ML pipeline is integration tested 
 52 |   * Infra 4: inspect data quality before serving it in model
 53 |   * Infra 5: the model allows debugging by observing the step-by-step computation of training or inference on a single example 
 54 |     * On the up side: could be useful for showing how `individual_importances` translate into decision
 55 |     * On the down side: could yield bias in how to interpret feature importances
 56 |   * Infra 6: models are tested via a canary process before they enter production serving environments
 57 |     * What distinguishes a "canary process" from a critical integration test?
 58 |   * Infra 7: models can be quickly and safely rolled back to a previous serving version 
 59 | * Monitoring tests for ML
 60 |   * Monitor 1: dependency changes result in notification
 61 |   * Monitor 2: data invariants hold in training and serving inputs - duplicate?
 62 |   * Monitor 3: Training and serving features compute the same values - duplicate?
 63 |   * Monitor 4: models are not too stale - duplicate?
 64 |   * Monitor 5: the model is numerically stable 
 65 |     * Numerical stability is not limited to ETL issues- see model monitor deep dive
 66 |   * Monitor 6: the model has not experienced a dramatic or slow-leak regression in training speed, serving latency, throughput, or RAM usage 
 67 |   * Monitor 7: the model has not experienced a regression in prediction quality on served data - duplicate?
 68 | * Survey insights
 69 |   * Goal is to grade ML systems based on their adherence to the traits above
 70 |   * Performed interviews at Google
 71 |     * Aside: authors literally include a citation for the statement "checklists are helpful even for expert teams [18]"
 72 |   * Common themes:
 73 |     * Teams have trouble communicating data ownership, which reduces incentives to do integration tests
 74 |     * Model "canarying" occurs long after engineering decisions were made
 75 |   * Survey self-assessment
 76 |     * "Teams using purely image or audio data did not feel many of the Data tests were applicable" ???
 77 |     * How to handle cost (time, data, money, etc.) of acquiring lables? Not addressed at all!
 78 | 
 79 | ## Meeting notes:
 80 | 
 81 | Introduction (Eddie):
 82 | * Paper is follow-up to earlier work
 83 | * Thinking of data as "part of the source code"
 84 | * Allows us to help determine new features in `triage`
 85 | 
 86 | Discussion notes:
 87 | * Going through `triage` and assessing if we meet requirements, worth it to implement
 88 | * Some packages already exist: [https://github.com/great-expectations/great\_expectations]
 89 | * Data requirements are ill-defined - creates ambiguity 
 90 |   * When we get data dumps, the data has already undergone unobserved transformations
 91 |   * Organizational problems: ownership of tools, common practices often undocumented
 92 | * Enrichment of data with metadata
 93 |   * How do we define "confidence" in what the data should look like? (i.e. prior confidence)
 94 | * Expanding feature groups
 95 |   * Looking at time aggregations / blocking to better investigate subsets of features
 96 |   * Example: examine smallest 10% of feature importance for all models in a model group
 97 |     * Improve using `audition` / `postmodeling`?
 98 |     * More difficult to gauge impacts of nonlinear models
 99 |   * Feature selection using ML? Should be implemented in `triage`?
100 |     * Pipelines instead of models
101 |     * What parts of pipelines are algorithmic and what could be considered preprocessing or postprocessing?
102 | * How do we change the feature space?
103 |   * Problem with `triage`: no way of processing lagged data (but Adolfo is working on it)
104 |   * Information is lost when aggregated - how much, and how do we control?
105 | * Trading off computational costs vs. ML costs - how to measure?
106 | * Data ownership is important, and in general people are bad at it!
107 | 
108 | Takeaways:
109 | * We should explore feature selection in `triage`
110 | * We should explore converting model objects into pipeline objects
111 | * We should explore ETL checks at different places besides only model output 
112 | * We should explore new ways to integrate partner `domain knowledge` into models
113 | 
114 | 


--------------------------------------------------------------------------------