├── .gitignore ├── Gemfile ├── _config.yml ├── _includes └── head-custom-google-analytics.html ├── scores.md ├── qualitative-vs-quantitative.md ├── productivity-concepts.md ├── README.md ├── persona-champions.md ├── driving-decisions.md ├── goals-signals-metrics.md ├── metric-pitfalls.md ├── data-vs-insights.md ├── audiences.md ├── metrics-and-performance-reviews.md ├── developer-personas.md ├── data-collection-principles.md ├── dph-goals-and-signals.md ├── metric-principles.md ├── example-metrics.md ├── why-our-metrics.md └── LICENSE /.gitignore: -------------------------------------------------------------------------------- 1 | Gemfile.lock 2 | _site/ 3 | -------------------------------------------------------------------------------- /Gemfile: -------------------------------------------------------------------------------- 1 | source 'https://rubygems.org' 2 | gem 'github-pages', group: :jekyll_plugins 3 | -------------------------------------------------------------------------------- /_config.yml: -------------------------------------------------------------------------------- 1 | remote_theme: pages-themes/primer 2 | plugins: 3 | - jekyll-remote-theme 4 | title: The LinkedIn DPH Framework 5 | -------------------------------------------------------------------------------- /_includes/head-custom-google-analytics.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 10 | -------------------------------------------------------------------------------- /scores.md: -------------------------------------------------------------------------------- 1 | # What’s wrong with scores? 2 | 3 | Resist the temptation to aggregate metrics into scores. Scores are problematic 4 | for a variety of reasons. 5 | 6 | # Issues 7 | 8 | **Added Indirection** — Creating scores is tempting because it appears to reduce 9 | the amount of information communicated to users. But, the issue is that a score 10 | adds another layer of indirection. Rather than reducing the amount of 11 | information needed to make decisions, scores add to it. 12 | 13 | **Aggregates and Weights** — Scores require additional decisions to be made 14 | about combining and weighting metrics. Agreeing to a standard set of weights 15 | distracts from the main goal of using metrics to drive priorities and 16 | improvements. It’s okay to let different teams prioritize different metrics. 17 | 18 | **Stakeholder Buy-in** — Scores also require a lot of "buy-in" from 19 | stakeholders. Getting the buy-in necessary for metrics is often difficult enough 20 | that creating scores as "meta-metrics" isn’t worth the effort. Another option is 21 | to have a strong sponsor for the score. But these top-down efforts are rare. 22 | 23 | **Expensive to Develop** — When creating a metric, lots of decisions need to be 24 | made about the telemetry and data pipeline. What is included or excluded from 25 | the metric? When does the duration start or end? Prioritize and iterate on these 26 | decisions before considering an aggregate score. 27 | 28 | **Unactionable Changes** — Scores are less directly actionable and meaningful 29 | than metrics. Creating an alert on a score is error prone as one metric’s 30 | improvement may conceal another metric’s worsening. 31 | 32 | **Increased Noise** — The changes in contributing metrics also make variance 33 | calculations more difficult. As fluctuations are bound to occur, the combination 34 | of those changes may increase the noise in the score. 35 | 36 | **Missing Units** — Scores often have no units and therefore are not meaningful 37 | or intuitive. To say that a score went from 50 to 60 is meaningless without more 38 | context. Is the overall score out of 100 or 1,000? Is a 20% improvement typical 39 | or remarkable? 40 | 41 | **Potential for Gaming** — When a set of metrics are reduced to a single number, 42 | there might be an incentive to "game" the system to achieve higher scores, 43 | potentially at the expense of other important factors. 44 | 45 | **Difficulty in Comparison** — Due to inherent differences in services or 46 | systems, it may be difficult to meaningfully compare scores. The effort to 47 | improve the score in one scenario by 10% may require ten times more effort in 48 | another scenario for the same benefit. 49 | 50 | Next: [Metrics and Performance Reviews](metrics-and-performance-reviews.md) -------------------------------------------------------------------------------- /qualitative-vs-quantitative.md: -------------------------------------------------------------------------------- 1 | # Qualitative vs Quantitative Measurements 2 | 3 | We collect both quantitative data and qualitative data. They have different uses 4 | and trade-offs. 5 | 6 | - [Qualitative Data](#qualitative-data) 7 | - [Quantitative Data](#quantitative-data) 8 | - [Conflicts](#conflicts) 9 | 10 | ## Qualitative Data 11 | 12 | Usually qualitative data is in the form of personal opinions or sentiments about 13 | things. It is collected by asking people questions, usually through surveys or 14 | by interviewing people. 15 | 16 | Qualitative data is our most "full-coverage" data. It can tell us about almost 17 | anything that can be framed as a survey question. However, it sometimes is hard 18 | to take direct action on, depending on how the survey questions are constructed. 19 | Also, you usually can't survey people _frequently_ (they don't want to be 20 | surveyed that often) which means that your data points are coming in every few 21 | months or quarters, rather than every day. So you can't see, for example, if a 22 | change you made yesterday caused an improvement today. 23 | 24 | Qualitative data usually ends up being best for discovering general areas for 25 | action, but often requires specific follow-up analysis to discover specifically 26 | what action needs to be taken in those areas. It is most often appropriate when 27 | you are measuring something abstract or social (like “happiness,” 28 | “productivity,” or “how do people feel about my product”), as opposed to 29 | something concrete or technical (like “how many requests do I get per second” or 30 | “how long is the median incremental build time of this specific binary?”). 31 | 32 | That said, it is usually easier to survey/interview users than to instrument 33 | systems for quantitative data. So when first analyzing an area, doing surveys or 34 | interviews is usually the best way to get started. However, eventually you do 35 | want to have quantitative data, because it's very useful when figuring out what 36 | specific actions you need to take to fix productivity issues. 37 | 38 | ## Quantitative Data 39 | 40 | Quantitative data is easier to take direct action on, but usually harder to 41 | implement. It's important to know that [somebody will _use_ the 42 | data](driving-decisions.md) before you go through the work to put numbers on a 43 | graph. 44 | 45 | Quantitative data is especially helpful when an engineer needs to do an 46 | investigation that will lead to an engineering team taking direct action. For 47 | example, "What are the slowest tests in my codebase?" An engineer asking that 48 | question likely is about to go fix those specific tests, and then wants to see 49 | the improvement reflected in the same metric. 50 | 51 | It is dangerous to use quantitative methods on things that should be 52 | qualitative. For example, **there is no general quantitative metric for 53 | "developer productivity."** "Productivity" is inherently a _quality_ that 54 | you’re measuring (“how easily are people able to do their jobs?”), and there’s 55 | no quantitative measure of it. "Code complexity" is similar---that is a quality 56 | that is experienced by human beings; you can't measure it with a number. 57 | 58 | ## Conflicts 59 | 60 | When the qualitative data and the quantitative data disagree, usually there is 61 | something wrong with the quantitative data. For example, if developers all say 62 | they are unhappy with build times, and our build time metrics all look fine, our 63 | build time metrics are probably wrong or missing data. 64 | 65 | Next: [Audiences: Always Know Who Your Data is For](audiences.md) -------------------------------------------------------------------------------- /productivity-concepts.md: -------------------------------------------------------------------------------- 1 | # Productivity Concepts for Software Developers 2 | 3 | There are some common concepts used when discussing the productivity of software 4 | developers. These concepts aren't specific to designing metrics, but they 5 | frequently come up when we think about which metrics to choose. 6 | 7 | - [Iteration Time (aka Cycle Time)](#iteration-time-aka-cycle-time) 8 | - [Context Switching](#context-switching) 9 | 10 | ## Iteration Time (aka Cycle Time) 11 | 12 | One of the most important things to optimize about software engineering is 13 | "iteration time," which is the time it takes for an engineer to make an 14 | observation, decide what to do about that observation, and act on it. There are 15 | small iterations, like when a person is coding, they might write a line of code, 16 | see the IDE give it a squiggly red underline, and fix their typo. And there are 17 | huge iterations, like the time between having an idea for a whole product and 18 | then finally releasing that product to its users, getting feedback, and making 19 | improvements to the product. There are many types and sizes of iterations, like 20 | doing a build and seeing if your code compiles, posting a change and waiting for 21 | a code review comment, deploying an experiment and analyzing feedback from 22 | users, etc. 23 | 24 | Sometimes, we don’t have a set expectation for how many iterations a process 25 | should take, but we know that if we speed up the iterations, the process as a 26 | whole will get faster. It’s a safe general assumption to make in almost all 27 | cases. 28 | 29 | Also, there are times when you fundamentally change the nature of the work by 30 | reducing iteration time. For example, if it takes 2 seconds to run all my tests, 31 | I can run them every time I save a file. But if it takes 10 minutes, I might not 32 | even run them before submitting my code for review. In particular, work can 33 | dramatically change when iterations become short enough to eliminate context 34 | switching, as described below. 35 | 36 | ## Context Switching 37 | 38 | Developers will "context switch" to another activity if they have to wait a 39 | certain amount of time for something. For example, they might go read their 40 | email or work on something else if they have to wait longer than 30 - 60 seconds 41 | for a build to complete. The specific time depends on various factors, such as a 42 | person's expectations for how long the task _should_ take, how the task informs 43 | the developer of its progress, etc. 44 | 45 | When you interrupt a developer for long enough, they could take up to 10 - 15 46 | minutes to mentally "reload" the "context" that they had for the change they 47 | were working on. Basically, it can take some time to figure out what you were 48 | working on, what you were thinking about it, and what you intended to do. 49 | 50 | So if you combine those two, there’s a chance that every time you make somebody 51 | wait longer than a certain amount (let's say 30 seconds, on average), you 52 | actually could lose fifteen minutes of their time. Now, this isn’t an absolute 53 | thing--you might only lose two or three minutes for a 30-second wait, but you 54 | might lose 5 - 10 minutes for a 3-minute wait. 55 | 56 | Also, sometimes important context is entirely lost from the developer’s mind 57 | when you make them context switch, especially when the context switch comes from 58 | a sudden interruption (like having some service go down that they depend on for 59 | development). They come back to what they are doing and have forgotten something 60 | important, leading to bugs or missed opportunities in your actual production 61 | systems. 62 | 63 | So it’s important to avoid making developers context switch. 64 | 65 | Next: [Example Metrics](example-metrics.md) 66 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Welcome to the [LinkedIn Developer Productivity and Happiness 2 | Framework](https://linkedin.github.io/dph-framework/)! 3 | 4 | At LinkedIn, we have a fairly advanced system for understanding our developers, 5 | the success of our engineering work, and where we should focus our 6 | infrastructure efforts to be most effective. 7 | 8 | This repository contains the documents that describe this system. They are 9 | mostly direct copies of the internal documents that our own engineers read to 10 | understand how this works. 11 | 12 | This document set explains how to define metrics and feedback systems for 13 | software developers, how to get action taken on that data, and provides examples 14 | of a few internal metrics that we use. 15 | 16 | You can read the documents in any order. Each one is designed to be able to be 17 | read and referenced independently. However, we provide a suggested sequence and 18 | hierarchy here: 19 | 20 | * **Goals, Signals, and Metrics: A Framework for Defining Metrics** 21 | * [Goals, Signals, and Metrics](goals-signals-metrics.md) 22 | * [Developer Productivity & Happiness Goals and Signals](dph-goals-and-signals.md) 23 | * **Developer Personas: A system for categorizing and understanding developers** 24 | * [Developer Personas](developer-personas.md) 25 | * [Persona Champions](persona-champions.md) 26 | * **Guidelines for Teams Who Create Metrics and Feedback Systems** 27 | * [Data vs Insights](data-vs-insights.md) 28 | * [Qualitative vs Quantitative Measurements](qualitative-vs-quantitative.md) 29 | * [Audiences: Always Know Who Your Data is For](audiences.md) 30 | * [Driving Decisions With Data](driving-decisions.md) 31 | * [Data Collection Principles](data-collection-principles.md) 32 | * **Quantitative Metrics: General Tips and Guidelines** 33 | * [Principles and Guidelines for Metric Design](metric-principles.md) 34 | * [Common Pitfalls When Designing Metrics](metric-pitfalls.md) 35 | * [What's wrong with "scores?"](scores.md) 36 | * [Metrics and Performance Reviews](metrics-and-performance-reviews.md) 37 | * **Example Metrics** 38 | * [Productivity Concepts for Software Developers](productivity-concepts.md) 39 | * [Example Metrics](example-metrics.md) 40 | * [Why Did We Choose Our Metrics?](why-our-metrics.md) 41 | 42 | ## Forking, Modifying, and Contributing 43 | 44 | We have made the DPH Framework open-source so that you can fork, modify, and 45 | re-use these documents however you wish, as long as you respect the license that 46 | is on the repository. You can see the source in our [GitHub 47 | repo](https://github.com/linkedin/dph-framework/). 48 | 49 | We welcome community contributions that help move forward the state of the art 50 | in understanding software developers across the entire software industry. If 51 | there’s something missing in the documents that you’d like to see added, feel 52 | free to file an issue via [GitHub 53 | Issues](https://github.com/linkedin/dph-framework/issues)! If you just have 54 | questions or a discussion you’d like to have, participate in our [GitHub 55 | Discussions](https://github.com/linkedin/dph-framework/discussions). 56 | 57 | And of course, if you want to contribute new text or improvements to the 58 | existing text, we welcome your contributions! Keep in mind that we hold this 59 | framework to a very high standard---we want it to be validated by real 60 | experience in the software industry, generally applicable across a wide range of 61 | software development environments, and assure that additional inputs are both 62 | interesting and accessible to a broad audience. If you think you have content 63 | that meets that bar and fits in with these documents, we would love to have your 64 | contribution! If you’re not sure, start a 65 | [discussion](https://github.com/linkedin/dph-framework/issues) or just send us a 66 | PR and we can discuss it. 67 | 68 | ## License 69 | 70 | The LinkedIn Developer Productivity & Happiness Framework is licensed under [CC 71 | BY 4.0](http://creativecommons.org/licenses/by/4.0/?ref=chooser-v1). 72 | 73 | ![CC]( 74 | https://mirrors.creativecommons.org/presskit/icons/cc.svg?ref=chooser-v1)![BY](https://mirrors.creativecommons.org/presskit/icons/by.svg?ref=chooser-v1) 75 | 76 | Copyright © 2023 LinkedIn Corporation 77 | -------------------------------------------------------------------------------- /persona-champions.md: -------------------------------------------------------------------------------- 1 | # Persona Champions 2 | 3 | There is a person or small group assigned as the "champion" for each of our 4 | [Developer Personas](persona-champions.md). They are a _member_ of that persona. 5 | For example, the persona champion for backend developers _is_ a backend 6 | developer or a manager of backend developers. 7 | 8 | Any time somebody at the company has a question about a particular persona that 9 | isn't answered by our automated systems, the persona champion is their point of 10 | contact. If you want some custom data analysis done about a Persona, want to 11 | know what their requirements are, want to understand the best way to engage with 12 | a persona, etc. then the persona champion is your best point of contact. 13 | 14 | - [Duties of Persona Champions](#duties-of-persona-champions) 15 | - [Workflow Mapping](#workflow-mapping) 16 | - [Feedback Analysis](#feedback-analysis) 17 | - [Point of Contact](#point-of-contact) 18 | 19 | ## Duties of Persona Champions 20 | 21 | ### Workflow Mapping 22 | 23 | When first creating the [Developer Personas](developer-personas.md), it is often 24 | helpful to provide infrastructure owners with some basic information about how 25 | that persona does work. A persona champion produces and maintains a document 26 | that answers the questions, "What is the most common workflow that a developer 27 | has, in this persona? What tools do they use as part of this workflow, and how 28 | do they use them?" 29 | 30 | For most traditional software engineers, this would include the tools they use 31 | for: 32 | 33 | * Design and Planning 34 | * Source Control 35 | * Editing Code 36 | * Building/Compiling 37 | * Debugging 38 | * Dependency Management 39 | * Testing 40 | * Code Review 41 | * Release / Deployment 42 | * Experimentation 43 | * Bug Tracking / Fixing 44 | * Monitoring and Alerting 45 | 46 | Each persona will have custom steps in addition to (or instead of) those. Some 47 | personas, like data scientists, will have completely different workflow steps. 48 | 49 | This isn't just a list of tools, but sentences that describe how those tools are 50 | used in each "phase" of development above. 51 | 52 | Sometimes you will find that a persona has different ways of working for each 53 | person or team. In that case, you should document the broadest things the 54 | persona has in common, and note that otherwise there are a lot of differences 55 | between individuals. If there are _large_ sub-categories (essentially 56 | "sub-personas" who have unique workflows _within_ a larger persona) you can call 57 | out how those large sub-personas work, too. The trick is to keep the document 58 | small enough that people can read it and you can easily maintain it in the 59 | future, while still being useful for its readers. 60 | 61 | These documents are useful for infrastructure owners who are engaging in new 62 | strategic initiatives and need to understand if their plans are going to fit in 63 | with how developers across the company work. For example, if I'm going to work 64 | on a new web framework, does it fit in with how web developers work? Are there 65 | any other developers who might benefit from it, at the company? 66 | 67 | ### Feedback Analysis 68 | 69 | Persona champions are the primary people responsible for processing the feedback 70 | from our surveys and real-time feedback systems. They get access to the raw 71 | feedback data in order to process and categorize it into actionable 72 | [insights](data-vs-insights.md). 73 | 74 | After each survey, this process occurs: 75 | 76 | 1. A persona champion produces an analysis document listing out the top pain 77 | points for that persona. This involves categorizing the free-text feedback, 78 | looking at the 1 - 5 satisfaction scores produced by the survey questions, 79 | and doing direct follow-up interviews with developers as necessary (when we 80 | need more clarity on what the specific pain points are). 81 | 2. We share the analysis document with the relevant stakeholders. This always 82 | includes specifically notifying any team that might be involved in fixing one 83 | of the pain points. They should get an immediate "heads up" that their tool 84 | is being named as a top pain point for a persona. 85 | 3. We schedule a meeting that includes all those relevant stakeholders. The 86 | attendees should include people who have priority control over the work of 87 | the relevant infrastructure teams (i.e., the managers of the stuff named in 88 | the pain points). We go over the analysis document, answer questions, and 89 | agree on next steps. Somebody (such as a Technical Program Manager) is 90 | responsible for following up and making sure those next steps get executed. 91 | 92 | It is important that this whole process happens _before_ the planning cycle of 93 | the infrastructure teams, so that there is enough time for them to consider 94 | developer feedback in their planning. 95 | 96 | Persona champions also help curate the questions that go into surveys for their 97 | persona. They get lists of questions to review, and can also propose new 98 | questions for their persona when necessary. 99 | 100 | ### Point of Contact 101 | 102 | We expect persona champions to be available to answer questions about a persona 103 | from various parts of the organization. People might want more specifics about a 104 | particular pain point noted in the feedback analysis, or they might have 105 | questions about behavior or data not included in any of our documents. 106 | 107 | We expect persona champions to put in reasonable effort to answer reasonable 108 | questions posed to them, as long as those questions will actually assist in 109 | developer productivity if answered. If the work to answer a question would be 110 | too extensive, it is fine for the persona champion to decline to answer it, 111 | citing the amount of work that would be involved in getting the data. 112 | 113 | Next: [Data vs. Insights](data-vs-insights.md) 114 | -------------------------------------------------------------------------------- /driving-decisions.md: -------------------------------------------------------------------------------- 1 | # Driving Decisions With Data 2 | 3 | It is important when designing metrics, systems, or processes involving data 4 | that we understand how they will be used to drive what decisions, by who. The 5 | data needs to be something that [engineering leadership, senior ICs, front-line 6 | managers, or front-line engineers](audiences.md) can _do_ something effective 7 | with. 8 | 9 | **When you propose a metric, system, or process involving data, you should 10 | always explain how it will drive decisions.** 11 | 12 | - [Bad Examples](#bad-examples) 13 | - [Misleading Numbers](#misleading-numbers) 14 | - [No Audience](#no-audience) 15 | - [Good Example](#good-example) 16 | - [Driving Decisions Differently for Different Groups](#driving-decisions-differently-for-different-groups) 17 | - [Focus on Causing Action](#focus-on-causing-action) 18 | 19 | ## Bad Examples 20 | 21 | Let's look at some bad examples to demonstrate this. 22 | 23 | ### Misleading Numbers 24 | 25 | Most of us know that "lines of code written per software engineer" is a bad 26 | metric. But it's especially bad in this context of effectively driving the right 27 | decisions. 28 | 29 | _Who_ does _what_ with that number? It tells nothing to engineering leadership 30 | and senior engineers--there is no specific action they can take to fix it. The 31 | front-line engineers don't care that much, other than to look at their own stats 32 | and feel proud of themselves for having written a lot of code. It's only the 33 | front-line managers who will be able to _do_ something with it, by trying to 34 | talk to their engineers or figure out why some of their engineers are more 35 | "productive" than others. 36 | 37 | Now, to be clear, it's fine to make a metric, system, or process that only one 38 | of these groups can use. The problem with _this_ metric is that when you give it 39 | to a front-line manager, it often either (a) misleads them or (b) is useless. 40 | They may see that one engineer writes fewer lines of code than another, but the 41 | engineer writing fewer lines actually has more impact on the users of the 42 | product, or is doing more research, or is writing better-crafted code. 43 | 44 | Basically, what you've actually done is sent the front-line manager on an 45 | unreliable wild-goose chase that wasted their time and taught them to distrust 46 | your metrics and you! You drove no decision for that manager--in fact, you 47 | instead [created a problem](data-vs-insights.md) that they have to solve: 48 | constantly double-checking your metric and doing their own investigations 49 | against it, since it's so often wrong. Worse yet, if the manager trusts this 50 | metric without any question, it could lead to bad code, rewarding the wrong 51 | behavior, a bad engineering culture, and low morale. 52 | 53 | ### No Audience 54 | 55 | Another way to do this wrong is to have some metric but not have any 56 | [_individual_ who will make a decision based on it](audiences.md). For example, 57 | measuring test coverage on a codebase that nobody works on would not be very 58 | useful. 59 | 60 | ## Good Example 61 | 62 | 1. You discover, through surveys or other valid feedback mechanisms, that mobile 63 | developers feel their releases are too slow. 64 | 2. You properly instrument the release pipeline to accurately measure the length 65 | of each part of the pipeline. 66 | 3. You do some basic investigation to see what the major pain points are that 67 | cause this slowness along each part of the pipeline, to get an idea of how 68 | much work would be involved in actually fixing each piece, and what would be 69 | the most important pieces to fix first. 70 | 4. You boil this information down into an understandable report that concisely: 71 | * Proves the problem exists 72 | * Explains why it's important 73 | * Gives a birds-eye view of where time is being spent in the release 74 | pipeline 75 | * Provides very rough estimates of how much work it is to solve each piece 76 | 5. This report is presented to [engineering leadership](audiences.md), who 77 | decides who to assign, at what priority, to fix the most important parts of 78 | the release pipeline to address this issue. 79 | 6. Front-line engineers use the instrumentation we developed in order to 80 | understand and solve the specifics of the problem. 81 | 7. After the work is done, the same data can be referenced to show how 82 | successful the optimizations were. 83 | 84 | ## Driving Decisions Differently for Different Groups 85 | 86 | At the level of front-line managers and front-line engineers, it is sufficient 87 | to provide information that allows people to figure out _where_ a problem is, so 88 | that front-line engineers can track down the actual problem and solve it. This 89 | can sometimes be [more "data" and less "insights."](data-vs-insights.md) 90 | 91 | At more senior levels, [more "insights" are required as opposed to raw 92 | data](data-vs-insights.md). In general, the more senior a person you are 93 | presenting to, the more work you should have done up front to provide insights 94 | gathered from data, as opposed to just showing raw data. 95 | 96 | ## Focus on Causing Action 97 | 98 | When you deliver data or insights to people, it should actually be something 99 | that will influence their decisions. They should be interested in having data 100 | for their decisions, and be willing to act based on the data. It's important to 101 | check this _up front_ before doing an analysis--otherwise, you can end up doing 102 | a lot of analysis work that ends up having no impact, because the recipient was 103 | never going to take action on it in the first place. 104 | 105 | For example, sometimes teams have mandates of things they _must_ do, such as 106 | legal or policy compliance, where it doesn't matter what you say about the work 107 | they do, they still have to do it in exactly the way they plan to do it. It's 108 | not useful to try to change their mind with data--your work will not result in 109 | any different decision. 110 | 111 | In general, if a person has already made up their mind and no data or insight 112 | will realistically sway them, it's not worth doing the work to provide them data 113 | or insights. We might have some opinion about the way the world should be, but 114 | that doesn't matter, because a _person has to change their mind_ in order for 115 | action to happen. If we won't even _potentially_ change somebody's mind, we 116 | should not do the work. 117 | 118 | It can be useful to ask people who request data: 119 | 120 | 1. What do you think the data will say? (What is their prediction before they 121 | look at the data.) 122 | 2. If it is different than what you think, will that change your decision in any 123 | way? 124 | 125 | If the answer to question #2 is "no," then it's not worth working on that 126 | analysis. 127 | 128 | Next: [Data Collection Principles](data-collection-principles.md) -------------------------------------------------------------------------------- /goals-signals-metrics.md: -------------------------------------------------------------------------------- 1 | # Goals, Signals, and Metrics 2 | 3 | There is a framework we use for picking metrics called “Goals-Signals-Metrics.” 4 | 5 | Basically, first you decide what the **goals** are that you want to achieve for 6 | your product or system. Then you decide on the **signals** you want to examine 7 | that tell you if you’re achieving your goal--essentially, these are what you 8 | would measure if you had perfect knowledge of everything. Then you choose 9 | **metrics** that give you some proxy or idea of that signal, since few signals 10 | can be measured perfectly. 11 | 12 | - [Goals](#goals) 13 | - [Example Goal](#example-goal) 14 | - [Uncertainty](#uncertainty) 15 | - [Signals](#signals) 16 | - [Metrics](#metrics) 17 | 18 | ## Goals 19 | 20 | A goal should be framed as the thing that you want your team, product, or 21 | project to accomplish. It should not be framed as "I want to measure X." 22 | 23 | **Most of the trouble that teams have in defining metrics comes from defining 24 | unclear goals.** 25 | 26 | _Webster's Third New International Dictionary_ defines a "goal" as: 27 | 28 | > The end toward which effort or ambition is directed; a condition or state to 29 | > be brought about through a course of action. 30 | 31 | This needs to be stated in a fashion specific enough that it _could be 32 | measured_. Not that you have to know in advance what all the metrics will be, 33 | but that conceptually, a person could know whether you were getting closer to 34 | (or further from) your goal. 35 | 36 | ### Example Goal 37 | 38 | In developing a goal, you can start with a vague statement of your desire. For 39 | example: 40 | 41 | **LinkedIn's systems should be reliable.** 42 | 43 | However, that's not measurable. So the first thing you do is **clarify 44 | definitions**. 45 | 46 | First off, what does "LinkedIn's systems" mean? What does "reliable" mean? How 47 | do we choose which of our systems we want to measure? 48 | 49 | Well, to figure this out, we have to ask ourselves **why** we have this goal. 50 | The answer could be "We are the team that assures reliability of all the 51 | products that are used by LinkedIn's users." In that case, that would clarify 52 | our goal to be: 53 | 54 | **The products that are used by LinkedIn's users should be reliable.** 55 | 56 | That's still not measurable. Nobody could tell you, concretely, if you were 57 | accomplishing that goal. Here's the remaining problem: what does "reliable" 58 | mean? 59 | 60 | Once again, we have to ask ourselves **why** we have this goal. To do this, we 61 | might look at the larger goals of the company. At LinkedIn, our [top-level 62 | vision](https://about.linkedin.com/) is: "Create economic opportunity for every 63 | member of the global workforce." This gives us some context to define 64 | reliability: we somehow want to look at things that prevent us from 65 | accomplishing that vision. Of course, in a broad sense, there are _many_ factors 66 | that could prevent us: social, cultural, human, economic, etc. So we say, well, 67 | our scope is what we can do _technically_ with our software development and 68 | production-management processes, systems, and tools. What sort of technical 69 | issues would members experience as "unreliability" in that context? Probably 70 | bugs, performance issues, and downtime. 71 | 72 | So we could update our goal to be: 73 | 74 | **LinkedIn's users have an experience of our products that is free from bugs, 75 | performance issues, and downtime.** 76 | 77 | We could get more specific and define "bug," "performance issue," and 78 | "downtime," if it's not clear to the team what those specifically mean. The 79 | trick here is to get something that's clear and measurable, without it being 80 | super long. What I would recommend, if you wanted to clarify those terms, is to 81 | create _sub-goals_ for each of those terms. That is, keep this as the overall 82 | goal, and then state three more goals, one for bugs, one for performance issues, 83 | and one for downtime, which do a better job of spelling out what each of those 84 | means. 85 | 86 | One of the things that you'll notice about this exercise is that not only does 87 | it help us define our metrics, it actually helps clarify what is the most 88 | important work we should be doing. For example, this goal tells us that we 89 | should be paying _more_ attention to the experience of our users than the 90 | specific availability numbers of low-level services (even though we might care 91 | about those, too). 92 | 93 | ### Uncertainty 94 | 95 | **If you aren’t sure how to measure something, it’s very likely that you haven’t 96 | defined what it is that you are measuring.** This was the problem with 97 | "developer productivity" measurements from the past--they didn’t define what 98 | “developer productivity” actually meant, concretely, exactly, in the physical 99 | universe. They attempted to measure an abstract nothing, so they had no real 100 | metrics. This is why it is so important to understand and clarify your goals 101 | before you start to think about metrics. 102 | 103 | ## Signals 104 | 105 | Signals are what you would measure if you had perfect knowledge---if you knew 106 | everything in the world, including everything that was inside of everybody 107 | else's mind. These do not have to actually be measurable. They are a useful 108 | mental tool to help understand the areas one wants to measure. 109 | 110 | Signals are the answer to the question, "How would you know you were achieving 111 | your goal(s)?" 112 | 113 | For example, some signals around reliability might be: 114 | 115 | * Human effort spent resolving production incidents. 116 | * Human time spent debugging deployment failures. 117 | * Number of users who experienced a failure in their workflow due to a bug 118 | * Amount of time lost by users trying to work around failures. 119 | * How reliable LinkedIn's products are according to the _belief_ of our users. 120 | * User-perceived latency for each action taken by users. 121 | 122 | The concept of "signals" can also be useful to differentiate them from "metrics" 123 | (things that can actually be measured). Sometimes people will write down a 124 | signal in a doc and then claim it is a metric, and this distinction in terms can 125 | help clarify that. 126 | 127 | ## Metrics 128 | 129 | Metrics are numbers over time that can actually be measured. A metric has the 130 | following qualities: 131 | 132 | * It can actually be implemented, in the real world, and produce a concrete 133 | number. 134 | * The number can be trended over time. 135 | * It is meaningful when the number goes up or goes down, and that meaning is 136 | clear. 137 | 138 | All metrics are _proxies_ for your signal. There are no perfect metrics. It is a 139 | waste of time to try to find the "one true metric" for anything. Instead, create 140 | multiple metrics and triangulate the truth from looking at different metrics. 141 | All metrics will have flaws, but _sets_ of metrics can collectively provide 142 | insight. 143 | 144 | Example metrics for our "reliability" goal from above might be something like: 145 | 146 | * The percentage of user sessions that do not experience an error as determined 147 | by our product telemetry. 148 | * Percentage of deployments that do not experience a failure requiring human 149 | intervention. 150 | * Conduct a survey of a cross-section of users to ask them their opinion about 151 | our reliability, where they give us a score, and aggregate the scores into a 152 | metric. 153 | * Define what the "acceptable" highest latency would be for each UI action, and 154 | then count how often UI interactions happen _under_ those thresholds. (Display 155 | a percentage of how many UI interactions have an "acceptable" latency.) 156 | 157 | If you look at the signals above, you will see that some of these metrics map back 158 | to those signals. 159 | 160 | Overall, there is a _lot_ to know about metrics, including what makes metrics 161 | good or bad, how to take action on them, what types of metrics to use in what 162 | situation, etc. It would be impossible to cover all of it in this doc, but we 163 | attempt to cover some of it in other docs on this site. 164 | 165 | Next: [Developer Productivity & Happiness Goals and Signals](dph-goals-and-signals.md) 166 | -------------------------------------------------------------------------------- /metric-pitfalls.md: -------------------------------------------------------------------------------- 1 | # Common Pitfalls When Designing Metrics 2 | 3 | There are some common pitfalls that teams fall into when the start trying to 4 | measure developer productivity, the success of their engineering efforts, etc. 5 | It would be impossible to list every way to "do it wrong," but here are some of 6 | the most common ones we've seen. 7 | 8 | - [Measuring Whatever You Can Measure](#measuring-whatever-you-can-measure) 9 | - [Measuring Too Many Things](#measuring-too-many-things) 10 | - [Measuring Work Instead of Impact](#measuring-work-instead-of-impact) 11 | - [Measuring Something That is Not Acted On](#measuring-something-that-is-not-acted-on) 12 | - [Defining Signals and Calling them "Metrics"](#defining-signals-and-calling-them-metrics) 13 | - [If Everybody Does This, We Can Have a Metric!](#if-everybody-does-this-we-can-have-a-metric) 14 | - [Metrics That Create a Mystery](#metrics-that-create-a-mystery) 15 | - [Vanity Metrics](#vanity-metrics) 16 | 17 | ## Measuring Whatever You Can Measure 18 | 19 | When a team first starts looking into metrics, often they just decide to measure 20 | whatever is easiest to measure---the data they currently have available. Over 21 | time, this metric can even become entrenched as "the metric" that the team uses. 22 | 23 | There are lots of problems that occur from measuring whatever number is handy. 24 | 25 | For example, it's easy for a company to measure only revenue. Does that tell you 26 | that you're actually accomplishing the goal of your company, though? Does it 27 | tell you how happy your users are with your product? Does it tell you that your 28 | company is going to succeed in the long term? No. 29 | 30 | Or let's say you're a team that manages servers in production. What if you only 31 | measured "hits per second" on those servers? After all, that's usually a number 32 | that's easy to get. But what about the reliability of those servers? What about 33 | their performance, and how that impacts the user experience? Overall, what does 34 | this have to do with the goals of your team? 35 | 36 | The solution here is to [think about your goals](goals-signals-metrics.md) and 37 | derive a set of metrics from those, instead of just measuring whatever you can 38 | measure. 39 | 40 | ## Measuring Too Many Things 41 | 42 | If you had a dashboard with 100 graphs that were all the same size and color, it 43 | would be very hard for you to determine which of those numbers was important. It 44 | would be hard to focus a team around improving specific metrics. Everybody would 45 | get lost in the data, and very likely, no action would be taken at all. 46 | 47 | This is even worse if those 100 graphs are spread across 20 different 48 | dashboards. Everybody gets confused about where to look, which dashboard is 49 | correct (when they disagree), etc. 50 | 51 | The solution here is to have fewer dashboards that have a small set of relevant 52 | metrics on them, by understanding your [audience](audiences.md) and what [level 53 | of metrics](audiences.md) you're showing to each audience. 54 | 55 | ## Measuring Work Instead of Impact 56 | 57 | Would "number of times the saw moved" be a good metric for a carpenter? No. 58 | Similarly, any metric that simply measures the _amount of work_ done is not a 59 | good metric. It gets you an organization that tries to make work _more 60 | difficult_ so they can show you that they did _more of it_. 61 | 62 | Often, people measure work when they do not understand the _job function_ of the 63 | people being measured. In our carpenter example, the job function of a carpenter 64 | who builds houses is very different from a carpenter who repairs furniture. They 65 | both work with wood and use some of the same tools, but the _intended output_ is 66 | very different. A house builder might measure "number of projects completed that 67 | passed inspection," while the repairman might measure, "repaired furniture 68 | delivered to customers." 69 | 70 | Instead of measuring work, you want to focus on [measuring the impact or result 71 | of the work](metric-principles.md). In order to do this, you have to understand 72 | the [goals](goals-signals-metrics.md) of the team you are measuring. 73 | 74 | ## Measuring Something That is Not Acted On 75 | 76 | Sometimes we have a great theory about some metric that would help somebody. But 77 | unless that metric has an [audience](audiences.md) that will actually [drive 78 | decisions](driving-decisions.md) based on it, you should not implement the 79 | metric. 80 | 81 | Even when it has an audience, sometimes that audience does not actually plan to 82 | act on it. You can ask, "If this metric shows you something different than you 83 | expect, or if it starts to get worse, will you change your course of action?" If 84 | the answer is no, you shouldn't implement the metric. 85 | 86 | ## Defining Signals and Calling them "Metrics" 87 | 88 | Sometimes you will see a document with a list of "metrics" that can't actually 89 | be implemented. For example, somebody writes down "documentation quality" or 90 | "developer happiness" as proposed metrics, with no further explanation. 91 | 92 | The problem is that these are not [metrics](goals-signals-metrics.md), they are 93 | [signals](goals-signals-metrics.md). They will never be implemented, because 94 | they are not concretely defined as something that can be specifically measured. 95 | 96 | If you look at a "metric" and it's not _immediately clear_ how to implement it, 97 | it's probably a signal, and not a metric. 98 | 99 | ## If Everybody Does This, We Can Have a Metric! 100 | 101 | Don't expect people to change their behavior _just_ so you can measure it. 102 | 103 | For example, don't expect that everybody will tag their bugs, PRs, etc. in some 104 | special way _just_ so that you can count them. There's no incentive for them to 105 | do that correctly, no matter how much _you think_ it would be good. What really 106 | happens is you get very spotty data with questionable accuracy. Nobody believes 107 | your metric, and thus nobody takes action on it. And probably, they shouldn't 108 | believe your metric, because the data will almost certainly be missing or wrong. 109 | 110 | Sometimes people try to solve this by saying, "we will just make it mandatory to 111 | fill out the special tag!" This does _not_ solve the problem. It creates a 112 | roadblock for users that they solve by putting random values into the field 113 | (often just whatever is easiest to put into the field). You can start to engage 114 | in an "arms race," where you keep trying to add validation to prevent bad data. 115 | But why? At some point, you are actually _harming_ productivity in the name of 116 | measuring it. 117 | 118 | If you want people to change their behavior, you need to figure out a change 119 | that would be beneficial to them or the company, not just to you. Otherwise, you 120 | need to figure out ways to measure the behavior they currently have. 121 | 122 | ## Metrics That Create a Mystery 123 | 124 | Metrics should help [solve a mystery](data-vs-insights.md), not create one. 125 | 126 | The basic problem here is when a viewer says, "I don't know [what this metric 127 | means when it goes up or down](metric-principles.md)." For example, counting the 128 | total number of bugs filed in the company creates this sort of mystery. Is it 129 | good when it goes up, because we have improved our QA processes? Is it good when 130 | it goes down, because we have a less buggy system? Or (as usually happens) does 131 | it just mean some team changed the administrative details of how they track 132 | bugs, and thus the upward or downward movement is meaningless? 133 | 134 | The most common offenders here are [metrics that aggregate multiple numbers into 135 | a score](scores.md). 136 | 137 | ## Vanity Metrics 138 | 139 | Metrics must [drive decisions](driving-decisions.md). That means they must show 140 | when things are going well and when they are not going well. It is _especially_ 141 | important that they indicate when things are not going well, because those are 142 | your opportunities for improvement! If a metric hides problems, it is actually 143 | harmful to the team's goals. 144 | 145 | Imagine counting "number of happy customers" as your only metric. In January, I 146 | have 100 customers, and 10 of them are happy. In February, I have 200 customers, 147 | and 15 of them are happy. Our metric goes up, but our actual situation is 148 | getting worse! 149 | 150 | Sometimes teams are worried about metrics that "make them look bad." But 151 | look---if things are bad, they are bad. Only by acknowledging that they are bad 152 | and working to improve them will anything actually get better. By hiding the 153 | badness behind a vanity metric, we are worsening the situation for the company 154 | and our customers. 155 | 156 | Especially with developers, it is very hard to fool them. They _know_ when 157 | things are bad. If you show them some beautiful number when they are all 158 | suffering, all that will happen is you will damage your 159 | [credibility](https://www.codesimplicity.com/post/effective-engineering-productivity/) 160 | and nobody will ever believe you again in the future. 161 | 162 | Next: [What's Wrong with "Scores?"](scores.md) 163 | -------------------------------------------------------------------------------- /data-vs-insights.md: -------------------------------------------------------------------------------- 1 | # Data vs Insights 2 | 3 | What is the difference between "insights" and "data?" 4 | 5 | Data is simply raw information: numbers, graphs, survey responses, etc. A viewer 6 | has to _analyze_ data in order to come to a conclusion or solve a mystery. 7 | 8 | An "insight" is: **a presentation of information that solves a mystery for the 9 | viewer**. Now, one can do this _more_ or _less_--not all insights will fully 10 | answer _every_ question. Some might just help clarify, and the viewer has to do 11 | the rest of the analysis to fully answer their question. 12 | 13 | **When you propose a system or process that provides insights, it is a good idea 14 | to explain:** 15 | 16 | 1. **What mysteries does this solve?** 17 | 2. **Are there mysteries it intentionally doesn't solve?** 18 | 3. **How can you assure that this system currently provides (and continues to 19 | provide) trustworthy insights to its users?** 20 | 4. **How will you change this system in the future if you discover a flaw in the 21 | metric, system, or process you’ve proposed?** 22 | 5. **How do you ensure the accuracy of the data now and in the future?** 23 | 24 | Let's explain this a bit more, with some examples. 25 | 26 | - [Example: Graphs](#example-graphs) 27 | - [Example: Free-text Feedback](#example-free-text-feedback) 28 | - [Categorization](#categorization) 29 | - [Algorithmic Analysis](#algorithmic-analysis) 30 | - [Trustworthiness of Insights](#trustworthiness-of-insights) 31 | - [Accuracy of Data](#accuracy-of-data) 32 | 33 | ## Example: Graphs 34 | 35 | If you have ever heard somebody say, "But what does this graph _mean_?", you 36 | will understand the difference between data and insights. For example, let's say 37 | you have a graph that shows the average page-load latency of all pages on 38 | linkedin.com. This graph creates a huge mystery when it changes, for several 39 | reasons: 40 | 41 | 1. Because it's an average, outliers can drastically affect it, meaning people 42 | who look at it have no idea if the spikes and dips are from outliers or from 43 | actual, relevant changes. 44 | 2. Because it covers so many different areas of the site, it's hard to figure 45 | out _who_ should even be digging into it. 46 | 3. It, by itself, provides no avenue for further investigation--you have to go 47 | look at a _lot_ of other data to be able to actually understand what you 48 | should do. 49 | 50 | The _most_ that this graph could do is allow an [engineering 51 | leader](audiences.md) to say, "One person should go investigate this and tell us 52 | what is up." That's not a very impactful decision or a good use of anybody's 53 | time. Basically, this graph, if it's all we have, _creates a problem_. 54 | 55 | But what if, instead, we had a system that _accurately_ informed front-line 56 | engineers and front-line managers when there was a significant difference in 57 | median (or 90th percentile) page-load latency between one release of a server 58 | and another? Or even better, if it could analyze changes between those two 59 | releases and tell you specifically which change introduced the latency? That's a 60 | much harder engineering problem, and may or may not be realistically possible. 61 | The point, though, is that that is an _insight_. It does its best to point 62 | specific individuals (in this case, the front-line engineers who own the system) 63 | toward specific work. 64 | 65 | To be clear, if you want to develop a system that provides insights into 66 | latency, there are many levels at which you should do this and many different 67 | stakeholders who want to know many different things. These are just two examples 68 | to compare the difference between mystery and insight. 69 | 70 | ## Example: Free-text Feedback 71 | 72 | One of the biggest sources of mystery is free-text answers in surveys. If there 73 | are only a few answers that are relevant to your work, it's possible to read all 74 | of them. But once you have thousands of free-text answers, it's hard to process 75 | them all and make a decision based off of them. If you just give an [engineering 76 | leader](audiences.md) a spreadsheet with 1000 free-text answers in it, you have 77 | created a problem and a mystery for them. You have to process them _somehow_ in 78 | order for the data to become _understandable_. 79 | 80 | That's a reasonable way to think about the job of our team, by the way: make 81 | data understandable. 82 | 83 | In this specific instance, there are lots of ways to make it understandable. 84 | Each of these leaves behind different types of mysteries. For example: 85 | 86 | ### Categorization 87 | 88 | One common way of understanding free-text feedback is to have a person go 89 | through the negative comments and categorize them according to categories that 90 | they determine while reading the comments. Then, you count up the number of 91 | comments in each category and display them to [engineering 92 | leadership](audiences.md) as a way to decide where to assign engineers. 93 | 94 | However, this leaves behind mysteries like, "_What_ were the people complaining 95 | about, specifically? What was the actual problem with the system that we need to 96 | address?" For example, it might say that "code review" was the problem. But what 97 | _about_ code review? If you're an engineer working on the code review system, 98 | you need those answers in order to do effective work. Engineering leaders also 99 | might want to know some specifics, so they understand how much work is involved 100 | in fixing the problem, who needs to be assigned to it, etc. 101 | 102 | This could be solved by further analysis that summarizes the free-text feedback. 103 | Also, usually the team that works on the tool itself (the code review tool, in 104 | our example here) wants to see all of the raw free-text comments that relate to 105 | their tool. 106 | 107 | ### Algorithmic Analysis 108 | 109 | There are various programmatic ways to analyze free-text feedback. You could use 110 | a Machine Learning system to do "sentiment analysis" that attempts to determine 111 | how people feel about various things and pull out the relevant data. You could 112 | use an LLM to summarize the feedback. 113 | 114 | Each of these leave behind some degree of mystery. Readers usually wonder how 115 | accurate the analysis is. Summaries and sentiment analysis often leave out 116 | specifics that teams need in order to fully understand the feedback. 117 | 118 | That said, these methods can be sufficient for certain situations and for 119 | certain [audiences](audiences.md), like when you just want to know the general 120 | area of a problem and can accept some inaccuracy or lack of detail. 121 | 122 | ## Trustworthiness of Insights 123 | 124 | The insights that you provide **must be _trustworthy_**. You do _not_ want to 125 | train your users to ignore the insights that you provide. If the insights you 126 | provide are wrong often enough, your users will learn to distrust them. Avoiding 127 | false insights is one of the _most important_ duties of any system that 128 | generates insights and data, because providing too much false insight for too 129 | long can destroy all the usefulness of your system. 130 | 131 | To be clear: _if your system frequently provides false insights to its users, it 132 | would have been better if you hadn't made the system at all_, because you will 133 | have spent a lot of effort to give people a system that confuses them, 134 | frustrates them, takes up their time, and which they eventually want to abandon 135 | and just "do it themselves." 136 | 137 | This isn't just a one-time thing you have to think about when you first write a 138 | system for generating insights. Your own monitoring, testing, and maintenance of 139 | the system should assure that it continues to provide trustworthy insights to 140 | its users throughout its life. 141 | 142 | ## Accuracy of Data 143 | 144 | The most important attribute of **data** is that it needs to be as accurate as 145 | reasonably possible. We often combine data from many different sources in order 146 | to create insights. If each of these data sources are inaccurate in different, 147 | significant ways, then it’s impossible to trust the insights we produce from 148 | them. Inaccuracy (and compensating for it) also makes life very difficult and 149 | complex for people doing data analysis--the data that you are providing now 150 | creates a problem for its consumers rather than solving a problem for them. 151 | 152 | The degree of accuracy required depends on the 153 | [purpose](collecting-with-purpose-vs-collecting-everything.md) for which the 154 | data will be used. If you don’t know how it’s going to be used, then you should 155 | make the data as accurate as you can reasonably accomplish with the engineering 156 | resources that you have. 157 | 158 | If there is anything about your data that would not be **obvious to a casual 159 | viewer** (such as low accuracy in some areas) then you should publish that fact 160 | and make it known to your users somehow. For example, if you have a system that 161 | is accurate for large sample sizes but inaccurate for small sample sizes, it 162 | should say so on the page that presents the data, or it should print a warning 163 | (one that actually makes sense and explains things to the user) any time it's 164 | displaying information about small sample sizes. 165 | 166 | Next: [Qualitative vs Quantitative Measurements](qualitative-vs-quantitative.md) 167 | -------------------------------------------------------------------------------- /audiences.md: -------------------------------------------------------------------------------- 1 | # Audiences: Always Know Who Your Data Is For 2 | 3 | When designing any system for metrics or data collection, you need to know who 4 | your _audience_ is. Otherwise, it is hard to get _action_ to be taken on the 5 | data. 6 | 7 | For example, let's take the metric "number of pastries consumed by employees." 8 | If you're the pastry chef at the company cafeteria, that's an interesting 9 | metric. If you're the head of Engineering, it's not an interesting metric. 10 | 11 | Ultimately, all our work serves _people_. Any organization is basically just an 12 | agreement between people. The pieces of the organization that actually exist in 13 | the physical universe are material objects (buildings, computers, etc.) and 14 | _individuals_. We serve individuals. 15 | 16 | For engineering metrics and feedback systems, there are a few broad categories 17 | of individuals that we serve, who have very different requirements. How we 18 | provide [data and insights](data-vs-insights.md) to these audiences is very 19 | different for each audience. 20 | 21 | **When you propose a metric, system, or process, you should always say which of 22 | the groups below you are serving and how you are serving them.** 23 | 24 | - [Front-Line Developers](#front-line-developers) 25 | - [Front-Line Managers](#front-line-managers) 26 | - [Engineering Leadership](#engineering-leadership) 27 | - [Tool Owners](#tool-owners) 28 | - [Productivity Champions](#productivity-champions) 29 | - [Levels of Metrics](#levels-of-metrics) 30 | 31 | ## Front-Line Developers 32 | 33 | A "front-line developer" is any person who directly writes or reviews code. 34 | 35 | Developers are best served by delivering [insights](data-vs-insights.md) to them 36 | within their natural workflow. For example, insights that directly help them 37 | during a code review, displayed in the code review tool. Imagine if we could 38 | tell a developer, "This change you are about to make will increase the build 39 | time of your codebase by 50%." Those are the sort of insights that most help 40 | developers---actionable information displayed right when they can act on it. 41 | 42 | Developers also need [data](data-vs-insights.md) when they are making decisions 43 | about how to do specific work (such as "what's the most important performance 44 | problem to tackle for my users?" or "how many users are being affected by this 45 | bug?"). Developers do not usually need [metrics](goals-signals-metrics.md) in 46 | dashboards. Instead, they need analytical tools---systems that allow them to 47 | dive into data or debug some specific problem. 48 | 49 | Note that this is actually the _largest_ group of people that we serve, and 50 | actually the place where we can often make the most impact. It’s easy to 51 | consider that because Directors and VPs are "important people" that it is more 52 | important to serve them, but **we move more of the organization and make more 53 | change by providing actionable insights to developers who are doing the actual 54 | work of writing our systems**. 55 | 56 | ## Front-Line Managers 57 | 58 | By this, we mean managers who directly manage teams of developers. Often, 59 | managers of managers have similar requirements to front-line managers, and so 60 | could also be covered as part of this audience. 61 | 62 | We provide [data](data-vs-insights.md) that front-line managers can use to form 63 | their own insights about what their team should be working on or how they should 64 | prioritize work on their team. Front-Line Managers usually have the time (and 65 | desire) to process the data relevant to their team and turn it into insights 66 | themselves. We do also provide _some_ insights that front-line managers can use 67 | directly to inform their decisions. 68 | 69 | Managers tend to have a regular cadence of meetings where they can look at 70 | dashboards, so putting [metrics](goals-signals-metrics.md) in dashboards is 71 | helpful for them. 72 | 73 | The information we provide for front-line managers should feed into decisions as 74 | small as "what should we work on for the next two weeks?" 75 | 76 | ## Engineering Leadership 77 | 78 | By "engineering leadership," we usually mean SVPs, VPs, Sr. Directors, and 79 | Directors in Engineering. This can also include very senior engineers at the 80 | company who will have similar requirements (though they also end up fitting into 81 | multiple other audiences, as well). Essentially, this category includes anybody 82 | who is distant from the day-to-day details of the thing they are in charge of. 83 | 84 | We provide [insights](data-vs-insights.md) to engineering leadership that 85 | either: 86 | 87 | 1. Allow them to choose what direction the organization should go in, or 88 | 2. Convince them what direction to go in based on sound reasoning we provide 89 | (which usually would mean an argument based on data). 90 | 91 | The result here should be that an engineering leader does one of these three 92 | things: 93 | 94 | 1. Tells actual people to go do actual work. 95 | 2. Decides that no work needs to be done. 96 | 3. Decides on the _prioritization_ of work--deciding _when_ work will be done, 97 | by who. 98 | 99 | Engineering Leaders often just need a system that shows them that a problem 100 | _exists_, so that they can ask for a more detailed investigation to be done by 101 | somebody who reports to them. 102 | 103 | The decisions made at this level are usually _strategic_ decisions--at the 104 | shortest, multi-week, at the longest, multi-year, and so the insights we provide 105 | to engineering leadership should guide decisions at that level. 106 | 107 | ## Tool Owners 108 | 109 | This is a manager or senior developer whose team works on developer tools or 110 | infrastructure. 111 | 112 | Tool Owners need [data and insights](data-vs-insights.md) that help them 113 | understand their users and how to make the most impact with their 114 | infrastructure. When in doubt, err on the side of data instead of insights, 115 | because the requirements of Tool Owners are complex, and they often can spend 116 | the time to dive into the data themselves. 117 | 118 | Tool Owners need analytical systems that allow them to dive into data to 119 | understand the specifics of tool usage, workflows, problems developers are 120 | having, etc. While Front-Line Developers need such tools to analyze their own 121 | codebases, Tool Owners need these tools for analyzing _the whole company_. 122 | 123 | For example, an owner of the build tool might need to ask, "Which team is having 124 | the worst build experience?" They would need to be able to define what "worst 125 | build experience" means themselves (so basically, just slicing the data any way 126 | they need to). Then, they would need to understand the detailed specifics of 127 | what is happening with individual builds, so their teams can write code to solve 128 | the problem. 129 | 130 | Tool Owners are also benefited by having [metrics](goals-signals-metrics.md) in 131 | dashboards. They may need many metrics (maybe 10 or more) to be able to 132 | understand if the experience of their users are getting better or worse. At any 133 | given time, their teams might be focused on one or two metrics, but usually the 134 | front-line developers _within_ the Tool Owner team will need to look at other 135 | metrics to do complete investigations. 136 | 137 | ## Productivity Champions 138 | 139 | This is a person who cares about the developer productivity of a team or a set 140 | of teams, and takes it as their responsibility to do something about it 141 | directly, advise a senior executive what to do about it, or encourage other 142 | teams to take action on it. 143 | 144 | We don’t think of this person as a Tool Owner (even though one person may on 145 | rare occasion be both a Tool Owner and a Productivity Champion). 146 | 147 | This audience has the most complex set of requirements. Essentially, they need 148 | all the tools of all the other audiences, combined. They need to look at broad 149 | overviews of data to understand where there are problems, and then they need 150 | detailed dashboards and analytical tools to dive into the specifics of the 151 | problem, themselves. 152 | 153 | ## Levels of Metrics 154 | 155 | When designing [metrics](goals-signals-metrics.md), there is another thing to 156 | think about besides the above audiences, which is "what level of the org chart 157 | is this audience at?" For example, the entire developer tools org might have a 158 | set of metrics that measure the overall success of that org. However, individual 159 | teams within that org would have more detailed, lower-level metrics. 160 | 161 | In general, teams should know what their top-level metrics are---what are the 162 | most important metrics that they are driving and which measure how well they are 163 | achieving their [goals](goals-signals-metrics.md). There can be many top-level 164 | metrics, so long as they are a good representation of the aggregate 165 | accomplishments of the team toward the team's goals. 166 | 167 | There can be many other metrics that a team has besides their top-level metrics. 168 | Front-line developers might need a large set of metrics to be able to judge the 169 | effectiveness of their changes or what area should be worked on next. The 170 | reasons to have top-level metrics are: 171 | 172 | 1. So that a team can focus on specific numbers that they are driving---this is 173 | one of the most effective ways to get action taken on metrics, is to focus a 174 | team around "these are the numbers we are driving." 175 | 2. So that people don't have to look at _so many_ graphs that each graph 176 | individually becomes meaningless. No matter how smart a leader is, they can't 177 | look at 100 graphs _simultaneously_ and make any sensible decision about 178 | them. They _could_ look at 10 graphs, though. 179 | 180 | We call the top-level metrics of a team the "Key Impact Metrics." 181 | 182 | Next: [Driving Decisions With Data](driving-decisions.md) 183 | -------------------------------------------------------------------------------- /metrics-and-performance-reviews.md: -------------------------------------------------------------------------------- 1 | # Metrics and Performance Reviews 2 | 3 | It is dangerous to use numbers representing the _volume of output_ of a software 4 | engineer to determine their job performance—numbers like "lines of code 5 | produced," "number of changes submitted to the repository," "numbers of bugs 6 | fixed," etc. This doc explains why and offers some alternative suggestions of 7 | how to understand and manage the performance of software engineers. 8 | 9 | - [Why Is This Dangerous?](#why-is-this-dangerous) 10 | - [Perverse Incentives](#perverse-incentives) 11 | - [Doesn’t Actually Do What You Want](#doesnt-actually-do-what-you-want) 12 | - [Clouding the Metrics](#clouding-the-metrics) 13 | - [What Do You Do Instead?](#what-do-you-do-instead) 14 | - [Examples of Good Metrics](#examples-of-good-metrics) 15 | 16 | ## Why Is This Dangerous? 17 | 18 | ### Perverse Incentives 19 | 20 | When you measure the output of a software engineer instead of their impact, you 21 | create perverse incentives for software engineers to behave in ways that 22 | ultimately damage your business. 23 | 24 | Let’s say that you see that a software engineer has submitted very few pieces of 25 | code to the repository in this quarter. If you use this to manage their 26 | performance, you might say, "Let’s figure out how you can submit more code," 27 | Sometimes, this will result in good behaviors—perhaps they were spending too 28 | long on one large change that they will now submit as three smaller changes, 29 | making everybody happier. Nice. 30 | 31 | However, there’s absolutely no guarantee that this _will_ result in good 32 | behavior, because all you’re measuring is this absolute number that doesn’t 33 | actually tie back to the business impact that the developer is having. The same 34 | person, at a different time, could look at a change they have that really 35 | _should_ be submitted all at once, and split it into 100 different changes just 36 | because it will cause their numbers to go up. That causes a ton of unnecessary 37 | work for them and their code reviewer, but it looks great on their performance 38 | review. Then they may even sit back and delay the rest of their work until 39 | _next_ quarter because they "already got their change count up." This may sound 40 | remarkable but is actually very rational behavior for a person whose primary 41 | concern is their performance review and knows specifically that this metric is 42 | being used as an important factor in that review. 43 | 44 | ### Doesn’t Actually Do What You Want 45 | 46 | It’s assumed that we are managing the performance of software engineers because 47 | we have some goal as a business, and software engineers are here to contribute 48 | to that goal. So we are managing their performance to make sure that everybody 49 | contributes to that goal as much as possible and that the business succeeds. 50 | 51 | The goal of the business is not "to write code." Thus, when you measure how 52 | _much_ code somebody is writing, you don’t actually get a sense of how much they 53 | are contributing to the business. Sometimes, you do get a general idea—person X 54 | writes 100 changes a month and person Y writes 4. But honestly, that’s not even 55 | meaningful. What if those 4 changes required a ton of background research and 56 | made the company a million dollars, while those 100 changes were all sloppily 57 | done and cost the company a million dollars in lost productivity and lost users? 58 | 59 | The same happens with measuring "how many bugs got fixed." It’s possible that 60 | somebody spent a lot of time fixing one bug that was very valuable, and somebody 61 | else spent the same time fixing 10 bugs that didn’t actually matter at all, but 62 | were easy to handle. The point here isn’t how much time got spent—it’s about 63 | how valuable the work ended up being. If you measure _how much effort_ people 64 | are putting into things, you will get an organization whose focus is on _making 65 | things more difficult_ so that they can show you _how hard it was to solve their 66 | problems_. 67 | 68 | ### Clouding the Metrics 69 | 70 | All of the metrics that we have around our developer tools, such as volume of 71 | code reviews, speed of code reviews, speed of builds, etc. have definite 72 | purposes that have nothing to do with individual performance reviews. Taken in 73 | aggregate across a large number of developers, these numbers show trends that 74 | allow business leaders and tool developers to make intelligent decisions about 75 | how best to serve the company. When you look at a group of 1000 developers, the 76 | differences between "I worked on one important thing for 100 hours" and "I 77 | worked on 100 unimportant things for 1 hour each" all even out and fade away, 78 | because you’re looking at such a large sample. So you can actually make 79 | intelligent statements and analysis about what’s happening, at that scale. If 80 | the volume of code reviews drops _for the whole company_ for a significant 81 | period of time, that’s something we need to investigate. 82 | 83 | However, if you make people behave in unusual ways because their _individual 84 | performance_ is being measured by the same metrics, then suddenly it’s hard for 85 | business leaders to know if the numbers they are looking at are even _valid_. 86 | How do we know if code review volume is going up for some good reason related to 87 | our tooling improvements, or just because suddenly everybody started behaving in 88 | some weird way due to their performance being measured on these numbers? Maybe 89 | code review times went down on Team A because their performance was measured on 90 | that, so they all started rubber-stamping all code reviews and not really doing 91 | code review. But then some executive comes along and says, "Hey, Team A has much 92 | lower code review times than our other teams, can we find out what they are 93 | doing and bring that practice to other teams?" Obviously, in this situation what 94 | would really happen is that we would find out what Team A was doing and would 95 | correct it. But ideally this confusion and investigation from leadership would 96 | never have to happen in the first place, because nobody should be measuring the 97 | performance of individual software engineers based on such a metric. 98 | 99 | ## What Do You Do Instead? 100 | 101 | There are two types of measurements that you can use, qualitative (subjective) 102 | measurements, like surveys and talking with your reports, and quantitative 103 | (objective) measurements, like numbers on graphs. This document is mostly about 104 | the quantitative side of things, because we’ve found a particular problem with 105 | that. We won’t cover the qualitative aspects here. 106 | 107 | If you really want quantitative measurements for your team or developers, the 108 | best thing to do is to figure out the goal of the projects that they are working 109 | on, and determine a metric that measures the success of that goal. This will be 110 | different for every team, because every team works on something different. 111 | 112 | The truth is, "programming" is a _skill_, not a _job._ We wouldn’t measure the 113 | _performance of a skill_ to understand the _success of a job._ Let me give an 114 | analogy. Let’s say you are a carpenter. What’s the metric of a carpenter? You 115 | can think about that for a second, but I’ll tell you, you won’t come up with a 116 | good answer, because it’s a fundamentally bad question. I have told you that a 117 | person has a _skill_ (carpentry) but I haven’t actually told you what their 118 | _job_ is. If their job is that they own a furniture shop that produces custom 119 | furniture, then the success there is probably measured by "furniture delivered" 120 | and "income produced greater than expenses." But what if they are a carpenter on 121 | a construction job site? Then their success is probably measured by "projects 122 | that are complete and have passed inspection." As you can see, a skill is hard 123 | to measure, but a _job_ is something that you can actually understand. 124 | 125 | So to measure the success of a programmer, you have to understand what their job 126 | is. Hopefully, it’s tied to some purpose that your team has, which is part of 127 | accomplishing the larger purpose of the whole company somehow. That purpose 128 | results in some product, that product has some effect, and you can measure 129 | something about that product or that effect. 130 | 131 | Yes, sometimes it’s hard to tie that back to an individual software engineer. 132 | That’s where your individual judgment, understanding, communication, and skills 133 | as a manager come into play. It’s up to you to understand the work that’s 134 | actually being done and how that work affected the metrics you’ve defined. 135 | 136 | And yes, it’s also possible to design success metrics for your work that are 137 | hard to understand, difficult to take action on, or that don’t really convince 138 | anybody. There is a [whole system of designing and using 139 | metrics](goals-signals-metrics.md) that can help get around those problems. 140 | 141 | ### Examples of Good Metrics 142 | 143 | These are just examples. There could be as many metrics in the world as there 144 | are projects. 145 | 146 | **User-Facing Project:** Usually, your work is intended to impact some business 147 | metric. For example, maybe you’re trying to improve user engagement. You 148 | can do an experiment to prove how much your work affects that business metric, 149 | and then use that impact as the metric for your work, as long as you can see 150 | that same impact after you actually release the new feature. Bugs could be 151 | thought of as impacting some metric negatively, even if it’s just user 152 | sentiment, and thus one can figure out a metric for bug fixing that way. 153 | 154 | **Refactoring Projects:** Let’s say that you have an engineer who has to 155 | refactor 100 files across 25 different codebases. You could measure how many of 156 | those refactorings are done. You could count it by file, by codebase, or by 157 | whatever makes sense. You could also get qualitative feedback from developers 158 | about how much easier the code was to use or read afterward. Some refactoring 159 | projects improve reliability or other metrics, and can be tracked that way. It 160 | just depends on what the intent is behind the project. 161 | 162 | Next: [Productivity Concepts for Software Developers](productivity-concepts.md) -------------------------------------------------------------------------------- /developer-personas.md: -------------------------------------------------------------------------------- 1 | # Developer Personas 2 | 3 | We segment developers into "personas" based on their development workflow. 4 | 5 | - [Why Personas?](#why-personas) 6 | - [How to Define Personas](#how-to-define-personas) 7 | - [Define the Categories](#define-the-categories) 8 | - [Sub-Personas](#sub-personas) 9 | - [Initial Research](#initial-research) 10 | - [Categorizing Developers Into Personas](#categorizing-developers-into-personas) 11 | - [Example Personas](#example-personas) 12 | - [What To Do With Personas](#what-to-do-with-personas) 13 | - [What About Using Personas For Quantitative Metrics?](#what-about-using-personas-for-quantitative-metrics) 14 | 15 | ## Why Personas? 16 | 17 | It's very easy to assume, as somebody who works on developer productivity, that 18 | one knows all about software development—after all, one is a software developer! 19 | However, it turns out that different types of development require very different 20 | workflows. If you've never done mobile development, web development, or ML 21 | development, for example, you might be very surprised to learn how different the 22 | workflows are! 23 | 24 | One of the most common mistakes that developer productivity teams make is only 25 | focusing on the largest group of developers at the company. For example, many 26 | companies have _far_ more backend server engineers than they have mobile frontend 27 | engineers, and so they assume that most (or all) of the developer productivity 28 | work should go toward those backend engineers. 29 | 30 | What this misses out on is the _importance to the business_ of the various 31 | different types of developers at the business. You might only employ a few 32 | mobile engineers, but how much impact does their work have for your customers? 33 | Similarly, Machine Learning Engineers have a _very_ different workflow and set 34 | of pain points from backend server developers—in fact, there's often no overlap 35 | at all in their pain points and commonly-used tools. But in many companies 36 | today, machine learning and AI are key to the success of their business. 37 | 38 | If your developer productivity team has been focusing on only one type of 39 | developer, and it seems like some parts of the company are very upset with you, 40 | this might be why. 41 | 42 | ## How to Define Personas 43 | 44 | ### Define the Categories 45 | 46 | We segment developers into _broad_ categories by their _workflow_. Obviously, 47 | each developer works in a slightly different way. But you will find groups that 48 | have large things in common in terms of the *type* of work that they do. 49 | 50 | For example, you may see that there are a large group of developers who all work 51 | in Java making "backend" servers. They might use different frameworks in Java, 52 | different editors, different CI systems, or even different deployment platforms. 53 | But you'll find that a _lot_ of their tooling is common within the group, or 54 | fits into 2-3 categories (like one team uses The New CI System and another team uses 55 | The Old CI System). 56 | 57 | Since we use this data for survey analysis (as described below) we try to have 58 | our Persona groups be large—at least 200 people, so that if only a small 59 | percentage respond to the survey, we can still get statistically significant 60 | data about the persona as a whole. Of course, if your company is smaller, you 61 | might be doing interviews instead of surveys, in which case the personas should 62 | be whatever size makes sense for you. The key is: don't have _too many_ 63 | personas, because that makes survey analysis hard. We currently have ten 64 | developer personas for an engineering org with thousands of people in it, and 65 | that number has seemed manageable. 66 | 67 | Of course, if you have a small or medium-sized engineering team, you won't be 68 | able to get groups of 200 people (that might be larger than your whole 69 | engineering team!). If so, segment them out by the workflows that are most 70 | important to the business. 71 | 72 | #### Sub-Personas 73 | 74 | People often ask if they can split the personas down even further and have 75 | "sub-personas." Sure, you can totally do that. For example, you might have one 76 | overall persona for people whose job it is to maintain production systems 77 | (called SREs, DevOps Engineers, or System Administrators). However, you might 78 | have one large sub-group that works on creating _tools_ for production 79 | management, and another large sub-group that works on handling major incidents 80 | in production. Those could be two sub-personas, because their workflows and 81 | needs are very different, even though they have _some_ things in common. 82 | 83 | The key here is to not make your survey analysis too complex. If the person 84 | doing the survey analysis thinks there is value in separating out feedback and 85 | scores for the SRE persona into these two categories, that's fine. However, you 86 | might want to still review both of these sub-personas in the same meeting or 87 | document or whatever process you decide on for doing your survey analysis. 88 | 89 | ### Initial Research 90 | 91 | Once you have some idea of what categories exist, you'll likely want to do some 92 | initial research into the current workflows of those engineers. Don't go too 93 | overboard with this. It often is enough just to interview a set of engineers 94 | (not just managers or executives) who are members of that persona. You'll want 95 | to ask them about their workflows and what parts of that workflow are the most 96 | frustrating. This should be a live dialogue, not an email or a survey, so that 97 | you can ask follow up questions to clarify. You will also learn what questions 98 | you should ask to this persona in future surveys. 99 | 100 | Then you can synthesize the collected information into a document, and present 101 | this document to the relevant stakeholders. 102 | 103 | One thing that can be useful here is simply describing what the usual workflow 104 | is for this type of developer. This is useful because when a tool developer 105 | wants to support all the personas at the company, they can start off by reading 106 | the descriptions that you've written, instead of having to figure out those 107 | workflows themselves all over again. 108 | 109 | You'll also want to try to get a count of how many people are in each persona, 110 | even just an approximation, as this question comes up frequently, in our 111 | experience. 112 | 113 | ### Categorizing Developers Into Personas 114 | 115 | Now you will need some sort of system that categorizes developers into personas. 116 | That is, some system that will tell you what personas a person is part of. Don't 117 | create a system where managers or engineers have to fill out this data manually. 118 | It will get out of date or be missing data. Instead, there are multiple data 119 | sources you can combine to figure this out: 120 | 121 | * A person's title. 122 | * A person's management chain. 123 | * The "cost center" that they belong to (usually tracked by your Finance team). 124 | * Which systems / tools they use (possibly taking into account _how much_ they 125 | use them). 126 | * What codebases or file types they contribute to (or review). 127 | 128 | When you're starting off, it is simplest to start off with whatever data is 129 | easiest to get, and accept some percentage of inaccuracy in the system. (For 130 | example, we accepted a 10% inaccuracy in the early days of our Personas system, 131 | and it didn't cause any real problems.) 132 | 133 | Note: One developer can have multiple personas, that's fine. Sometimes people 134 | ask if we should have a "master" persona for each person, but _usually_ when we 135 | investigate, we discover the requester has misunderstood the purpose of 136 | personas, or they are trying to work around a limitation in some other system. 137 | Usually when a person has multiple personas, it's because they genuinely work in 138 | all those workflows. Of course, it's always possible a valid use case for the 139 | "master persona" concept comes up at some point in the future. 140 | 141 | #### Example Personas 142 | 143 | Here are some of the personas we have defined at LinkedIn: 144 | 145 | * Backend Developer 146 | * Data Scientist 147 | * Machine Learning / AI Engineer 148 | * Android Developer 149 | * iOS Developer 150 | * SRE 151 | * Tools Developer 152 | * Web Developer 153 | 154 | Plus we have two other personas that are unique to how our internal systems 155 | work (they are not included here because that would require too much explanation 156 | for too little value). 157 | 158 | ## What To Do With Personas 159 | 160 | For us, the Developer Personas system is used primarily as part of our survey 161 | analysis, to split out pain points by developer persona. 162 | 163 | We have a person called a "Persona Champion" who is a member of that persona 164 | (for example, for the "Backend Developer" persona, the Persona Champion is a 165 | backend developer). They do the analysis of the comments and the survey scores, 166 | and work with infrastructure teams to help them understand the needs of that 167 | Persona. 168 | 169 | It is helpful to have a person who is familiar with the tools and workflow of 170 | the persona doing the analysis, because they can pick up important details that 171 | others will miss, and they have the context to "fill in the gaps" of vague or 172 | incomplete comments. (Or, they know who they can go talk to to fill in those 173 | gaps, because they know other developers who share their workflow.) 174 | 175 | For more about persona champions, see the [detailed description of their 176 | duties](persona-champions.md) (which go beyond just survey analysis). 177 | 178 | ### What About Using Personas For Quantitative Metrics? 179 | 180 | We have found minimal value in splitting our quantitative metrics by persona. 181 | Even when it seems like you would want to do that, there is usually another 182 | dimension that would be better for doing analysis of the quantitative metrics. 183 | 184 | For example, let's say you want to analyze build times for Java Backend 185 | Developers vs JavaScript Web Developers. Splitting the data by persona would 186 | get you confusing overlaps—you might have some full-stack engineers who write 187 | both Java and JavaScript. It makes the data confusing and hard to analyze. 188 | 189 | What you would want to do instead in that situation is analyze the build speed 190 | based on what language is being compiled. Then you would have actionable data 191 | that the owners of the build tools could use to speed things up, as opposed to 192 | a confusing mash of mixed signals. 193 | 194 | Next: [Persona Champions](persona-champions.md) 195 | -------------------------------------------------------------------------------- /data-collection-principles.md: -------------------------------------------------------------------------------- 1 | # Data Collection Principles 2 | 3 | Teams who make metrics and do data analysis often wonder: how much data should I 4 | collect? How long should I retain it for? Are there important aspects of how I 5 | should structure the data? 6 | 7 | In general, the concepts here are: 8 | 9 | 1. Collect as much data as you possibly can. This becomes the "lowest layer" of 10 | your data system. 11 | 2. Refine that, at higher "layers," into data stores designed for specific 12 | purposes. 13 | 3. Always be able to understand how two data points connect to each other, 14 | across any data sets. (That is, always know what your "join key" is.) 15 | 16 | For example, imagine that you are collecting data about your code review tool. 17 | 18 | Ideally, you record every interaction with the tool, every important touchpoint, 19 | etc. into one data set that knows the exact time of each event, the type of 20 | event, all the context around that event, and important "join keys" that you 21 | might need to connect this data with other data sources (for example, the ID of 22 | commits, the IDs of code review requests, the username of the person taking 23 | actions, the username of the author, etc.). 24 | 25 | Then you figure out specific business requirements you have around the data. For 26 | example, you want to know how long it takes reviewers to respond to requests for 27 | review. So you create a higher-level data source, derived from this "master" 28 | data source, which contains only the events relevant for understanding review 29 | responses, with the fields structured in such a way that makes answering 30 | questions about code review response time really easy. 31 | 32 | That's a basic example, but there's more to know about all of this. 33 | 34 | - [When In Doubt, Collect Everything](#when-in-doubt-collect-everything) 35 | - [Purpose](#purpose) 36 | - [Boundaries, Intentions, and Join Keys](#boundaries-intentions-and-join-keys) 37 | 38 | ## When In Doubt, Collect Everything 39 | 40 | It is impossible to go back in time and instrument systems to answer a question 41 | that you have in the present. It is also impossible to predict every question 42 | you will want to answer. You must _already have_ the data. 43 | 44 | As a result, you should strive to collect every piece of data you can collect 45 | about every system you are instrumenting. The only exceptions to this are: 46 | 47 | 1. Don't collect so much data that it becomes extremely expensive to store, or 48 | _very_ slow to query. Sometimes there are low-value events that you can 49 | pre-aggregate, only store a sample of, or simply skip tracking at all. Note 50 | that you have to actually prove that storage would be expensive or 51 | slow---often, people _believe_ something will be expensive or slow when it's 52 | really not. Storage is cheap and query systems can be faster than you expect. 53 | 2. Some data has security or privacy restrictions. You need to work with the 54 | appropriate people inside the company to determine how this data is supposed 55 | to be treated. 56 | 57 | As an extreme example, you could imagine a web-logging system that stored the 58 | entirety of every request and response. After all, that's "everything!" But it 59 | would be impossible to search, impossible to store, and an extremely complex 60 | privacy nightmare. 61 | 62 | The only other danger of "collecting everything" is storing the data in such a 63 | disorganized or complicated way that you can't make any sense of it. You can 64 | solve that by keeping in mind that no matter what you're doing, you always want 65 | to produce [insights](data-vs-insights.md) fron the data at some point. Keep in 66 | mind a few questions that you know people want to answer, and make sure that 67 | it's at least theoretically possible to answer those questions with the data 68 | you're collecting, with the fields you have, and with the format you're storing 69 | the data in. 70 | 71 | If your data layout is well-thought-out and provides sufficient coverage to 72 | answer almost any question that you could imagine about the system (even if it 73 | would take some future work to actually understand the answers to those 74 | questions) then you should be at least somewhat future-proof. 75 | 76 | ## Purpose 77 | 78 | In general, you always want to have some idea of why you are collecting data. At 79 | the "lowest level" of your data system, the telemetry that "collects 80 | everything," this is less important. But as you derive higher-level tables from 81 | that raw data, you want to ask yourself things like: 82 | 83 | 1. What questions are people going to want to answer with this data? 84 | 2. _Why_ are people going to ask those questions? How does answering those 85 | questions help people? (This gives you insight into how to present the data.) 86 | 3. What dimensions might people want to use to slice the data? 87 | 88 | This is where you take the underlying raw data and massage it into a format that 89 | is designed to solve specific problems. In general, you don't want to expose the 90 | underlying complex "collect everything" data store to the world. You don't even 91 | want to expose it directly to your dashboards. You want to have simpler tables 92 | _derived from_ the "everything" data store---tables that are designed for some 93 | specific purpose. 94 | 95 | You can have a hierarchy of these tables. Taking our code review example: 96 | 97 | 1. You start off with the "everything" table that contains time series events of 98 | every action taken with the tool. 99 | 2. From that, you derive a set of tables that let you view just comments, 100 | approvals, and new code pushes, with the "primary key" being the Code Review 101 | ID (so it's easy to group these into actions that happened during a 102 | particular code review process). 103 | 3. Then you want to make a dashboard that shows how quickly reviewers responded 104 | on each code review. You could actually now make one table that just contains 105 | the specific derived information the dashboard needs. 106 | 107 | You'll find that people rarely ever want to directly query the table from Step 1 108 | (because it's hard to do so) sometimes want to query the table from Step 2, and 109 | the table from Step 3 becomes a useful tool in and of itself, even beyond just 110 | the dashboard. That is, the act of creating a table specifically for the 111 | dashboard makes a useful data source that people sometimes want to query 112 | directly. 113 | 114 | Sometimes, trying to build one of these purpose-built tables will also show you 115 | gaps in your data-collection systems. It can be a good idea to have one of these 116 | purpose-built tables or use cases in mind even when you're designing your 117 | systems for "collecting everything," because they can make you realize that you 118 | missed some important data. 119 | 120 | In general, the better you know the requirements of your consumers, the better 121 | job you can do at designing these purpose-built tables. It's important to 122 | understand the current and potential requirements of your consumers when you 123 | design data-gathering systems. This should be accomplished by actual research 124 | into requirements, not just by guessing. 125 | 126 | ## Boundaries, Intentions, and Join Keys 127 | 128 | It should be possible to know when a large workflow starts, and when it ends. We 129 | should know that a developer intended something to happen, the steps involved in 130 | accomplishing that intention, when that whole workflow started, and when it 131 | ended. We should not have to develop a complex algorithm to determine these 132 | things from looking at the stored data. The stored data should contain 133 | sufficient information that is very easy to answer these questions. 134 | 135 | We need to be able to connect every event within a workflow as being part of 136 | that workflow, and we need to know its boundaries---its start point and end 137 | point. 138 | 139 | For example, imagine that we have a deployment system. Here's a set of events 140 | that represent a _bad_ data layout: 141 | 142 | 1. User Alice requested that we start a deployment workflow named "Deploy It" at 143 | 10:00. 144 | 2. Binary Foo was started on Host Bar at 10:05. 145 | 3. Binary Baz was started on Host Bar at 10:10. 146 | 4. Binary Baz responded "OK" to a health check at 10:15. 147 | 5. Binary Foo responded "OK" to a health check at 10:20. 148 | 149 | We have no idea that "Deploy It" means to deploy those two binaries. What if 150 | there are a hundred simultaneous workflows going on? What if "Deploy It" has 151 | been run more than once in the last five minutes? We have no idea that those 152 | health checks signal the end of the deployment. In fact, _do_ they signal the 153 | end of the deployment? Are there other actions that "Deploy It" is supposed to 154 | do? I'm sure the author of "Deploy It" knows the answers to that, but we, a 155 | central data team, have _no way_ of knowing that, because it's not recorded in 156 | the data store. 157 | 158 | A better data layout would look like: 159 | 160 | 1. User Alice started the deployment workflow "Deploy It" at 10:00. This 161 | indicates an intent to deploy Binary Foo and Binary Baz to 200 machines. We 162 | give this specific instance of the workflow an ID: "Deploy It 2543." 163 | 2. We record all of the actions taken by "Deploy It 2543" and be sure to tag 164 | them in our data store with that ID. 165 | 3. The deployment workflow itself contains a configuration variable that allows 166 | users to specify how many hosts the machine must successfully deploy to 167 | before we consider the deployment "successful." For example, let's say this 168 | one requires only 175 hosts to be deployed to be "successful." Once 175 hosts 169 | are deployed, we record an event in the data store indicating successful 170 | completion of the deployment workflow. (We also record failure in a similar 171 | way, if the workflow fails, but noting that it's a failure along with any 172 | necessary details about the failure.) 173 | 174 | You don't have to figure out every workflow in advance that you might want to 175 | measure. When you know a workflow exists, record its start and end point. But 176 | even when you don't know that a workflow exists, make sure that you can always 177 | see _in the data store_ that two data points are related when they are related. 178 | For example, record that a merge is related to a particular PR. Record that a 179 | particular PR was part of a deployment. Record that an alert was fired against a 180 | binary that was part of a particular deployment. And so forth. Any two objects 181 | that _are_ related should be able to be easily connected by querying your data 182 | store. 183 | 184 | Next: [Principles and Guidelines for Metric Design](metric-principles.md) -------------------------------------------------------------------------------- /dph-goals-and-signals.md: -------------------------------------------------------------------------------- 1 | # Developer Productivity and Happiness: Goals and Signals 2 | 3 | This is a proposal, at the highest level, of what we try to measure to improve 4 | the productivity of engineering teams. This should help guide the creation of 5 | metrics and analysis systems. This is not a complete listing of every single 6 | thing that we want to measure regarding developer productivity. It is an 7 | aspirational description of _how_ we would ideally be measuring things and _how_ 8 | we should come up with new metrics. 9 | 10 | It is based on the [Goals, Signals, Metrics system](goals-signals-metrics.md). 11 | This document contains our primary Goals and Signals. 12 | 13 | - [Goals](#goals) 14 | - [Productive](#productive) 15 | - [Happy](#happy) 16 | - [Not Limited to Tools and Infrastructure](#not-limited-to-tools-and-infrastructure) 17 | - [Signals](#signals) 18 | - [Productive](#productive-1) 19 | - [Efficient](#efficient) 20 | - [Caution: Don't Compare Individual Efficiency](#caution-dont-compare-individual-efficiency) 21 | - [Effective](#effective) 22 | - [Over-Indexing On A Single Effectiveness Metric](#over-indexing-on-a-single-effectiveness-metric) 23 | - [Happy](#happy-1) 24 | 25 | ## Goals 26 | 27 | The simplest statement of our goal would be: 28 | 29 | Developers at LinkedIn are **productive** and **happy**. 30 | 31 | That could use some clarification, though. 32 | 33 | ### Productive 34 | 35 | What "productive" means isn't very clear, though—what are we actually talking 36 | about? So, we can refine "Developers at LinkedIn are productive" to: 37 | 38 | **Developers at LinkedIn are able to _effectively_ and _efficiently_ accomplish 39 | their _intentions_ regarding LinkedIn's _software systems_**. 40 | 41 | This is a more precise way of stating exactly what we mean by "productive." A 42 | person is productive, by definition, if they produce products efficiently. 43 | "Efficiently" implies that you want to measure something about how long it takes 44 | or how often a product can be produced. So that means that we have to have a 45 | time that we start measuring from, and a time that we stop measuring at. The 46 | earliest point we can think of wanting to measure is the moment when the 47 | developer directly _intends_ to do something (as in, the intention they have 48 | right before they start to take action, not the first time they have some idle 49 | thought), and the last moment is when that action is totally complete. 50 | 51 | There are many different intentions a developer could have: gathering 52 | requirements, writing code, running tests, releasing software, creating a whole 53 | feature end-to-end, etc. They exist on different levels and different scopes. 54 | It's "fractal," basically—there are larger intentions (like "build a whole 55 | software system") that have smaller intentions within them (like "write this 56 | code change"), which themselves have smaller intentions within them ("run a 57 | build"), etc. 58 | 59 | The goal also states "effectively," because we don't just care that a result was 60 | _fast_, we care that the intention was _actually carried out_ in the most 61 | complete sense. For example, if I can release my software quickly but my service 62 | has 3 major production incidents a day, then I'm not as effective as I could be, 63 | even if I'm efficient. It's reasonable to assume that no developer intends to 64 | release a service that fails catastrophically multiple times a day, so that 65 | system isn't accomplishing a developer's intention. 66 | 67 | ### Happy 68 | 69 | Happy about what? Well, here's a more precise statement: 70 | 71 | **Developers at LinkedIn are happy with the tools, systems, processes, 72 | facilities, and activities involved in software development at LinkedIn.** 73 | 74 | ### Not Limited to Tools and Infrastructure 75 | 76 | Note that although the primary focus of our team is on Tools & Infrastructure, 77 | neither of these goals absolutely limit us to Tools & Infrastructure. If we 78 | discover that something outside of our area is impacting developer productivity 79 | in a significant way, like a facilities issue or process issue, we should feel 80 | empowered to raise that issue to the group that handles that problem. 81 | 82 | ## Signals 83 | 84 | Let's break down each goal by its parts. 85 | 86 | ### Productive 87 | 88 | We'll break this down into signals for "effective" and signals for "efficient." 89 | 90 | #### Efficient 91 | 92 | Essentially, we want to measure the time between when a developer has an 93 | intention and the time when they accomplish that intention. This is probably 94 | best framed as: 95 | 96 | **The time from when a developer starts taking an action to when they accomplish 97 | what they were intending to accomplish, with that action.** 98 | 99 | It's worth remembering that there are many different types of intentions and 100 | actions that a developer takes, some large, some small. Here are some examples 101 | of more specific signals that you might want to examine: 102 | 103 | * The time between when a developer encounters a problem or confusion and when 104 | they get the answer they are looking for (such as via docs, support, etc.). 105 | * How long it takes from when developers start writing a piece of code to when 106 | it is released in production. 107 | 108 | There are many other signals that you could come up with—those are just 109 | examples. 110 | 111 | It's also worth keeping in mind: 112 | 113 | **The most important resource we are measuring, in terms of efficiency, is the 114 | amount of time that software engineers have to spend on something.** 115 | 116 | That's _sort of_ a restatement of the above signal, but it's a clearer way to 117 | think about some types of specific signals. In particular, it considers the 118 | whole sum of time spent on things. For example, these could be specific signals 119 | that you want to measure: 120 | 121 | - How much time developers actually spend waiting for builds each day. 122 | - How much time developers spend on non-coding activities. 123 | 124 | The general point is that we care more about the time that human beings have to 125 | spend (or wait) to do a task, and less about the time that machines have to 126 | spend. 127 | 128 | And there are many, many more signals that you could come up with in this 129 | category, too. 130 | 131 | #### Caution: Don't Compare Individual Efficiency 132 | 133 | Don't propose metrics, systems, or processes that are intended to rate the 134 | efficiency of individual engineers. We [aren't trying to create systems for 135 | performance ratings of employees](metrics-and-performance-reviews.md). We are 136 | creating systems that drive our developer productivity work. 137 | 138 | It's even worth thinking about whether or not your system _could_ be used this 139 | way, and either prevent it from being used that way in how the system is 140 | designed, or forbid that usage via cautions in the UI/docs. 141 | 142 | #### Effective 143 | 144 | Essentially, we want to know that when a developer tries to do something, they 145 | actually accomplish the thing they were intending to accomplish. Things like 146 | crashes, bugs, difficulty gathering information—these are all things that 147 | prevent an engineer from being effective. 148 | 149 | When problems occur in a developer's workflow, we want to know how often they 150 | occur and how much time they waste. Probably the best high-level signal for this 151 | would be phrased like: 152 | 153 | **The probability that an individual developer will be able to accomplish their 154 | intention successfully. (Or on the inverse, the frequency with which developers 155 | experience a failure.)** 156 | 157 | You have to take into account the definition of "success" appropriately for the 158 | thing you're measuring. For example, let's say a developer, for some reason, has 159 | the intention "I run all my tests and they pass." In that case, if a test fails, 160 | they've failed to accomplish their intention. But most of the time, a developer 161 | runs tests to know if the code they are working on is broken. So success would 162 | be, "I ran the tests and they only failed if I broke them." Thus, flaky tests, 163 | infrastructure failures, or being broken by a dependency would _not_ be the 164 | developer accomplishing their intention successfully. Defining the intention 165 | here is important. 166 | 167 | We use _probability_ here because what we care about is how individual engineers 168 | are actually impacted by a problem. For example, let's say you have a piece of 169 | testing infrastructure that's flaky 10% of the time. What does that actually 170 | mean for engineers? How often is an engineer impacted by that flakiness? Maybe 171 | the system mostly just runs tests in the background that affect very few people. 172 | 173 | In order to define this probability appropriately, you have to know what group 174 | of engineers you're looking at. You could be looking at the whole company, or 175 | some specific [persona](developer-personas.md), area, org, or team. 176 | 177 | Some specific examples of possible signals here would be: 178 | 179 | * The probability that when a developer runs a test in CI, it will produce valid 180 | results (that is, if it fails, it's not because of flakiness). 181 | - The probability that a developer will run a build without the build tool 182 | crashing. 183 | - How often (such as the median number of times per day) a developer experiences 184 | build tool crashes. 185 | 186 | It's important to keep in mind here that what we care about most is what 187 | developers actually experience, not just what's happening with the tool. 188 | 189 | Very often, just knowing _how often_ something happens isn't enough to 190 | understand its impact, though. You also might want to tie things back to 191 | efficiency here, by noting: 192 | 193 | **How much extra human time was spent as a result of failures.** 194 | 195 | For example, it would be good to know that 15 engineers were caught up for three 196 | days handling a production incident, wouldn't it? That changes the importance of 197 | addressing the root causes, there. Some signals here could be: 198 | 199 | - How much time was spent re-running tests after they had flaky failures. 200 | - How much time a developer had to spend shepherding a change through the 201 | release pipeline due to release infrastructure failures. 202 | 203 | Not all failures are black and white, though. Very often, an intention 204 | _partially_ succeeds. For example, I might release a feature with a bug that 205 | affects only 0.01% of my users. Thus, it is sometimes also useful to know: 206 | 207 | **The percentage _degree_ of success of each action that a developer takes.** 208 | 209 | For example, one way to look at flakiness is how many tests per run actually 210 | flake. That is, if I have 10 tests in my test suite but only one fails due to 211 | flakiness, then that specific run of that specific test suite had a 90% success 212 | rate. 213 | 214 | #### Over-Indexing On A Single Effectiveness Metric 215 | 216 | It's important not to over-index on any one of these "effectiveness" signals. 217 | Sometimes, the probability of a failure is low, but it is very impactful in 218 | terms of human time when the failure occurs. It's helpful to have data for _all_ 219 | of the signals, as appropriate for the thing you're measuring. 220 | 221 | ### Happy 222 | 223 | Usually, we rate happiness via subjective signals. Basically we want to know: 224 | 225 | 1. **The percentage of software engineers that are happy with the tools, 226 | systems, processes, facilities, and activities involved in software 227 | engineering at LinkedIn.** 228 | 2. **A score for _how_ happy those engineers are.** 229 | 230 | And you can break it down for specific tools, systems, processes, and 231 | facilities. Very often we say "satisfied" instead of happy, as a more concrete 232 | thing that people can respond more easily to. 233 | 234 | One of our assumptions is that if you improve the quantitative metrics in the 235 | "Productive" section, you should increase developer happiness. If happiness 236 | _doesn't_ increase, then that probably means there's something wrong with your 237 | quantitative signals for productivity—either you've picked the wrong 238 | signal/metric, or there's some inaccuracy in the data. 239 | 240 | Next: [Developer Personas](developer-personas.md) -------------------------------------------------------------------------------- /metric-principles.md: -------------------------------------------------------------------------------- 1 | # Principles and Guidelines for Metrics Design 2 | 3 | This document covers some of our general guidelines that we use when designing 4 | metrics. 5 | 6 | This document covers only the principles for quantitative metrics covered under 7 | 'Productivity' in [Developer Productivity and Happiness: Goals & 8 | Signals](dph-goals-and-signals.md). 9 | 10 | These principles are not about the operational metrics of various tools and 11 | services. "Operational metrics" are the ones that a team owning a tool uses to 12 | understand if it's currently working, how it's broken, its current performance, 13 | etc. Think of Operational metrics as something you would use in monitoring and 14 | alerting of a production service. 15 | 16 | This doc also doesn't cover the business metrics for tools—the ones that measure 17 | their business goals/success—like how many users they have. 18 | 19 | There might be some overlap of operational, business, and productivity metrics, 20 | though. That is, some productivity metrics might also be business metrics, and a 21 | very few of them could also be used as operational metrics. 22 | 23 | - [Measuring Results vs. Measuring Work](#measuring-results-vs-measuring-work) 24 | - [Time Series](#time-series) 25 | - [Exclude Weekends](#exclude-weekends) 26 | - [Defined Weeks](#defined-weeks) 27 | - [Timestamps](#timestamps) 28 | - [Use the End Time, not the Start Time](#use-the-end-time-not-the-start-time) 29 | - [Developers vs. Machines](#developers-vs-machines) 30 | - [Common Dimensions](#common-dimensions) 31 | - [Going Up or Down Must Be Meaningful](#going-up-or-down-must-be-meaningful) 32 | - [Prefer Graphs That Are Good When They Go Up](#prefer-graphs-that-are-good-when-they-go-up) 33 | - [Retention](#retention) 34 | - [Business Hours](#business-hours) 35 | - [Fallback System](#fallback-system) 36 | 37 | ## Measuring Results vs. Measuring Work 38 | 39 | We want to measure the impact, results, or effects of our work, rather than just 40 | measuring how much work gets done. Otherwise, our metrics just tell us we're 41 | doing work—they don't tell us if we're doing the _right_ work. 42 | 43 | As an analogy, let's say we owned a packing plant (like, a place where they put 44 | things in boxes) and we're installing new conveyor belts on the assembly line. 45 | We could measure the progress of installing the belts (i.e., "how many belts 46 | have been fully installed?") or we could measure the _effect_ of the belts on 47 | the quantity and quality of our packing ("how many boxes get packed?" and "what 48 | percentage of boxes pass quality inspection?"). We would rather measure the 49 | latter. 50 | 51 | ## Time Series 52 | 53 | Unless stated otherwise, all of our metrics are a time series. That is, they are 54 | plotted on a graph against time—we count them per day or per week as 55 | appropriate. Most commonly, we count them per week, as that eliminates 56 | confusions around weekends and minor fluctuations between days. In particular, 57 | with some metrics, Mondays and Fridays tend to have lower or higher values than 58 | other days, and aggregating the data over a week eliminates those fluctuations. 59 | 60 | ### Exclude Weekends 61 | 62 | If you show daily graphs, you should exclude weekends from all metrics unless 63 | they seem very relevant for your particular metric. We exclude weekends because 64 | they make the graphs very hard to read. (We don't exclude holidays—those are 65 | _usually_ understandable anomalies that we actually _want_ to be able to see in 66 | our graphs. Also, not all offices have the same holidays.) You can include 67 | weekends in all other types of graphs other than daily graphs. 68 | 69 | ### Defined Weeks 70 | 71 | Unless you have good reason for another requirement, weeks should be measured 72 | from **12:00am Friday to the end of Thursday**. Cutting off our metrics at the 73 | end of Thursday gives executives enough time to do reviews and investigations 74 | for Monday or Tuesday meetings. When we cannot specify a time zone, we should 75 | assume we are measuring things in the `America/Los_Angeles` time zone. However, 76 | all data should be _stored_ in the **UTC** time zone. 77 | 78 | ### Timestamps 79 | 80 | All times should be stored ideally as microseconds, but if that's not possible, 81 | then as milliseconds. As engineering systems grow, there can be more and more 82 | events that happen _very_ close in time, but which we need to put in order for 83 | certain metrics to work. (For example, a metric might need to know when each 84 | test in a test suite ended, and if all the tests are running in parallel, these 85 | end times could be very close together.) 86 | 87 | ### Use the End Time, not the Start Time 88 | 89 | When you display a metric on a graph, the event you are measuring should be 90 | assigned to a date or time on the graph by the _end time_ of the event. For 91 | example, let's imagine we have a daily graph that shows how long our CI jobs 92 | take. If a CI job takes two days to finish, it should show up on the graph on 93 | the day that it _finished_, not the day that it started. 94 | 95 | This prevents having to "go back in time" and update previous data points on the 96 | graph in a confusing way. (This is especially bad if it changes whether the 97 | graph is going up or going down multiple days _after_ a manager has already 98 | looked at the graph and made plans based on its trend.) 99 | 100 | ## Developers vs. Machines 101 | 102 | Almost all our productivity metrics are defined by saying "a developer" does 103 | something. **For each metric,** we need to have two different versions of the 104 | same metric: one for when machines do the task, and one for when people do the 105 | task. For example, we need to measure human beings doing `git clone` and 106 | the CI system (or any other automated system) doing `git clone` separately. 107 | 108 | **The most important metrics are the ones where people do the task, not the ones 109 | where machines do the task.** But we still need to be able to measure the 110 | machines doing the task, as that is sometimes relevant for answering questions 111 | about developer productivity. 112 | 113 | Sometimes, a question comes up on whether to count something as being machine 114 | time or developer time. For example, let's say a developer types a command on 115 | their machine that causes the CI system to run a bunch of tests. The question to 116 | ask is, "Most of the time, is this action costing us developer time?" A 117 | developer doesn't have to just be sitting there and waiting for the action to 118 | finish, for it to be costing us developer time. Maybe it takes so long that they 119 | switch contexts, go read email, go eat lunch, etc. That's still developer cost. 120 | On the other hand, let's say there is a daily automated process that runs a test 121 | and then reports the test result to a developer. We would count the time spent 122 | running that test as "machine time," because it wasn't actively spending 123 | developer time. 124 | 125 | This is particularly relevant when looking at things that happen in parallel. 126 | When we measure "developer time" for parallel actions, we only care about how 127 | much wall-clock time was taken up by the action, not the sum total of machine 128 | hours used to execute it. For example, if I run 10 tests in parallel that each 129 | take 10 minutes, I only spent 10 minutes of developer time, even though I spent 130 | 100 minutes of machine time. 131 | 132 | ## Common Dimensions 133 | 134 | There are a few common dimensions that we have found useful for slicing most of 135 | our metrics: 136 | 137 | 1. By Team (i.e., the team that the affected developer is on) 138 | 2. By code repository (or codebase, when considering a large repo) 139 | 3. By location the developer sits in 140 | 141 | Most metrics should also support being sliced by the above dimensions. There are 142 | many other dimensions that we slice metrics by, some of which are special to 143 | each metric; these are just the _common_ dimensions we have found to be the most 144 | useful. 145 | 146 | ## Going Up or Down Must Be Meaningful 147 | 148 | It should be clearly good or bad when a metric goes up, and clearly good or bad 149 | when it goes down. One direction should be good, and the other direction should 150 | be bad. 151 | 152 | This should not change at a certain threshold. For example, let’s say you have a 153 | metric that’s "bad" when it goes down until it’s at 90%, and then above 90% it’s 154 | either meaningless or it’s actually good when it goes down (like you want it 155 | always to be at 90%). 156 | 157 | Consider redesigning such metrics so that they are always good when they go up 158 | and always bad when they go down. For example, in the case of the metric that’s 159 | supposed to always be at 90%, perhaps measure how far away teams are from being 160 | at 90%. 161 | 162 | ### Prefer Graphs That Are Good When They Go Up 163 | 164 | We should strive to make metrics that are "good" when they go up and "bad" when 165 | they go down. This makes it consistent to read graphs, which makes life easier 166 | for executives and other people looking at dashboards. This isn't always 167 | possible, but when we have the choice, this is the way we should make the 168 | choice. 169 | 170 | ## Retention 171 | 172 | Prefer to retain data about developer productivity metrics forever. Each of 173 | these metrics could potentially need to be analyzed several years back into the 174 | past, especially when questions come up about how we have affected productivity 175 | over the long term. The actual cost to LinkedIn of storing these forever is 176 | _extremely_ small, whereas the potential benefit to the business from being able 177 | to do long-term analysis is huge. 178 | 179 | Of course, if there are legitimate legal or regulatory constrains on retention, 180 | make sure to follow those. However, there are rarely such constraints on the type 181 | of data we want to retain. 182 | 183 | ## Business Hours 184 | 185 | Some metrics are defined in terms of "business hours." 186 | 187 | Basically, we are measuring the _perceived_ wait time experienced by a 188 | developer. For example, imagine Alice sends off a code review request at 5pm 189 | and then goes home. Bob, in another time zone, reviews that code while Alice is 190 | sleeping. Alice comes back into work the next day and experiences a code review 191 | that, for her, she received in zero hours. 192 | 193 | The ideal way to measure business hours would be to do an automated analysis for 194 | each individual to determine what their normal working hours are (making sure to 195 | keep this information confidential, only use it as an input to our productivity 196 | metrics, and never expose it outside of our team). Once you have established 197 | this baseline for each individual (which you will have to update on some regular 198 | basis) you only count time spent by individuals during those hours. 199 | 200 | For any calendar day in the developer's local time zone, do not count more than 201 | 8 hours a day. It is confusing to have "business days" that are longer than 8 202 | hours. So if a task takes two days to complete, and a developer was somehow 203 | working on it full time for each day, it would show up as 16 business hours 204 | (even if they worked 9 hours each day). This helps normalize out the differences 205 | in schedules between engineers. (For some metrics we may need to relax this 206 | restriction---we would determine that on a case-by-base basis when we develop 207 | the metric.) 208 | 209 | The reason that we are picking 8 hours is that what we care the most about here 210 | is the cost to the company, and even though most software engineers are 211 | salaried, we are measuring their cost to the company in 8-hour-a-day increments. 212 | It would be unusual for any salaried engineer anywhere in the world to have 213 | _contractual obligations_ to work more than 8 hours a day. 214 | 215 | ### Fallback System 216 | 217 | If you cannot generate an automated analysis of working hours for each 218 | individual, then set the "business day" as being 7am to 7pm in their local time 219 | zone. We don't use 9am to 5pm because the schedule of developers varies, and you 220 | don't want to count 0 for a lot of business hour metrics when a person regularly 221 | starts working at 8am or regularly works until 6pm. We have found that setting 222 | the range as 7am to 7pm eliminates most of those anomalies. 223 | 224 | One still needs to limit the maximum "business hours" counted in a day to eight 225 | hours, though, for most metrics. 226 | 227 | Next: [Common Pitfalls When Designing Metrics](metric-pitfalls.md) 228 | -------------------------------------------------------------------------------- /example-metrics.md: -------------------------------------------------------------------------------- 1 | # Example Metrics 2 | 3 | Here are some actual metrics we have used inside of LinkedIn, along with their 4 | complete definitions. We are not holding these up as the "right" metrics to 5 | use--only as examples of how to precisely define a metric. There is also another 6 | document that explains [why we chose each metric](why-our-metrics.md), if you'd 7 | like to get some insight into the thinking process behind each one. 8 | 9 | These are not all the metrics we use internally; only a subset that we think are 10 | good examples for showing how to define metrics, and which we have found useful. 11 | 12 | We give each metric an abbreviation that we can use to refer to it. For example, 13 | internally, when we talk about "Developer Build Time," most people familiar with 14 | the metric just call it "DBT." 15 | 16 | Every metric gets a "TL;DR" summary, which most casual readers appreciate. 17 | 18 | These definitions are written so that implementers can implement them, but 19 | _also_ so that people who have questions about "What specifically is this 20 | actually measuring?" can have a place to get that question answered. 21 | 22 | - [Company-Wide Engineering Metrics](#company-wide-engineering-metrics) 23 | - [Developer Build Time (DBT)](#developer-build-time-dbt) 24 | - [Post-Merge CI Duration (PMCID)](#post-merge-ci-duration-pmcid) 25 | - [CI Determinism (CID)](#ci-determinism-cid) 26 | - [Code Reviewer Response Time (RRT)](#code-reviewer-response-time-rrt) 27 | - [Definitions](#definitions) 28 | - [Metric](#metric) 29 | - [Metrics for the Developer Platform Team](#metrics-for-the-developer-platform-team) 30 | - [CI Reliability (CIR)](#ci-reliability-cir) 31 | - [Deployment Reliability (DR)](#deployment-reliability-dr) 32 | - [Number of Insights Metrics in SLO (NIMS)](#number-of-insights-metrics-in-slo-nims) 33 | 34 | ## Company-Wide Engineering Metrics 35 | 36 | These are some examples of metrics we measure for the whole company. 37 | 38 | ### Developer Build Time (DBT) 39 | 40 | **TL;DR: How much time developers spend waiting for their build tool to finish 41 | its work.** 42 | 43 | The intention of this metric is to measure the amount of time that human beings 44 | spend waiting for their build tool to complete. This is measured as the 45 | wall-clock time from when the build tool starts a "build" to when it completes. 46 | 47 | This measures all human-triggered builds. At present, this includes all 48 | human-triggered builds with the following tools: 49 | 50 | - Gradle 51 | - Bazel 52 | - Ember 53 | - Xcode 54 | 55 | This duration is measured and reported in _seconds_. 56 | 57 | We count this only for builds invoked by human beings, that we reasonably assume 58 | they are waiting on. To be clear, this means we exclude all builds run on the CI 59 | infrastructure. 60 | 61 | We report this as P50 (median) and P90, so we have “DBT P50” and “DBT P90.” 62 | 63 | ### Post-Merge CI Duration (PMCID) 64 | 65 | **TL;DR: How long is it between when I say I want to submit a change and when 66 | its post-merge CI job fully completes?** 67 | 68 | The time it takes for each PR merge to get through CI, during post-commit. 69 | Counted whether the CI job passes or fails. 70 | 71 | The start point is when the user expresses the intent to merge a PR, and the end 72 | point is when the CI job delivers its final signal (passing or failing) to the 73 | developer who authored the PR. 74 | 75 | Reported in minutes, with one decimal place. 76 | 77 | We report the P50 and P90 of this, so we have “PMCID P50” and “PMCID P90.” 78 | 79 | ### CI Determinism (CID) 80 | 81 | **TL;DR: Test flakiness (just the inverse, so it’s a number that’s good when it 82 | goes up).** 83 | 84 | Each codebase at LinkedIn has a CI job that blocks merges if it fails. Each 85 | week, at some time during the week while CI machines are otherwise idle, we run 86 | each of these CI jobs many times at the same version of the repository (keeping 87 | everything in the environment identical between runs, as much as possible). We 88 | are looking to see if any of the runs returns a different result from the others 89 | (passes when the others fail, or fails when the others pass). 90 | 91 | The system that runs these jobs is called **CID**. 92 | 93 | Each CI job gets a **Determinism Score**, which is a percentage. The 94 | **Denominator** is the total number of builds run for that CI job by CID during 95 | the week. The **Numerator** is the number of times the CI job passed. So for 96 | example, let's say we run the CI job 10 times, 3 of them fail, and 7 of them 97 | pass. The score would be 7/10 (70%). 98 | 99 | However, if _all_ runs of a job fail, then its Determinism Score is 100%, and 100 | its Numerator should be set equal to its Denominator. 101 | 102 | When we aggregate this metric (for example, to show a Determinism score for a 103 | whole team that owns multiple MPs) we average all the Determinism Scores to get 104 | an overall Determinism Score. Taking the average means that codebases that run 105 | less frequently but are still flaky have their flakiness equally represented in 106 | the metric as codebases that are run frequently. 107 | 108 | Note that this is intended to only count CI jobs that block deployments or 109 | library publishing. It’s not intended to count flakiness in other things that 110 | might coincidentally run on the CI infrastructure. 111 | 112 | ### Code Reviewer Response Time (RRT) 113 | 114 | **TL;DR: How quickly do reviewers respond to each update from a developer, 115 | during a code review?** 116 | 117 | One of the most important qualities of any code review process is that 118 | [reviewers respond _quickly_](why-our-metrics.md). Often, [any complaints about 119 | the code review process (such as complaints about strictness) vanish when you 120 | just make it fast 121 | enough](https://google.github.io/eng-practices/review/reviewer/speed.html). This 122 | metric measures how quickly reviewers respond to each update that a developer 123 | posts. 124 | 125 | #### Definitions 126 | 127 | **Author**: The person who wrote the code that is being reviewed. If there are 128 | multiple contributors, all of them count as an Author. 129 | 130 | **Reviewer**: A person listed as an assigned reviewer on a PR. 131 | 132 | **Code Owner**: A person listed in the ACL files of a code repository. A person 133 | who has the power to approve changes to at least one of the files in a PR. 134 | 135 | **Request**: When a Reviewer gets a notification that an Author has taken some 136 | action, and now the Author is blocked while they are waiting for a response. 137 | Usually this means an Author has pushed a set of changes to the repository and 138 | the Reviewer has been sent a notification to review those changes. (Note: The 139 | Request Time is tracked as when a notification was sent, not when the changes 140 | were pushed. If a PR has no assigned Reviewer and changes are pushed, the 141 | Request Time only is tracked when the Reviewer gets assigned to the PR.) 142 | 143 | **Response**: The first time after a Request that a Reviewer or Code Owner 144 | responds on the PR and sends that response to the author. This could be a 145 | comment in the conversation, a comment on a line of code, or even just an 146 | approval with no comments. 147 | 148 | #### Metric 149 | 150 | Measure the [business hours](metric-principles.md) between each Request and 151 | Response. Note that in the process of a code review, there are many Requests and 152 | Responses. We count this metric only once for each Request within that code 153 | review. 154 | 155 | For example, imagine this sequence of events: 156 | 157 | 1. 10:00: Author Alice posts a new change and requests review. 158 | 2. 10:20: Reviewer Ravi posts a comment. 159 | 3. 10:25: Reviewer Rob posts a comment. 160 | 4. 10:45: Author Alice posts an updated set of changes in response to the 161 | Reviewers' comments. 162 | 5. 10:55: Reviewer Rob approves the PR with no comments. 163 | 164 | That creates two events. The first is 20 minutes long (Step 1 to Step 2), and 165 | the second is 10 minutes long. (Step 4 to Step 5) The response at 10:25 (Step 3) 166 | is ignored and doesn't have anything to do with this metric. 167 | 168 | We then take every single event like this and take the overall P50 (median) and 169 | P90 of them, so we have “RRT P50” and “RRT P90” that we report. 170 | 171 | This metric is reported in [business hours](metric-principles.md) as the unit, 172 | with one decimal place. 173 | 174 | When displaying this metric, show it to the managers of the _reviewers_, not the 175 | managers of the _authors_. That is, show people how long they took to review 176 | code, not how long they waited for a review. We have found it to be more 177 | actionable for managers, this way. 178 | 179 | ## Metrics for the Developer Platform Team 180 | 181 | These are a few example metrics that the Developer Platform team uses to measure 182 | their success. Unlike the company-wide metrics, these metrics tend to be things 183 | that the developer platform team has more direct control over and that more 184 | accurately represent the specific work they do. 185 | 186 | We would like to reiterate: we have many more metrics for our Developer Platform 187 | team besides these. These are just a few examples to show how one might define 188 | metrics for an individual team, as opposed to the whole company. 189 | 190 | ### CI Reliability (CIR) 191 | 192 | **TL;DR: How often does CI fail because of a problem with CI infrastructure?** 193 | 194 | How often do users experience a CI failure that is not due to an error they 195 | made, but instead was due to a breakage or bug in our infrastructure? 196 | 197 | The only failures that are counted are infrastructure failures. So if a job 198 | succeeds or fails only due to a user error (like a test failure--something the 199 | system is supposed to do) then that doesn’t count as a failure. Any failure that 200 | we don’t know is a user failure is assumed to be an infrastructure failure. 201 | 202 | Reported as the success percentage. (That is, a number that is good when it goes 203 | up.) 204 | 205 | ### Deployment Reliability (DR) 206 | 207 | **TL;DR: How often do deployments fail because of a problem with the deployment 208 | infrastructure?** 209 | 210 | How often do users experience a deployment failure that is not due to an error 211 | they made, but instead was due to a breakage or bug in our infrastructure? 212 | 213 | Measured as how often a deployment job experiences no deployment infrastructure 214 | failures. A “successful” deployment job is one where one version of one 215 | deployable has been deployed to all machines (or locations) it was intended to 216 | be deployed to per its deployment configuration. However, we only count failures 217 | of the deployment infrastructure as failures, for this particular metric. 218 | 219 | Deployments can partially fail. For example, one machine might deploy 220 | successfully while another fails due to infrastructure-related reasons. In our 221 | deployment infrastructure, individual teams can define what constitutes a 222 | “successful” deployment in terms of the percentage of machines that successfully 223 | deployed. We count a deployment as successful if it deployed to that percentage 224 | of machines. 225 | 226 | Infrastructure failures do include machine-based failures, like bad hosts that 227 | won’t accept deployments. They also include failures of the config 228 | infrastructure (but not config syntax errors or things like that). (This is not 229 | a complete list of what is included--we just note these specific things that are 230 | included so there isn’t confusion about those specific things.) 231 | 232 | Reported as the success percentage. (That is, a number that is good when it goes 233 | up.) 234 | 235 | ### Number of Insights Metrics in SLO (NIMS) 236 | 237 | **TL;DR: How many metrics are presented in our central developer productivity 238 | dashboard, and are they updating regularly on time every day?** 239 | 240 | To calculate this metric, first we count up all the metrics in our central 241 | developer productivity dashboard that have been fully implemented. "Implemented" 242 | means, for each metric: 243 | 244 | 1. We have an automated pipeline that does not require human interaction, which 245 | dumps data to some standard data store. 246 | 2. The data is processed by data pipelines to create metrics which are then 247 | displayed in our central developer productivity dashboard. 248 | 3. The metric displayed in our dashboard has passed User Acceptance Testing 249 | (UAT) and is viewable in production. 250 | 251 | We count the number of metrics in our central dashboard, not the number of 252 | pipelines. One pipeline might produce multiple metrics. 253 | 254 | The current SLO for our metrics is: each workflow for the metrics will compute 255 | data that is no older than **30 hours**. The definition of “compute” here is 256 | “calculate and provide in a format that is consumed by the dashboard.” 257 | 258 | For every metric where the latency is within SLO, we count "1" for this metric. 259 | 260 | When displaying this metric for a time period longer than a day, we average out 261 | the latency of each update over the time period we want to measure. For example, 262 | if we are measuring this for a quarter, we average them out over the whole 263 | quarter. By "each update" we mean every time the pipeline updated itself such 264 | that the dashboard would show an updated number. By "latency" we mean the time 265 | between each update, or the time between the last update and present time. 266 | 267 | Next: [Why Did We Choose Our Metrics?](why-our-metrics.md) 268 | -------------------------------------------------------------------------------- /why-our-metrics.md: -------------------------------------------------------------------------------- 1 | # Why Did We Choose Our Metrics? 2 | 3 | This document explains why we chose the metrics described in the [example 4 | metrics](example-metrics.md). It is here to give you an idea of the reasoning 5 | process used to pick a metric, and why we consider some metrics good and other 6 | metrics bad. It also contains some insights around the value (or lack thereof) 7 | of certain common developer productivity metrics. 8 | 9 | - [Company-Wide Engineering Metrics](#company-wide-engineering-metrics) 10 | - [Developer Build Time (DBT)](#developer-build-time-dbt) 11 | - [Post-Merge CI Duration (PMCID)](#post-merge-ci-duration-pmcid) 12 | - [Why Not "CI Success Rate?"](#why-not-ci-success-rate) 13 | - [Code Reviewer Response Time (RRT)](#code-reviewer-response-time-rrt) 14 | - [Why not "code review volume?"](#why-not-code-review-volume) 15 | - [Why not "PR Creation to Merge Time?"](#why-not-pr-creation-to-merge-time) 16 | - [Metrics for Developer Platform Team](#metrics-for-developer-platform-team) 17 | - [CI Reliability (CIR)](#ci-reliability-cir) 18 | - [Deployment Reliability (DR)](#deployment-reliability-dr) 19 | - [Number of Insights Metrics in SLO (NIMS)](#number-of-insights-metrics-in-slo-nims) 20 | 21 | ## Company-Wide Engineering Metrics 22 | 23 | ### Developer Build Time (DBT) 24 | 25 | One thing that slows down [iterations](productivity-concepts.md) for software 26 | developers is them having to wait for the build tool to finish. They make a 27 | change and they want to see if it compiles, which is a very important and 28 | frequent iteration. Because it happens so frequently, you can potentially save a 29 | ton of engineering time and make engineers much more efficient by improving 30 | build time. 31 | 32 | You can also fundamentally change how they work if you get build time low 33 | enough. For example, wouldn't it be nice if you set up a system that 34 | automatically built your code every time you saved a file, and returned instant 35 | feedback to you about whether or not your code compiles and passes the basic 36 | build checks? 37 | 38 | ### Post-Merge CI Duration (PMCID) 39 | 40 | This is, to a large degree, another [iteration time](productivity-concepts.md) 41 | issue. It's less common that a developer is actively _waiting_ for a post-merge 42 | to finish, so this isn't about [context switching](productivity-concepts.md). 43 | However, a developer may want to know that there is some problem in CI (if there 44 | is). In particular, if the signal from the CI system comes _after_ the 45 | developer's working hours, that means you've potentially lost a lot of 46 | real-world time in terms of getting your feature or deployment done. 47 | 48 | Also, just generally, just imagine the frustration and difficulty of submitting 49 | something and only finding out that there is something wrong with the change two 50 | or three _hours_ later (if the CI pipeline is that long), when you're already 51 | working on something totally different. Granted, this will happen even if the CI 52 | pipeline takes five or ten minutes, it's just not as drastically bad. If it were 53 | just ten minutes and I _really_ wanted to wait for it to finish because I was 54 | sending out something _really_ important that I was worried about, I could just 55 | go get a coffee and come back. But if it takes two hours, I'm _definitely_ 56 | working on something else, or the work day has ended. 57 | 58 | Also, long CI times mean that deployments take longer for people who do 59 | automated continuous deployment. This means experimentation takes longer, it 60 | takes longer to get feedback from users, etc. 61 | 62 | This metric is not as important as Developer Build Time, because that more 63 | directly impacts iteration time and context switching. But we do care about 64 | making CI faster, for the reasons specified above. There are probably also other 65 | benefits that we don't see, which only show up in specific cases. 66 | 67 | #### Why Not "CI Success Rate?" 68 | 69 | The whole purpose of a CI system is that it's supposed to fail sometimes. The 70 | problem with making this a metric is that it's [not really clear what it means 71 | when it goes up or down](metric-principles.md). Like, what number is good? 72 | Should it be 100% all the time? If it should be, then why does the system exist 73 | at all? 74 | 75 | There _is_ a good metric here that can be used called "CI Greenness" where you 76 | measure what percentage of the time a CI build is "green," or passing. We aren't 77 | sure this makes sense at LinkedIn, though, for a few reasons: 78 | 79 | 1. We don't actually run our builds _continuously_, but only when somebody 80 | merges a PR. So just one flaky test suddenly can make your "greenness" very 81 | bad, because you have to wait for somebody to submit a new change to fix it. 82 | Sometimes that might happen quickly, but it's not like there's an automated 83 | system that is just going to re-run the tests in a few hours anyway, to 84 | guarantee that flakiness has minimal impact. 85 | 2. At LinkedIn, our post-commit CI system doesn't actually _block_ developers 86 | from continuing to work. It just prevents deployment. (Or in the case of a 87 | library, prevents others from depending on the version whose CI failed.) A 88 | failing CI is an inconvenience, that's for sure, as you now have to 89 | investigate the failure, fix whatever is necessary, go through code review 90 | again, etc. But (in most cases at LinkedIn) other people can still submit to 91 | the repository, your change is still _in_ the repository (this is good and 92 | bad, both), and you can still continue to do work. So measuring "how long was 93 | CI failing for" isn't as valuable at LinkedIn as it would be in a place where 94 | "failing CI" means "nobody on the whole team can continue to work." 95 | 96 | ### Code Reviewer Response Time (RRT) 97 | 98 | Code review is one of the most important quality processes at every large 99 | software company, and LinkedIn is no exception. While there are many things 100 | about code that can be caught and improved by machines, only human beings can 101 | tell you if code is [easy to read, easy to understand, and easy to correctly 102 | modify in the 103 | future](https://www.codesimplicity.com/post/the-definition-of-simplicity/). 104 | 105 | What you want from a code review process is that it provides continuous 106 | _improvement_ to your code base through effective feedback from code reviewers. 107 | This requires whatever level of strictness is necessary in order to achieve this 108 | goal, which is different depending on who the submitter of the code is--how much 109 | experience they have with your system, how much experience they have with the 110 | language or tools, how long they have been programming, etc. 111 | 112 | However, it can sometimes seem like when you are too strict, developers start to 113 | push back about the strictness of the code review process, or they start trying 114 | to get around it, like "let's send it to the person who always rubber stamps my 115 | changes and never provides any feedback." This breaks the process and removes 116 | its value for your company. 117 | 118 | There is a secret to maintaining strictness while removing complaints: [have the 119 | reviewers reply 120 | faster](https://google.github.io/eng-practices/review/reviewer/speed.html). As 121 | wild as it may sound, nearly _all_ complaints about strictness are actually 122 | complaints about _speed_. 123 | 124 | Think about it this way: you submit a change, wait three days, the reviewer asks 125 | you to make major changes. Then you submit those changes, wait three days, and 126 | the reviewer again asks for major changes. At this point you're extremely 127 | frustrated. "I've been waiting for a week just to submit this!" Developers in 128 | this situation often say, "Stop being so strict!" After all, it's too late to 129 | say, "Stop being so slow!" And the slowness has increased the tension so much 130 | that the author can't stand it anymore, so they revolt. 131 | 132 | On the other hand, if a developer posts a change and the reviewer sends back 133 | comments within 30 minutes that ask for major changes, the developer might sigh 134 | and be a little annoyed, but will do it. Then when more major changes are 135 | requested, there might be some pushback, but at least it's all happened within 136 | the same day or two, not after waiting for more than a week. It makes people 137 | stop complaining about the code review process _as a whole_, and also 138 | significantly reduces (but does not eliminate) author pushback on valid code 139 | review comments. 140 | 141 | #### Why not "code review volume?" 142 | 143 | There is an [entire document that explains why any metric measuring the volume 144 | of output of a developer is dangerous](metrics-and-performance-reviews.md). 145 | 146 | That said, on a very large scale, PR Volume actually is an interesting metric. 147 | Looking at this _in the aggregate_ for thousands of engineers can show us 148 | patterns that are very interesting. Usually they tell us about things that 149 | happened in the past more than they tell us about what we can or should _do_ 150 | about something--it's hard to take action items from it, but it can warn us 151 | about bad situations. 152 | 153 | We didn't want to make this a company-wide metric because (a) it's hard to make 154 | managerial decisions based on it (b) it's only really valid on the level of the 155 | whole company or parts of it that have 500+ engineers (c) we were concerned it 156 | would be mis-used by people as a measurement of engineer performance as opposed 157 | to something we use to understand our tools, processes, or large-scale 158 | managerial decisions. 159 | 160 | All that said, it is possible to see PR Volume in our detailed dashboards that 161 | track PR metrics, and it's useful _informationally_ (that is, as an interesting 162 | input to help understand a few types of management decisions or their impact, 163 | such as time off, large-scale changes to our engineering systems, etc.) We 164 | include a large red disclaimer that the data should not be used for performance 165 | management, along with a link to [our explanation about why output metrics 166 | should not be used for individual performance ratings](scores.md). 167 | 168 | #### Why not "PR Creation to Merge Time?" 169 | 170 | As noted above, code review is a quality process. When you make people focus on 171 | the length of the whole process, you end up unintentionally driving behaviors 172 | that you don't want. 173 | 174 | As an absurd example, I could make every code review fast by simply making 175 | everybody approve them without looking at them. This metric would look beautiful 176 | and perhaps I would be rewarded for my efforts at optimizing code review times. 177 | 178 | As a less absurd example, let's say that a manager has instructed their team, 179 | "Let's look into how we can make code reviews take less total time." Now imagine 180 | that on some particular code review, a reviewer leaves a round of legitimate 181 | comments, and the author pushes back saying, "You are making this code review 182 | take longer," or something of that essence, and manages to talk their way out of 183 | improvements that really should have been made, simply because of _how long_ the 184 | review has taken in real-world days. This is a specific example, but there is a 185 | broad general class of bad behaviors you encourage when you tell people "make 186 | code reviews take less overall time," instead of "let's speed up how fast 187 | individual responses come in." 188 | 189 | When you focus on the speed of individual responses, you still get what you want 190 | (faster code reviews), but you don't sacrifice the power of the review process 191 | to produce high-quality code. 192 | 193 | There actually _is_ a point at which "too many iterations" (like, too many 194 | back-and-forth discussions between a developer and reviewer) becomes a problem 195 | in code review, but it's super-rare that this is the problem on a team. It's 196 | more common that people are too lax than that people are too strict. And it's 197 | more common that people are slow to respond than that they respond too many 198 | times. If "too many iterations" was a widespread problem at LinkedIn, we would 199 | think about tracking a "number of code review iterations" metric just for the 200 | duration of a project focused on solving the problem, but it's not something we 201 | want to go to zero. We just don't want it to be absurdly high (like ten 202 | iterations). 203 | 204 | ## Metrics for Developer Platform Team 205 | 206 | ### CI Reliability (CIR) 207 | 208 | The CI system itself absolutely must return reliable results, or developers will 209 | start to mistrust it. In the past, we saw some teams where nearly _every_ CI 210 | failure was actually a failure of the CI infrastructure. 211 | 212 | Instead of focusing on the uptime of the individual pieces, we focus on the 213 | reliability of the system as it is _experienced by its users_. Making this the 214 | focus of our reliability efforts dramatically improved the actual reliability of 215 | our system, compared to previous efforts focused around only availability. 216 | 217 | ### Deployment Reliability (DR) 218 | 219 | This has similar reasoning to CI Reliability, above. 220 | 221 | The only point worth noting is that there are sometimes discussions about 222 | whether the team that owns the deployment platform should be responsible for the 223 | full _success_ of all deployments, and not just the reliability of the 224 | deployment platform itself. Over time, we have come to the conclusion that this 225 | should _not_ be the responsibility of the deployment platform team, because too 226 | many of the issues that block successful deployments are out of the hands of the 227 | deployment platform team. 228 | 229 | There _are_ things that the platform team can do to make it more likely that 230 | people have successful deployments. For example, make it easier to add 231 | validation steps into a deployment. Make the configuration process for a 232 | deployment simpler. But overall, the actual success of a deployment depends a 233 | lot on the specifics of the binary being deployed and the deployment scripts 234 | that were written by the team that owns that binary. 235 | 236 | If you make the deployment platform team responsible for the success of every 237 | deployment, you tend to make them into consultants who spend too much time 238 | debugging the failing deployments of customers and not enough time developing 239 | new features that improve the deployment experience for the whole company. 240 | 241 | ### Number of Insights Metrics in SLO (NIMS) 242 | 243 | We have a team whose job it is to create data pipelines and infrastructure 244 | around developer productivity metrics. These metrics can't help anybody unless 245 | they end up in a dashboard, and that dashboard has up-to-date data. 246 | 247 | This metric doesn't measure the success of our dashboards. It measures the 248 | effectiveness of our team's data infrastructure. This metric measures that we 249 | have met the "table stakes" of being able to simply display data at all, and how 250 | effectively we are doing that as a team. We have other metrics to measure the 251 | impact our work has. 252 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Attribution 4.0 International 2 | 3 | ======================================================================= 4 | 5 | Creative Commons Corporation ("Creative Commons") is not a law firm and 6 | does not provide legal services or legal advice. Distribution of 7 | Creative Commons public licenses does not create a lawyer-client or 8 | other relationship. Creative Commons makes its licenses and related 9 | information available on an "as-is" basis. Creative Commons gives no 10 | warranties regarding its licenses, any material licensed under their 11 | terms and conditions, or any related information. Creative Commons 12 | disclaims all liability for damages resulting from their use to the 13 | fullest extent possible. 14 | 15 | Using Creative Commons Public Licenses 16 | 17 | Creative Commons public licenses provide a standard set of terms and 18 | conditions that creators and other rights holders may use to share 19 | original works of authorship and other material subject to copyright 20 | and certain other rights specified in the public license below. The 21 | following considerations are for informational purposes only, are not 22 | exhaustive, and do not form part of our licenses. 23 | 24 | Considerations for licensors: Our public licenses are 25 | intended for use by those authorized to give the public 26 | permission to use material in ways otherwise restricted by 27 | copyright and certain other rights. Our licenses are 28 | irrevocable. Licensors should read and understand the terms 29 | and conditions of the license they choose before applying it. 30 | Licensors should also secure all rights necessary before 31 | applying our licenses so that the public can reuse the 32 | material as expected. Licensors should clearly mark any 33 | material not subject to the license. This includes other CC- 34 | licensed material, or material used under an exception or 35 | limitation to copyright. More considerations for licensors: 36 | wiki.creativecommons.org/Considerations_for_licensors 37 | 38 | Considerations for the public: By using one of our public 39 | licenses, a licensor grants the public permission to use the 40 | licensed material under specified terms and conditions. If 41 | the licensor's permission is not necessary for any reason--for 42 | example, because of any applicable exception or limitation to 43 | copyright--then that use is not regulated by the license. Our 44 | licenses grant only permissions under copyright and certain 45 | other rights that a licensor has authority to grant. Use of 46 | the licensed material may still be restricted for other 47 | reasons, including because others have copyright or other 48 | rights in the material. A licensor may make special requests, 49 | such as asking that all changes be marked or described. 50 | Although not required by our licenses, you are encouraged to 51 | respect those requests where reasonable. More considerations 52 | for the public: 53 | wiki.creativecommons.org/Considerations_for_licensees 54 | 55 | ======================================================================= 56 | 57 | Creative Commons Attribution 4.0 International Public License 58 | 59 | By exercising the Licensed Rights (defined below), You accept and agree 60 | to be bound by the terms and conditions of this Creative Commons 61 | Attribution 4.0 International Public License ("Public License"). To the 62 | extent this Public License may be interpreted as a contract, You are 63 | granted the Licensed Rights in consideration of Your acceptance of 64 | these terms and conditions, and the Licensor grants You such rights in 65 | consideration of benefits the Licensor receives from making the 66 | Licensed Material available under these terms and conditions. 67 | 68 | 69 | Section 1 -- Definitions. 70 | 71 | a. Adapted Material means material subject to Copyright and Similar 72 | Rights that is derived from or based upon the Licensed Material 73 | and in which the Licensed Material is translated, altered, 74 | arranged, transformed, or otherwise modified in a manner requiring 75 | permission under the Copyright and Similar Rights held by the 76 | Licensor. For purposes of this Public License, where the Licensed 77 | Material is a musical work, performance, or sound recording, 78 | Adapted Material is always produced where the Licensed Material is 79 | synched in timed relation with a moving image. 80 | 81 | b. Adapter's License means the license You apply to Your Copyright 82 | and Similar Rights in Your contributions to Adapted Material in 83 | accordance with the terms and conditions of this Public License. 84 | 85 | c. Copyright and Similar Rights means copyright and/or similar rights 86 | closely related to copyright including, without limitation, 87 | performance, broadcast, sound recording, and Sui Generis Database 88 | Rights, without regard to how the rights are labeled or 89 | categorized. For purposes of this Public License, the rights 90 | specified in Section 2(b)(1)-(2) are not Copyright and Similar 91 | Rights. 92 | 93 | d. Effective Technological Measures means those measures that, in the 94 | absence of proper authority, may not be circumvented under laws 95 | fulfilling obligations under Article 11 of the WIPO Copyright 96 | Treaty adopted on December 20, 1996, and/or similar international 97 | agreements. 98 | 99 | e. Exceptions and Limitations means fair use, fair dealing, and/or 100 | any other exception or limitation to Copyright and Similar Rights 101 | that applies to Your use of the Licensed Material. 102 | 103 | f. Licensed Material means the artistic or literary work, database, 104 | or other material to which the Licensor applied this Public 105 | License. 106 | 107 | g. Licensed Rights means the rights granted to You subject to the 108 | terms and conditions of this Public License, which are limited to 109 | all Copyright and Similar Rights that apply to Your use of the 110 | Licensed Material and that the Licensor has authority to license. 111 | 112 | h. Licensor means the individual(s) or entity(ies) granting rights 113 | under this Public License. 114 | 115 | i. Share means to provide material to the public by any means or 116 | process that requires permission under the Licensed Rights, such 117 | as reproduction, public display, public performance, distribution, 118 | dissemination, communication, or importation, and to make material 119 | available to the public including in ways that members of the 120 | public may access the material from a place and at a time 121 | individually chosen by them. 122 | 123 | j. Sui Generis Database Rights means rights other than copyright 124 | resulting from Directive 96/9/EC of the European Parliament and of 125 | the Council of 11 March 1996 on the legal protection of databases, 126 | as amended and/or succeeded, as well as other essentially 127 | equivalent rights anywhere in the world. 128 | 129 | k. You means the individual or entity exercising the Licensed Rights 130 | under this Public License. Your has a corresponding meaning. 131 | 132 | 133 | Section 2 -- Scope. 134 | 135 | a. License grant. 136 | 137 | 1. Subject to the terms and conditions of this Public License, 138 | the Licensor hereby grants You a worldwide, royalty-free, 139 | non-sublicensable, non-exclusive, irrevocable license to 140 | exercise the Licensed Rights in the Licensed Material to: 141 | 142 | a. reproduce and Share the Licensed Material, in whole or 143 | in part; and 144 | 145 | b. produce, reproduce, and Share Adapted Material. 146 | 147 | 2. Exceptions and Limitations. For the avoidance of doubt, where 148 | Exceptions and Limitations apply to Your use, this Public 149 | License does not apply, and You do not need to comply with 150 | its terms and conditions. 151 | 152 | 3. Term. The term of this Public License is specified in Section 153 | 6(a). 154 | 155 | 4. Media and formats; technical modifications allowed. The 156 | Licensor authorizes You to exercise the Licensed Rights in 157 | all media and formats whether now known or hereafter created, 158 | and to make technical modifications necessary to do so. The 159 | Licensor waives and/or agrees not to assert any right or 160 | authority to forbid You from making technical modifications 161 | necessary to exercise the Licensed Rights, including 162 | technical modifications necessary to circumvent Effective 163 | Technological Measures. For purposes of this Public License, 164 | simply making modifications authorized by this Section 2(a) 165 | (4) never produces Adapted Material. 166 | 167 | 5. Downstream recipients. 168 | 169 | a. Offer from the Licensor -- Licensed Material. Every 170 | recipient of the Licensed Material automatically 171 | receives an offer from the Licensor to exercise the 172 | Licensed Rights under the terms and conditions of this 173 | Public License. 174 | 175 | b. No downstream restrictions. You may not offer or impose 176 | any additional or different terms or conditions on, or 177 | apply any Effective Technological Measures to, the 178 | Licensed Material if doing so restricts exercise of the 179 | Licensed Rights by any recipient of the Licensed 180 | Material. 181 | 182 | 6. No endorsement. Nothing in this Public License constitutes or 183 | may be construed as permission to assert or imply that You 184 | are, or that Your use of the Licensed Material is, connected 185 | with, or sponsored, endorsed, or granted official status by, 186 | the Licensor or others designated to receive attribution as 187 | provided in Section 3(a)(1)(A)(i). 188 | 189 | b. Other rights. 190 | 191 | 1. Moral rights, such as the right of integrity, are not 192 | licensed under this Public License, nor are publicity, 193 | privacy, and/or other similar personality rights; however, to 194 | the extent possible, the Licensor waives and/or agrees not to 195 | assert any such rights held by the Licensor to the limited 196 | extent necessary to allow You to exercise the Licensed 197 | Rights, but not otherwise. 198 | 199 | 2. Patent and trademark rights are not licensed under this 200 | Public License. 201 | 202 | 3. To the extent possible, the Licensor waives any right to 203 | collect royalties from You for the exercise of the Licensed 204 | Rights, whether directly or through a collecting society 205 | under any voluntary or waivable statutory or compulsory 206 | licensing scheme. In all other cases the Licensor expressly 207 | reserves any right to collect such royalties. 208 | 209 | 210 | Section 3 -- License Conditions. 211 | 212 | Your exercise of the Licensed Rights is expressly made subject to the 213 | following conditions. 214 | 215 | a. Attribution. 216 | 217 | 1. If You Share the Licensed Material (including in modified 218 | form), You must: 219 | 220 | a. retain the following if it is supplied by the Licensor 221 | with the Licensed Material: 222 | 223 | i. identification of the creator(s) of the Licensed 224 | Material and any others designated to receive 225 | attribution, in any reasonable manner requested by 226 | the Licensor (including by pseudonym if 227 | designated); 228 | 229 | ii. a copyright notice; 230 | 231 | iii. a notice that refers to this Public License; 232 | 233 | iv. a notice that refers to the disclaimer of 234 | warranties; 235 | 236 | v. a URI or hyperlink to the Licensed Material to the 237 | extent reasonably practicable; 238 | 239 | b. indicate if You modified the Licensed Material and 240 | retain an indication of any previous modifications; and 241 | 242 | c. indicate the Licensed Material is licensed under this 243 | Public License, and include the text of, or the URI or 244 | hyperlink to, this Public License. 245 | 246 | 2. You may satisfy the conditions in Section 3(a)(1) in any 247 | reasonable manner based on the medium, means, and context in 248 | which You Share the Licensed Material. For example, it may be 249 | reasonable to satisfy the conditions by providing a URI or 250 | hyperlink to a resource that includes the required 251 | information. 252 | 253 | 3. If requested by the Licensor, You must remove any of the 254 | information required by Section 3(a)(1)(A) to the extent 255 | reasonably practicable. 256 | 257 | 4. If You Share Adapted Material You produce, the Adapter's 258 | License You apply must not prevent recipients of the Adapted 259 | Material from complying with this Public License. 260 | 261 | 262 | Section 4 -- Sui Generis Database Rights. 263 | 264 | Where the Licensed Rights include Sui Generis Database Rights that 265 | apply to Your use of the Licensed Material: 266 | 267 | a. for the avoidance of doubt, Section 2(a)(1) grants You the right 268 | to extract, reuse, reproduce, and Share all or a substantial 269 | portion of the contents of the database; 270 | 271 | b. if You include all or a substantial portion of the database 272 | contents in a database in which You have Sui Generis Database 273 | Rights, then the database in which You have Sui Generis Database 274 | Rights (but not its individual contents) is Adapted Material; and 275 | 276 | c. You must comply with the conditions in Section 3(a) if You Share 277 | all or a substantial portion of the contents of the database. 278 | 279 | For the avoidance of doubt, this Section 4 supplements and does not 280 | replace Your obligations under this Public License where the Licensed 281 | Rights include other Copyright and Similar Rights. 282 | 283 | 284 | Section 5 -- Disclaimer of Warranties and Limitation of Liability. 285 | 286 | a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE 287 | EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS 288 | AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF 289 | ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS, 290 | IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION, 291 | WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR 292 | PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS, 293 | ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT 294 | KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT 295 | ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU. 296 | 297 | b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE 298 | TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION, 299 | NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT, 300 | INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES, 301 | COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR 302 | USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN 303 | ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR 304 | DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR 305 | IN PART, THIS LIMITATION MAY NOT APPLY TO YOU. 306 | 307 | c. The disclaimer of warranties and limitation of liability provided 308 | above shall be interpreted in a manner that, to the extent 309 | possible, most closely approximates an absolute disclaimer and 310 | waiver of all liability. 311 | 312 | 313 | Section 6 -- Term and Termination. 314 | 315 | a. This Public License applies for the term of the Copyright and 316 | Similar Rights licensed here. However, if You fail to comply with 317 | this Public License, then Your rights under this Public License 318 | terminate automatically. 319 | 320 | b. Where Your right to use the Licensed Material has terminated under 321 | Section 6(a), it reinstates: 322 | 323 | 1. automatically as of the date the violation is cured, provided 324 | it is cured within 30 days of Your discovery of the 325 | violation; or 326 | 327 | 2. upon express reinstatement by the Licensor. 328 | 329 | For the avoidance of doubt, this Section 6(b) does not affect any 330 | right the Licensor may have to seek remedies for Your violations 331 | of this Public License. 332 | 333 | c. For the avoidance of doubt, the Licensor may also offer the 334 | Licensed Material under separate terms or conditions or stop 335 | distributing the Licensed Material at any time; however, doing so 336 | will not terminate this Public License. 337 | 338 | d. Sections 1, 5, 6, 7, and 8 survive termination of this Public 339 | License. 340 | 341 | 342 | Section 7 -- Other Terms and Conditions. 343 | 344 | a. The Licensor shall not be bound by any additional or different 345 | terms or conditions communicated by You unless expressly agreed. 346 | 347 | b. Any arrangements, understandings, or agreements regarding the 348 | Licensed Material not stated herein are separate from and 349 | independent of the terms and conditions of this Public License. 350 | 351 | 352 | Section 8 -- Interpretation. 353 | 354 | a. For the avoidance of doubt, this Public License does not, and 355 | shall not be interpreted to, reduce, limit, restrict, or impose 356 | conditions on any use of the Licensed Material that could lawfully 357 | be made without permission under this Public License. 358 | 359 | b. To the extent possible, if any provision of this Public License is 360 | deemed unenforceable, it shall be automatically reformed to the 361 | minimum extent necessary to make it enforceable. If the provision 362 | cannot be reformed, it shall be severed from this Public License 363 | without affecting the enforceability of the remaining terms and 364 | conditions. 365 | 366 | c. No term or condition of this Public License will be waived and no 367 | failure to comply consented to unless expressly agreed to by the 368 | Licensor. 369 | 370 | d. Nothing in this Public License constitutes or may be interpreted 371 | as a limitation upon, or waiver of, any privileges and immunities 372 | that apply to the Licensor or You, including from the legal 373 | processes of any jurisdiction or authority. 374 | 375 | 376 | ======================================================================= 377 | 378 | Creative Commons is not a party to its public 379 | licenses. Notwithstanding, Creative Commons may elect to apply one of 380 | its public licenses to material it publishes and in those instances 381 | will be considered the â€œLicensor.â€ The text of the Creative Commons 382 | public licenses is dedicated to the public domain under the CC0 Public 383 | Domain Dedication. Except for the limited purpose of indicating that 384 | material is shared under a Creative Commons public license or as 385 | otherwise permitted by the Creative Commons policies published at 386 | creativecommons.org/policies, Creative Commons does not authorize the 387 | use of the trademark "Creative Commons" or any other trademark or logo 388 | of Creative Commons without its prior written consent including, 389 | without limitation, in connection with any unauthorized modifications 390 | to any of its public licenses or any other arrangements, 391 | understandings, or agreements concerning use of licensed material. For 392 | the avoidance of doubt, this paragraph does not form part of the 393 | public licenses. 394 | 395 | Creative Commons may be contacted at creativecommons.org. 396 | --------------------------------------------------------------------------------