├── .gitignore
├── Gemfile
├── _config.yml
├── _includes
└── head-custom-google-analytics.html
├── scores.md
├── qualitative-vs-quantitative.md
├── productivity-concepts.md
├── README.md
├── persona-champions.md
├── driving-decisions.md
├── goals-signals-metrics.md
├── metric-pitfalls.md
├── data-vs-insights.md
├── audiences.md
├── metrics-and-performance-reviews.md
├── developer-personas.md
├── data-collection-principles.md
├── dph-goals-and-signals.md
├── metric-principles.md
├── example-metrics.md
├── why-our-metrics.md
└── LICENSE
/.gitignore:
--------------------------------------------------------------------------------
1 | Gemfile.lock
2 | _site/
3 |
--------------------------------------------------------------------------------
/Gemfile:
--------------------------------------------------------------------------------
1 | source 'https://rubygems.org'
2 | gem 'github-pages', group: :jekyll_plugins
3 |
--------------------------------------------------------------------------------
/_config.yml:
--------------------------------------------------------------------------------
1 | remote_theme: pages-themes/primer
2 | plugins:
3 | - jekyll-remote-theme
4 | title: The LinkedIn DPH Framework
5 |
--------------------------------------------------------------------------------
/_includes/head-custom-google-analytics.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
10 |
--------------------------------------------------------------------------------
/scores.md:
--------------------------------------------------------------------------------
1 | # What’s wrong with scores?
2 |
3 | Resist the temptation to aggregate metrics into scores. Scores are problematic
4 | for a variety of reasons.
5 |
6 | # Issues
7 |
8 | **Added Indirection** — Creating scores is tempting because it appears to reduce
9 | the amount of information communicated to users. But, the issue is that a score
10 | adds another layer of indirection. Rather than reducing the amount of
11 | information needed to make decisions, scores add to it.
12 |
13 | **Aggregates and Weights** — Scores require additional decisions to be made
14 | about combining and weighting metrics. Agreeing to a standard set of weights
15 | distracts from the main goal of using metrics to drive priorities and
16 | improvements. It’s okay to let different teams prioritize different metrics.
17 |
18 | **Stakeholder Buy-in** — Scores also require a lot of "buy-in" from
19 | stakeholders. Getting the buy-in necessary for metrics is often difficult enough
20 | that creating scores as "meta-metrics" isn’t worth the effort. Another option is
21 | to have a strong sponsor for the score. But these top-down efforts are rare.
22 |
23 | **Expensive to Develop** — When creating a metric, lots of decisions need to be
24 | made about the telemetry and data pipeline. What is included or excluded from
25 | the metric? When does the duration start or end? Prioritize and iterate on these
26 | decisions before considering an aggregate score.
27 |
28 | **Unactionable Changes** — Scores are less directly actionable and meaningful
29 | than metrics. Creating an alert on a score is error prone as one metric’s
30 | improvement may conceal another metric’s worsening.
31 |
32 | **Increased Noise** — The changes in contributing metrics also make variance
33 | calculations more difficult. As fluctuations are bound to occur, the combination
34 | of those changes may increase the noise in the score.
35 |
36 | **Missing Units** — Scores often have no units and therefore are not meaningful
37 | or intuitive. To say that a score went from 50 to 60 is meaningless without more
38 | context. Is the overall score out of 100 or 1,000? Is a 20% improvement typical
39 | or remarkable?
40 |
41 | **Potential for Gaming** — When a set of metrics are reduced to a single number,
42 | there might be an incentive to "game" the system to achieve higher scores,
43 | potentially at the expense of other important factors.
44 |
45 | **Difficulty in Comparison** — Due to inherent differences in services or
46 | systems, it may be difficult to meaningfully compare scores. The effort to
47 | improve the score in one scenario by 10% may require ten times more effort in
48 | another scenario for the same benefit.
49 |
50 | Next: [Metrics and Performance Reviews](metrics-and-performance-reviews.md)
--------------------------------------------------------------------------------
/qualitative-vs-quantitative.md:
--------------------------------------------------------------------------------
1 | # Qualitative vs Quantitative Measurements
2 |
3 | We collect both quantitative data and qualitative data. They have different uses
4 | and trade-offs.
5 |
6 | - [Qualitative Data](#qualitative-data)
7 | - [Quantitative Data](#quantitative-data)
8 | - [Conflicts](#conflicts)
9 |
10 | ## Qualitative Data
11 |
12 | Usually qualitative data is in the form of personal opinions or sentiments about
13 | things. It is collected by asking people questions, usually through surveys or
14 | by interviewing people.
15 |
16 | Qualitative data is our most "full-coverage" data. It can tell us about almost
17 | anything that can be framed as a survey question. However, it sometimes is hard
18 | to take direct action on, depending on how the survey questions are constructed.
19 | Also, you usually can't survey people _frequently_ (they don't want to be
20 | surveyed that often) which means that your data points are coming in every few
21 | months or quarters, rather than every day. So you can't see, for example, if a
22 | change you made yesterday caused an improvement today.
23 |
24 | Qualitative data usually ends up being best for discovering general areas for
25 | action, but often requires specific follow-up analysis to discover specifically
26 | what action needs to be taken in those areas. It is most often appropriate when
27 | you are measuring something abstract or social (like “happiness,”
28 | “productivity,” or “how do people feel about my product”), as opposed to
29 | something concrete or technical (like “how many requests do I get per second” or
30 | “how long is the median incremental build time of this specific binary?”).
31 |
32 | That said, it is usually easier to survey/interview users than to instrument
33 | systems for quantitative data. So when first analyzing an area, doing surveys or
34 | interviews is usually the best way to get started. However, eventually you do
35 | want to have quantitative data, because it's very useful when figuring out what
36 | specific actions you need to take to fix productivity issues.
37 |
38 | ## Quantitative Data
39 |
40 | Quantitative data is easier to take direct action on, but usually harder to
41 | implement. It's important to know that [somebody will _use_ the
42 | data](driving-decisions.md) before you go through the work to put numbers on a
43 | graph.
44 |
45 | Quantitative data is especially helpful when an engineer needs to do an
46 | investigation that will lead to an engineering team taking direct action. For
47 | example, "What are the slowest tests in my codebase?" An engineer asking that
48 | question likely is about to go fix those specific tests, and then wants to see
49 | the improvement reflected in the same metric.
50 |
51 | It is dangerous to use quantitative methods on things that should be
52 | qualitative. For example, **there is no general quantitative metric for
53 | "developer productivity."** "Productivity" is inherently a _quality_ that
54 | you’re measuring (“how easily are people able to do their jobs?”), and there’s
55 | no quantitative measure of it. "Code complexity" is similar---that is a quality
56 | that is experienced by human beings; you can't measure it with a number.
57 |
58 | ## Conflicts
59 |
60 | When the qualitative data and the quantitative data disagree, usually there is
61 | something wrong with the quantitative data. For example, if developers all say
62 | they are unhappy with build times, and our build time metrics all look fine, our
63 | build time metrics are probably wrong or missing data.
64 |
65 | Next: [Audiences: Always Know Who Your Data is For](audiences.md)
--------------------------------------------------------------------------------
/productivity-concepts.md:
--------------------------------------------------------------------------------
1 | # Productivity Concepts for Software Developers
2 |
3 | There are some common concepts used when discussing the productivity of software
4 | developers. These concepts aren't specific to designing metrics, but they
5 | frequently come up when we think about which metrics to choose.
6 |
7 | - [Iteration Time (aka Cycle Time)](#iteration-time-aka-cycle-time)
8 | - [Context Switching](#context-switching)
9 |
10 | ## Iteration Time (aka Cycle Time)
11 |
12 | One of the most important things to optimize about software engineering is
13 | "iteration time," which is the time it takes for an engineer to make an
14 | observation, decide what to do about that observation, and act on it. There are
15 | small iterations, like when a person is coding, they might write a line of code,
16 | see the IDE give it a squiggly red underline, and fix their typo. And there are
17 | huge iterations, like the time between having an idea for a whole product and
18 | then finally releasing that product to its users, getting feedback, and making
19 | improvements to the product. There are many types and sizes of iterations, like
20 | doing a build and seeing if your code compiles, posting a change and waiting for
21 | a code review comment, deploying an experiment and analyzing feedback from
22 | users, etc.
23 |
24 | Sometimes, we don’t have a set expectation for how many iterations a process
25 | should take, but we know that if we speed up the iterations, the process as a
26 | whole will get faster. It’s a safe general assumption to make in almost all
27 | cases.
28 |
29 | Also, there are times when you fundamentally change the nature of the work by
30 | reducing iteration time. For example, if it takes 2 seconds to run all my tests,
31 | I can run them every time I save a file. But if it takes 10 minutes, I might not
32 | even run them before submitting my code for review. In particular, work can
33 | dramatically change when iterations become short enough to eliminate context
34 | switching, as described below.
35 |
36 | ## Context Switching
37 |
38 | Developers will "context switch" to another activity if they have to wait a
39 | certain amount of time for something. For example, they might go read their
40 | email or work on something else if they have to wait longer than 30 - 60 seconds
41 | for a build to complete. The specific time depends on various factors, such as a
42 | person's expectations for how long the task _should_ take, how the task informs
43 | the developer of its progress, etc.
44 |
45 | When you interrupt a developer for long enough, they could take up to 10 - 15
46 | minutes to mentally "reload" the "context" that they had for the change they
47 | were working on. Basically, it can take some time to figure out what you were
48 | working on, what you were thinking about it, and what you intended to do.
49 |
50 | So if you combine those two, there’s a chance that every time you make somebody
51 | wait longer than a certain amount (let's say 30 seconds, on average), you
52 | actually could lose fifteen minutes of their time. Now, this isn’t an absolute
53 | thing--you might only lose two or three minutes for a 30-second wait, but you
54 | might lose 5 - 10 minutes for a 3-minute wait.
55 |
56 | Also, sometimes important context is entirely lost from the developer’s mind
57 | when you make them context switch, especially when the context switch comes from
58 | a sudden interruption (like having some service go down that they depend on for
59 | development). They come back to what they are doing and have forgotten something
60 | important, leading to bugs or missed opportunities in your actual production
61 | systems.
62 |
63 | So it’s important to avoid making developers context switch.
64 |
65 | Next: [Example Metrics](example-metrics.md)
66 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | Welcome to the [LinkedIn Developer Productivity and Happiness
2 | Framework](https://linkedin.github.io/dph-framework/)!
3 |
4 | At LinkedIn, we have a fairly advanced system for understanding our developers,
5 | the success of our engineering work, and where we should focus our
6 | infrastructure efforts to be most effective.
7 |
8 | This repository contains the documents that describe this system. They are
9 | mostly direct copies of the internal documents that our own engineers read to
10 | understand how this works.
11 |
12 | This document set explains how to define metrics and feedback systems for
13 | software developers, how to get action taken on that data, and provides examples
14 | of a few internal metrics that we use.
15 |
16 | You can read the documents in any order. Each one is designed to be able to be
17 | read and referenced independently. However, we provide a suggested sequence and
18 | hierarchy here:
19 |
20 | * **Goals, Signals, and Metrics: A Framework for Defining Metrics**
21 | * [Goals, Signals, and Metrics](goals-signals-metrics.md)
22 | * [Developer Productivity & Happiness Goals and Signals](dph-goals-and-signals.md)
23 | * **Developer Personas: A system for categorizing and understanding developers**
24 | * [Developer Personas](developer-personas.md)
25 | * [Persona Champions](persona-champions.md)
26 | * **Guidelines for Teams Who Create Metrics and Feedback Systems**
27 | * [Data vs Insights](data-vs-insights.md)
28 | * [Qualitative vs Quantitative Measurements](qualitative-vs-quantitative.md)
29 | * [Audiences: Always Know Who Your Data is For](audiences.md)
30 | * [Driving Decisions With Data](driving-decisions.md)
31 | * [Data Collection Principles](data-collection-principles.md)
32 | * **Quantitative Metrics: General Tips and Guidelines**
33 | * [Principles and Guidelines for Metric Design](metric-principles.md)
34 | * [Common Pitfalls When Designing Metrics](metric-pitfalls.md)
35 | * [What's wrong with "scores?"](scores.md)
36 | * [Metrics and Performance Reviews](metrics-and-performance-reviews.md)
37 | * **Example Metrics**
38 | * [Productivity Concepts for Software Developers](productivity-concepts.md)
39 | * [Example Metrics](example-metrics.md)
40 | * [Why Did We Choose Our Metrics?](why-our-metrics.md)
41 |
42 | ## Forking, Modifying, and Contributing
43 |
44 | We have made the DPH Framework open-source so that you can fork, modify, and
45 | re-use these documents however you wish, as long as you respect the license that
46 | is on the repository. You can see the source in our [GitHub
47 | repo](https://github.com/linkedin/dph-framework/).
48 |
49 | We welcome community contributions that help move forward the state of the art
50 | in understanding software developers across the entire software industry. If
51 | there’s something missing in the documents that you’d like to see added, feel
52 | free to file an issue via [GitHub
53 | Issues](https://github.com/linkedin/dph-framework/issues)! If you just have
54 | questions or a discussion you’d like to have, participate in our [GitHub
55 | Discussions](https://github.com/linkedin/dph-framework/discussions).
56 |
57 | And of course, if you want to contribute new text or improvements to the
58 | existing text, we welcome your contributions! Keep in mind that we hold this
59 | framework to a very high standard---we want it to be validated by real
60 | experience in the software industry, generally applicable across a wide range of
61 | software development environments, and assure that additional inputs are both
62 | interesting and accessible to a broad audience. If you think you have content
63 | that meets that bar and fits in with these documents, we would love to have your
64 | contribution! If you’re not sure, start a
65 | [discussion](https://github.com/linkedin/dph-framework/issues) or just send us a
66 | PR and we can discuss it.
67 |
68 | ## License
69 |
70 | The LinkedIn Developer Productivity & Happiness Framework is licensed under [CC
71 | BY 4.0](http://creativecommons.org/licenses/by/4.0/?ref=chooser-v1).
72 |
73 | 
75 |
76 | Copyright © 2023 LinkedIn Corporation
77 |
--------------------------------------------------------------------------------
/persona-champions.md:
--------------------------------------------------------------------------------
1 | # Persona Champions
2 |
3 | There is a person or small group assigned as the "champion" for each of our
4 | [Developer Personas](persona-champions.md). They are a _member_ of that persona.
5 | For example, the persona champion for backend developers _is_ a backend
6 | developer or a manager of backend developers.
7 |
8 | Any time somebody at the company has a question about a particular persona that
9 | isn't answered by our automated systems, the persona champion is their point of
10 | contact. If you want some custom data analysis done about a Persona, want to
11 | know what their requirements are, want to understand the best way to engage with
12 | a persona, etc. then the persona champion is your best point of contact.
13 |
14 | - [Duties of Persona Champions](#duties-of-persona-champions)
15 | - [Workflow Mapping](#workflow-mapping)
16 | - [Feedback Analysis](#feedback-analysis)
17 | - [Point of Contact](#point-of-contact)
18 |
19 | ## Duties of Persona Champions
20 |
21 | ### Workflow Mapping
22 |
23 | When first creating the [Developer Personas](developer-personas.md), it is often
24 | helpful to provide infrastructure owners with some basic information about how
25 | that persona does work. A persona champion produces and maintains a document
26 | that answers the questions, "What is the most common workflow that a developer
27 | has, in this persona? What tools do they use as part of this workflow, and how
28 | do they use them?"
29 |
30 | For most traditional software engineers, this would include the tools they use
31 | for:
32 |
33 | * Design and Planning
34 | * Source Control
35 | * Editing Code
36 | * Building/Compiling
37 | * Debugging
38 | * Dependency Management
39 | * Testing
40 | * Code Review
41 | * Release / Deployment
42 | * Experimentation
43 | * Bug Tracking / Fixing
44 | * Monitoring and Alerting
45 |
46 | Each persona will have custom steps in addition to (or instead of) those. Some
47 | personas, like data scientists, will have completely different workflow steps.
48 |
49 | This isn't just a list of tools, but sentences that describe how those tools are
50 | used in each "phase" of development above.
51 |
52 | Sometimes you will find that a persona has different ways of working for each
53 | person or team. In that case, you should document the broadest things the
54 | persona has in common, and note that otherwise there are a lot of differences
55 | between individuals. If there are _large_ sub-categories (essentially
56 | "sub-personas" who have unique workflows _within_ a larger persona) you can call
57 | out how those large sub-personas work, too. The trick is to keep the document
58 | small enough that people can read it and you can easily maintain it in the
59 | future, while still being useful for its readers.
60 |
61 | These documents are useful for infrastructure owners who are engaging in new
62 | strategic initiatives and need to understand if their plans are going to fit in
63 | with how developers across the company work. For example, if I'm going to work
64 | on a new web framework, does it fit in with how web developers work? Are there
65 | any other developers who might benefit from it, at the company?
66 |
67 | ### Feedback Analysis
68 |
69 | Persona champions are the primary people responsible for processing the feedback
70 | from our surveys and real-time feedback systems. They get access to the raw
71 | feedback data in order to process and categorize it into actionable
72 | [insights](data-vs-insights.md).
73 |
74 | After each survey, this process occurs:
75 |
76 | 1. A persona champion produces an analysis document listing out the top pain
77 | points for that persona. This involves categorizing the free-text feedback,
78 | looking at the 1 - 5 satisfaction scores produced by the survey questions,
79 | and doing direct follow-up interviews with developers as necessary (when we
80 | need more clarity on what the specific pain points are).
81 | 2. We share the analysis document with the relevant stakeholders. This always
82 | includes specifically notifying any team that might be involved in fixing one
83 | of the pain points. They should get an immediate "heads up" that their tool
84 | is being named as a top pain point for a persona.
85 | 3. We schedule a meeting that includes all those relevant stakeholders. The
86 | attendees should include people who have priority control over the work of
87 | the relevant infrastructure teams (i.e., the managers of the stuff named in
88 | the pain points). We go over the analysis document, answer questions, and
89 | agree on next steps. Somebody (such as a Technical Program Manager) is
90 | responsible for following up and making sure those next steps get executed.
91 |
92 | It is important that this whole process happens _before_ the planning cycle of
93 | the infrastructure teams, so that there is enough time for them to consider
94 | developer feedback in their planning.
95 |
96 | Persona champions also help curate the questions that go into surveys for their
97 | persona. They get lists of questions to review, and can also propose new
98 | questions for their persona when necessary.
99 |
100 | ### Point of Contact
101 |
102 | We expect persona champions to be available to answer questions about a persona
103 | from various parts of the organization. People might want more specifics about a
104 | particular pain point noted in the feedback analysis, or they might have
105 | questions about behavior or data not included in any of our documents.
106 |
107 | We expect persona champions to put in reasonable effort to answer reasonable
108 | questions posed to them, as long as those questions will actually assist in
109 | developer productivity if answered. If the work to answer a question would be
110 | too extensive, it is fine for the persona champion to decline to answer it,
111 | citing the amount of work that would be involved in getting the data.
112 |
113 | Next: [Data vs. Insights](data-vs-insights.md)
114 |
--------------------------------------------------------------------------------
/driving-decisions.md:
--------------------------------------------------------------------------------
1 | # Driving Decisions With Data
2 |
3 | It is important when designing metrics, systems, or processes involving data
4 | that we understand how they will be used to drive what decisions, by who. The
5 | data needs to be something that [engineering leadership, senior ICs, front-line
6 | managers, or front-line engineers](audiences.md) can _do_ something effective
7 | with.
8 |
9 | **When you propose a metric, system, or process involving data, you should
10 | always explain how it will drive decisions.**
11 |
12 | - [Bad Examples](#bad-examples)
13 | - [Misleading Numbers](#misleading-numbers)
14 | - [No Audience](#no-audience)
15 | - [Good Example](#good-example)
16 | - [Driving Decisions Differently for Different Groups](#driving-decisions-differently-for-different-groups)
17 | - [Focus on Causing Action](#focus-on-causing-action)
18 |
19 | ## Bad Examples
20 |
21 | Let's look at some bad examples to demonstrate this.
22 |
23 | ### Misleading Numbers
24 |
25 | Most of us know that "lines of code written per software engineer" is a bad
26 | metric. But it's especially bad in this context of effectively driving the right
27 | decisions.
28 |
29 | _Who_ does _what_ with that number? It tells nothing to engineering leadership
30 | and senior engineers--there is no specific action they can take to fix it. The
31 | front-line engineers don't care that much, other than to look at their own stats
32 | and feel proud of themselves for having written a lot of code. It's only the
33 | front-line managers who will be able to _do_ something with it, by trying to
34 | talk to their engineers or figure out why some of their engineers are more
35 | "productive" than others.
36 |
37 | Now, to be clear, it's fine to make a metric, system, or process that only one
38 | of these groups can use. The problem with _this_ metric is that when you give it
39 | to a front-line manager, it often either (a) misleads them or (b) is useless.
40 | They may see that one engineer writes fewer lines of code than another, but the
41 | engineer writing fewer lines actually has more impact on the users of the
42 | product, or is doing more research, or is writing better-crafted code.
43 |
44 | Basically, what you've actually done is sent the front-line manager on an
45 | unreliable wild-goose chase that wasted their time and taught them to distrust
46 | your metrics and you! You drove no decision for that manager--in fact, you
47 | instead [created a problem](data-vs-insights.md) that they have to solve:
48 | constantly double-checking your metric and doing their own investigations
49 | against it, since it's so often wrong. Worse yet, if the manager trusts this
50 | metric without any question, it could lead to bad code, rewarding the wrong
51 | behavior, a bad engineering culture, and low morale.
52 |
53 | ### No Audience
54 |
55 | Another way to do this wrong is to have some metric but not have any
56 | [_individual_ who will make a decision based on it](audiences.md). For example,
57 | measuring test coverage on a codebase that nobody works on would not be very
58 | useful.
59 |
60 | ## Good Example
61 |
62 | 1. You discover, through surveys or other valid feedback mechanisms, that mobile
63 | developers feel their releases are too slow.
64 | 2. You properly instrument the release pipeline to accurately measure the length
65 | of each part of the pipeline.
66 | 3. You do some basic investigation to see what the major pain points are that
67 | cause this slowness along each part of the pipeline, to get an idea of how
68 | much work would be involved in actually fixing each piece, and what would be
69 | the most important pieces to fix first.
70 | 4. You boil this information down into an understandable report that concisely:
71 | * Proves the problem exists
72 | * Explains why it's important
73 | * Gives a birds-eye view of where time is being spent in the release
74 | pipeline
75 | * Provides very rough estimates of how much work it is to solve each piece
76 | 5. This report is presented to [engineering leadership](audiences.md), who
77 | decides who to assign, at what priority, to fix the most important parts of
78 | the release pipeline to address this issue.
79 | 6. Front-line engineers use the instrumentation we developed in order to
80 | understand and solve the specifics of the problem.
81 | 7. After the work is done, the same data can be referenced to show how
82 | successful the optimizations were.
83 |
84 | ## Driving Decisions Differently for Different Groups
85 |
86 | At the level of front-line managers and front-line engineers, it is sufficient
87 | to provide information that allows people to figure out _where_ a problem is, so
88 | that front-line engineers can track down the actual problem and solve it. This
89 | can sometimes be [more "data" and less "insights."](data-vs-insights.md)
90 |
91 | At more senior levels, [more "insights" are required as opposed to raw
92 | data](data-vs-insights.md). In general, the more senior a person you are
93 | presenting to, the more work you should have done up front to provide insights
94 | gathered from data, as opposed to just showing raw data.
95 |
96 | ## Focus on Causing Action
97 |
98 | When you deliver data or insights to people, it should actually be something
99 | that will influence their decisions. They should be interested in having data
100 | for their decisions, and be willing to act based on the data. It's important to
101 | check this _up front_ before doing an analysis--otherwise, you can end up doing
102 | a lot of analysis work that ends up having no impact, because the recipient was
103 | never going to take action on it in the first place.
104 |
105 | For example, sometimes teams have mandates of things they _must_ do, such as
106 | legal or policy compliance, where it doesn't matter what you say about the work
107 | they do, they still have to do it in exactly the way they plan to do it. It's
108 | not useful to try to change their mind with data--your work will not result in
109 | any different decision.
110 |
111 | In general, if a person has already made up their mind and no data or insight
112 | will realistically sway them, it's not worth doing the work to provide them data
113 | or insights. We might have some opinion about the way the world should be, but
114 | that doesn't matter, because a _person has to change their mind_ in order for
115 | action to happen. If we won't even _potentially_ change somebody's mind, we
116 | should not do the work.
117 |
118 | It can be useful to ask people who request data:
119 |
120 | 1. What do you think the data will say? (What is their prediction before they
121 | look at the data.)
122 | 2. If it is different than what you think, will that change your decision in any
123 | way?
124 |
125 | If the answer to question #2 is "no," then it's not worth working on that
126 | analysis.
127 |
128 | Next: [Data Collection Principles](data-collection-principles.md)
--------------------------------------------------------------------------------
/goals-signals-metrics.md:
--------------------------------------------------------------------------------
1 | # Goals, Signals, and Metrics
2 |
3 | There is a framework we use for picking metrics called “Goals-Signals-Metrics.”
4 |
5 | Basically, first you decide what the **goals** are that you want to achieve for
6 | your product or system. Then you decide on the **signals** you want to examine
7 | that tell you if you’re achieving your goal--essentially, these are what you
8 | would measure if you had perfect knowledge of everything. Then you choose
9 | **metrics** that give you some proxy or idea of that signal, since few signals
10 | can be measured perfectly.
11 |
12 | - [Goals](#goals)
13 | - [Example Goal](#example-goal)
14 | - [Uncertainty](#uncertainty)
15 | - [Signals](#signals)
16 | - [Metrics](#metrics)
17 |
18 | ## Goals
19 |
20 | A goal should be framed as the thing that you want your team, product, or
21 | project to accomplish. It should not be framed as "I want to measure X."
22 |
23 | **Most of the trouble that teams have in defining metrics comes from defining
24 | unclear goals.**
25 |
26 | _Webster's Third New International Dictionary_ defines a "goal" as:
27 |
28 | > The end toward which effort or ambition is directed; a condition or state to
29 | > be brought about through a course of action.
30 |
31 | This needs to be stated in a fashion specific enough that it _could be
32 | measured_. Not that you have to know in advance what all the metrics will be,
33 | but that conceptually, a person could know whether you were getting closer to
34 | (or further from) your goal.
35 |
36 | ### Example Goal
37 |
38 | In developing a goal, you can start with a vague statement of your desire. For
39 | example:
40 |
41 | **LinkedIn's systems should be reliable.**
42 |
43 | However, that's not measurable. So the first thing you do is **clarify
44 | definitions**.
45 |
46 | First off, what does "LinkedIn's systems" mean? What does "reliable" mean? How
47 | do we choose which of our systems we want to measure?
48 |
49 | Well, to figure this out, we have to ask ourselves **why** we have this goal.
50 | The answer could be "We are the team that assures reliability of all the
51 | products that are used by LinkedIn's users." In that case, that would clarify
52 | our goal to be:
53 |
54 | **The products that are used by LinkedIn's users should be reliable.**
55 |
56 | That's still not measurable. Nobody could tell you, concretely, if you were
57 | accomplishing that goal. Here's the remaining problem: what does "reliable"
58 | mean?
59 |
60 | Once again, we have to ask ourselves **why** we have this goal. To do this, we
61 | might look at the larger goals of the company. At LinkedIn, our [top-level
62 | vision](https://about.linkedin.com/) is: "Create economic opportunity for every
63 | member of the global workforce." This gives us some context to define
64 | reliability: we somehow want to look at things that prevent us from
65 | accomplishing that vision. Of course, in a broad sense, there are _many_ factors
66 | that could prevent us: social, cultural, human, economic, etc. So we say, well,
67 | our scope is what we can do _technically_ with our software development and
68 | production-management processes, systems, and tools. What sort of technical
69 | issues would members experience as "unreliability" in that context? Probably
70 | bugs, performance issues, and downtime.
71 |
72 | So we could update our goal to be:
73 |
74 | **LinkedIn's users have an experience of our products that is free from bugs,
75 | performance issues, and downtime.**
76 |
77 | We could get more specific and define "bug," "performance issue," and
78 | "downtime," if it's not clear to the team what those specifically mean. The
79 | trick here is to get something that's clear and measurable, without it being
80 | super long. What I would recommend, if you wanted to clarify those terms, is to
81 | create _sub-goals_ for each of those terms. That is, keep this as the overall
82 | goal, and then state three more goals, one for bugs, one for performance issues,
83 | and one for downtime, which do a better job of spelling out what each of those
84 | means.
85 |
86 | One of the things that you'll notice about this exercise is that not only does
87 | it help us define our metrics, it actually helps clarify what is the most
88 | important work we should be doing. For example, this goal tells us that we
89 | should be paying _more_ attention to the experience of our users than the
90 | specific availability numbers of low-level services (even though we might care
91 | about those, too).
92 |
93 | ### Uncertainty
94 |
95 | **If you aren’t sure how to measure something, it’s very likely that you haven’t
96 | defined what it is that you are measuring.** This was the problem with
97 | "developer productivity" measurements from the past--they didn’t define what
98 | “developer productivity” actually meant, concretely, exactly, in the physical
99 | universe. They attempted to measure an abstract nothing, so they had no real
100 | metrics. This is why it is so important to understand and clarify your goals
101 | before you start to think about metrics.
102 |
103 | ## Signals
104 |
105 | Signals are what you would measure if you had perfect knowledge---if you knew
106 | everything in the world, including everything that was inside of everybody
107 | else's mind. These do not have to actually be measurable. They are a useful
108 | mental tool to help understand the areas one wants to measure.
109 |
110 | Signals are the answer to the question, "How would you know you were achieving
111 | your goal(s)?"
112 |
113 | For example, some signals around reliability might be:
114 |
115 | * Human effort spent resolving production incidents.
116 | * Human time spent debugging deployment failures.
117 | * Number of users who experienced a failure in their workflow due to a bug
118 | * Amount of time lost by users trying to work around failures.
119 | * How reliable LinkedIn's products are according to the _belief_ of our users.
120 | * User-perceived latency for each action taken by users.
121 |
122 | The concept of "signals" can also be useful to differentiate them from "metrics"
123 | (things that can actually be measured). Sometimes people will write down a
124 | signal in a doc and then claim it is a metric, and this distinction in terms can
125 | help clarify that.
126 |
127 | ## Metrics
128 |
129 | Metrics are numbers over time that can actually be measured. A metric has the
130 | following qualities:
131 |
132 | * It can actually be implemented, in the real world, and produce a concrete
133 | number.
134 | * The number can be trended over time.
135 | * It is meaningful when the number goes up or goes down, and that meaning is
136 | clear.
137 |
138 | All metrics are _proxies_ for your signal. There are no perfect metrics. It is a
139 | waste of time to try to find the "one true metric" for anything. Instead, create
140 | multiple metrics and triangulate the truth from looking at different metrics.
141 | All metrics will have flaws, but _sets_ of metrics can collectively provide
142 | insight.
143 |
144 | Example metrics for our "reliability" goal from above might be something like:
145 |
146 | * The percentage of user sessions that do not experience an error as determined
147 | by our product telemetry.
148 | * Percentage of deployments that do not experience a failure requiring human
149 | intervention.
150 | * Conduct a survey of a cross-section of users to ask them their opinion about
151 | our reliability, where they give us a score, and aggregate the scores into a
152 | metric.
153 | * Define what the "acceptable" highest latency would be for each UI action, and
154 | then count how often UI interactions happen _under_ those thresholds. (Display
155 | a percentage of how many UI interactions have an "acceptable" latency.)
156 |
157 | If you look at the signals above, you will see that some of these metrics map back
158 | to those signals.
159 |
160 | Overall, there is a _lot_ to know about metrics, including what makes metrics
161 | good or bad, how to take action on them, what types of metrics to use in what
162 | situation, etc. It would be impossible to cover all of it in this doc, but we
163 | attempt to cover some of it in other docs on this site.
164 |
165 | Next: [Developer Productivity & Happiness Goals and Signals](dph-goals-and-signals.md)
166 |
--------------------------------------------------------------------------------
/metric-pitfalls.md:
--------------------------------------------------------------------------------
1 | # Common Pitfalls When Designing Metrics
2 |
3 | There are some common pitfalls that teams fall into when the start trying to
4 | measure developer productivity, the success of their engineering efforts, etc.
5 | It would be impossible to list every way to "do it wrong," but here are some of
6 | the most common ones we've seen.
7 |
8 | - [Measuring Whatever You Can Measure](#measuring-whatever-you-can-measure)
9 | - [Measuring Too Many Things](#measuring-too-many-things)
10 | - [Measuring Work Instead of Impact](#measuring-work-instead-of-impact)
11 | - [Measuring Something That is Not Acted On](#measuring-something-that-is-not-acted-on)
12 | - [Defining Signals and Calling them "Metrics"](#defining-signals-and-calling-them-metrics)
13 | - [If Everybody Does This, We Can Have a Metric!](#if-everybody-does-this-we-can-have-a-metric)
14 | - [Metrics That Create a Mystery](#metrics-that-create-a-mystery)
15 | - [Vanity Metrics](#vanity-metrics)
16 |
17 | ## Measuring Whatever You Can Measure
18 |
19 | When a team first starts looking into metrics, often they just decide to measure
20 | whatever is easiest to measure---the data they currently have available. Over
21 | time, this metric can even become entrenched as "the metric" that the team uses.
22 |
23 | There are lots of problems that occur from measuring whatever number is handy.
24 |
25 | For example, it's easy for a company to measure only revenue. Does that tell you
26 | that you're actually accomplishing the goal of your company, though? Does it
27 | tell you how happy your users are with your product? Does it tell you that your
28 | company is going to succeed in the long term? No.
29 |
30 | Or let's say you're a team that manages servers in production. What if you only
31 | measured "hits per second" on those servers? After all, that's usually a number
32 | that's easy to get. But what about the reliability of those servers? What about
33 | their performance, and how that impacts the user experience? Overall, what does
34 | this have to do with the goals of your team?
35 |
36 | The solution here is to [think about your goals](goals-signals-metrics.md) and
37 | derive a set of metrics from those, instead of just measuring whatever you can
38 | measure.
39 |
40 | ## Measuring Too Many Things
41 |
42 | If you had a dashboard with 100 graphs that were all the same size and color, it
43 | would be very hard for you to determine which of those numbers was important. It
44 | would be hard to focus a team around improving specific metrics. Everybody would
45 | get lost in the data, and very likely, no action would be taken at all.
46 |
47 | This is even worse if those 100 graphs are spread across 20 different
48 | dashboards. Everybody gets confused about where to look, which dashboard is
49 | correct (when they disagree), etc.
50 |
51 | The solution here is to have fewer dashboards that have a small set of relevant
52 | metrics on them, by understanding your [audience](audiences.md) and what [level
53 | of metrics](audiences.md) you're showing to each audience.
54 |
55 | ## Measuring Work Instead of Impact
56 |
57 | Would "number of times the saw moved" be a good metric for a carpenter? No.
58 | Similarly, any metric that simply measures the _amount of work_ done is not a
59 | good metric. It gets you an organization that tries to make work _more
60 | difficult_ so they can show you that they did _more of it_.
61 |
62 | Often, people measure work when they do not understand the _job function_ of the
63 | people being measured. In our carpenter example, the job function of a carpenter
64 | who builds houses is very different from a carpenter who repairs furniture. They
65 | both work with wood and use some of the same tools, but the _intended output_ is
66 | very different. A house builder might measure "number of projects completed that
67 | passed inspection," while the repairman might measure, "repaired furniture
68 | delivered to customers."
69 |
70 | Instead of measuring work, you want to focus on [measuring the impact or result
71 | of the work](metric-principles.md). In order to do this, you have to understand
72 | the [goals](goals-signals-metrics.md) of the team you are measuring.
73 |
74 | ## Measuring Something That is Not Acted On
75 |
76 | Sometimes we have a great theory about some metric that would help somebody. But
77 | unless that metric has an [audience](audiences.md) that will actually [drive
78 | decisions](driving-decisions.md) based on it, you should not implement the
79 | metric.
80 |
81 | Even when it has an audience, sometimes that audience does not actually plan to
82 | act on it. You can ask, "If this metric shows you something different than you
83 | expect, or if it starts to get worse, will you change your course of action?" If
84 | the answer is no, you shouldn't implement the metric.
85 |
86 | ## Defining Signals and Calling them "Metrics"
87 |
88 | Sometimes you will see a document with a list of "metrics" that can't actually
89 | be implemented. For example, somebody writes down "documentation quality" or
90 | "developer happiness" as proposed metrics, with no further explanation.
91 |
92 | The problem is that these are not [metrics](goals-signals-metrics.md), they are
93 | [signals](goals-signals-metrics.md). They will never be implemented, because
94 | they are not concretely defined as something that can be specifically measured.
95 |
96 | If you look at a "metric" and it's not _immediately clear_ how to implement it,
97 | it's probably a signal, and not a metric.
98 |
99 | ## If Everybody Does This, We Can Have a Metric!
100 |
101 | Don't expect people to change their behavior _just_ so you can measure it.
102 |
103 | For example, don't expect that everybody will tag their bugs, PRs, etc. in some
104 | special way _just_ so that you can count them. There's no incentive for them to
105 | do that correctly, no matter how much _you think_ it would be good. What really
106 | happens is you get very spotty data with questionable accuracy. Nobody believes
107 | your metric, and thus nobody takes action on it. And probably, they shouldn't
108 | believe your metric, because the data will almost certainly be missing or wrong.
109 |
110 | Sometimes people try to solve this by saying, "we will just make it mandatory to
111 | fill out the special tag!" This does _not_ solve the problem. It creates a
112 | roadblock for users that they solve by putting random values into the field
113 | (often just whatever is easiest to put into the field). You can start to engage
114 | in an "arms race," where you keep trying to add validation to prevent bad data.
115 | But why? At some point, you are actually _harming_ productivity in the name of
116 | measuring it.
117 |
118 | If you want people to change their behavior, you need to figure out a change
119 | that would be beneficial to them or the company, not just to you. Otherwise, you
120 | need to figure out ways to measure the behavior they currently have.
121 |
122 | ## Metrics That Create a Mystery
123 |
124 | Metrics should help [solve a mystery](data-vs-insights.md), not create one.
125 |
126 | The basic problem here is when a viewer says, "I don't know [what this metric
127 | means when it goes up or down](metric-principles.md)." For example, counting the
128 | total number of bugs filed in the company creates this sort of mystery. Is it
129 | good when it goes up, because we have improved our QA processes? Is it good when
130 | it goes down, because we have a less buggy system? Or (as usually happens) does
131 | it just mean some team changed the administrative details of how they track
132 | bugs, and thus the upward or downward movement is meaningless?
133 |
134 | The most common offenders here are [metrics that aggregate multiple numbers into
135 | a score](scores.md).
136 |
137 | ## Vanity Metrics
138 |
139 | Metrics must [drive decisions](driving-decisions.md). That means they must show
140 | when things are going well and when they are not going well. It is _especially_
141 | important that they indicate when things are not going well, because those are
142 | your opportunities for improvement! If a metric hides problems, it is actually
143 | harmful to the team's goals.
144 |
145 | Imagine counting "number of happy customers" as your only metric. In January, I
146 | have 100 customers, and 10 of them are happy. In February, I have 200 customers,
147 | and 15 of them are happy. Our metric goes up, but our actual situation is
148 | getting worse!
149 |
150 | Sometimes teams are worried about metrics that "make them look bad." But
151 | look---if things are bad, they are bad. Only by acknowledging that they are bad
152 | and working to improve them will anything actually get better. By hiding the
153 | badness behind a vanity metric, we are worsening the situation for the company
154 | and our customers.
155 |
156 | Especially with developers, it is very hard to fool them. They _know_ when
157 | things are bad. If you show them some beautiful number when they are all
158 | suffering, all that will happen is you will damage your
159 | [credibility](https://www.codesimplicity.com/post/effective-engineering-productivity/)
160 | and nobody will ever believe you again in the future.
161 |
162 | Next: [What's Wrong with "Scores?"](scores.md)
163 |
--------------------------------------------------------------------------------
/data-vs-insights.md:
--------------------------------------------------------------------------------
1 | # Data vs Insights
2 |
3 | What is the difference between "insights" and "data?"
4 |
5 | Data is simply raw information: numbers, graphs, survey responses, etc. A viewer
6 | has to _analyze_ data in order to come to a conclusion or solve a mystery.
7 |
8 | An "insight" is: **a presentation of information that solves a mystery for the
9 | viewer**. Now, one can do this _more_ or _less_--not all insights will fully
10 | answer _every_ question. Some might just help clarify, and the viewer has to do
11 | the rest of the analysis to fully answer their question.
12 |
13 | **When you propose a system or process that provides insights, it is a good idea
14 | to explain:**
15 |
16 | 1. **What mysteries does this solve?**
17 | 2. **Are there mysteries it intentionally doesn't solve?**
18 | 3. **How can you assure that this system currently provides (and continues to
19 | provide) trustworthy insights to its users?**
20 | 4. **How will you change this system in the future if you discover a flaw in the
21 | metric, system, or process you’ve proposed?**
22 | 5. **How do you ensure the accuracy of the data now and in the future?**
23 |
24 | Let's explain this a bit more, with some examples.
25 |
26 | - [Example: Graphs](#example-graphs)
27 | - [Example: Free-text Feedback](#example-free-text-feedback)
28 | - [Categorization](#categorization)
29 | - [Algorithmic Analysis](#algorithmic-analysis)
30 | - [Trustworthiness of Insights](#trustworthiness-of-insights)
31 | - [Accuracy of Data](#accuracy-of-data)
32 |
33 | ## Example: Graphs
34 |
35 | If you have ever heard somebody say, "But what does this graph _mean_?", you
36 | will understand the difference between data and insights. For example, let's say
37 | you have a graph that shows the average page-load latency of all pages on
38 | linkedin.com. This graph creates a huge mystery when it changes, for several
39 | reasons:
40 |
41 | 1. Because it's an average, outliers can drastically affect it, meaning people
42 | who look at it have no idea if the spikes and dips are from outliers or from
43 | actual, relevant changes.
44 | 2. Because it covers so many different areas of the site, it's hard to figure
45 | out _who_ should even be digging into it.
46 | 3. It, by itself, provides no avenue for further investigation--you have to go
47 | look at a _lot_ of other data to be able to actually understand what you
48 | should do.
49 |
50 | The _most_ that this graph could do is allow an [engineering
51 | leader](audiences.md) to say, "One person should go investigate this and tell us
52 | what is up." That's not a very impactful decision or a good use of anybody's
53 | time. Basically, this graph, if it's all we have, _creates a problem_.
54 |
55 | But what if, instead, we had a system that _accurately_ informed front-line
56 | engineers and front-line managers when there was a significant difference in
57 | median (or 90th percentile) page-load latency between one release of a server
58 | and another? Or even better, if it could analyze changes between those two
59 | releases and tell you specifically which change introduced the latency? That's a
60 | much harder engineering problem, and may or may not be realistically possible.
61 | The point, though, is that that is an _insight_. It does its best to point
62 | specific individuals (in this case, the front-line engineers who own the system)
63 | toward specific work.
64 |
65 | To be clear, if you want to develop a system that provides insights into
66 | latency, there are many levels at which you should do this and many different
67 | stakeholders who want to know many different things. These are just two examples
68 | to compare the difference between mystery and insight.
69 |
70 | ## Example: Free-text Feedback
71 |
72 | One of the biggest sources of mystery is free-text answers in surveys. If there
73 | are only a few answers that are relevant to your work, it's possible to read all
74 | of them. But once you have thousands of free-text answers, it's hard to process
75 | them all and make a decision based off of them. If you just give an [engineering
76 | leader](audiences.md) a spreadsheet with 1000 free-text answers in it, you have
77 | created a problem and a mystery for them. You have to process them _somehow_ in
78 | order for the data to become _understandable_.
79 |
80 | That's a reasonable way to think about the job of our team, by the way: make
81 | data understandable.
82 |
83 | In this specific instance, there are lots of ways to make it understandable.
84 | Each of these leaves behind different types of mysteries. For example:
85 |
86 | ### Categorization
87 |
88 | One common way of understanding free-text feedback is to have a person go
89 | through the negative comments and categorize them according to categories that
90 | they determine while reading the comments. Then, you count up the number of
91 | comments in each category and display them to [engineering
92 | leadership](audiences.md) as a way to decide where to assign engineers.
93 |
94 | However, this leaves behind mysteries like, "_What_ were the people complaining
95 | about, specifically? What was the actual problem with the system that we need to
96 | address?" For example, it might say that "code review" was the problem. But what
97 | _about_ code review? If you're an engineer working on the code review system,
98 | you need those answers in order to do effective work. Engineering leaders also
99 | might want to know some specifics, so they understand how much work is involved
100 | in fixing the problem, who needs to be assigned to it, etc.
101 |
102 | This could be solved by further analysis that summarizes the free-text feedback.
103 | Also, usually the team that works on the tool itself (the code review tool, in
104 | our example here) wants to see all of the raw free-text comments that relate to
105 | their tool.
106 |
107 | ### Algorithmic Analysis
108 |
109 | There are various programmatic ways to analyze free-text feedback. You could use
110 | a Machine Learning system to do "sentiment analysis" that attempts to determine
111 | how people feel about various things and pull out the relevant data. You could
112 | use an LLM to summarize the feedback.
113 |
114 | Each of these leave behind some degree of mystery. Readers usually wonder how
115 | accurate the analysis is. Summaries and sentiment analysis often leave out
116 | specifics that teams need in order to fully understand the feedback.
117 |
118 | That said, these methods can be sufficient for certain situations and for
119 | certain [audiences](audiences.md), like when you just want to know the general
120 | area of a problem and can accept some inaccuracy or lack of detail.
121 |
122 | ## Trustworthiness of Insights
123 |
124 | The insights that you provide **must be _trustworthy_**. You do _not_ want to
125 | train your users to ignore the insights that you provide. If the insights you
126 | provide are wrong often enough, your users will learn to distrust them. Avoiding
127 | false insights is one of the _most important_ duties of any system that
128 | generates insights and data, because providing too much false insight for too
129 | long can destroy all the usefulness of your system.
130 |
131 | To be clear: _if your system frequently provides false insights to its users, it
132 | would have been better if you hadn't made the system at all_, because you will
133 | have spent a lot of effort to give people a system that confuses them,
134 | frustrates them, takes up their time, and which they eventually want to abandon
135 | and just "do it themselves."
136 |
137 | This isn't just a one-time thing you have to think about when you first write a
138 | system for generating insights. Your own monitoring, testing, and maintenance of
139 | the system should assure that it continues to provide trustworthy insights to
140 | its users throughout its life.
141 |
142 | ## Accuracy of Data
143 |
144 | The most important attribute of **data** is that it needs to be as accurate as
145 | reasonably possible. We often combine data from many different sources in order
146 | to create insights. If each of these data sources are inaccurate in different,
147 | significant ways, then it’s impossible to trust the insights we produce from
148 | them. Inaccuracy (and compensating for it) also makes life very difficult and
149 | complex for people doing data analysis--the data that you are providing now
150 | creates a problem for its consumers rather than solving a problem for them.
151 |
152 | The degree of accuracy required depends on the
153 | [purpose](collecting-with-purpose-vs-collecting-everything.md) for which the
154 | data will be used. If you don’t know how it’s going to be used, then you should
155 | make the data as accurate as you can reasonably accomplish with the engineering
156 | resources that you have.
157 |
158 | If there is anything about your data that would not be **obvious to a casual
159 | viewer** (such as low accuracy in some areas) then you should publish that fact
160 | and make it known to your users somehow. For example, if you have a system that
161 | is accurate for large sample sizes but inaccurate for small sample sizes, it
162 | should say so on the page that presents the data, or it should print a warning
163 | (one that actually makes sense and explains things to the user) any time it's
164 | displaying information about small sample sizes.
165 |
166 | Next: [Qualitative vs Quantitative Measurements](qualitative-vs-quantitative.md)
167 |
--------------------------------------------------------------------------------
/audiences.md:
--------------------------------------------------------------------------------
1 | # Audiences: Always Know Who Your Data Is For
2 |
3 | When designing any system for metrics or data collection, you need to know who
4 | your _audience_ is. Otherwise, it is hard to get _action_ to be taken on the
5 | data.
6 |
7 | For example, let's take the metric "number of pastries consumed by employees."
8 | If you're the pastry chef at the company cafeteria, that's an interesting
9 | metric. If you're the head of Engineering, it's not an interesting metric.
10 |
11 | Ultimately, all our work serves _people_. Any organization is basically just an
12 | agreement between people. The pieces of the organization that actually exist in
13 | the physical universe are material objects (buildings, computers, etc.) and
14 | _individuals_. We serve individuals.
15 |
16 | For engineering metrics and feedback systems, there are a few broad categories
17 | of individuals that we serve, who have very different requirements. How we
18 | provide [data and insights](data-vs-insights.md) to these audiences is very
19 | different for each audience.
20 |
21 | **When you propose a metric, system, or process, you should always say which of
22 | the groups below you are serving and how you are serving them.**
23 |
24 | - [Front-Line Developers](#front-line-developers)
25 | - [Front-Line Managers](#front-line-managers)
26 | - [Engineering Leadership](#engineering-leadership)
27 | - [Tool Owners](#tool-owners)
28 | - [Productivity Champions](#productivity-champions)
29 | - [Levels of Metrics](#levels-of-metrics)
30 |
31 | ## Front-Line Developers
32 |
33 | A "front-line developer" is any person who directly writes or reviews code.
34 |
35 | Developers are best served by delivering [insights](data-vs-insights.md) to them
36 | within their natural workflow. For example, insights that directly help them
37 | during a code review, displayed in the code review tool. Imagine if we could
38 | tell a developer, "This change you are about to make will increase the build
39 | time of your codebase by 50%." Those are the sort of insights that most help
40 | developers---actionable information displayed right when they can act on it.
41 |
42 | Developers also need [data](data-vs-insights.md) when they are making decisions
43 | about how to do specific work (such as "what's the most important performance
44 | problem to tackle for my users?" or "how many users are being affected by this
45 | bug?"). Developers do not usually need [metrics](goals-signals-metrics.md) in
46 | dashboards. Instead, they need analytical tools---systems that allow them to
47 | dive into data or debug some specific problem.
48 |
49 | Note that this is actually the _largest_ group of people that we serve, and
50 | actually the place where we can often make the most impact. It’s easy to
51 | consider that because Directors and VPs are "important people" that it is more
52 | important to serve them, but **we move more of the organization and make more
53 | change by providing actionable insights to developers who are doing the actual
54 | work of writing our systems**.
55 |
56 | ## Front-Line Managers
57 |
58 | By this, we mean managers who directly manage teams of developers. Often,
59 | managers of managers have similar requirements to front-line managers, and so
60 | could also be covered as part of this audience.
61 |
62 | We provide [data](data-vs-insights.md) that front-line managers can use to form
63 | their own insights about what their team should be working on or how they should
64 | prioritize work on their team. Front-Line Managers usually have the time (and
65 | desire) to process the data relevant to their team and turn it into insights
66 | themselves. We do also provide _some_ insights that front-line managers can use
67 | directly to inform their decisions.
68 |
69 | Managers tend to have a regular cadence of meetings where they can look at
70 | dashboards, so putting [metrics](goals-signals-metrics.md) in dashboards is
71 | helpful for them.
72 |
73 | The information we provide for front-line managers should feed into decisions as
74 | small as "what should we work on for the next two weeks?"
75 |
76 | ## Engineering Leadership
77 |
78 | By "engineering leadership," we usually mean SVPs, VPs, Sr. Directors, and
79 | Directors in Engineering. This can also include very senior engineers at the
80 | company who will have similar requirements (though they also end up fitting into
81 | multiple other audiences, as well). Essentially, this category includes anybody
82 | who is distant from the day-to-day details of the thing they are in charge of.
83 |
84 | We provide [insights](data-vs-insights.md) to engineering leadership that
85 | either:
86 |
87 | 1. Allow them to choose what direction the organization should go in, or
88 | 2. Convince them what direction to go in based on sound reasoning we provide
89 | (which usually would mean an argument based on data).
90 |
91 | The result here should be that an engineering leader does one of these three
92 | things:
93 |
94 | 1. Tells actual people to go do actual work.
95 | 2. Decides that no work needs to be done.
96 | 3. Decides on the _prioritization_ of work--deciding _when_ work will be done,
97 | by who.
98 |
99 | Engineering Leaders often just need a system that shows them that a problem
100 | _exists_, so that they can ask for a more detailed investigation to be done by
101 | somebody who reports to them.
102 |
103 | The decisions made at this level are usually _strategic_ decisions--at the
104 | shortest, multi-week, at the longest, multi-year, and so the insights we provide
105 | to engineering leadership should guide decisions at that level.
106 |
107 | ## Tool Owners
108 |
109 | This is a manager or senior developer whose team works on developer tools or
110 | infrastructure.
111 |
112 | Tool Owners need [data and insights](data-vs-insights.md) that help them
113 | understand their users and how to make the most impact with their
114 | infrastructure. When in doubt, err on the side of data instead of insights,
115 | because the requirements of Tool Owners are complex, and they often can spend
116 | the time to dive into the data themselves.
117 |
118 | Tool Owners need analytical systems that allow them to dive into data to
119 | understand the specifics of tool usage, workflows, problems developers are
120 | having, etc. While Front-Line Developers need such tools to analyze their own
121 | codebases, Tool Owners need these tools for analyzing _the whole company_.
122 |
123 | For example, an owner of the build tool might need to ask, "Which team is having
124 | the worst build experience?" They would need to be able to define what "worst
125 | build experience" means themselves (so basically, just slicing the data any way
126 | they need to). Then, they would need to understand the detailed specifics of
127 | what is happening with individual builds, so their teams can write code to solve
128 | the problem.
129 |
130 | Tool Owners are also benefited by having [metrics](goals-signals-metrics.md) in
131 | dashboards. They may need many metrics (maybe 10 or more) to be able to
132 | understand if the experience of their users are getting better or worse. At any
133 | given time, their teams might be focused on one or two metrics, but usually the
134 | front-line developers _within_ the Tool Owner team will need to look at other
135 | metrics to do complete investigations.
136 |
137 | ## Productivity Champions
138 |
139 | This is a person who cares about the developer productivity of a team or a set
140 | of teams, and takes it as their responsibility to do something about it
141 | directly, advise a senior executive what to do about it, or encourage other
142 | teams to take action on it.
143 |
144 | We don’t think of this person as a Tool Owner (even though one person may on
145 | rare occasion be both a Tool Owner and a Productivity Champion).
146 |
147 | This audience has the most complex set of requirements. Essentially, they need
148 | all the tools of all the other audiences, combined. They need to look at broad
149 | overviews of data to understand where there are problems, and then they need
150 | detailed dashboards and analytical tools to dive into the specifics of the
151 | problem, themselves.
152 |
153 | ## Levels of Metrics
154 |
155 | When designing [metrics](goals-signals-metrics.md), there is another thing to
156 | think about besides the above audiences, which is "what level of the org chart
157 | is this audience at?" For example, the entire developer tools org might have a
158 | set of metrics that measure the overall success of that org. However, individual
159 | teams within that org would have more detailed, lower-level metrics.
160 |
161 | In general, teams should know what their top-level metrics are---what are the
162 | most important metrics that they are driving and which measure how well they are
163 | achieving their [goals](goals-signals-metrics.md). There can be many top-level
164 | metrics, so long as they are a good representation of the aggregate
165 | accomplishments of the team toward the team's goals.
166 |
167 | There can be many other metrics that a team has besides their top-level metrics.
168 | Front-line developers might need a large set of metrics to be able to judge the
169 | effectiveness of their changes or what area should be worked on next. The
170 | reasons to have top-level metrics are:
171 |
172 | 1. So that a team can focus on specific numbers that they are driving---this is
173 | one of the most effective ways to get action taken on metrics, is to focus a
174 | team around "these are the numbers we are driving."
175 | 2. So that people don't have to look at _so many_ graphs that each graph
176 | individually becomes meaningless. No matter how smart a leader is, they can't
177 | look at 100 graphs _simultaneously_ and make any sensible decision about
178 | them. They _could_ look at 10 graphs, though.
179 |
180 | We call the top-level metrics of a team the "Key Impact Metrics."
181 |
182 | Next: [Driving Decisions With Data](driving-decisions.md)
183 |
--------------------------------------------------------------------------------
/metrics-and-performance-reviews.md:
--------------------------------------------------------------------------------
1 | # Metrics and Performance Reviews
2 |
3 | It is dangerous to use numbers representing the _volume of output_ of a software
4 | engineer to determine their job performance—numbers like "lines of code
5 | produced," "number of changes submitted to the repository," "numbers of bugs
6 | fixed," etc. This doc explains why and offers some alternative suggestions of
7 | how to understand and manage the performance of software engineers.
8 |
9 | - [Why Is This Dangerous?](#why-is-this-dangerous)
10 | - [Perverse Incentives](#perverse-incentives)
11 | - [Doesn’t Actually Do What You Want](#doesnt-actually-do-what-you-want)
12 | - [Clouding the Metrics](#clouding-the-metrics)
13 | - [What Do You Do Instead?](#what-do-you-do-instead)
14 | - [Examples of Good Metrics](#examples-of-good-metrics)
15 |
16 | ## Why Is This Dangerous?
17 |
18 | ### Perverse Incentives
19 |
20 | When you measure the output of a software engineer instead of their impact, you
21 | create perverse incentives for software engineers to behave in ways that
22 | ultimately damage your business.
23 |
24 | Let’s say that you see that a software engineer has submitted very few pieces of
25 | code to the repository in this quarter. If you use this to manage their
26 | performance, you might say, "Let’s figure out how you can submit more code,"
27 | Sometimes, this will result in good behaviors—perhaps they were spending too
28 | long on one large change that they will now submit as three smaller changes,
29 | making everybody happier. Nice.
30 |
31 | However, there’s absolutely no guarantee that this _will_ result in good
32 | behavior, because all you’re measuring is this absolute number that doesn’t
33 | actually tie back to the business impact that the developer is having. The same
34 | person, at a different time, could look at a change they have that really
35 | _should_ be submitted all at once, and split it into 100 different changes just
36 | because it will cause their numbers to go up. That causes a ton of unnecessary
37 | work for them and their code reviewer, but it looks great on their performance
38 | review. Then they may even sit back and delay the rest of their work until
39 | _next_ quarter because they "already got their change count up." This may sound
40 | remarkable but is actually very rational behavior for a person whose primary
41 | concern is their performance review and knows specifically that this metric is
42 | being used as an important factor in that review.
43 |
44 | ### Doesn’t Actually Do What You Want
45 |
46 | It’s assumed that we are managing the performance of software engineers because
47 | we have some goal as a business, and software engineers are here to contribute
48 | to that goal. So we are managing their performance to make sure that everybody
49 | contributes to that goal as much as possible and that the business succeeds.
50 |
51 | The goal of the business is not "to write code." Thus, when you measure how
52 | _much_ code somebody is writing, you don’t actually get a sense of how much they
53 | are contributing to the business. Sometimes, you do get a general idea—person X
54 | writes 100 changes a month and person Y writes 4. But honestly, that’s not even
55 | meaningful. What if those 4 changes required a ton of background research and
56 | made the company a million dollars, while those 100 changes were all sloppily
57 | done and cost the company a million dollars in lost productivity and lost users?
58 |
59 | The same happens with measuring "how many bugs got fixed." It’s possible that
60 | somebody spent a lot of time fixing one bug that was very valuable, and somebody
61 | else spent the same time fixing 10 bugs that didn’t actually matter at all, but
62 | were easy to handle. The point here isn’t how much time got spent—it’s about
63 | how valuable the work ended up being. If you measure _how much effort_ people
64 | are putting into things, you will get an organization whose focus is on _making
65 | things more difficult_ so that they can show you _how hard it was to solve their
66 | problems_.
67 |
68 | ### Clouding the Metrics
69 |
70 | All of the metrics that we have around our developer tools, such as volume of
71 | code reviews, speed of code reviews, speed of builds, etc. have definite
72 | purposes that have nothing to do with individual performance reviews. Taken in
73 | aggregate across a large number of developers, these numbers show trends that
74 | allow business leaders and tool developers to make intelligent decisions about
75 | how best to serve the company. When you look at a group of 1000 developers, the
76 | differences between "I worked on one important thing for 100 hours" and "I
77 | worked on 100 unimportant things for 1 hour each" all even out and fade away,
78 | because you’re looking at such a large sample. So you can actually make
79 | intelligent statements and analysis about what’s happening, at that scale. If
80 | the volume of code reviews drops _for the whole company_ for a significant
81 | period of time, that’s something we need to investigate.
82 |
83 | However, if you make people behave in unusual ways because their _individual
84 | performance_ is being measured by the same metrics, then suddenly it’s hard for
85 | business leaders to know if the numbers they are looking at are even _valid_.
86 | How do we know if code review volume is going up for some good reason related to
87 | our tooling improvements, or just because suddenly everybody started behaving in
88 | some weird way due to their performance being measured on these numbers? Maybe
89 | code review times went down on Team A because their performance was measured on
90 | that, so they all started rubber-stamping all code reviews and not really doing
91 | code review. But then some executive comes along and says, "Hey, Team A has much
92 | lower code review times than our other teams, can we find out what they are
93 | doing and bring that practice to other teams?" Obviously, in this situation what
94 | would really happen is that we would find out what Team A was doing and would
95 | correct it. But ideally this confusion and investigation from leadership would
96 | never have to happen in the first place, because nobody should be measuring the
97 | performance of individual software engineers based on such a metric.
98 |
99 | ## What Do You Do Instead?
100 |
101 | There are two types of measurements that you can use, qualitative (subjective)
102 | measurements, like surveys and talking with your reports, and quantitative
103 | (objective) measurements, like numbers on graphs. This document is mostly about
104 | the quantitative side of things, because we’ve found a particular problem with
105 | that. We won’t cover the qualitative aspects here.
106 |
107 | If you really want quantitative measurements for your team or developers, the
108 | best thing to do is to figure out the goal of the projects that they are working
109 | on, and determine a metric that measures the success of that goal. This will be
110 | different for every team, because every team works on something different.
111 |
112 | The truth is, "programming" is a _skill_, not a _job._ We wouldn’t measure the
113 | _performance of a skill_ to understand the _success of a job._ Let me give an
114 | analogy. Let’s say you are a carpenter. What’s the metric of a carpenter? You
115 | can think about that for a second, but I’ll tell you, you won’t come up with a
116 | good answer, because it’s a fundamentally bad question. I have told you that a
117 | person has a _skill_ (carpentry) but I haven’t actually told you what their
118 | _job_ is. If their job is that they own a furniture shop that produces custom
119 | furniture, then the success there is probably measured by "furniture delivered"
120 | and "income produced greater than expenses." But what if they are a carpenter on
121 | a construction job site? Then their success is probably measured by "projects
122 | that are complete and have passed inspection." As you can see, a skill is hard
123 | to measure, but a _job_ is something that you can actually understand.
124 |
125 | So to measure the success of a programmer, you have to understand what their job
126 | is. Hopefully, it’s tied to some purpose that your team has, which is part of
127 | accomplishing the larger purpose of the whole company somehow. That purpose
128 | results in some product, that product has some effect, and you can measure
129 | something about that product or that effect.
130 |
131 | Yes, sometimes it’s hard to tie that back to an individual software engineer.
132 | That’s where your individual judgment, understanding, communication, and skills
133 | as a manager come into play. It’s up to you to understand the work that’s
134 | actually being done and how that work affected the metrics you’ve defined.
135 |
136 | And yes, it’s also possible to design success metrics for your work that are
137 | hard to understand, difficult to take action on, or that don’t really convince
138 | anybody. There is a [whole system of designing and using
139 | metrics](goals-signals-metrics.md) that can help get around those problems.
140 |
141 | ### Examples of Good Metrics
142 |
143 | These are just examples. There could be as many metrics in the world as there
144 | are projects.
145 |
146 | **User-Facing Project:** Usually, your work is intended to impact some business
147 | metric. For example, maybe you’re trying to improve user engagement. You
148 | can do an experiment to prove how much your work affects that business metric,
149 | and then use that impact as the metric for your work, as long as you can see
150 | that same impact after you actually release the new feature. Bugs could be
151 | thought of as impacting some metric negatively, even if it’s just user
152 | sentiment, and thus one can figure out a metric for bug fixing that way.
153 |
154 | **Refactoring Projects:** Let’s say that you have an engineer who has to
155 | refactor 100 files across 25 different codebases. You could measure how many of
156 | those refactorings are done. You could count it by file, by codebase, or by
157 | whatever makes sense. You could also get qualitative feedback from developers
158 | about how much easier the code was to use or read afterward. Some refactoring
159 | projects improve reliability or other metrics, and can be tracked that way. It
160 | just depends on what the intent is behind the project.
161 |
162 | Next: [Productivity Concepts for Software Developers](productivity-concepts.md)
--------------------------------------------------------------------------------
/developer-personas.md:
--------------------------------------------------------------------------------
1 | # Developer Personas
2 |
3 | We segment developers into "personas" based on their development workflow.
4 |
5 | - [Why Personas?](#why-personas)
6 | - [How to Define Personas](#how-to-define-personas)
7 | - [Define the Categories](#define-the-categories)
8 | - [Sub-Personas](#sub-personas)
9 | - [Initial Research](#initial-research)
10 | - [Categorizing Developers Into Personas](#categorizing-developers-into-personas)
11 | - [Example Personas](#example-personas)
12 | - [What To Do With Personas](#what-to-do-with-personas)
13 | - [What About Using Personas For Quantitative Metrics?](#what-about-using-personas-for-quantitative-metrics)
14 |
15 | ## Why Personas?
16 |
17 | It's very easy to assume, as somebody who works on developer productivity, that
18 | one knows all about software development—after all, one is a software developer!
19 | However, it turns out that different types of development require very different
20 | workflows. If you've never done mobile development, web development, or ML
21 | development, for example, you might be very surprised to learn how different the
22 | workflows are!
23 |
24 | One of the most common mistakes that developer productivity teams make is only
25 | focusing on the largest group of developers at the company. For example, many
26 | companies have _far_ more backend server engineers than they have mobile frontend
27 | engineers, and so they assume that most (or all) of the developer productivity
28 | work should go toward those backend engineers.
29 |
30 | What this misses out on is the _importance to the business_ of the various
31 | different types of developers at the business. You might only employ a few
32 | mobile engineers, but how much impact does their work have for your customers?
33 | Similarly, Machine Learning Engineers have a _very_ different workflow and set
34 | of pain points from backend server developers—in fact, there's often no overlap
35 | at all in their pain points and commonly-used tools. But in many companies
36 | today, machine learning and AI are key to the success of their business.
37 |
38 | If your developer productivity team has been focusing on only one type of
39 | developer, and it seems like some parts of the company are very upset with you,
40 | this might be why.
41 |
42 | ## How to Define Personas
43 |
44 | ### Define the Categories
45 |
46 | We segment developers into _broad_ categories by their _workflow_. Obviously,
47 | each developer works in a slightly different way. But you will find groups that
48 | have large things in common in terms of the *type* of work that they do.
49 |
50 | For example, you may see that there are a large group of developers who all work
51 | in Java making "backend" servers. They might use different frameworks in Java,
52 | different editors, different CI systems, or even different deployment platforms.
53 | But you'll find that a _lot_ of their tooling is common within the group, or
54 | fits into 2-3 categories (like one team uses The New CI System and another team uses
55 | The Old CI System).
56 |
57 | Since we use this data for survey analysis (as described below) we try to have
58 | our Persona groups be large—at least 200 people, so that if only a small
59 | percentage respond to the survey, we can still get statistically significant
60 | data about the persona as a whole. Of course, if your company is smaller, you
61 | might be doing interviews instead of surveys, in which case the personas should
62 | be whatever size makes sense for you. The key is: don't have _too many_
63 | personas, because that makes survey analysis hard. We currently have ten
64 | developer personas for an engineering org with thousands of people in it, and
65 | that number has seemed manageable.
66 |
67 | Of course, if you have a small or medium-sized engineering team, you won't be
68 | able to get groups of 200 people (that might be larger than your whole
69 | engineering team!). If so, segment them out by the workflows that are most
70 | important to the business.
71 |
72 | #### Sub-Personas
73 |
74 | People often ask if they can split the personas down even further and have
75 | "sub-personas." Sure, you can totally do that. For example, you might have one
76 | overall persona for people whose job it is to maintain production systems
77 | (called SREs, DevOps Engineers, or System Administrators). However, you might
78 | have one large sub-group that works on creating _tools_ for production
79 | management, and another large sub-group that works on handling major incidents
80 | in production. Those could be two sub-personas, because their workflows and
81 | needs are very different, even though they have _some_ things in common.
82 |
83 | The key here is to not make your survey analysis too complex. If the person
84 | doing the survey analysis thinks there is value in separating out feedback and
85 | scores for the SRE persona into these two categories, that's fine. However, you
86 | might want to still review both of these sub-personas in the same meeting or
87 | document or whatever process you decide on for doing your survey analysis.
88 |
89 | ### Initial Research
90 |
91 | Once you have some idea of what categories exist, you'll likely want to do some
92 | initial research into the current workflows of those engineers. Don't go too
93 | overboard with this. It often is enough just to interview a set of engineers
94 | (not just managers or executives) who are members of that persona. You'll want
95 | to ask them about their workflows and what parts of that workflow are the most
96 | frustrating. This should be a live dialogue, not an email or a survey, so that
97 | you can ask follow up questions to clarify. You will also learn what questions
98 | you should ask to this persona in future surveys.
99 |
100 | Then you can synthesize the collected information into a document, and present
101 | this document to the relevant stakeholders.
102 |
103 | One thing that can be useful here is simply describing what the usual workflow
104 | is for this type of developer. This is useful because when a tool developer
105 | wants to support all the personas at the company, they can start off by reading
106 | the descriptions that you've written, instead of having to figure out those
107 | workflows themselves all over again.
108 |
109 | You'll also want to try to get a count of how many people are in each persona,
110 | even just an approximation, as this question comes up frequently, in our
111 | experience.
112 |
113 | ### Categorizing Developers Into Personas
114 |
115 | Now you will need some sort of system that categorizes developers into personas.
116 | That is, some system that will tell you what personas a person is part of. Don't
117 | create a system where managers or engineers have to fill out this data manually.
118 | It will get out of date or be missing data. Instead, there are multiple data
119 | sources you can combine to figure this out:
120 |
121 | * A person's title.
122 | * A person's management chain.
123 | * The "cost center" that they belong to (usually tracked by your Finance team).
124 | * Which systems / tools they use (possibly taking into account _how much_ they
125 | use them).
126 | * What codebases or file types they contribute to (or review).
127 |
128 | When you're starting off, it is simplest to start off with whatever data is
129 | easiest to get, and accept some percentage of inaccuracy in the system. (For
130 | example, we accepted a 10% inaccuracy in the early days of our Personas system,
131 | and it didn't cause any real problems.)
132 |
133 | Note: One developer can have multiple personas, that's fine. Sometimes people
134 | ask if we should have a "master" persona for each person, but _usually_ when we
135 | investigate, we discover the requester has misunderstood the purpose of
136 | personas, or they are trying to work around a limitation in some other system.
137 | Usually when a person has multiple personas, it's because they genuinely work in
138 | all those workflows. Of course, it's always possible a valid use case for the
139 | "master persona" concept comes up at some point in the future.
140 |
141 | #### Example Personas
142 |
143 | Here are some of the personas we have defined at LinkedIn:
144 |
145 | * Backend Developer
146 | * Data Scientist
147 | * Machine Learning / AI Engineer
148 | * Android Developer
149 | * iOS Developer
150 | * SRE
151 | * Tools Developer
152 | * Web Developer
153 |
154 | Plus we have two other personas that are unique to how our internal systems
155 | work (they are not included here because that would require too much explanation
156 | for too little value).
157 |
158 | ## What To Do With Personas
159 |
160 | For us, the Developer Personas system is used primarily as part of our survey
161 | analysis, to split out pain points by developer persona.
162 |
163 | We have a person called a "Persona Champion" who is a member of that persona
164 | (for example, for the "Backend Developer" persona, the Persona Champion is a
165 | backend developer). They do the analysis of the comments and the survey scores,
166 | and work with infrastructure teams to help them understand the needs of that
167 | Persona.
168 |
169 | It is helpful to have a person who is familiar with the tools and workflow of
170 | the persona doing the analysis, because they can pick up important details that
171 | others will miss, and they have the context to "fill in the gaps" of vague or
172 | incomplete comments. (Or, they know who they can go talk to to fill in those
173 | gaps, because they know other developers who share their workflow.)
174 |
175 | For more about persona champions, see the [detailed description of their
176 | duties](persona-champions.md) (which go beyond just survey analysis).
177 |
178 | ### What About Using Personas For Quantitative Metrics?
179 |
180 | We have found minimal value in splitting our quantitative metrics by persona.
181 | Even when it seems like you would want to do that, there is usually another
182 | dimension that would be better for doing analysis of the quantitative metrics.
183 |
184 | For example, let's say you want to analyze build times for Java Backend
185 | Developers vs JavaScript Web Developers. Splitting the data by persona would
186 | get you confusing overlaps—you might have some full-stack engineers who write
187 | both Java and JavaScript. It makes the data confusing and hard to analyze.
188 |
189 | What you would want to do instead in that situation is analyze the build speed
190 | based on what language is being compiled. Then you would have actionable data
191 | that the owners of the build tools could use to speed things up, as opposed to
192 | a confusing mash of mixed signals.
193 |
194 | Next: [Persona Champions](persona-champions.md)
195 |
--------------------------------------------------------------------------------
/data-collection-principles.md:
--------------------------------------------------------------------------------
1 | # Data Collection Principles
2 |
3 | Teams who make metrics and do data analysis often wonder: how much data should I
4 | collect? How long should I retain it for? Are there important aspects of how I
5 | should structure the data?
6 |
7 | In general, the concepts here are:
8 |
9 | 1. Collect as much data as you possibly can. This becomes the "lowest layer" of
10 | your data system.
11 | 2. Refine that, at higher "layers," into data stores designed for specific
12 | purposes.
13 | 3. Always be able to understand how two data points connect to each other,
14 | across any data sets. (That is, always know what your "join key" is.)
15 |
16 | For example, imagine that you are collecting data about your code review tool.
17 |
18 | Ideally, you record every interaction with the tool, every important touchpoint,
19 | etc. into one data set that knows the exact time of each event, the type of
20 | event, all the context around that event, and important "join keys" that you
21 | might need to connect this data with other data sources (for example, the ID of
22 | commits, the IDs of code review requests, the username of the person taking
23 | actions, the username of the author, etc.).
24 |
25 | Then you figure out specific business requirements you have around the data. For
26 | example, you want to know how long it takes reviewers to respond to requests for
27 | review. So you create a higher-level data source, derived from this "master"
28 | data source, which contains only the events relevant for understanding review
29 | responses, with the fields structured in such a way that makes answering
30 | questions about code review response time really easy.
31 |
32 | That's a basic example, but there's more to know about all of this.
33 |
34 | - [When In Doubt, Collect Everything](#when-in-doubt-collect-everything)
35 | - [Purpose](#purpose)
36 | - [Boundaries, Intentions, and Join Keys](#boundaries-intentions-and-join-keys)
37 |
38 | ## When In Doubt, Collect Everything
39 |
40 | It is impossible to go back in time and instrument systems to answer a question
41 | that you have in the present. It is also impossible to predict every question
42 | you will want to answer. You must _already have_ the data.
43 |
44 | As a result, you should strive to collect every piece of data you can collect
45 | about every system you are instrumenting. The only exceptions to this are:
46 |
47 | 1. Don't collect so much data that it becomes extremely expensive to store, or
48 | _very_ slow to query. Sometimes there are low-value events that you can
49 | pre-aggregate, only store a sample of, or simply skip tracking at all. Note
50 | that you have to actually prove that storage would be expensive or
51 | slow---often, people _believe_ something will be expensive or slow when it's
52 | really not. Storage is cheap and query systems can be faster than you expect.
53 | 2. Some data has security or privacy restrictions. You need to work with the
54 | appropriate people inside the company to determine how this data is supposed
55 | to be treated.
56 |
57 | As an extreme example, you could imagine a web-logging system that stored the
58 | entirety of every request and response. After all, that's "everything!" But it
59 | would be impossible to search, impossible to store, and an extremely complex
60 | privacy nightmare.
61 |
62 | The only other danger of "collecting everything" is storing the data in such a
63 | disorganized or complicated way that you can't make any sense of it. You can
64 | solve that by keeping in mind that no matter what you're doing, you always want
65 | to produce [insights](data-vs-insights.md) fron the data at some point. Keep in
66 | mind a few questions that you know people want to answer, and make sure that
67 | it's at least theoretically possible to answer those questions with the data
68 | you're collecting, with the fields you have, and with the format you're storing
69 | the data in.
70 |
71 | If your data layout is well-thought-out and provides sufficient coverage to
72 | answer almost any question that you could imagine about the system (even if it
73 | would take some future work to actually understand the answers to those
74 | questions) then you should be at least somewhat future-proof.
75 |
76 | ## Purpose
77 |
78 | In general, you always want to have some idea of why you are collecting data. At
79 | the "lowest level" of your data system, the telemetry that "collects
80 | everything," this is less important. But as you derive higher-level tables from
81 | that raw data, you want to ask yourself things like:
82 |
83 | 1. What questions are people going to want to answer with this data?
84 | 2. _Why_ are people going to ask those questions? How does answering those
85 | questions help people? (This gives you insight into how to present the data.)
86 | 3. What dimensions might people want to use to slice the data?
87 |
88 | This is where you take the underlying raw data and massage it into a format that
89 | is designed to solve specific problems. In general, you don't want to expose the
90 | underlying complex "collect everything" data store to the world. You don't even
91 | want to expose it directly to your dashboards. You want to have simpler tables
92 | _derived from_ the "everything" data store---tables that are designed for some
93 | specific purpose.
94 |
95 | You can have a hierarchy of these tables. Taking our code review example:
96 |
97 | 1. You start off with the "everything" table that contains time series events of
98 | every action taken with the tool.
99 | 2. From that, you derive a set of tables that let you view just comments,
100 | approvals, and new code pushes, with the "primary key" being the Code Review
101 | ID (so it's easy to group these into actions that happened during a
102 | particular code review process).
103 | 3. Then you want to make a dashboard that shows how quickly reviewers responded
104 | on each code review. You could actually now make one table that just contains
105 | the specific derived information the dashboard needs.
106 |
107 | You'll find that people rarely ever want to directly query the table from Step 1
108 | (because it's hard to do so) sometimes want to query the table from Step 2, and
109 | the table from Step 3 becomes a useful tool in and of itself, even beyond just
110 | the dashboard. That is, the act of creating a table specifically for the
111 | dashboard makes a useful data source that people sometimes want to query
112 | directly.
113 |
114 | Sometimes, trying to build one of these purpose-built tables will also show you
115 | gaps in your data-collection systems. It can be a good idea to have one of these
116 | purpose-built tables or use cases in mind even when you're designing your
117 | systems for "collecting everything," because they can make you realize that you
118 | missed some important data.
119 |
120 | In general, the better you know the requirements of your consumers, the better
121 | job you can do at designing these purpose-built tables. It's important to
122 | understand the current and potential requirements of your consumers when you
123 | design data-gathering systems. This should be accomplished by actual research
124 | into requirements, not just by guessing.
125 |
126 | ## Boundaries, Intentions, and Join Keys
127 |
128 | It should be possible to know when a large workflow starts, and when it ends. We
129 | should know that a developer intended something to happen, the steps involved in
130 | accomplishing that intention, when that whole workflow started, and when it
131 | ended. We should not have to develop a complex algorithm to determine these
132 | things from looking at the stored data. The stored data should contain
133 | sufficient information that is very easy to answer these questions.
134 |
135 | We need to be able to connect every event within a workflow as being part of
136 | that workflow, and we need to know its boundaries---its start point and end
137 | point.
138 |
139 | For example, imagine that we have a deployment system. Here's a set of events
140 | that represent a _bad_ data layout:
141 |
142 | 1. User Alice requested that we start a deployment workflow named "Deploy It" at
143 | 10:00.
144 | 2. Binary Foo was started on Host Bar at 10:05.
145 | 3. Binary Baz was started on Host Bar at 10:10.
146 | 4. Binary Baz responded "OK" to a health check at 10:15.
147 | 5. Binary Foo responded "OK" to a health check at 10:20.
148 |
149 | We have no idea that "Deploy It" means to deploy those two binaries. What if
150 | there are a hundred simultaneous workflows going on? What if "Deploy It" has
151 | been run more than once in the last five minutes? We have no idea that those
152 | health checks signal the end of the deployment. In fact, _do_ they signal the
153 | end of the deployment? Are there other actions that "Deploy It" is supposed to
154 | do? I'm sure the author of "Deploy It" knows the answers to that, but we, a
155 | central data team, have _no way_ of knowing that, because it's not recorded in
156 | the data store.
157 |
158 | A better data layout would look like:
159 |
160 | 1. User Alice started the deployment workflow "Deploy It" at 10:00. This
161 | indicates an intent to deploy Binary Foo and Binary Baz to 200 machines. We
162 | give this specific instance of the workflow an ID: "Deploy It 2543."
163 | 2. We record all of the actions taken by "Deploy It 2543" and be sure to tag
164 | them in our data store with that ID.
165 | 3. The deployment workflow itself contains a configuration variable that allows
166 | users to specify how many hosts the machine must successfully deploy to
167 | before we consider the deployment "successful." For example, let's say this
168 | one requires only 175 hosts to be deployed to be "successful." Once 175 hosts
169 | are deployed, we record an event in the data store indicating successful
170 | completion of the deployment workflow. (We also record failure in a similar
171 | way, if the workflow fails, but noting that it's a failure along with any
172 | necessary details about the failure.)
173 |
174 | You don't have to figure out every workflow in advance that you might want to
175 | measure. When you know a workflow exists, record its start and end point. But
176 | even when you don't know that a workflow exists, make sure that you can always
177 | see _in the data store_ that two data points are related when they are related.
178 | For example, record that a merge is related to a particular PR. Record that a
179 | particular PR was part of a deployment. Record that an alert was fired against a
180 | binary that was part of a particular deployment. And so forth. Any two objects
181 | that _are_ related should be able to be easily connected by querying your data
182 | store.
183 |
184 | Next: [Principles and Guidelines for Metric Design](metric-principles.md)
--------------------------------------------------------------------------------
/dph-goals-and-signals.md:
--------------------------------------------------------------------------------
1 | # Developer Productivity and Happiness: Goals and Signals
2 |
3 | This is a proposal, at the highest level, of what we try to measure to improve
4 | the productivity of engineering teams. This should help guide the creation of
5 | metrics and analysis systems. This is not a complete listing of every single
6 | thing that we want to measure regarding developer productivity. It is an
7 | aspirational description of _how_ we would ideally be measuring things and _how_
8 | we should come up with new metrics.
9 |
10 | It is based on the [Goals, Signals, Metrics system](goals-signals-metrics.md).
11 | This document contains our primary Goals and Signals.
12 |
13 | - [Goals](#goals)
14 | - [Productive](#productive)
15 | - [Happy](#happy)
16 | - [Not Limited to Tools and Infrastructure](#not-limited-to-tools-and-infrastructure)
17 | - [Signals](#signals)
18 | - [Productive](#productive-1)
19 | - [Efficient](#efficient)
20 | - [Caution: Don't Compare Individual Efficiency](#caution-dont-compare-individual-efficiency)
21 | - [Effective](#effective)
22 | - [Over-Indexing On A Single Effectiveness Metric](#over-indexing-on-a-single-effectiveness-metric)
23 | - [Happy](#happy-1)
24 |
25 | ## Goals
26 |
27 | The simplest statement of our goal would be:
28 |
29 | Developers at LinkedIn are **productive** and **happy**.
30 |
31 | That could use some clarification, though.
32 |
33 | ### Productive
34 |
35 | What "productive" means isn't very clear, though—what are we actually talking
36 | about? So, we can refine "Developers at LinkedIn are productive" to:
37 |
38 | **Developers at LinkedIn are able to _effectively_ and _efficiently_ accomplish
39 | their _intentions_ regarding LinkedIn's _software systems_**.
40 |
41 | This is a more precise way of stating exactly what we mean by "productive." A
42 | person is productive, by definition, if they produce products efficiently.
43 | "Efficiently" implies that you want to measure something about how long it takes
44 | or how often a product can be produced. So that means that we have to have a
45 | time that we start measuring from, and a time that we stop measuring at. The
46 | earliest point we can think of wanting to measure is the moment when the
47 | developer directly _intends_ to do something (as in, the intention they have
48 | right before they start to take action, not the first time they have some idle
49 | thought), and the last moment is when that action is totally complete.
50 |
51 | There are many different intentions a developer could have: gathering
52 | requirements, writing code, running tests, releasing software, creating a whole
53 | feature end-to-end, etc. They exist on different levels and different scopes.
54 | It's "fractal," basically—there are larger intentions (like "build a whole
55 | software system") that have smaller intentions within them (like "write this
56 | code change"), which themselves have smaller intentions within them ("run a
57 | build"), etc.
58 |
59 | The goal also states "effectively," because we don't just care that a result was
60 | _fast_, we care that the intention was _actually carried out_ in the most
61 | complete sense. For example, if I can release my software quickly but my service
62 | has 3 major production incidents a day, then I'm not as effective as I could be,
63 | even if I'm efficient. It's reasonable to assume that no developer intends to
64 | release a service that fails catastrophically multiple times a day, so that
65 | system isn't accomplishing a developer's intention.
66 |
67 | ### Happy
68 |
69 | Happy about what? Well, here's a more precise statement:
70 |
71 | **Developers at LinkedIn are happy with the tools, systems, processes,
72 | facilities, and activities involved in software development at LinkedIn.**
73 |
74 | ### Not Limited to Tools and Infrastructure
75 |
76 | Note that although the primary focus of our team is on Tools & Infrastructure,
77 | neither of these goals absolutely limit us to Tools & Infrastructure. If we
78 | discover that something outside of our area is impacting developer productivity
79 | in a significant way, like a facilities issue or process issue, we should feel
80 | empowered to raise that issue to the group that handles that problem.
81 |
82 | ## Signals
83 |
84 | Let's break down each goal by its parts.
85 |
86 | ### Productive
87 |
88 | We'll break this down into signals for "effective" and signals for "efficient."
89 |
90 | #### Efficient
91 |
92 | Essentially, we want to measure the time between when a developer has an
93 | intention and the time when they accomplish that intention. This is probably
94 | best framed as:
95 |
96 | **The time from when a developer starts taking an action to when they accomplish
97 | what they were intending to accomplish, with that action.**
98 |
99 | It's worth remembering that there are many different types of intentions and
100 | actions that a developer takes, some large, some small. Here are some examples
101 | of more specific signals that you might want to examine:
102 |
103 | * The time between when a developer encounters a problem or confusion and when
104 | they get the answer they are looking for (such as via docs, support, etc.).
105 | * How long it takes from when developers start writing a piece of code to when
106 | it is released in production.
107 |
108 | There are many other signals that you could come up with—those are just
109 | examples.
110 |
111 | It's also worth keeping in mind:
112 |
113 | **The most important resource we are measuring, in terms of efficiency, is the
114 | amount of time that software engineers have to spend on something.**
115 |
116 | That's _sort of_ a restatement of the above signal, but it's a clearer way to
117 | think about some types of specific signals. In particular, it considers the
118 | whole sum of time spent on things. For example, these could be specific signals
119 | that you want to measure:
120 |
121 | - How much time developers actually spend waiting for builds each day.
122 | - How much time developers spend on non-coding activities.
123 |
124 | The general point is that we care more about the time that human beings have to
125 | spend (or wait) to do a task, and less about the time that machines have to
126 | spend.
127 |
128 | And there are many, many more signals that you could come up with in this
129 | category, too.
130 |
131 | #### Caution: Don't Compare Individual Efficiency
132 |
133 | Don't propose metrics, systems, or processes that are intended to rate the
134 | efficiency of individual engineers. We [aren't trying to create systems for
135 | performance ratings of employees](metrics-and-performance-reviews.md). We are
136 | creating systems that drive our developer productivity work.
137 |
138 | It's even worth thinking about whether or not your system _could_ be used this
139 | way, and either prevent it from being used that way in how the system is
140 | designed, or forbid that usage via cautions in the UI/docs.
141 |
142 | #### Effective
143 |
144 | Essentially, we want to know that when a developer tries to do something, they
145 | actually accomplish the thing they were intending to accomplish. Things like
146 | crashes, bugs, difficulty gathering information—these are all things that
147 | prevent an engineer from being effective.
148 |
149 | When problems occur in a developer's workflow, we want to know how often they
150 | occur and how much time they waste. Probably the best high-level signal for this
151 | would be phrased like:
152 |
153 | **The probability that an individual developer will be able to accomplish their
154 | intention successfully. (Or on the inverse, the frequency with which developers
155 | experience a failure.)**
156 |
157 | You have to take into account the definition of "success" appropriately for the
158 | thing you're measuring. For example, let's say a developer, for some reason, has
159 | the intention "I run all my tests and they pass." In that case, if a test fails,
160 | they've failed to accomplish their intention. But most of the time, a developer
161 | runs tests to know if the code they are working on is broken. So success would
162 | be, "I ran the tests and they only failed if I broke them." Thus, flaky tests,
163 | infrastructure failures, or being broken by a dependency would _not_ be the
164 | developer accomplishing their intention successfully. Defining the intention
165 | here is important.
166 |
167 | We use _probability_ here because what we care about is how individual engineers
168 | are actually impacted by a problem. For example, let's say you have a piece of
169 | testing infrastructure that's flaky 10% of the time. What does that actually
170 | mean for engineers? How often is an engineer impacted by that flakiness? Maybe
171 | the system mostly just runs tests in the background that affect very few people.
172 |
173 | In order to define this probability appropriately, you have to know what group
174 | of engineers you're looking at. You could be looking at the whole company, or
175 | some specific [persona](developer-personas.md), area, org, or team.
176 |
177 | Some specific examples of possible signals here would be:
178 |
179 | * The probability that when a developer runs a test in CI, it will produce valid
180 | results (that is, if it fails, it's not because of flakiness).
181 | - The probability that a developer will run a build without the build tool
182 | crashing.
183 | - How often (such as the median number of times per day) a developer experiences
184 | build tool crashes.
185 |
186 | It's important to keep in mind here that what we care about most is what
187 | developers actually experience, not just what's happening with the tool.
188 |
189 | Very often, just knowing _how often_ something happens isn't enough to
190 | understand its impact, though. You also might want to tie things back to
191 | efficiency here, by noting:
192 |
193 | **How much extra human time was spent as a result of failures.**
194 |
195 | For example, it would be good to know that 15 engineers were caught up for three
196 | days handling a production incident, wouldn't it? That changes the importance of
197 | addressing the root causes, there. Some signals here could be:
198 |
199 | - How much time was spent re-running tests after they had flaky failures.
200 | - How much time a developer had to spend shepherding a change through the
201 | release pipeline due to release infrastructure failures.
202 |
203 | Not all failures are black and white, though. Very often, an intention
204 | _partially_ succeeds. For example, I might release a feature with a bug that
205 | affects only 0.01% of my users. Thus, it is sometimes also useful to know:
206 |
207 | **The percentage _degree_ of success of each action that a developer takes.**
208 |
209 | For example, one way to look at flakiness is how many tests per run actually
210 | flake. That is, if I have 10 tests in my test suite but only one fails due to
211 | flakiness, then that specific run of that specific test suite had a 90% success
212 | rate.
213 |
214 | #### Over-Indexing On A Single Effectiveness Metric
215 |
216 | It's important not to over-index on any one of these "effectiveness" signals.
217 | Sometimes, the probability of a failure is low, but it is very impactful in
218 | terms of human time when the failure occurs. It's helpful to have data for _all_
219 | of the signals, as appropriate for the thing you're measuring.
220 |
221 | ### Happy
222 |
223 | Usually, we rate happiness via subjective signals. Basically we want to know:
224 |
225 | 1. **The percentage of software engineers that are happy with the tools,
226 | systems, processes, facilities, and activities involved in software
227 | engineering at LinkedIn.**
228 | 2. **A score for _how_ happy those engineers are.**
229 |
230 | And you can break it down for specific tools, systems, processes, and
231 | facilities. Very often we say "satisfied" instead of happy, as a more concrete
232 | thing that people can respond more easily to.
233 |
234 | One of our assumptions is that if you improve the quantitative metrics in the
235 | "Productive" section, you should increase developer happiness. If happiness
236 | _doesn't_ increase, then that probably means there's something wrong with your
237 | quantitative signals for productivity—either you've picked the wrong
238 | signal/metric, or there's some inaccuracy in the data.
239 |
240 | Next: [Developer Personas](developer-personas.md)
--------------------------------------------------------------------------------
/metric-principles.md:
--------------------------------------------------------------------------------
1 | # Principles and Guidelines for Metrics Design
2 |
3 | This document covers some of our general guidelines that we use when designing
4 | metrics.
5 |
6 | This document covers only the principles for quantitative metrics covered under
7 | 'Productivity' in [Developer Productivity and Happiness: Goals &
8 | Signals](dph-goals-and-signals.md).
9 |
10 | These principles are not about the operational metrics of various tools and
11 | services. "Operational metrics" are the ones that a team owning a tool uses to
12 | understand if it's currently working, how it's broken, its current performance,
13 | etc. Think of Operational metrics as something you would use in monitoring and
14 | alerting of a production service.
15 |
16 | This doc also doesn't cover the business metrics for tools—the ones that measure
17 | their business goals/success—like how many users they have.
18 |
19 | There might be some overlap of operational, business, and productivity metrics,
20 | though. That is, some productivity metrics might also be business metrics, and a
21 | very few of them could also be used as operational metrics.
22 |
23 | - [Measuring Results vs. Measuring Work](#measuring-results-vs-measuring-work)
24 | - [Time Series](#time-series)
25 | - [Exclude Weekends](#exclude-weekends)
26 | - [Defined Weeks](#defined-weeks)
27 | - [Timestamps](#timestamps)
28 | - [Use the End Time, not the Start Time](#use-the-end-time-not-the-start-time)
29 | - [Developers vs. Machines](#developers-vs-machines)
30 | - [Common Dimensions](#common-dimensions)
31 | - [Going Up or Down Must Be Meaningful](#going-up-or-down-must-be-meaningful)
32 | - [Prefer Graphs That Are Good When They Go Up](#prefer-graphs-that-are-good-when-they-go-up)
33 | - [Retention](#retention)
34 | - [Business Hours](#business-hours)
35 | - [Fallback System](#fallback-system)
36 |
37 | ## Measuring Results vs. Measuring Work
38 |
39 | We want to measure the impact, results, or effects of our work, rather than just
40 | measuring how much work gets done. Otherwise, our metrics just tell us we're
41 | doing work—they don't tell us if we're doing the _right_ work.
42 |
43 | As an analogy, let's say we owned a packing plant (like, a place where they put
44 | things in boxes) and we're installing new conveyor belts on the assembly line.
45 | We could measure the progress of installing the belts (i.e., "how many belts
46 | have been fully installed?") or we could measure the _effect_ of the belts on
47 | the quantity and quality of our packing ("how many boxes get packed?" and "what
48 | percentage of boxes pass quality inspection?"). We would rather measure the
49 | latter.
50 |
51 | ## Time Series
52 |
53 | Unless stated otherwise, all of our metrics are a time series. That is, they are
54 | plotted on a graph against time—we count them per day or per week as
55 | appropriate. Most commonly, we count them per week, as that eliminates
56 | confusions around weekends and minor fluctuations between days. In particular,
57 | with some metrics, Mondays and Fridays tend to have lower or higher values than
58 | other days, and aggregating the data over a week eliminates those fluctuations.
59 |
60 | ### Exclude Weekends
61 |
62 | If you show daily graphs, you should exclude weekends from all metrics unless
63 | they seem very relevant for your particular metric. We exclude weekends because
64 | they make the graphs very hard to read. (We don't exclude holidays—those are
65 | _usually_ understandable anomalies that we actually _want_ to be able to see in
66 | our graphs. Also, not all offices have the same holidays.) You can include
67 | weekends in all other types of graphs other than daily graphs.
68 |
69 | ### Defined Weeks
70 |
71 | Unless you have good reason for another requirement, weeks should be measured
72 | from **12:00am Friday to the end of Thursday**. Cutting off our metrics at the
73 | end of Thursday gives executives enough time to do reviews and investigations
74 | for Monday or Tuesday meetings. When we cannot specify a time zone, we should
75 | assume we are measuring things in the `America/Los_Angeles` time zone. However,
76 | all data should be _stored_ in the **UTC** time zone.
77 |
78 | ### Timestamps
79 |
80 | All times should be stored ideally as microseconds, but if that's not possible,
81 | then as milliseconds. As engineering systems grow, there can be more and more
82 | events that happen _very_ close in time, but which we need to put in order for
83 | certain metrics to work. (For example, a metric might need to know when each
84 | test in a test suite ended, and if all the tests are running in parallel, these
85 | end times could be very close together.)
86 |
87 | ### Use the End Time, not the Start Time
88 |
89 | When you display a metric on a graph, the event you are measuring should be
90 | assigned to a date or time on the graph by the _end time_ of the event. For
91 | example, let's imagine we have a daily graph that shows how long our CI jobs
92 | take. If a CI job takes two days to finish, it should show up on the graph on
93 | the day that it _finished_, not the day that it started.
94 |
95 | This prevents having to "go back in time" and update previous data points on the
96 | graph in a confusing way. (This is especially bad if it changes whether the
97 | graph is going up or going down multiple days _after_ a manager has already
98 | looked at the graph and made plans based on its trend.)
99 |
100 | ## Developers vs. Machines
101 |
102 | Almost all our productivity metrics are defined by saying "a developer" does
103 | something. **For each metric,** we need to have two different versions of the
104 | same metric: one for when machines do the task, and one for when people do the
105 | task. For example, we need to measure human beings doing `git clone` and
106 | the CI system (or any other automated system) doing `git clone` separately.
107 |
108 | **The most important metrics are the ones where people do the task, not the ones
109 | where machines do the task.** But we still need to be able to measure the
110 | machines doing the task, as that is sometimes relevant for answering questions
111 | about developer productivity.
112 |
113 | Sometimes, a question comes up on whether to count something as being machine
114 | time or developer time. For example, let's say a developer types a command on
115 | their machine that causes the CI system to run a bunch of tests. The question to
116 | ask is, "Most of the time, is this action costing us developer time?" A
117 | developer doesn't have to just be sitting there and waiting for the action to
118 | finish, for it to be costing us developer time. Maybe it takes so long that they
119 | switch contexts, go read email, go eat lunch, etc. That's still developer cost.
120 | On the other hand, let's say there is a daily automated process that runs a test
121 | and then reports the test result to a developer. We would count the time spent
122 | running that test as "machine time," because it wasn't actively spending
123 | developer time.
124 |
125 | This is particularly relevant when looking at things that happen in parallel.
126 | When we measure "developer time" for parallel actions, we only care about how
127 | much wall-clock time was taken up by the action, not the sum total of machine
128 | hours used to execute it. For example, if I run 10 tests in parallel that each
129 | take 10 minutes, I only spent 10 minutes of developer time, even though I spent
130 | 100 minutes of machine time.
131 |
132 | ## Common Dimensions
133 |
134 | There are a few common dimensions that we have found useful for slicing most of
135 | our metrics:
136 |
137 | 1. By Team (i.e., the team that the affected developer is on)
138 | 2. By code repository (or codebase, when considering a large repo)
139 | 3. By location the developer sits in
140 |
141 | Most metrics should also support being sliced by the above dimensions. There are
142 | many other dimensions that we slice metrics by, some of which are special to
143 | each metric; these are just the _common_ dimensions we have found to be the most
144 | useful.
145 |
146 | ## Going Up or Down Must Be Meaningful
147 |
148 | It should be clearly good or bad when a metric goes up, and clearly good or bad
149 | when it goes down. One direction should be good, and the other direction should
150 | be bad.
151 |
152 | This should not change at a certain threshold. For example, let’s say you have a
153 | metric that’s "bad" when it goes down until it’s at 90%, and then above 90% it’s
154 | either meaningless or it’s actually good when it goes down (like you want it
155 | always to be at 90%).
156 |
157 | Consider redesigning such metrics so that they are always good when they go up
158 | and always bad when they go down. For example, in the case of the metric that’s
159 | supposed to always be at 90%, perhaps measure how far away teams are from being
160 | at 90%.
161 |
162 | ### Prefer Graphs That Are Good When They Go Up
163 |
164 | We should strive to make metrics that are "good" when they go up and "bad" when
165 | they go down. This makes it consistent to read graphs, which makes life easier
166 | for executives and other people looking at dashboards. This isn't always
167 | possible, but when we have the choice, this is the way we should make the
168 | choice.
169 |
170 | ## Retention
171 |
172 | Prefer to retain data about developer productivity metrics forever. Each of
173 | these metrics could potentially need to be analyzed several years back into the
174 | past, especially when questions come up about how we have affected productivity
175 | over the long term. The actual cost to LinkedIn of storing these forever is
176 | _extremely_ small, whereas the potential benefit to the business from being able
177 | to do long-term analysis is huge.
178 |
179 | Of course, if there are legitimate legal or regulatory constrains on retention,
180 | make sure to follow those. However, there are rarely such constraints on the type
181 | of data we want to retain.
182 |
183 | ## Business Hours
184 |
185 | Some metrics are defined in terms of "business hours."
186 |
187 | Basically, we are measuring the _perceived_ wait time experienced by a
188 | developer. For example, imagine Alice sends off a code review request at 5pm
189 | and then goes home. Bob, in another time zone, reviews that code while Alice is
190 | sleeping. Alice comes back into work the next day and experiences a code review
191 | that, for her, she received in zero hours.
192 |
193 | The ideal way to measure business hours would be to do an automated analysis for
194 | each individual to determine what their normal working hours are (making sure to
195 | keep this information confidential, only use it as an input to our productivity
196 | metrics, and never expose it outside of our team). Once you have established
197 | this baseline for each individual (which you will have to update on some regular
198 | basis) you only count time spent by individuals during those hours.
199 |
200 | For any calendar day in the developer's local time zone, do not count more than
201 | 8 hours a day. It is confusing to have "business days" that are longer than 8
202 | hours. So if a task takes two days to complete, and a developer was somehow
203 | working on it full time for each day, it would show up as 16 business hours
204 | (even if they worked 9 hours each day). This helps normalize out the differences
205 | in schedules between engineers. (For some metrics we may need to relax this
206 | restriction---we would determine that on a case-by-base basis when we develop
207 | the metric.)
208 |
209 | The reason that we are picking 8 hours is that what we care the most about here
210 | is the cost to the company, and even though most software engineers are
211 | salaried, we are measuring their cost to the company in 8-hour-a-day increments.
212 | It would be unusual for any salaried engineer anywhere in the world to have
213 | _contractual obligations_ to work more than 8 hours a day.
214 |
215 | ### Fallback System
216 |
217 | If you cannot generate an automated analysis of working hours for each
218 | individual, then set the "business day" as being 7am to 7pm in their local time
219 | zone. We don't use 9am to 5pm because the schedule of developers varies, and you
220 | don't want to count 0 for a lot of business hour metrics when a person regularly
221 | starts working at 8am or regularly works until 6pm. We have found that setting
222 | the range as 7am to 7pm eliminates most of those anomalies.
223 |
224 | One still needs to limit the maximum "business hours" counted in a day to eight
225 | hours, though, for most metrics.
226 |
227 | Next: [Common Pitfalls When Designing Metrics](metric-pitfalls.md)
228 |
--------------------------------------------------------------------------------
/example-metrics.md:
--------------------------------------------------------------------------------
1 | # Example Metrics
2 |
3 | Here are some actual metrics we have used inside of LinkedIn, along with their
4 | complete definitions. We are not holding these up as the "right" metrics to
5 | use--only as examples of how to precisely define a metric. There is also another
6 | document that explains [why we chose each metric](why-our-metrics.md), if you'd
7 | like to get some insight into the thinking process behind each one.
8 |
9 | These are not all the metrics we use internally; only a subset that we think are
10 | good examples for showing how to define metrics, and which we have found useful.
11 |
12 | We give each metric an abbreviation that we can use to refer to it. For example,
13 | internally, when we talk about "Developer Build Time," most people familiar with
14 | the metric just call it "DBT."
15 |
16 | Every metric gets a "TL;DR" summary, which most casual readers appreciate.
17 |
18 | These definitions are written so that implementers can implement them, but
19 | _also_ so that people who have questions about "What specifically is this
20 | actually measuring?" can have a place to get that question answered.
21 |
22 | - [Company-Wide Engineering Metrics](#company-wide-engineering-metrics)
23 | - [Developer Build Time (DBT)](#developer-build-time-dbt)
24 | - [Post-Merge CI Duration (PMCID)](#post-merge-ci-duration-pmcid)
25 | - [CI Determinism (CID)](#ci-determinism-cid)
26 | - [Code Reviewer Response Time (RRT)](#code-reviewer-response-time-rrt)
27 | - [Definitions](#definitions)
28 | - [Metric](#metric)
29 | - [Metrics for the Developer Platform Team](#metrics-for-the-developer-platform-team)
30 | - [CI Reliability (CIR)](#ci-reliability-cir)
31 | - [Deployment Reliability (DR)](#deployment-reliability-dr)
32 | - [Number of Insights Metrics in SLO (NIMS)](#number-of-insights-metrics-in-slo-nims)
33 |
34 | ## Company-Wide Engineering Metrics
35 |
36 | These are some examples of metrics we measure for the whole company.
37 |
38 | ### Developer Build Time (DBT)
39 |
40 | **TL;DR: How much time developers spend waiting for their build tool to finish
41 | its work.**
42 |
43 | The intention of this metric is to measure the amount of time that human beings
44 | spend waiting for their build tool to complete. This is measured as the
45 | wall-clock time from when the build tool starts a "build" to when it completes.
46 |
47 | This measures all human-triggered builds. At present, this includes all
48 | human-triggered builds with the following tools:
49 |
50 | - Gradle
51 | - Bazel
52 | - Ember
53 | - Xcode
54 |
55 | This duration is measured and reported in _seconds_.
56 |
57 | We count this only for builds invoked by human beings, that we reasonably assume
58 | they are waiting on. To be clear, this means we exclude all builds run on the CI
59 | infrastructure.
60 |
61 | We report this as P50 (median) and P90, so we have “DBT P50” and “DBT P90.”
62 |
63 | ### Post-Merge CI Duration (PMCID)
64 |
65 | **TL;DR: How long is it between when I say I want to submit a change and when
66 | its post-merge CI job fully completes?**
67 |
68 | The time it takes for each PR merge to get through CI, during post-commit.
69 | Counted whether the CI job passes or fails.
70 |
71 | The start point is when the user expresses the intent to merge a PR, and the end
72 | point is when the CI job delivers its final signal (passing or failing) to the
73 | developer who authored the PR.
74 |
75 | Reported in minutes, with one decimal place.
76 |
77 | We report the P50 and P90 of this, so we have “PMCID P50” and “PMCID P90.”
78 |
79 | ### CI Determinism (CID)
80 |
81 | **TL;DR: Test flakiness (just the inverse, so it’s a number that’s good when it
82 | goes up).**
83 |
84 | Each codebase at LinkedIn has a CI job that blocks merges if it fails. Each
85 | week, at some time during the week while CI machines are otherwise idle, we run
86 | each of these CI jobs many times at the same version of the repository (keeping
87 | everything in the environment identical between runs, as much as possible). We
88 | are looking to see if any of the runs returns a different result from the others
89 | (passes when the others fail, or fails when the others pass).
90 |
91 | The system that runs these jobs is called **CID**.
92 |
93 | Each CI job gets a **Determinism Score**, which is a percentage. The
94 | **Denominator** is the total number of builds run for that CI job by CID during
95 | the week. The **Numerator** is the number of times the CI job passed. So for
96 | example, let's say we run the CI job 10 times, 3 of them fail, and 7 of them
97 | pass. The score would be 7/10 (70%).
98 |
99 | However, if _all_ runs of a job fail, then its Determinism Score is 100%, and
100 | its Numerator should be set equal to its Denominator.
101 |
102 | When we aggregate this metric (for example, to show a Determinism score for a
103 | whole team that owns multiple MPs) we average all the Determinism Scores to get
104 | an overall Determinism Score. Taking the average means that codebases that run
105 | less frequently but are still flaky have their flakiness equally represented in
106 | the metric as codebases that are run frequently.
107 |
108 | Note that this is intended to only count CI jobs that block deployments or
109 | library publishing. It’s not intended to count flakiness in other things that
110 | might coincidentally run on the CI infrastructure.
111 |
112 | ### Code Reviewer Response Time (RRT)
113 |
114 | **TL;DR: How quickly do reviewers respond to each update from a developer,
115 | during a code review?**
116 |
117 | One of the most important qualities of any code review process is that
118 | [reviewers respond _quickly_](why-our-metrics.md). Often, [any complaints about
119 | the code review process (such as complaints about strictness) vanish when you
120 | just make it fast
121 | enough](https://google.github.io/eng-practices/review/reviewer/speed.html). This
122 | metric measures how quickly reviewers respond to each update that a developer
123 | posts.
124 |
125 | #### Definitions
126 |
127 | **Author**: The person who wrote the code that is being reviewed. If there are
128 | multiple contributors, all of them count as an Author.
129 |
130 | **Reviewer**: A person listed as an assigned reviewer on a PR.
131 |
132 | **Code Owner**: A person listed in the ACL files of a code repository. A person
133 | who has the power to approve changes to at least one of the files in a PR.
134 |
135 | **Request**: When a Reviewer gets a notification that an Author has taken some
136 | action, and now the Author is blocked while they are waiting for a response.
137 | Usually this means an Author has pushed a set of changes to the repository and
138 | the Reviewer has been sent a notification to review those changes. (Note: The
139 | Request Time is tracked as when a notification was sent, not when the changes
140 | were pushed. If a PR has no assigned Reviewer and changes are pushed, the
141 | Request Time only is tracked when the Reviewer gets assigned to the PR.)
142 |
143 | **Response**: The first time after a Request that a Reviewer or Code Owner
144 | responds on the PR and sends that response to the author. This could be a
145 | comment in the conversation, a comment on a line of code, or even just an
146 | approval with no comments.
147 |
148 | #### Metric
149 |
150 | Measure the [business hours](metric-principles.md) between each Request and
151 | Response. Note that in the process of a code review, there are many Requests and
152 | Responses. We count this metric only once for each Request within that code
153 | review.
154 |
155 | For example, imagine this sequence of events:
156 |
157 | 1. 10:00: Author Alice posts a new change and requests review.
158 | 2. 10:20: Reviewer Ravi posts a comment.
159 | 3. 10:25: Reviewer Rob posts a comment.
160 | 4. 10:45: Author Alice posts an updated set of changes in response to the
161 | Reviewers' comments.
162 | 5. 10:55: Reviewer Rob approves the PR with no comments.
163 |
164 | That creates two events. The first is 20 minutes long (Step 1 to Step 2), and
165 | the second is 10 minutes long. (Step 4 to Step 5) The response at 10:25 (Step 3)
166 | is ignored and doesn't have anything to do with this metric.
167 |
168 | We then take every single event like this and take the overall P50 (median) and
169 | P90 of them, so we have “RRT P50” and “RRT P90” that we report.
170 |
171 | This metric is reported in [business hours](metric-principles.md) as the unit,
172 | with one decimal place.
173 |
174 | When displaying this metric, show it to the managers of the _reviewers_, not the
175 | managers of the _authors_. That is, show people how long they took to review
176 | code, not how long they waited for a review. We have found it to be more
177 | actionable for managers, this way.
178 |
179 | ## Metrics for the Developer Platform Team
180 |
181 | These are a few example metrics that the Developer Platform team uses to measure
182 | their success. Unlike the company-wide metrics, these metrics tend to be things
183 | that the developer platform team has more direct control over and that more
184 | accurately represent the specific work they do.
185 |
186 | We would like to reiterate: we have many more metrics for our Developer Platform
187 | team besides these. These are just a few examples to show how one might define
188 | metrics for an individual team, as opposed to the whole company.
189 |
190 | ### CI Reliability (CIR)
191 |
192 | **TL;DR: How often does CI fail because of a problem with CI infrastructure?**
193 |
194 | How often do users experience a CI failure that is not due to an error they
195 | made, but instead was due to a breakage or bug in our infrastructure?
196 |
197 | The only failures that are counted are infrastructure failures. So if a job
198 | succeeds or fails only due to a user error (like a test failure--something the
199 | system is supposed to do) then that doesn’t count as a failure. Any failure that
200 | we don’t know is a user failure is assumed to be an infrastructure failure.
201 |
202 | Reported as the success percentage. (That is, a number that is good when it goes
203 | up.)
204 |
205 | ### Deployment Reliability (DR)
206 |
207 | **TL;DR: How often do deployments fail because of a problem with the deployment
208 | infrastructure?**
209 |
210 | How often do users experience a deployment failure that is not due to an error
211 | they made, but instead was due to a breakage or bug in our infrastructure?
212 |
213 | Measured as how often a deployment job experiences no deployment infrastructure
214 | failures. A “successful” deployment job is one where one version of one
215 | deployable has been deployed to all machines (or locations) it was intended to
216 | be deployed to per its deployment configuration. However, we only count failures
217 | of the deployment infrastructure as failures, for this particular metric.
218 |
219 | Deployments can partially fail. For example, one machine might deploy
220 | successfully while another fails due to infrastructure-related reasons. In our
221 | deployment infrastructure, individual teams can define what constitutes a
222 | “successful” deployment in terms of the percentage of machines that successfully
223 | deployed. We count a deployment as successful if it deployed to that percentage
224 | of machines.
225 |
226 | Infrastructure failures do include machine-based failures, like bad hosts that
227 | won’t accept deployments. They also include failures of the config
228 | infrastructure (but not config syntax errors or things like that). (This is not
229 | a complete list of what is included--we just note these specific things that are
230 | included so there isn’t confusion about those specific things.)
231 |
232 | Reported as the success percentage. (That is, a number that is good when it goes
233 | up.)
234 |
235 | ### Number of Insights Metrics in SLO (NIMS)
236 |
237 | **TL;DR: How many metrics are presented in our central developer productivity
238 | dashboard, and are they updating regularly on time every day?**
239 |
240 | To calculate this metric, first we count up all the metrics in our central
241 | developer productivity dashboard that have been fully implemented. "Implemented"
242 | means, for each metric:
243 |
244 | 1. We have an automated pipeline that does not require human interaction, which
245 | dumps data to some standard data store.
246 | 2. The data is processed by data pipelines to create metrics which are then
247 | displayed in our central developer productivity dashboard.
248 | 3. The metric displayed in our dashboard has passed User Acceptance Testing
249 | (UAT) and is viewable in production.
250 |
251 | We count the number of metrics in our central dashboard, not the number of
252 | pipelines. One pipeline might produce multiple metrics.
253 |
254 | The current SLO for our metrics is: each workflow for the metrics will compute
255 | data that is no older than **30 hours**. The definition of “compute” here is
256 | “calculate and provide in a format that is consumed by the dashboard.”
257 |
258 | For every metric where the latency is within SLO, we count "1" for this metric.
259 |
260 | When displaying this metric for a time period longer than a day, we average out
261 | the latency of each update over the time period we want to measure. For example,
262 | if we are measuring this for a quarter, we average them out over the whole
263 | quarter. By "each update" we mean every time the pipeline updated itself such
264 | that the dashboard would show an updated number. By "latency" we mean the time
265 | between each update, or the time between the last update and present time.
266 |
267 | Next: [Why Did We Choose Our Metrics?](why-our-metrics.md)
268 |
--------------------------------------------------------------------------------
/why-our-metrics.md:
--------------------------------------------------------------------------------
1 | # Why Did We Choose Our Metrics?
2 |
3 | This document explains why we chose the metrics described in the [example
4 | metrics](example-metrics.md). It is here to give you an idea of the reasoning
5 | process used to pick a metric, and why we consider some metrics good and other
6 | metrics bad. It also contains some insights around the value (or lack thereof)
7 | of certain common developer productivity metrics.
8 |
9 | - [Company-Wide Engineering Metrics](#company-wide-engineering-metrics)
10 | - [Developer Build Time (DBT)](#developer-build-time-dbt)
11 | - [Post-Merge CI Duration (PMCID)](#post-merge-ci-duration-pmcid)
12 | - [Why Not "CI Success Rate?"](#why-not-ci-success-rate)
13 | - [Code Reviewer Response Time (RRT)](#code-reviewer-response-time-rrt)
14 | - [Why not "code review volume?"](#why-not-code-review-volume)
15 | - [Why not "PR Creation to Merge Time?"](#why-not-pr-creation-to-merge-time)
16 | - [Metrics for Developer Platform Team](#metrics-for-developer-platform-team)
17 | - [CI Reliability (CIR)](#ci-reliability-cir)
18 | - [Deployment Reliability (DR)](#deployment-reliability-dr)
19 | - [Number of Insights Metrics in SLO (NIMS)](#number-of-insights-metrics-in-slo-nims)
20 |
21 | ## Company-Wide Engineering Metrics
22 |
23 | ### Developer Build Time (DBT)
24 |
25 | One thing that slows down [iterations](productivity-concepts.md) for software
26 | developers is them having to wait for the build tool to finish. They make a
27 | change and they want to see if it compiles, which is a very important and
28 | frequent iteration. Because it happens so frequently, you can potentially save a
29 | ton of engineering time and make engineers much more efficient by improving
30 | build time.
31 |
32 | You can also fundamentally change how they work if you get build time low
33 | enough. For example, wouldn't it be nice if you set up a system that
34 | automatically built your code every time you saved a file, and returned instant
35 | feedback to you about whether or not your code compiles and passes the basic
36 | build checks?
37 |
38 | ### Post-Merge CI Duration (PMCID)
39 |
40 | This is, to a large degree, another [iteration time](productivity-concepts.md)
41 | issue. It's less common that a developer is actively _waiting_ for a post-merge
42 | to finish, so this isn't about [context switching](productivity-concepts.md).
43 | However, a developer may want to know that there is some problem in CI (if there
44 | is). In particular, if the signal from the CI system comes _after_ the
45 | developer's working hours, that means you've potentially lost a lot of
46 | real-world time in terms of getting your feature or deployment done.
47 |
48 | Also, just generally, just imagine the frustration and difficulty of submitting
49 | something and only finding out that there is something wrong with the change two
50 | or three _hours_ later (if the CI pipeline is that long), when you're already
51 | working on something totally different. Granted, this will happen even if the CI
52 | pipeline takes five or ten minutes, it's just not as drastically bad. If it were
53 | just ten minutes and I _really_ wanted to wait for it to finish because I was
54 | sending out something _really_ important that I was worried about, I could just
55 | go get a coffee and come back. But if it takes two hours, I'm _definitely_
56 | working on something else, or the work day has ended.
57 |
58 | Also, long CI times mean that deployments take longer for people who do
59 | automated continuous deployment. This means experimentation takes longer, it
60 | takes longer to get feedback from users, etc.
61 |
62 | This metric is not as important as Developer Build Time, because that more
63 | directly impacts iteration time and context switching. But we do care about
64 | making CI faster, for the reasons specified above. There are probably also other
65 | benefits that we don't see, which only show up in specific cases.
66 |
67 | #### Why Not "CI Success Rate?"
68 |
69 | The whole purpose of a CI system is that it's supposed to fail sometimes. The
70 | problem with making this a metric is that it's [not really clear what it means
71 | when it goes up or down](metric-principles.md). Like, what number is good?
72 | Should it be 100% all the time? If it should be, then why does the system exist
73 | at all?
74 |
75 | There _is_ a good metric here that can be used called "CI Greenness" where you
76 | measure what percentage of the time a CI build is "green," or passing. We aren't
77 | sure this makes sense at LinkedIn, though, for a few reasons:
78 |
79 | 1. We don't actually run our builds _continuously_, but only when somebody
80 | merges a PR. So just one flaky test suddenly can make your "greenness" very
81 | bad, because you have to wait for somebody to submit a new change to fix it.
82 | Sometimes that might happen quickly, but it's not like there's an automated
83 | system that is just going to re-run the tests in a few hours anyway, to
84 | guarantee that flakiness has minimal impact.
85 | 2. At LinkedIn, our post-commit CI system doesn't actually _block_ developers
86 | from continuing to work. It just prevents deployment. (Or in the case of a
87 | library, prevents others from depending on the version whose CI failed.) A
88 | failing CI is an inconvenience, that's for sure, as you now have to
89 | investigate the failure, fix whatever is necessary, go through code review
90 | again, etc. But (in most cases at LinkedIn) other people can still submit to
91 | the repository, your change is still _in_ the repository (this is good and
92 | bad, both), and you can still continue to do work. So measuring "how long was
93 | CI failing for" isn't as valuable at LinkedIn as it would be in a place where
94 | "failing CI" means "nobody on the whole team can continue to work."
95 |
96 | ### Code Reviewer Response Time (RRT)
97 |
98 | Code review is one of the most important quality processes at every large
99 | software company, and LinkedIn is no exception. While there are many things
100 | about code that can be caught and improved by machines, only human beings can
101 | tell you if code is [easy to read, easy to understand, and easy to correctly
102 | modify in the
103 | future](https://www.codesimplicity.com/post/the-definition-of-simplicity/).
104 |
105 | What you want from a code review process is that it provides continuous
106 | _improvement_ to your code base through effective feedback from code reviewers.
107 | This requires whatever level of strictness is necessary in order to achieve this
108 | goal, which is different depending on who the submitter of the code is--how much
109 | experience they have with your system, how much experience they have with the
110 | language or tools, how long they have been programming, etc.
111 |
112 | However, it can sometimes seem like when you are too strict, developers start to
113 | push back about the strictness of the code review process, or they start trying
114 | to get around it, like "let's send it to the person who always rubber stamps my
115 | changes and never provides any feedback." This breaks the process and removes
116 | its value for your company.
117 |
118 | There is a secret to maintaining strictness while removing complaints: [have the
119 | reviewers reply
120 | faster](https://google.github.io/eng-practices/review/reviewer/speed.html). As
121 | wild as it may sound, nearly _all_ complaints about strictness are actually
122 | complaints about _speed_.
123 |
124 | Think about it this way: you submit a change, wait three days, the reviewer asks
125 | you to make major changes. Then you submit those changes, wait three days, and
126 | the reviewer again asks for major changes. At this point you're extremely
127 | frustrated. "I've been waiting for a week just to submit this!" Developers in
128 | this situation often say, "Stop being so strict!" After all, it's too late to
129 | say, "Stop being so slow!" And the slowness has increased the tension so much
130 | that the author can't stand it anymore, so they revolt.
131 |
132 | On the other hand, if a developer posts a change and the reviewer sends back
133 | comments within 30 minutes that ask for major changes, the developer might sigh
134 | and be a little annoyed, but will do it. Then when more major changes are
135 | requested, there might be some pushback, but at least it's all happened within
136 | the same day or two, not after waiting for more than a week. It makes people
137 | stop complaining about the code review process _as a whole_, and also
138 | significantly reduces (but does not eliminate) author pushback on valid code
139 | review comments.
140 |
141 | #### Why not "code review volume?"
142 |
143 | There is an [entire document that explains why any metric measuring the volume
144 | of output of a developer is dangerous](metrics-and-performance-reviews.md).
145 |
146 | That said, on a very large scale, PR Volume actually is an interesting metric.
147 | Looking at this _in the aggregate_ for thousands of engineers can show us
148 | patterns that are very interesting. Usually they tell us about things that
149 | happened in the past more than they tell us about what we can or should _do_
150 | about something--it's hard to take action items from it, but it can warn us
151 | about bad situations.
152 |
153 | We didn't want to make this a company-wide metric because (a) it's hard to make
154 | managerial decisions based on it (b) it's only really valid on the level of the
155 | whole company or parts of it that have 500+ engineers (c) we were concerned it
156 | would be mis-used by people as a measurement of engineer performance as opposed
157 | to something we use to understand our tools, processes, or large-scale
158 | managerial decisions.
159 |
160 | All that said, it is possible to see PR Volume in our detailed dashboards that
161 | track PR metrics, and it's useful _informationally_ (that is, as an interesting
162 | input to help understand a few types of management decisions or their impact,
163 | such as time off, large-scale changes to our engineering systems, etc.) We
164 | include a large red disclaimer that the data should not be used for performance
165 | management, along with a link to [our explanation about why output metrics
166 | should not be used for individual performance ratings](scores.md).
167 |
168 | #### Why not "PR Creation to Merge Time?"
169 |
170 | As noted above, code review is a quality process. When you make people focus on
171 | the length of the whole process, you end up unintentionally driving behaviors
172 | that you don't want.
173 |
174 | As an absurd example, I could make every code review fast by simply making
175 | everybody approve them without looking at them. This metric would look beautiful
176 | and perhaps I would be rewarded for my efforts at optimizing code review times.
177 |
178 | As a less absurd example, let's say that a manager has instructed their team,
179 | "Let's look into how we can make code reviews take less total time." Now imagine
180 | that on some particular code review, a reviewer leaves a round of legitimate
181 | comments, and the author pushes back saying, "You are making this code review
182 | take longer," or something of that essence, and manages to talk their way out of
183 | improvements that really should have been made, simply because of _how long_ the
184 | review has taken in real-world days. This is a specific example, but there is a
185 | broad general class of bad behaviors you encourage when you tell people "make
186 | code reviews take less overall time," instead of "let's speed up how fast
187 | individual responses come in."
188 |
189 | When you focus on the speed of individual responses, you still get what you want
190 | (faster code reviews), but you don't sacrifice the power of the review process
191 | to produce high-quality code.
192 |
193 | There actually _is_ a point at which "too many iterations" (like, too many
194 | back-and-forth discussions between a developer and reviewer) becomes a problem
195 | in code review, but it's super-rare that this is the problem on a team. It's
196 | more common that people are too lax than that people are too strict. And it's
197 | more common that people are slow to respond than that they respond too many
198 | times. If "too many iterations" was a widespread problem at LinkedIn, we would
199 | think about tracking a "number of code review iterations" metric just for the
200 | duration of a project focused on solving the problem, but it's not something we
201 | want to go to zero. We just don't want it to be absurdly high (like ten
202 | iterations).
203 |
204 | ## Metrics for Developer Platform Team
205 |
206 | ### CI Reliability (CIR)
207 |
208 | The CI system itself absolutely must return reliable results, or developers will
209 | start to mistrust it. In the past, we saw some teams where nearly _every_ CI
210 | failure was actually a failure of the CI infrastructure.
211 |
212 | Instead of focusing on the uptime of the individual pieces, we focus on the
213 | reliability of the system as it is _experienced by its users_. Making this the
214 | focus of our reliability efforts dramatically improved the actual reliability of
215 | our system, compared to previous efforts focused around only availability.
216 |
217 | ### Deployment Reliability (DR)
218 |
219 | This has similar reasoning to CI Reliability, above.
220 |
221 | The only point worth noting is that there are sometimes discussions about
222 | whether the team that owns the deployment platform should be responsible for the
223 | full _success_ of all deployments, and not just the reliability of the
224 | deployment platform itself. Over time, we have come to the conclusion that this
225 | should _not_ be the responsibility of the deployment platform team, because too
226 | many of the issues that block successful deployments are out of the hands of the
227 | deployment platform team.
228 |
229 | There _are_ things that the platform team can do to make it more likely that
230 | people have successful deployments. For example, make it easier to add
231 | validation steps into a deployment. Make the configuration process for a
232 | deployment simpler. But overall, the actual success of a deployment depends a
233 | lot on the specifics of the binary being deployed and the deployment scripts
234 | that were written by the team that owns that binary.
235 |
236 | If you make the deployment platform team responsible for the success of every
237 | deployment, you tend to make them into consultants who spend too much time
238 | debugging the failing deployments of customers and not enough time developing
239 | new features that improve the deployment experience for the whole company.
240 |
241 | ### Number of Insights Metrics in SLO (NIMS)
242 |
243 | We have a team whose job it is to create data pipelines and infrastructure
244 | around developer productivity metrics. These metrics can't help anybody unless
245 | they end up in a dashboard, and that dashboard has up-to-date data.
246 |
247 | This metric doesn't measure the success of our dashboards. It measures the
248 | effectiveness of our team's data infrastructure. This metric measures that we
249 | have met the "table stakes" of being able to simply display data at all, and how
250 | effectively we are doing that as a team. We have other metrics to measure the
251 | impact our work has.
252 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Attribution 4.0 International
2 |
3 | =======================================================================
4 |
5 | Creative Commons Corporation ("Creative Commons") is not a law firm and
6 | does not provide legal services or legal advice. Distribution of
7 | Creative Commons public licenses does not create a lawyer-client or
8 | other relationship. Creative Commons makes its licenses and related
9 | information available on an "as-is" basis. Creative Commons gives no
10 | warranties regarding its licenses, any material licensed under their
11 | terms and conditions, or any related information. Creative Commons
12 | disclaims all liability for damages resulting from their use to the
13 | fullest extent possible.
14 |
15 | Using Creative Commons Public Licenses
16 |
17 | Creative Commons public licenses provide a standard set of terms and
18 | conditions that creators and other rights holders may use to share
19 | original works of authorship and other material subject to copyright
20 | and certain other rights specified in the public license below. The
21 | following considerations are for informational purposes only, are not
22 | exhaustive, and do not form part of our licenses.
23 |
24 | Considerations for licensors: Our public licenses are
25 | intended for use by those authorized to give the public
26 | permission to use material in ways otherwise restricted by
27 | copyright and certain other rights. Our licenses are
28 | irrevocable. Licensors should read and understand the terms
29 | and conditions of the license they choose before applying it.
30 | Licensors should also secure all rights necessary before
31 | applying our licenses so that the public can reuse the
32 | material as expected. Licensors should clearly mark any
33 | material not subject to the license. This includes other CC-
34 | licensed material, or material used under an exception or
35 | limitation to copyright. More considerations for licensors:
36 | wiki.creativecommons.org/Considerations_for_licensors
37 |
38 | Considerations for the public: By using one of our public
39 | licenses, a licensor grants the public permission to use the
40 | licensed material under specified terms and conditions. If
41 | the licensor's permission is not necessary for any reason--for
42 | example, because of any applicable exception or limitation to
43 | copyright--then that use is not regulated by the license. Our
44 | licenses grant only permissions under copyright and certain
45 | other rights that a licensor has authority to grant. Use of
46 | the licensed material may still be restricted for other
47 | reasons, including because others have copyright or other
48 | rights in the material. A licensor may make special requests,
49 | such as asking that all changes be marked or described.
50 | Although not required by our licenses, you are encouraged to
51 | respect those requests where reasonable. More considerations
52 | for the public:
53 | wiki.creativecommons.org/Considerations_for_licensees
54 |
55 | =======================================================================
56 |
57 | Creative Commons Attribution 4.0 International Public License
58 |
59 | By exercising the Licensed Rights (defined below), You accept and agree
60 | to be bound by the terms and conditions of this Creative Commons
61 | Attribution 4.0 International Public License ("Public License"). To the
62 | extent this Public License may be interpreted as a contract, You are
63 | granted the Licensed Rights in consideration of Your acceptance of
64 | these terms and conditions, and the Licensor grants You such rights in
65 | consideration of benefits the Licensor receives from making the
66 | Licensed Material available under these terms and conditions.
67 |
68 |
69 | Section 1 -- Definitions.
70 |
71 | a. Adapted Material means material subject to Copyright and Similar
72 | Rights that is derived from or based upon the Licensed Material
73 | and in which the Licensed Material is translated, altered,
74 | arranged, transformed, or otherwise modified in a manner requiring
75 | permission under the Copyright and Similar Rights held by the
76 | Licensor. For purposes of this Public License, where the Licensed
77 | Material is a musical work, performance, or sound recording,
78 | Adapted Material is always produced where the Licensed Material is
79 | synched in timed relation with a moving image.
80 |
81 | b. Adapter's License means the license You apply to Your Copyright
82 | and Similar Rights in Your contributions to Adapted Material in
83 | accordance with the terms and conditions of this Public License.
84 |
85 | c. Copyright and Similar Rights means copyright and/or similar rights
86 | closely related to copyright including, without limitation,
87 | performance, broadcast, sound recording, and Sui Generis Database
88 | Rights, without regard to how the rights are labeled or
89 | categorized. For purposes of this Public License, the rights
90 | specified in Section 2(b)(1)-(2) are not Copyright and Similar
91 | Rights.
92 |
93 | d. Effective Technological Measures means those measures that, in the
94 | absence of proper authority, may not be circumvented under laws
95 | fulfilling obligations under Article 11 of the WIPO Copyright
96 | Treaty adopted on December 20, 1996, and/or similar international
97 | agreements.
98 |
99 | e. Exceptions and Limitations means fair use, fair dealing, and/or
100 | any other exception or limitation to Copyright and Similar Rights
101 | that applies to Your use of the Licensed Material.
102 |
103 | f. Licensed Material means the artistic or literary work, database,
104 | or other material to which the Licensor applied this Public
105 | License.
106 |
107 | g. Licensed Rights means the rights granted to You subject to the
108 | terms and conditions of this Public License, which are limited to
109 | all Copyright and Similar Rights that apply to Your use of the
110 | Licensed Material and that the Licensor has authority to license.
111 |
112 | h. Licensor means the individual(s) or entity(ies) granting rights
113 | under this Public License.
114 |
115 | i. Share means to provide material to the public by any means or
116 | process that requires permission under the Licensed Rights, such
117 | as reproduction, public display, public performance, distribution,
118 | dissemination, communication, or importation, and to make material
119 | available to the public including in ways that members of the
120 | public may access the material from a place and at a time
121 | individually chosen by them.
122 |
123 | j. Sui Generis Database Rights means rights other than copyright
124 | resulting from Directive 96/9/EC of the European Parliament and of
125 | the Council of 11 March 1996 on the legal protection of databases,
126 | as amended and/or succeeded, as well as other essentially
127 | equivalent rights anywhere in the world.
128 |
129 | k. You means the individual or entity exercising the Licensed Rights
130 | under this Public License. Your has a corresponding meaning.
131 |
132 |
133 | Section 2 -- Scope.
134 |
135 | a. License grant.
136 |
137 | 1. Subject to the terms and conditions of this Public License,
138 | the Licensor hereby grants You a worldwide, royalty-free,
139 | non-sublicensable, non-exclusive, irrevocable license to
140 | exercise the Licensed Rights in the Licensed Material to:
141 |
142 | a. reproduce and Share the Licensed Material, in whole or
143 | in part; and
144 |
145 | b. produce, reproduce, and Share Adapted Material.
146 |
147 | 2. Exceptions and Limitations. For the avoidance of doubt, where
148 | Exceptions and Limitations apply to Your use, this Public
149 | License does not apply, and You do not need to comply with
150 | its terms and conditions.
151 |
152 | 3. Term. The term of this Public License is specified in Section
153 | 6(a).
154 |
155 | 4. Media and formats; technical modifications allowed. The
156 | Licensor authorizes You to exercise the Licensed Rights in
157 | all media and formats whether now known or hereafter created,
158 | and to make technical modifications necessary to do so. The
159 | Licensor waives and/or agrees not to assert any right or
160 | authority to forbid You from making technical modifications
161 | necessary to exercise the Licensed Rights, including
162 | technical modifications necessary to circumvent Effective
163 | Technological Measures. For purposes of this Public License,
164 | simply making modifications authorized by this Section 2(a)
165 | (4) never produces Adapted Material.
166 |
167 | 5. Downstream recipients.
168 |
169 | a. Offer from the Licensor -- Licensed Material. Every
170 | recipient of the Licensed Material automatically
171 | receives an offer from the Licensor to exercise the
172 | Licensed Rights under the terms and conditions of this
173 | Public License.
174 |
175 | b. No downstream restrictions. You may not offer or impose
176 | any additional or different terms or conditions on, or
177 | apply any Effective Technological Measures to, the
178 | Licensed Material if doing so restricts exercise of the
179 | Licensed Rights by any recipient of the Licensed
180 | Material.
181 |
182 | 6. No endorsement. Nothing in this Public License constitutes or
183 | may be construed as permission to assert or imply that You
184 | are, or that Your use of the Licensed Material is, connected
185 | with, or sponsored, endorsed, or granted official status by,
186 | the Licensor or others designated to receive attribution as
187 | provided in Section 3(a)(1)(A)(i).
188 |
189 | b. Other rights.
190 |
191 | 1. Moral rights, such as the right of integrity, are not
192 | licensed under this Public License, nor are publicity,
193 | privacy, and/or other similar personality rights; however, to
194 | the extent possible, the Licensor waives and/or agrees not to
195 | assert any such rights held by the Licensor to the limited
196 | extent necessary to allow You to exercise the Licensed
197 | Rights, but not otherwise.
198 |
199 | 2. Patent and trademark rights are not licensed under this
200 | Public License.
201 |
202 | 3. To the extent possible, the Licensor waives any right to
203 | collect royalties from You for the exercise of the Licensed
204 | Rights, whether directly or through a collecting society
205 | under any voluntary or waivable statutory or compulsory
206 | licensing scheme. In all other cases the Licensor expressly
207 | reserves any right to collect such royalties.
208 |
209 |
210 | Section 3 -- License Conditions.
211 |
212 | Your exercise of the Licensed Rights is expressly made subject to the
213 | following conditions.
214 |
215 | a. Attribution.
216 |
217 | 1. If You Share the Licensed Material (including in modified
218 | form), You must:
219 |
220 | a. retain the following if it is supplied by the Licensor
221 | with the Licensed Material:
222 |
223 | i. identification of the creator(s) of the Licensed
224 | Material and any others designated to receive
225 | attribution, in any reasonable manner requested by
226 | the Licensor (including by pseudonym if
227 | designated);
228 |
229 | ii. a copyright notice;
230 |
231 | iii. a notice that refers to this Public License;
232 |
233 | iv. a notice that refers to the disclaimer of
234 | warranties;
235 |
236 | v. a URI or hyperlink to the Licensed Material to the
237 | extent reasonably practicable;
238 |
239 | b. indicate if You modified the Licensed Material and
240 | retain an indication of any previous modifications; and
241 |
242 | c. indicate the Licensed Material is licensed under this
243 | Public License, and include the text of, or the URI or
244 | hyperlink to, this Public License.
245 |
246 | 2. You may satisfy the conditions in Section 3(a)(1) in any
247 | reasonable manner based on the medium, means, and context in
248 | which You Share the Licensed Material. For example, it may be
249 | reasonable to satisfy the conditions by providing a URI or
250 | hyperlink to a resource that includes the required
251 | information.
252 |
253 | 3. If requested by the Licensor, You must remove any of the
254 | information required by Section 3(a)(1)(A) to the extent
255 | reasonably practicable.
256 |
257 | 4. If You Share Adapted Material You produce, the Adapter's
258 | License You apply must not prevent recipients of the Adapted
259 | Material from complying with this Public License.
260 |
261 |
262 | Section 4 -- Sui Generis Database Rights.
263 |
264 | Where the Licensed Rights include Sui Generis Database Rights that
265 | apply to Your use of the Licensed Material:
266 |
267 | a. for the avoidance of doubt, Section 2(a)(1) grants You the right
268 | to extract, reuse, reproduce, and Share all or a substantial
269 | portion of the contents of the database;
270 |
271 | b. if You include all or a substantial portion of the database
272 | contents in a database in which You have Sui Generis Database
273 | Rights, then the database in which You have Sui Generis Database
274 | Rights (but not its individual contents) is Adapted Material; and
275 |
276 | c. You must comply with the conditions in Section 3(a) if You Share
277 | all or a substantial portion of the contents of the database.
278 |
279 | For the avoidance of doubt, this Section 4 supplements and does not
280 | replace Your obligations under this Public License where the Licensed
281 | Rights include other Copyright and Similar Rights.
282 |
283 |
284 | Section 5 -- Disclaimer of Warranties and Limitation of Liability.
285 |
286 | a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
287 | EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
288 | AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
289 | ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
290 | IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
291 | WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
292 | PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
293 | ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
294 | KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
295 | ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.
296 |
297 | b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
298 | TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
299 | NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
300 | INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
301 | COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
302 | USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
303 | ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
304 | DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
305 | IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.
306 |
307 | c. The disclaimer of warranties and limitation of liability provided
308 | above shall be interpreted in a manner that, to the extent
309 | possible, most closely approximates an absolute disclaimer and
310 | waiver of all liability.
311 |
312 |
313 | Section 6 -- Term and Termination.
314 |
315 | a. This Public License applies for the term of the Copyright and
316 | Similar Rights licensed here. However, if You fail to comply with
317 | this Public License, then Your rights under this Public License
318 | terminate automatically.
319 |
320 | b. Where Your right to use the Licensed Material has terminated under
321 | Section 6(a), it reinstates:
322 |
323 | 1. automatically as of the date the violation is cured, provided
324 | it is cured within 30 days of Your discovery of the
325 | violation; or
326 |
327 | 2. upon express reinstatement by the Licensor.
328 |
329 | For the avoidance of doubt, this Section 6(b) does not affect any
330 | right the Licensor may have to seek remedies for Your violations
331 | of this Public License.
332 |
333 | c. For the avoidance of doubt, the Licensor may also offer the
334 | Licensed Material under separate terms or conditions or stop
335 | distributing the Licensed Material at any time; however, doing so
336 | will not terminate this Public License.
337 |
338 | d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
339 | License.
340 |
341 |
342 | Section 7 -- Other Terms and Conditions.
343 |
344 | a. The Licensor shall not be bound by any additional or different
345 | terms or conditions communicated by You unless expressly agreed.
346 |
347 | b. Any arrangements, understandings, or agreements regarding the
348 | Licensed Material not stated herein are separate from and
349 | independent of the terms and conditions of this Public License.
350 |
351 |
352 | Section 8 -- Interpretation.
353 |
354 | a. For the avoidance of doubt, this Public License does not, and
355 | shall not be interpreted to, reduce, limit, restrict, or impose
356 | conditions on any use of the Licensed Material that could lawfully
357 | be made without permission under this Public License.
358 |
359 | b. To the extent possible, if any provision of this Public License is
360 | deemed unenforceable, it shall be automatically reformed to the
361 | minimum extent necessary to make it enforceable. If the provision
362 | cannot be reformed, it shall be severed from this Public License
363 | without affecting the enforceability of the remaining terms and
364 | conditions.
365 |
366 | c. No term or condition of this Public License will be waived and no
367 | failure to comply consented to unless expressly agreed to by the
368 | Licensor.
369 |
370 | d. Nothing in this Public License constitutes or may be interpreted
371 | as a limitation upon, or waiver of, any privileges and immunities
372 | that apply to the Licensor or You, including from the legal
373 | processes of any jurisdiction or authority.
374 |
375 |
376 | =======================================================================
377 |
378 | Creative Commons is not a party to its public
379 | licenses. Notwithstanding, Creative Commons may elect to apply one of
380 | its public licenses to material it publishes and in those instances
381 | will be considered the “Licensor.†The text of the Creative Commons
382 | public licenses is dedicated to the public domain under the CC0 Public
383 | Domain Dedication. Except for the limited purpose of indicating that
384 | material is shared under a Creative Commons public license or as
385 | otherwise permitted by the Creative Commons policies published at
386 | creativecommons.org/policies, Creative Commons does not authorize the
387 | use of the trademark "Creative Commons" or any other trademark or logo
388 | of Creative Commons without its prior written consent including,
389 | without limitation, in connection with any unauthorized modifications
390 | to any of its public licenses or any other arrangements,
391 | understandings, or agreements concerning use of licensed material. For
392 | the avoidance of doubt, this paragraph does not form part of the
393 | public licenses.
394 |
395 | Creative Commons may be contacted at creativecommons.org.
396 |
--------------------------------------------------------------------------------