├── examples
├── index.md
├── environment-variable-configuration
│ └── index.md
├── monorepo-vs-multirepo
│ └── index.md
├── timestamp-format
│ └── index.md
├── css-framework
│ └── index.md
├── programming-languages
│ └── index.md
├── microsoft-azure-devops
│ └── index.md
├── secrets-storage
│ └── index.md
└── metrics-monitors-alerts
│ └── index.md
├── templates
├── index.md
├── decision-record-template-by-michael-nygard
│ └── index.md
├── decision-record-template-for-alexandrian-pattern
│ └── index.md
├── decision-record-template-using-planguage
│ └── index.md
├── decision-record-template-madr
│ └── index.md
├── decision-record-template-for-business-case
│ └── index.md
└── decision-record-template-by-jeff-tyree-and-art-akerman
│ └── index.md
└── README.md
/examples/index.md:
--------------------------------------------------------------------------------
1 | # Decision record examples
2 |
3 | * [CSS framework](css-framework/index.md)
4 | * [Environment variable configuration](environment-variable-configuration/index.md)
5 | * [Metrics, monitors, alerts](metrics-monitors-alerts/index.md)
6 | * [Microsoft Azure DevOps](microsoft-azure-devops/index.md)
7 | * [Monorepo vs multirepo](monorepo-vs-multirepo/index.md)
8 | * [Programming languages](programming-languages/index.md)
9 | * [Secrets storage](secrets-storage/index.md)
10 | * [Timestamp format](timestamp-format/index.md)
11 |
--------------------------------------------------------------------------------
/templates/index.md:
--------------------------------------------------------------------------------
1 | # Decision record templates
2 |
3 | * [Decision record template by Jeff Tyree and Art Akerman](decision-record-template-by-jeff-tyree-and-art-akerman/index.md)
4 | * [Decision record template by Michael Nygard
5 | Title](decision-record-template-by-michael-nygard/index.md)
6 | * [Decision record template for Alexandrian pattern](decision-record-template-for-alexandrian-pattern/index.md)
7 | * [Decision record template for business case](decision-record-template-for-business-case/index.md)
8 | * [Decision record template for MADR](decision-record-template-madr/index.md)
9 | * [Decision record template using Planguage](decision-record-template-using-planguage/index.md)
10 |
--------------------------------------------------------------------------------
/templates/decision-record-template-by-michael-nygard/index.md:
--------------------------------------------------------------------------------
1 | # Decision record template by Michael Nygard
2 |
3 | This is the template in [Documenting architecture decisions - Michael Nygard](http://thinkrelevance.com/blog/2011/11/15/documenting-architecture-decisions).
4 | You can use [adr-tools](https://github.com/npryce/adr-tools) for managing the ADR files.
5 |
6 | In each ADR file, write these sections:
7 |
8 | # Title
9 |
10 | ## Status
11 |
12 | What is the status, such as proposed, accepted, rejected, deprecated, superseded, etc.?
13 |
14 | ## Context
15 |
16 | What is the issue that we're seeing that is motivating this decision or change?
17 |
18 | ## Decision
19 |
20 | What is the change that we're proposing and/or doing?
21 |
22 | ## Consequences
23 |
24 | What becomes easier or more difficult to do because of this change?
25 |
--------------------------------------------------------------------------------
/templates/decision-record-template-for-alexandrian-pattern/index.md:
--------------------------------------------------------------------------------
1 | # Decision record template for Alexandrian pattern
2 |
3 | ## Introduction
4 |
5 | * Prologue (Summary)
6 | * Discussion (Context)
7 | * Solution (Decision)
8 | * Consequences (Results)
9 |
10 | ## Specifics ##
11 |
12 | * Prologue (Summary)
13 | * Statement to summarize:
14 | * In the context of (use case)
15 | facing (concern)
16 | we decided for (option)
17 | to achieve (quality)
18 | accepting (downside).
19 | * Discussion (Context)
20 | * Explains the forces at play (technical, political, social, project).
21 | * This is the story explaining the problem we are looking to resolve.
22 | * Solution
23 | * Explains how the decision will solve the problem.
24 | * Consequences
25 | * Explains the results of the decision over the long term.
26 | * Did it work, not work, was changed, upgraded, etc.
27 |
--------------------------------------------------------------------------------
/templates/decision-record-template-using-planguage/index.md:
--------------------------------------------------------------------------------
1 | # Decision record template using Planguage
2 |
3 | See http://www.iaria.org/conferences2012/filesICCGI12/Tutorial%20Specifying%20Effective%20Non-func.pdf
4 |
5 | ## What is Planguage?
6 |
7 | Planguage is a planning language that uses these keywords:
8 |
9 | * Tag: A unique, persistent identifier
10 | * Gist: A brief summary of the requirement or area addressed
11 | * Requirement: The text that details the requirement itself
12 | * Rationale: The reasoning that justifies the requirement
13 | * Priority: A statement of priority and claim on resources
14 | * Stakeholders: Parties materially affected by the requirement
15 | * Status: The status of the requirement (draft, reviewed, committed, etc.)
16 | * Owner: The person responsible for implementing the requirement
17 | * Author: The person that wrote the requirement
18 | * Revision: A version number for the statement
19 | * Date: The date of the most recent revision
20 | * Assumptions: Anything that could cause problems if untrue now or later
21 | * Risks: Anything that could cause malfunction, delay, or other negative impacts
22 | * Defined: The definition of a term (better to use a glossary)
23 |
24 |
--------------------------------------------------------------------------------
/templates/decision-record-template-madr/index.md:
--------------------------------------------------------------------------------
1 | # [short title of solved problem and solution]
2 |
3 | * Status: [proposed | rejected | accepted | deprecated | … | superseded by [ADR-0005](0005-example.md)]
4 | * Deciders: [list everyone involved in the decision]
5 | * Date: [YYYY-MM-DD when the decision was last updated]
6 |
7 | Technical Story: [description | ticket/issue URL]
8 |
9 | ## Context and Problem Statement
10 |
11 | [Describe the context and problem statement, e.g., in free form using two to three sentences. You may want to articulate the problem in form of a question.]
12 |
13 | ## Decision Drivers
14 |
15 | * [driver 1, e.g., a force, facing concern, …]
16 | * [driver 2, e.g., a force, facing concern, …]
17 | * …
18 |
19 | ## Considered Options
20 |
21 | * [option 1]
22 | * [option 2]
23 | * [option 3]
24 | * …
25 |
26 | ## Decision Outcome
27 |
28 | Chosen option: "[option 1]", because [justification. e.g., only option, which meets k.o. criterion decision driver | which resolves force force | … | comes out best (see below)].
29 |
30 | ### Positive Consequences
31 |
32 | * [e.g., improvement of quality attribute satisfaction, follow-up decisions required, …]
33 | * …
34 |
35 | ### Negative Consequences
36 |
37 | * [e.g., compromising quality attribute, follow-up decisions required, …]
38 | * …
39 |
40 | ## Pros and Cons of the Options
41 |
42 | ### [option 1]
43 |
44 | [example | description | pointer to more information | …]
45 |
46 | * Good, because [argument a]
47 | * Good, because [argument b]
48 | * Bad, because [argument c]
49 | * …
50 |
51 | ### [option 2]
52 |
53 | [example | description | pointer to more information | …]
54 |
55 | * Good, because [argument a]
56 | * Good, because [argument b]
57 | * Bad, because [argument c]
58 | * …
59 |
60 | ### [option 3]
61 |
62 | [example | description | pointer to more information | …]
63 |
64 | * Good, because [argument a]
65 | * Good, because [argument b]
66 | * Bad, because [argument c]
67 | * …
68 |
69 | ## Links
70 |
71 | * [Link type] [Link to ADR]
72 | * …
73 |
--------------------------------------------------------------------------------
/templates/decision-record-template-for-business-case/index.md:
--------------------------------------------------------------------------------
1 | # Decision record template for business case
2 |
3 | This ADR template emphasizes creating a business case for a decision, including critera, candidates, and costs.
4 |
5 |
6 | ## Top-level
7 |
8 | * Title
9 | * Status
10 | * Evaluation criteria
11 | * Candidates to consider
12 | * Research and analysis of each candidate
13 | * Does/doesn't meet criteria and why
14 | * Cost analysis
15 | * SWOT analysis
16 | * Opinions and feedback
17 | * Recommendation
18 |
19 |
20 | ## Low-level deep dive
21 |
22 | **Title**:
23 |
24 | * A short present tense imperative phrase, less than 50 characters, like a git commit message.
25 |
26 | **Status**:
27 |
28 | * One of proposed, accepted, rejected, deprecated, superseded, etc.
29 |
30 | **Evaluation criteria**:
31 |
32 | * Summary: explain briefly what we seek to discover and why.
33 |
34 | * Specifics
35 |
36 | **Candidates to consider**:
37 |
38 | * Summary: explain briefly how we discovered candidates, and draw attention to any outliers.
39 |
40 | * List all candidates and related options; what are we evaluating as potential solutions?
41 |
42 | * Specifics
43 |
44 | **Research and analysis of each candidate**:
45 |
46 | * Summary: explain briefly the research methods, and draw attention to patterns, clusters, and outliers.
47 |
48 | * Does/doesn't meet criteria and why
49 |
50 | * Summary
51 |
52 | * Specifics
53 |
54 | * Cost analysis
55 |
56 | * Summary
57 |
58 | * Examples
59 |
60 | * Licensing, such as contract agreements and legal commitments
61 |
62 | * Training, such as upskilling and change management
63 |
64 | * Operating, such as support and maintenance
65 |
66 | * Metering, such as bandwidth and CPU usage
67 |
68 | * SWOT analysis
69 |
70 | * Summary
71 |
72 | * Strengths
73 |
74 | * Weaknesses
75 |
76 | * Opportunites
77 |
78 | * Threats
79 |
80 | * Internal opinions and feedback
81 |
82 | * Summary
83 |
84 | * Examples
85 |
86 | * By the team, ideally written by the actual person
87 |
88 | * From other stakeholders
89 |
90 | * Quality attributes a.k.a. cross-functional requirements
91 |
92 | * External opinions and feedback
93 |
94 | * Summary
95 |
96 | * Who is providing the opinion?
97 |
98 | * What are other candidates you considered?
99 |
100 | * What are you creating?
101 |
102 | * Examples
103 |
104 | * B2B or B2C
105 |
106 | * external-facing or employee-only
107 |
108 | * desktop or mobile
109 |
110 | * pilot or production
111 |
112 | * monolith or microservices
113 |
114 | * How did you evaluate the candidates?
115 |
116 | * Why did you choose the winner?
117 |
118 | * What is happening since then?
119 |
120 | * Examples
121 |
122 | * How is the winner performing?
123 |
124 | * What % of real-world production user traffic is flowing through the winner?
125 |
126 | * What kinds of integrations are involved, such as with continuous delivery pipelines, content management systems, analytics and metrics, etc.?
127 |
128 | * Knowing what you know now, what would you advise people to do differently?
129 |
130 | * Anecdotes
131 |
132 | **Recommendation**:
133 |
134 | * Summary
135 |
136 | * Specifics
137 |
138 |
--------------------------------------------------------------------------------
/templates/decision-record-template-by-jeff-tyree-and-art-akerman/index.md:
--------------------------------------------------------------------------------
1 | # Decision record template by Jeff Tyree and Art Akerman
2 |
3 | This is the architecture decision description template published in ["Architecture Decisions: Demystifying Architecture" by Jeff Tyree and Art Akerman, Capital One Financial](https://www.utdallas.edu/~chung/SA/zz-Impreso-architecture_decisions-tyree-05.pdf).
4 |
5 | * **Issue**: Describe the architectural design issue you’re addressing, leaving no questions about why you’re addressing this issue now. Following a minimalist approach, address and document only the issues that need addressing at various points in the life cycle.
6 |
7 | * **Decision**: Clearly state the architecture’s direction—that is, the position you’ve selected.
8 |
9 | * **Status**: The decision’s status, such as pending, decided, or approved.
10 |
11 | * **Group**: You can use a simple grouping—such as integration, presentation, data, and so on—to help organize the set of decisions. You could also use a more sophisticated architecture ontology, such as John Kyaruzi and Jan van Katwijk’s, which includes more abstract categories such as event, calendar, and location. For example, using this ontology, you’d group decisions that deal with occurrences where the system requires information under event.
12 |
13 | * **Assumptions**: Clearly describe the underlying assumptions in the environment in which you’re making the decision—cost, schedule, technology, and so on. Note that environmental constraints (such as accepted technology standards, enterprise architecture, commonly employed patterns, and so on) might limit the alternatives you consider.
14 |
15 | * **Constraints**: Capture any additional constraints to the environment that the chosen alternative (the decision) might pose.
16 |
17 | * **Positions**: List the positions (viable options or alternatives) you considered. These often require long explanations, sometimes even models and diagrams. This isn’t an exhaustive list. However, you don’t want to hear the question "Did you think about...?" during a final review; this leads to loss of credibility and questioning of other architectural decisions. This section also helps ensure that you heard others’ opinions; explicitly stating other opinions helps enroll their advocates in your decision.
18 |
19 | * **Argument**: Outline why you selected a position, including items such as implementation cost, total ownership cost, time to market, and required development resources’ availability. This is probably as important as the decision itself.
20 |
21 | * **Implications**: A decision comes with many implications, as the REMAP metamodel denotes. For example, a decision might introduce a need to make other decisions, create new requirements, or modify existing requirements; pose additional constraints to the environment; require renegotiating scope or schedule with customers; or require additional staff training. Clearly understanding and stating your decision’s implications can be very effective in gaining buy-in and creating a roadmap for architecture execution.
22 |
23 | * **Related decisions**: It’s obvious that many decisions are related; you can list them here. However, we’ve found that in practice, a traceability matrix, decision trees, or metamodels are more useful. Metamodels are useful for showing complex relationships diagrammatically (such as Rose models).
24 |
25 | * **Related requirements**: Decisions should be business driven. To show accountability, explicitly map your decisions to the objectives or requirements. You can enumerate these related requirements here, but we’ve found it more convenient to reference a traceability matrix. You can assess each architecture decision’s contribution to meeting each requirement, and then assess how well the requirement is met across all decisions. If a decision doesn’t contribute to meeting a requirement, don’t make that decision.
26 |
27 | * **Related artifacts**: List the related architecture, design, or scope documents that this decision impacts.
28 |
29 | * **Related principles**: If the enterprise has an agreed-upon set of principles, make sure the decision is consistent with one or more of them. This helps ensure alignment along domains or systems.
30 |
31 | * **Notes**: Because the decision-making process can take weeks, we’ve found it useful to capture notes and issues that the team discusses during the socialization process.
32 |
33 |
--------------------------------------------------------------------------------
/examples/environment-variable-configuration/index.md:
--------------------------------------------------------------------------------
1 | # Environment variable configuration
2 |
3 | Contents:
4 |
5 | * [Summary](#summary)
6 | * [Issue](#issue)
7 | * [Decision](#decision)
8 | * [Status](#status)
9 | * [Details](#details)
10 | * [Assumptions](#assumptions)
11 | * [Constraints](#constraints)
12 | * [Positions](#positions)
13 | * [Argument](#argument)
14 | * [Implications ](#implications)
15 | * [Related](#related)
16 | * [Related decisions](#related-decisions)
17 | * [Related requirements](#related-requirements)
18 | * [Related artifacts](#related-artifacts)
19 | * [Related principles](#related-principles)
20 | * [Notes](#notes)
21 |
22 |
23 | ## Summary
24 |
25 |
26 | ### Issue
27 |
28 | We want our applications to be configurable beyond artifacts/binaries/source, such that one build can behave differently depending on its deployment environment.
29 |
30 | * To accomplish this, we want to use environment variable configuration.
31 |
32 | * We want to manage the configuration by using files that we can version control.
33 |
34 | * We want to provide some developer experience ergonomics, such as knowing what can be configured and any relevant defaults.
35 |
36 |
37 | ### Decision
38 |
39 | Decided on .env files with related default file and schema file.
40 |
41 |
42 | ### Status
43 |
44 | Decided. Open to considering to new capabilities as they come up.
45 |
46 |
47 | ## Details
48 |
49 |
50 | ### Assumptions
51 |
52 | We favor separating the application code and environment code. We assume the app needs to work differently in different environments, such as in a development environment, test environment, demo environment, production environment, etc.
53 |
54 | We favor the industry practice of "12 factor app" and even more the related practice of "15 factor app".
55 |
56 | Many of our previous projects have used the convention of a `.env` file or similar `.env` directory. There's a typical practice of keeping these out of version control, and instead using some other way to deploy them, version them, and manage them.
57 |
58 |
59 | ### Constraints
60 |
61 | We want to keep secrets out of our source code management (SCM) version control system (VCS).
62 |
63 | We want to aim for compatibility with popular software frameworks and libraries. For example, Node has a module "dotenv" for reading environment variable configuration.
64 |
65 |
66 | ### Positions
67 |
68 | We considered a few approaches:
69 |
70 | * Store config in the app such as in a file `config.js` file.
71 |
72 | * Store config in the environment such as in a file `.env`.
73 |
74 | * Fetch config from a known location such as a license server.
75 |
76 |
77 | ### Argument
78 |
79 | We selected the approach of a file .env because:
80 |
81 | * It is popular including among experts.
82 |
83 | * It follows the pattern of `.env` files which our teams have successfully used many times on many projects.
84 |
85 | * It is simple. Notably, We are fine for now with the significant trade-offs that we see, such as a lack of audit capabilities as compared to an approach of a license server.
86 |
87 |
88 | ### Implications
89 |
90 | We need to figure out a way to separate environment variable configuration that is public from any secrets management.
91 |
92 |
93 | ## Related
94 |
95 |
96 | ### Related decisions
97 |
98 | We expect all our applications to use this approach.
99 |
100 | We will plan to upgrade any of our applications that use a less-capable approach, such as hardcoding in a binary or in source code.
101 |
102 | We will keep as-is any of our applications that use a more-capable approach, such as a licensing server.
103 |
104 |
105 | ### Related requirements
106 |
107 | We will add devops capabilities for the files, including hooks, tests, and continuous integration.
108 |
109 | We need to train all developer teammates on this decision.
110 |
111 |
112 |
113 | ### Related artifacts
114 |
115 | Each area where we deploy will need its own .env file and related files.
116 |
117 |
118 | ### Related principles
119 |
120 | Easily reversible.
121 |
122 |
123 | ## Notes
124 |
125 |
126 | Example file `.env`:
127 |
128 | ```env
129 | NAME=Alice Anderson
130 | EMAIL=alice@example.com
131 | ```
132 |
133 | Example file `.env.defaults`:
134 |
135 | ```env
136 | NAME=Joe Doe
137 | EMAIL=joe@example.com
138 | ```
139 |
140 | Example file `.env.schema` with just the keys:
141 |
142 | ```env
143 | NAME
144 | EMAIL
145 | ```
146 |
--------------------------------------------------------------------------------
/examples/monorepo-vs-multirepo/index.md:
--------------------------------------------------------------------------------
1 | # Monorepo vs multirepo
2 |
3 | Contents:
4 |
5 | * [Summary](#summary)
6 | * [Issue](#issue)
7 | * [Decision](#decision)
8 | * [Status](#status)
9 | * [Details](#details)
10 | * [Assumptions](#assumptions)
11 | * [Constraints](#constraints)
12 | * [Positions](#positions)
13 | * [Argument](#argument)
14 | * [Implications](#implications)
15 | * [Related](#related)
16 | * [Related decisions](#related-decisions)
17 | * [Related requirements](#related-requirements)
18 | * [Related artifacts](#related-artifacts)
19 | * [Related principles](#related-principles)
20 | * [Notes](#notes)
21 |
22 |
23 | ## Summary
24 |
25 |
26 | ### Issue
27 |
28 | Our project involves developing three major categories of software:
29 |
30 | * Front-end GUIs
31 | * Middleware services
32 | * Back-end servers
33 |
34 | When we develop, our source code management (SCM) version control system (VCS) is git.
35 |
36 | We need to choose how we use git to organize our code.
37 |
38 | The top-level choice is to organize as a "monorepo" or "polyrepo" or "hybrid":
39 |
40 | * Monorepo means we put all pieces into one big repo
41 | * Polyrepo means we put each piece in its own repo
42 | * Hybrid means some mix of monorepo and polyrepo
43 |
44 | For more please see https://github.com/joelparkerhenderson/monorepo-vs-polyrepo
45 |
46 |
47 | ### Decision
48 |
49 | Monorepo when an organization/team/project is relatively small, and rapid iteration is higher priority than sustaining stability.
50 |
51 | Polyrepo when an organization/team/project is relatively large, and sustaining stability is higher priority than rapid iteration.
52 |
53 |
54 | ### Status
55 |
56 | Decided. Open to revisiting if/when new tooling becomes available for managing monorepos and/or polyrepos.
57 |
58 |
59 | ## Details
60 |
61 |
62 | ### Assumptions
63 |
64 | All the code that we're developing is for one organization's offerings, and not for the general public. I.e. the Broker-Dealer isn't aiming to have anything like general public volunteer developers.
65 |
66 |
67 | ### Constraints
68 |
69 | Constraints are well-documented at https://github.com/joelparkerhenderson/monorepo-vs-polyrepo
70 |
71 |
72 | ### Positions
73 |
74 | We considered monorepos in the style of Google, Facebook, etc. We think any monorepo scaling issues are so far in the future that we will be able to leverage the same practices as Google and Facebook, by the time we need them.
75 |
76 | We considered polyrepos in the style of typical Git open source projects, such as Google Android, Facebook React, etc. We think these are the best choice for general public participation (e.g. anyone in the world can work on the code) and individual availability (e.g. the project is used on its own, without any other pieces).
77 |
78 |
79 | ### Argument
80 |
81 | When an organization/team/project is relatively small, we choose monorepo, because rapid iteration is significantly higher in priority than sustaining stability
82 |
83 | When an organization/team/project is relatively large, we choose polyrepo, because sustaining stability is significantly higher in priority than rapid iteration.
84 |
85 |
86 | ### Implications
87 |
88 | If there's an existing pipeline for CI+CD, then we may need to adjust it for testing multiple projects within one repo.
89 |
90 | CI+CD could take more time for a full build for a monorepo, because CI+CD could build all the projects in the monorepo.
91 |
92 | If an organization/team/project grows, then a monorepo will have scaling issues.
93 |
94 | Monorepo scaling issues may make it increasing valuable to transition to a polyrepo.
95 |
96 | Transition from monorepo to polyrepo is a signficant devops task, and will need to be planned, managed, and programmed.
97 |
98 |
99 | ## Related
100 |
101 |
102 | ### Related decisions
103 |
104 | We will create decisions for related tooling to manage monorepos (e.g. Google Bazel) and polyrepos (e.g. Lyft Refactorator).
105 |
106 |
107 | ### Related requirements
108 |
109 | We need to develop the CI+CD pipeline to work well with git.
110 |
111 |
112 | ### Related artifacts
113 |
114 | We expect the repo organization to have related artifacts for provisioning, configuration management, testing, and similar devops areas.
115 |
116 |
117 | ### Related principles
118 |
119 | Easily reversable. If the monorepo doesn't work in practice, or isn't wanted by leadership, it's simple to change to polyrepo.
120 |
121 | Customer Obsession. We value getting the project in the hands of customers, and we believe that a monoreop can get us there faster than a polyrepo, and also help us iterate faster.
122 |
123 | Think big. Google and Facebook are very strong advocates for monorepos over polyrepos, because all the core offerings can be developed/tested/deployed in concert.
124 |
125 |
126 | ## Notes
127 |
128 | Add any notes here.
129 |
--------------------------------------------------------------------------------
/examples/timestamp-format/index.md:
--------------------------------------------------------------------------------
1 | # Timestamp format
2 |
3 | Contents:
4 |
5 | * [Summary](#summary)
6 | * [Issue](#issue)
7 | * [Decision](#decision)
8 | * [Status](#status)
9 | * [Details](#details)
10 | * [Assumptions](#assumptions)
11 | * [Constraints](#constraints)
12 | * [Positions](#positions)
13 | * [Argument](#argument)
14 | * [Implications](#implications)
15 | * [Related](#related)
16 | * [Related decisions](#related-decisions)
17 | * [Related requirements](#related-requirements)
18 | * [Related artifacts](#related-artifacts)
19 | * [Related principles](#related-principles)
20 | * [Notes](#notes)
21 |
22 |
23 | ## Summary
24 |
25 |
26 | ### Issue
27 |
28 | We want to be able to track when things happen by using timestamps and by using a consistent timestamp format that works well across all our systems and third-party systems.
29 |
30 | We interact with systems that have different timestamp formats:
31 |
32 | * JSON messages do not have a native timestamp format, so we need to choose how to convert a timestamp to a string, and convert a string to a timestamp, i.e. how to serialize/deserialize.
33 |
34 | * Some applications are set to use local time, rather than UTC time. This can be convenient for projects that must adjust to local time, such as projects that trigger events that are based on local time.
35 |
36 | * Some systems have different time precision needs and capabilities, such as using a time resolution of seconds vs. milliseconds vs. nanoseconds. For example, the Linux operating system `date` command uses a default time precision of seconds, whereas the Nasdaq stock exchange wants a default time precision of nanoseconds.
37 |
38 |
39 | ### Decision
40 |
41 | We choose the timestamp standard format ISO 8601 with nanosecond precision, specifically "YYYY-MM-DDTHH:MM:SS.NNNNNNNNNZ".
42 |
43 | The format shows the year, month, day, hour, minute, second, nanoseconds, and Zulu time zone a.k.a. UTC, GMT.
44 |
45 |
46 | ### Status
47 |
48 | Decided.
49 |
50 |
51 | ## Details
52 |
53 |
54 | ### Assumptions
55 |
56 | We need to handle these timestamp text strings, to convert from a timestamp to a string (a.k.a. serialize) and convert from a string to a timestamp (a.k.a. deserialize).
57 |
58 | We want a format that is generally easy to use, easy to convert, and easy for a person to read.
59 |
60 | We want compatibility with a wide range of external systems that we cannot control, such as analytics systems, database systems, financial systems.
61 |
62 |
63 | ### Constraints
64 |
65 | Some systems have time precision limitations. For example, the macOS operating system `date` command can print time precision in seconds, but not in nanoseconds.
66 |
67 |
68 | ### Positions
69 |
70 | We considered a range of options:
71 |
72 | * Unix epoch i.e. one incrementing number.
73 |
74 | * Terse text format "YYYYMMDDTHHMMSSNNNNNNNNN".
75 |
76 | * Using a local time zone vs. the UTC time zone.
77 |
78 |
79 | ### Argument
80 |
81 | For typical use, we value easy to read/write by humans, more than raw speed/size.
82 |
83 | For typical use, we want a format that works fine in machine systems, and also works well manually, such as writing sample data, reading JSON output, grepping a log file, etc.
84 |
85 | For atypical use, such as high performance computing, we expect we'll want to optimize any text format we choose by converting the text to a faster format, such as a programming language's built-in date object type. So the text format doesn't matter much for HPC.
86 |
87 |
88 | ### Implications
89 |
90 | Our various text systems and time systems will converge on this format.
91 |
92 |
93 | ## Related
94 |
95 |
96 | ### Related decisions
97 |
98 | We may want a fast/easy way to also track time deltas a.k.a. durations. These are easy with Unix epoch timestamps.
99 |
100 |
101 | ### Related requirements
102 |
103 | We may want to adjust our decision e.g. if we have a related requirement for a specific kind of logging message stamp, such as for Splunk, Sumo, ELK, etc.
104 |
105 |
106 | ### Related artifacts
107 |
108 | Language formatters and parsers:
109 |
110 | * [date-fns: Modern JavaScript date utility library](https://date-fns.org/)
111 | * [Crono: date and time library for Rust](https://github.com/chronotope/chron)
112 |
113 | Rosetta Code examples:
114 |
115 | * [System time](https://www.rosettacode.org/wiki/System_time)
116 | * [Data format](https://www.rosettacode.org/wiki/Date_format)
117 | * [Show the epoch](https://www.rosettacode.org/wiki/Show_the_epoch)
118 |
119 | SixArm examples:
120 |
121 | * [now_string](https://github.com/SixArm/rosetta_code/tree/master/tasks/now_string)
122 |
123 |
124 | ### Related principles
125 |
126 | Easily reversible. We can change pretty easily to a different format, such as Unix epoch.
127 |
128 | Defer premature optimization. For typical use we don't care much about a handful of extra characters such as a format that uses dashes and colons.
129 |
130 |
131 | ## Notes
132 |
133 | Add notes here.
134 |
--------------------------------------------------------------------------------
/examples/css-framework/index.md:
--------------------------------------------------------------------------------
1 | # CSS framework
2 |
3 | Contents:
4 |
5 | * [Summary](#summary)
6 | * [Issue](#issue)
7 | * [Decision](#decision)
8 | * [Status](#status)
9 | * [Details](#details)
10 | * [Assumptions](#assumptions)
11 | * [Constraints](#constraints)
12 | * [Positions](#positions)
13 | * [Argument](#argument)
14 | * [Implications](#implications)
15 | * [Related](#related)
16 | * [Related decisions](#related-decisions)
17 | * [Related requirements](#related-requirements)
18 | * [Related artifacts](#related-artifacts)
19 | * [Related principles](#related-principles)
20 | * [Notes](#notes)
21 |
22 |
23 | ## Summary
24 |
25 |
26 | ### Issue
27 |
28 | We want to use a CSS framework to create our web applications:
29 |
30 | * We want user experience to be fast and reliable, on all popular browsers and screen sizes.
31 |
32 | * We want rapid iteration on design, layout, UI/UX, etc.
33 |
34 | * We want responsive applications, especially for smaller screens such as on mobile devices, larger screens such as on 4K widescreens, and dynamic screens such as rotatable displays.
35 |
36 |
37 | ### Decision
38 |
39 | Decided on Bulma.
40 |
41 |
42 | ### Status
43 |
44 | Decided on Bulma. Open to new CSS framework choices as they arrive.
45 |
46 |
47 | ## Details
48 |
49 |
50 | ### Assumptions
51 |
52 | We want to create web apps that are modern, fast, reliable, responsive, etc.
53 |
54 | Typical modern web apps are reducing/eliminating the use of jQuery because of multiple reasons:
55 |
56 | * Modern JavaScript in phasing in many capabilities that jQuery has provided, so jQuery is less needed, and there are better/faster/smaller modules that provide specific implementations
57 |
58 | * jQuery's broad approach is to do direct DOM manipulation, which is an anti-pattern for modern JavaScript frameworks (e.g. React, Vue, Svelte)
59 |
60 | * jQuery interferes with itself if it's loaded twice, etc.
61 |
62 |
63 | ### Constraints
64 |
65 | If we choose a CSS framework that uses jQuery, then we're stuck importing jQuery. For example, Semantic UI uses jQuery, and Tachyons does not.
66 |
67 | If we choose a CSS framework that is minimal, then we forego framework components that we may want now or soon. For example, Semantic UI provides an image carousel, and Tachyons does not.
68 |
69 |
70 | ### Positions
71 |
72 | We considered using no framework. This still seems viable, especially because CSS grid provides much of what we need for our project..
73 |
74 | We considered many CSS frameworks using a quick shortlist triage: Bootstrap, Bulma, Foundation, Materialize, Semantic UI, Tachyons, etc. Our two selections for deeper review are Semantic UI (because it has the most-semantic approach) and Bulma (because it has the lightest-weight approach that provides the components we want now).
75 |
76 | We considered Semantic UI. This provides many components, including ones we want for our project: tabs, grids, buttons, etc. We did a pilot with Semantic UI two ways: using typical CDN files, and using NPM repos. We achieved success with Semantic UI in a static HTML page, but did not achieve success within our timebox to build a JavaScript SPA (primarly because of jQuery load issues). We discovered that other coders have been asking the Semantic UI developers to create a jQuery-free version, for the same reasons we have. Other coders have been requesting a jQuery-free version for many years, yet the developers have said no, and stated that any jQuery-free version would be too hard to write e.g. ~"the Semantic UI project has more than 22,000 touchpoints that use jQuery".
77 |
78 | Example with Semantic:
79 |
80 | ```html
81 |
85 | ```
86 |
87 | We considered Bulma. Bulma has many similar capabilties as Semantic UI, although not as many sophisticated components. Bulma is built with modern techniques, such as no jQuery. Bulma has some third-party components, some of which we may want to use.
88 |
89 |
90 | Example with Bulma:
91 | ```html
92 |
98 | ```
99 |
100 |
101 | ### Argument
102 |
103 | As above.
104 |
105 | Specifically, Semantic UI seems to have a caution flag both in terms of technology (i.e. so many jQuery touchpoints) and also in terms of leadership (i.e. jQuery-free was a hard no, rather than attemping a roadmap, or continous improvement, or donation fundraising, etc.).
106 |
107 |
108 | ### Implications
109 |
110 | If we find a good non-jQuery CSS framework, this is generally helpful and good overall.
111 |
112 |
113 | ## Related
114 |
115 |
116 | ### Related decisions
117 |
118 | The CSS framework we choose may affect testability.
119 |
120 |
121 | ### Related requirements
122 |
123 | We want to ship a purely-modern app fast.
124 |
125 | We do not want to spend time working on older frameworks (esp. Semantic UI) using older dependencies (esp. jQuery).
126 |
127 |
128 | ### Related artifacts
129 |
130 | Affects all the typical HTML that will use the CSS.
131 |
132 |
133 | ### Related principles
134 |
135 | Easily reversible.
136 |
137 | Need for speed.
138 |
139 |
140 | ## Notes
141 |
142 | Any notes here.
143 |
--------------------------------------------------------------------------------
/examples/programming-languages/index.md:
--------------------------------------------------------------------------------
1 | # Programming languages
2 |
3 | Contents:
4 |
5 | * [Summary](#summary)
6 | * [Issue](#issue)
7 | * [Decision](#decision)
8 | * [Status](#status)
9 | * [Details](#details)
10 | * [Assumptions](#assumptions)
11 | * [Constraints](#constraints)
12 | * [Positions](#positions)
13 | * [Argument](#argument)
14 | * [Implications](#implications)
15 | * [Related](#related)
16 | * [Related decisions](#related-decisions)
17 | * [Related requirements](#related-requirements)
18 | * [Related artifacts](#related-artifacts)
19 | * [Related principles](#related-principles)
20 | * [Notes](#notes)
21 |
22 |
23 | ## Summary
24 |
25 |
26 | ### Issue
27 |
28 | We need to choose programming languages for our software. We have two major needs: a front-end programming language suitable for web applications, and a back-end programming language suitable for server applications.
29 |
30 |
31 | ### Decision
32 |
33 | We are choosing TypeScript for the front-end.
34 |
35 | We are choosing Rust for the back-end.
36 |
37 |
38 | ### Status
39 |
40 | Decided. We are open to new alternatives as they arise.
41 |
42 |
43 | ## Details
44 |
45 |
46 | ### Assumptions
47 |
48 | The front-end applications are typical:
49 |
50 | * Typical users and interactions
51 |
52 | * Typical browsers and systems
53 |
54 | * Typical developments and deployments
55 |
56 | The front-end applications is likely to evolve quickly:
57 |
58 | * We want to ensure fast easy developments, deployments, iterations, etc.
59 |
60 | * We value provability, such as type safety, and we are fine doing a bit more work to achieve it.
61 |
62 | * We do not need legacy compatibility.
63 |
64 | The back-end applications are higher-than-typical:
65 |
66 | * Higher-than-typical goals for quality, especially provability, reliability, security, etc.
67 |
68 | * Higher-than-typical goals for near-real-time, i.e. we do not want pauses due to virtual machine garbage collection.
69 |
70 | * Higher-than-typical goals for functional programming, especially for parallelization, multi-core processing, and memory safety.
71 |
72 | We accept lower compile-time speeds in favor of compile-time safety and runtime speeds.
73 |
74 |
75 | ### Constraints
76 |
77 | We have a strong constraint on languages that are usuable with major cloud provider services for functions, such as Amazon Lambda.
78 |
79 |
80 | ### Positions
81 |
82 | We considered these langauges:
83 |
84 | * C
85 |
86 | * C++
87 |
88 | * Clojure
89 |
90 | * Elixir
91 |
92 | * Erlang
93 |
94 | * Elm
95 |
96 | * Flow
97 |
98 | * Go
99 |
100 | * Haskell
101 |
102 | * Java
103 |
104 | * JavaScript
105 |
106 | * Kotlin
107 |
108 | * Python
109 |
110 | * Ruby
111 |
112 | * Rust
113 |
114 | * TypeScript
115 |
116 |
117 |
118 | ### Argument
119 |
120 | Summary per language:
121 |
122 | * C: rejected because of low safety; Rust can do nearly everything better.
123 |
124 | * C++: rejected because it's a mess; Rust can do nearly everything better.
125 |
126 | * Clojure: excellent modeling; best Lisp approximation; great runtime on the JVM.
127 |
128 | * Elixir: excellent runtime including deployability and concurrency; excellent developer experience; relatively small ecosystem.
129 |
130 | * Erlang: excellent runtime including deployability and concurrency; challenging developer experience; relatively small ecosystem.
131 |
132 | * Elm: looks very promising; IBM is publishing major case studies with good resutls; smaller ecosystem.
133 |
134 | * Flow: interesting improvement over JavaScript; however; developers are moving away from it.
135 |
136 | * Go: excellent developer experience; excellent concurrency; but a track record of bad decisions that cripple the language.
137 |
138 | * Haskell: best functional language; smaller developer community; hasn't achieved enough published production successes.
139 |
140 | * Java: excellent runtime; excellent ecosystem; sub-par developer experience.
141 |
142 | * JavaScript: most popular language ever; most widespread ecosystem.
143 |
144 | * Kotlin: fixes so much of Java; excelent backing by JetBrains; good published cases of porting from Java to Kotlin.
145 |
146 | * Python: most popular language for systems administration; great analytics tooling; good web frameworks; but abandonded by Google in favor of Go.
147 |
148 | * Ruby: best developer experience ever; best web frameworks; nicest community; but very slow; somewhat hard to package.
149 |
150 | * Rust: best new language; zero-abstraction emphasis; concurrency emphasis; however relatively small ecosystem; and has deliberate limits on some kinds of compiler accelerations e.g. direct memory access needs to be explicitly unsafe.
151 |
152 | * TypeScript: adds types to JavaScript; great transpiler; growing developer emphasis on porting from JavaScript to TypeScript; strong backing from Microsoft.
153 |
154 | We decided that VMs have a set of tradeoffs that we do not need right now, such as additional complexity that provides runtime capabilities.
155 |
156 | We believe that our core decision is driven by two cross-cutting concerns:
157 |
158 | * For fastest runtime speed and tightest system access, we would choose JavaScript and C.
159 |
160 | * For close-to-fastest runtime speed and close-to-tightest system access, we choose TypeScript and Rust.
161 |
162 | Honorable mentions go to the VM languages and web frameworks that we would choose if we wanted a VM lanauge:
163 |
164 | * Closure and Luminous
165 |
166 | * Java and Spring
167 |
168 | * Elixir and Phoenix
169 |
170 |
171 | ### Implications
172 |
173 | Front-end developers will need to learn TypeScript. This is likely an easy learning curve if the developer's primary experience is using JavaScript.
174 |
175 | Back-end developers will need to learn Rust. This is likely a moderate learning curve if the developer's primary experience is using C/C++, and a hard learning curve if the developer's primary experience is using Java, Python, Ruby, or similar memory-managed languages.
176 |
177 | TypeScript and Rust are both relatively new. This means that many tools do not yet have documentation for these languages. For example, the devops pipeline will need to be set up for these languages, and so far, none of the devops tools that we are evaluating have default examples for these langauges.
178 |
179 | Compile times for TypeScript and Rust are quite slow. Some of this may be due to the newness of the languages. We may want to look at how to mitigate slow compile times, such as by compile-on-demand, compile-concurrency, etc.
180 |
181 | IDE support for these languages is not yet ubiquitous and not yet first-class. For example, JetBrains sells the PyCharm IDE for first-class support for Python, but does not sell and IDE with first-class support for Rust; instead, JetBrains can use a Rust plug-in that provides perhaps 80% of Rust language support vis a vis Python language support.
182 |
183 |
184 | ## Related
185 |
186 |
187 | ### Related decisions
188 |
189 | We will aim toward ecosystem choices that align with these langauges.
190 |
191 | For example, we want to choose an IDE that has good capabilties for these languages.
192 |
193 | For example, for our front-end web framework, we are more-likley to decide on a framework that tends to aim toward TypeScript (e.g. Vue) than a framework that tends to aim toward plain JavaScript (e.g. React).
194 |
195 |
196 | ### Related requirements
197 |
198 | Our entire toolchain must support these languages.
199 |
200 |
201 | ### Related artifacts
202 |
203 | We expect we may export some secrets to environment variables.
204 |
205 |
206 | ### Related principles
207 |
208 | Measure twice, build once. We are prioritizing some safety over some speed.
209 |
210 | Runtime is more valuable than compile time. We are prioritizing customer usage over developer usage.
211 |
212 |
213 | ## Notes
214 |
215 | Any notes here.
216 |
--------------------------------------------------------------------------------
/examples/microsoft-azure-devops/index.md:
--------------------------------------------------------------------------------
1 | # Microsoft Azure DevOps
2 |
3 | Contents:
4 |
5 | * [Summary](#summary)
6 | * [Issue](#issue)
7 | * [Decision](#decision)
8 | * [Status](#status)
9 | * [Details](#details)
10 | * [Assumptions](#assumptions)
11 | * [Constraints](#constraints)
12 | * [Positions](#positions)
13 | * [Argument](#argument)
14 | * [Implications](#implications)
15 | * [Related](#related)
16 | * [Related decisions](#related-decisions)
17 | * [Related requirements](#related-requirements)
18 | * [Related artifacts](#related-artifacts)
19 | * [Related principles](#related-principles)
20 | * [Notes](#notes)
21 | * [Microsoft Devops CI: An Unsatisfying Adventure](#microsoft-devops-ci-an-unsatisfying-adventure)
22 | * [Hacker News discussion highlights](#hacker-news-discussion-highlights)
23 | * [Windows Development MVP](#windows-development-mvp)
24 | * [Edward Thomson (Azure PM) summary](#edward-thomson-azure-pm-summary)
25 |
26 |
27 | ## Summary
28 |
29 |
30 | ### Issue
31 |
32 | We want to use devops to build, integrate, deploy, and host our projects. We are considering Microsoft Azure DevOps.
33 |
34 | * We want developer experience to be fast and reliable, for the setup of the devops e.g. configuring as well as ongoing use e.g. fast build times.
35 |
36 | * We want to consider using Microsoft Azure as whole, for hosting the project apps, databases, etc.
37 |
38 |
39 | ### Decision
40 |
41 | Decided against Microsoft Azure DevOps.
42 |
43 |
44 | ### Status
45 |
46 | Decided. Open to revisiting if/when new significant info arrives.
47 |
48 |
49 | ## Details
50 |
51 |
52 | ### Assumptions
53 |
54 | All the usual devops assumptions, such as in the book Accelerate.
55 |
56 | * Fast builds are a significant help. This accelerates the feedback loops.
57 |
58 | * We can swap in/out pieces from alternate vendors i.e. we may want to bring our our higher-speed build servers, or use our own choice of version control system, or coordinate with a self-hosted continuous integration server.
59 |
60 | * Streamlined usabilitiy is a signficant help, for developer experience, and in turn for subtle areas such as consistency, clarity, security, and ease of learning curve.
61 |
62 | * When anything is broken or problematic, we want an effective way to report the issue. This is especially important for any security-related issues.
63 |
64 |
65 | ### Constraints
66 |
67 | None known. Azure has a published commitment to playing well with external tools.
68 |
69 |
70 | ### Positions
71 |
72 | We considered using Microsoft Azure Devops vs. AWS which is incumbent.
73 |
74 | We experimented with Azure DevOps, Azure Pipelines, Azure Repo, and Azure spin up of new server via Terraform.
75 |
76 | We experimented with getting support from Microsoft representatives.
77 |
78 | We gathered information from peers on blogs and Hacker News.
79 |
80 |
81 | ### Argument
82 |
83 | Azure DevOps advertises an excellent set of offerings, but they do not hold up, and they do not work well together, and support is poor.
84 |
85 | Our firsthand experience:
86 |
87 | * Azure setup is a mess of UIs, some of which overlap with Microsoft accounts, some of which don't. E.g. there's an Azure sign in, a Microsoft.com sign in, a Live.com sign in, etc. and all are simultaneously in play.
88 |
89 | * We encountered a minor security issue during setup, and found no resolution. We tried many ways to report it, to many Microsoft reps, with no success. We successfully reported it to Microfot security, which replied with won't fix.
90 |
91 | * Documentation is often either wrong or outdated. At least some of this is due to Microsoft's poor search engine, and some of this is due to sub-par SEO.
92 |
93 | * Terraform setup is well documented, and works. However, Terraform supoprt is weak compared to AWS because Microsoft is building business relationships with vendors to do chain-through Terraform setup examples.
94 |
95 | Our peer experiences:
96 |
97 | * After we did our own blind assessment, we looked for peer experiences. What we found confirmed our experiences.
98 |
99 | * Peers reported additional problems with build times, and problems with bring-your-own-build server. These problems are signficantly more severe than UI problems, because doing builds is the core purpose of a build pipeline, and we expect to do many per day.
100 |
101 | * We found excellent participation by Azure teammates in the discussion areas. Kudos to Microsoft for this. We are especially impressed with Edward Thomson, Azure PM and coder, because of his participation, directness, and technical explanations.
102 |
103 |
104 | ### Implications
105 |
106 | Choosing Microsoft Azure DevOps looks likely to be more expensive (~3x) in time and cost than not choosing Azure.
107 |
108 |
109 | ## Related
110 |
111 |
112 | ### Related decisions
113 |
114 | If we choose Azure DevOps, there are many related offerings, including Azure Repo, Azure Pipeline, etc. We believe that if we choose Azure Devops, this may make it easier to use more Azure capabilities, or may make it harder to use other vendors' capabilities.
115 |
116 | We believe that Microsoft is making great strides in developer experience, and we see Microsoft making large acquisition of developer tools (e.g. GitHub) and dependencies (e.g. Citus).
117 |
118 | If we choose Azure DevOps, then we may want to emphasize choosing the Microsoft acquistion offerings, and we may also want to approach the acquisition offerings with more care/assessement because of potential tissue-rejection e.g. staff turnover risk.
119 |
120 |
121 | ### Related requirements
122 |
123 | We want build times to be very fast. We accept paying a high premium for this. This is because we want to iterate very fast.
124 |
125 | We want reliability to be very high. We accept paying a high premium for this. This is because we are testing high-value use cases, including financial transations, confidential transactions, etc.
126 |
127 | Our top 4 devops KPIs include mean time to recovery, which necessitates fast builds and high reliability.
128 |
129 |
130 | ### Related artifacts
131 |
132 | We want the build system to output artifacts suitable for using in other systems, such as Artefactory.
133 |
134 |
135 | ### Related principles
136 |
137 | Easily reversible. We can evaluate Azure DevOps in parallel with AWS incumbent.
138 |
139 |
140 | ## Notes
141 |
142 |
143 | ### Microsoft Devops CI: An Unsatisfying Adventure
144 |
145 | https://toxicbakery.github.io/vsts-devops/microsoft-devops-ci/
146 |
147 | Blog post.
148 |
149 | "As a software developer, I know first-hand how difficult it is to build quality products quickly and cheaply. It’s an art form that we sometimes get right, and other times devolves into something akin to the Obama era healthcare government site. Our level of control over the resulting product varies, and blame for failure often falls on the wrong people in the decision-making hierarchy. Microsoft’s Azure DevOps (formerly known as Visual Studio Team Services), despite clearly good intentions, is a perfect storm of bad decisions and poor execution."
150 |
151 |
152 | ### Hacker News discussion highlights
153 |
154 | https://news.ycombinator.com/item?id=18983586
155 |
156 | "We use Azure DevOps extensively at my work and, after having used GitHub, Gitlab, self hosted solutions, Jenkins, TeamCity... Azure DevOps ranks dead last."
157 |
158 | "The UI is terribly clunky everywhere. The worst for me are pull requests. Incredibly tough to work with people on a pull request. I can't even point you to "a" particular problem - for us it's broken everywhere."
159 |
160 | "Azure Devops is something I want to love. The UI keeps changing, but doesn't fix underlying bugs that have been around for ages."
161 |
162 | "The tools are not well integrated, the UI is really slow, there’s no dashboard view of active pull requests, builds, releases, etc for my favorite repos. Build/Deploy times are insanely slow."
163 |
164 | "We tried to also use Azure Boards (Work Items, Boards, Backlogs, etc). Ouch. It is a complete UI mess of disjointed ideas. Instead of implementing one thing well, they implemented two dozen things terribly."
165 |
166 |
167 | ### Windows Development MVP
168 |
169 | Windows Development MVP here. I feel like I must shoulder some of the responsibility here for not being louder about these issues. But must say, I'm disappointed to hear you're "surprised" about the UX issues. I've been telling your folks the UX is dreadful (e.g. as far back as pre-launch) and kept hearing back "we know, we're fixing it". I'll start formalizing the feedback and push it through the pipes, stay tuned. I'm also local (Bellevue), would love to come in and try to pipeline our relatively simple oss .net/wpf/uwp app. I suspect it'll be an eye opener for the both of us.
170 |
171 | Some examples:
172 |
173 | * You can't build a pipeline with a git repo. that contains submodules
174 |
175 | * Found it impossible to edit the PATH for some custom tooling
176 |
177 | * The New Pipeline experience just doesn't make a lot of sense, new users clicking around will eventually end up at the wrong Docs.
178 |
179 |
180 | ### Edward Thomson (Azure PM) summary
181 |
182 | I wrote the code that merges your pull requests. Program Manager for at Microsoft for Azure DevOps; formerly a software engineer on version control tools at GitHub, Microsoft, SourceGear.
183 |
184 | https://www.edwardthomson.com/
185 |
186 | Co-maintainer of libgit2. https://libgit2.github.io
187 |
188 | Co-host of All Things Git, the Podcast about Git. https://www.allthingsgit.com/
189 |
190 | Curator of Developer Tools Weekly, a newsletter about development tools. https://developertoolsweekly.com/
191 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
9 |
10 | # Architecture decision record (ADR)
11 |
12 | An architecture decision record (ADR) is a document that captures an important architecture decision made along with its context and consequences.
13 |
14 | Contents:
15 |
16 | * [What is an architecture decision record?](#what-is-an-architecture-decision-record)
17 | * [How to start using ADRs](#how-to-start-using-adrs)
18 | * [How to start using ADRs with tools](#how-to-start-using-adrs-with-tools)
19 | * [How to start using ADRs with git](#how-to-start-using-adrs-with-git)
20 | * [ADR file name conventions](#adr-file-name-conventions)
21 | * [Suggestions for writing good ADRs](#suggestions-for-writing-good-adrs)
22 | * [ADR example templates](#adr-example-templates)
23 | * [Teamwork advice](#teamwork-advice)
24 | * [For more information](#for-more-information)
25 |
26 | Templates:
27 |
28 | * [Decision record template by Jeff Tyree and Art Akerman](templates/decision-record-template-by-jeff-tyree-and-art-akerman/index.md)
29 | * [Decision record template by Michael Nygard](templates/decision-record-template-by-michael-nygard/index.md)
30 | * [Decision record template for Alexandrian pattern](templates/decision-record-template-for-alexandrian-pattern/index.md)
31 | * [Decision record template for business case](templates/decision-record-template-for-business-case/index.md)
32 | * [Decision record template for MADR](templates/decision-record-template-madr/index.md)
33 | * [Decision record template using Planguage](templates/decision-record-template-using-planguage/index.md)
34 | * [Decision record template by Paulo Merson](https://github.com/pmerson/ADR-template)
35 |
36 | Examples:
37 |
38 | * [CSS framework](examples/css-framework/index.md)
39 | * [Environment variable configuration](examples/environment-variable-configuration/index.md)
40 | * [Metrics, monitors, alerts](examples/metrics-monitors-alerts/index.md)
41 | * [Microsoft Azure DevOps](examples/microsoft-azure-devops/index.md)
42 | * [Monorepo vs multirepo](examples/monorepo-vs-multirepo/index.md)
43 | * [Programming languages](examples/programming-languages/index.md)
44 | * [Secrets storage](examples/secrets-storage/index.md)
45 | * [Timestamp format](examples/timestamp-format/index.md)
46 |
47 |
48 | ## What is an architecture decision record?
49 |
50 | An **architecture decision record** (ADR) is a document that captures an important architectural decision made along with its context and consequences.
51 |
52 | An **architecture decision** (AD) is a software design choice that addresses a significant requirement.
53 |
54 | An **architecture decision log** (ADL) is the collection of all ADRs created and maintained for a particular project (or organization).
55 |
56 | An **architecturally-significant requirement** (ASR) is a requirement that has a measurable effect on a software system’s architecture.
57 |
58 | All these are within the topic of **architecture knowledge management** (AKM).
59 |
60 | The goal of this document is to provide a fast overview of ADRs, how to create them, and where to look for more information.
61 |
62 | Abbreviations:
63 |
64 | * **AD**: architecture decision
65 |
66 | * **ADL**: architecture decision log
67 |
68 | * **ADR**: architecture decision record
69 |
70 | * **AKM**: architecture knowledge management
71 |
72 | * **ASR**: architecturally-significant requirement
73 |
74 |
75 | ## How to start using ADRs
76 |
77 | To start using ADRs, talk with your teammates about these areas.
78 |
79 | Decision identification:
80 |
81 | * How urgent and how important is the AD?
82 |
83 | * Does it have to be made now, or can it wait until more is known?
84 |
85 | * Both personal and collective experience, as well as recognized design methods and practices, can assist with decision identification.
86 |
87 | * Ideally maintain a decision todo list that complements the product todo list.
88 |
89 | Decision making:
90 |
91 | * A number of decision making techniques exists, both general ones and software architecture specific ones, for instance, dialogue mapping.
92 |
93 | * Group decision making is an active research topic.
94 |
95 | Decision enactment and enforcement:
96 |
97 | * ADs are used in software design; hence they have to be communicated to, and accepted by, the stakeholders of the system that fund, develop, and operate it.
98 |
99 | * Architecturally evident coding styles and code reviews that focus on architectural concerns and decisions are two related practices.
100 |
101 | * ADs also have to be (re-)considered when modernizing a software system in software evolution.
102 |
103 | Decision sharing (optional):
104 |
105 | * Many ADs recur across projects.
106 |
107 | * Hence, experiences with past decisions, both good and bad, can be valuable reusable assets when employing an explicit knowledge management strategy.
108 |
109 | * Group decision making is an active research topic.
110 |
111 | Decision documentation:
112 |
113 | * Many templates and tools for decision capturing exist.
114 |
115 | * See agile communities, e.g. M. Nygard's ADRs.
116 |
117 | * See traditional software engineering and architecture design processes, e.g. table layouts suggested by IBM UMF and by Tyree and Akerman from CapitalOne.
118 |
119 | Decision guidance:
120 |
121 | * The steps above are adopted from the Wikipedia entry on [Architectural Decision](https://en.wikipedia.org/wiki/Architectural_decision)
122 |
123 | * A number of decision making techniques exists, both general ones and software architecture specific ones, for instance, dialogue mapping.
124 |
125 |
126 | ## How to start using ADRs with tools
127 |
128 | You can start using ADRs with tools any way you want.
129 |
130 | For example:
131 |
132 | * If you like using Google Drive and online editing, then you can create a Google Doc, or Google Sheet.
133 |
134 | * If you like use source code version control, such as git, then you can create a file for each ADR.
135 |
136 | * If you like using project planning tools, such as Atlassian Jira, then you can use the tool's planning tracker.
137 |
138 | * If you like using wikis, such as MediaWiki, then you can create an ADR wiki.
139 |
140 |
141 | ## How to start using ADRs with git
142 |
143 | If you like using git version control, then here is how we like to start using ADRs with git for a typical software project with source code.
144 |
145 | Create a directory for ADR files:
146 |
147 | ```sh
148 | $ mkdir adr
149 | ```
150 |
151 | For each ADR, create a text file, such as `database.txt`:
152 |
153 | ```sh
154 | $ vi database.txt
155 | ```
156 |
157 | Write anything you want in the ADR. See the templates in this repository for ideas.
158 |
159 | Commit the ADR to your git repo.
160 |
161 |
162 | ## ADR file name conventions
163 |
164 | If you choose to create your ADRs using typical text files, then you may want to come up with your own ADR file name convention.
165 |
166 | We prefer to use a file name convention that has a specific format.
167 |
168 | Examples:
169 |
170 | * choose-database.md
171 |
172 | * format-timestamps.md
173 |
174 | * manage-passwords.md
175 |
176 | * handle-exceptions.md
177 |
178 | Our file name convention:
179 |
180 | * The name has a present tense imperative verb phrase. This helps readability and matches our commit message format.
181 |
182 | * The name uses lowercase and dashs (same as this repo). This is a balance of readability and system usability.
183 |
184 | * The extension is markdown. This can be useful for easy formatting.
185 |
186 |
187 | ## Suggestions for writing good ADRs
188 |
189 | Characteristics of a good ADR:
190 |
191 | * Rational: Explain the reasons for doing the particular AD. This can include the context (see below), pros and cons of various potential choices, feature comparions, cost/benefit discussions, and more.
192 |
193 | * Specific: Each ADR should be about one AD, not multiple ADs.
194 |
195 | * Timestamps: Identify when each item in the ADR is written. This is especially important for aspects that may change over time, such as costs, schedules, scaling, and the like.
196 |
197 | * Immutable: Don't alter existing information in an ADR. Instead, amend the ADR by adding new information, or supersede the ADR by creating a new ADR.
198 |
199 | Characteristics of a good "Context" section in an ADR:
200 |
201 | * Explain your organization's situation and business priorities.
202 |
203 | * Include rationale and considerations based on social and skills makeups of your teams.
204 |
205 | * Include pros and cons that are relevant, and describe them in terms that align with your needs and goals.
206 |
207 | Characteristics of good "Consequences" section in an ADR:
208 |
209 | * Explain what follows from making the decision. This can include the effects, outcomes, outputs, follow ups, and more.
210 |
211 | * Include information about any subsequent ADRs. It's relatively common for one ADR to trigger the need for more ADRs, such as when one ADR makes a big overarching choice, which in turn creates needs for more smaller decisions.
212 |
213 | * Include any after-action review processes. It's typical for teams to review each ADR one month later, to compare the ADR information with what's happened in actual practice, in order to learn and grow.
214 |
215 | A new ADR may take the place of a previous ADR:
216 |
217 | * When an AD is made that replaces or invalidates a previous ADR, then a new ADR should be created
218 |
219 |
220 | ## ADR example templates
221 |
222 | ADR example templates that we have collected on the net:
223 |
224 | * [ADR template by Michael Nygard](templates/decision-record-template-by-michael-nygard/index.md) (simple and popular)
225 |
226 | * [ADR template by Jeff Tyree and Art Akerman](templates/decision-record-template-by-jeff-tyree-and-art-akerman/index.md) (more sophisticated)
227 |
228 | * [ADR template for Alexandrian pattern](templates/decision-record-template-for-alexandrian-pattern/index.md) (simple with context specifics)
229 |
230 | * [ADR template for business case](templates/decision-record-template-for-business-case/index.md) (more MBA-oriented, with costs, SWOT, and more opinions)
231 |
232 | * [ADR template MADR](templates/decision-record-template-madr/index.md) (more Markdown)
233 |
234 | * [ADR template using Planguage](templates/decision-record-template-using-planguage/index.md) (more quality assurance oriented)
235 |
236 |
237 | ## Teamwork advice
238 |
239 | If you're considering using decision records with your team, then here's some advice that we've learned by working with many teams.
240 |
241 | You have an opportunity to lead your teammates, by talking together about the "why", rather than mandating the "what". For example, decision records are a way for teams to think smarter and communicate better; decision records are not valuable if they're just an after-the-fact forced paperwork requirement.
242 |
243 | Some teams much prefer the name "decisions" over the abbreviation "ADRs". When some teams use the directory name "decisions", then it's as if a light bulb turns on, and the team starts putting more information into the directory, such as vendor decisions, planning decisions, scheduling decisions, etc. All of these kinds of information can use the same template. We hypothesize that people learn faster with words ("decisions") over abbreviations ("ADRs"), and people are more motivated to write work-in-progress docs when the word "record" is removed, and also some developers and some managers dislike the word "architecture".
244 |
245 | In theory, immutability is ideal. In practice, mutability has worked better for our teams. We insert the new info the existing ADR, with a date stamp, and a note that the info arrived after the decision. This kind of approach leads to a "living document" that we all can update. Typical updates are when we get information thanks to new teammates, or new offerings, or real-world results of our usages, or after-the-fact third-party changes such as vendor capabilties, pricing plans, license agreements, etc.
246 |
247 |
248 | ## For more information
249 |
250 | Introduction:
251 |
252 | * [Architectural decision (wikipedia.org)](https://wikipedia.org/wiki/Architectural_decision)
253 |
254 | * [Architecturally significant requirements (wikipedia.org)](https://wikipedia.org/wiki/Architecturally_significant_requirements)
255 |
256 | Templates:
257 |
258 | * [Documenting architecture decisions - Michael Nygard (thinkrelevance.com)](http://thinkrelevance.com/blog/2011/11/15/documenting-architecture-decisions)
259 |
260 | * [Markdown Architectural Decision Records (adr.github.io)](https://adr.github.io/madr/) - provided by the [adr GitHub organization](https://adr.github.io/)
261 |
262 | * [Template for documenting architecture alternatives and decisions (stackoverflow.com)](http://stackoverflow.com/questions/7104735/template-for-documenting-architecture-alternatives-and-decisions)
263 |
264 | In-depth:
265 |
266 | * [ADMentor XML project (github.com)](https://github.com/IFS-HSR/ADMentor)
267 |
268 | * [Architectural Decision Guidance across Projects: Problem Space Modeling, Decision Backlog Management and Cloud Computing Knowledge (ifs.hsr.ch)](https://www.ifs.hsr.ch/fileadmin/user_upload/customers/ifs.hsr.ch/Home/projekte/ADMentor-WICSA2015ubmissionv11nc.pdf)
269 |
270 | * [The Decision View's Role in Software Architecture Practice (computer.org)](https://www.computer.org/csdl/mags/so/2009/02/mso2009020036-abs.html)
271 |
272 | * [Documenting Software Architectures: Views and Beyond (resources.sei.cmu.edu)](http://resources.sei.cmu.edu/library/asset-view.cfm?assetID=30386)
273 |
274 | * [Architecture Decisions: Demystifying Architecture (utdallas.edu)](https://www.utdallas.edu/~chung/SA/zz-Impreso-architecture_decisions-tyree-05.pdf)
275 |
276 | * [ThoughtWorks Technology Radar: Lightweight Architecture Decision Records (thoughtworks.com)](https://www.thoughtworks.com/radar/techniques/lightweight-architecture-decision-records)
277 |
278 | Tools:
279 |
280 | * [Command-line tools for working with Architecture Decision Records](https://github.com/npryce/adr-tools)
281 |
282 | * [Command line tools with python by Victor Sluiter](https://bitbucket.org/tinkerer_/adr-tools-python/src/master/)
283 |
284 | Examples:
285 |
286 | * [Repository of Architecture Decision Records made for the Arachne Framework](https://github.com/arachne-framework/architecture)
287 |
288 | See also:
289 |
290 | * REMAP (Representation and Maintenance of Process Knowledge)
291 |
292 | * DRL (Decision Representation Language)
293 |
294 | * IBIS (Issue-Based Information System)
295 |
296 | * QOC (Questions, Options, and Criteria)
297 |
298 | * IBM’s e-Business Reference Architecture Framework
299 |
--------------------------------------------------------------------------------
/examples/secrets-storage/index.md:
--------------------------------------------------------------------------------
1 | # Secrets storage
2 |
3 | Contents:
4 |
5 | * [Summary](#summary)
6 | * [Issue](#issue)
7 | * [Decision](#decision)
8 | * [Status](#status)
9 | * [Details](#details)
10 | * [Assumptions](#assumptions)
11 | * [Constraints](#constraints)
12 | * [Positions](#positions)
13 | * [Argument](#argument)
14 | * [Implications](#implications)
15 | * [Related](#related)
16 | * [Related decisions](#related-decisions)
17 | * [Related requirements](#related-requirements)
18 | * [Related artifacts](#related-artifacts)
19 | * [Related principles](#related-principles)
20 | * [Notes](#notes)
21 | * [Vault by HashiCorp](#vault-by-hashicorp)
22 | * [LastPass](#lastpass)
23 | * [Bitwarden](#bitwarden)
24 | * [EnvKey](#envkey)
25 | * [Confidant by Lyft](#confidant-by-lyft)
26 | * [Devolutions Password Server](#devolutions-password-server)
27 | * [Secret Server by Thycotic](#secret-server-by-thycotic)
28 |
29 |
30 | ## Summary
31 |
32 |
33 | ### Issue
34 |
35 | We need to store secrets, such as passwords, private keys, authentication tokens, etc.
36 |
37 | Some of the secrets are user-oriented. For example, our developer wants to be able to use their mobile phone to look up a password to a service.
38 |
39 | Some of the secrets are system-oriented. For example, our continuous delivery pipeline needs to be able to look up the credentials for our cloud hosting.
40 |
41 |
42 | ### Decision
43 |
44 | Bitwarden for user-oriented secrets
45 |
46 | Vault by HashiCorp for system-oriented secrets.
47 |
48 |
49 | ### Status
50 |
51 | Decided. We are open to new alternatives as they arise.
52 |
53 |
54 | ## Details
55 |
56 |
57 | ### Assumptions
58 |
59 | For this purpose, and our current state, we value user-oriented convenience, such as usable mobile apps.
60 |
61 | * We want to ensure fast easy access on the go, such as for a developer doing on-call system reliability engineering.
62 |
63 | * We want to be able to share some secrets among selected people, such as a team.
64 |
65 | We are not trying to solve for single-provider, such as storing all secrets exclusively on Amazon or Azure or Google.
66 |
67 | We do not want ad-hoc approachs such as "remember it" or "write it on a note" or "figure out your own way to store it".
68 |
69 | Our security model for this purpose is fine with using well-respected COTS vendors, such as SaaS password management tools.
70 |
71 |
72 | ### Constraints
73 |
74 | Right now we want something that is easy i.e. no need to write code, no need to install servers, no need to make a major commitment, no need to standardize everyone.
75 |
76 |
77 | ### Positions
78 |
79 | We considered:
80 |
81 | 1. User-oriented off-the-self password managers: LastPass, 1Password, Bitwarden, Dashlane, KeePass, pass, GPG, etc.
82 |
83 | 2. System-oriented COTS password managers: AWS KMS, Vault by HashiCorp, EnvKy, Secret Server by Thycotic, Devolutions Password Server, Confidant by Lyft.
84 |
85 | 3. Sharing-oriented approaches: using a shared Google document, or shared Slack channel, or shared network folder, etc.
86 |
87 | 4. Low-tech ad-hoc approaches, such as remembering, writing a note, or relying on each user to figure out their own approach.
88 |
89 |
90 | ### Argument
91 |
92 | Bitwarden, LastPass, 1Password, and Dashlane all are commerical off-the-shelf products.
93 |
94 | * Similar kinds of features for users, teams, organizations, etc.
95 |
96 | * Desktop capability for Windows and Mac, and mobile capability for Android and iOS.
97 |
98 | * Browser extensions for Chrome and Firefox, for automatic form fill in, etc.
99 |
100 | Bitwarden has two advantages over the others:
101 |
102 | * Bitwarden is open source, which means the security can be peer reviewed and also the company is widely-appreciated by security-oriented developers.
103 |
104 | * Anecdotes by software workers describe a significant preference for Bitwarden over the others.
105 |
106 | A typical good example writeup: https://jcs.org/2017/11/17/bitwarden
107 |
108 | A typical side-by-side voting site: https://stackshare.io/stackups/bitwarden-vs-dashlane
109 |
110 | We defer KeyPass, pass, GPG, etc. because there's additional complexity. All of these look like fine solutions for technical users. GPG looks especially good for technical users who want cross-system command-oriented capabilities.
111 |
112 | We defer KMS because it has single-provider lock-in.
113 |
114 | We choose Vault for system-oriented needs, because the reviews are amazingly positive, and because HashiCorp has an excellent track record fup top-quality software and support.
115 |
116 | We veto the approaches of sharing approaches such as via shared documents, shared channels, shared network folders, etc. These do not provide the security qualities that we want.
117 |
118 | We veto the ad-hoc low-tech approaches, because we all agree it's not a long-term path forward.
119 |
120 |
121 | ### Implications
122 |
123 | Developers may need to track secrets in two places: Bitwarden for user-oriented access, and Vault for system-oriented access.
124 |
125 |
126 | ## Related
127 |
128 |
129 | ### Related decisions
130 |
131 | The decision of which CI/CD server must include proof of capability for accessing secrets.
132 |
133 | We will need to decide how to managea the secrets, in terms of policies, rotations, organizations, etc.
134 |
135 |
136 | ### Related requirements
137 |
138 | The secrets will have related requirements for compliance, auditing, and HR onboarding/offboarding.
139 |
140 |
141 | ### Related artifacts
142 |
143 | We expect we may export some secrets to environment variables.
144 |
145 |
146 | ### Related principles
147 |
148 | Easily reversible.
149 |
150 | Easily parallel i.e. it's easy to use a variety of password managers.
151 |
152 | Cheap to try i.e. there's a free trial and no commitment.
153 |
154 |
155 | ## Notes
156 |
157 | Evaluation notes here. The notes are all public comments on various devops discussion boards.
158 |
159 |
160 | ### Vault by HashiCorp
161 |
162 | Vault is exactly what you want here.
163 |
164 | Don't just throw Vault into production though, stand it up in a test environment first, because HashiCorp's documentation can be pretty lacking even if their products are amazing.
165 |
166 | Very steep learning curve and is not trivial to stand up.
167 |
168 | The initial setup is a bit of a pain. It's well worth it though, and the community will support it will enough for you to get by.
169 |
170 | Horrid docs but there are lots of guides online of people setting it up and if you put a few of them together you will have a working setup.
171 |
172 | Initial setup took fiddling with their helm charts (vault and consul). While technically you can use lots of other back-ends, I really really don't recommend it. The back-end/consul can be teeny tiny if you don't have a ton of data to store.
173 |
174 | Definitely get comfortable/familiar with using the CLI, because the GUI is more like a proof-of-concept/advertisement portal for their enterprise edition.
175 |
176 | The fact that you cannot just "fill it up" is a pain. For example if you have 5 fields you need to manually add each field for each item. So it's not like you pre-define fields for a specific category, and fill those fields for all the items in that category, it's more like "you generate everything every time", which (in my mind) is a pain in the ass.
177 |
178 | You may want to also look at goldfish as a UI for on top of vault. Makes it rather nice to get your team on board with it. They also have a demo. 1. Set up consul. 2. Set up vault pointing to consul. 3. Set up goldfish pointing to vault. 3. Setup some cron job to run consul snapshot for backups.
179 |
180 |
181 |
182 | ### LastPass
183 |
184 | LastPass Teams. We use it, has custom templates, ACL, nothing missing IMO.
185 |
186 | I implemented LastPass at my org and give it a C+/B-. The biggest issue lately is a lack of reliability. In the past 90 days there have been multiple hours where vaults were forced into offline mode. This isn't ideal for my org due to having, literally, 4,000+ passwords stored across 20+ shared folders. As you can imagine with that many passwords at least a few get updated or added daily. We have a DR plan if issues last more than an hour or two: a script signs and encrypts a CSV dump of vault every night that can be imported into keepass.
187 |
188 | LastPass has had unreported blips of degraded service: login 'works' but doesn't pull sites, random features broken in the admin panel, and not properly sharing keys for new top level shared folders. I have a specific 'key push'/backup user that is in every group. Usually logging in as that user will fix any key sharing issues but not when the service is degraded despite what the status page says...
189 |
190 | For integration it can be easy if you have proper ACLs with a least privileged model e.g. if a user has read & write and read only on an entry or folder they only get read only permissions. Unfortunately my org's ACLs are not the best so I ended up using the JSON provisioning API and ~500 lines of python due to the dependent nature of our hundreds of ACLs not mapping well to the least privileged model. I ended up getting all ACLs a user was in and do a dependency walk of sorts.
191 |
192 | If your ACL or group structure is already built with a least privileged structure in mind the AD/LDAP sync tool for Windows will work well.
193 |
194 | Reach out to their sales team and they can hook you up with a longer Enterprise trial. Be sure you fully understand its limitations before pulling the trigger. We had a good number of growing pains but aside from outages or impairments on the server side it's been incredibly smooth.
195 |
196 |
197 | ### Bitwarden
198 |
199 | Bitwarden has a nice tooling around it (WebUI, CLI, Mobile, Desktop). Self-hosted and fairly easy to setup. Fairly good documentation and recommended tool by PrivacyTools.
200 |
201 |
202 | ### EnvKey
203 |
204 | https://www.envkey.com/ Is a saas. Really easy to implement, integrate and manage.
205 |
206 | Features:
207 |
208 | * Protect API keys and credentials.
209 |
210 | * Keep configuration in sync everywhere.
211 |
212 | * Smart, end-to-end encrypted configuration and secrets management.
213 |
214 | * Prevent insecure sharing and config sprawl.
215 |
216 | * Integrate in minutes.
217 |
218 | Capabilities:
219 |
220 | * Manage configuration and access levels for all your apps, environments, and teams in one place.
221 |
222 | * Configure any development or server environment with just a single environment variable.
223 |
224 | Pros:
225 |
226 | * Good home page.
227 |
228 | * Clear value prop.
229 |
230 | * Visually excellent web app.
231 |
232 | * Superior example data e.g. Algolia, AWS, Datadog, GitHub, Stripe, etc.
233 |
234 | * Spoke with the founder for 30m about the company, UI, etc. Dane sounds well-informed, honest about the pros/cons, and a viable partner.
235 |
236 | * The company is essentially a typical Y Combinator company, with 1 founder. Raised $120K in 2018-01.
237 |
238 | * Focus is on getting to enterprise features, esp. moving from EnvKey cloud-hosting to either on-prem or BYOC.
239 |
240 | * Potential path forward starting with EnvKey for ease of use, then later (or in parallel) adding Vault.
241 |
242 |
243 | ### Confidant by Lyft
244 |
245 | https://lyft.github.io/confidant/
246 |
247 | Confidant is a open source secret management service that provides user-friendly storage and access to secrets in a secure way, from the developers at Lyft.
248 |
249 | KMS Authentication: Confidant solves the authentication chicken and egg problem by using AWS KMS and IAM to allow IAM roles to generate secure authentication tokens that can be verified by Confidant. Confidant also manages KMS grants for your IAM roles, which allows the IAM roles to generate tokens that can be used for service-to-service authentication, or to pass encrypted messages between services.
250 |
251 | At-rest encryption of versioned secrets: Confidant stores secrets in an append-only way in DynamoDB, generating a unique KMS data key for every revision of every secret, using Fernet symmetric authenticated cryptography.
252 |
253 | A user-friendly web interface for managing secrets: Confidant provides an AngularJS web interface that allows end-users to easily manage secrets, the mappings of secrets to services and the history of changes.
254 |
255 |
256 | ### Devolutions Password Server
257 |
258 | https://server.devolutions.net/
259 |
260 | Secure, manage, and monitor access to privileged accounts and sessions.
261 |
262 | A comprehensive, highly-secured password vault that lets you control access to your privileged accounts, while also improving overall network visibility for sysadmins and providing a seamless experience for end users.
263 |
264 | Features: centralized organization password vault, user-specific private vault, password manager, credential injection,
265 | Active Directory integration, role-based access control, two-factor authentication, enterprise ready, IP restrictions, management capabilities, automated password generator, mobile app access, password history, access reports, email alerts.
266 |
267 | * supports data encryption
268 |
269 | * supports multiple Authentication schemes including LDAP, O365, and Local users WITH support for MFA from multiple sources
270 |
271 | * multiple repositories/vaults with fine-grained access controls for multiple teams
272 |
273 | * modern Web UI
274 |
275 | * private credential and connection vaults for personal creds/connections
276 |
277 | * mobile apps for IOS/Android
278 |
279 | * audit logs for each entry, who/what/when with an optional prompt for why they're accessing
280 |
281 | * customizable templates (though they support hundreds of connection types natively)
282 |
283 | * tons more features and a Windows/Mac thick client (Remote Desktop Manager) that you can sync to that greatly expands the options...one-click connections
284 |
285 | * pricing isn't all that bad - up to 15 users is $500 a year for the password server
286 |
287 |
288 | ### Secret Server by Thycotic
289 |
290 | https://thycotic.com/products/secret-server/
291 |
292 | On-premise version features:
293 |
294 | * Total control over your end-to-end security systems and infrastructure
295 |
296 | * Deploy software within your on-premise data center or your own virtual private cloud instance
297 |
298 | * Meet legal and regulatory obligations that require all data and systems to reside on premise
299 |
300 | Cloud version features:
301 |
302 | * Software-as-a-service model lets you sign up and start right away
303 |
304 | * Elastic scalability as you grow
305 |
306 | * Controls and redundancy delivered by Azure with 99.9% uptime SLA
307 |
308 | User feedback:
309 |
310 | * We used to use that product. It was so easily bypassed and the rules only work for smart people. Lazy or dumb users can easily screw it up in a team area. Prices are negotiable when you talk to them.
311 |
312 | * You can run it using SQL express and an Win 7box.
313 |
314 | * Cheap.
315 |
--------------------------------------------------------------------------------
/examples/metrics-monitors-alerts/index.md:
--------------------------------------------------------------------------------
1 | # Metrics, monitors, alerts
2 |
3 | Contents:
4 |
5 | * [Summary](#summary)
6 | * [Issue](#issue)
7 | * [Decision](#decision)
8 | * [Status](#status)
9 | * [Details](#details)
10 | * [Assumptions](#assumptions)
11 | * [Constraints](#constraints)
12 | * [Positions](#positions)
13 | * [Argument](#argument)
14 | * [Implications](#implications)
15 | * [Related](#related)
16 | * [Related decisions](#related-decisions)
17 | * [Related requirements](#related-requirements)
18 | * [Related artifacts](#related-artifacts)
19 | * [Related principles](#related-principles)
20 | * [Notes](#notes)
21 | * [Freeform text messages vs structured event messages](#freeform-text-messages-vs-structured-event-messages)
22 | * [Graylog is easier](#graylog-is-easier)
23 | * [Prometheus take some tuning](#prometheus-take-some-tuning)
24 | * [AWS services are mixed](#aws-services-are-mixed)
25 | * [Kafka](#kafka)
26 | * [Loki](#loki)
27 | * [Prometheus + alertmanager + Rollbar + Graylog + Grafana](#prometheus-alertmanager-rollbar-graylog-grafana)
28 | * [Thanos](#thanos)
29 | * [Prometheus HA](#prometheus-ha)
30 | * [Datadog + PagerDuty + Threat Stack](#datadog-pagerduty-threat-stack)
31 | * [Zabbix](#zabbix)
32 | * [Outlyer](#outlyer)
33 | * [Nagios + Nagiosgraph](#nagios-nagiosgraph)
34 | * [Prometheus + Grafana + AlertManager](#prometheus-grafana-alertmanager)
35 | * [DataDog + Sentry + PagerDuty.](#datadog-sentry-pagerduty)
36 | * [Sensu + Graphite + ELK](#sensu-graphite-elk)
37 | * [Prometheus + Alertmanager](#prometheus-alertmanager)
38 | * [Sensu + Grafana + Graylog + Kibana + NewRelic.](#sensu-grafana-graylog-kibana-newrelic)
39 | * [Prometheus + Circonus](#prometheus-circonus)
40 | * [icinga2 + VictorOps + NewRelic + Sentry + Slack](#icinga2-victorops-newrelic-sentry-slack)
41 | * [AppDynamics + Papertrail + PagerDuty + Healthchecks.io + Stackdriver](#appdynamics-papertrail-pagerduty-healthchecks-io-stackdriver)
42 | * [icinga2 + elasticsearch](#icinga2-elasticsearch)
43 | * [DataDog + New Relic + ELK + EFK + Sentry + Alertmanager + VictorOps](#datadog-new-relic-elk-efk-sentry-alertmanager-victorops)
44 | * [Wavefront + Scalyr + PagerDuty + Stackstorm + Slack](#wavefront-scalyr-pagerduty-stackstorm-slack)
45 | * [Telegraf + Prometheus + InfluxDB + Grafana](#telegraf-prometheus-influxdb-grafana)
46 | * [Sematext + Logagent + Experience](#sematext-logagent-experience)
47 | * [Azure Monitor/Analytics + OpsGenie](#azure-monitor-analytics-opsgenie)
48 | * [Prometheus + Alertmanager + Grafana + Splunk + PagerDuty](#prometheus-alertmanager-grafana-splunk-pagerduty)
49 | * [Telegraf + Prometheus + Grafana + Alertmanager](#telegraf-prometheus-grafana-alertmanager)
50 | * [Prometheus + Grafana + Cloudwatch + sentry + kibana + elasticsearch](#prometheus-grafana-cloudwatch-sentry-kibana-elasticsearch)
51 | * [PagerDuty + Monitis](#pagerduty-monitis)
52 | * [Prometheus + Grafana + Bosun](#prometheus-grafana-bosun)
53 | * [Azure Monitor/Analytics/Insights/Dashboards](#azure-monitor-analytics-insights-dashboards)
54 | * [Grafana + Monitis + OpsGenie + Slack](#grafana-monitis-opsgenie-slack)
55 | * [Checkly + AppOptics + Cloudwatch + Heroku + Pagerduty + Papertrail](#checkly-appoptics-cloudwatch-heroku-pagerduty-papertrail)
56 | * [Instana + Logz.io + slack](#instana-logz-io-slack)
57 | * [SignalFX + Splunk + PagerDuty + Slack](#signalfx-splunk-pagerduty-slack)
58 | * [ELK + Prometheus + Grafana](#elk-prometheus-grafana)
59 | * [Datadog + Prometheus + Grafana](#datadog-prometheus-grafana)
60 | * [Nagios + ELK](#nagios-elk)
61 | * [Datadog vs. Site24x7 + StatusCake + PagerDuty + SumoLogic + Slack](#datadog-vs-site24x7-statuscake-pagerduty-sumologic-slack)
62 | * [Prometheus + AlertManager + Grafana + Stackdriver](#prometheus-alertmanager-grafana-stackdriver)
63 | * [Zabbix](#zabbix)
64 |
65 |
66 | ## Summary
67 |
68 |
69 | ### Issue
70 |
71 | We want to use metrics, monitors, and alerts, because we want to know how well our applications are working, and to know when there's a problem.
72 |
73 |
74 | ### Decision
75 |
76 | WIP.
77 |
78 |
79 | ### Status
80 |
81 | Gathering information. We are starting with the plausible ends of the spectrum: the most-recommended older free tool (Nagios) and the most-recommended newer paid tool (New Relic).
82 |
83 |
84 | ## Details
85 |
86 |
87 | ### Assumptions
88 |
89 | We want to create web apps that are modern, fast, reliable, responsive, etc.
90 |
91 | We want to buy rather than build.
92 |
93 |
94 | ### Constraints
95 |
96 | We want tooling that works well with our devops pipeline and with our deployment clouds.
97 |
98 |
99 | ### Positions
100 |
101 | We are researching positions now.
102 |
103 |
104 | * AlertManager
105 |
106 | * AppDynamics
107 |
108 | * AppOptics
109 |
110 | * Azure Monitor/Analytics/Insights/Dashboards
111 |
112 | * Bosun
113 |
114 | * Checkly
115 |
116 | * Circonus
117 |
118 | * Cloudwatch
119 |
120 | * EFK
121 |
122 | * ELK
123 |
124 | * Grafana
125 |
126 | * Grafana
127 |
128 | * Graphite
129 |
130 | * Graylog
131 |
132 | * Healthchecks.io
133 |
134 | * Heroku
135 |
136 | * icinga2
137 |
138 | * InfluxDB
139 |
140 | * Instana
141 |
142 | * Kafka
143 |
144 | * Logagent
145 |
146 | * Logz.io
147 |
148 | * Loki
149 |
150 | * Monitis
151 |
152 | * Nagios
153 |
154 | * Nagios
155 |
156 | * Nagiosgraph
157 |
158 | * NewRelic
159 |
160 | * OpsGenie
161 |
162 | * Outlyer
163 |
164 | * PagerDuty
165 |
166 | * Pagerduty
167 |
168 | * PagerDuty
169 |
170 | * Papertrail
171 |
172 | * Prometheus
173 |
174 | * Rollbar
175 |
176 | * Scalyr
177 |
178 | * Sematext Metrics, Logs, Experience, Tracing
179 |
180 | * Sensu
181 |
182 | * SignalFX
183 |
184 | * Slack
185 |
186 | * Splunk
187 |
188 | * Stackdriver
189 |
190 | * Stackstorm
191 |
192 | * Telegraf
193 |
194 | * Telegraf
195 |
196 | * Thanos
197 |
198 | * VictorOps
199 |
200 | * Wavefront
201 |
202 | * Zabbix
203 |
204 |
205 | ### Argument
206 |
207 | So far, Nagios and New Relic are the ends of the spectrum. Nagios is the oldest, simplest, free, viable tool. New Relic is the newest-featured, most-complete, paid, viable tool. We will start with evaluations of these. As needed, we will migrate into the spectrum.
208 |
209 | So far, Zabbix has the best recommendations, and also offers the most complete capabilities.
210 |
211 | So far, ELK has the best open-source build-over-buy popularity.
212 |
213 | So far, Prometheus + Graphana have the best popularity.
214 |
215 |
216 | ### Implications
217 |
218 | TODO.
219 |
220 |
221 | ## Related
222 |
223 |
224 | ### Related decisions
225 |
226 | The choices will affect testability, telemetry, and likely other systems such as for customer service, site reliability engineering, etc.
227 |
228 |
229 | ### Related requirements
230 |
231 | TODO.
232 |
233 |
234 | ### Related artifacts
235 |
236 | TODO.
237 |
238 |
239 | ### Related principles
240 |
241 | Easily reversible.
242 |
243 | Need for speed.
244 |
245 |
246 | ## Notes
247 |
248 |
249 | A pretty good open source stack is:
250 |
251 | * Prometheus for metrics and alerting based on metrics
252 |
253 | * Grafana to display metrics
254 |
255 | * Elasticsearch/Logstash/Kibana (ELK) for logs and structured events
256 |
257 | * Pushover for mobile notifications
258 |
259 |
260 | ### Freeform text messages vs structured event messages
261 |
262 | Freeform text messages: for example, the kind of random stuff you would find in /var/log/messages, and something generated intentionally by the application. The messages are useful for identifying other things that are happening on the box like out of memory or hardware errors, but have a lot of junk.
263 |
264 | Structured event messages: generated by the application, with a fixed or dynamic set of attributes, e.g. a HTTP request log, an accounting log, a user login.
265 |
266 | Generally speaking, it's nice to log details about every request in a way that you can drill down based on attributes. So adding e.g. a userid or sessionid to everything lets you trace. Explicit tracing is also good, of course. Using ELK for this is kind of a poor man's https://www.honeycomb.io/
267 |
268 |
269 | ### Graylog is easier
270 |
271 | Graylog is easier to spin up from my experience.
272 |
273 |
274 |
275 | ### Prometheus take some tuning
276 |
277 |
278 | I am generally happy with Prometheus for metrics. The alerting takes some tuning, but is pretty good. It depends on your application. I think it's best to alert on end-user visible conditions, not underlying causes. For example, page load time is good, number of requests per second is not. Though zero requests per second indicates that something is wrong.
279 |
280 | The advantage of a service is that they offer additional intelligence out of the box. I generally like Datadog. The services can be frighteningly expensive if you have a lot of data, and sometimes have pricing models which are not cloud friendly, e.g. charging per instance, when instances are dynamic. There is also a difference between services where every request is from a paid user and ones that are advertising related, so only some small percentage of the requests make you money. You can end up with a lot of data and not that much budget.
281 |
282 | I work on some services that get 1B requests a day, so it makes sense to host our own monitoring and logging. If your volumes are lower, then hosted services are easier.
283 |
284 |
285 | ### AWS services are mixed
286 |
287 | My experience with AWS services has been mixed. Their Elasticsearch service has been flaky, so we run our own instances for that. CloudWatch metrics are expensive, so we generally only use them for "infrastructure" level metrics rather than the application, i.e. health-related metrics where AWS can know better what's going on than software running on the instance. CloudWatch Logs can be slow to update and don't have that much metadata. Running ELK helps with that. If I really want real time data, then using Kafka as the transport for logs is better. That's pretty well supported by Logstash. Managing a Kafka cluster is not for the faint hearted, though, there is a lot of exposed plumbing.
288 |
289 |
290 | ### Kafka
291 |
292 | Comment: Kafka can be super tricky at times, or Kafka can be rock solid and you almost forget about it being there tying everything together.
293 |
294 |
295 | Comment: Kafka has been solid, but it was a surprising amount of work getting it going. I think of it like a relational database but you are only working at the "physical" layer, e.g. tablespaces, files and partitions. There were some times early on where the management utilities were lacking, and we had to write programs to e.g. reset a consumer group. http://howfuckedismydatabase.com/nosql/
296 |
297 | Comment: We use Kafka as a "buffer" for log messages and a place where we can do real time stream processing on data that is coming from multiple servers. If we get a DDOS attack, then we need a way of analyzing data across multiple instances. If we are logging directly from the servers to ELK, the load can blow up the Elasticsearch cluster.
298 |
299 | Comment: Kafka is good for us because if we get a DDOS attack, then we need a way of analyzing data across multiple instances. If we are logging directly from the servers to ELK, the load can blow up the Elasticsearch cluster.
300 |
301 |
302 | Comment: Kafka does less work and is more efficient, so it can handle the load better. And we queue the Kafka work and retry. And having Kafka be overloaded doesn't affect users who are trying to do interactive work with Kibana, the way it would if Elasticsearch is struggling.
303 |
304 | Comment: Stream processing is mostly looking for abuse, e.g. too much traffic from a single IP across the whole cluster, and then share the block across the whole cluster.
305 |
306 | Comment: The logstash-output-kafka plugin is quite unreliable at the moment though. I've been bit by several of the issues on its GitHub issues page, that never seem to get fixed. I want to move away from using it, to sending directly from our apps to Kafka.
307 |
308 | Comment: We are now sending structured events directly from the app to Kafka. The main motivation was touching the log data fewer times and avoiding reading and writing the disk multiple times. In high volume systems, logging can take more work than the app itself. I am thinking about making journald send logs directly as well, from a C program.
309 |
310 |
311 | ### Loki
312 |
313 | Keep a close eye on Loki. It’s not ready yet but when it is I’d expect it to be a better fit in this stack. Loki is a log aggregator created by grafana labs, it uses similar a scraping and tag syntax as Prometheus.
314 |
315 |
316 | ### Prometheus + alertmanager + Rollbar + Graylog + Grafana
317 |
318 | Prometheus + alertmanager for metrics. <3 Prometheus.
319 |
320 | Rollbar/Graylog for logging/error reporting (there's some overlap here; a small service probably doesn't need both).
321 |
322 | Currently, alerts just go to one of a few Slack channels that interested parties have notifications turned on for. If we were more serious about on-call they would go to PagerDuty/VictorOps/etc.
323 |
324 | Grafana for graphs and dashboarding. Also eagerly looking forward to seeing if their upcoming logging facilities will obviate Graylog.
325 |
326 |
327 | ### Thanos
328 |
329 | We use Thanos as a front-end for our HA setup. It knows how to de-duplicate HA pairs.
330 |
331 | We're currently keeping 6 months of local Prometheus data. This works reasonably well for us. But I'm just in the middle of rolling out bucket storage to our Thanos setup for long-term data storage. In theory, GCS storage will be about 30% cheaper than the GCE standard persistent disk we use right now.
332 |
333 | We don't backup Prometheus data right now. The data just isn't really important to us beyond having enough for alerting. Our overall fleet deployment changes so much year to year that historical data older than a few months just isn't that interesting. It might be interesting to have a few core stats year-over-year, I may setup a core-stats set of recording rules and store those with Federation or just let Thanos take care of it.
334 |
335 | EDIT: Minor disclaimer, I'm a Prometheus developer.
336 |
337 |
338 | ### Prometheus HA
339 |
340 | HA in Prometheus is done by duplication you run multiple poppers there’s ways to poll multiple and deduplicate the data
341 |
342 | The scaling is by deciding network and having different Prometheus poll different parts of the network
343 |
344 | Longterm storage isn’t proms strong point but is offloaded to something like influxes or timescaledb (which technically does the HA checkmark as well) article I read on it https://blog.timescale.com/prometheus-ha-postgresql-8de68d19b6f5?gi=7df160f10e07
345 |
346 | Haven’t tried the longterm stuff yet as I’m still only experimenting and using it for short term graphs while librenms monitors my network for longterm
347 |
348 |
349 | ### Datadog + PagerDuty + Threat Stack
350 |
351 | We use Datadog (w/ PagerDuty) and Threat Stack and couldn't be happier. My only gripe with DD is the relatively high cost of metric storage.
352 |
353 |
354 | ### Zabbix
355 |
356 | Zabbix with custom scripts to monitor almost everything. Works like a charm.
357 |
358 |
359 | ### Outlyer
360 |
361 | I'm using Outlyer, but then I must disclaim that I work here, and dog fooding is a must.
362 |
363 | Still need Graylog, Sentry and Statuscake to enhance.
364 |
365 | Sounds biased, but having happily run Nagios and other monitoring systems internally, I would buy a hosted solution in any new gig and offload that pain.
366 |
367 |
368 | ### Nagios + Nagiosgraph
369 |
370 | We run Nagios for all monitoring and alerting. Alerts happen through e-mail (warnings and critical notifications) and audible app notifications (for critical alerts).
371 |
372 | Nagiosgraph is used for visualisations.
373 |
374 | This set-up has been very effective in keeping us comprehensively informed on what's happening in our environment. We run and monitor about 110 mission-critical servers and about 760 data points, and have had this morning system in place for over seven years.
375 |
376 | I would like to also aggregate logs with Graylog or ELk at some point.
377 |
378 |
379 | ### Prometheus + Grafana + AlertManager
380 |
381 | Prometheus + Grafana + AlertManager via the awesome Prometheus Operator helm chart. Log still goes to LogDNA birch plan as we noticed ELK is too heavyweight for our humble min 3 max 5 nodes on GKE.
382 |
383 |
384 | ### DataDog + Sentry + PagerDuty.
385 |
386 | I used to run all my own monitoring solutions using all kinds of software including Nagios, Icinga, Zabbix, ELK, Greylog2, Influx, and many other tools, but the truth is that there's just too much effort involved in running your own monitoring infrastructure, especially when you can pay someone else such low rates for someone else to do it for you!
387 |
388 | Paying others to run the monitoring infra frees my clients up to focus on running their platforms instead of monitoring the monitoring, meaning that the value they gain from the stability of their platform far outweighs any costs of Monitoring as a Service.
389 |
390 |
391 | ### Sensu + Graphite + ELK
392 |
393 | My company is real big on self-hosted stuff.
394 |
395 | Sensu -> PagerDuty
396 |
397 | Graphite/Grafana
398 |
399 | ELK (Elasticsearch, Logstash, Kibana)
400 |
401 |
402 | ### Prometheus + Alertmanager
403 |
404 | Prometheus + Alertmanager for alerting, my team believe that simple monitoring is good monitoring.
405 |
406 | Other systems like logging and tracing will provide rich context for diagnosing when the on-call receive an alert, but we never build alerting on these.
407 |
408 |
409 | ### Sensu + Grafana + Graylog + Kibana + NewRelic.
410 |
411 | Sensu, grafana, graylog, kibana, newrelic.
412 |
413 |
414 | ### Prometheus + Circonus
415 |
416 | Prometheus instrumented services => Circonus analytics and visualization
417 |
418 |
419 | ### icinga2 + VictorOps + NewRelic + Sentry + Slack
420 |
421 | We are using below services:
422 |
423 | icinga2 for monitoring and VictorOps for alerting
424 |
425 | NewRelic for details monitoring of the service
426 |
427 | Sentry for error tracking in the service
428 |
429 | Slack/Email is part of alerting which triggers from NewRelic or icinga2
430 |
431 |
432 | ### AppDynamics + Papertrail + PagerDuty + Healthchecks.io + Stackdriver
433 |
434 | AppDynamics
435 |
436 | Papertrail
437 |
438 | PagerDuty
439 |
440 | Healthchecks.io
441 |
442 | Stackdriver
443 |
444 |
445 | ### icinga2 + elasticsearch
446 |
447 | icinga2 with elasticsearch integration for analysis and graphite+grafana integration for graphs.
448 |
449 | thanks to the flexibility of apply rules in icinga2, the developers can only see services they receive notifications for.
450 |
451 | and through icinga2 director, programmers can easily define their own checks (which they do, every few days - 100 checks go out, 100 other checks go in) on big scale with zero hassle.
452 |
453 |
454 | ### DataDog + New Relic + ELK + EFK + Sentry + Alertmanager + VictorOps
455 |
456 | What do we have now:
457 |
458 | DataDog for metrics
459 |
460 | New Relic for application monitoring
461 |
462 | ELK (Elastic Search + Logstash + Kibana) for the logs
463 |
464 | Sentry (self-hosted) for logging exceptions
465 |
466 | Emails + Slack + VictorOps for alerting (based on severity)
467 |
468 | What do we want to have:
469 |
470 | Prometheus for metrics (Grafana for visualization)
471 |
472 | New Relic (probably Elastic Search APM) for the application monitoring
473 |
474 | EFK (elastic search + fluentd + kibana) for logging. Probably, Loki by Grafana would be prod-ready till the time we get here
475 |
476 | Sentry for the exceptions
477 |
478 | Alertmanager + email + VictorOps for the alerts
479 |
480 |
481 | ### Wavefront + Scalyr + PagerDuty + Stackstorm + Slack
482 |
483 | Wavefront + Scalyr + PagerDuty + Stackstorm + Slack (Disclaimer: works at VMware)
484 |
485 |
486 | ### Telegraf + Prometheus + InfluxDB + Grafana
487 |
488 | Telegraf for server metrics like CPU, Disk, Memory and Network. We also use Telegraf for SNMP monitoring of our network devices.
489 |
490 | Prometheus for application metrics. We code health checks into our application that Prometheus scrapes.
491 |
492 | InfluxDB for time-series storage. This is where our Telegraf data gets sent.
493 |
494 | Grafana for dashboards and alerts. The alerting engine isn't super robust, but it does the job. We also fire alerts into Slack.
495 |
496 | What I don't have right now is a centralized logging solution. ELK is powerful but difficult to set up and manage, and I don't know of any free alternatives that are close enough to look into.
497 |
498 |
499 | ### Sematext + Logagent + Experience
500 |
501 | Sematext for metrics, for logs, for traces, soon for real user monitoring, too. Simpler/cheaper than using N different tools/services, IMHO.
502 |
503 | For log shipping we used to use rsyslog and then we switched to Logagent.
504 |
505 | For frontend crash reporting we use Sentry, but we'll switch to Experience shortly.
506 |
507 | Disclaimer: I'm a Sematextan.
508 |
509 |
510 | ### Azure Monitor/Analytics + OpsGenie
511 |
512 | I wish Log Analytics had a better interface. We’re moving away from splunk, which was much easier to navigate.
513 |
514 |
515 | ### Prometheus + Alertmanager + Grafana + Splunk + PagerDuty
516 |
517 | Prometheus, Alertmanager, Grafana, Splunk, PagerDuty
518 |
519 | You really do not want to run your own notification system. You can replace Splunk with ELK unless your Security team prefers Splunk.
520 |
521 |
522 | ### Telegraf + Prometheus + Grafana + Alertmanager
523 |
524 | Telegraf as a collector, Prometheus + Alertmanager for monitoring and alerting, integrated with slack channels and pagerduty for critical alerts. Grafana for host metrics visualization.
525 |
526 |
527 | ### Prometheus + Grafana + Cloudwatch + sentry + kibana + elasticsearch
528 |
529 | Prometheus for metrics + alerts
530 |
531 | Grafana for Prometheus dashboards
532 |
533 | Cloudwatch monitoring Prometheus instances
534 |
535 | sentry for exception tracking
536 |
537 | kibana + elasticsearch
538 |
539 | graylog
540 |
541 | prometheus Push Gateway for batch/cronjobs
542 |
543 | SOP https://github.com/rapidloop/sop to „push/forward“ metrics from 1 Prometheus instance to another
544 |
545 | clients either use the Prometheus clients . We try to use opencensus.io on the client side
546 |
547 |
548 | ### PagerDuty + Monitis
549 |
550 | PagerDuty + Monitis. Also some bespoke Azure Functions for testing the health of some services.
551 |
552 | Looking to introduce Prometheus and Grafana this year
553 |
554 |
555 | ### Prometheus + Grafana + Bosun
556 |
557 | Prometheus for storing the time series data. Grafana for visualization. Bosun for alert management.
558 |
559 |
560 | ### Azure Monitor/Analytics/Insights/Dashboards
561 |
562 | Azure only shop, Azure Monitor, Log Analytics, App Insights, Azure Dashboards + Pager Duty
563 |
564 |
565 | ### Grafana + Monitis + OpsGenie + Slack
566 |
567 | Grafana for monitoring container services in Kubernetes via Prometheus
568 |
569 | Monitis for end-to-end service monitoring mainly for web APIs and web applications
570 |
571 | OpsGenie for alert management
572 |
573 | Slack for getting status information from our systems
574 |
575 |
576 | ### Checkly + AppOptics + Cloudwatch + Heroku + Pagerduty + Papertrail
577 |
578 | Long time (dev)ops engineer here. Grew up on Nagios. Would love to have opinions on my bootstrapped SaaS https://checklyhq.com. We do API monitoring & site transaction monitoring with pretty in depth alerting.
579 |
580 | I started Checkly because active / synthetic monitoring in the API space was a bit limited (and pricey). Browser based / scripted monitoring is even more proprietary and expensive. We use Puppeteer and keep pricing as low as possible.
581 |
582 | Our monitoring stack:
583 |
584 | Checkly (dog fooding...)
585 |
586 | AppOptics (custom graphing)
587 |
588 | AWS Cloudwatch & SNS for SMS messages.
589 |
590 | built-in Heroku alerting.
591 |
592 | Pagerduty
593 |
594 | Papertrail
595 |
596 |
597 | ### Instana + Logz.io + slack
598 |
599 | Instana alerts us in slack about infrastructure problems or performance degradation, and we’ve configured logz.io to alert in slack on a certain volume of error level logs from the application layer.
600 |
601 |
602 | ### SignalFX + Splunk + PagerDuty + Slack
603 |
604 | Currently using: SignalFX, Splunk, PagerDuty and Slack. I'm not a huge fan of SignalFX although their support team is super friendly and responsive. I like Splunk (worth it if you can pay for it), PagerDuty and Slack.
605 |
606 | I used to use the TICK stack where most of the C was really a G, that is Grafana although I did use Chronograf a little bit. That was awesome but it was a pain to manage. The classic SaaS vs self-hosting dilemma.
607 |
608 | I have used DataDog, New Relic, Graylog, ELK and BugSnag. I like DataDog and New Relic a lot, Graylog is pretty good. I'm not a big ELK fan though. BugSnag is nice, I actually feel like tracking errors/exceptions is a pretty good standin for full log monitoring, in many cases.
609 |
610 |
611 | ### ELK + Prometheus + Grafana
612 |
613 | Just like others, we use ELK for logs and Prometheus+Grafana for everything else.
614 |
615 | Maintaining this setup is easy if you give yourself permission to occasionally lose data. For example, if our ElasticSearch database gets into a funk (which happens every 2-3 months for us unfortunately) we don't bother with HA and instead dump the data and get on with our lives. If you absolutely must have HA or long term retention, good luck.
616 |
617 |
618 | ### Datadog + Prometheus + Grafana
619 |
620 | I set up Datadog on a month to month because when I got here, there was no monitoring and no alerting. Only a couple of our sites were being monitored every 5m for uptime. Datadog hands down is the easiest to setup. When I finish addressing all the other issues, I will change to Prometheus+Grafana. Not 100% decided on log management yet.
621 |
622 | ### Nagios + ELK
623 |
624 | We have over 100+ products that we support.
625 |
626 | For on-prem, it's mostly Nagios and ELK. For the cloud, we're migrating from DataDog to NewRelic.
627 |
628 |
629 | ### Datadog vs. Site24x7 + StatusCake + PagerDuty + SumoLogic + Slack
630 |
631 | We used to use datadog but found it to be way to expensive for our needs. Don't get me wrong its amazing but, it does have a huge cost. We were able to setup site24x7.com with an annual subscription for about 2-3 months of cost from DD.
632 |
633 | Our monitoring stack:
634 |
635 | Site24x7 - APM, External URL monitoring, SMTP mailflow monitoring, ssl expirations and process monitoring.
636 |
637 | StatusCake - for URL monitoring and confirmation - It's our backup just in case site24x7 misses something (It does not) but SC is more flexible for external port and service monitoring for our needs.
638 |
639 | Both tools escalate to PagerDuty, and then we get our escalations in slack.
640 |
641 | SumoLogic - for log monitoring (it’s a great tool but a little complicated for our needs)
642 |
643 | From slack we can ack, or remedy the alert.
644 |
645 | We then have a lot of site24x7 automations that then connect to commando.io for 'BedOps' as we call it - where an alert gets triggered, we kick of a few scripts or automations as an attempt to remedy the situation (99% of the time the automation + our scripts keep us out of trouble)
646 |
647 | We have internal runbooks in our KB for when the automatons fail or if there is something out of the scope that needs to be fixed.
648 |
649 |
650 | ### Prometheus + AlertManager + Grafana + Stackdriver
651 |
652 | Prometheus (Operator) / AlertManager / Grafana for metrics in our GKE clusters and VMs.
653 |
654 | Google Stackdriver for Logs (as it’s included and active by default and is currently enough for our needs).
655 |
656 |
657 | ### Zabbix
658 |
659 | Zabbix for everything. No additional software needed.
660 |
661 |
662 |
--------------------------------------------------------------------------------