├── .circleci
    └── config.yml
├── .gitignore
├── 0000-template.md
├── LICENSE
├── README.md
└── text
    ├── 0001-rfc_process.md
    ├── 0002-qri_dataset_definition.md
    ├── 0003-content_addressed_file_system.md
    ├── 0004-structured_data_io.md
    ├── 0006-dataset_naming.md
    ├── 0011-html_viz.md
    ├── 0012-CLI-commands.md
    ├── 0013-api.md
    ├── 0014-export.md
    ├── 0015-add-geo-processing-abilities-to-skylark.md
    ├── 0016-revise_transform_processing.md
    ├── 0017-define_dataset_creation.md
    ├── 0018-publish-update.md
    ├── 0019-manifests.md
    ├── 0020-distingush_manual_vs_scripted_transforms.md
    ├── 0021-export_behavior.md
    ├── 0022-remotes.md
    ├── 0023-starlark_load_dataset.md
    ├── 0024-scheduled-updates.md
    ├── 0025-filesystem-integration.md
    ├── 0026-starlark_expose.md
    ├── 0027-assets
        ├── dataset_01.png
        ├── dataset_02_patch_application.png
        ├── dataset_03_xform_diff_shape.png
        ├── dataset_04_xform_same_shape.png
        ├── dataset_05_conflict.png
        └── dataset_06_working_dir.png
    ├── 0027-transform_application.md
    ├── 0028-externalize_private_keys.md
    ├── 0029-config_revision.md
    ├── 0030-replace_publish_clone_with_push_pull.md
    ├── 0031-expanded_remove.md
    ├── 0032-access_command.md
    └── 0033-storage_command.md


/.circleci/config.yml:
--------------------------------------------------------------------------------
 1 | version: '2'
 2 | jobs:
 3 |   build:
 4 |     working_directory: "~/rfcs"
 5 |     docker:
 6 |       - image: circleci/golang:1.11
 7 |     steps:
 8 |       - checkout
 9 |       - run: 
10 |           name: install misspell
11 |           command: curl -L -o ./install-misspell.sh https://git.io/misspell && sh ./install-misspell.sh
12 |       - run:
13 |           name: Check Spelling
14 |           command: ./bin/misspell -error ~/rfcs/text/
15 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | *~
2 | book/
3 | src/
4 | .DS_Store
5 | 


--------------------------------------------------------------------------------
/0000-template.md:
--------------------------------------------------------------------------------
 1 | - Feature Name: <!-- (fill me in with a unique ident, my_awesome_feature) -->
 2 | - Start Date: <!-- (fill me in with today's date, YYYY-MM-DD) -->
 3 | - RFC PR: <!-- (leave this empty) -->
 4 | - Issue: <!-- (leave this empty) -->
 5 | 
 6 | # Summary
 7 | [summary]: #summary
 8 | 
 9 | <!-- One paragraph explanation of the feature. -->
10 | 
11 | # Motivation
12 | [motivation]: #motivation
13 | 
14 | <!-- Why are we doing this? What use cases does it support? What is the expected outcome? -->
15 | 
16 | # Guide-level explanation
17 | [guide-level-explanation]: #guide-level-explanation
18 | 
19 | <!-- Explain the proposal as if it was already included in the language and you were teaching it to a Qri _developer_. That generally means:
20 | 
21 | - Introducing new named concepts.
22 | - Explaining the feature largely in terms of examples.
23 | - Explaining how Qri developer should *think* about the feature, and how it should impact the way they use Qri. It should explain the impact as concretely as possible.
24 | - If applicable, provide sample error messages, deprecation warnings, or migration guidance.
25 | - If applicable, describe the differences between teaching this to a Qri developer vs a Qri _user_.
26 | 
27 | For implementation-oriented RFCs (e.g. for Qri codebase internals), this section should focus on how contributors should think about the change, and give examples of its concrete impact. For policy RFCs, this section should provide an example-driven introduction to the policy, and explain its impact in concrete terms. -->
28 | 
29 | # Reference-level explanation
30 | [reference-level-explanation]: #reference-level-explanation
31 | 
32 | <!-- This is the technical portion of the RFC. Explain the design in sufficient detail that:
33 | 
34 | - Its interaction with other features is clear.
35 | - It is reasonably clear how the feature would be implemented.
36 | - Corner cases are dissected by example.
37 | 
38 | The section should return to the examples given in the previous section, and explain more fully how the detailed proposal makes those examples work. -->
39 | 
40 | # Drawbacks
41 | [drawbacks]: #drawbacks
42 | 
43 | Why should we *not* do this?
44 | 
45 | # Rationale and alternatives
46 | [rationale-and-alternatives]: #rationale-and-alternatives
47 | 
48 | <!-- - Why is this design the best in the space of possible designs?
49 | - What other designs have been considered and what is the rationale for not choosing them?
50 | - What is the impact of not doing this? -->
51 | 
52 | # Prior art
53 | [prior-art]: #prior-art
54 | 
55 | <!-- Discuss prior art, both the good and the bad, in relation to this proposal.
56 | A few examples of what this can include are:
57 | 
58 | - Does this feature exist in other places and what experience have their community had?
59 | - For community proposals: Is this done by some other community and what were their experiences with it?
60 | - For other teams: What lessons can we learn from what other communities have done here?
61 | - Papers: Are there any published papers or great posts that discuss this? If you have some relevant papers to refer to, this can serve as a more detailed theoretical background.
62 | 
63 | This section is intended to encourage you as an author to think about the lessons from other projects, provide readers of your RFC with a fuller picture.
64 | If there is no prior art, that is fine - your ideas are interesting to us whether they are brand new or if it is an adaptation from other languages.
65 | 
66 | Note that while precedent set by other projects is some motivation, it does not on its own motivate an RFC.
67 | Please also take into consideration that Qri sometimes intentionally diverges from other projects. -->
68 | 
69 | # Unresolved questions
70 | [unresolved-questions]: #unresolved-questions
71 | 
72 | <!-- - What parts of the design do you expect to resolve through the RFC process before this gets merged?
73 | - What parts of the design do you expect to resolve through the implementation of this feature before stabilization?
74 | - What related issues do you consider out of scope for this RFC that could be addressed in the future independently of the solution that comes out of this RFC? -->
75 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | The MIT License (MIT)
 2 | 
 3 | Copyright (c) 2018 QRI, Inc.
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in
13 | all copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21 | THE SOFTWARE.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Qri RFCs
  2 | 
  3 | [Qri RFCs]: #qri-rfcs
  4 | 
  5 | Many changes can be implemented and reviewed via the normal GitHub pull request 
  6 | workflow. Some changes though are "substantial", and we ask that these be put 
  7 | through a bit of a design process and produce a consensus among the qri 
  8 | community and core team.
  9 | 
 10 | The "RFC" (request for comments) process is intended to provide a consistent
 11 | and controlled path for charting the roadmap of qri. We've seen a number of 
 12 | projects in the distributed space suffer from under-considered design choices 
 13 | and unclear roadmapping, We're hoping strong adherence to a lightweight RFC 
 14 | process can help mitigate these problems. You should be able get a sense of 
 15 | where qri is going by reading through the accepted proposals.
 16 | 
 17 | We openly adknowledge this may seem premature for such an early-stage project.
 18 | We're intending to put this RFC place in process now to develop a design-driven
 19 | culture that others have a clear path to contribute to the future of 
 20 | the project.
 21 | 
 22 | This process is _deeply_ inspired by the [rust language RFC process](https://github.com/rust-lang/rfcs),
 23 | which builds on the [Python Enhancement Proposals process](https://www.python.org/dev/peps/),
 24 | a big thank-you to these projects for leading the way.
 25 | 
 26 | 
 27 | ## Table of Contents
 28 | [Table of Contents]: #table-of-contents
 29 | 
 30 |   - [Opening](#qri-rfcs)
 31 |   - [Table of Contents]
 32 |   - [When you need to follow this process]
 33 |   - [Before creating an RFC]
 34 |   - [What the process is]
 35 |   - [The RFC life-cycle]
 36 |   - [Reviewing RFCs]
 37 |   - [Implementing an RFC]
 38 |   - [RFC Postponement]
 39 |   - [Help this is all too informal!]
 40 |   - [License]
 41 | 
 42 | 
 43 | ## When you need to follow this process
 44 | [When you need to follow this process]: #when-you-need-to-follow-this-process
 45 | 
 46 | Most qri repositories follow [angular commit conventions](https://github.com/angular/angular.js/blob/master/DEVELOPERS.md#type)
 47 | which designates 8 _types_ of change:
 48 | 
 49 |   - **feat:** A new feature
 50 |   - **fix:** A bug fix
 51 |   - **docs:** Documentation only changes
 52 |   - **style:** Changes that do not affect the meaning of the code (white-space, formatting, missing semi-colons, etc)
 53 |   - **refactor:** A code change that neither fixes a bug nor adds a feature
 54 |   - **perf:** A code change that improves performance
 55 |   - **test:** Adding missing or correcting existing tests
 56 |   - **chore:** Changes to the build process or auxiliary tools and libraries such as documentation generation
 57 | 
 58 | We only need RFCs for `feat` and breaking `refactor` changes to any of our novel
 59 | projects, which are the [qri](https://github.com/qri-io/qri) and [dataset](https://github.com/qri-io/dataset)
 60 | repositories. `feat` and `refactor` changes to the [frontend](https://github.com/qri-io/frontend)
 61 | will also require heavy review, but don't require an RFC.
 62 | 
 63 | Most of the RFC's you'll see created will be coming from the core team, but 
 64 | that doesn't mean you can't chime in. Constructive one-off comments
 65 | are more than welcome!
 66 | 
 67 | 
 68 | ## Before creating an RFC
 69 | [Before creating an RFC]: #before-creating-an-rfc
 70 | 
 71 | A hastily-proposed RFC can hurt its chances of acceptance. Low quality
 72 | proposals, proposals for previously-rejected features, or those that don't fit
 73 | into the near-term roadmap, may be quickly rejected, which can be demotivating
 74 | for the unprepared contributor. Laying some groundwork ahead of the RFC can
 75 | make the process smoother.
 76 | 
 77 | Although there is no single way to prepare for submitting an RFC, it is
 78 | generally a good idea to pursue feedback from other project developers
 79 | beforehand, to ascertain that the RFC may be desirable; having a consistent
 80 | impact on the project requires concerted effort toward consensus-building.
 81 | 
 82 | The best place to start is by filing an issue on this repository, the core team
 83 | monitors this repo, and will provide feedback on your issue, including weather
 84 | it would make for a good RFC.
 85 | 
 86 | 
 87 | ## What the process is
 88 | [What the process is]: #what-the-process-is
 89 | 
 90 | In short, to get a major feature added to Qri, one must first get the RFC
 91 | merged into the RFC repository as a markdown file. At that point the RFC is
 92 | "active" and may be implemented with the goal of eventual inclusion into Qri.
 93 | 
 94 |   - Fork the RFC repo [RFC repository]
 95 |   - Copy `0000-template.md` to `text/0000-my-feature.md` (where "my-feature" is
 96 |     descriptive. don't assign an RFC number yet).
 97 |   - Fill in the RFC. Put care into the details: RFCs that do not present
 98 |     convincing motivation, demonstrate understanding of the impact of the
 99 |     design, or are disingenuous about the drawbacks or alternatives tend to be
100 |     poorly-received.
101 |   - Submit a pull request. As a pull request the RFC will receive design
102 |     feedback from the larger community, and the author should be prepared to
103 |     revise it in response.
104 |   - Each pull request will be reviewed by the core team and assigned to a team
105 |     member, who will take responsibility for guiding the RFC
106 |   - Build consensus and integrate feedback. RFCs that have broad support are
107 |     much more likely to make progress than those that don't receive any
108 |     comments. Feel free to reach out to the RFC assignee in particular to get
109 |     help identifying stakeholders and obstacles.
110 |   - The core team will discuss the RFC pull request, as much as possible in the
111 |     comment thread of the pull request itself. Offline discussion will be
112 |     summarized on the pull request comment thread.
113 |   - RFCs rarely go through this process unchanged, especially as alternatives
114 |     and drawbacks are shown. You can make edits, big and small, to the RFC to
115 |     clarify or change the design, but make changes as new commits to the pull
116 |     request, and leave a comment on the pull request explaining your changes.
117 |     Specifically, do not squash or rebase commits after they are visible on the
118 |     pull request.
119 |   - If the proposal is submitted by a core team member, it can be merged
120 |     by at least 2 other core team members approving the RFC, otherwise a member 
121 |     of the core team will propose a "motion for final comment period" (FCP), 
122 |     along with a *disposition* for the RFC (merge, close, or postpone).
123 |     - This step is taken when enough of the tradeoffs have been discussed that
124 |     the core team is in a position to make a decision. That does not require
125 |     consensus amongst all participants in the RFC thread (which is usually
126 |     impossible). However, the argument supporting the disposition on the RFC
127 |     needs to have already been clearly articulated, and there should not be a
128 |     strong consensus *against* that position outside of the core team. Team
129 |     members use their best judgment in taking this step, and the FCP itself
130 |     ensures there is ample time and notification for stakeholders to push back
131 |     if it is made prematurely.
132 |     - For RFCs with lengthy discussion, the motion to FCP is usually preceded by
133 |       a *summary comment* trying to lay out the current state of the discussion
134 |       and major tradeoffs/points of disagreement.
135 |     - Before actually entering FCP, *all* members of the core team must sign off;
136 |     this is often the point at which many core team members first review the RFC
137 |     in full depth.
138 |   - The FCP lasts ten calendar days, so that it is open for at least 5 business
139 |     days. It is also advertised widely,
140 |     e.g. in [This Week in Qri](https://this-week-in-rust.org/). This way all
141 |     stakeholders have a chance to lodge any final objections before a decision
142 |     is reached.
143 |   - In most cases, the FCP period is quiet, and the RFC is either merged or
144 |     closed. However, sometimes substantial new arguments or ideas are raised,
145 |     the FCP is canceled, and the RFC goes back into development mode.
146 | 
147 | ## The RFC life-cycle
148 | [The RFC life-cycle]: #the-rfc-life-cycle
149 | 
150 | Once an RFC becomes "active" then authors may implement it and submit the
151 | feature as a pull request to the Qri repo. Being "active" is not a rubber
152 | stamp, and in particular still does not mean the feature will ultimately be
153 | merged; it does mean that in principle all the major stakeholders have agreed
154 | to the feature and are amenable to merging it.
155 | 
156 | Furthermore, the fact that a given RFC has been accepted and is "active"
157 | implies nothing about what priority is assigned to its implementation, nor does
158 | it imply anything about whether a Qri developer has been assigned the task of
159 | implementing the feature. While it is not *necessary* that the author of the
160 | RFC also write the implementation, it is by far the most effective way to see
161 | an RFC through to completion: authors should not expect that other project
162 | developers will take on responsibility for implementing their accepted feature.
163 | 
164 | Modifications to "active" RFCs can be done in follow-up pull requests. We
165 | strive to write each RFC in a manner that it will reflect the final design of
166 | the feature; but the nature of the process means that we cannot expect every
167 | merged RFC to actually reflect what the end result will be at the time of the
168 | next major release.
169 | 
170 | In general, once accepted, RFCs should not be substantially changed. Only very
171 | minor changes should be submitted as amendments. More substantial changes
172 | should be new RFCs, with a note added to the original RFC. Exactly what counts
173 | as a "very minor change" is up to the sub-team to decide; check
174 | [Sub-team specific guidelines] for more details.
175 | 
176 | 
177 | ## Reviewing RFCs
178 | [Reviewing RFCs]: #reviewing-rfcs
179 | 
180 | While the RFC pull request is up, the core team may schedule meetings with the
181 | author and/or relevant stakeholders to discuss the issues in greater detail,
182 | and in some cases the topic may be discussed at a core team meeting. In either
183 | case a summary from the meeting will be posted back to the RFC pull request.
184 | 
185 | The core team makes final decisions about RFCs after the benefits and drawbacks
186 | are well understood. These decisions can be made at any time, but the core team
187 | will regularly issue decisions. When a decision is made, the RFC pull request
188 | will either be merged or closed. In either case, if the reasoning is not clear
189 | from the discussion in thread, the sub-team will add a comment describing the
190 | rationale for the decision.
191 | 
192 | 
193 | ## Implementing an RFC
194 | [Implementing an RFC]: #implementing-an-rfc
195 | 
196 | Some accepted RFCs represent vital features that need to be implemented right
197 | away. Other accepted RFCs can represent features that can wait until some
198 | arbitrary developer feels like doing the work. Every accepted RFC has an
199 | associated issue tracking its implementation in the Qri repository; thus that
200 | associated issue can be assigned a priority via the triage process that the
201 | team uses for all issues in the Qri repository.
202 | 
203 | The author of an RFC is not obligated to implement it. Of course, the RFC
204 | author (like any other developer) is welcome to post an implementation for
205 | review after the RFC has been accepted.
206 | 
207 | If you are interested in working on the implementation for an "active" RFC, but
208 | cannot determine if someone else is already working on it, feel free to ask
209 | (e.g. by leaving a comment on the associated issue).
210 | 
211 | 
212 | ## RFC Postponement
213 | [RFC Postponement]: #rfc-postponement
214 | 
215 | Some RFC pull requests are tagged with the "postponed" label when they are
216 | closed (as part of the rejection process). An RFC closed with "postponed" is
217 | marked as such because we want neither to think about evaluating the proposal
218 | nor about implementing the described feature until some time in the future, and
219 | we believe that we can afford to wait until then to do so. Historically,
220 | "postponed" was used to postpone features until after 1.0. Postponed pull
221 | requests may be re-opened when the time is right. We don't have any formal
222 | process for that, you should ask members of the relevant sub-team.
223 | 
224 | Usually an RFC pull request marked as "postponed" has already passed an
225 | informal first round of evaluation, namely the round of "do we think we would
226 | ever possibly consider making this change, as outlined in the RFC pull request,
227 | or some semi-obvious variation of it." (When the answer to the latter question
228 | is "no", then the appropriate response is to close the RFC, not postpone it.)
229 | 
230 | 
231 | ### Help this is all too informal!
232 | [Help this is all too informal!]: #help-this-is-all-too-informal
233 | 
234 | The process is intended to be as lightweight as reasonable for the present
235 | circumstances. As usual, we are trying to let the process be driven by
236 | consensus and community norms, not impose more structure than necessary.
237 | 
238 | [RFC repository]: http://github.com/qri-io/rfcs
239 | 
240 | ## License
241 | [License]: #license
242 | 
243 | This repository is licensed under the MIT license ([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT)
244 | 
245 | ### Contributions
246 | 
247 | Unless you explicitly state otherwise, any contribution intentionally submitted 
248 | for inclusion in the work by you, as defined in the shall be MIT licensed, 
249 | without any additional terms or conditions.
250 | 


--------------------------------------------------------------------------------
/text/0001-rfc_process.md:
--------------------------------------------------------------------------------
  1 | - Feature Name: rfc_process
  2 | - Start Date: 2018-08-13
  3 | - RFC PR: #1
  4 | - Issue: N/A
  5 | 
  6 | # Summary
  7 | [summary]: #summary
  8 | 
  9 | The "RFC" (request for comments) process is intended to provide a consistent
 10 | and controlled path for charting the roadmap of qri. We've seen a number of 
 11 | projects in the distributed space suffer from under-considered design choices 
 12 | and unclear roadmapping, We're hoping strong adherence to a lightweight RFC 
 13 | process can help mitigate these problems. Anyone should be able get a sense of 
 14 | _where qri is going_ by reading through the accepted proposals.
 15 | 
 16 | # Motivation
 17 | [motivation]: #motivation
 18 | 
 19 | I openly adknowledge this may seem premature for such an early-stage project.
 20 | I'm intending to put this RFC place in process now to develop a design-driven
 21 | culture so that others have a clear path to contribute to the future of the 
 22 | project, and replace myself as the arbiter of "what qri is" with a codified 
 23 | principle of [rough consensus and working code](https://www.ietf.org/about/participate/tao/).
 24 | 
 25 | One of the main concerns I have is ambiguity exists around what is _expected 
 26 | behaviour_ when interacting with qri. To me, this is a _major_ problem. 
 27 | Not knowing precisely what working-as-expected looks like cuts down on what core 
 28 | team members can confidently reason about, which means the core can't
 29 | confidently tell _others_ how qri should work, which means the whole project is
 30 | both ambiguous and without a clear path for resolving that ambiguity.
 31 | 
 32 | To me this ambiguity is a sign that we (the core team) have haven't taken enough 
 33 | time to reach design-level consensus about how qri _should_ behave, and why I 
 34 | think it's not too early to start this process.
 35 | 
 36 | There are a few key motivations for implementing this now: 
 37 |   - develop a culture of rough consensus
 38 |   - inject more design-time thinking into the core team's process
 39 |   - maintain an accurate, reliable roadmap
 40 |   - create an on-ramp for others to contribute to the design of qri
 41 | 
 42 | # Guide-level explanation
 43 | [guide-level-explanation]: #guide-level-explanation
 44 | 
 45 | For the core team adopting this RFC process will feel like three things at once:
 46 |   - a source of truth for _how qri should behave_
 47 |   - a new channel to directly impact the design of qri
 48 |   - lots of additional work to get new stuff into qri
 49 | 
 50 | The first batch of RFCs should ideally come from me (@b5), outlining in detail
 51 | how the various aspects of qri _should_ work (which as we know, is different 
 52 | from how things do/not work). The process of debating these initial RFCs should 
 53 | serve as a chance to work together to establish language, clarity & detail 
 54 | around expected behaviours. Accepting these initial RFCs provides a new,
 55 | consensus-driven foundation for the project that should give core team members
 56 | a confident, consistent source of truth to work from.
 57 | By requiring sign-off from the core team, this should be a chance for the
 58 | core team to achieve consensus on how things should work.
 59 | 
 60 | Upon accepting this proposal, we'll also move all of the issues filed in 
 61 | [qri-io/notes](https://github.com/qri-io/notes) & depricate the repo. If we
 62 | accept this RFC process, we should commit to it fully. Notes should now
 63 | start as issues on this repo with the intention that they could develop into 
 64 | formal RFCs.
 65 | 
 66 | At the same time, this should also be a chance for others to start documenting
 67 | ideas we've been kicking around for enhancing qri. Things we've discussed in the
 68 | past like skylark modules, simulation testing, new file format support, 
 69 | etc. should be collected as issues on this repo so we can start working them
 70 | into RFCs.
 71 | 
 72 | Finally this process should not get in the way. If done properly, day-to-day 
 73 | development should  _accelerate_ once we've accepted enough RFCs to get 
 74 | proposals ahead of current development.  
 75 | It should be easier to implement a feature with confidence because implementing 
 76 | new things should just be coding up an already-approved design document.
 77 | I want to save RFCs for design-level features, not day-to-day fixes or 
 78 | implementation details. It should still be perfectly acceptable to make changes 
 79 | on interfaces between packages, move packages around, and riff on ideas outside 
 80 | of the formal RFC process. RFCs should be saved for decisions that affect how we 
 81 | expect the user-facing edges of qri to behave.
 82 | 
 83 | # Reference-level explanation
 84 | [reference-level-explanation]: #reference-level-explanation
 85 | 
 86 | _Most of this is copied from the Rust RFC process, except I've added a section
 87 | intended to keep us moving quickly: if a core team member creates an RFC, and
 88 | two other core team members approve it, it's automatically merged. We should
 89 | remove this provision once qri reaches a 1.0 release._
 90 | 
 91 | In short, to get a major feature added to Qri, one must first get the RFC
 92 | merged into the RFC repository as a markdown file. At that point the RFC is
 93 | "active" and may be implemented with the goal of eventual inclusion into Qri.
 94 | 
 95 |   - Fork the RFC repo [RFC repository]
 96 |   - Copy `0000-template.md` to `text/0000-my-feature.md` (where "my-feature" is
 97 |     descriptive. don't assign an RFC number yet).
 98 |   - Fill in the RFC. Put care into the details: RFCs that do not present
 99 |     convincing motivation, demonstrate understanding of the impact of the
100 |     design, or are disingenuous about the drawbacks or alternatives tend to be
101 |     poorly-received.
102 |   - Submit a pull request. As a pull request the RFC will receive design
103 |     feedback from the larger community, and the author should be prepared to
104 |     revise it in response.
105 |   - Each pull request will be reviewed by the core team and assigned to a team
106 |     member, who will take responsibility for guiding the RFC
107 |   - Build consensus and integrate feedback. RFCs that have broad support are
108 |     much more likely to make progress than those that don't receive any
109 |     comments. Feel free to reach out to the RFC assignee in particular to get
110 |     help identifying stakeholders and obstacles.
111 |   - The core team will discuss the RFC pull request, as much as possible in the
112 |     comment thread of the pull request itself. Offline discussion will be
113 |     summarized on the pull request comment thread.
114 |   - RFCs rarely go through this process unchanged, especially as alternatives
115 |     and drawbacks are shown. You can make edits, big and small, to the RFC to
116 |     clarify or change the design, but make changes as new commits to the pull
117 |     request, and leave a comment on the pull request explaining your changes.
118 |     Specifically, do not squash or rebase commits after they are visible on the
119 |     pull request.
120 |   - If the proposal is submitted by a core team member, it can be merged
121 |     by at least 2 other core team members approving the RFC, otherwise a member 
122 |     of the core team will propose a "motion for final comment period" (FCP), 
123 |     along with a *disposition* for the RFC (merge, close, or postpone).
124 |     - This step is taken when enough of the tradeoffs have been discussed that
125 |     the core team is in a position to make a decision. That does not require
126 |     consensus amongst all participants in the RFC thread (which is usually
127 |     impossible). However, the argument supporting the disposition on the RFC
128 |     needs to have already been clearly articulated, and there should not be a
129 |     strong consensus *against* that position outside of the core team. Team
130 |     members use their best judgment in taking this step, and the FCP itself
131 |     ensures there is ample time and notification for stakeholders to push back
132 |     if it is made prematurely.
133 |     - For RFCs with lengthy discussion, the motion to FCP is usually preceded by
134 |       a *summary comment* trying to lay out the current state of the discussion
135 |       and major tradeoffs/points of disagreement.
136 |     - Before actually entering FCP, *all* members of the core team must sign off;
137 |     this is often the point at which many core team members first review the RFC
138 |     in full depth.
139 |   - The FCP lasts ten calendar days, so that it is open for at least 5 business
140 |     days. This way all stakeholders have a chance to lodge any final objections 
141 |     before a decision is reached.
142 |   - In most cases, the FCP period is quiet, and the RFC is either merged or
143 |     closed. However, sometimes substantial new arguments or ideas are raised,
144 |     the FCP is cancelled, and the RFC goes back into development mode.
145 | 
146 | ### The RFC life-cycle
147 | [The RFC life-cycle]: #the-rfc-life-cycle
148 | 
149 | Once an RFC becomes "active" then authors may implement it and submit the
150 | feature as a pull request to the Qri repo. Being "active" is not a rubber
151 | stamp, and in particular still does not mean the feature will ultimately be
152 | merged; it does mean that in principle all the major stakeholders have agreed
153 | to the feature and are amenable to merging it.
154 | 
155 | Furthermore, the fact that a given RFC has been accepted and is "active"
156 | implies nothing about what priority is assigned to its implementation, nor does
157 | it imply anything about whether a Qri developer has been assigned the task of
158 | implementing the feature. While it is not *necessary* that the author of the
159 | RFC also write the implementation, it is by far the most effective way to see
160 | an RFC through to completion: authors should not expect that other project
161 | developers will take on responsibility for implementing their accepted feature.
162 | 
163 | Modifications to "active" RFCs can be done in follow-up pull requests. We
164 | strive to write each RFC in a manner that it will reflect the final design of
165 | the feature; but the nature of the process means that we cannot expect every
166 | merged RFC to actually reflect what the end result will be at the time of the
167 | next major release.
168 | 
169 | In general, once accepted, RFCs should not be substantially changed. Only very
170 | minor changes should be submitted as amendments. More substantial changes
171 | should be new RFCs, with a note added to the original RFC. Exactly what counts
172 | as a "very minor change" is up to the sub-team to decide; check
173 | [Sub-team specific guidelines] for more details.
174 | 
175 | ### Reviewing RFCs
176 | [Reviewing RFCs]: #reviewing-rfcs
177 | 
178 | While the RFC pull request is up, the core team may schedule meetings with the
179 | author and/or relevant stakeholders to discuss the issues in greater detail,
180 | and in some cases the topic may be discussed at a core team meeting. In either
181 | case a summary from the meeting will be posted back to the RFC pull request.
182 | 
183 | The core team makes final decisions about RFCs after the benefits and drawbacks
184 | are well understood. These decisions can be made at any time, but the core team
185 | will regularly issue decisions. When a decision is made, the RFC pull request
186 | will either be merged or closed. In either case, if the reasoning is not clear
187 | from the discussion in thread, the sub-team will add a comment describing the
188 | rationale for the decision.
189 | 
190 | 
191 | ### Implementing an RFC
192 | [Implementing an RFC]: #implementing-an-rfc
193 | 
194 | Some accepted RFCs represent vital features that need to be implemented right
195 | away. Other accepted RFCs can represent features that can wait until some
196 | arbitrary developer feels like doing the work. Every accepted RFC has an
197 | associated issue tracking its implementation in the Qri repository; thus that
198 | associated issue can be assigned a priority via the triage process that the
199 | team uses for all issues in the Qri repository.
200 | 
201 | The author of an RFC is not obligated to implement it. Of course, the RFC
202 | author (like any other developer) is welcome to post an implementation for
203 | review after the RFC has been accepted.
204 | 
205 | If you are interested in working on the implementation for an "active" RFC, but
206 | cannot determine if someone else is already working on it, feel free to ask
207 | (e.g. by leaving a comment on the associated issue).
208 | 
209 | 
210 | # Drawbacks
211 | [drawbacks]: #drawbacks
212 | 
213 | This is going to slow us down and consume precious time.
214 | 
215 | # Rationale and alternatives
216 | [rationale-and-alternatives]: #rationale-and-alternatives
217 | 
218 | On a philosophical level, I'm tired of shitty software that doesn't work. I want
219 | to raise our standards, aspiring toward principles of 
220 | ["Zero Defects Programming"](https://en.wikipedia.org/wiki/Zero_Defects):
221 | 
222 | > Instill in workers the will to prevent problems during design and manufacture
223 | rather than go back and fix them later
224 | 
225 | The only way to prevent problems during the design phase is to **require a 
226 | design phase for things that matter**. In the context of Open Source Software, 
227 | the RFC process is the best I've seen for having a strong design that yields 
228 | software others can depend on.
229 | 
230 | # Prior art
231 | [prior-art]: #prior-art
232 | 
233 | - [Rust RFC Process](https://github.com/rust-lang/rfcs)
234 | - [IETF Standards Process](https://www.ietf.org/standards/process/)
235 | - [Python Enhancement Proposals](https://www.python.org/dev/peps/)
236 | 
237 | # Unresolved questions
238 | [unresolved-questions]: #unresolved-questions
239 | 
240 | - What structures need to be in place to align RFCs with the mission of the
241 | project?
242 | - Do we need to somehow distill approved RFCs into a single roadmap document?
243 | 


--------------------------------------------------------------------------------
/text/0003-content_addressed_file_system.md:
--------------------------------------------------------------------------------
  1 | - Feature Name: content_addressed_file_system
  2 | - Start Date: 2017-08-03
  3 | - RFC PR: [#3](https://github.com/qri-io/rfcs/pull/3)
  4 | - Repo: https://github.com/qri-io/cafs
  5 | 
  6 | _Note: This RFC was created as part of an initial sprint to adopt the RFC
  7 | process itself, as such sections of this document are less complete than
  8 | we'd hope, or less complete than we'd expect from a new RFC.
  9 | -:heart: the qri core team_
 10 | 
 11 | # Summary
 12 | [summary]: #summary
 13 | 
 14 | Content-Addressed File System (CAFS) is a generalized interface for working with 
 15 | filestores that names content based on the content itself, usually
 16 | through some sort of hashing function.
 17 | Examples of content-addressed file systems include git, bittorrent, IPFS, 
 18 | the DAT project, etc.
 19 | 
 20 | # Motivation
 21 | [motivation]: #motivation
 22 | 
 23 | The long-term goal of CAFS is to define an interface for common filestore 
 24 | operations between different content-addressed filestores that serves the
 25 | subset of features qri needs to function.
 26 | 
 27 | This package doesn't aim to implement everything a given filestore can do, 
 28 | but instead focus on basic file & directory i/o. CAFS is in its very early days, 
 29 | starting with a proof of concept based on IPFS and an in-memory implementation. 
 30 | Over time we'll work to add additional stores, which will undoubtedly affect 
 31 | the overall interface definition.
 32 | 
 33 | A tacit goal of this interface is to manage the seam between graph-based
 34 | storage systems, and a file interface.
 35 | 
 36 | # Guide-level explanation
 37 | [guide-level-explanation]: #guide-level-explanation
 38 | 
 39 | There are two key interfaces to CAFS. The rest are built upon these two:
 40 | 
 41 | ### File
 42 | File is an interface based largely on the `os.File` interface from golang `os` 
 43 | package, with the exception that files can be _either a file or a directory_.
 44 | This file interface will have many dependants in the qri ecosystem.
 45 | 
 46 | ### Filestore
 47 | Filestore is the interface for storing files & directories. The "content 
 48 | addressed" part means that `Put` operations are in charge of returning the name
 49 | of the file.
 50 | 
 51 | 
 52 | # Reference-level explanation
 53 | [reference-level-explanation]: #reference-level-explanation
 54 | 
 55 | There are two primary interfaces that constitute a CAFS, *File* and *Filestore*:
 56 | 
 57 | 
 58 | File is an interface that provides functionality for handling files/directories 
 59 | as values that can be supplied to commands. For directories, child files are 
 60 | accessed serially by calling `NextFile()`.
 61 | ```golang
 62 |   type File interface {
 63 |     // Files implement ReadCloser, but can only be read from or closed if
 64 |     // they are not directories
 65 |     io.ReadCloser
 66 | 
 67 |     // FileName returns a filename associated with this file
 68 |     FileName() string
 69 | 
 70 |     // FullPath returns the full path used when adding this file
 71 |     FullPath() string
 72 | 
 73 |     // IsDirectory returns true if the File is a directory (and therefore
 74 |     // supports calling `NextFile`) and false if the File is a normal file
 75 |     // (and therefore supports calling `Read` and `Close`)
 76 |     IsDirectory() bool
 77 | 
 78 |     // NextFile returns the next child file available (if the File is a
 79 |     // directory). It will return (nil, io.EOF) if no more files are
 80 |     // available. If the file is a regular file (not a directory), NextFile
 81 |     // will return a non-nil error.
 82 |     NextFile() (File, error)
 83 |   }
 84 | ```
 85 | 
 86 | Filestore is an interface for working with a content-addressed file system.
 87 | This interface is under active development, expect it to change lots.
 88 | It's currently form-fitting around IPFS (ipfs.io), with far-off plans to 
 89 | generalize toward compatibility with git (git-scm.com), then maybe other stuff, 
 90 | who knows.
 91 | ```golang
 92 |   type Filestore interface {
 93 |     // Put places a file or a directory in the store.
 94 |     // The most notable difference from a standard file store is the store itself determines
 95 |     // the resulting key (google "content addressing" for more info ;)
 96 |     // keys returned by put must be prefixed with the PathPrefix,
 97 |     // eg. /ipfs/QmZ3KfGaSrb3cnTriJbddCzG7hwQi2j6km7Xe7hVpnsW5S
 98 |     // "pin" is a flag for recursively pinning this object
 99 |     Put(file File, pin bool) (key string, err error)
100 | 
101 |     // Get retrieves the object `value` named by `key`.
102 |     // Get will return ErrNotFound if the key is not mapped to a value.
103 |     Get(key string) (file File, err error)
104 | 
105 |     // Has returns whether the `key` is mapped to a `value`.
106 |     // In some contexts, it may be much cheaper only to check for existence of
107 |     // a value, rather than retrieving the value itself. (e.g. HTTP HEAD).
108 |     // The default implementation is found in `GetBackedHas`.
109 |     Has(key string) (exists bool, err error)
110 | 
111 |     // Delete removes the value for given `key`.
112 |     Delete(key string) error
113 | 
114 |     // NewAdder allocates an Adder instance for adding files to the filestore
115 |     // Adder gives a higher degree of control over the file adding process at the
116 |     // cost of being harder to work with.
117 |     // "pin" is a flag for recursively pinning this object
118 |     // "wrap" sets whether the top level should be wrapped in a directory
119 |     // expect this to change to something like:
120 |     // NewAdder(opt map[string]interface{}) (Adder, error)
121 |     NewAdder(pin, wrap bool) (Adder, error)
122 | 
123 |     // PathPrefix is a top-level identifier to distinguish between filestores,
124 |     // for exmple: the "ipfs" in /ipfs/QmZ3KfGaSrb3cnTriJbddCzG7hwQi2j6km7Xe7hVpnsW5S
125 |     // a Filestore implementation should always return the same prefix
126 |     PathPrefix() string
127 |   }
128 | ```
129 | 
130 | ### Pinning
131 | `Filestore.Put()` accepts a `pin` flag. when `pin = true` the filestore will retain
132 | a permanent reference to the file. When `pin = false`, the filestore may remove
133 | the file at any point. Adding files without pinning is a great way to get the hash
134 | of a file without enforcing the overhead of content retention.
135 | 
136 | # Drawbacks
137 | [drawbacks]: #drawbacks
138 | 
139 | Getting this interface right will be difficult & full of odd edge-cases that'll 
140 | need to be handled carefully.
141 | 
142 | # Rationale and alternatives
143 | [rationale-and-alternatives]: #rationale-and-alternatives
144 | 
145 | We could skip the notion of _files_ entirely at this level, and instead choose 
146 | to focus on _graph_ structures. I think we may end up maturing in this direction
147 | over time, but if so let's do that with proper design consideration.
148 | 
149 | # Prior art
150 | [prior-art]: #prior-art
151 | 
152 | There isn't much here in the way of prior art.
153 | 
154 | # Unresolved questions
155 | [unresolved-questions]: #unresolved-questions
156 | 
157 | - How do we properly handle the distinction between network & local operations?
158 | 


--------------------------------------------------------------------------------
/text/0004-structured_data_io.md:
--------------------------------------------------------------------------------
  1 | - Feature Name: structured_io
  2 | - Start Date: 2017-10-18
  3 | - RFC PR: [#4](https://github.com/qri-io/rfcs/pull/4)
  4 | - Repo: [dataset](https://github.com/qri-io/dataset)
  5 | 
  6 | _Note: This RFC was created as part of an initial sprint to adopt the RFC
  7 | process itself, as such sections of this document are less complete than
  8 | we'd hope, or less complete than we'd expect from a new RFC.
  9 | -:heart: the qri core team_
 10 | 
 11 | # Summary
 12 | [summary]: #summary
 13 | 
 14 | Structured I/O defines interfaces for reading & writing streams of parsed 
 15 |  _entries_, which are elements of arbitraty-yet-structured data such as JSON, 
 16 | CBOR, or CSV documents. Structured reader & writer interfaces combine a byte 
 17 | stream, data format and schema to create entry readers & writers that produce & 
 18 | consume entries of parsed primtive types instead of bytes. Structured I/O 
 19 | streams can be composed & connected to form the basis of rich data communication 
 20 | capable of spanning across formats.
 21 | 
 22 | # Motivation
 23 | [motivation]: #motivation
 24 | 
 25 | One of the prime goals of qri is to to be able to make any dataset comparable to
 26 | another dataset. Datasets are also intended to be a arbitrary-yet-structured
 27 | document format, able to support all sorts of data with varying degrees of
 28 | quality. These requirements mean datasets must be able to define their own 
 29 | schemas, and may include data that violates that schema.
 30 | 
 31 | Our challenge is to declare a clear set of abstractions that leverage _key
 32 | assumptions_ enforced by the dataset model, and leverage those assumptions
 33 | to combine arbitrary data at runtime.
 34 |  
 35 | Structured I/O is the primary means of parsing data, abstracting away the
 36 | underlying data format while also delivering a set of expectations about 
 37 | parsed data based on those key assumptions. Those expectations are parsing
 38 | to a predetermined set of types (`int`, `bool`, `string`, etc.), and delivering
 39 | a _schema_ that includes a definition of valid data structuring.
 40 | 
 41 | As concrete examples, all of the following require tools for data parsing:
 42 | - Creating a dataset from a JSON file
 43 | - Converting dataset data from one data format to another
 44 | - Printing the first 10 rows of a dataset
 45 | - Counting the number of entries in a dataset
 46 | 
 47 | All of these tasks are basic things that we'd like to be able to do with one
 48 | or more datasets.
 49 | 
 50 | Structured I/O is intended to be a robust set of primitives that
 51 | underpin these tasks. Structured _streams_ (readers & writers) wrap a 
 52 | raw stream of bytes with a parser that transform raw bytes into _entries_ made
 53 | of a standard set language-native types (`int`, `bool`, `string`, etc.)
 54 | Working with entries instead of bytes allows the programmer to avoid thinking
 55 | about the underlying format & focus on the semantics of data instead of
 56 | idiosyncrocies between encoding formats.
 57 | 
 58 | Orienting our primitives around _streams_ helps manage concerns created by both 
 59 | network latency and data volume. By orienting qri around stream programming 
 60 | we set ourselves up for success for programming in a distributed context.
 61 | 
 62 | Structured I/O builds on foundations set forth in the _structure_ portion of
 63 | the dataset definition. For any valid dataset it must be possible to create
 64 | a Structured Reader of the dataset body, and a Writer that can be used to 
 65 | compose an update.
 66 | 
 67 | Structured I/O is intended to underpin lots of other functionality. Doing new
 68 | things with data should be a process of composing and enhancing Structured I/O 
 69 | streams. Want only a subsection of a dataset's body? use a `LimitOffsetReader`. 
 70 | Want to convert JSON to CBOR? Pipe a JSON-formatted `EntryReader` to a 
 71 | CBOR-formatted writer.
 72 | 
 73 | # Guide-level explanation
 74 | [guide-level-explanation]: #guide-level-explanation
 75 | 
 76 | Creating a structured I/O stream requires a minimum of three things:
 77 | - a stream of raw data bytes
 78 | - the _data format_ of that stream (eg: JSON)
 79 | - a data schema
 80 | 
 81 | The Format & Schema are specified in the passed-in structure, the byte stream
 82 | is an io.Reader or io.Writer primitive. Here's a quick example of creating a 
 83 | reader from scratch & reading it's values:
 84 | ```golang
 85 |   import (
 86 |     "strings"
 87 | 
 88 |     "github.com/qri-io/dataset"
 89 |     "github.com/qri-io/dataset/dsio"
 90 |     "github.com/qri-io/jsonschema"
 91 |   )
 92 | 
 93 |   // the data we want to stream, an array with two entries 
 94 |   const JSONData = `["foo",{"name":"bar"}]`
 95 | 
 96 |   st := &dataset.Structure{
 97 |     Format: dataset.JSONDataFormat,
 98 |     Schema: jsonschema.Must(`{"type":"array"}`),
 99 |   }
100 | 
101 |   // created a Structured I/O reader:
102 |   str, err := dsio.NewEntryReader(st, strings.NewReader(JSONData))
103 |   if err != nil {
104 |     panic(err)
105 |   }
106 | 
107 |   ent, err := str.ReadEntry()
108 |   if err != nil {
109 |     panic(err)
110 |   }
111 |   fmt.Println(ent.Value) // "foo"
112 | 
113 |   ent, err := str.ReadEntry()
114 |   if err != nil {
115 |     panic(err)
116 |   }
117 |   fmt.Println(ent.Value) // {"name":"bar"}
118 | 
119 |   _, err := str.ReadEntry()
120 |   fmt.Println(err.Error()) // EOF
121 | ```
122 | 
123 | ### Stream & Top Level Data
124 | A _stream_ refers to refer to both a _reader_ and a _writer_ collectively.
125 | 
126 | _Top Level_ refers to the specific type of the outermost element in a structured 
127 | piece of data. This data's top level is an _array_:
128 | ```json
129 | [
130 |   {"a": 1},
131 |   {"b": 2},
132 |   {"c": 3}
133 | ]
134 | ```
135 | 
136 | This Data's top level is a _string_:
137 | ```json
138 | "foo"
139 | ```
140 | 
141 | Qri requires a top-level type of either array or object.
142 | 
143 | ### Entries
144 | Traditional "unstructured" streams often use byte arrays as the basic unit that
145 | is both read and written. Structured I/O works with _entries_ instead
146 | An _entry_ is the fundamental unit of reading & writing for a Structured stream.
147 | Entries are themselves a small abstraction that carries the `Value`, which is 
148 | parsed data of an arbitrary structure, `Index` and `Key`. Only one of `Index` 
149 | and `Key` will be populated for a given entry, depending on weather a top level 
150 | array or object is being read.
151 | 
152 | ### Value Types
153 | Qri is built around a basic set of types, which forms a crucial assumption when
154 | working with Structured I/O, which build on this assumption. These assumptions
155 | are inherited from JSON, with the addition of byte arrays.
156 | 
157 | All entries will conform to one of the following types:
158 | ```golang
159 | // Scalar Types
160 | nil
161 | bool
162 | int
163 | float64
164 | string
165 | []byte
166 | 
167 | // Complex Types
168 | []interface{} // array
169 | map[string]interface{} // object
170 | ```
171 | 
172 | When examining an `Entry.Value` it's type is `interface{}`, performing a 
173 | [type switch](https://tour.golang.org/methods/16) that handles all of the above 
174 | types will cover all possible cases for a valid entry. Using such a type switch 
175 | recursively on complex types provides a robust, exhaustive method for inspecting
176 | any given entry.
177 | 
178 | Its important to note that these garuntees are only enforced for basic 
179 | Structured I/O streams. Abstractions on top of Structured I/O may introduce
180 | additional types during processing. A classic example is timestamp parsing.
181 | Implementers of streams that break this type assumption are encouraged to define
182 | a more specific interface than structured I/O to indicate to consumers this
183 | assumption has been broken.
184 | 
185 | ### Corrupt Vs. Invalid Data
186 | Structured I/O must distinguish between data that is _corrupt_ and data that is
187 | _invalid_. Corrupt data is data that doesn't conform to the specified format.
188 | As an example, this is corrupt json data (extra comma):
189 | ```json
190 | ["foooo",]
191 | ```
192 | Structured I/O will error if it enounters any kind of corrupt data.
193 | 
194 | _Invalid_ data is data the doesn't conform to the specified _schema_ of
195 | structured I/O. Structured I/O streams _are_ expected to gracefully handle
196 | invalid data by falling back to _identity schemas_, discussed below.
197 | 
198 | ### Identity Schemas & Fallbacks
199 | Because schemas _must_ be defined on complex types, and the only complex types 
200 | we support are objects and arrays, there are two "identity" schemas that 
201 | represent the minimum possible schema definitions that specify the top level of
202 | a data stream be either an array or an object:
203 | 
204 | **Array Identity Schema**
205 | ```json
206 | { "type" : "array" }
207 | ```
208 | 
209 | **Object Identity Schema**
210 | ```json
211 | { "type" : "object" }
212 | ```
213 | 
214 | Data who's top level does not conform to one of these two schemas is 
215 | considered corrupt.
216 | 
217 | These "Identity Schemas" form a _fallback_ the stream will revert to if the data
218 | it's presented with is invalid. For example, if a schema specifies a top level
219 | of "object", and the stream encounters an array, it will silently revert to the
220 | array identity schema & keep reading.
221 | 
222 | The rationale for such a choice is emphasizing _parsing_ over strict adherence
223 | to schema definitions. One of the primary use cases of a dataset version control
224 | system is to begin with data that is invalid according to a given schema, and
225 | correct toward it.
226 | 
227 | For this reason, consumers of structured I/O streams are encouraged to 
228 | prioritize parsing based on type switches as mentined above, unless the codepath
229 | they are operating in presumes _strict mode_.
230 | 
231 | ### Strict Mode
232 | Fallbacks are intended to keep data reading at all costs. However many use cases
233 | will want explicit failure when a stream is misconfigured. For this purpose
234 | streams provide a "strict mode" that errors when invalid data is encountered,
235 | instead of using silent identity-schema fallbacks.
236 | 
237 | When a stream operating in Strict mode encouters an entry that doesn't match, it
238 | will return `ErrInvalidEntry` for that entry. In this case the stream will
239 | remain safe for continued use, so that invalid entries do not prevent access
240 | to subsequent valid reads/writes.
241 | 
242 | # Reference-level explanation
243 | [reference-level-explanation]: #reference-level-explanation
244 | 
245 | Entry is a "row" of a dataset:
246 | ```golang
247 | type Entry struct {
248 |   // Index represents this entry's numeric position in a dataset
249 |   // this index may not necessarily refer to the overall position within 
250 |   // the dataset as things like offsets affect where the index begins
251 |   Index int
252 |   // Key is a string key for this entry
253 |   // only present when the top level structure is a map
254 |   Key string
255 |   // Value is information contained within the row
256 |   Value interface{}
257 | }
258 | ```
259 | 
260 | EntryWriter is a generalized interface for writing structured data:
261 | ```golang
262 | type EntryWriter interface {
263 |   // Structure gives the structure being written
264 |   Structure() *dataset.Structure
265 |   // WriteEntry writes one "row" of structured data to the Writer
266 |   WriteEntry(Entry) error
267 |   // Close finalizes the writer, indicating all entries
268 |   // have been written
269 |   Close() error
270 | }
271 | ```
272 | 
273 | EntryReader is a generalized interface for reading Ordered Structured Data:
274 | ```golang
275 | type EntryReader interface {
276 |   // Structure gives the structure being read
277 |   Structure() *dataset.Structure
278 |   // ReadVal reads one row of structured data from the reader
279 |   ReadEntry() (Entry, error)
280 | }
281 | ```
282 | 
283 | EntryReadWriter combines EntryWriter and EntryReader behaviors:
284 | ```golang
285 | type EntryReadWriter interface {
286 |   // Structure gives the structure being read and written
287 |   Structure() *dataset.Structure
288 |   // ReadVal reads one row of structured data from the reader
289 |   ReadEntry() (Entry, error)
290 |   // WriteEntry writes one row of structured data to the ReadWriter
291 |   WriteEntry(Entry) error
292 |   // Close finalizes the ReadWriter, indicating all entries
293 |   // have been written
294 |   Close() error
295 |   // Bytes gives the raw contents of the ReadWriter
296 |   Bytes() []byte
297 | }
298 | ```
299 | 
300 | # Drawbacks
301 | [drawbacks]: #drawbacks
302 | 
303 | The drawbacks of this is it's a new interface that needs to be built, 
304 | rationalized & maintained, which is to say this is work, and we should avoid
305 | doing work when we could instead be doing something else.
306 | 
307 | # Rationale and alternatives
308 | [rationale-and-alternatives]: #rationale-and-alternatives
309 | 
310 | Given that we've already written this, the time for considering alternatives
311 | should be in future a RFC.
312 | 
313 | # Prior art
314 | [prior-art]: #prior-art
315 | 
316 | [golang's io package](https://godoc.org/io) is _the_ source of inspiration here.
317 | 
318 | 
319 | ### OpenAPI
320 | [OpenAPI](https://swagger.io/docs/specification/about/) Structured I/O can be 
321 | seen as a strict extension on OpenAPI. In fact, we use the jsonschema spec that
322 | grew out of OpenAPI!
323 | 
324 | From OpenAPI's docs:
325 | > The ability of APIs to describe their own structure is the root of all 
326 | awesomeness in OpenAPI. Once written, an OpenAPI specification and Swagger tools 
327 | can drive your API development further in various ways...
328 | 
329 | Qri datasets are analogous to self-contained OpenAPI specifications & data in 
330 | one combined document. Structured I/O is kinda like the thing that turns such a 
331 | document back into an "API".
332 | 
333 | 
334 | # Unresolved questions
335 | [unresolved-questions]: #unresolved-questions
336 | 
337 | How do we handle data that _doesn't_ conform to the structure, such as invalid
338 | data. Should we implement a "strict mode" that requires data to be valid?
339 | 
340 | A Structured Reader connects a single schema to a Data Stream, it's a common
341 | use case that entries need only a the subsection of the schema that the entry


--------------------------------------------------------------------------------
/text/0006-dataset_naming.md:
--------------------------------------------------------------------------------
  1 | - Feature Name: dataset_naming
  2 | - Start Date: 2017-08-14
  3 | - RFC PR: [#6](https://github.com/qri-io/rfcs/pull/3)
  4 | - Repo: https://github.com/qri-io/qri
  5 | 
  6 | _Note: This RFC was created as part of an initial sprint to adopt the RFC
  7 | process itself to help clarify our existing work. As such, sections of this
  8 | document are less complete than we'd expect from a new RFC.
  9 | -:heart: the qri core team_
 10 | 
 11 | # Summary
 12 | [summary]: #summary
 13 | 
 14 | Define the qri naming system, conventions for name resolution, and related jargon.
 15 | 
 16 | # Motivation
 17 | [motivation]: #motivation
 18 | 
 19 | As a decentralized system, qri must confront the [Zooko's Triangle](https://en.wikipedia.org/wiki/Zooko%27s_triangle) problem, which is establishing a way of referring to datasets that is:
 20 | 
 21 | * human readable
 22 | * decentralized
 23 | * unique
 24 | 
 25 | Because Qri is assumed to be built atop a content-addressed file system as its storage layer, the properties of being decentralized & unique are already present. The Qri naming system maps a human readable name to the newest version of a dataset.
 26 | 
 27 | # Guide-level explanation
 28 | [guide-level-explanation]: #guide-level-explanation
 29 | 
 30 | It’s possible to refer to a dataset in a number of ways. It’s easiest to look 
 31 | at the full definition of a dataset reference, and then show what the “defaults” are to make sense of things. The full definition of a dataset reference is as follows:
 32 | 
 33 |     dataset_reference = handle/dataset_name@profile_id/network/version_id
 34 | 
 35 | an example of that looks like this:
 36 | 
 37 |     b5/comics@QmYCvbfNbCwFR45HiNP45rwJgvatpiW38D961L5qAhUM5Y/ipfs/QmejvEPop4D7YUadeGqYWmZxHhLc4JBUCzJJHWMzdcMe2y
 38 | 
 39 | In a sentence: b5 is the handle of a peer, who has a dataset named comics, and its hash on the ipfs network at a point in time was `QmejvEPop4D7YUadeGqYWmZxHhLc4JBUCzJJHWMzdcMe2y`
 40 | 
 41 | The individual components of this reference are:
 42 | 
 43 | * handle - The human-friendly name that the creator is using to refer to theirself, somewhat analogous to a username in other systems. We need handles so lots of people can name datasets the same thing.
 44 | * dataset_name - The human-friendly name that makes it easy to remember and refer to the dataset.
 45 | * profile_id - A unique identifier to let machines uniquely refer to datasets created by this peer, regardless of whether their handle is renamed.
 46 | * network - Protocol name that stores distributed data, defaulting to "ipfs".
 47 | * version_id - A unique identifier hash to refer to a specific version of a dataset, from an exact point in time.
 48 | 
 49 | ### default to latest on ipfs
 50 | 
 51 | Now, having to type `b5/comics@QmYCvbfNbCwFR45HiNP45rwJgvatpiW38D961L5qAhUM5Y/ipfs/QmejvEPop4D7YUadeGqYWmZxHhLc4JBUCzJJHWMzdcMe2y`
 52 | every time you wanted a dataset would be irritating. So we have two defaults. 
 53 | The default network is `ipfs`, and the default hash is the lastest known version of a dataset. We say latest known because sometimes things can fall out of sync. If you're only working with your own local datasets, this won’t be an issue.
 54 | 
 55 | Anyway, that means we can cut down on the typing if we just want the latest 
 56 | version of b5’s comics dataset, we can just type:
 57 | 
 58 |     b5/comics
 59 | 
 60 | In a sentence: “the latest version of b5’s dataset named comics.”
 61 | 
 62 | ### the me keyword
 63 | 
 64 | What if your handle is, say, `golden_pear_ginger_pointer`? First, why did you pick such a long handle? 
 65 | Whatever your answer, it would be irritating to have to type this every time, so we give you a special way to refer to yourself: `me`. So if you have a dataset named comics, you can just type:
 66 | 
 67 |     me/comics
 68 | 
 69 | In a sentence: “the latest version of my dataset named comics.” Under the hood, we’ll re-write this request to use your actual handle instead.
 70 | 
 71 | ### drop names with hashes
 72 | 
 73 | Finally, it’s also possible to use just the hash. This is a perfectly valid dataset reference:
 74 | 
 75 |     @/ipfs/QmejvEPop4D7YUadeGqYWmZxHhLc4JBUCzJJHWMzdcMe2y
 76 | 
 77 | In this case we’re ignoring naming altogether and simply referring to a dataset by its network and version hash. Because IPFS hashes are global, we can do this across the entire network. If you’re coming from git, this is a fun new trick.
 78 | 
 79 | To recap:
 80 | 
 81 | All of these are ways to refer to a dataset:
 82 | 
 83 | * handle/dataset_name (user-friendly reference)
 84 | 
 85 |       b5/comics
 86 |     
 87 | * network/hash (path)
 88 | 
 89 |       /ipfs/QmejvEPop4D7YUadeGqYWmZxHhLc4JBUCzJJHWMzdcMe2y
 90 |       
 91 | * handle/dataset_name@profile_id/network/version_id (canonical reference)
 92 | 
 93 |       b5/comics@QmYCvbfNbCwFR45HiNP45rwJgvatpiW38D961L5qAhUM5Y/ipfs/QmejvEPop4D7YUadeGqYWmZxHhLc4JBUCzJJHWMzdcMe2y
 94 | 
 95 | 
 96 | # Reference-level explanation
 97 | [reference-level-explanation]: #reference-level-explanation
 98 | 
 99 | The struct that stores a dataset reference is `DatasetRef`. Its fields correspond directly to the definition specified above.
100 | 
101 | ```
102 | // DatasetRef encapsulates a reference to a dataset.
103 | type DatasetRef struct {
104 | 	// Handle of the dataset creator
105 | 	Handle string `json:"handle,omitempty"`
106 | 	// Human-friendly name for this dataset
107 | 	Name string `json:"name,omitempty"`
108 | 	// ProfileID of dataset creator
109 | 	ProfileID profile.ID `json:"profileID,omitempty"`
110 | 	// Content-addressed path for this dataset, network and version_id
111 | 	Path string `json:"path,omitempty"`
112 | 	// Dataset is a pointer to the dataset being referenced
113 | 	Dataset *dataset.DatasetPod `json:"dataset,omitempty"`
114 | }
115 | ```
116 | 
117 | The Dataset pointer optionally points to a dataset itself, once it has been loaded into memory.
118 | 
119 | The most important function for working with `DatasetRef`s is `CanonicalizeDatasetRef`, which converts from user-friendly references and path references to canonical references.
120 | 
121 | ```
122 | func CanonicalizeDatasetRef(r Repo, ref *DatasetRef) error {
123 |   ...
124 | }
125 | ```
126 | 
127 | This function handles replacements like converting `me` to the peer's `handle`, and fills in PeerID and Path. It returns the error `repo.ErrNotFound` if the dataset does not exist in the user's repo, which means the dataset is not local, but may exist remotely. Callers of this function should respond appropriately, by contacting peers if a p2p connection exists.
128 | 
129 | 
130 | # Drawbacks
131 | [drawbacks]: #drawbacks
132 | 
133 | Creating a naming scheme carrying with it many issues around backwards compatibility, but this is somewhat mitigated by having the network name in the reference. If, for some reason, a new distributed network needs to be supported in the future, Qri can adapt without breaking old references.
134 | 
135 | Renames are difficult to get right, since it means that code cannot assume that all hashes correlate to the same user-friendly string.
136 | 
137 | There exist subtle differences between types of dataset references, due to the structure being stateful. Code must carefully handle DatasetRefs not knowing from their type alone whether they are user-friendly, or canonical, or only a path.
138 | 
139 | # Rationale and alternatives
140 | [rationale-and-alternatives]: #rationale-and-alternatives
141 | 
142 | Some naming scheme is absolutely necessary to refer to distributed dataset in a user-friendly way.
143 | 
144 | # Prior art
145 | [prior-art]: #prior-art
146 | 
147 | Similar concepts exist in Git, which uses sha1 hashes the way Qri uses IPFS hashes. Git also uses branch and remote names similar to how Qri uses dataset names.
148 | 
149 | Bittorrent solves similar problems by encapsulating hash information in binary files that users load from their native interface.
150 | 
151 | # Unresolved questions
152 | [unresolved-questions]: #unresolved-questions
153 | 
154 | What other shortcut names are there aside from `me` that we may also want to support in the future? How might the DatasetRef structure need to change in the future if new network values are ever supported?
155 | 


--------------------------------------------------------------------------------
/text/0011-html_viz.md:
--------------------------------------------------------------------------------
  1 | - Feature Name: html_viz
  2 | - Start Date: 2018-08-14
  3 | - RFC PR: [#13](https://github.com/qri-io/rfcs/pull/13)
  4 | - Issue: <!-- (leave this empty) -->
  5 | 
  6 | _Note: We never properly finished the HTML rendering RFC, which we started work on in August 2018.
  7 | Instead of creating a new one we just "finished what we started" in March 2019. As such this RFC
  8 | contains references to RFCS developed in the period between August 2018 & March 2019.
  9 | -:heart: the qri core team_
 10 | 
 11 | # Summary
 12 | [summary]: #summary
 13 | 
 14 | HTML vizualizations are instructions embedded in a dataset rendering a dataset as a standard HTML document. 
 15 | 
 16 | # Motivation
 17 | [motivation]: #motivation
 18 | 
 19 | Qri has a syntax-agnostic `viz` component that encapsulates the details required to visualize a dataset. This RFC proposes the first & default visualization syntax should be rendering to a single HTML document using a template processing engine.
 20 | 
 21 | # Guide-level explanation
 22 | [guide-level-explanation]: #guide-level-explanation
 23 | 
 24 | 
 25 | ### `qri render` & default template
 26 | 
 27 | A command will be added to both Qri's HTTP API ('API' for short) & command-line interface (CLI) called `render`, which adds the capacity to execute HTML templates against datasets. When called with a specified dataset `qri render` will load the dataset, assume the viz syntax is HTML, and use a default template to write an HTML representation of a dataset to `stdout` on the CLI, or the HTTP response body on the API:
 28 | `qri render me/example_dataset`
 29 | 
 30 | The default template and output path can be overridden with the `--template` and `--output` flags respectively. The output is on the CLI only:
 31 | `qri render --template template.html --output rendered.html me/example_dataset`
 32 | 
 33 | The default template must be carefully composed to balance the size of the resulting HTML file in bytes against readability & utility of the resulting visualization. It should also include a well-constructed citation footer that details the components of a dataset in a concise, unobtrusive manner that invites users to audit the veracity of the dataset in question. These defaults should encourage easy reading and invite verification on the part of the reader.
 34 | 
 35 | ### Vizualizations in datasets
 36 | 
 37 | Saving a dataset will by default execute the default template to a file called `index.html` & place it in the IPFS merkle-DAG. When an IPFS HTTP gateway receives a request for a DAG that is a directory containing `index.html`, it returns the HTML file by default. This means when a user visits a dataset on the d.web–completely outside the Qri system of dataset production–they are greeted with a well-formatted dataset document by default.
 38 | 
 39 | While care will be taken to keep `index.html` files small, users may understandably want to disable them entirely. To achieve this we'll add a new flag to `qri save` and `qri update`: `--no-render`. No render will prevent the execution of any viz template. This will save a few KB from version to version at the cost of usability.
 40 | 
 41 | Users can _override_ the default template by supplying their own custom viz templates either by specifying a `viz.scriptPath`:
 42 | 
 43 | `dataset.yaml`:
 44 | ```yaml
 45 | name: example_dataset
 46 | # additional fields elided ...
 47 | viz:
 48 |   format: html
 49 |   scriptPath: template.html
 50 | ```
 51 | 
 52 | and running save:
 53 | ```
 54 | $ qri save --file dataset.yaml
 55 | ```
 56 | 
 57 | Or by running `qri save` with an `.html` file argument:
 58 | `qri save --file template.html me/example_dataset`
 59 | 
 60 | Since the above example provided no additional configuration details for the `viz` component in `dataset.yaml`, the two calls will have the same effect.
 61 | 
 62 | # Reference-level explanation
 63 | [reference-level-explanation]: #reference-level-explanation
 64 | 
 65 | ### Template API
 66 | 
 67 | Introducing HTML template execution requires defining an API for template values. This API will need to be documented & maintained just like any other API in the Qri ecosystem.
 68 | 
 69 | The template implementation will use the [go language html template package](https://golang.org/pkg/html/template)
 70 | 
 71 | #### Dataset is exposed as `ds`
 72 | 
 73 | HTML template should expose a dataset document at `ds`. Exposing the dataset document as `ds`  matches our conventions for referring to a dataset in starlark, and allows access to all defined parts of a dataset. ds should use _lower camel case_ fields for component accessors. eg:
 74 | 
 75 | ```html
 76 | {{ ds.meta.title }}
 77 | {{ ds.transform.scriptPath }}
 78 | ```
 79 | 
 80 | Undefined components should be defined as empty struct pointers if null. For example a dataset _without_ a transform the following template should print nothing:
 81 | ```html
 82 | {{ ds.transform }}
 83 | ```
 84 | And this should error:
 85 | ```html
 86 | {{ ds.transform.script }}
 87 | ```
 88 | 
 89 | Having default empty pointers prevents unnecessary `if` clauses, allowing a skip to tests for existence on component fields:
 90 | ```html
 91 | {{ if ds.transform.script }}
 92 | {{ end }}
 93 | ```
 94 | 
 95 | ### Template functions
 96 | 
 97 | Top level functions should be loaded into the template `funcs` space to make rendering templates easier. The go html template package comes with [predefined functions](https://golang.org/pkg/text/template/#hdr-Functions). Because our implementation builds on the html/template package, this RFC introduces all of these functions into our template engine. I think this is totally fine, might be a pain if someone needs to write a non-go `Render` implementation, but at least the universe of available template functions is known.
 98 | 
 99 | An example that uses a default function prints the length of the body:
100 | 
101 | ```html
102 | <h3> The Body has {{ len getBody }} elements </h3>
103 | ```
104 | 
105 | In addition to the stock predefined functions the following should be loaded for all templates to make templating a little easier:
106 | 
107 | | name        | description              |
108 | | ----------- | ------------------------ |
109 | | timeParse   | parse a timestamp string, returning a golang *time.Time struct |
110 | | timeFormat  | convert the textual representation of the datetime into the specified format |
111 | | default     | allows setting a default value that can be returned if a first value is not set. |
112 | | title       | give the title of a dataset |
113 | | getBody     | load the full dataset body |
114 | | filesize    | convert byte count to kb/mb/etc string |
115 | 
116 | 
117 | #### future dataset document API
118 | 
119 | We have reserved future work for a "dataset API" that will expand the default capabilities of a dataset document to include convenience functions for doing things like loading named columns or sampling rows from the body. We've intentionally left this API undefined thus far to understand how it will work in different contexts. One such context is this template API.
120 | 
121 | The one exception to this is exposing body data through a global function: `getBody`. This is because there's a _very_ high chance we'll want to export `ds.body` as an object with methods in the future. If we simply load the entire body & drop it into `ds.body`, adding methods to `ds.body` will require breaking the document API.
122 | 
123 | ### The default template
124 | 
125 | Our standard template should be a collection of pre-defined blocks which are also available to user-provided templates. An example default template would look something like this:
126 | 
127 | ```html
128 | <!DOCTYPE html>
129 | <html>
130 |   <head>
131 |   {{ block "stylesheet" . }}{{ end }}
132 |   </head>
133 |   <body>
134 |   {{ block "header" ds . }}{{ end }}
135 |   {{ block "summary" ds . }}{{ end }}
136 |   {{ block "stats" ds . }}{{ end }}
137 |   {{ block "citations" ds . }}{{ end }}
138 |   </body>
139 | </html>
140 | ```
141 | 
142 | Users can then swap in these predefined blocks to make pseudo-custom templates a thing, and ease the transition to fully-custom visualizations through progressive customization. The pre-defined blocks are as follows:
143 | 
144 | | name       | purpose |
145 | | ---------- | ------- |
146 | | stylesheet | default css styles |
147 | | header     | dataset title, date created & author in a `<header>` tag |
148 | | summary    | overview of dataset details |
149 | | stats      | template of stats component that prints nothing if the stats component is undefined |
150 | | citations  | `<footer>` tag that prints hash content & links to in-package files |
151 | 
152 | In practice I'm hoping the `citations` partial will get a lot of use, making it easier to drop properly-formatted citations into markup with little thought.
153 | 
154 | # Drawbacks
155 | [drawbacks]: #drawbacks
156 | 
157 | One of the biggest drawbacks to the proposed implementation comes from a natural tendency to use the following to template the contents of a dataset body in to a dataset:
158 | 
159 | ```html
160 | <script type="text/javascript"> window.data = {{ ds.get_body }} </script>
161 | ```
162 | 
163 | Using templating this way prints the entire body into the template. If the body is large, this will cause _all body data to be duplicated in `index.html`_. I propose we fight this through documentation, because there are many scenarios where this code is completely appropriate (for example, when the dataset body is an aggregation of some other data source).
164 | 
165 | # Rationale and alternatives
166 | [rationale-and-alternatives]: #rationale-and-alternatives
167 | 
168 | there are more common templating styles that may make more sense. [Mustache](https://mustache.github.io/mustache.5.html) comes to mind. All of these will be more work to integrate than the standard go package. I've tried to focus on removing the "go-specific" aspects that often show up in templates like Title.Case.Accessors and using `.` to refer to the current template data.
169 | 
170 | # Prior art
171 | [prior-art]: #prior-art
172 | 
173 | * [Hugo's template functions](https://gohugo.io/functions/)
174 | * [Mustache templates](https://mustache.github.io/mustache.5.html)
175 | 
176 | # Unresolved questions
177 | [unresolved-questions]: #unresolved-questions
178 | 
179 | ### Implementation in another language
180 | It's unclear if this RFC is specific enough to make templates created in this context & executed in go-qri function in another language. Future work should scope _down_ the allowed syntax, with an eye toward portability.
181 | 
182 | ### Dataset API
183 | We need to put time into the dataset API, but it's also unclear whether the template engine should have full parity with that forthcoming API. Template execution is supposed to be limited, and smart decisions will need to be made to keep users from writing too much business logic into their templates.


--------------------------------------------------------------------------------
/text/0014-export.md:
--------------------------------------------------------------------------------
  1 | - Feature Name: Export Dataset
  2 | - Start Date: 2018-08-27
  3 | - RFC PR: [#20](https://github.com/qri-io/rfcs/pull/20)
  4 | - Issue: <!-- (leave this empty) -->
  5 | 
  6 | # Summary
  7 | [summary]: #summary
  8 | 
  9 | Users need to be able to get data not only into Qri, but out of it as well.
 10 | 
 11 | # Motivation
 12 | [motivation]: #motivation 
 13 | 
 14 | In order for Qri to be useful to a large swath of folks, we need to allow users to get datasets out of it, thus providing an easy way to interop with other tools.
 15 | 
 16 | Our two methods for doing so are the `/export` api endpoint and the `qri export` command. Both methods should have the same options, level of control, and outputs.
 17 | 
 18 | # Guide-level explanation
 19 | [guide-level-explanation]: #guide-level-explanation
 20 | 
 21 | Exporting from Qri needs to provide a wide array of flexible options, in order to maximize the quantity of other tools it can interoperate with. As with any system, as the number of options increases, the complexity of how options combine with each other grows exponentially, so it's especially important that we define how each option works individually.
 22 | 
 23 | In addition, we wish to provide high-level "intelligent" options that work like "presets", setting reasonable low-level options in order to cover common cases of what users will often want to accomplish.
 24 | 
 25 | The options that `export` should allow include:
 26 | 
 27 | * What component of a dataset gets exported
 28 | 	* header (meta, structure, commit)
 29 | 	* body (full, or part)
 30 | 	* transform
 31 | 	* viz
 32 | * Format for the exported dataset header
 33 | 	* json
 34 | 	* yaml
 35 | 	* none (no header)
 36 | * Format for the exported dataset body 	
 37 |     * json
 38 |     * yaml
 39 | 	* csv
 40 | 	* cbor
 41 | 	* a non-qri format like xlsx, pdf, html
 42 | * Choices around encoding of data
 43 |     * line-endings
 44 |     * charset (usually utf-8)
 45 | * Miscellaneous options
 46 |     * file-format specific metadata (like putting the header into xlsx)
 47 |     * outputting a directory instead of zip (only for cmd-line)
 48 |     
 49 | ### Component naming
 50 | 
 51 | The components of a dataset use a standard naming scheme. Exporting a single component will use this name by default.
 52 | 
 53 | | Component | Filename |
 54 | | ------------ |  --------- |
 55 | | header | "dataset.json" |
 56 | | body | "body.json" |
 57 | | transform script | "transform.sky" |
 58 | | viz script | "viz.html" |
 59 | | reference | "ref.txt" |
 60 | 
 61 | The "dataset.json" file contains the header components that describe the dataset: the meta, structure, and commit. The "ref.txt" file contains the full reference string, including human name and path on the distributed web. Clearly, this value can't be stored in the dataset itself, but it necessary for re-importing.
 62 | 
 63 | ### Zip format
 64 | 
 65 | If more than one file needs to be exported, Qri will create a single zip file containing a directory with the standardized filenames. This is pretty similar to how many modern file formats work, such as `.ods`.
 66 | 
 67 | ### Embedded metadata
 68 | 
 69 | Some file formats include a location for including metadata. For example, html has the `<head>` element with `<meta>` tags, jpeg has EXIF chunks, and xlsx has custom properties. One of Qri's export options should be to write the metadata of a dataset into one of these file-specific locations, assuming the format is in use.
 70 | 
 71 | This would allow exporting a dataset to a single file, without losing any important metadata.
 72 | 
 73 | In such cases, it is also assumed that the metadata will include a cafs hash to the exported dataset, which will allow for easy parenting if the result gets reimported.
 74 | 
 75 | ### Saving
 76 | 
 77 | The command-line tool will require a filename to save the export to. The frontend will prompt with a standard "Save As" dialog box.
 78 | 
 79 | ### Blank
 80 | 
 81 | The command-line tool will also allow a "blank export" using the `--blank` flag to create an empty file of the requested type.
 82 | 
 83 |   - `qri export --blank header`
 84 |   - `qri export --blank transform`
 85 |   - `qri export --blank viz`
 86 |   
 87 | Blank files will be outputted to the standard name for their component. If a file with that name already exists, an error will be displayed.
 88 | 
 89 | ### Importing
 90 | 
 91 | As explicit goal of the export process is that an exported dataset can be reimported in the future (possibly with modifications), and not lose any fidelity in the process. For example, all metadata should still be present, and a file that is not modified at all should be treated as an exact duplicate of the dataset that was exported.
 92 | 
 93 | Of course this won't always be possible if certain options are used, such as if the user requests the header not be exported, but assuming this isn't the case, information should not be lost.
 94 | 
 95 | # Reference-level explanation
 96 | [reference-level-explanation]: #reference-level-explanation
 97 | 
 98 | The top-level `Export` method will be passed a struct that contains a number of other structs, each roughly corresponding to the categories of options mentioned above.
 99 | 
100 | ```
101 | type ExportOptions struct {
102 |   Section  ExportSectionOptions
103 |   Format   ExportFormatOptions
104 |   Encoding ExportEncodingOptions
105 |   Misc     ExportMiscOptions
106 | }
107 | ```
108 | 
109 | The exact contents of each of these substructs will be determined during development.
110 | 
111 | The command-line arguments corresponding to these substructs are either simple string values, or use a hyphenated name if the struct has multiple fields. Here are some examples:
112 | 
113 | ```
114 | --section all
115 | --section meta,body
116 | --format-header yaml
117 | --format-body json
118 | --format-file xlsx
119 | --encoding-charset utf-8
120 | --metadata-in-file true
121 | --directory-of-files true
122 | ```
123 | 
124 | The final example "metadata-in-file" is part of the `Misc` struct.
125 | 
126 | ### URL encoding
127 | 
128 | The process for converting from command-line arguments to a URL used by the frontend is as simple as separating each argument with an ampersand, and putting an equals sign between each key and value.
129 | 
130 | ```
131 | qri export --sections all \
132 |            --format-header yaml \
133 |            --format-body json
134 | ```
135 | 
136 | would have this corresponding URL:
137 | 
138 | ```
139 | https://localhost:2503/export?sections=all&format-header=yaml&format-body=json
140 | ```
141 | 
142 | ### Presets
143 | 
144 | For the high-level "preset" style flags, the export command will use the `preset` flag, like this:
145 | 
146 | ```
147 | qri export me/my-dataset --preset html save_to.html
148 | qri export me/my-dataset --preset excel save_to.xlsx
149 | ```
150 | 
151 | There is no struct corresponding to the `--preset` flag. Rather, a given value of this flag will translate into a specific instance of the ExportOptions structure, using reasonable defaults for the requested output. These instances are encoded in sourcefiles, so using `--preset excel` will deserialize this instance and pass it to the Export method.
152 | 
153 | For instance, `--preset excel` may be treated like:
154 | 
155 | ```
156 | qri export --sections all \
157 |            --format-file xlsx \
158 |            --option-metadata-in-file true
159 | ```
160 | 
161 | # Drawbacks
162 | [drawbacks]: #drawbacks
163 | 
164 | There is inherent complexity around this feature. However, we believe the benefits of having a strong interop story outway the cost of the added complexity. 
165 | 
166 | # Rationale and alternatives
167 | [rationale-and-alternatives]: #rationale-and-alternatives
168 | 
169 | An alternative to this feature would be to just give users raw data, in a Qri-specific format, and ask them to write their own conversion tools. We have chosen not to take this approach in order to hopefully encourage others to use Qri.
170 | 
171 | # Prior art
172 | [prior-art]: #prior-art
173 | 
174 | Export is a common feature is nearly all document editing and data handling applications. Often it may use the name "Save As" with conversion options.
175 | 
176 | # Unresolved questions
177 | [unresolved-questions]: #unresolved-questions
178 | 
179 | How many options will need to be supported in the "Misc" section?
180 | 
181 | Is there a better way to divide up the structure that represents all ExportOptions?


--------------------------------------------------------------------------------
/text/0016-revise_transform_processing.md:
--------------------------------------------------------------------------------
  1 | - Feature Name: Revise Transform Processing
  2 | - Start Date: 2018-11-15
  3 | - RFC PR: [qri-io/rfcs#24](https://github.com/qri-io/rfcs/pull/24)
  4 | - Issue: <!-- (leave this empty) -->
  5 | 
  6 | # Summary
  7 | [summary]: #summary
  8 | 
  9 | Change transformation script execution, replacing the notion of a pipeline of uniform transform functions with set of _special functions_ that are fit-to-purpose, and a canonical _transform_ responsible for mutating the input dataset.
 10 | 
 11 | # Motivation
 12 | [motivation]: #motivations
 13 | 
 14 | While building out transform scripts, I've come to believe we need to revise our approach to resolve a design tension.
 15 | 
 16 | > Is the dataset that the download function is passed the original dataset in Qri? If so, then in the case when you want to combine it with an external dataset that you grab over http, is this a valid way to get them both to the transform function?
 17 | -- [source](https://github.com/qri-io/rfcs/pull/21#issuecomment-429575144)
 18 | 
 19 | > when I was writing a transformation, it wasn't super clear what needed to happen in the sky file and what needed to happen in the dataset.yml. I started out with both and had to deal with a bunch of errors which stemmed from (I think) me accidentally specifying things in both places or in the wrong place. In particular I wasn't sure what runs first. For example, does the ds in the transformation know the schema specified in dataset.yml or does qri check the output of transform against the specified schema?
 20 | 
 21 | # Guide-level explanation
 22 | [guide-level-explanation]: #guide-level-explanation
 23 | 
 24 | ## Current Transform Script Execution
 25 | As Qri currently exists, the primitive unit of a _transform script_ is a _transform function_, which always had the same signature:
 26 | ```python
 27 | def transform_function(input_dataset):
 28 |   return result_dataset
 29 | ```
 30 | 
 31 | A _transform script_ is a script that defines one or more transform functions, each of which is chained together in a predetermined order. The function names in order are `download`, `transform`. The reason for the separation is to separate capiblities & concerns into matching predeclared functions, establishing a declaritive sandboxing model. By chaining uniform functions we minimize side effects & establish a clear mental model of order. Here's an example:
 32 | 
 33 | ```python
 34 | load("http.sky", "http")
 35 | 
 36 | def download(ds):
 37 |   res = http.get("https://api.example.com/cats.json")
 38 |   ds.set_body(res.json())
 39 |   return ds
 40 | 
 41 | def transform(ds):
 42 |   ds.set_meta("title", "a list of cats from example.com")
 43 |   return ds
 44 | ```
 45 | 
 46 | This script first calls `download`, and the return value of `download` is passed as the `ds` argument to `transform`. The `http` module only allows network access within the `download` transform function, access to local dataset info is only available in the `transform` function. Sandboxing comes from use of the "hollywood principle" ("don't call us we call you"). If a `download` function is defined, the Qri environment running the transform will activate network access, call the user-defined `download` transform function, turn off network access, and move on to the next transform function. Access to the user's local repository is _not_ available during the `download` step, but is during the `transform` step.
 47 | 
 48 | Two important questions have been posed by early Qri users about this model:
 49 | * _What is the initial input dataset to `download`?_ -- @dustmop
 50 | * _How do I get state that isn't part of a dataset from one transform function to another?_ -- paraphrased from [@stuartlynn](https://github.com/qri-io/rfcs/pull/21#issuecomment-429575144)
 51 | 
 52 | These are both very good questions that point to an opportunity for improvement on our current model. 
 53 | 
 54 | ### What is the initial input dataset to `download`?
 55 | The initial input to download is the user-supplied dataset details (eg all the things provided as arguments to the command line, or files provided over the HTTP API). This is, well, bad, because it flies directly in the face of our sandbox model. If the user is using a script on a private dataset (note that private datasets currently don't exist, but we plan on building support for them), this will lead to bad things.
 56 | 
 57 | Ideally the download function would only have access to the transformation configuration, and secrets. Secrets are unfortunately required as it's the only way to safely provide things like API keys and other priviledged information for constructing network requests, and transformation configuration details are important for driving logic around network request construction.
 58 | 
 59 | Both the required signature and the required order are getting in the way here. `download` has to be called before `transform` to give transform access to the result of network requests, and the requirement that functions have the same signature means we can't properly constrain the download function. We could pass an empty dataset with only config & secrets made available, but to me this creates confusion. Ideally download is passed nothing at all.
 60 | 
 61 | ### How do I get state that isn't part of a dataset from one transform function to another?
 62 | 
 63 | This extra-textual state is a question of _routing return values_. By requiring only datasets to be passed in and out, the user is forced to compromise, returning either the user input _or_ the result of network requests. In a worst-case scenario, script authors may abuse the dataset object as a kind of transitive context, stuffing download results into, say, the `meta` dataset component, and deleting it in a later. This is highly error prone.
 64 | 
 65 | Ideally, consumers of the return value of `download` have access to whatever the author of the download function deems is relevant, and are not required to make any tradeoffs when delivering the results of network requests.
 66 | 
 67 | 
 68 | ## Proposed Change:
 69 | I think these questions point to problems with the requirement of uniform transform function signatures, and a strict sequential call order. I have concerns about how these requirements will prevent us from adapting to new challenges in the future, and don't provide enough payoff to warrant keeping them around.
 70 | 
 71 | Let's have more support for the basic idea that functions that have different signatures _do different things_. That isn't really honored here. It's not clear from the above example that download and transform have any predeclared order or capabilities. Transform functions as implemented appear to differ in name only, and the predeclared call order runs the risk of becoming arbitrary. 
 72 | 
 73 | To solve both of these problems, I'm proposing the following changes:
 74 | * replace the notion of a transform function with a set of predefined _special functions_, where each function has a required signature that fits it's needs
 75 | * replace the "chain of transform functions" with a single function that applies programmitic changes & returns the finalized dataset
 76 | * all _special functions_ accept a passed `context` object for moving state across special function calls
 77 | * remove the requirement that the final transform function return a dataset
 78 | 
 79 | Here's a an example that re-writes the above transform script using the propsosed changes:
 80 | 
 81 | ```python
 82 | load("http.sky", "http")
 83 | 
 84 | def download(context):
 85 |   res = http.get("https://api.example.com/cats.json")
 86 |   return res.json()
 87 | 
 88 | def transform(ds, context):
 89 |   cats = context.download
 90 |   ds.set_body(cats)
 91 |   ds.set_meta("title", "a list of cats from example.com")
 92 | ```
 93 | 
 94 | Here the notion of chaining transform functions is gone. Instead, there is now one point entry, a function, `transform` that accepts a dataset. `transform` acts as a "main" function that Qri calls, passing in the user-supplied dataset snapshot (whatever the user provided from either the command line or API). `transform` retains is previous sandbox qualities: access to the user's local qri repository, but no network access, and is now always provided with the user-input dataset.
 95 | 
 96 | `download` is an example of a _special function_. Special functions are predefined function names and signatures the Qri environment knows to look for. Unlike our current transform functions, special functions have varying signatures, which must be conveyed through documentation. If a transform script defines a special function, Qri will call it and place the return value of the function at `context.[special_function_name]` before calling the main `transform` function. The main `transform` now also has access to the result of all _special function_ calls via the passed in `context`.
 97 | 
 98 | `context` is designed to solve the issue of moving state around during a transform. All special functions will accept a `context` argument. The Qri environment will populate `context` with necessary transform state such as `transform.config` and `transform.secret` values. `context` also has an API for passing arbitrary user data via `context.set("key", Value)` and `context.get("key")`.
 99 | 
100 | _Note: Most special functions will not be able to reference reference state that another function procduces–even via context–because many special functions and will be called in parallel._
101 | 
102 | As a special function, `download` now has a unique signature that more closely matches it's intention. Download is intended to _download things_. It has no access to the input dataset, which is denoted by the signature. Download should do network stuff, then return the results for further processing in `transform`. Transform should modify the dataset, so it's provided a dataset param. All of this will need to be thoroughly documented, (and ideally made as a "tab completion" in the Qri frontend editor), but this change helps convey intended use.
103 | 
104 | Making the return value of `download` available at `context.download` also helps clarify to the user that by the time `transform` is executing, the qri environment has already called download. This clarifies both what is doing the calling, and when it's been called.
105 | 
106 | A side effect of these changes, transform scripts now must define a `transform` function if they wish to have any effect on a dataset. Reducing a transform to this single requirement will make for a consistent point of entry that's easier for both humans and machines to reason about.
107 | 
108 | ### Remove `return ds` requirement
109 | 
110 | An additional change I'd like to make is to make the return value of `transform` optional. Users should be free to simply mutate the passed in `ds` dataset, and have that committed as the result. We'll keep the explicit `return ds` as an option in the event that scripts wish to return a distinct dataset object from the one passed in. (This currently isn't possible in Qri as we haven't provided a way to construct a dataset within a transform script, but there's a high likelihood this functionality will be introduced in the future.)
111 | 
112 | # Reference-level explanation
113 | [reference-level-explanation]: #reference-level-explanation
114 | 
115 | I'm intentionally keeping this short. Qri staff will implement this feature, and what's important is agreeing upon userland changes. The short list:
116 | 
117 | * Remove the notion of _transform functions_ from documentation
118 | * Document all supported special functions
119 | * New `context` object & API
120 | * Update our tutorials
121 | 
122 | # Drawbacks
123 | [drawbacks]: #drawbacks
124 | 
125 | ### Proliferation of Function Signatures
126 | Losing the uniform function definition requires clarification & documentation of _every_ predefined function available in a transform script. This creates additional overhead for the end user, who now needs to memorize multiple function signatures. This isn't "all bad", however, as functions whose signatures map well to their intended purpose should be easier to remember.
127 | 
128 | We plan to mitgate this in a few ways:
129 | * good documentation
130 | * code completion
131 | * introducing predefined functions slowly
132 | 
133 | ### "Thicker" transform functions
134 | Using this approach moves much of the burden of doing transforms into one place. This can lead to lots of spaghetti code and not-so-encapsulated concens. One of the key ways Qri fights this is by confining bad code to a single dataset.
135 | 
136 | # Rationale and alternatives
137 | [rationale-and-alternatives]: #rationale-and-alternatives
138 | 
139 | ### Global state object
140 | A close-second alternative considered by the Qri team is to use a _global object_ to pass state around between predefined functions:
141 | 
142 | ```python
143 | load("http.sky", "http")
144 | load("qri.sky", "qri")
145 | 
146 | def download():
147 |   res = http.get("https://api.example.com/cats.json")
148 |   context.set("download", res.json())
149 | 
150 | def transform(ds):
151 |   cats = qri.results.download
152 |   ds.set_body(cats)
153 |   ds.set_meta("title", "a list of cats from example.com")
154 | ```
155 | 
156 | While arguably more elegant, this violates a core skylark principle:
157 | ```python
158 | load("qri.sky", "qri")    # after this load call, skylark "freezes" all values in qri.sky 
159 |                           # the "qri" value loaded should now be immutable
160 | 
161 | def download():
162 |   return ["hello", "world"]
163 | 
164 | def transform(ds):
165 |   qri.results.download     # <- this implies a value in the qri object is mutated :(
166 | ```
167 | 
168 | For this reason we switched to a passed context.
169 | 
170 | ### Renaming the 'transform' function
171 | We've also considered a number of alternative names for the function called `transform` in all of the above examples:
172 | * `save`
173 | * `update`
174 | * `main`
175 | 
176 | I'm not a huge fan of `main` because this isn't code that'll work outside of the Qri environment. `save` and `update` feel like functions / methods that a user would call, not define.
177 | 
178 | # Prior art
179 | [prior-art]: #prior-art
180 | 
181 | I haven't had time to look up prior art. We did consider stuff like [go's context package](https://golang.org/pkg/context/).
182 | 
183 | # Unresolved questions
184 | [unresolved-questions]: #unresolved-questions
185 | 
186 | What errors (if any) should be presented to the user if:
187 | * they define a special function with the wrong signature?
188 | * they attempt to reference `qri.results.[special_function_name]`, but don't define the function?
189 | * they define a special function, but never make use of `qri.results.[special_function_name]`?
190 | Most people will do at least _some_ learning by punching the wrong stuff into Qri and negotiating the error. Providing usable error output is more important here.
191 | 
192 | In this model special functions have no access to each other's state. We often throw around the notion of defining `map` and `reduce` functions for distributed computation. This change removes the predefined dataset input, and allows for a more useful function signature for any future `map` and `reduce` functions. However, there may be some ambiguity to the user (and to use until we have a good think on it) about how exactly the `map` and `reduce` function might relate, as we have explicitly kept the state of each special function separate. 


--------------------------------------------------------------------------------
/text/0018-publish-update.md:
--------------------------------------------------------------------------------
  1 | - Feature Name: publish and update
  2 | - Start Date: 2018-10-24
  3 | - RFC PR: [#27](https://github.com/qri-io/rfcs/pull/27)
  4 | - Issue: <!-- (leave this empty) -->
  5 | 
  6 | # Summary
  7 | [summary]: #summary
  8 | 
  9 | Add "publish" and "update" as patterns for broadcasting when a dataset is ready for public consumption, and an update pattern for grabbing the latest available version of a dataset.
 10 | 
 11 | # Motivation
 12 | [motivation]: #motivation
 13 | 
 14 | Qri makes promises to keep user's data fresh, but we're missing explicit tools & terminology for getting latest versions. This RFC introduces two initial tools for publishing datasets and keeping versions fresh. It's an important part of our story that I'd like to ship some code against to round out our initial user experience.
 15 | 
 16 | In regard to publishing, right now Qri is a little _too_ "public". Any time I add anything to Qri and go online, I have no tools for saying "I don't want people downloading this dataset", or "this dataset isn't ready for public consumption". This is creating "use paralysis", where I don't feel as comfortable as I should "practicing" on Qri, because any and everything I save is visible to the world. If I make a mistake, I have no clear way of telling another user "hey don't grab these updates!".
 17 | 
 18 | Speaking of updates, I don't have any clear mechanism for re-fetching datasets that's been updated since I "added" it. We need a quick and easy way to syncronize my remote copy of a dataset with the lastest version after it's been published.
 19 | 
 20 | Finally, much of the beauty of Qri's self-contained datasets is _not needing to understand code to make an update_. By providing users with an explicit "update" flow that re-executes transforms, that beauty is much easier to see.
 21 | 
 22 | # Guide-level explanation
 23 | [guide-level-explanation]: #guide-level-explanation
 24 | 
 25 | This RFC introduces two new concepts to Qri that will come with new commands and, will modify some existing behaviors. Publish and Update will work together to allow users to explicitly specify which datasets are ready for consumption.
 26 | 
 27 | ## Publish
 28 | All datasets start as "unlisted", and the user must `publish` them to make visible to other users. Peers calling `list_datasets` now only see datasets you've designated as published. Any new commits to a published dataset will be available to others.
 29 | 
 30 | Whenever I choose, I may "unpublish" a dataset, which will remove it entirely from the list returned by `list_datasets`. It is true that others may still have a copy of my dataset. At present there's not much I can do about it.
 31 | 
 32 | #### Publish will integrate with Registries
 33 | Running `publish` with a configured registry should POST a dataset _summary_ to the registry. We haven't defined dataset previews, summaries, and documents yet (a story for another RFC), but once it's done publish will both list the dataset for p2p discovery and push the dataset to the registry.
 34 | 
 35 | #### There is only "publish" and "unlisted"
 36 | Currently there's no way to have one version of a dataset be public and another version be unlisted. Datasets are either completely unpublished or unlisted.
 37 | 
 38 | In the future we **might** build the notion of _branches_ into Qri. Branches were popularized by git. In this case branches would allow you to have a different version of a dataset published form the one you're working with locally. If we go this way, publishing will become an explicit step the user has to execute whenever they'd like to advance the current published version of a dataset. If we go this route, I'd like to make the common use patterns as simple as possible, to the point that `qri save` should come with a `publish` flag that combines saving an update & publishing it into one step.
 39 | 
 40 | 
 41 | ## Update
 42 | Update brings your data to the latest known version of a dataset. Update comes in two flavors:
 43 | * Updating Someone else's dataset
 44 | * Updating A Dataset in your namespace
 45 | Both types of update will be accessed via the same command like and API interfaces, they only differ based on the dataset reference in question.
 46 | 
 47 | #### Updating a peer's dataset
 48 | Running update on a peer dataset checks to see if the peer is online and calls `get_dataset`, if the version is different, it'll grab that dataset. If the peer in question is offline we can't check for updates. Once `publish` also writes to registries, we can check the registry for updates to help aleviate the availability problem.
 49 | 
 50 | Later on we'll want to do proactive checks for updates by submitting a list of dataset refs for which we'd like to grab updates to peers, which will let us display things like "your dataset is out of date, update?" in user interfaces.
 51 | 
 52 | #### Updating a dataset in your namespace
 53 | Running update on a dataset in your namespace will check if a transform script is specified, If so, it'll re-execute the transform & check for changes. If changes exist it'll create a commit.
 54 | 
 55 | Manual changes to a dataset are still run with `qri save`. Calling update on a dataset with no transform script is an error. That error should tell users to run `qri save`.
 56 | 
 57 | # Reference-level explanation
 58 | [reference-level-explanation]: #reference-level-explanation
 59 | 
 60 | publish and update are both steps that require an existing dataset.
 61 | 
 62 | ## Publish:
 63 | #### CLI:
 64 | ```
 65 | $ qri publish --help
 66 | 
 67 | Description:
 68 |   Qri publish makes your dataset available to others. While online, peers that 
 69 |   connect to you can only see datasets and versions that you've published. 
 70 |   publishing a dataset always makes all previous history entries available.
 71 | 
 72 |   By default publish also publishes your dataset to all configured registries,
 73 |   to publish to only specific registries, use the --registries flag.
 74 | 
 75 | Examples:
 76 |   # publish the latest version of a dataset
 77 |   $ qri publish me/dataset
 78 | 
 79 |   # publish a dataset (and all previous versions) at a specific version
 80 |   $ qri publish me/dataset@/ipfs/QmFoo...
 81 | 
 82 |   # unpublish a dataset
 83 |   $ qri publish -d me/dataset
 84 | 
 85 | Flags:
 86 |   --delete -d       unpublish the specified dataset
 87 |   --no-registry -r  don't publish to the registry
 88 |   --registries      comma separated list of registries to publish to
 89 | ```
 90 | 
 91 | #### API:
 92 | ```
 93 | # publish a reference:
 94 | POST http://localhost:2505/publish/me/dataset
 95 | 
 96 | # publish a dataset (and all previous versions) at a specific version
 97 | POST http://localhost:2505/publish/me/dataset/at/ipfs/QmFoo...
 98 | 
 99 | # unpublish a dataset
100 | DELETE http://localhost:2505/publish/me/dataset
101 | ```
102 | 
103 | ## Update:
104 | #### CLI:
105 | ```
106 | $ qri update --help
107 | 
108 | Description:
109 |   Update fast-forwards your dataset to the latest known version. If the dataset
110 |   is not in your namespace (i.e. dataset name doesn't start with your peername), 
111 |   update will ask the peer for any new versions and download them.
112 | 
113 |   Calling update on a dataset in your namespace will advance your dataset by 
114 |   re-running any specified transform script, creating a new version of your 
115 |   dataset in the process. If your dataset doesn't have a transform script, 
116 |   update will error.
117 | 
118 | Examples:
119 |   # get the freshest version of a dataset from a peer
120 |   qri update other_person/dataset
121 | 
122 |   # update your local dataset by re-running the dataset transform
123 |   qri update me/dataset_with_transform
124 | 
125 |   # supply secrets to an update, publish on successful run
126 |   qri update me/dataset_with_transform -p --secrets=keyboard,cat
127 | 
128 | Flags:
129 |   --publish, -p     publish successful updates
130 |   --title, -t       supply a 
131 |   --message, -m     supply a commit message for local updates
132 |   --secrets, -s     specify transform secrets 
133 | ```
134 | 
135 | ### API:
136 | ```
137 | # get the freshest version of a dataset from a peer
138 | POST http://localhost:2503/update/other_person/dataset
139 | 
140 | # update your local dataset by re-running the dataset transform
141 | POST http://localhost:2503/update/me/dataset_with_transform
142 | 
143 | # supply secrets to an update via request headers, publish on successful run
144 | $curl -X "POST" -H "secrets: keyboard,cat" http://localhost:2503/update/me/dataset_with_transform?publish=true
145 | ```
146 | 
147 | ## Library changes
148 | ### lib:
149 | two new methods:
150 | * `lib.Publish`
151 | * `lib.Update`
152 | I'll work out how they work while coding.
153 | 
154 | ListDatasets will have to be adjusted to read from either `repo.PublishedRefs` or `repo.Refs` depending on who's asking.
155 | 
156 | ### p2p:
157 | * `list_datasets` should now return the list of _published_ datasets.
158 | * `get_dataset` should only return the latest published version
159 | 
160 | ### repo:
161 | DatasetRef will now have a `Published` boolean field that indicates weather a dataset is published or unlisted.
162 | 
163 | ### deletes:
164 | removing a dataset that has been published should _not_ remove it from any registries, but should remove from local publication list. Deleting a published dataset should warn the user that it may still be published to
165 | 
166 | # Drawbacks
167 | [drawbacks]: #drawbacks
168 | 
169 | The major drawback here is once information is within an IPFS repo, it's available on the open web, which is why "unpublished" data is better described as "unlisted". One could argue that we should wait until either the capacity to keep IPFS hashes private, or wait until Qri has private datasets.
170 | 
171 | This would be ideal, but it's not practically feasible on our current timeframe.
172 | 
173 | # Rationale and alternatives
174 | [rationale-and-alternatives]: #rationale-and-alternatives
175 | 
176 | A possible alternative would be to mimic git's `push` / `pull` system.
177 | 
178 | # Prior art
179 | [prior-art]: #prior-art
180 | 
181 | Youtube. Vimeo. On many vid sharing platforms you can set a video to "unlisted", that's kinda what's going on here with unpublished datasets.
182 | 
183 | # Unresolved questions
184 | [unresolved-questions]: #unresolved-questions
185 | 
186 | * We still need some sort of rebase / squash / merge system that allows users to edit & republish histories
187 | * Publish currently ignores whatever the previous publication status was. We should be checking calls to publish against any existing history and require a `--force` flag if the new publish re-writes history
188 | * We should also be warning users when they publish changes to a schema, possibly requiring a `--force` flag to publish schema changes. Schema changes have a habit of breaking transform scripts
189 | * How publishing to multiple registries works. (I'm guessing a CLI example will involve giving registries names and adding a `--registry` flag that accepts a comma separated list of registries to publish to)
190 | 
191 | Syncing version histories is now a much bigger question that'll have to be addressed. This should be the subject of another RFC, ideally happening sooner-rather-than-later.


--------------------------------------------------------------------------------
/text/0019-manifests.md:
--------------------------------------------------------------------------------
  1 | - Feature Name: dataset manifests
  2 | - Start Date: 2018-10-25
  3 | - RFC PR: [#32](https://github.com/qri-io/rfcs/pull/32)
  4 | - Issue: <!-- (leave this empty) -->
  5 | 
  6 | # Summary
  7 | [summary]: #summary
  8 | 
  9 | Define dataset manifests: a building block for managing subsets of a dataset and point-to-point block syncing.
 10 | 
 11 | # Motivation
 12 | [motivation]: #motivation
 13 | 
 14 | At Qri we've run into a pretty serious snag: sending datasets from point-to-point in a manner that's roughly comparable with a centralized service in terms of speed. While working on `qri publish`, we want to pin the blocks in a dataset on an IPFS node that is always available so others can get to it. This requires getting the dataset from a local machine to a publically-hosted Qri registry. This is the rough equivalent of `git push origin master` for Qri.
 15 | 
 16 | Without the Qri version of `git push origin master`, Qri's usefulness is degraded. We've written a test that uses out-of-band communication to ask a remote node to `ipfs pin add` while the local machine is on the distributed web, and it's taking anywhere from 5-7 minutes to pin a ~200kb dataset, even with an explicit request to directly connect the two nodes making the transfer.
 17 | 
 18 | While we're most certainly doing a few things wrong in terms of network configuration, this "ambient pinning" approach isn't producing a consistent-enough user experience. After initially [proposing some upstream changes](https://github.com/ipld/specs/pull/66) in IPFS, it's become clear that it'll be more useful to others to solve our own pain points with an eye toward generalization. If we come up with a solution that works well, it might be useful in other situations.
 19 | 
 20 | There is also a related-but-separate set of problems we've encountered when working with dataset transmission that all involve situations where we need tools for quantifying a _subset_ of a dataset. Some examples from recent work:
 21 | 
 22 | * storing only a "preview" of a dataset on Qri registries, where a preview is a dataset without the body
 23 | * producing _size-bounded_  datasets. Datasets can theoretically be any size, because of this, we need some way to get meaningful, predefined subsets of a dataset. 
 24 | * tracking the progress of transferring a dataset
 25 | * comparing blocks of a dataset between two versions to examine deduplication
 26 | 
 27 | Because datasets in Qri are organized into _immutable graphs_ of content addressed blocks. There are two versions of what it means to "have" a dataset. In the first sense of the word, you have the entire graph of blocks that make up a dataset pinned locally. The other meaning of "having" is the result of reading the canonical hash and encoding them into an API, CLI, or exported dataset.
 28 | 
 29 | This second sense of "having" is obviously vital for Qri to work properly because the entire world doesn't read raw IPFS blocks, and never will. The problem is the first version is all-or-nothing. You either have the _entire_ dataset graph, or you have to consider the result an emphemeral byproduct of the canonical source.
 30 | 
 31 | Before we approach the point-to-point sync problem, I'd like to introduce **Manifests** as a method of describing all blocks that compose a dataset in a concise, repeatable way. By having a clear method of defining a complete dataset, we have the tools we need to have meaningful discussion about subsets of a dataset, and make progress on the all-or-nothing issue.
 32 | 
 33 | Because we intend to use datasets in a peer-2-peer context, manifests need to contend with _trust_. The design of manifests and techniques that rely on manifests must take abusive use into account.
 34 | 
 35 | Manifests get consumers of a subset of a graph out of an information-poor context. If a peer has only part of a graph, but also has a manifest, they know _which_ part they have, and more importantly, which parts they're missing.
 36 | 
 37 | Once manifests are in place, I'll follow up with an RFC on block-syncing that uses manifests as a building block for an rsync-like delta replication system.
 38 | 
 39 | 
 40 | # Guide-level explanation
 41 | [guide-level-explanation]: #guide-level-explanation
 42 | 
 43 | ## Manifests
 44 | A manifest is a determinsitc description of a complete directed acyclic graph (DAG). Manifests contain no content, only a description of the _structure_ of a graph (nodes and links). Manifests are inspired by the `info` section of bittorrent "torrent" files. 
 45 | 
 46 | Manifests are built around a flat list of node identifiers (usually hashes) and an array of links. A link element is a tuple of `[from,to]` where `from` and `to` are indexes in the `nodes` array. Manifests always describe the **full** graph: a root node and all it's descendants. Here's an example graph with four nodes: `A,B,C,D` and three Links: `A->B`, `A->C`, `C->D`.
 47 | ```
 48 |     A
 49 |    / \
 50 |   B   C
 51 |        \
 52 |         D
 53 | ```
 54 | 
 55 | The manifest for this DAG written as JSON would be:
 56 | ```json
 57 | {
 58 |   "nodes": ["A","C","B","D"],
 59 |   "links": [[0,1],[0,2],[1,3]]
 60 | }
 61 | ```
 62 | 
 63 | A valid manifest has a few requirements:
 64 | * The array of nodes MUST be sorted by number of descendants. 
 65 | * When two nodes have the same number of descendants they MUST be sorted lexographically by node ID
 66 | * The list of links MUST be sorted from low-to-high tuple values. eg: `[0,1]`, `[1,1]`, `[1,2]`.
 67 | 
 68 | These requirements create a few nice properties:
 69 | * The means that for DAGs with a single root node, the root ID will always be index 0 of `nodes`
 70 | * Using this `nodes` organization, all leaf nodes will be grouped at the end of the list. 
 71 |   * The element after highest `From` index present in the `links` array is also the split point between leaf and branch nodes
 72 | * supplying the same DAG to the manifest function must be deterministic: `manifest_of_dag = manifest(dag)`
 73 |   * Therefore: `hash(manifest_of_dag) == hash(manifest(dag))`
 74 | * In order to generate a manifest, you need the full DAG, or at least a trusted source of DAG IDs and links
 75 | 
 76 | Manifests are intentionally limited in scope to make them easier to prove, faster to calculate, the list of nodes can be used as a base other structures can be built upon. By keeping manifests at a minimum they are easier to verify, forming a foundation for other.
 77 | 
 78 | Based on some light testing, if leaves have a max size of 256KB, a 100KB manifest encoded as CBOR can represent somewhere in the vicinity of 1GB of content. Having manifests that are roughly 1:10,000 in size makes them a reasonable choice for preflighting a DAG transfer over a network.
 79 | 
 80 | I'd like to introduce two structures that build on Manifests: `DAGInfo` and `Completion`:
 81 | 
 82 | ### DAGInfo
 83 | DAGInfo is intended to work like `os.FileInfo` for graph-based storage: a struct that describes important details about a graph. In terms of trues, the contents of DAGInfo should be considered gossip. DAGInfo's are *not* deterministic, and are intended to fill the role of application-specific needs when describing DAGs. 
 84 | 
 85 | For Qri's purpose DAGInfo contains two fields: `sizes` and `paths`. `sizes` lists the size of each node in bytes, where each array index corresponds to the ID specified in `manifest.Nodes`. `sizes` is useful for a number of things, the most obvious being getting the the total size of the DAG by summing of all elements in the `sizes` array.
 86 | 
 87 | DAGInfo _paths_ associate labels with meaninful subsection of a manifest. They're represented as a map of strings to manifest node indicies, which indicate the "root" of the dag subgraph that constitutes the path. For example, we could associate the label `dataset.json` with it's root in the DAG:
 88 | 
 89 | ```
 90 |     A
 91 |    / \
 92 |   B   C <- dataset.json
 93 |      / \
 94 |     D   E
 95 | ```
 96 | 
 97 | We can construct the following DAGInfo:
 98 | ```json
 99 | {
100 |   "manifest": {
101 |     "nodes": ["A","C","B","D","E"],
102 |     "links": [[0,1],[0,2],[1,3],[1,4]]
103 |   },
104 |   "paths": {
105 |     "dataset.json": 1
106 |   },
107 |   "sizes": [150,145,2340,256000,3404]
108 | }
109 | ```
110 | 
111 | Setting a `paths` property key `dataset.json` to `2` lets us leverage `links` to construct the subgraph starting at index 2 to construct the subraph at `dataset.json`.
112 | 
113 | 
114 | ### Completion
115 | Completion tracks the presence of blocks described in a manifest. Completion can be used to store transfer progress, or be stored as a record of which blocks in a DAG are missing each element in the slice represents the index a block in a `manifest.nodes` field, which contains the hash of a block needed to complete a manifest the element in the progress slice represents the transmission completion of that block's _locality_ (weather or not it's on your hard drive). Completion values must be a number from 0-100, where 0 = nothing locally, 100 = block is local.
116 | 
117 | Note that completion is not necessarily linear. For example the following is 50% complete:
118 | ```
119 | manifest.Nodes: ["QmA", "QmB", "QmC", "QmD"]
120 | progress:       [0, 100, 0, 100]
121 | ```
122 | 
123 | # Reference-level explanation
124 | [reference-level-explanation]: #reference-level-explanation
125 | 
126 | ```go
127 | type Manifest struct {
128 | 	Links [][2]int `json:"links"` // links between nodes
129 | 	Nodes []string `json:"nodes"` // list if CIDS contained in the root dag
130 | 	Root  int      `json:"root"`  // index if CID in nodes list this manifest is about. The subject of the manifest
131 | }
132 | 
133 | type DAGInfo struct {
134 | 	Manifest *Manifest      // DAGInfo is built upon a manifest
135 | 	Paths    map[string]int // sections are lists of logical sub-DAGs by positions in the nodes list
136 | 	Sizes    []uint64       // sizes of nodes in bytes
137 | }
138 | 
139 | type Completion []uint16
140 | ```
141 | 
142 | # Drawbacks
143 | [drawbacks]: #drawbacks
144 | 
145 | * Manifests are additional stuff that needs to be shipped around, and by design include no actual dataset content. In this sense they're a kind of tax
146 | * Manifests have to be generated, and likely cached. This is additional storage and code complexity
147 | 
148 | I do believe the benefits in terms of having a reliable assumption about DAG structure, and reduced code complexity from being able to leverage this assumption is well worth both of these drawbacks
149 | 
150 | # Rationale and alternatives
151 | [rationale-and-alternatives]: #rationale-and-alternatives
152 | 
153 | We could attempt to implement something like _selectors_, which Protocol Labs is working on. This solution seems more general than our needs.
154 | 
155 | # Prior art
156 | [prior-art]: #prior-art
157 | 
158 | * Manifests started as a proposal for inclusion in the IPFS GraphSync protocol: https://github.com/ipld/specs/pull/66
159 | * Bittorrent Torrent files: https://en.wikipedia.org/wiki/Torrent_file
160 | 
161 | # Unresolved questions
162 | [unresolved-questions]: #unresolved-questions
163 | 
164 | * We still need to figure out how to store Manifests/DAGInfos/Completion
165 | * This is too large to be sending to the frontend / over an API without being explicitly requested. We'll need to develop some sort of shorthand for representing completion


--------------------------------------------------------------------------------
/text/0021-export_behavior.md:
--------------------------------------------------------------------------------
  1 | - Feature Name: Export Behavior
  2 | - Start Date: 2019-01-23
  3 | - RFC PR: [#34](https://github.com/qri-io/rfcs/pull/34)
  4 | - Issue: <!-- (leave this empty) -->
  5 | 
  6 | # Summary
  7 | [summary]: #summary
  8 | 
  9 | Export functionality exists in qri, but is lacking many options and pieces of functionality.
 10 | 
 11 | # Motivation
 12 | [motivation]: #motivation
 13 | 
 14 | An <a href="https://github.com/qri-io/rfcs/blob/master/text/0014-export.md">earlier RFC</a> began describing the export functionality. However, it left a number of details unspecified, which led to too much uncertainty to move forward with implementation. While it did mention many of the options for export, it left out discussions of defaults, the interactions between options, and basically how `export` is expected to work. This RFC hopes to remedy this situation so that implementation can proceed without hesitation.
 15 | 
 16 | # Guide-level explanation
 17 | [guide-level-explanation]: #guide-level-explanation
 18 | 
 19 | Export is a piece of functionality that will generate the version of a dataset which is external to qri storage, so that it can live instead on a local filesystem. The goal is to have a document definition of the structured data that has been, up until now, represented only internally by qri. Here are some of the use cases that `export` is meant to handle:
 20 | 
 21 | * Getting a local binary copy of a dataset, such that it can be safetly deleted from qri, and later imported without any loss of fidelity or history.
 22 | * Exporting an entire dataset with metadata and body to analyze in a custom script. A single file "json" or "yaml", for example.
 23 | * Converting to a binary format (like xlsx) for use in another program like Excel or OpenOffice.
 24 | * Deciding to stop using qri, and wanting to get all of your data out in a high-fidelity form, with both machine and human readable formats.
 25 | 
 26 | These should all be doable using `export`.
 27 | 
 28 | Some non-goals:
 29 | 
 30 | * Retrieving only a single component of a dataset, like the title.
 31 | * Converting a dataset's body to a different format than it is being stored as (csv to json or cbor).
 32 | 
 33 | These types of operations should be added to `get` instead, which already has similar functionality. The distinction between `export` and `get` is that `export` is intended to always produce a complete dataset (in some form) whereas `get` is designed to select particular components or subsets of datasets.
 34 | 
 35 | ## Formats
 36 | 
 37 | Each of the above listed use cases for `export` corresponds to a different export format. Here they are discussed in more detail.
 38 | 
 39 | ### Native export
 40 | 
 41 | The default option is a native export. This consists of a zip file containing two things, a `manifest.cbor` file, then a list of binary blocks containing the data that is directly read from IPFS. The manifest simply enumerates these blocks. This native format is designed to work best with reimports; using it will recreate the dataset with no loss of information.
 42 | 
 43 | ### Foreign export
 44 | 
 45 | An `export` can also be foreign, using the `--format` flag to choose a specific file format. For example:
 46 | 
 47 | ```
 48 | qri export --format json me/my_dataset
 49 | ```
 50 | 
 51 | ```
 52 | qri export --format xlsx me/my_dataset
 53 | ```
 54 | 
 55 | A file created by these commands still contains as much data about the original dataset, but reimporting it is not guaranteed to end up with the exact same result as the dataset it was exported from.
 56 | 
 57 | For the case of "json", the result will be a single file, with the body stored inline in the document.
 58 | 
 59 | ### Full fidelity export
 60 | 
 61 | If a user wishes to stop using qri, and wants to get a fully detailed record of the data that was involved in their usage of the system, they can request a full export. This produces a file that contains all the raw binary data from qri, as well as human-readable versions, so that they can work with their exported data without needing qri anymore to process it.
 62 | 
 63 | One proposal for this flag:
 64 | 
 65 | ```
 66 | qri export --takeout me/my_dataset
 67 | ```
 68 | 
 69 | The `--takeout` flag can be combined with `--format` to decide what the format of the human-readable document should be, defaulting to "json".
 70 | 
 71 | ## Output filename
 72 | 
 73 | An option not mentioned previously is a `-o` flag (long form `--output`), which can be used to set the output path.
 74 | 
 75 | This is a powerful option, as it actually a few effects:
 76 | 
 77 | * Decides the filename and also the path to which the export is written
 78 | * Can be used to infer the output format via file extension, such as "json" or "xlsx" or "zip"
 79 | 
 80 | If not specified the default output path should use the file extension based upon the format. Since the default format is "native", this means an `export` with no options will output a ".zip" file. It is an error to use both `--format` and `-o` if the output file extension contradicts the format flag.
 81 | 
 82 | In addition, the basename of the output path should be derived from the name of the dataset, and should be append a formatted timestamp of the most recent commit to the dataset. For example:
 83 |  
 84 | `qri export me/my_dataset`
 85 |  
 86 | Should be saved to "my_dataset_[formatted_timestamp_of_most_recent_commit].zip". This means exports of the same version of the dataset will have the same filename, while new versions will have different filenames.
 87 |  
 88 | When the `export` command is completed, the command-line should print a success message that includes the filename of the exported result. This should be done whether the output filename is derived from the dataset name and timestamp, or if its specified using the `-o` flag.
 89 | 
 90 | ## Transform and viz
 91 | 
 92 | For foreign exports, the transform and viz fields (if present in the dataset), will just be represented as paths. The flags `--inline-transform` and `--inline-viz` can be used to store them as base64 encoded files instead.
 93 | 
 94 | ## All revisions
 95 | 
 96 | Using the flag `--revision=all` or `--all` will include all past versions of a dataset. This feature will not be implemented at first, but will be added in a follow up change.
 97 | 
 98 | ## Body format in single-file exports
 99 | 
100 | When a dataset is being export as a "json" document, it will be necessary to convert the body to also be json. In addition, the Structure.Format field should be updated to reflect this new json format.
101 | 
102 | It may be desired to export a document as "yaml", or to have a body as a "csv" or other format. In these cases, the body would need to be byte-encoded, so that it could be properly represented in the host format.
103 | 
104 | # Reference-level explanation
105 | [reference-level-explanation]: #reference-level-explanation
106 | 
107 | A brief summary of the main flags, and their effects:
108 | 
109 | | flag          | type    | raw   | document |
110 | | ------------- | ------- | ----- | -------- |
111 | | (default)     | native  | yes   | none     |
112 | | --format json | foreign | no    | json     |
113 | | --format yaml | foreign | no    | yaml     |
114 | | --format xlsx | foreign | no    | xlsx     |
115 | | --takeout     | native  | yes   | json     |
116 | 
117 | Other flags:
118 | 
119 | ```
120 | -o (--output) Filename to output, file extension implies --format
121 | --revisions   How many revisions to export, "all" for all
122 | --all         Same as --revisions=all
123 | ```
124 | 
125 | # Drawbacks
126 | [drawbacks]: #drawbacks
127 | 
128 | No good reasons not to do this.
129 | 
130 | # Rationale and alternatives
131 | [rationale-and-alternatives]: #rationale-and-alternatives
132 | 
133 | More detail and specification is needed before implementation can go forward.
134 | 
135 | # Prior art
136 | [prior-art]: #prior-art
137 | 
138 | [https://github.com/qri-io/rfcs/blob/master/text/0014-export.md](https://github.com/qri-io/rfcs/blob/master/text/0014-export.md)
139 | 
140 | # Unresolved questions
141 | [unresolved-questions]: #unresolved-questions
142 | 
143 | The `--revisions` flag needs a bit more exploration before it is implemented. The plan is to first all the basic export functionality, and come back to revisions later on.
144 | 
145 | 
146 | 


--------------------------------------------------------------------------------
/text/0022-remotes.md:
--------------------------------------------------------------------------------
  1 | - Feature Name: Remotes
  2 | - Start Date: 02-21-2019
  3 | - RFC PR: [#36](https://github.com/qri-io/rfcs/pull/36)
  4 | - Issue: <!-- (leave this empty) -->
  5 | 
  6 | # Summary
  7 | [summary]: #summary
  8 | 
  9 | Remotes act as a way for any user of Qri to setup their own server that keeps datasets alive, providing availability and ownership over data within a set of nodes that they control.
 10 | 
 11 | # Motivation
 12 | [motivation]: #motivation
 13 | 
 14 | Multiple users have requested a way to keep datasets alive and available within their own network. Currently, we have a public Registry that serves a similar purpose, but it acts too centralized and is also not designed to be duplicated and deployed by existing users. Although IPFS may keep data blocks alive due to its distributed nature, there's no guarantee to keep data around forever unless it is pinned, and the pinning node remains online. Remotes solve this problem by giving control to users, letting them own their data. Having it work as a do-it-yourself service goes a long way torwards helping Qri fulfill its goal of putting data everywhere.
 15 | 
 16 | Relatedly, the current implementation of the Registry should be reworked so that it isn't duplicating work done inside of the normal Qri backend. Rather, it should simply be a variation of a Remote. By avoiding code duplicated across code bases, we will make maintenance easier and have a better story to explain how Qri works.
 17 | 
 18 | The eventual goal is to allow advanced users to run their own Remotes, both to keep data alive and provide certain federated services, while the Registry exists as a Qri run service and is the default location for publishing data and centralization.
 19 | 
 20 | # Guide-level explanation
 21 | [guide-level-explanation]: #guide-level-explanation
 22 | 
 23 | Some definitions used in this document:
 24 | 
 25 | * Registry - the Qri-run service for pinning, searching, etc
 26 | * Remote - the future functionality included in the main Qri binary
 27 | * Backend - the main Qri command-line program
 28 | * Client - a Qri command-line program that lets users run their own command, as opposed to running as a server that receives remote commands
 29 | 
 30 | Currently, the Registry exists to solve multiple problems that are the result of Qri being primarily a distributed system:
 31 | 
 32 | * User identity and registration
 33 | * Search
 34 | * Data availability
 35 | 
 36 | The Registry acts to solve these problems, the first two of which require moving away from a purely distributed network model, and the third of which is an easy thing to add once a centralized component exists. While the Registry isn't strictly required to use Qri, as Qri can be run entirely in p2p mode, the tradeoff would be losing these types of features.
 37 | 
 38 | Our immediate benefit of creating Remotes is that we can unlink these features, and let users run their own services to provide better availability for their datasets. In addition, while a Remote can't provide global search or user identity, it can support a federated model, or a limited version within a specific network.
 39 | 
 40 | Once Remotes exist, and contain a sufficient amount of capability, we hope to reposition the Registry as just a special type of Remote that is run by Qri, and is used as the default for users, in addition to its necessary role of handling global user identity and search.
 41 | 
 42 | ## Proposed workflow
 43 | 
 44 | Assuming a new Qri remote is up and running, users who wish to push to a remote would have to do three things:
 45 | 
 46 | 1. Find the URL of a running remote, (let's pretend a remote is running at `https://example.com`)
 47 | 2. Run `qri config add remotes.eg https://example.com`. This would create a new remote named `eg`
 48 | 3. Publish a dataset using `qri publish --remote eg`
 49 | 
 50 | Running the default `qri publish` would still push to the default registry. Users should be able to publish to multiple remotes with many `remote` flags, or by separating remote names with commas.
 51 | 
 52 | Users can pick whatever name they would like for remotes. Remotes will initially need to be set up with DNS records we'll describe in detail later. Setting up a remote will be an advanced thing for a while.
 53 | 
 54 | # Reference-level explanation
 55 | [reference-level-explanation]: #reference-level-explanation
 56 | 
 57 | Currently functionality that exists only in the Registry now includes:
 58 | 
 59 | * `POST /dataset` to upload a dataset's head
 60 | * `/search` to search over known datasets (online or not)
 61 | * `/datasets` to list all known datasets (online or not)
 62 | * `/profile` user information
 63 | * `/reputation` user trustability
 64 | * `/pins` pinning a dataset
 65 | * `/dsync` data block uploading
 66 | 
 67 | Long-term we want to obviate the need for these to be in a separate executable.
 68 | 
 69 | The plan of action is as follows:
 70 | 
 71 | * Add a flag to Qri's backend to run it as a remote. This will work similarly to the current "read-only" flag, disabling most write functionality.
 72 | * Add APIs to the backend that serve a similar purpose to what the Registry is doing now: `POST /dataset` in particular.
 73 | * Get enough APIs in the backend such that it can be run as a remote which provides availability to users that want it.
 74 | * Add functionality to Qri's command-line to publish to this remote instead of, or in addition to the public Registry.
 75 | * Over time, move the rest of the functionality from the Registry into the backend remote.
 76 | * Replace the live Registry with a standard remote.
 77 | * Delete old Registry code.
 78 | * New semi-centralized features get added to the Remote feature set.
 79 | 
 80 | Some specific notes as to how this is presented. First, the concept of the Registry is a good way to explain Qri to users: it exists as a way to keep data online even if your own node is not. This shouldn't change as we move to Remotes; instead of deleting the _concept_ of a Registry, we're generalizing it by introduced a Remote as something you can run yourself if you're an advanced user with a special use-case. Secondly, since a Remote is considered an expert-level concept, it's okay for its UX to be slightly more advanced than it may be with other features. The easy thing to use should remain to be the Qri run public Registry. This implies, for example, that setting up a client to use a remote should be more like `git config set remote <url>` instead of `git remote add <name> <url>`, with `remote` as a top-level command.
 81 | 
 82 | ## Posting protocol
 83 | 
 84 | Posting a dataset to a Remote involves expanding our current protocol:
 85 | 
 86 | * Client sends Dataset Manifest and Head to the Remote, signed with their cryptographic key
 87 | * The Remote has some ruleset about what it accepts, for example, no more than 10Mb in a body
 88 | * Remote replies with what it wants from the user
 89 | * DAGSync occurs (or not, if the dataset body is rejected)
 90 | * Dataset is pinned
 91 | * Remote tells the client about the result
 92 | 
 93 | In the future, advanced users may want to configure a Remote with full granularity how an acceptance ruleset works. Perhaps they could even use `starlark` as a way to decide this.
 94 | 
 95 | # Drawbacks
 96 | [drawbacks]: #drawbacks
 97 | 
 98 | Adding remotes to the mix adds an additional layer of complexity. Specifically, the "is my dataset available" question is now a little harder to answer consistently, given that it'll depend on the updtime of the remote. Complexity aside, this problem is the same as it is now, but this heightens the need for a clearer story around dataset availability measurement.
 99 | 
100 | # Rationale and alternatives
101 | [rationale-and-alternatives]: #rationale-and-alternatives
102 | 
103 | The current version of the Registry as a special citizen is not working for multiple reasons. It must be coalesed into the main codebase. Without having Remotes, users must rely upon the single qri-run Registry or their own nodes to keep their data alive, which does not fit many use cases.
104 | 
105 | # Prior art
106 | [prior-art]: #prior-art
107 | 
108 | Github plays a similar role to git. However, it is already possible for a user to run their own git server, albeit without the nice features like Issues and Pull Requests.
109 | 
110 | # Unresolved questions
111 | [unresolved-questions]: #unresolved-questions
112 | 
113 | Access controls, federation
114 | 
115 | 
116 | 


--------------------------------------------------------------------------------
/text/0023-starlark_load_dataset.md:
--------------------------------------------------------------------------------
  1 | - Feature Name: starlark load_dataset
  2 | - Start Date: 2018-03-15
  3 | - RFC PR: [#38](https://github.com/qri-io/rfcs/pull/38)
  4 | - Issue: <!-- (leave this empty) -->
  5 | 
  6 | # Summary
  7 | [summary]: #summary
  8 | 
  9 | Introduce a global function `load_dataset` to starlark scripts that both loads the dataset and declares it as a dependency. Deprecate all `qri.load_dataset` functions, making `load_dataset` the only way to depend on an external dataset.
 10 | 
 11 | This RFC assumes all datasets are public, and that names refer to a dataset the script author can resolve. In the future `load_dataset` will be expanded, accepting optional arguments to scope the kinds of access a script is requesting.
 12 | 
 13 | # Motivation
 14 | [motivation]: #motivation
 15 | 
 16 | Qri's goal is to build an environment where datasets are a first class citizen. Qri currently monitors any call to `qri.load_dataset` and records the loaded dataset as a dependency. This results in a complete dependency graph.
 17 | 
 18 | There are (at least) three problems with this approach: 
 19 | 
 20 | * **scattered requirements**
 21 |   Dependencies are not explicit in code. The only way to know what datasets a script depends on is to read the entire script. We've learned this lesson many times with software dependencies, and know to explicitly require dependency-forming statements at the top of a code file. There's no reason this shouldn't apply to depending on datasets.
 22 | * **inexpressive usage**
 23 |   `qri.load_dataset` functions are not expressive enough to capture all the uses for a dataset, building up patterns around loading pieces of a dataset inside special functions like `transform` is missing the point of thinking of datasets as _documents_.
 24 | * **expanding adds to `qri` module, not dataset documents**
 25 |   Currently the `qri` module returns plain values from methods like `load_dataset_body`. If we wanted to add a function that, say, only selected specific parts of a dataset body, that function would be added to the `qri` module. This issue will only be exacerbated when we try to overlay access control onto datasets. It makes more sense to define all ways a dataset will be used at the point of import, and have Qri return a `dataset` document that is scoped to that use.
 26 | 
 27 | For these reasons, we should make datasets a first class citizen in starlark scripts by creating a global `load_dataset` statement that can be expanded with permissions scopes in the future. This statement should return a _dataset document_ that we can expand the functionality of over time. 
 28 | 
 29 | 
 30 | # Guide-level explanation
 31 | [guide-level-explanation]: #guide-level-explanation
 32 | 
 33 | ### `load_dataset`
 34 | This RFC proposes switching to a single method of loading a dataset that is analogous to loading a module. Dataset dependencies will be declared (by convention) grouped together at the top of a script, forming a clear delaration of the datasets this script depends on: 
 35 | 
 36 | ```python
 37 | load("http.star", "http")
 38 | # load a dataset into a variable named "fhv"
 39 | fhv = load_dataset("b5/nyc_for_hire_vehicles")
 40 | 
 41 | def download(ctx):
 42 |   # use the fhv dataset to inform an http request
 43 |   vins = ["%s,%s" % (entry['vin'], entry['model_yearl']), for entry in fhv.body()]
 44 | 
 45 |   res = http.post("https://vpic.nhtsa.dot.gov/api/vehicles/DecodeVINValuesBatch/", form_body={
 46 |     'format': 'JSON', 
 47 |     'DATA': vins.join(";")
 48 |   })
 49 | 
 50 |   return res.json()
 51 | 
 52 | def transform(ds, ctx):
 53 |   ds.set_body(ctx.download)
 54 | ```
 55 | 
 56 | In this example, the `load_dataset` statement supplies no optional arguments, it's assumed this dataset is loaded and available in all parts of the script. This RFC defines no optional arguments for `load_dataset`, so loading datasets for global use is the only option available. Considering all datasets on Qri are currently public, this is a sensible place to start.
 57 | 
 58 | ### Access Control
 59 | Knowing that things like "private datasets" are planned for the future, we need an explicit way to load a dataset that we can build upon. While we can't yet define what those nuances will be, we do need to define _how they will be expressed_. Here are examples of possible scoping statements, expressed as optional arguments: 
 60 | 
 61 | ```python
 62 | # scope a dataset load to only meta & structure components
 63 | residents = load_dataset("b5/city_residents", components=["md", "st"])
 64 | 
 65 | # only allow access to "residents" during the "transform" step. "residents" will 
 66 | # be equal to None during all other steps
 67 | residents = load_dataset("b5/city_residents", steps=["transform"])
 68 | 
 69 | # apply differential privacy to all statements that read from the body
 70 | residents = load_dataset("b5/city_residents", anonymized=True)
 71 | 
 72 | # require that the permissions of the dataset that results from this transform
 73 | # match the permissions of "b5/city_residents"
 74 | residents = load_dataset("b5/city_residents", forward_permissions=True)
 75 | ```
 76 | 
 77 | These options are not giving additional permissions to scripts, as that would go against the no-need-to-audit goal, they are instead self-restricting what the loaded dataset may be used for. They exist for the benefit of the transform author, not for the security of the transform consumer.
 78 | 
 79 | These are suggestions only. We need lots of time to think through how these options could work, this rfc only adds the assumption that those statements will be expressed as options on `load_dataset`. We can build on the already-present assumption that loaded datasets are read-only, making all options pertain to read access.
 80 | 
 81 | We _do_ need to declare what the default `load_dataset` expression means in relation to permissions:
 82 | 
 83 | ```python
 84 | residents = load_dataset("b5/city_residents")
 85 | ```
 86 | 
 87 | The default load dataset should be taken to mean "load this dataset with the _maximum available permission_". If a dataset is open, the user is free to read from any and all parts of the document.
 88 | 
 89 | Combining a default of requesting maximum permission with _restrictive_ permission declared on the imported dataset supports our default stance of an open dataset ecosystem. The onus of declaring how a dataset can be used will fall to the owner of the dataset, and defaulting to max permissions forces the owner to assume that "if it's permitted, someone will use it", which is the correct assumption when attempting to restrict access.
 90 | 
 91 | # Reference-level explanation
 92 | [reference-level-explanation]: #reference-level-explanation
 93 | 
 94 | `load_dataset` has a fairly simple definition:
 95 | 
 96 | ```
 97 | def load_dataset(reference): # return: dataset object
 98 | ```
 99 | 
100 | * **reference** is a string reference to a dataset, as defined by our dataset naming conventions elsewhere.
101 | * **dataset object** the returned value is a dataset object, as defined by ds.NewDataset in `qri-io/startf/ds`.
102 | 
103 | As mentioned earler, the plan is to expand this definition with options once we've had time to do more research. Theses examples of valid reference strings, with `QmProfile` and `QmHash` being standins for full base-58 encoded hashes:
104 | 
105 | ```python
106 | load_dataset("b5/dataset")
107 | load_dataset("b5/dataset@QmProfile/ipfs/QmHash")
108 | load_dataset("@/ipfs/QmHash")
109 | ```
110 | 
111 | ### `load_dataset` must be top level scope
112 | `load_dataset` statements will only be allowed within the primary scope of a transform script. Phrased another way: `load_dataset` _cannot_ be called within a function. This limits the scope of what is possible when composing dynamic import statements. This **won't** work:
113 | 
114 | ```python
115 | load("qri.star", "qri")
116 | 
117 | # create a dataset of all dataset titles
118 | def transform(ds, ctx):
119 |   # error: load_dataset must be called in top level scope
120 |   datasets = [load_dataset(name) for name in qri.list_datasets()]
121 |   ds.set_body([ds.get_meta('title') for ds in datasets])
122 | ```
123 | 
124 | This example is making use of a _privileged resource_ (the local qri node), which should be denied by default. For this assumption to hold, privileged resources must _only_ be available within special functions, an assumption we already enforce.
125 | 
126 | Statements that use logic to compose import strings should also be denied:
127 | 
128 | ```python
129 | # error: load_dataset must be called with a string value
130 | pop_years = [load_dataset("b5/population_%s" % year) for year in range(2011,2017)]
131 | ```
132 | 
133 | This convenience function for creating legal string names breaks static analysis tools, making it impossible to determine a script's dependencies without executing code.
134 | 
135 | ### Name Resolution
136 | When the starlark execution environment encounters a `load_dataset` statement, it has to _resolve_ the name, connecting the string reference to a dataset object. This may or may not require network access, depending on if the user currently has all of the specified versions locally. All naming duties required by `load_dataset` will be offloaded to Qri's name resolution system, but a few interactions merit calling out here:
137 | 
138 | #### Default to fetching non-local dependencies
139 | By default, any dataset referenced with `load_dataset` that isn't local should be fetched from the network, using whatever heuristics Qri's name resolution system is configured to use (currently: fetching from both the registry & p2p network). This frees the user from having to consider what data they do & do not have. Requiring `load_dataset` be called in top level scope means non-local fetching will run before any special functions (`download`, `transform`) are called. This has the natural effect of requiring dependency resolution before the "main" functions of a transform script can execute.
140 | 
141 | #### Name resolution errors
142 | This interaction with network availability will need to be managed. If, for example, a script were run in conjunction with a hypothetical `--offline` flag like this:
143 | 
144 | ```
145 | $ qri save --offline --file transform.star me/example
146 | ```
147 | 
148 | Referencing any non-local dataset in `load_dataset` will raise a name resolution error. But if all depencies are locally satisfied, the transform should execute without error. If a dataset cannot be loaded, or doesn't exist anywhere, the script MUST exit. Providing a recovery mechanism would allow probing datasets to see whether they exist or not.
149 | 
150 | ### Don't pin dependencies by default
151 | The default behaviour should be to **not pin dependencies**. We should add a new flag to `save` and `update` commands: `--pin-deps` that pins dependencies locally. The reason for this choice relies on default settings in other parts of Qri. A "stock" installation of Qri defaults to storing datasets in an IPFS repository restricted to 10Gigs of space. In practice most datasets so far have been less than 100Mb in size, and most users coming to Qri either understand how to manually manage their IPFS storage, or are only using IPFS through Qri. These choices accumulate to lots of free space in the 10Gig repository that can go for some time before garbage collection is needed. If garbage collection _is_ required, the first thing that should go are transient dependencies required by previous transform scripts, which will be the case here. Future tools that perform graph analysis on versioned dependency trees can be introduced to add fine-grained pinning review & management. Did I mention that this work never stops?
152 | 
153 | ### Relative "me" references aren't allowed:
154 | Any valid dataset reference should work with one exeption: no "me" statements.
155 | 
156 | ```python
157 | # this will error:
158 | pop = load_dataset("me/population")
159 | # this will work:
160 | pop = load_dataset("b5/population")
161 | ```
162 | 
163 | This is to keep transform scripts portable by freeing them from dependencies on execution context.
164 | 
165 | This variable-style import makes sense for datasets, who's names are often lengthy. Unlike package imports, we anticipate it will be very common for dataset names to collide, being distinguished only be peername. this required assign-to-name pattern will help here as well.
166 | 
167 | # Drawbacks
168 | [drawbacks]: #drawbacks
169 | 
170 | ### differences with `load`
171 | As currently proposed users are expected to assign the result of `load_dataset`
172 | ```python
173 | # load the `http` component of the http "package object"
174 | load("http.star", "http")
175 | # load a dataset into a variable named "fhv"
176 | fhv = load_dataset("b5/nyc_for_hire_vehicles")
177 | ```
178 | 
179 | In the `load` statement, the value `http` is defined by the module, which can only be known by reading the source. I find starlark's `load` statement opaque, and don't consider it a pattern we should follow.
180 | 
181 | # Rationale and alternatives
182 | [rationale-and-alternatives]: #rationale-and-alternatives
183 | 
184 | ### load into a user-supplied string
185 | One alternative would be to have `load_dataset` work a little closer to the way `load` does in starlark, having a user supply a second string argument that names a global variable:
186 | ```python
187 | # load b5/nyc_for_hire_vehicles into a global variable named "fhv"
188 | load_dataset("b5/nyc_for_hire_vehicles", "fhv")
189 | 
190 | # use fhv:
191 | fhv.get_body()
192 | ```
193 | 
194 | I attempted to code this up, it was both difficult to execute, and felt too "magical" in practice.
195 | 
196 | ### alternate names
197 | We can also consider alternative names for `load_dataset`. To me the requirements are as follows:
198 | * show a dependency is being created
199 | * differ from `load`, which refers to requiring code
200 | * prefer concise names
201 | 
202 | 
203 | # Prior art
204 | [prior-art]: #prior-art
205 | 
206 | We're taking a lot of inspiration for the security model for imports & dependencies from the [deno project](https://github.com/denoland/deno)
207 | 
208 | # Unresolved questions
209 | [unresolved-questions]: #unresolved-questions
210 | 
211 | ### Loading from history
212 | One aspect not yet covered by this RFC is how to load previous versions of the dataset a script is affecting. The closest I have so far is supplying some sort of relative ref with no name:
213 | 
214 | ```python
215 | prev = load_dataset("~1")
216 | ```
217 | 
218 | The current thinking is to _not_ do this because it's not safe to assume that "~1" is a dataset meant for public consumption. Loading historical entries clearly needs more thought, I'm hoping to cover this in a subsequent RFC that helps refine & clarify the distinction between _names_ & _selection_.
219 | 
220 | ### Dataset object
221 | This RFC adds pressure to the api defined around dataset documents in starlark. We're on a collision course with a `dataframe-like api` that'll need to get worked out in a separate RFC. I think we can survive on the current model until a subsequent RFC can be written with extensive research on dataframe-like APIs.
222 | 
223 | ### Dynamic dependencies
224 | This RFC explicitly denies building dynamic dependencies by forcing `load_dataset` to be top-level calls. 
225 |  
226 | I think we should instead follow up with a subsequent rfc to handle dynamic dependencies. I think dynamic dependency loading should happen in a new special function created expressly for this purpose. An example of this would be a `load_datasets` special function that allows dynamic dependencies:
227 | 
228 | ```python
229 | load("qri.star", "qri")
230 | 
231 | def load_datasets(ctx):
232 |   return [load_dataset(name) for name in qri.list_datasets()]
233 | 
234 | # create a dataset of all dataset titles
235 | def transform(ds, ctx):
236 |   datasets = ctx.datasets
237 |   ds.set_body([ds.get_meta('title') for ds in datasets])
238 | ```
239 | 
240 | I'd rather introduce this later after more concrete use-cases have been created so we can design a solution that addresses as many as possible while keeping the sandbox model intact. This solution, for example, would not allow http access to determine dependencies.


--------------------------------------------------------------------------------
/text/0025-filesystem-integration.md:
--------------------------------------------------------------------------------
  1 | - Feature Name: Filesystem Integration
  2 | - Start Date: 2019-07-02
  3 | - RFC PR: [#45](https://github.com/qri-io/rfcs/pull/45)
  4 | - Issue:
  5 | 
  6 | # Summary
  7 | [summary]: #summary
  8 | 
  9 | This feature adds a "live link" between a qri dataset in the user's IPFS repo and a directory on the filesystem. It adds new qri commands for initializing this link, and changes the behavior of some commands to use the files in the working directory.
 10 | 
 11 | # Motivation
 12 | [motivation]: #motivation
 13 | 
 14 | The motivation for this feature is to provide a more tangible connection between the qri versioning/publishing ecosystem and the file-based data that data practitioners are used to working with. Since all of the user-modifiable information in a qri dataset can map to files, we should just make those files exist on the filesystem.  Likewise, we should watch for changes to those files, and ensure that they are all in a valid state before allowing the user to commit changes to a qri dataset.
 15 | 
 16 | *This will significantly lower the barrier to entry for new users to qri, especially GUI users, as they can quickly pull down published datasets and begin working with them without having to understand how IPFS/QRI stores and maintains the data.*  
 17 | 
 18 | # Guide-level explanation
 19 | [guide-level-explanation]: #guide-level-explanation
 20 | 
 21 | The intent is to make a qri dataset more tangible (and thus easier to get started with) by adding a common interface (the user's filesystem).  
 22 | 
 23 | When a new user wants the files in a remote git repo, they use `git clone {repoUrl}` and a new directory is created on their filesystem containing those files.  They can use _whatever tooling they want_ to make changes to the files, and commit them locally when they reach a critical point.
 24 | 
 25 | We want the qri dataset user to have a similar experience: `qri add /b5/world_bank_population/` should create a new directory called `world_bank_population` on the filesystem containing the data file and associated metadata. To change dataset, the user uses _whatever tooling they want_ to modify the files, then commits the changes.
 26 | 
 27 | ## One-to-One
 28 | 
 29 | This is a 1-1 relationship, a dataset can only link to one directory on the filesystem.
 30 | 
 31 | ## Files
 32 | 
 33 | The following files abstract the qri dataset model, providing well-named files for the most recognizable parts of the dataset.
 34 | 
 35 | ### `meta.json`
 36 | 
 37 | The dataset's `meta` component.
 38 | 
 39 | ### `schema.json`
 40 | 
 41 | The dataset's `structure.schema` component
 42 | 
 43 | ### `body.[csv|json|xlsx]`
 44 | 
 45 | The data file!
 46 | 
 47 | ### `dataset.json`
 48 | 
 49 | Any additional uncontrolled `dataset` fields.  This file will probably not be used by novice users.  Users who are more familiar with the dataset model can use this file to add advanced configuration.
 50 | 
 51 | ### `.qri-ref`
 52 | 
 53 | An invisible file used to specify the link between the current directory and the Qri repo.
 54 | 
 55 | ## Commands
 56 | 
 57 | - `qri init` - interactive prompt that sets up a qri dataset with linked files in the current directory. (User should have already created a directory to hold the dataset, `cd` into it, then run `qri init`)
 58 | 
 59 | This should have an `non-interactive` flag that only needs a dataset name, which will link an empty dataset working directory to a name reference in the qri repo. (reserving the name for a future commit)
 60 | 
 61 | ### Prompts:
 62 |   -  **Dataset Name** - (autopopulate with the name of the current directory) generates `meta.json` in the working directory with the supplied name as `title` and boilerplate `description`, `keywords`, etc
 63 | 
 64 |   -  **File Type** - csv/json? Based on the user's selection, add dummy `body.csv|json` and `schema.json` to the working directory.
 65 | 
 66 | - `qri clone` - hybrid command that adds an existing dataset to the user's local Qri repo _AND_ creates a linked working directory on the filesystem.  
 67 | 
 68 | - `qri status` - provides a report of the current state of the files in the working directory versus the last commit to the dataset.  Validates changes and gives a green light if everything is ready to be committed.
 69 | 
 70 |   - `.qri-ref`
 71 |     - If this file doesn't exist, notify the user that they are not in a qri dataset directory
 72 |     - If it does, provide info about the dataset and version this directory is linked to
 73 | 
 74 |   - `body.csv|json`
 75 |     - No changes/XX changes
 76 |     - valid schema/XX schema violations
 77 |     - error if not parsable CSV/JSON etc
 78 | 
 79 |   - `meta.json`
 80 |     - No changes/XX changes
 81 |     - warning if missing suggested minimum metadata fields
 82 |     - error if invalid
 83 | 
 84 |   - `schema.json`
 85 |     - No changes/XX changes
 86 |     - error if invalid JSON schema
 87 | 
 88 |   - `dataset.json`
 89 |     - No changes/XX changes
 90 |     - error if invalid
 91 | 
 92 | 
 93 | - `qri log` - show the log of commits for the current dataset
 94 | 
 95 | - `qri checkout {hash}` - change the files in the working directory to reflect a previous version of the qri dataset
 96 | 
 97 | - `qri commit|save` - commits changes on the current dataset, ensures that the working directory reflects the version of the dataset after the commit (e.g. if there was no `schema.json` because the user just initialized an empty working directory)
 98 | 
 99 | Notes:
100 | - `qri add` - May cause confusion because it does not work the same way as `git add` as qri has no staging area.  For now it should remain unchanged, `qri add` will add the dataset to the user's local IPFS repo but will not create a linked directory on the filesystem.
101 | 
102 | - `qri use` can still work, but if the user is currently in a linked qri dataset directory, all commands should assume the user is working with the linked dataset
103 | 
104 | # Reference-level explanation
105 | [reference-level-explanation]: #reference-level-explanation
106 | 
107 | When a filesystem folder is linked to a qri dataset, we create a file `.qri-ref` in that directory. This file contains a full path to the qri dataset (peername, dataset name, profile id, network, version hash). Every command that normally allows for `use` refs, will check for the presence of this file, and if found, will use that dataset reference contained within to pass along to the command's arguments.
108 | 
109 | Running `diff` would compare the working directory against the head version of the dataset, using the existing diff lib functionality.
110 | 
111 | Running `status` would compare the same, but only as a binary yes/no "are they the same?" comparison, to display if the working directory is dirty or clean.
112 | 
113 | Running `save` would save using the dataset.yaml and body.csv in the working directory as file parameters.
114 | 
115 | # Drawbacks
116 | [drawbacks]: #drawbacks
117 | 
118 | - More overhead around the core qri service
119 | - Permissions/Cross-platform filesystem issues?
120 | 
121 | # Rationale and alternatives
122 | [rationale-and-alternatives]: #rationale-and-alternatives
123 | 
124 | This design is preferable because it makes Qri more relatable to data consumers who are used to working with data files.  It also emulates the established and popular workflow of git.
125 | 
126 | If we don't do this, new qri users have the additional burden of manually exporting qri components to files, and must use more advanced commands/tooling/scripting to commit changes to a dataset. _This proposal opens up the universe of qri-compatible tools to anything that can save json and csv files to the filesystem_
127 | 
128 | This feature is also critical to the desired GUI for Qri, which would make the Qri frontend behave more like Github Desktop.  The Qri frontend will clone, validate commits, show diffs, allow for pushing and publishing, etc.  Some basic editing will be possible via the frontend, but most users will move to another environment to actually make changes.
129 | 
130 | # Prior art
131 | [prior-art]: #prior-art
132 | 
133 | Git is the inspiration for this model, specifically the idea that cloning a repository results in new files on the local machine that can be modified and subsequently committed as a new version.
134 | 
135 | # Unresolved questions
136 | [unresolved-questions]: #unresolved-questions
137 | 


--------------------------------------------------------------------------------
/text/0026-starlark_expose.md:
--------------------------------------------------------------------------------
  1 | - Feature Name: Expose special function
  2 | - Start Date: 2019-06-23
  3 | - RFC PR: [#43](https://github.com/qri-io/rfcs/pull/43)
  4 | - Issue: 
  5 | 
  6 | # Summary
  7 | [summary]: #summary
  8 | 
  9 | A way to access data from a dataset in the special `download` function of Starlark scripts, without breaking the safety guarantees that Qri provides.
 10 | 
 11 | # Motivation
 12 | [motivation]: #motivation
 13 | 
 14 | Qri includes the Starlark programming language as a way to automate transformations against datasets. One of the benefits of embedding Starlark is that it includes a heap of safety guarantees: the language is sandboxed with no access the filesystem or similar OS facilities, it is non-Turing Complete which guarantees execution will terminate, has limitations on its memory usage, and more. The main draw of using this kind of restricted language as opposed to a do-whatever-you-want embedded general-purpose language is that users can trust that unknown code is always safe to run. We make it easy for Qri to just "do the right thing" without requiring users to audit or understand what's happening under the hood in order to avoid doing anything dangerous or destructive.
 15 | 
 16 | This safety and trust is integral to what makes Qri powerful for Data Science. Therefore, we should do as much as possible to reinforce this approach with how Starlark operates. The specific situation that this RFC addresses involves the nature of dataflow with regards to HTTP access. Even with all the safety guarantees that Starlark provides, we open ourselves up to potential pitfalls when we provide network access to transformation scripts. Accessing network resources can be used for malicious purposes, and has proven to be a problem even for other sandboxed languages such as Javascript.
 17 | 
 18 | Following the same principles outlined above, the approach we should take is to limit network access by only allowing data to be downloaded from other sources, but never to be uploaded. In other words, isolate network-enabled code from datasets, requiring transformations to flow data from the `download` function into the `transform` function, which is able to combine the results of a download with a dataset that has an existing version.
 19 | 
 20 | With this system in place, however, certain use cases and workflows become especially difficult. For example, if I'm building a dataset that includes urls that specify where to retrieve updates from at a later date, my transformation needs to get access to these urls. Since the dataset was something I built, and I'm running my own transform against it, I can be certain that there's nothing negative that could happen by piping these urls into the `download` function. In this case, I should be able to opt-in to some sort of dataflow that sends information to my network-aware `download` function.
 21 | 
 22 | The facility described in this RFC is `expose`, a new special function in our Starlark system that enables this sort of use case. By default, a user cannot run `expose`, but through an opt-in mechanism is able to enable it for their own use. We expect that the primary need for `expose` will be getting access to a user's own dataset that they have built and then want to access, rather than using it to integrate datasets made by other users. We need to be careful in how this feature is documented, so that users do not needlessly enable it, opening themselves up to information leakage that they didn't intend.
 23 | 
 24 | # Guide-level explanation
 25 | [guide-level-explanation]: #guide-level-explanation
 26 | 
 27 | ### starlark
 28 | 
 29 | The special function `expose` is a top-level entry point in Starlark scripts that works as follows:
 30 | 
 31 | ```
 32 |   # Defined with the same signature as "transform"
 33 |   def expose(ds, ctx):
 34 |     # ds is read-only for this function
 35 |     urls = [row['url'] for row ds.get_body()]
 36 |     return urls
 37 | 
 38 |   def download(ctx):
 39 |     # Download function can get the result using ctx.expose
 40 |     for url in ctx.expose:
 41 |       res = http.get(url)
 42 |       ...
 43 | ```
 44 | 
 45 | Similarly to how `transform` gets the results of `download` by accessing `ctx.download`, the results of `expose` are available as `ctx.expose`. This leads to a consistent access pattern, and also has the pleasant side-effect of not needing to change the signature of `download`.
 46 | 
 47 | ### command-line
 48 | 
 49 | By default, the `expose` function should be disabled. Running a script with a function named `expose` should result in an error about it being disallowed. The only way to run such a script is to explicitly opt-in by using a special command-line flag, preferably one that is named in a somewhat scary manner that would hint to the average user that it shouldn't be used without understanding it. One such possibility is:
 50 | 
 51 | ```
 52 | qri save --file my_transform.star --transform-expose-all-datasets me/ds
 53 | ```
 54 | 
 55 | Further considerations:
 56 | 
 57 | * We may want to allow a user to run `expose` only if they both wrote and own the transform script and previous version of the dataset.
 58 | * In future iterations, the `expose` function could probably be replaced by a simple Path evaluation, once Qri has functionality analogous to XPath.
 59 | 
 60 | # Reference-level explanation
 61 | [reference-level-explanation]: #reference-level-explanation
 62 | 
 63 | ### qri/startf/transform.go
 64 | 
 65 | References to download will be copied and renamed for expose:
 66 | 
 67 | * transform.download starlark.Iterable
 68 | * specialFuncs "download"
 69 | * callDownoadFunc (with the additional check for the expose flag)
 70 | 
 71 | The dataset passed to callExposeFunc must be kept as immutable, by not calling `SetMutable`.
 72 | 
 73 | ### qri/cmd/save.go
 74 | 
 75 | A new flag (suggested --transform-expose-all-datasets above) added to the command. This needs to be passed down the call stack to the startf package.
 76 | 
 77 | ### qri/update
 78 | 
 79 | The flag needs to be somehow integrated here. Somewhat of an open question as to how this works.
 80 | 
 81 | # Drawbacks
 82 | [drawbacks]: #drawbacks
 83 | 
 84 | We should definitely do this instead of allowing unfettered access to the `dataset` within the `download` function.
 85 | 
 86 | One drawback is that we're introducing a new command-line flag. It is possible that this flag is only going to be temporary, and may be deprecated in the future in order to provide a more fine-grained way to expose certain parts of a dataset.
 87 | 
 88 | # Rationale and alternatives
 89 | [rationale-and-alternatives]: #rationale-and-alternatives
 90 | 
 91 | Though we currently have no facility for explicitly private data, and talk about Qri as being meant for public-only data, there are still situations where allowing the uploading of data in a function like `download` could break user expectations of how data-flow works.
 92 | 
 93 | For example, let's say a user has a local mesh network setup to host Qri instances that communicate between each other. Their network is configured such that IPFS only shares data between their machines, avoiding block sharing with the larger distributed web.
 94 | 
 95 | In the meantime, they have some complicated data download and processing pipeline, employing some third-party libraries that they themselves didn't author. If one of those libraries were behaving naughtily, and were given full access to the user's datasets, it could get information about those datasets and upload them to some server using url query parameters like "http://evil-site.com/q?your_dataset_size=1234". Clearly this is not destructive as other bad consequences (file system access, privilege escalation) but even minor information leaks like this could erode trust in Qri.
 96 | 
 97 | # Prior art
 98 | [prior-art]: #prior-art
 99 | 
100 | Browser extension systems has similar systems around permissions and code isolation. One similar situation is how extensions are allowed to update themselves, but the code to update is self-contained and is not allowed to feed information from the main extension body into the network requests for fetching a new version.
101 | 
102 | 
103 | # Unresolved questions
104 | [unresolved-questions]: #unresolved-questions
105 | 
106 | The future of permissions, security, ACLs, and private data. All have complicated interactions with everything mentioned in this document.
107 | 


--------------------------------------------------------------------------------
/text/0027-assets/dataset_01.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/qri-io/rfcs/ccec6b5283efcc532aa2d7b444b0b12e0e0f9c43/text/0027-assets/dataset_01.png


--------------------------------------------------------------------------------
/text/0027-assets/dataset_02_patch_application.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/qri-io/rfcs/ccec6b5283efcc532aa2d7b444b0b12e0e0f9c43/text/0027-assets/dataset_02_patch_application.png


--------------------------------------------------------------------------------
/text/0027-assets/dataset_03_xform_diff_shape.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/qri-io/rfcs/ccec6b5283efcc532aa2d7b444b0b12e0e0f9c43/text/0027-assets/dataset_03_xform_diff_shape.png


--------------------------------------------------------------------------------
/text/0027-assets/dataset_04_xform_same_shape.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/qri-io/rfcs/ccec6b5283efcc532aa2d7b444b0b12e0e0f9c43/text/0027-assets/dataset_04_xform_same_shape.png


--------------------------------------------------------------------------------
/text/0027-assets/dataset_05_conflict.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/qri-io/rfcs/ccec6b5283efcc532aa2d7b444b0b12e0e0f9c43/text/0027-assets/dataset_05_conflict.png


--------------------------------------------------------------------------------
/text/0027-assets/dataset_06_working_dir.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/qri-io/rfcs/ccec6b5283efcc532aa2d7b444b0b12e0e0f9c43/text/0027-assets/dataset_06_working_dir.png


--------------------------------------------------------------------------------
/text/0028-externalize_private_keys.md:
--------------------------------------------------------------------------------
  1 | - Feature Name: Externalize Private Keys
  2 | - Start Date: 2019-04-30
  3 | - RFC PR: [#52](https://github.com/qri-io/rfcs/pull/52)
  4 | - Issue: <!-- (leave this empty) -->
  5 | 
  6 | # Summary
  7 | [summary]: #summary
  8 | 
  9 | Separate Qri user private key from configuration, storing in the operating system's keyring, or superseded by an environment variable `QRI_PRIVATE_KEY`.
 10 | 
 11 | # Motivation
 12 | [motivation]: #motivation
 13 | 
 14 | Qri users are getting themselves into un-recoverable states by manually deleting their `.qri` directory, which takes their private key with it, creating major problems when they try to re-add & change datasets. Because Qri is built on public-key infrastructure (PKI), deleting the private key has the effect of locking a user out of their own identity.
 15 | 
 16 | More importantly, storing private keys in configuration is a security vulnerability. Coupling sensitive information with configuration renders the entire configuration sensitive.
 17 | 
 18 | # Guide-level explanation
 19 | [guide-level-explanation]: #guide-level-explanation
 20 | 
 21 | We inherited this bad habit of storing private keys in configuration as a "deal with it later" approach taken from IPFS, unlike Qri, IPFS identities are not "significant" in the sense that it's not tied to an identity intended for humans. Tying an identity to a private key means we need to deal with this problem much earlier.
 22 | 
 23 | This RFC deprecates `profile.privKey` config value, moving the the private key to the user's operating system keyring. When no keyring is available, store the keyring in a separate file called `QRI_PRIVATE_KEY` within the `.qri` directory, and present the user with a warning that they should copy this file to a safe location.
 24 | 
 25 | In all cases the active private key can be overridden with an environment variable: `QRI_PRIVATE_KEY`. In all cases the key is expected to be a base64-encoded string.
 26 | 
 27 | 
 28 | # Reference-level explanation
 29 | [reference-level-explanation]: #reference-level-explanation
 30 | 
 31 | Doing the work of generalizing secret storage can be handled by an external package: https://github.com/zalando/go-keyring. go-keyring supports Windos, OS X, and Linux. The key is stored in a `(username,password)` combination, with the "username" being the user's ProfileID (hash of public key), and the password being the base64-encoded private key. OS Keyrings support namespacing, qri's namespace will be a reversed-domain-name constant:
 32 | 
 33 | ```go
 34 | const qriKeyringServiceID = "io.qri.keys"
 35 | ```
 36 | 
 37 | Once setup is complete, `config.Profile.ID` stores the ProfileID string this configuration is attached to. Starting a qri process uses this as the "username" in a `Get` call:
 38 | 
 39 | ```go
 40 | privKey, _ := keyring.Get(qriKeyringServiceID, config.Profile.ID)
 41 | ```
 42 | 
 43 | ### Keys subsystem
 44 | 
 45 | All of this work should happen in a new package within the qri codebase: `github.com/qri-io/qri/keys`. We need to takle key rotation next, followed by storing _many_ keys for doing access control. All of this work should happen in a new subsystem called `keys.Service`. `lib.Instance` gets a new field that points to this service.
 46 | 
 47 | ### Phasing out `config.Profile.PrivKey`
 48 | This work should happen in phases:
 49 | 
 50 | 1. **copy to keyring, support env var**. In in the next release, add in the new APIs, prefer their use when present.
 51 | 2. **intiailize with an empty privKey value**. Once we know the keyring approach works across a number of different platforms in a wide release, switch default initialization behaviour to _not_ write the privKey value.
 52 | 3. **remove the field** Eventually, remove the field entirely with a repo migration.
 53 | 
 54 | In all phases the privKey field should be the _lowest_ in the loading hierarchy. Keyring values will override `Profile.PrivKey`, and the env var will override all other values.
 55 | 
 56 | ### Recovering from a deleted `.qri` folder
 57 | Once we've externalized the private keys, we can change the behaviour of `qri setup` to first scan the keyring under the `io.qri.keys` namespace for existing keys. If a key is found, we can ask the user if they'd like to create a new account, or recover an old one.
 58 | 
 59 | We can bolster this with calls to the default registry for the discovered public keys to add username info to the list. Profile information may not exist on the default registry, so we'll need to default to showing just the raw ProfileID. We can support this process with docs.
 60 | 
 61 | Instead of jumping straight to building the UI for switching, we can supply a `--recover` flag that lets a user supply a ProfileID to recover from:
 62 | 
 63 | ```
 64 | $ qri setup --recover QmProfileID
 65 | ```
 66 | 
 67 | listing available private keys in a keyring may also form the basis of profile switching in the future.
 68 | 
 69 | # Drawbacks
 70 | [drawbacks]: #drawbacks
 71 | 
 72 | ### More OS-Specific Work.
 73 | This increases our reliance on an OS-level API that sometimes doesn't exist, and runs the risk of deviating behaviour between platforms. 
 74 | 
 75 | Fallback to a `QRI_PRIVATE_KEY` file whenever no keyring exists is necessary to keep things working when our keyring implementation fails, effectively reverting us back to the setup we're in today. This time we at least have a separate file and a warning.
 76 | 
 77 | ### Upstream work to get recovery functionality
 78 | The package I've suggested we build on has no high-level method for listing the keys in a namespace. We'e need to be able to do this, and haven't confirmed this will be possible across all platforms.
 79 | 
 80 | ### No Qri-specific security
 81 | This RFC does not propose adding any key-specific security. If we used symmetric encryption on the private key itself, Qri could require a local password of it's own. In this case we're relying on the user's Operating System security architecture
 82 | 
 83 | # Rationale and alternatives
 84 | [rationale-and-alternatives]: #rationale-and-alternatives
 85 | 
 86 | A number of alternatives exist for externalizing secrets. We could store keys in the `.ssh` directory if it exists, and looking for guidance on 
 87 | 
 88 | # Prior art
 89 | [prior-art]: #prior-art
 90 | 
 91 | * Keybase - We should really do a dive on the approach keybase uses for key management before moving on to any other keypair work. _Especially_ key rotation.
 92 | * Textile wallet
 93 | * Hashicorp Vault
 94 | 
 95 | * https://github.com/keybase/go-keychain
 96 | * https://github.com/zalando/go-keyring
 97 | 
 98 | # Unresolved questions
 99 | [unresolved-questions]: #unresolved-questions
100 | 
101 | ### Loading order for configuration
102 | We haven't defined _when_ key loading happens in this process now that key loading is separate from configuration loading in the relation to seting up other services.
103 | 
104 | ### Profile Switching
105 | This RFC lays the groundwork for switching accounts _within_ qri, but this would have very strange results, as the data stored in `.qri` is mapped to a single user. We also don't have any UI for how this would work. 
106 | 
107 | ### Key Rotation
108 | We still haven't covered how to rotate this master private key in the event of a private key disclosure. That'll come later.
109 | 
110 | ### Multiple Keys 
111 | The "master key" is not the only key we need to store. In the future we'll end up needing to create all sorts of keys for managing access control. I'm wondering if we use something like [BIP0032](https://en.bitcoin.it/wiki/BIP_0032) for this.


--------------------------------------------------------------------------------
/text/0029-config_revision.md:
--------------------------------------------------------------------------------
  1 | - Feature Name: configuration overhaul
  2 | - Start Date: [#51](https://github.com/qri-io/rfcs/pulls/51)
  3 | - RFC PR: 2020-04-23
  4 | - Issue: 
  5 | 
  6 | # Summary
  7 | [summary]: #summary
  8 | 
  9 | Revise configuration fields with clear descriptions, removing unused fields, and defining new fields to cover existing configuration. Define a hierarchy for configuration overrides.
 10 | 
 11 | # Motivation
 12 | [motivation]: #motivation
 13 | 
 14 | Our configuration story needs work. _Many_ fields are not in use, and there's ambiguity about how others should work. Cleaning up our configuration story should open up new use cases & cut down on confusion.
 15 | 
 16 | # Guide-level explanation
 17 | [guide-level-explanation]: #guide-level-explanation
 18 | 
 19 | This RFC proposes the following concrete changes:
 20 | 1. Overhaul configuration data structures, removing unnecessary fields & adding new ones
 21 | 1. Support configuring `qfs.Filesystem`
 22 | 1. Formalize the hierarchy of different configuration sources
 23 | 1. Switch to multiaddresses for all network configuration.
 24 | 1. Move ipfs repo location into configuration, deprecate support for `IPFS_PATH`.
 25 | 
 26 | ### Updated Config File
 27 | 
 28 | This RFC proposes a new configuration layout with the following fields, values should show defaults. Many fields are removed. New and altered fields have a comment.
 29 | 
 30 | ```yaml
 31 | API:
 32 |   allowedorigins:
 33 |   - electron://local.qri.io
 34 |   - http://localhost:2505
 35 |   enabled: true
 36 |   address: "/ip4/0.0.0.0/tcp/2503" # replaces the "port" field with a multiaddr
 37 |   readonly: false
 38 |   serveremotetraffic: false
 39 | CLI:
 40 |   colorizeOutput: true
 41 |   pager: ""
 42 |   warnBeforeFetchingSize: 250MB    # any fetch larger than this will prompt the user before continuing
 43 | Filesystems:                       # see note on filesystems
 44 | - type: ipfs
 45 |   options:
 46 |     path: ./ipfs
 47 |     api: true
 48 |     pubsub: false
 49 | - type: local
 50 | - type: http
 51 | Logging:
 52 |   levels:
 53 |     lib: info
 54 | P2P:
 55 |   addrs: null                     # an array of multiaddresses to listen on, only if *not* using IPFS
 56 |   enabled: true
 57 |   peerid: QmdJQpcmGMkNtKYv5bviZ7z9kPh7uLKSMzqdSbemYs3XpH
 58 |   privkey: <base64-encoded private key>
 59 |   qribootstrapaddrs:
 60 |   - /ip4/35.193.162.149/tcp/4001/ipfs/QmTZxETL4YCCzB1yFx4GT1te68henVHD1XPQMkHZ1N22mm
 61 |   - /ip4/35.239.80.82/tcp/4001/ipfs/QmdpGkbqDYRPCcwLYnEm8oYGz2G9aUZn9WwPjqvqw3XUAc
 62 |   - /ip4/35.225.152.38/tcp/4001/ipfs/QmTRqTLbKndFC2rp6VzpyApxHCLrFV35setF1DQZaRWPVf
 63 |   - /ip4/35.202.155.225/tcp/4001/ipfs/QmegNYmwHUQFc3v3eemsYUVf3WiSg4RcMrh3hovA5LncJ2
 64 |   - /ip4/35.238.10.180/tcp/4001/ipfs/QmessbA6uGLJ7HTwbUJ2niE49WbdPfzi27tdYXdAaGRB4G
 65 |   - /ip4/35.238.105.35/tcp/4001/ipfs/Qmc353gHY5Wx5iHKHPYj3QDqHP4hVA1MpoSsT6hwSyVx3r
 66 |   - /ip4/35.239.138.186/tcp/4001/ipfs/QmT9YHJF2YkysLqWhhiVTL5526VFtavic3bVueF9rCsjVi
 67 |   - /ip4/35.226.44.58/tcp/4001/ipfs/QmQS2ryqZrjJtPKDy9VTkdPwdUSpTi1TdpGUaqAVwfxcNh
 68 | Profile:
 69 |   color: ""
 70 |   created: "2020-02-12T04:29:18Z"
 71 |   description: ""
 72 |   email: brendan+test@qri.io
 73 |   homeurl: ""
 74 |   id: QmSyDX5LYTiwQi861F5NAwdHrrnd1iRGsoEvCyzQMUyZ4W
 75 |   name: ""
 76 |   peername: b5
 77 |   photo: https://qri-user-images.storage.googleapis.com/1570029824064.png
 78 |   poster: /ipfs/QmSN2yHp4qFy1xLCRVQNJKAjgbwHYPAho9fT7C3vAM426c
 79 |   privkey: <base64-encoded private key>
 80 |   thumb: ""
 81 |   twitter: ""
 82 |   type: peer
 83 |   updated: "2018-04-19T18:10:46.627471799-04:00"
 84 | RPC:
 85 |   enabled: true
 86 |   address: "/ip4/0.0.0.0/tcp/2504" # replaces "port" with a multiaddr
 87 | Registry:
 88 |   location: https://registry.qri.cloud
 89 | Remote:
 90 |   acceptsizemax: 0
 91 |   accepttimeoutms: 0
 92 |   allowremoves: false
 93 |   enabled: false
 94 |   requireallblocks: false
 95 | Remotes:
 96 |   remote_name: https://remote_name_domain.com
 97 | Repo:
 98 |   type: fs
 99 | Revision: 2
100 | ```
101 | 
102 | ### The Filesystems field
103 | 
104 | Qri is built atop a filesystem abstraction called [qfs](https://github.com/qri-io/qfs) that defines a common interface with numerous implementations. At runtime, qri composes a number of these filesystems together to provide access to HTTP assets, local files, and IPFS data. In qri v0.9.8, the `qfs.Filsystem` instance cannot be configured.
105 | 
106 | This RFC proposes a format for configuring the qfs.Filesystem. Making qfs configurable opens the doors to new backing filesystems like Amazon S3, opening the door to operating Qri in new contexts entirely through configuration.
107 | 
108 | We accomplish this by removing the `Store` field and replacing it with `Filesystems`. The default configuration used in qri v0.9.8 would look like this:
109 | 
110 | ```yaml
111 | Filesystems:
112 | - type: ipfs
113 |   options:
114 |     path: ~/.ipfs
115 |     api: true
116 |     pubsub: false
117 | - type: local
118 | - type: http
119 | ```
120 | 
121 | Filesystems is an array of objects, with each object representing a filesystem. each object MUST have a `type` parameter that matches the name of a known system. Any system-specific configuration details can be set in `options`, an object that will be passed onto the Filesystem itself. Finally an optional `source` key tells Qri where to load fileystem plugins from:
122 | 
123 | ```yaml
124 | Filesystems:
125 | - type: s3
126 |   source: https://github.com/qri-io/s3-filesystem/releases/v0.9.8.tar
127 |   options:
128 |     access_token: TOKEN
129 | - type: http
130 | ```
131 | 
132 | The source value is itself a filesystem path. In this case we're loading over https, which would necessitate the `http` filesystem for this plugin to be able to load. Any plugin architecture is a ways off, and should be specified in a later RFC.
133 | 
134 | ### Use multiaddrs, port 0 for random open port
135 | We should switch all `port` fields to an `address` field instead, and accept a [multiaddr](https://github.com/multiformats/go-multiaddr) string to enhance configurability.
136 | 
137 | In addition, we should support the convention of using `0` as a port number to indicate "any available port", allowing the operating system to choose.
138 | 
139 | 
140 | # Reference-level explanation
141 | [reference-level-explanation]: #reference-level-explanation
142 | 
143 | ### Config Hierarchy
144 | 
145 | Configuration details should be unified into a single data structure loaded in a predefined hierarchy. From least that hierarchy should be:
146 | 
147 | * default configuration
148 | * qri repo configuration file
149 | * global command-line flags
150 | 
151 | We can imagine the final computed configuration as a _fold-left_ operation:
152 | 
153 | ```
154 | config = fold_left(defaultConfig(), ConfigFile(), CliFlags())
155 | ```
156 | 
157 | Any qri process loads at most 1 configuration file. The source of that file defaults to `$HOME/.qri/config.yaml`, and can be overridden with the `QRI_PATH` environment variable.
158 | 
159 | In the event that a command line flag and a configuration setting are affecting the same behaviour, the
160 | command line flag wins, overriding any configuration setting. With that said, **flags should change parameters on lib method calls, not affect configuration**.
161 | 
162 | 
163 | ### Config in alternate contexts:
164 | 
165 | **qri CLI commands that work without a repo**
166 | Some commands like `qri setup` have to operate without a `lib.Instance`, and _cannot_ depend on a file, so the loading hierarchy simply removes the repoConfig() from the hierarchy, leaving the rest intact:
167 | 
168 | ```
169 | config = fold_left(defaultConfig(), getEnvVarConfig(), getCLIConfig())
170 | ```
171 | 
172 | **using qri as a library**
173 | When using qri as a library, we don't have the `cmd` package, and often supply configuration details directly.
174 | 
175 | ```
176 | config = fold_left(defaultConfig(), getUserOfLibraryConfig())
177 | ```
178 | 
179 | ### Remove stale configuration fields
180 | 
181 | The following fields should be **removed**:
182 | 
183 | ```yaml
184 | API:
185 |   proxyforcehttps: false
186 |   remoteacceptsizemax: 0
187 |   remoteaccepttimeoutms: 0
188 |   remotemode: false
189 |   tls: false
190 |   urlroot: ""
191 | P2P:
192 |   autoNAT: false
193 |   profilereplication: full
194 |   bootstrapaddrs: []
195 |   httpgatewayaddr: ""
196 |   pubkey: ""
197 | Repo:
198 |   middleware: []
199 | Webapp:
200 |   analyticstoken: ""
201 |   enabled: true
202 |   entrypointhash: /ipfs/QmXofmXcQKnYTMjsfhgTGqzzLLPyS9enaHsfaRR5AxTGBK
203 |   entrypointupdateaddress: /ipns/webapp.qri.io
204 |   port: 2505
205 | Render:
206 |   defaultTemplateHash: /ipfs/QmeqeRTf2Cvkqdx4xUdWi1nJB2TgCyxmemsL3H4f1eTBaw
207 |   templateUpdateAddress: /ipns/defaulttmpl.qri.io
208 | Stats: null
209 | Update:
210 |   address: ""
211 |   daemonize: false
212 |   type: fs
213 | ```
214 | 
215 | ### Expanding Store to Filesystems
216 | 
217 | Currently we hard-code qri filesystem (`qfs`) configuration details. qfs spins up a multiplexed-filesystem by default, composing multiple file-like persistence layers using prefix demuxing. If we expand "Store" to "Filesystem", we can make this fully configurable.
218 | 
219 | The first win: users could remove `local` or `http` to disable qri's access to the local filesystem or HTTP for the purpose of reading data. We can also expand for later filesystems & options. an `s3` implementation would be nice. this is also how we'd do `dat` integration.
220 | 
221 | We could use a convention of the default write destination being the first listed filesystem, and present a big warning if users configure anything other than `ipfs` in the first position.
222 | 
223 | It'd be great to delegate this entire section of store setup to the qfs package, aliasing up the configuration data structure into the config package.
224 | 
225 | ### Moving past cafs
226 | 
227 | Config has a "Store" field that configured a _content addressed file system_ (CAFS), cafs is the precursor to `qfs.Filesystem`, and is in the process of being merged into this API. The `Filesystems` field is used to consutrct a `qfs.Muxfs` multiplexing filesystem at runtime. Muxfs will provide methods for accessing individual filesystems, where ones like `ipfs` still satisfy the `cafs` interface. It'll be possible to progressively migrate code that requires a `cafs`  toward a `qfs.Filesystem` while we work on configuration changes.
228 | 
229 | ### move IPFS repo location into configuration
230 | the above filesystem example shows a `ipfs` filesystem type with a `config.path` field. This value should now be the canonical source of where to load.
231 | 
232 | the global `ipfs-path` flag should also be removed:
233 | ```
234 |       --ipfs-path string   override IPFS path location (default "/Users/b5/.ipfs")
235 | ```
236 | 
237 | ### Relative Paths in configuration files are relative to the file
238 | If a configuration value specifies a filepath, and the given path is relative, that path should be relative to the location of the config file. We should add a note of this to `$ qri config --help`.
239 | 
240 | ### Environment Variables don't configure, they point to configuration files
241 | 
242 | This is the total list of environment variables qri is sensitive to:
243 | 
244 | | variable name            | description                           |
245 | | ------------------------ | ------------------------------------- |
246 | | `PAGER`                  | what pager to use for output          |
247 | | `QRI_PATH`               | path to qri configuration data        |
248 | | `QRI_SETUP_CONFIG_DATA`  | JSON data to use when creating a repo |
249 | | `IPFS_SETUP_CONFIG_DATA` | JSON data when initializing IPFS      |
250 | | `QRI_PRIVATE_KEY`        | defined in the private key RFC        |
251 | | `QRI_BACKTRACE`          | show stack trace output on crash      |
252 | 
253 | 
254 | _note: I (b5) haven't had a chance to vet our dependencies for env vars they are sensitive to, so this list is only complete for the purpose of this RFC._
255 | 
256 | This RFC proposes environment variables are _not_ a direct source of configuration. Instead environment variables either point to a path to load configuration from, or contain a payload to create an initial configuration file. 
257 | 
258 | `QRI_SETUP_CONFIG_DATA` as the only supported is only honored by `qri setup`. We need `SETUP` env variables for operating within containerized contexts, but that's about it.
259 | 
260 | Directing as much configuration as possible to a configuration file cuts down on potential sources of confusion. As an example, `git` will listen to `$PAGER`, `$GIT_PAGER`, and `git config core.pager`, in that order, and I (b5) had to look up what the order is.
261 | 
262 | Instead of supporting a potential `QRI_PAGER` environment variable, we should _only_ have `qri.cli.pager` to override the default value.
263 | 
264 | All that said, using env var configuration should be _optional_, meaning a user employing qri as a library can provide an option to the `lib.NewInstance` constructor function that ignores all environment variables. Put another way: environment variables never affect configuration, which is a lib-level concept, but as a caller of lib, the command-line client `qri` is able to use them in order to create the configuration that it wants.
265 | 
266 | # Drawbacks
267 | [drawbacks]: #drawbacks
268 | 
269 | 
270 | # Rationale and alternatives
271 | [rationale-and-alternatives]: #rationale-and-alternatives
272 | 
273 | ### Per-field env vars
274 | We could support setting configuration on a per-field basis. I'm not sure that's totally necessary at this point.
275 | 
276 | ### Attach config to context.Context
277 | We could build a sophisticated system of attaching configuration to request contexts, to support scoping configuration on a per-request basis. We might need to in the future, but it's overkill for the moment.
278 | 
279 | ### Config changes emit events
280 | We could also use the event bus to propagate changes to configuration. Again, overkill for now, but we could do so later.
281 | 
282 | # Prior art
283 | [prior-art]: #prior-art
284 | 
285 | ### Configuration:
286 | awesome-go has an entire section of configuration libraries https://awesome-go.com/#configuration. Some of the ones I found intersting:
287 | * hot-reloading config: https://github.com/lalamove/konfig
288 | * env var config with lots of bells& whistles: https://github.com/kelseyhightower/envconfig
289 | 
290 | # Unresolved questions
291 | [unresolved-questions]: #unresolved-questions
292 | 
293 | Some configuration things are out of scope for this RFC, but should be discussed later:
294 | * Distinguish runtime _options_ from static _configuration_.
295 | * hot-swapping configuration.


--------------------------------------------------------------------------------
/text/0031-expanded_remove.md:
--------------------------------------------------------------------------------
  1 | - Feature Name: expanded remove command
  2 | - Start Date: 2020-05-02
  3 | - RFC PR: [#53](https://github.com/qri-io/rfcs/pulls/53)
  4 | - Issue: <!-- (leave this empty) -->
  5 | 
  6 | # Summary
  7 | [summary]: #summary
  8 | 
  9 | Clarify how `remove` relates to history & storage manipulation, "owned" vs. "cloned" datasets. Expand `remove` to also work on remotes when used with the `--remote` flag. Add an `--amend` flag to `save` for easier single-commit revisions. Introduce _read-only_ and _read-write_ terms to describe Qri's _de facto_ access control model.
 10 | 
 11 | # Motivation
 12 | [motivation]: #motivation
 13 | 
 14 | The _user experience (UX)_ of removing data from Qri needs some work. We're seeing a number of users get into states they feel they cannot recover from, and rather than use tools like `qri remove`, they're deleting repositories entirely in an effort to start over. 
 15 | 
 16 | Part of the The `remove` command doesn't presently cover all the use cases we need it to. In v0.9.8, a user `b5` with a dataset `chriswhong/data_portal_snapshot` gets this from the `remove` command:
 17 | 
 18 | ```
 19 | $ qri remove chriswhong/data_portal_snapshot
 20 | need --all or --revisions to specify how many revisions to remove
 21 | ```
 22 | 
 23 | Running with the `--all` flag work, but `b5` has no edit access to the dataset, the `--all` flag is misleading & implies to the user they are _able_ to edit history, when they're not.
 24 | 
 25 | A better UX for removing data from qri should _clearly_ answer the common "uh oh" scenarios a user is likely to encounter:
 26 | 
 27 | 1. "I made a mistake while saving"
 28 | 2. "I want to delete a dataset I've pulled"
 29 | 3. "I want to undo a push"
 30 | 
 31 | Theses are three distinct examples of removing versions, whole datasets, and remote replicas. This RFC proposes expanding the `remove` command to properly cover all of these common scenarios with the following changes:
 32 | 
 33 | 1. re-work the remove command to deal with all common cases of data removal. Formalize the interactions between these three removal scenarios 
 34 | 2. introduce _read-only_ and _read-write_ as terms to describe our present dataset acces model.
 35 | 2. improve help text and error messages, guiding users toward different uses of remove.
 36 | 3. Add an `amend` flag on the `save` command that _replaces_ the HEAD commit
 37 | 4. Add a `--remote` flag to `remove` for dropping datasets on remotes
 38 | 
 39 | A well-documented destructive command should be clearly labeled as the "opposite" of a constructive command and present warnings _before_ execution. Once executed, commands removing data should show as few caveat/errors/warnings as possible, or at least connect caveats to a clear next course of action. A feeling of "clean state" comes when vommands like `qri log`, `qri list` and `qri status` match user expectations after running a destructive command.
 40 | 
 41 | 
 42 | # Guide-level explanation
 43 | [guide-level-explanation]: #guide-level-explanation
 44 | 
 45 | 
 46 | ### "read-only" and "read-write"
 47 | 
 48 | To get through this RFC we need language to distinguish between datasets the current user can and cannot edit, because it has an impact on the default behaviour of removing a dataset. **These terms should describe how qri currently works, not introduce new functionality**
 49 | 
 50 | This RFC proposes distinguishing between _read-write_ and _read-only_ datasets like so:
 51 | 
 52 | * datasets you create are _read-write_
 53 | * all other users datasets are _read-only_
 54 | * attempting to save to a read-only dataset creates a fork
 55 | 
 56 | This proposes _qri already has an access control model_, we just need to make it explicit. Talking about our present system This gives us room for growth without introducing new terms. In the future what makes a dataset "readable" or "writable" will become more nuanced as we introduce collaboration features.
 57 | 
 58 | ### Expanding the Remove command
 59 | 
 60 | This RFC sets aside the tight coupling between history and storage for another day, and instead focuses on fixing the role of remove by clarifying and expanding it's purpose.
 61 | 
 62 | Here's the proposed help text for the new remove command:
 63 | 
 64 | ```
 65 | $ qri remove --help
 66 | Remove deletes datasets from qri.
 67 | 
 68 | For read-only datasets you've pulled from others, Remove gets rid of a dataset 
 69 | from your qri node. After running remove, qri will no longer list that dataset 
 70 | as being available locally, and may free up the storage space.
 71 | 
 72 | For datasets you can edit, remove deletes commits from a dataset history.
 73 | Use delete to "correct the record" by erasing commits. Running remove on writable
 74 | datasets requires a "revisions" flag, specifying the number of commits to 
 75 | delete. Remove always starts from the latest (HEAD) commit, working backwards
 76 | toward the first commit.
 77 | 
 78 | Remove can also be used to ask remotes to delete datasets with the `--remote`
 79 | flag. Passing the remote flag will run the operation as a network request,
 80 | reporting the results of attempting to remove on the destination remote.
 81 | The remote flag can only be used to completely remove a dataset from a remote.
 82 | To edit history on a remote, run delete locally and use `qri push` to send the
 83 | updated history to the remote. Any command run with the remote flag has no
 84 | effects on local data.
 85 | 
 86 | Usage:
 87 |   qri remove DATASET [DATASET...] [flags]
 88 | 
 89 | Examples:
 90 |   # delete a dataset cloned from another user
 91 |   $ qri remove user/world_bank_population
 92 | 
 93 |   # delete the latest commit from annual_pop
 94 |   $ qri remove me/annual_pop --revisions 1
 95 | 
 96 |   # delete the latest two versions from history
 97 |   $ qri remove me/annual_pop --revisions 2
 98 | 
 99 |   # destroy a dataset named `annual_pop`
100 |   $ qri remove --all me/annual_pop
101 | 
102 |   # ask the registry to delete a dataset
103 |   $ qri remove --remote registry me/annual_pop
104 | 
105 | 
106 | Flags:
107 |    --keep-files       don't modify or remove files in a working directory if one exists
108 |    --force            override checks that prevent data loss
109 | ```
110 | 
111 | ### Amend
112 | 
113 | Running `save` with the `--amend` flag replaces the current HEAD with a new commit. This is pulled directly from git. 
114 | 
115 | Adding an `--amend` flag has the UX advantage of working language that resembles "undo" directly into the save command, which users will be familiar with before they ever need `remove`. As a git user I (b5) learned to use `--amend` before `rebase`, which I admit wasn't great, but allowed me to stick with git long enough to understand what rebase did.
116 | 
117 | Users should come to think of `--amend` as a shortcut:
118 | 
119 | ```
120 | # two commands:
121 | $ qri remove me/dataset --revisions 1
122 | $ qri save me/dataset
123 | 
124 | # same thing, one command:
125 | $ qri save me/dataset --amend 
126 | ```
127 | 
128 | When a user understands these two things to be equal, it'll help solidfy their mental model of versioning with Qri.
129 | 
130 | 
131 | 
132 | # Reference-level explanation
133 | [reference-level-explanation]: #reference-level-explanation
134 | 
135 | The remove command unpins blocks AND alters the logbook. Removing a read-only dataset drops logbook data entirely. Removing from a writable dataset adds an operation to the dataset log marking the number of versions removed.
136 | 
137 | All the previously stated local delete behaviors stay the same, with a few exceptions listed below.
138 | But in addition to those behaviors, we have remote behaviors as well:
139 | ```
140 | $ qri remove me/dataset --remote registry
141 | ```
142 | Unpins all blocks & removes the log on the registry because the registry has a read-only copy. Does nothing locally.
143 | 
144 | Attempting to drop revisions from a read-only dataset replica is an error:
145 | ```
146 | $ qri remove me/dataset --remote registry --revisions 1
147 | ERROR
148 | ```
149 | 
150 | This you can't manipulate the logbook of a dataset on a remote. Instead you would first edit locally & push:
151 | ```
152 | $ qri remove me/dataset --revisions 1
153 | $ qri push me/dataset
154 | ```
155 | 
156 | remove works without any flags on read-only datasets:
157 | ```
158 | $ qri remove not_me/dataset
159 | ```
160 | unpins the blocks and then removes the log of a dataset that is not in your namespace. Because we have to remove the log entirely, we cannot restore a dataset that was removed but not in your namespace. Instead restore by re-pulling.
161 | 
162 | using revisions with a read-only dataset will error
163 | ```
164 | $ qri remove not_me/dataset --revisions 1
165 | ERROR
166 | ```
167 | should error: you can't manipulate the logbook of a read-only dataset.
168 | 
169 | If you alter your logbook locally using qri remove, you can update the remote log by using qri push.
170 | 
171 | 
172 | ### Use the logbook
173 | Delete actions should examine the history of a logbook to present warnings to a user that help them "clean up" when deleteing stuff. As an example: `$ qri remove --all` on a dataset should look for `push` operations and warn user they should `drop` from a list of remotes. This warning will abort the init operation, but can be ignored with `--force`.
174 | 
175 | ### Corner Cases
176 | There are a few scenarios we should make a point writing tests for:
177 | * Running amend on a dataset with no commits (no history) is an error
178 | * a drop request for a dataset name that doesn't exist locally, but _does_ exist on the remote should not error. This implies remotes should prefer local reference resolution
179 | * Remotes _should_ uniquely identify datasets by their initID, rather than their name, when a new dataset is pushed to them. A name can be changed or reused while a initID can not. If user creates `b5/population`, pushes, then `delete --all b5/population` locally, then creates a new dataset with the same `b5/population` name, and pushes again, that should work, completely replacing the history
180 | 
181 | # Drawbacks
182 | [drawbacks]: #drawbacks
183 | 
184 | ### Drop + `--remote` doesn't undo publication status
185 | Another RFC currently has `qri push` to the registry automatically adding a "visible" step to the dataset, the drop command as presented here _doesn't_ undo this, leaving p2p visibility in list. Maybe we should present the user with a warning?
186 | 
187 | ### Situations where a dataset you want to edit can't fit locally
188 | The design of the `remove` command requires edits to logbook happen locally. What happens if you want to work on a dataset that you have write access to, but it won't fit on your local machine? It's possible the user could end up in a situation where they can't manipulate a dataset history because they can't get it onto their machine.
189 | 
190 | The short term solution will have to be :shrug: get a bigger computer. Long term, there's no reason a user needs to have blocks to run a `remove` operation. Saving a new version is an entirely different story. One that may need to be answered by streaming the previous version instead of retain-and-read.
191 | 
192 | # Rationale and alternatives
193 | [rationale-and-alternatives]: #rationale-and-alternatives
194 | 
195 | ### `--delete` flags
196 | 
197 | Git uses `--delete` as a flag on push to get rid of remote branches:
198 | ```
199 | $ git push origin --delete feature/login
200 | ```
201 | 
202 | We do something similar now with `$ qri publish --unpublish ` now, and it's a viable alternative to having dedicated commands. Looking at the registry, users seem to be finding & using the `--unpublish` flag, which is heartening. My concern is qri wants to deviate from git by giving users fine-grained control over version storage. Layering deletes atop push operations leaves little room for these detailed needs.
203 | 
204 | 
205 | # Prior art
206 | [prior-art]: #prior-art
207 | 
208 | Heroku's `rollback` command plays an important UX inspiration for `remove --remote`. Having a top level "go back" command is part of what makes working with heroku feel safe.
209 | 
210 | # Unresolved questions
211 | [unresolved-questions]: #unresolved-questions
212 | 
213 | ### Storage management
214 | This RFC puts storage management out of scope.
215 | 
216 | 
217 | ### logbook garbage collection
218 | Running `qri remove --all` drops a bunch a storage and "caps off" a dataset history with a delete operation, preventing the dataset from showing up in list operations. Delete doesn't remove remove logbook data. We'll have to figure out a way to clean up the logbook itself with a plumbing command of some sort.
219 | 
220 | ### A Restore Command
221 | It's also possible to "undelete" by re-opening the log, writing a new operation to the closed log. We haven't addressed if this sort of double-undo should be allowed. If so, it shouldn't be a top-level command. 
222 | 
223 | One possible solution is to repurpose qri's FSI `restore` command to a delete/drop undo command. The previous use of qri restore should be contained in `qri checkout`, if the FSI link already exists. Possible commands:
224 | 
225 | ```
226 | $ qri restore me/dataset
227 | ```
228 | Restores a dataset based on the logbook, if those blocks still exist
229 | 
230 | ```
231 | $ qri restore me/dataset@version
232 | ```
233 | Restores a dataset version if those blocks still exist.


--------------------------------------------------------------------------------
/text/0032-access_command.md:
--------------------------------------------------------------------------------
  1 | - Feature Name: access_command
  2 | - Start Date: 2020-05-02
  3 | - RFC PR: [#54](https://github.com/qri-io/rfcs/pulls/54)
  4 | - Issue: <!-- (leave this empty) -->
  5 | 
  6 | # Summary
  7 | [summary]: #summary
  8 | 
  9 | Decouple dataset visibility from data pushing, moving `publish` & `--unpublish` into a new subcommand: `access`. Position `access` for future commands like access control and encryption. Rename `publish/unpublish` to `visible/hidden`.
 10 | 
 11 | # Motivation
 12 | [motivation]: #motivation
 13 | 
 14 | In the current version of qri (v0.9.8) "listing" a dataset so other peers can see it is tangled up with moving a dataset around. Both actions are tightly coupled in the `publish` command, and can be untangled with flags like `--no-registry`, which prevents publish from pushing data, leaving only the listing behaviour.
 15 | 
 16 | Publish is the only way to send datasets somewhere else, and the way publish is set up gives us very little room to grow. `publish` works well as a one-liner for "put this public data on qri.cloud", but falls short for other uses on the near-term roadmap. Some examples:
 17 | 
 18 | * send a dataset to a trusted peer without listing it
 19 | * send encrypted data to the registry
 20 | * ask a peer to _stop_ listing a dataset sent to them for others to see
 21 | 
 22 | All are examples of _unlisted collaboration_, where _sending_ and _listing_ data are independent actions. Forcing these use cases through the `publish` command yields strange results (like having to "publish" encrypted data).
 23 | 
 24 | To accommodate more use cases, we should flip the mental model, prioritizing thinking about _where data is being sent_. `publish` should join a family commands for managing _access control_. Sending data to a public destination should imply public listing. In other words: publishing should be an automatic part of pushing, instead of pushing being an automatic part of publishing.
 25 | 
 26 | The actual process of moving data around is the subject of another RFC on pushing and pulling. In this RFC we tackle how to manage publication status of a dataset.
 27 | 
 28 | This RFC proposes five changes:
 29 | 1. create a new `access` subcommand to view & edit access control for users & datasets
 30 | 2. move `publish` and `unpublish` into access, changing both commands to edit _only_ `list` visibility (not push data)
 31 | 3. rename `publish/unpublish` to `visible/hidden`.
 32 | 4. add an `info` command under access to show access info for a dataset
 33 | 5. default to setting a dataset to `visible` on push to any remote if no visibility is set.
 34 | 
 35 | # Guide-level explanation
 36 | [guide-level-explanation]: #guide-level-explanation
 37 | 
 38 | 
 39 | ### The Access Subcommand:
 40 | Here's the proposed access subcommand `help`:
 41 | 
 42 | ```
 43 | $ qri access --help
 44 | Access controls visibility and permissions for datasets.
 45 | 
 46 | Usage:
 47 |   qri access [command]
 48 | 
 49 | Available Commands:
 50 |   info           show access control details for a dataset
 51 |   hidden         unlist dataset, removing it from public visibility
 52 |   visible        list dataset for others to see and pull
 53 | ```
 54 | 
 55 | The old `publish` used to have a note talking about how it pushes data. This RFC assumes that work is now handled by `qri push`. the new `visible` command only controls listing & feed visibility:
 56 | 
 57 | ```
 58 | $ qri access visible --help
 59 | Visible makes your dataset publically visible to others. While online, peers 
 60 | that connect to you can only see datasets and versions that you've made visible. 
 61 | Making a datset visible always makes all previous history entries available, 
 62 | and any updates to a visible dataset will be immediately visible. Visible
 63 | datasets will show up in feeds.
 64 | 
 65 | New datasets are *not* visible by default, and remain hidden until pushed.
 66 | 
 67 | Pushing to the registry requires a dataset be either encrypted or visible.
 68 | By default pushing a dataset to the registry also makes a dataset visible.
 69 | 
 70 | Usage:
 71 |   qri access visible [DATASET] [DATASET...] [flags]
 72 | 
 73 | Examples:
 74 |   # make a dataset visible so others can see it with "qri list":
 75 |   $ qri access visible me/dataset
 76 | 
 77 |   # make a few datasets visible at once:
 78 |   $ qri access visible me/dataset me/other_dataset
 79 | ```
 80 | 
 81 | Unbpublish used to be a flag. In this RFC it's bumped up to a subcommand of its own, to match other reciprocal commands like `push/pull`. Here's it's renamed to `hidden`, and again it only deals with weather a dataset will show up in listing & feeds. the `hidden` command does _not_ move data around.
 82 | 
 83 | ```
 84 | $ qri access hidden --help
 85 | hidden "unlists" a dataset. A hidden dataset will not show up when browsing 
 86 | dataset lists or feeds. Others can still access a hidden dataset if they know
 87 | its name.
 88 | 
 89 | hidden datasets cannot be pushed to the registry, but can be pushed to peers 
 90 | that accept hidden datasets from you.
 91 | 
 92 | Usage:
 93 |   qri access hidden [DATASET] [DATASET...] [flags]
 94 | 
 95 | Examples:
 96 |   # unpublish a dataset, removing it from any lists or feeds
 97 |   $ qri access hidden me/dataset
 98 | 
 99 |   # Unpublish a few datasets:
100 |   $ qri access hidden me/dataset me/other_dataset
101 | ```
102 | 
103 | `visible` & `hidden` on their own may not seem like they warrant being moved into a subcommand. To understand the justification for the `access`, this is how `access` _might_ look one day, after we've landed encryption and role-based access control (RBAC) for datasets:
104 | 
105 | ```
106 | $ qri access --help
107 | Access controls visibility and permissions for datasets.
108 | 
109 | Usage:
110 |   qri access [command]
111 | 
112 | Available Commands:
113 |   info           show access control details for a dataset
114 |   visible        list dataset for others to see and pull
115 |   hidden         unlist dataset, removing it from public visibility
116 |   encrypt        restrict access to a dataset with a secret key
117 |   decrypt        remove access restrictions from a dataset
118 |   allow          give permissions to a peer for an encrypted dataset
119 |   deny           remove peer access permissions
120 | ```
121 | 
122 | ### `qri access info`
123 | ```
124 | $ qri access info --help
125 | Info describes permissions for a dataset.
126 | 
127 | Usage
128 |   qri access info [DATASET] [DATASET...]
129 | ```
130 | 
131 | An example output for a read-write dataset:
132 | ```
133 | $ qri access info me/dataset
134 | read:      true
135 | write:     true
136 | visible:   true
137 | 
138 | You have read-write access to this dataset, because you created it.
139 | 
140 | This dataset is visible. Other users may see it when you're online, and if
141 | pushed to the registry it will show up in feeds for other users to browse and
142 | pull.
143 | ```
144 | 
145 | An example for a read-only dataset pulled from another user:
146 | ```
147 | $ qri access info ca-state-parks/park-features
148 | read:      true
149 | write:     false
150 | visible:   true
151 | 
152 | You cannot edit this dataset, attempting to save to its history will create a 
153 | fork.
154 | 
155 | This dataset is visible. Other users may see it when you're online, and if
156 | pushed to the registry it will show up in feeds for other users to browse and
157 | pull.
158 | ```
159 | 
160 | A brand new dataset has no visibilty operations in it's log. It's `visibility` property is `unset`. More on this in the next section.
161 | 
162 | ```
163 | $ qri access info me/brand_new_dataset
164 | read:      true
165 | write:     true
166 | visible:   unset
167 | 
168 | You have read-write access to this dataset, because you created it.
169 | ```
170 | 
171 | `$ qri access info me/dataset` for now will only but it will be _very_ helpful for viewing dataset permsissions that control both who can access a dataset and what they can do with it.
172 | 
173 | ### default to setting "visible" on initial push
174 | When a dataset has no `visibility` operations in it's log, it's `visibility` value is `unknown`. We explicitly model visibility as a _tri-state_: `[unset, visibile, hidden]`. So commands like push can make a one-time inference on behalf of the user. Once the visible property on a dataset is assigned, it cannot return to the `unset` state.
175 | 
176 | Pushing an unencrypted dataset† to any remote with an `unset` visibility value should first should make it `visible` before pushing. This keeps the current (v0.9.8) behaviour of a one-liner publish. pushing to a remote will automatically run `$ qri access visible` on the user's behalf. The `push` command should present this as user feedback:
177 | 
178 | ```
179 | $ qri push me/some_dataset
180 | setting dataset access to visible... done.
181 | pushing 28 versions of chriswhong/some_dataset...
182 | X of XX blocks pushed for version 12 August 2019 (Qmss4h552h3j2h33)...
183 | ```
184 | 
185 | Defaulting to `visible` avoids a potential UX problem when a dataset is pushed: without setting a dataset to visible everyone (including the user) will not be able to see the pushed dataset in list operations on the remote after pushing. Users who want to push hidden datasets should first make them hidden with `$ qri access hidden`, then push. Pushing a hidden dataset should present a warning:
186 | 
187 | ```
188 | $ qri access hidden me/some_dataset
189 | $ qri push me/some_dataset
190 | warning:  you're pushing a hidden dataset, this dataset will not show up in list
191 | warning:  operations on the destination remote
192 | pushing 28 versions of chriswhong/some_dataset...
193 | x of xx blocks pushed for version 12 August 2019 (Qmss4h552h3j2h33)...
194 | ```
195 | 
196 | Pushing hidden datasets is aimed at the small-group colloaboration use-case, where a datasets are explicitly pushed to trusted colleagues. When a remote receives a hidden dataset, it will show up when they run `qri list` locally, but this dataset won't be broadcast back to the network. Users can `pull` hidden datasets by name, and communicate hidden dataset names to each other outside of qri. The default registry will accept hidden datasets.
197 | 
198 | Because the `visible` property is held in logbook operations, users can flip between `hidden` & `visible` without a problem, and push to remotes, who will reflect that change.
199 | 
200 | Remotes can use a `requireVisible` configuration parameter to deny pushing hidden datasets entirely:
201 | 
202 | ```
203 | $ qri push me/some_dataset --hidden --remote peer
204 | error:  this remote does not accept hidden datasets
205 | ```
206 | 
207 | † _encrypted datasets don't exist yet. Once such a thing exists, there will be cases where the `unset` value can behave like `push`, but automatically set to `hidden` instead. A `$ qri access encrypt` command would do this, as would something like `$ qri init --encrypted`. All situations would set visbility by writing a visibility operation to the dataset log._
208 | 
209 | ### Visible means visible _everywhere_
210 | In the past we've had permissions tied to a dataset/remote combination. As an example I could "publish" to a registry and that registry would have it.
211 | 
212 | This RFC proposes permissions like visibility _only apply to datasets_. If you `push` a visible dataset to remote A, and peer B downloads that dataset, it's also `visible` for peer B.
213 | 
214 | Later when we add access control, if you say "this dataset can _only_ be accessed by peer B", that will apply to anyone with the dataset.
215 | 
216 | 
217 | # Reference-level explanation
218 | [reference-level-explanation]: #reference-level-explanation
219 | 
220 | ### The `Visible` field
221 | For some time we've modelled `published` as a boolean field outside the dataset itself, stored as a logbook operation. We've struggled with this field not capturing the problem of _where_ a dataset's been published. Re-conceiving the new `visible` field as a "permission bit" clarifies the intent of this field. Published now answers "can the existence of this dataset be broadcasted" for the _entire history_ of a dataset. visibility changes are recorded as logbook operations.
222 | 
223 | ### Corner cases
224 | A few extra bullet points to consider while we're implementing this:
225 | 
226 | * qri list `--published` should be renamed to `--visible`, should work as it does today, showing only local, visible datasets
227 | * We should create a new set of `lib.AccessMethods`, talking to a number of subsystems for access control management.
228 | * many `publish` tests should move over to the new `push` command
229 | 
230 | # Drawbacks
231 | [drawbacks]: #drawbacks
232 | 
233 | Access control is really irritating.
234 | 
235 | # Rationale and alternatives
236 | [rationale-and-alternatives]: #rationale-and-alternatives
237 | 
238 | ### `listed` and `unlisted`
239 | Instead of the publish/unpublish terminology, we could use a more direct reference to list visibility. We can't use `list` directly because `qri access list` should be the command for listing all access-controlled datasets (if such a thing turns out to be necessary). An alternative would be to use a different conjucation or tense for list like "listed" & "unlisted". using a past tense is the best I could come up with, and in my opinion `$ qri access listed` isn't very memorable, and feels like your asking about things that have been listed in the past?
240 | 
241 | The other thing to keep in mind: hidden datasets shouldn't show up in feeds. The registry requires "published: true", but feeds built on the peer-2-peer network won't. Peers building feeds should only use datasets that are published. 
242 | 
243 | ### model visibility as two states, omitting "unset"
244 | This RFC proposes using three states for visible. The third state is somewhat irritating to work with. It also creates a potential point of confusion for the user.
245 | 
246 | With two states we have no way of knowing if the user _intended_ the present `visible/hidden` value. If, for example, we defaulted to `hidden` instead of `unset`, we have to either ask the user to reaffirm their choice every time (via something like a `--hidden` flag on `push`), or explicitly require the user to set visibilty (aka don't do inference). Having the third state let's qri perform inference once based on the command in question, then get out of the way.
247 | 
248 | # Prior art
249 | [prior-art]: #prior-art
250 | 
251 | * Google Docs Sharing settings
252 | * Unlisted Vimeo Videos
253 | * RBAC - Role Based Access Control
254 | 
255 | # Unresolved questions
256 | [unresolved-questions]: #unresolved-questions
257 | 
258 | ### An "Access" Data Structure
259 | This RFC describes `visible` as a _permissions flag_. At some point we should be looking to build a data structure that describes permissions. `Visible` might be a field on that data structure. The challenge here: permissions will apply to, what, users? groups? Needs more thought.
260 | 
261 | 
262 | ### Remotes know "who is asking"
263 | 
264 | Long term, we should be driving the identity of the peer making a request deeper into remote behavior. We'll need this as a part of access control, but it could also remove the need for the "you're pushing a hidden dataset" warning, because remotes can show users lists contextualized to their identities, and can show the hidden datasets they've pushed (aka: a list that is specific to a given user)


--------------------------------------------------------------------------------
/text/0033-storage_command.md:
--------------------------------------------------------------------------------
  1 | - Feature Name: storage command
  2 | - Start Date: 2020-04-21
  3 | - RFC PR: [#50](https://github.com/qri-io/qri/pulls/50)
  4 | - Issue: 
  5 | 
  6 | # Summary
  7 | [summary]: #summary
  8 | 
  9 | Create a new command `storage` for managing qri's disk usage.
 10 | 
 11 | # Motivation
 12 | [motivation]: #motivation
 13 | 
 14 | We're missing a command for managing storage. Work on _retention strategies_ from another RFC has forced the issue, as we can't move on to building those without first creating manual tools for inspecting and manipulating stored dataset versions. We need a robust command for managing storage locally, and a way to manually reach out to a remote to make storage management asks of them. This RFC sets the stage for moving toward retention strategies with commands for manually viewing & deleting storage.
 15 | 
 16 | the current `remove` command in qri v0.9.8 is currently covering _some_ of this territory for us, but not nearly enough. Adding storage management to `remove` would be overloading `remove`'s primary purpose of manipulating history. So we need a new command!
 17 | 
 18 | This RFC proposes starting with two subcommands under a new command: `qri storage`:
 19 | * `qri storage stat`
 20 | * `qri storage drop`
 21 | 
 22 | These commands should provide the most vital action required for any retention strategy: removing stored datasets independent of history manipulation, and set the stage for better understanding storage cost with a stats command.
 23 | 
 24 | # Guide-level explanation
 25 | [guide-level-explanation]: #guide-level-explanation
 26 | 
 27 | Here's the proposed description for the storage command:
 28 | 
 29 | ```
 30 | $ qri storage --help
 31 | storage manages disk usage. The default storage location is your local disk, but
 32 | most commands allow specifying a remote
 33 | 
 34 | Available Commands:
 35 |   drop       purge versions from storage, potentially reclaiming disk space
 36 |   stat       calculate disk usage for a dataset or the entire data store
 37 | ```
 38 | 
 39 | And a description of the stat command:
 40 | 
 41 | ```
 42 | $ qri storage stat
 43 | Stat calculates storage statistics for datasets. Showing how much space a
 44 | dataset, version, or versions consumes.
 45 | 
 46 | "storage stat" calculates four types of statistic:
 47 | * storage cost: how much disk space a stored dataset history consumes. Cost is 
 48 |   the sum of size of the set of blocks retained in a history
 49 | * commit overhead: how much unique space a commit takes in relation to other 
 50 |   stored commits. Overhead is th sum of blocks who's reference count is 
 51 |   1 within the set of blocks
 52 | * net overhead: sum of blocks in each DAG for _all_ commits in a history.
 53 |   Overhead is storage cost, if all versions were retained
 54 | * gross overhead: how much disk space the total set of versions represents.
 55 |   Gross overhead is the sum of listed dataset sizes, without any deduplcation.
 56 | 
 57 | Usage:
 58 |   qri storage stat [DATASET...]
 59 | 
 60 | Examples:
 61 |   # calculate storage stats across your entire datastore
 62 |   $ qri storage stat
 63 | 
 64 |   # show storage stats for a dataset
 65 |   $ qri storage stat b5/world_bank_population
 66 | 
 67 |   # show overlap between two different dataset histories
 68 |   $ qri storage stat b5/world_bank_population b5/country_codes
 69 | 
 70 |   # show storage stats for a dataset stored on the registry:
 71 |   $ qri storage stat b5/world_bank_population --remote registry
 72 | 
 73 |   
 74 | Flags:
 75 |   -f, --format          output data format. (pretty|json|json-pretty)
 76 |       --remote          show stats for dataset stored on a remote
 77 | ```
 78 | 
 79 | We want stats to calculate a few things here, throwing out some initial things we want to know:
 80 | 
 81 | | term | description | formula description |
 82 | | ---- | ----------- | ------------------- |
 83 | | storage cost | how much disk space a stored dataset history consumes | sum of size of the set of blocks retained in a history |
 84 | | commit overhead | how much unique space a commit takes in relation to other stored commits | sum of blocks who's reference count is 1 within the set of blocks |
 85 | | net overhead | sum of blocks in each dag for _all_ commits in a history | 
 86 | | gross overhead | how much disk space the total set of versions represents | sum of listed dataset sizes |
 87 | 
 88 | * storage cost: how much space is this dataset taking up?
 89 | * commit overhead: Which commit should I delete to get the most space back?
 90 | * net overhead: how much space would it take to store this entire dataset?
 91 | * gross overhead: how much space would it take up to write all of these versions to separate files?
 92 | 
 93 | The stats should present as much of this info as possible, scoped to the arguments provided.
 94 | 
 95 | 
 96 | ### drop: remove stored dataset versions without affecting history
 97 | 
 98 | 
 99 | ```
100 | $ qri storage drop --help
101 | "storage drop" deletes dataset version data, freeing up disk space. Drop does 
102 | not edit or remove version histories. To edit version history, use the delete 
103 | command. To remove pulled datasets entirely, use the higher-level "qri drop"
104 | command.
105 | 
106 | Dataset versions take up disk space, and at some point need to be managed.
107 | Becase datasets and storage capacity vary dramatically the need for this
108 | 
109 | You can also use "storage drop" to request a remote drop dataset versions.
110 | This can be helpful for staying under a storage cap by manually dropping
111 | versions on the remote.
112 | 
113 | Usage:
114 |   qri storage drop DATASET [flags]
115 | 
116 | Examples:
117 |   # drop the 3rd-newest revision of a dataset:
118 |   $ qri storage drop user/dataset~2
119 | 
120 |   # drop a stored version from the registry
121 |   $ qri storage drop me/dataset@/ipfs/QmHashOfVersion --remote registry
122 | 
123 | Flags:
124 |   -a, --all               purge all versions, overrides revisions
125 |       --force             force qri to perform a dangerous drop operation
126 |   -r, --revisions         purge a number of versions from the latest commit
127 |       --remote            perform this drop on a speficied remote
128 |   -t, --tail              purge stored versions starting from the earliest
129 | ```
130 | 
131 | For now, only one version can be dropped at a time. Attempting to drop HEAD should come with a warning, and require the `--force` flag to proceed. Attempting to drop a version that isn't stored is a no-op.
132 | 
133 | # Reference-level explanation
134 | [reference-level-explanation]: #reference-level-explanation
135 | 
136 | The purpose of stats are to answer a few primary questions. All of them deal with a _set of blocks_. 
137 | 
138 | Most of these changes will originate with the `dag` package. The foundation for all of this work should be a `dag.Info` & `dag.Manifest`. Info provides the block size information, and manifests provide the . `dag.Info` currently looks like this:
139 | 
140 | ```go
141 | // Info is os.FileInfo for dags: a struct that describes important
142 | // details about a graph. Info builds on a manifest.
143 | //
144 | // when being sent over the network, the contents of Info should be considered gossip,
145 | // as Info's are *not* deterministic. This has important implications
146 | // Info should contain application-specific info about a datset
147 | type Info struct {
148 | 	// Info is built upon a manifest
149 | 	Manifest *Manifest      `json:"manifest"`
150 | 	Labels   map[string]int `json:"labels,omitempty"` // sections are lists of logical sub-DAGs by positions in the nodes list
151 | 	Sizes    []uint64       `json:"sizes,omitempty"`  // sizes of nodes in bytes
152 | }
153 | ```
154 | 
155 | Here's the current Manifest data structure:
156 | ```go
157 | type Manifest struct {
158 |   Links [][2]int `json:"links"` // links between nodes
159 |   Nodes []string `json:"nodes"` // list if CIDS contained in the DAG
160 | }
161 | ```
162 | 
163 | Manifests assume they describe one graph with a single root. To represent multiple graphs in the same manifest, we'll need to make one change:
164 | 
165 | ```go
166 | type Manifest struct {
167 |   Graphs [][][2]int `json:"links"` // links between nodes
168 |   Nodes []string `json:"nodes"` // list if CIDS contained in the DAG
169 | }
170 | ```
171 | 
172 | 
173 | ### Changes to dsync
174 | * With multiple roots in the same manifest, we can adjust `dsync` to send multiple roots in the same sessio
175 | * We should add a new hook: `postReceiveManifest` that fires  _after_ the manifest has been sent, but before any blocks have moved.
176 | * dsync should do a check to confirm there is sufficient blockstore space for any new blocks, and reject before any calling any hooks if not
177 | * dsync should be adding per-block integrity checks that confirm the size of the block matches the stated size in the manifest. Any mismatch should abort the receive session.
178 | 
179 | ### Calculating storage stats
180 | The process for calculating storage stats should generally work like this:
181 | 
182 | 1. Load logbook info for a dataset
183 | 2. Build a `dag.Manifest` of all stored roots, keeping a slice of block sizes while you're at it
184 | 3. Turn that into a `dag.Info`
185 | 4. Call methods on `dag.Info` that produce storage stats
186 | 
187 | # Drawbacks
188 | [drawbacks]: #drawbacks
189 | 
190 | ### Lots of plumbing
191 | The storage command is largely plumbing, and ideally we aren't directing many users to this command at all.
192 | 
193 | # Rationale and alternatives
194 | [rationale-and-alternatives]: #rationale-and-alternatives
195 | 
196 | ### Skip directly to automated methods
197 | One major alternative to this would be to skip a `storage` subcommand and straight to a command like `cleanup` instead, without any nesting. My concern here is we're supposed to be building up toward retention strategies, and adding manual tools should help diagnose and develop these strategies safely.
198 | 
199 | ### Use language other than "drop"
200 | We have another RFC that details a top-level `drop` command that removes both history and dataset versions. it may be confusing to see the same command listed in two places. Personally, I'm a fan of the repition. In both cases `drop` is the word for removing redundant data.
201 | 
202 | # Prior art
203 | [prior-art]: #prior-art
204 | 
205 | * [docker system prune](https://docs.docker.com/engine/reference/commandline/system_prune/)
206 | * [brew cleanup](https://discourse.brew.sh/t/why-does-brew-cleanup-work-so-well/1819)
207 | 
208 | # Unresolved questions
209 | [unresolved-questions]: #unresolved-questions
210 | 
211 | ### Knowing which datasets depend on a block
212 | One of the problems this RFC doesn't tackle properly is _knowing_ if removing a block will free up storage space, because we aren't leveraging pin reference counting. In an ideal world we'd have a low-level command like this:
213 | 
214 | ```
215 | qri storage deps: for a given block, show which dataset versions depend on that that block
216 | ```
217 | 
218 | If I understand this correctly, that'd require going from a many-to-one in the wrong direction. Expensive.
219 | 
220 | ### Automated Cleanup
221 | Tools like `docker` and `brew` offer an automated way to clean up data that is safe to remove. The commands here don't cover such a thing, but instead form a foundation for adding a command like `qri storage cleanup` or `qri storage prune` that removes extraneous versions. 
222 | 
223 | ### How Storage interacts with other qfs.Filestores
224 | This RFC doesn't deal with how `qri storage drop` would affect other, non-IPFS filesystems. Ideally the `qfs.Filesystem` interface would be expanded to support this API any storage methods that opt into an interface necessary for running deletes.
225 | 
226 | ### Specifying version ranges
227 | The version presented here only permits dropping one commit at a time. Ideally we could specify a _range_ of versions to drop at once. git has the `..` syntax for this. We should add a spec for doing version ranges in a future RFC.
228 | 
229 | ### Garbage Collection
230 | In the future we hope to write a `qri storage gc` command that works a lot like `docker system prune`. Some possible examples:
231 | 
232 | ```
233 | $ qri storage gc me/dataset
234 | ```
235 | Removes the blocks from that dataset that have been unpinned
236 | 
237 | ```
238 | $ qri storage gc me/dataset@version
239 | ```
240 | Removes the blocks of that version if it's been unpinned?
241 | 
242 | ```
243 | $ qri storage gc --all
244 | ```
245 | Removes all unpinned blocks??


--------------------------------------------------------------------------------