├── README.md
└── proposal
    ├── README.md
    ├── feedback-aj-towns.txt
    ├── feedback-david-harding.txt
    └── feedback-sjors-provoost.txt


/README.md:
--------------------------------------------------------------------------------
1 | # assumeutxo-docs
2 | 
3 | Documentation for Bitcoin's assumeutxo proposal.
4 | 
5 | See [proposal/](proposal/) for more details.
6 | 


--------------------------------------------------------------------------------
/proposal/README.md:
--------------------------------------------------------------------------------
  1 | 
  2 | ## Abstract
  3 | 
  4 | An implementation and deployment plan for `assumeutxo` is proposed, which uses serialized UTXO sets to substantially reduce the amount of time needed to bootstrap a usable Bitcoin node with acceptable changes in security.
  5 | 
  6 | #### Design goals
  7 | 
  8 | 1. Provide a realistic avenue for non-hobbyist users to run a fully validating node,
  9 | 1. avoid imposing significant added storage burden, and
 10 | 1. make no significant concessions in security.
 11 | 
 12 | #### Plan
 13 | 
 14 | This feature could be deployed in two, possibly three, phases:
 15 | 
 16 | 1. UTXO snapshots (~3.2GB) can be created and loaded via RPC in lieu of the normal IBD process.
 17 |     - They will be obtained through means outside of bitcoind.
 18 |     - A hardcoded `assumeutxo` hash will fix which snapshots are considered valid.
 19 |     - Asynchronous validation of the snapshot will be performed in the background after it is loaded.
 20 | This phase includes most, if not all, of the changes needed to existing validation code. (see [the PR](https://github.com/bitcoin/bitcoin/pull/15606))
 21 | 1. Snapshots will be generated, stored, and transmitted by bitcoind.
 22 |     - To mitigate DoS concerns and storage consumption, nodes will store subsets of FEC-split chunks spanning three snapshots (one current, two historical) at an expected ~1.2GB storage burden.
 23 |     - Nodes bootstrapping will, if assumeutxo is enabled, obtain these chunks from several distinct peers, reassemble a snapshot, and load it.
 24 |     - The hardcoded assumeutxo value will change from a content hash to a Merkle root committing to the set of chunks particular to a certain snapshot.
 25 |     - We may consider adding a rolling UTXO set hash for local node storage to make accessing expected UTXO set hashes faster, and may augment the assumeutxo commitment with its value.
 26 | 1. (Far down the road) a consensus commitment to the UTXO set hash at a given height may be considered. The snapshot and background validation process may be reused as we migrate to a more compact representation of the UTXO set, e.g. [utreexo](https://www.youtube.com/watch?v=edRun-6ubCc), [UHS](https://lists.linuxfoundation.org/pipermail/bitcoin-dev/2018-May/015967.html), or [accumulators](https://eprint.iacr.org/2018/1188).
 27 | 
 28 | If parts of that were incomprehensible, keep reading for details.
 29 | 
 30 | Right now you can help by reviewing this proposal and the draft code.
 31 | 
 32 | #### Resources
 33 | 
 34 | - Github issue: https://github.com/bitcoin/bitcoin/issues/15605
 35 | - Phase 1 draft implementation: https://github.com/bitcoin/bitcoin/pull/15606
 36 | 
 37 | #### Already familiar?
 38 | 
 39 | If you've been following this discussion and already understand the basics of
 40 | `assumeutxo`, you can probably skip down to the [*Security* section](#security).
 41 | 
 42 | #### Acknowledgements
 43 | 
 44 | I'd like to thank the following people for helping with this proposal, though they should not be held accountable for any dumb mistakes I may have made in the preparation of this document and related code:
 45 | 
 46 | Suhas Daftuar, Pieter Wuille, Greg Maxwell, Matt Corallo, Alex Morcos, Dave Harding, AJ Towns, Sjors Provoost, Marco Falke, Russ Yanofsky, and Jim Posen.
 47 | 
 48 | 
 49 | ## Basics
 50 | 
 51 | ### What's a UTXO snapshot?
 52 | 
 53 | It's a serialized version of the unspent transaction output (UTXO) set, as seen from a certain height in the chain. The serialized UTXOs are packaged with some metadata: e.g.
 54 |   - total number of coins contained in the snapshot,
 55 |   - the block header for the latest block encapsulated in a snapshot (its "base"),
 56 | 
 57 | and a few other things. You can see the full data structure (in its current form) in the [assumeutxo pull request](https://github.com/jamesob/bitcoin/blob/utxo-dumpload-compressed/src/validation.h#L827-L881).
 58 | 
 59 | ### What's `assumeutxo`?
 60 | 
 61 | It's a piece of data embedded in the source code that commits to the hash of a serialized UTXO set which is considered valid for some chain height. The final format of this commitment is still subject to debate because generating it is computationally expensive, and its structure affects how we store and transmit the serialized UTXO set to and from peers. But at the moment it is a simple SHA256-based hash of the UTXO set contents generated by the existing [`GetUTXOStats()` utility](https://github.com/bitcoin/bitcoin/blob/91a25d1e711bfc0617027eee18b9777ff368d6b9/src/rpc/blockchain.cpp#L944-L981).
 62 | 
 63 | 
 64 | ### Okay... what's the use of these things?
 65 | 
 66 | We can use UTXO snapshots and the `assumeutxo` commitment to significantly reduce the amount of time it takes to bootstrap a usable Bitcoin node under a security model that seems acceptable.
 67 | 
 68 | Right now initial block download is a process that scales linearly with the size of the chain's history. It takes anywhere from 4 hours to multiple days to complete this process for a new installation of bitcoind, based upon hardware and network bandwidth. This process discourages users from running full nodes, instead incentivizing users to turn to clients with a reduced security model.
 69 | 
 70 | ### How does snapshot loading work?
 71 | 
 72 | When a snapshot is loaded, it is deserialized into a full chainstate data structure, which includes a representation of the block chain (`chainActive`) and UTXO set (both on disk and cached in memory). This lives alongside the original chainstate that was extant before loading the snapshot. Before accepting a loaded snapshot, a headers chain must be retrieved from the peer network which includes the block hash of the last block encompassed by a snapshot (its "base").
 73 | 
 74 | Once the snapshot is loaded, the snapshot chainstate performs initial block download from the snapshot state to the network's tip. The system then allows operation as though IBD has completed, and the assumed-valid snapshot chainstate is treated as `chainActive`/`pcoinsTip`/et al.
 75 | 
 76 | After the snapshot chainstate reaches network tip, the original chainstate resumes the initial block download from before the snapshot was loaded in the background. This "background validation" process happens asynchronously from use of the active (snapshot) chainstate, allowing the system to service, for example, wallet operations. The purpose of this background validation is to retrieve all block files and fully validate the chain up to the start of the snapshot.
 77 | 
 78 | Until the background validation completes, we refuse to load wallets with a `bestblock` marker before the base of the snapshot since we don't have the block data necessary to perform a rescan.
 79 | 
 80 | Once the background validation completes, we throw its state away as the snapshot chainstate has now been proven fully valid. If for some reason the background validation yields a UTXO set hash that is different from what the snapshot advertised, we warn loudly and shut down.
 81 | 
 82 | 
 83 | ## Security
 84 | 
 85 | ### Is there a change in security model by introducing `assumeutxo`?
 86 | 
 87 | If we're talking about the degree to which developers are trusted to identify what is and isn't valid in Bitcoin: no, there is no material difference between use of assumeutxo and the degree that Bitcoin "trusts developers" now.
 88 | 
 89 | Currently, Bitcoin ships with hardcoded assumevalid values. These values identify certain blocks which, if seen in the headers chain, allow signature checking for any prior blocks to be skipped as a performance optimization.
 90 | 
 91 | > With assumevalid, you assume that the people who peer review your software are capable of running a full node that verifies every block up to and including the assumedvalid block.  This is no more trust than assuming that the people who peer review your software know the programming language it's written in and understand the consensus rules; indeed, it's arguably less trust because *nobody* completely understands a complex language like C++ and *nobody* probably understands every possible nuance of the consensus rules---yet almost anyone technical can start a node with -noassumevalid, wait a few hours, and check that `bitcoin-cli getblock $assume_valid_hash` returns has a `"confirmations"` field that's not -1.
 92 | >
 93 | > *Dave Harding*
 94 | 
 95 | Assumeutxo is a similar idea and would be specified in a more stringent way (in that it can't be specified via CLI). The hardcoded assumeutxo value would be proposed and reviewed in exactly the same way as assumevalid (via pull request), and would be proposed and merged with a sufficient lead time that would allow anyone interested to reproduce its value for themselves.
 96 | 
 97 | ### But isn't a hash in the code that assumes validity of a chain basically like the developers deciding what the "right" chain is?
 98 | 
 99 | This value is no different than any other code change proposed by developers; in fact, it would be much easier to sneak in some kind of backhanded logic that implements a skewed notion of validity into another, more obscure part of the code, e.g. `CCoinsViewDB` could be modified to always attest to the existence of a coin under some special condition, or net could be modified to only communicate with certain peers on the network.
100 | 
101 | The clarity around an assumevalid/assumeutxo commitment value actually makes user participation much more straightforward because it's obvious how this optimization can be reviewed.
102 | 
103 | It's also worth noting that the existence of an assumevalid/utxo value doesn't preclude any other chain from being considered valid, it simply says "the software previously validated *this* particular chain."
104 | 
105 | ### Okay, so there might be a theoretical equivalence, but are there any *practical* security differences with assumeutxo (vs. assumevalid)?
106 | 
107 | Yes, there is one practical security difference. Currently, if I wanted to trick someone into thinking I had coins that I don't on the honest network, I'd have to
108 | 
109 | - get them to start bitcoind with a bad `-assumevalid=` parameter,
110 | - isolate their node from the honest network to prevent them from seeing the
111 |   most-PoW headers chain, and
112 | - build a PoW-valid chain that includes the existing [checkpoints](https://github.com/bitcoin/bitcoin/blob/91a25d1e711bfc0617027eee18b9777ff368d6b9/src/chainparams.cpp#L146-L160), and includes a block with an invalid coin assignment somewhere after the last checkpoint.
113 | 
114 | This obviously involves a bit of effort since the attacker needs to generate a chain of blocks along with the requisite PoW.
115 | 
116 | However, with assumeutxo if I can get the user to accept a malicious `assumeutxo` commitment value, most of the work is done. Modifying and serializing a false UTXO snapshot is quite easy -- no proof of work necessary.
117 | 
118 | ### That sounds really bad - so all an attacker has to do is get a user to accept a bad assumeutxo value and feed them a poisoned snapshot?
119 | 
120 | Yes, that's all it takes.
121 | 
122 | As a result, the assumeutxo value will be embedded in the source code and we will not build a mechanism that enables specification of the `assumeutxo` value via commandline; the practical risk is just too high. If users prefer to specify an alternate value (not recommended), they can modify the source code and recompile.
123 | 
124 | The assumeutxo value will live in the source code; recall that if an attacker has means to affect the source code used to build a binary, they can already do anything they want.
125 | 
126 | ### It seems like this is an argument for including a commitment to the assumeutxo value in a place where it could be enforced by consensus, say in the block headers. Should we do that?
127 | 
128 | Maybe in time, but not at the moment. Before we gain practical experience with use of UTXO snapshots, we don't know what the right commitment structure is. Making consensus changes is a very expensive process and should not be done until we're absolutely sure of what we want to commit to.
129 | 
130 | Down the road we may introduce such a commitment into a consensus-critical place, but for now we should design assumeutxo to be secure without the assumption that we will.
131 | 
132 | ### Do you perform any extra validation on a loaded snapshot besides comparing its hash to the `assumeutxo` value?
133 | 
134 | Yes. After the UTXO snapshot is loaded and the chain syncs to the network tip, we begin an initial block download process in the background using separate data structures (i.e. a separate chainstate). This background IBD will download and validate all blocks up to the last block assumed valid by the snapshot (i.e. the "base" of the snapshot).
135 | 
136 | Once the background IBD completes, we will have validated all blocks in the previously assumed-valid chain we've been using since we loaded the snapshot. We can throw away the background validation `chainstate` data.
137 | 
138 | 
139 | 
140 | ## Resource usage
141 | 
142 | ### This extra chainstate that we're using for the background IBD -- doesn't that take up extra disk space and memory for a separate leveldb and cache?
143 | 
144 | Yes. Since we have to maintain a completely separate UTXO set to support background IBD that is simultaneous with use of the assumed-valid chain, we have to have an extra `CCoinsView*` hierarchy. This means temporarily keeping an extra `chainstate` (leveldb) directory on disk, and it means splitting the memory allocated per `-dbcache` to the in-memory coins cache.
145 | 
146 | I don't think this is a huge deal because it basically means (at the moment) an extra 3.2GB on disk. We can split the specified `-dbcache` memory ~80/20 between the `CCoinsViewCache` used for background IBD vs. the assumed-valid `chainActive`, since a sizable dbcache only provides a noticeable performance benefit during initial block download.
147 | 
148 | ### Should we even run a background validation sync? If we accept the assumeutxo security model, why even do the IBD? If IBD isn't scalable long term, what's the point?
149 | 
150 | If we introduce assumeutxo with snapshots but do not perform IBD in the background, it's easy to imagine that almost anyone setting up a node will do so with a UTXO snapshot (since it's much quicker than conventional IBD), run using an assumed-valid chain, and will present itself to the network as a pruned node. In the limit, that results in an absence of nodes serving historical blocks to the network. This certainly isn't what we want, so it seems prudent to keep the background IBD on as a default.
151 | 
152 | Users constrained by hardware can of course use assumeutxo with pruning options.
153 | 
154 | Assumeutxo is a performance optimization. If we removed the IBD process in lieu of it, that would be a change in Bitcoin's security model. In the future, we may split the block download and connection/validation processes so that assumeutxo nodes can still serve blocks to the peer network without having to expend the computational resources needed to perform IBD-style validation.
155 | 
156 | 
157 | ### How will users and reviewers efficiently verify hashes of a given UTXO set?
158 | 
159 | Right now, computing the hash of the UTXO set at a certain height can be done by using the `gettxoutsetinfo` RPC command (i.e. `GetUTXOStats()`). It takes a few minutes to compute, and if you want to do it for an arbitrary height, you need to call `invalidateblock` to rewind to that point and then `reconsiderblock` afterwards to fast-forward back. Obviously, this interrupts normal operation.
160 | 
161 | This is inconvenient, but in the meantime we could modify `gettxoutsetinfo` to accept a height and at least abstract away the manual chain rewind and fast-forward via `invalidateblock`/`reconsiderblock`.
162 | 
163 | Longer-term, it's conceivable that we could use a node-local [rolling UTXO set hash](https://lists.linuxfoundation.org/pipermail/bitcoin-dev/2017-May/014337.html) to make the hash availability immediate. However, a rolling UTXO set hash is incompatible with assumeutxo commitment schemes that involve chunking snapshots (discussed below) and so the resulting assumeutxo value might have to be a tuple consisting of `(rolling_set_hash, split_snapshot_chunks_merkle_root)`.
164 | 
165 | 
166 | ## Snapshot storage and distribution
167 | 
168 | ### How will snapshot distribution work?
169 | 
170 | Ultimately, users will obtain UTXO snapshots over the peer network. Before that is implemented, users may obtain UTXO snapshots that validate against the hardcoded assumeutxo hash from any individuals or CDNs offering them (see "*What are the steps to deployment?*" below).
171 | 
172 | Because snapshots are quite sizable, and because malicious peers may well lie about a large file they're offering being a valid snapshot, we need chunked storage and transmission of snapshots. It should be easy to validate each chunk.
173 | 
174 | Naively, one way we could do this is split each snapshot into *n* evenly-sized chunks. We could construct a Merkle tree using the hash of each chunk, and then the root of that tree would be the `assumeutxo` commitment value embedded in the source code. Each peer would choose some random value modulo *k* (such that *k* <= *n*) that determines what subset of chunks it was responsible for storing. Upon initialization, a peer wanting to obtain a snapshot would have to find *k* distinct peers, each providing a unique "stripe" of the snapshot data, to obtain all *n* chunks.
175 | 
176 | ### That sounds nice and simple, but I bet there are problems.
177 | 
178 | Right. The issues with this approach are
179 | - finding a full set of peers which offers all *k* stripes of the data is inconvenient, and
180 | - this opens a minor DoS vector - to prevent initialization, an attacker has only to target all nodes offering one particular stripe of the data.
181 | 
182 | Instead, we could use [erasure coding](http://web.eecs.utk.edu/~plank/plank/classes/cs560/560/notes/Erasure/2004-ICL.pdf) to generate *n* + *m* chunks (where *m* is the number of extra coding chunks) and have the bootstrapping node retrieve any *n* + 𝛼 distinct chunks from its peers, where 𝛼 depends on the particular coding scheme used. Each node would still only store and serve a subset of the snapshot chunks.
183 | 
184 | ### How many snapshots will a node have to store?
185 | 
186 | A node would store more than just the latest snapshot in order to help peers running legacy versions of the software. The exact number could be subject to debate, but I'd say storing some data for two historical snapshots in addition to the latest snapshot is probably reasonable.
187 | 
188 | For each snapshot, each node would only store some slice of the data (perhaps an 1/8th or so) based upon the exact parameters chosen of an erasure coding scheme above.
189 | 
190 | For reference, a UTXO snapshot at a recent tip is roughly 3.2GB. If we don't do anything clever and snapshots remain at this size, we might expect nodes to store 1.2GB (= 1/8 shard size * 3 snapshots * 3.2GB per snapshot) of snapshot data (assuming 8 peers required for snapshot bootstrap and 3 total snapshots stored).
191 | 
192 | ### How will snapshots be available from peers for a new assumeutxo value that has just been released?
193 | 
194 | Good question - I haven't quite figured this one out yet. Presumably, we could have the code automatically generate snapshots every 6 months or so. In order to generate snapshots during runtime without impairing normal operations like new block reception, we'll probably have to refactor state-to-disk flushing to be asynchronous.
195 | 
196 | Sjors notes that we could generate snapshots periodically based on block height, which would be useful for more than transmission to peers:
197 | 
198 | > At fixed block intervals, the snapshots will [immediately] be useful even if they're not referenced in a release yet. They can be used for local backups to recover from block & chainstate data corruption. Each node can store the hash in a simple text file. When calling -reindex(chainstate) it looks for that text file. For pruned nodes -reindex could even just go back to the snapshot instead of all the way to genesis.  Similarly this can be helpful when a node needs to undo pruning to rescan an old wallet.
199 | 
200 | 
201 | 
202 | ## Alternative approaches
203 | 
204 | ### Instead of using UTXO snapshots that we have to store and send around, why not just have Bitcoin start in SPV mode (using e.g.  [BIP 157](https://github.com/bitcoin/bips/blob/master/bip-0157.mediawiki)) and do initial block download in the background?
205 | 
206 | This is an appealing idea, but there are some practical downsides.
207 | 
208 | For one, much less code is required to implement the assumeutxo approach than to build an SPV mode into Bitcoin Core. At the moment, Core doesn't have any code for acting as an SPV client and many subsystems assume the presence of a chainstate (i.e. a `CChain` blockchain object and a view of the UTXO set).
209 | 
210 | Maybe surprisingly, the assumeutxo approach allows us to reuse a lot of existing code within Bitcoin Core. Its implementation amounts to a series of refactorings (that should probably be done anyway to make writing tests easier) plus a few new smallish pieces of logic to handle multiple chainstates during initialization and network communication.
211 | 
212 | A good deal of new code will be required to use, say, [Reed-Solomon erasure coding](http://web.eecs.utk.edu/~plank/plank/classes/cs560/560/notes/Erasure/2004-ICL.pdf) to split snapshots for storage and peer transmission, but this sort of code can easily live at the periphery of the system, whereas building in an SPV mode would require numerous modifications to the "heart" of the software.
213 | 
214 | More new code is more engineering and review effort, and ultimately more risk. An SPV mode would also not allow the node to do a full validation of incoming blocks until the full chain has been downloaded and validated.
215 | 
216 | ### Why not do something that doesn't require any modification to the Bitcoin Core code, like have people offer PGP-signed datadirs that you can download *a la* btcpayserver's [FastSync](https://github.com/btcpayserver/btcpayserver-docker/tree/master/contrib/FastSync)?
217 | 
218 | Distributing this sort of data outside of the core software has a number of practical downsides. If an individual or group were to encourage downloading chainstate data that is not somehow validated by the software itself, a number of security risks are introduced. The user begins to trust not just bitcoind, but the supplier of the data. The supplier of that data then needs to be scrutinized in addition to the software itself.
219 | 
220 | Futhermore, existing schemes like FastSync use PGP signatures to attest to the validity of the data they offer. Such signatures are very often ignored, and convincing users to validate them reliably is a sisyphean task.
221 | 
222 | Usage of assumeutxo or a similar scheme should be *safe by default* and ultimately the user should not have to perform any extra steps to benefit from the optimization in a secure way.
223 | 
224 | 
225 | ## Planning
226 | 
227 | ### Okay, this assumeutxo stuff actually sounds pretty cool. What are the steps to deployment?
228 | 
229 | 1. Implement the changes necessary to have multiple chainstates in use simultaneously. ([See PR](https://github.com/bitcoin/bitcoin/pull/15606))
230 | 1. Implement the creation and usage of UTXO snapshots via `dumptxoutset` and `loadtxoutset` along with a hardcoded `assumeutxo` hash. ([See PR](https://github.com/bitcoin/bitcoin/pull/15606))
231 | 1. Allow time for sophisticated end users to manually test snapshot usage via RPC.
232 | 1. Research an effective snapshot storage and distribution scheme.
233 | 1. Implement and deploy the decided-upon P2P snapshot distribution mechanism.
234 | 1. Let some time pass.
235 | 1. Consider whether or not a consensus change supporting a UTXO set hash makes any sense.
236 | 
237 | ### If I want to help, what should I do next?
238 | 
239 | Review the code [here](https://github.com/bitcoin/bitcoin/pull/15606). Parts of it can and may be split out given the size of the current change, and your input there would be appreciated.
240 | 
241 | ### How will this work with [accumulators](https://eprint.iacr.org/2018/1188), [UHS](https://lists.linuxfoundation.org/pipermail/bitcoin-dev/2018-May/015967.html), or [utreexo](https://www.youtube.com/watch?v=edRun-6ubCc) if those things come around?
242 | 
243 | If some alternate, space-saving scheme for representing the UTXO set becomes viable, it'll dovetail with assumeutxo nicely. It's easy to imagine that the assumeutxo value could simply become a merkle root of the utreexo forest, or the hash of an accumulator value. UTXO snapshots would reduce to just a few kilobytes instead of multiple gigabytes. The background IBD/validation feature will still be quite useful as we'll still want to do full validation in the background.
244 | 
245 | 
246 | ## Implementation questions
247 | 
248 | ### How should we allocate memory (`-dbcache`) between the assumed-valid and background validation coins views?
249 | 
250 | From Sjors:
251 | 
252 | > I think it should allocate most memory to catching up from snapshot to the tip. This could be more than 6 months away. Once caught up, flush and allocate most memory to syncing from genesis. If a node is restarted during the process, sync the headers, if it's more than 24 hours behind, give most resources to catching up to tip, otherwise give them to catching up to the snapshot.
253 | >
254 | > For the very first version at least we need to make sure memory usage for both UTXO sets doesn't exceed `-dbcache + -maxpool`.
255 | 
256 | At the moment, the draft implementation does a [70/30 split](https://github.com/bitcoin/bitcoin/pull/15606/commits/83f13a754372579cd13a45a2052fd4e42ed24632#diff-c865a8939105e6350a50af02766291b7R1476) between background validation and snapshot.
257 | 
258 | ### Why allow loading a snapshot through RPC - shouldn't it just be a startup command?
259 | 
260 | Loading the snapshot as a startup parameter would allow us to simplify various things (like management of chainstate data structures), but if we're eventually going to transmit snapshots over the P2P network, we'll need to have logic in place that allows loading snapshots after startup has completed. I think we should be sure to test this kind of operation before this feature sees default usage, and so `loadtxoutset` seems like the right approach during the RPC phase.
261 | 


--------------------------------------------------------------------------------
/proposal/feedback-aj-towns.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jamesob/assumeutxo-docs/f2dfae07e8752016d1f47b84e496a7c1c2bf65b6/proposal/feedback-aj-towns.txt


--------------------------------------------------------------------------------
/proposal/feedback-david-harding.txt:
--------------------------------------------------------------------------------
  1 | --npejdeiydstzb5t6
  2 | Content-Type: text/plain; charset=us-ascii
  3 | Content-Disposition: inline
  4 | Content-Transfer-Encoding: quoted-printable
  5 | 
  6 | Hi, James.  Some quick thoughts on the text are below.  Sorry that some
  7 | of it is a bit rambly, I don't have time to clean it up now and I
  8 | figured you'd want comments sooner than later.
  9 | 
 10 | -Dave
 11 | 
 12 | On Tue, Apr 16, 2019 at 10:27:10AM -0400, James O'Beirne wrote:
 13 | > ```
 14 | >=20
 15 | >=20
 16 | >                                                                         _|
 17 | >    _|_|_|   _|_|_|   _|_|_| _|    _| _|_|_|  _|_|     _|_|   _|    _| _|_=
 18 | |_|_| _|    _|   _|_|
 19 | >  _|    _| _|_|     _|_|     _|    _| _|    _|    _| _|_|_|_| _|    _|   _=
 20 | |       _|_|   _|    _|
 21 | >  _|    _|     _|_|     _|_| _|    _| _|    _|    _| _|       _|    _|   _=
 22 | |     _|    _| _|    _|
 23 | >    _|_|_| _|_|_|   _|_|_|     _|_|_| _|    _|    _|   _|_|_|   _|_|_|    =
 24 |  _|_| _|    _|   _|_|
 25 | >=20
 26 | > ```
 27 | 
 28 | This is hard to read on both my console and on the gist.
 29 | 
 30 | >=20
 31 | > ## Abstract
 32 | >=20
 33 | > An implementation and deployment plan for `assumeutxo` is proposed, which=
 34 |  uses serialized UTXO sets to substantially reduce the amount of time neede=
 35 | d to bootstrap a usable Bitcoin node with insignificant changes in security.
 36 | >=20
 37 | > #### Design goals
 38 | >=20
 39 | > 1. Provide a workable alternative to the unsustainable initial block down=
 40 | load process,
 41 | > 1. provide a realistic avenue for non-hobbyist users to run a fully valid=
 42 | ating node,
 43 | 
 44 | I like the second point, but I think the first point might be
 45 | controversial.  If IBD is truly unsustainable, then Bitcoin will become
 46 | a trusted system someday because no new user will be able to verify the
 47 | complete previous history; if instead, it's s/unsustainable/burdensome/,
 48 | then it's clearer that assumeutxo is just a UX improvement.  I think
 49 | this would also be just as clear if the first point is deleted and you
 50 | lead with the current second point.
 51 | 
 52 | > 1. UTXO snapshots (~3.2GB) can be created and loaded via RPC in lieu
 53 | > of the normal IBD process.
 54 | 
 55 | Wouldn't this be better as a start-up option?  Back before sipa
 56 | optimized IBD downloading, a contributor to the project maintained a
 57 | separate file/torrent containing all blocks.  This could be used by
 58 | simply putting the file in the Bitcoin data directory (indeed, I just
 59 | git greped the source code, and it looks like it'd still work).  For
 60 | historic details, see:
 61 | 
 62 |     https://bitcoin.org/bin/block-chain/README.txt
 63 |     https://bitcointalk.org/index.php?topic=3D145386.0
 64 | 
 65 | > 1. Snapshots will be generated, stored, and transmitted by bitcoind.
 66 | > To mitigate DoS concerns and storage consumption, nodes will store
 67 | > subsets of FEC-split chunks spanning three snapshots (one current, two
 68 | > historical) at an expected ~1.2GB storage burden. Nodes bootstrapping
 69 | > will, if assumeutxo is enabled, obtain these chunks from several
 70 | > distinct peers, reassemble a snapshot, and load it. The hardcoded
 71 | > assumeutxo value will change from a content hash to a Merkle root
 72 | > committing to the set of chunks particular to a certain snapshot. We
 73 | > may consider adding a rolling UTXO set hash for local node storage to
 74 | > make accessing expected UTXO set hashes faster, and may augment the
 75 | > assumeutxo commitment with its value.
 76 | 
 77 | I think this list of improvements to the bare-bones proposal in the
 78 | previous list item might benefit from being split into a sub-list, e.g.:
 79 | 
 80 |     1. Snapshots will be generate, stored, and transmitted by bitcoind.
 81 |     2. Nodes will store subsets of snapshots
 82 |     3. ...
 83 | 
 84 | The reason is that several of these things are completely independent
 85 | and so can be implemented separately (e.g. storing snapshots vs
 86 | merklized root vs rolling commitment).
 87 | 
 88 | On the specific proposals, I feel that a rolling commitment that's
 89 | dumped to debug.log (or a specific db) is required for the initial
 90 | version of assumeutxo.  Otherwise, it'll require a considerable amount
 91 | of work for most people to verify a UTXO commitment.  E.g., currently if
 92 | you tell me that the UTXO set hash for block 555555 is 01234...cdef, I
 93 | can only verify that by running:
 94 | 
 95 |   hash=3D$( bitcoin-cli getblockhash )
 96 |   bitcoin-cli invalidateblock $hash
 97 |   bitcoin-cli gettxoutsetinfo
 98 |   bitcoin-cli reconsiderblock $hash
 99 | 
100 | That would take hours even on my fast server and would make my node
101 | unusable in most practical ways until the operation finished.  That's
102 | not a deal-breaker for the proposal, as many devs have spare machines
103 | they can do this on, but I think we really want to have as many people
104 | as possible being able to check utxo hashes.  To that end, I think we
105 | need rolling commitments (e.g. as described by sipa) that are
106 | automatically printed to debug.log or stored in a db.  I also think
107 | sipa's proposal (e.g. the ECC idea) automatically takes care of allowing
108 | verification of partial downloads; e.g., for each UTXO entry you
109 | receive, you can ensure it's part of your UTXO commitment.  This is
110 | arguably better than torrent-style merkle trees that have to choose a
111 | tradeoff between how many leaf nodes to have and how much overhead to
112 | create by distributing those leaves.
113 | 
114 | Keeping historical snapshots seems weird to me.  What's the motivation
115 | for that?  It seems to me that a version 0.25 node should just keep the
116 | snapshot for whatever is the highest commitment it has; if a 0.24 node
117 | tries to connect to it, 0.25 says "sorry, I can help you".  The 0.24
118 | node keeps trying different nodes until it finds a fellow 0.24 node that
119 | does have the snapshot.
120 | 
121 | > 1. (Far down the road) a consensus commitment to the UTXO set hash at
122 | > a given height will be considered.=20
123 | 
124 | Although I guess maybe you have to mention this, this is completely
125 | different than the trust model you otherwise discuss in this document.
126 | 
127 | > Given its linear scaling characteristics, initial block download may,
128 | > decades or centuries from now, become infeasible to complete for users
129 | > without considerable hardware resources.=20
130 | 
131 | That assumes hardware performance and bandwidth availability doesn't
132 | improve at least linearly.  (And, further, that we run out of techniques
133 | for improving performance in software.)  The problem *today* is that
134 | early linear growth can represent large percentage increases; e.g.,
135 | Bitcoin's current consensus rules could allow the chain to easily grow
136 | by 100GB a year.  When the chain is currently only ~200GB, that's 50%.
137 | But in 18 years, the chain would be 2TB and it'd be only 5% growth and
138 | only 5% additional burden for IBD users.  If hardware/bandwidth improves
139 | by more than 5% in that year, IBD actually becomes easier for users
140 | then.
141 | 
142 | > ### How does snapshot loading work?
143 | >=20
144 | > When a snapshot is loaded, it is deserialized into a full chainstate data=
145 |  structure, which includes a representation of the block chain and UTXO set=
146 |  (both on disk and cached in memory). This lives alongside the original cha=
147 | instate that was extant before loading the snapshot. Before accepting a loa=
148 | ded snapshot, a headers chain must be retrieved from the peer network which=
149 |  includes the block hash of the last block encompassed by a snapshot (its "=
150 | base").
151 | 
152 | Shouldn't this block header (or at least its hash) be hardcoded into the
153 | software besides the assumeutxo hash?  Otherwise the snapshot could
154 | really be for block 555555 but Mallory could give me a snapshot
155 | containing the honest utxo set by a dishonest block header (e.g. for
156 | block 555556), causing me to reject blocks that contained transactions
157 | spending from utxos created in 555556, putting me on my own consensus
158 | failure hard fork.
159 | 
160 | Although I can't think of why this is important for the other metadata,
161 | it does seem like all of it should be committed to in some way that
162 | prevents third-party modification.
163 | 
164 | > After the snapshot chainstate reaches network tip, the original
165 | > chainstate resumes the initial block download from before the snapshot
166 | > was loaded in the background. This "background validation" process
167 | > happens asynchronously from use of the active (snapshot) chainstate,
168 | > allowing the system to service, for example, wallet operations. The
169 | > purpose of this background validation is to retrieve all block files
170 | > and fully validate the chain up to the start of the snapshot.
171 | 
172 | I understand that some people want this, but I think it's not very
173 | useful.  One of the really nice benefits of assumeutxo by itself is that
174 | eliminates the wasted bandwidth and CPU of people verifying history that
175 | they probably don't care about.  I think it'd be great to have an option
176 | to verify old history (e.g. just like current -reindex-chainstate), but
177 | it doesn't seem to me like it should be the default.
178 | 
179 | > Currently, Bitcoin ships with hardcoded assumevalid values. These
180 | > values identify certain blocks
181 | 
182 | This is confusingly worded since Bitcoin Core only contains a single
183 | assumevalid value per release.
184 | 
185 | > The idea being that these blocks are buried under so much work that
186 | > validity can be assumed on the basis of blockhash alone.
187 | 
188 | Noooooooooo that's not at all the idea.  What you're describing is trust
189 | in most-PoW, AKA miner trust; if that's how assumevalid worked, I'd quit
190 | Bitcoin and I think a lot of other people would too.
191 | 
192 | With assumevalid, you assume that the people who peer review your
193 | software are capable of running a full node that verifies every block up
194 | to and including the assumedvalid block.  This is no more trust than
195 | assuming that the people who peer review your software know the
196 | programming language it's written in and understand the consensus rules;
197 | indeed, it's arguably less trust because *nobody* completely understands
198 | a complex language like C++ and *nobody* probably understands every
199 | possible nuance of the consensus rules---yet almost anyone technical can
200 | start a node with -noassumevalid, wait a few hours, and check that
201 | `bitcoin-cli getblock $assume_valid_hash` returns has a
202 | `"confirmations"` field that's not -1.
203 | 
204 | > ### Okay, so there might be a theoretical equivalence, but are there any =
205 | *practical* security differences with assumeutxo (vs. assumevalid)?
206 | >=20
207 | > Yes, there is one practical security difference. Currently, if I wanted t=
208 | o trick someone into thinking I had coins that I don't on the honest networ=
209 | k, I'd have to
210 | >=20
211 | > - get them to start bitcoind with a bad `-assumevalid=3D` parameter,
212 | > - isolate their node from the honest network to prevent them from getting=
213 |  a valid headers chain, and
214 | > - build a proof-of-work-compatible chain that includes the existing [chec=
215 | kpoints](https://github.com/bitcoin/bitcoin/blob/91a25d1e711bfc0617027eee18=
216 | b9777ff368d6b9/src/chainparams.cpp#L146-L160).
217 | 
218 | The first of those steps is not required, the second is oddly worded,
219 | and the third is really a DoS-prevention mechanism not a security
220 | mechanism.
221 | 
222 | 1. If Alice starts Bitcoin Core with the default assumevalid
223 | configuration and that node finds a most-PoW chain that doesn't contain
224 | the assumevalid hash, it'll use that chain and just verify all historic
225 | scripts.
226 | 
227 | 2. "prevent them from getting a valid headers chain" is confusing
228 | because Bitcoin Core will reject any headers that aren't valid by
229 | headers-only standards.  I think maybe you're trying to say, "prevent
230 | them from getting the headers chain for the best block chain (most-PoW
231 | valid block chain)".
232 | 
233 | 3. Checkpoints only exist today to prevent owners of modern ASICs from
234 | creating an inifinite number of long difficulty-1 headers chains and
235 | spamming them at newly-started clients, a DoS attack.  Although they do
236 | protect early chain history, current developers don't like that
237 | "feature" and would prefer to get rid of them.  They have proposals for
238 | that which require consensus changes, although I think they'd prefer to
239 | find a local-only solution.
240 | 
241 | > ### It seems like this is an argument for including a commitment to the a=
242 | ssumeutxo value in a place where it could be enforced by consensus, say in =
243 | the block headers. Should we do that?
244 | 
245 | Huh, what?  No matter where you put commitments, anyone who can modify
246 | your software can get you to accept a backdoored utxo set.
247 | 
248 | > ### Should we even run a background validation sync? If we accept the ass=
249 | umeutxo security model, why even do the IBD? If IBD isn't scalable long ter=
250 | m, what's the point?
251 | >=20
252 | > If we introduce assumeutxo with snapshots but do not perform IBD in
253 | > the background, it's easy to imagine that almost anyone setting up a
254 | > node will do so with a UTXO snapshot (since it's much quicker than
255 | > conventional IBD), run using an assumed-valid chain, and will present
256 | > itself to the network as a pruned node. In the limit, that results in
257 | > an absence of nodes serving historical blocks to the network. This
258 | > certainly isn't what we want, so it seems prudent to keep the
259 | > background IBD on as a default.
260 | 
261 | I don't understand this argument.  If you just want to serve historic
262 | blocks, you don't need to verify them.  You can put them in a torrent
263 | (as has been done before).  Or you can simply download them, verify
264 | they're part of the best headers chain, and share them without
265 | verification.
266 | 
267 | > Assumeutxo is a performance optimization. If we removed the IBD
268 | > process in lieu of it, that would be a change in Bitcoin's security
269 | > model.
270 | 
271 | That new security model is implemented from the time you finish
272 | assumeutxo+short-ibd until the time you finish genesis+full-ibd.  If
273 | it's good enough for those hours/days/weeks/whatever, why isn't it good
274 | enough all the time?
275 | 
276 | > This is inconvenient, but in the meantime we could modify
277 | > `gettxoutsetinfo` to accept a height and at least abstract away the
278 | > manual chain-mangling.
279 | 
280 | I don't think that works, unless you specify the height of a future
281 | block and that triggers gettxoutsetinfo to run on that block.
282 | 
283 | The problem is that spend UTXO entries are removed from the UTXO db after=
284 | =20
285 | each block, so there's the current db doesn't contain all of the data of
286 | previous versions and so a hash can't be generated without rolling back
287 | the chainstate (applyng the undo files) by invaliding subsequent blocks.
288 | 
289 | ***
290 | 
291 | That's all I have.  Thanks for working on this!
292 | 
293 | --npejdeiydstzb5t6
294 | Content-Type: application/pgp-signature; name="signature.asc"
295 | 
296 | -----BEGIN PGP SIGNATURE-----
297 | 
298 | iQIzBAEBCAAdFiEEgxUkqkMp0LnoXjCr2dtBqWwiadMFAly2B8YACgkQ2dtBqWwi
299 | adNUfQ/+McPfp670uCoWzLvpqRGCasakL2ixabcobrQSmU4/TaKpN6clyYNeQ2cf
300 | vsxg2ejETyon18nNzKvdb0k6ZHBysJpLzYBpDfpDSZPM3/kvWzM9ZE6scpzZD72s
301 | HkfwMLWcViZpNYAHFEqaSCzcWTxVMiHMge98i75BZuFvC4A7shHLVVIr7Ou1QvOi
302 | O0AEQy2Md8sJRoETHceYhnZ6OpCgDo3tLZudTbMOFbeVpnfElRo9G9J4tbgGAY6b
303 | eH36vkNrFGryNyqX2/emvBsPZ41QodvGF45lO39bFhD1tS9Rc6f0cLxomhKByVQU
304 | Ej6XacaQG2yYorLz/y9bsVHdg4jzJPGrI+1BM6zLPMoAnxP8MoxoRuOCBuSdbABk
305 | o9x032fnXjs5VPJ1Y1q2lHWvdHYfPviES7ANkoF4jB16qf2rXqBGsLLc9H3jwetE
306 | pta6sjvA2XJ8qfTF1WaW4AHubii5WmkiU4JDnKMM6S9dQ+aJWuwf26cQgpdlXxkD
307 | crFGkpvWG4eL39QYlmfOfY8RRGYcP3ZOk/FFqTpjgpqIz0mQaZxzPWCm24kQs2ne
308 | tBeyM/sg0uqq82E+CUTGoRKGcXb/B0RnhF/rJdzxkLYLcIp/62UE9MtO9qmFwEYr
309 | l7gJXdEF4DQINBkIFMir+gn8TOK6L4z80N3PMbdCZ5Ecp2t8fIY=
310 | =y3BI
311 | -----END PGP SIGNATURE-----
312 | 
313 | --npejdeiydstzb5t6--
314 | 


--------------------------------------------------------------------------------
/proposal/feedback-sjors-provoost.txt:
--------------------------------------------------------------------------------
 1 | Regarding memory usage:
 2 | 
 3 | I think it should allocate most memory to catching up from snapshot to the tip. This could be more than 6 months away. Once caught up, flush and allocate most memory to syncing from genesis. If a node is restarted during the process, sync the headers, if it's more than 24 hours behind, give most resources to catching up to tip, otherwise give them to catching up to the snapshot.
 4 | 
 5 | For the very first version at least we need to make sure memory usage for both UTXO sets doesn't exceed `-dbcache + -maxpool`.
 6 | 
 7 | > Presumably, we could have the code automatically generate snapshots every 6 months or so.
 8 | 
 9 | At fixed block intervals, the snapshots will immedidatley be useful even if they're not referenced in a release yet. They can be used for local backups to recover from block & chainstate data corruption. Each node can store the hash in a simple text file. When calling -reindex(chainstate) it looks for that text file. For pruned nodes -reindex could even just go back to the snapshot instead of all the way to genesis.  Similarly this can be helpful when a node needs to undo pruning to rescan an old wallet.
10 | 
11 | This also has the advantage that people can anticipate what the next release will contain. The trade-off there is lack of freshness, because it's unlikely a release will be immediately after a scheduled snapshot. However we can also release a minor version at more regular intervals.
12 | 
13 | Regular snapshots are also
14 | Once nodes have their maximum of 3 snapshots, they could randomly decide which one to toss out to make room for the 4th snapshot. That way old snapshots will remain at least somewhat available. Otherwise if for whatever reason Bitcoin Core stops releasing new versions for a few years, the feature would become unavailable for new users.
15 | 
16 | In your explanatino of the security model, you could emphasize that _all_ new users verify the assumeutxo, because their nodes will stop - after some time - if it's incorrect. When describing the  attack with a hyptothetical -assumeutxo param, maybe point out that this attack also only works if you trick the user before their node finds out what happens.
17 | 
18 | Regarding backups - or at least easier recovery from chainstate corruption, and unpruning - I think that use case is important enough to find a solution for the phishing risk.
19 | 
20 | The backup / unprune feature could be abused, but if we use a file rather than a parameter, then at least the attacker has to convince the victim to download a file.
21 | 
22 | We could use a binary format, so an attacker can't tell the user "open notepad and enter this magic number in a new file".
23 | 
24 | We can check all entries in the backup file against the header chain and abort if any don't match.
25 | 
26 | We could encrypt the backup using something derived from the main wallet seed, though not all nodes have wallets.
27 | 
28 | We could encrypt using something machine specific, but this doesn't work if the user restores a backup on a new machine because the old one is broken.
29 | 


--------------------------------------------------------------------------------