├── blockchain1 ├── blockchain2 ├── blockchain3 └── README.md /blockchain1: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python2 2 | #-*-python-*- 3 | 4 | import random 5 | 6 | class Link(): 7 | def __init__(self, parent, data): 8 | self.addr = random.randint(0, 256) 9 | self.parent = parent 10 | self.data = data 11 | 12 | def show(self): 13 | print('+---------------+') 14 | print('| parent | addr |') 15 | if self.parent: 16 | p = hex(self.parent) 17 | else: 18 | p = 'None' 19 | print('| %s | %s |' % (p, hex(self.addr))) 20 | print('+---------------+') 21 | print('|%-15s|' % (self.data)) 22 | print('+---------------+') 23 | 24 | l1 = Link(None, 'Amy pays Joe $5') 25 | l2 = Link(l1.addr, 'Joe pays Amy $7') 26 | l3 = Link(l2.addr, 'Joe pays Lou $1') 27 | 28 | print('unhacked') 29 | for l in [l1, l2, l3]: 30 | l.show() 31 | 32 | lhacker = Link(l1.addr, 'Amy -> Joe $1e6') 33 | l2.parent = lhacker.addr 34 | 35 | print('hacked') 36 | for l in [l1, lhacker, l2, l3]: 37 | l.show() 38 | 39 | -------------------------------------------------------------------------------- /blockchain2: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python2 2 | #-*-python-*- 3 | 4 | import hashlib 5 | 6 | class Link(): 7 | def __init__(self, parent, data): 8 | self.parent = parent 9 | self.data = data 10 | self.addr = hashlib.md5(str(self.parent) + str(self.data)).hexdigest() 11 | 12 | def show(self): 13 | print('+---------------------------------------------------------------------+') 14 | print('| parent | addr |') 15 | if self.parent: 16 | p = self.parent 17 | else: 18 | p = 'None' 19 | print('| %-32s | %s |' % (p, self.addr)) 20 | print('+---------------------------------------------------------------------+') 21 | print('|%-69s|' % (self.data)) 22 | print('+---------------------------------------------------------------------+') 23 | 24 | l1 = Link(None, 'Amy pays Joe $5') 25 | l2 = Link(l1.addr, 'Joe pays Amy $7') 26 | l3 = Link(l2.addr, 'Joe pays Lou $1') 27 | 28 | print('unhacked') 29 | for l in [l1, l2, l3]: 30 | l.show() 31 | 32 | lhacker = Link(l1.addr, 'Amy -> Joe $1e6') 33 | l2.parent = lhacker.addr 34 | l3.parent = l2.addr 35 | 36 | print('hacked') 37 | for l in [l1, lhacker, l2, l3]: 38 | l.show() 39 | 40 | -------------------------------------------------------------------------------- /blockchain3: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python2 2 | #-*-python-*- 3 | 4 | import hashlib 5 | 6 | class Link(): 7 | def __init__(self, parent, data, difficulty_level): 8 | self.parent = parent 9 | self.data = data 10 | 11 | nonce = 0 12 | addr = hashlib.md5(str(self.parent) + str(self.data) + str(nonce)).hexdigest() 13 | while int(addr,16) > difficulty_level: 14 | nonce += 1 15 | addr = hashlib.md5(str(self.parent) + str(self.data) + str(nonce)).hexdigest() 16 | 17 | self.nonce = nonce 18 | self.addr = addr 19 | 20 | def show(self): 21 | print('+---------------------------------------------------------------------+') 22 | print('| parent | addr |') 23 | if self.parent: 24 | p = self.parent 25 | else: 26 | p = 'None' 27 | print('| %-32s | %s |' % (p, self.addr)) 28 | print('+---------------------------------------------------------------------+') 29 | print('| nonce |') 30 | print('| %-67d |' % (self.nonce)) 31 | print('+---------------------------------------------------------------------+') 32 | print('|%-69s|' % (self.data)) 33 | print('+---------------------------------------------------------------------+') 34 | 35 | easy = 0xa0000000000000000000000000000000 36 | hard = 0x0a000000000000000000000000000000 37 | veryhard = 0x000000000000000a0000000000000000 38 | difficulty_level = easy 39 | 40 | l1 = Link(None, 'Amy pays Joe $5', difficulty_level) 41 | l2 = Link(l1.addr, 'Joe pays Amy $7', difficulty_level) 42 | l3 = Link(l2.addr, 'Joe pays Lou $1', difficulty_level) 43 | 44 | print('unhacked') 45 | for l in [l1, l2, l3]: 46 | l.show() 47 | 48 | difficulty_level = veryhard 49 | lhacker = Link(l1.addr, 'Amy -> Joe $1e6', difficulty_level) 50 | l2.parent = lhacker.addr 51 | l3.parent = l2.addr 52 | 53 | print('hacked') 54 | for l in [l1, lhacker, l2, l3]: 55 | l.show() 56 | 57 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # simpleblockchain 2 | 3 | ## Linked Lists 4 | 5 | Bitcoin and other cryptocurrencies are a huge deal and part of what 6 | makes them possible is the concept of a blockchain. The blockchain 7 | seems magical, but really it's just a data structure with a few 8 | special features. Let's look at how it works. 9 | 10 | The first basic concept to understand is a linked list where each 11 | child points to the parent it came from. (Note that a parent might 12 | have more than one child, but a child can have no more than one 13 | parent.) 14 | 15 | +--------+ +----------+ +----------+ 16 | | link 0 | <---+-* link 1 | <---+-* link 2 | <---... 17 | +--------+ +----------+ +----------+ 18 | 19 | A blockchain is a stronger form of a linked list. In a regular linked 20 | list, you can't tell if someone moved the pointers. With a blockchain, 21 | you can. 22 | 23 | Let's build up to that feature. We'll start by changing our 24 | representation of a linked list to be a little more concrete. We want 25 | a data payload in there. Also, instead of arrows, let's use addresses: 26 | 27 | +---------------+ +---------------+ +---------------+ 28 | | parent | addr | | parent | addr | | parent | addr | 29 | | NA | 0xa3 | | 0xa3 | 0x24 | | 0x24 | 0x1b | 30 | +---------------+ +---------------+ +---------------+ ... 31 | | DATA | | DATA + + DATA + 32 | +---------------+ +---------------+ +---------------+ 33 | 34 | Note that the "parent" value of each node (except the first, which has 35 | no parent) is the address value of the parent node. 36 | 37 | The addresses are written in hexadecimal to be reminiscent of 38 | addresses in RAM. In an instantiated linked list in a real running 39 | program, the addresses are usually physical--they point to a 40 | particular memory location. In our case they are merely logical. Look 41 | at a link and read the parent address. Then find the node that has 42 | that address. 43 | 44 | Here is some Python code to simulate this simple linked list 45 | situation, with some financial transactions as the data payload: 46 | 47 | ```python 48 | import random 49 | 50 | class Link(): 51 | def __init__(self, parent, data): 52 | self.addr = random.randint(0, 256) 53 | self.parent = parent 54 | self.data = data 55 | 56 | def show(self): 57 | print('+---------------+') 58 | print('| parent | addr |') 59 | if self.parent: 60 | p = hex(self.parent) 61 | else: 62 | p = 'None' 63 | print('| %s | %s |' % (p, hex(self.addr))) 64 | print('+---------------+') 65 | print('|%-15s|' % (self.data)) 66 | print('+---------------+') 67 | 68 | l1 = Link(None, 'Amy pays Joe $5') 69 | l2 = Link(l1.addr, 'Joe pays Amy $7') 70 | l3 = Link(l2.addr, 'Joe pays Lou $1') 71 | 72 | for l in [l1, l2, l3]: 73 | l.show() 74 | ``` 75 | 76 | This implementation is fine for storing data among people you 77 | trust. But if it's editable by the public, you have a security 78 | problem. Anyone can alter transaction data or insert/remove items at 79 | any time by picking some link L, pointing to L and having the child of 80 | L point to your fake transaction: 81 | 82 | 83 | +---------------+ 84 | | parent | addr | 85 | | 0xa3 | 0xfa | 86 | +---------------+ 87 | | HACKER | 88 | +---------------+ 89 | 90 | +---------------+ +---------------+ +---------------+ 91 | | parent | addr | | parent | addr | | parent | addr | 92 | | NA | 0xa3 | | 0xfa | 0x24 | | 0x24 | 0x1b | 93 | +---------------+ +---------------+ +---------------+ ... 94 | | DATA | | DATA + + DATA + 95 | +---------------+ +---------------+ +---------------+ 96 | 97 | This can be accomplished with some code like this: 98 | 99 | ```python 100 | lhacker = Link(l1.addr, 'Amy -> Joe $1e6') 101 | l2.parent = lhacker.addr 102 | 103 | for l in [l1, lhacker, l2, l3]: 104 | l.show() 105 | ``` 106 | 107 | All of the above code is in the file `blockchain1`. 108 | 109 | ## Security Measure #1 110 | 111 | We're going to take two security measures that will work together to 112 | ensure no one can edit the blockchain. The first security measure is 113 | changing how the addresses work. 114 | 115 | Right now, the address is basically a random number, which represents 116 | some location in RAM or meatspace. This address is something *about* 117 | the link, but is not a fundamental propery *of* the link. We're going 118 | to change that. Instead of a location, we're going to make the address 119 | a hash of the link itself. 120 | 121 | (If you don't know what a hash is, it's basically a one-way 122 | function. You hand the hash function a bunch of data and it hands you 123 | back a unique fingerprint. There's no way to recover the data from the 124 | fingerprint. Also, minor tweaks to the data result in big changes in 125 | the fingerprint, so you can't "fish around" by subtly modifying the 126 | data until you get close to the right hash.) 127 | 128 | It's easy to alter our existing linked list code to use addresses that 129 | are hashes of the fundamental properties of the link itself: 130 | 131 | ```python 132 | import hashlib 133 | 134 | class Link(): 135 | def __init__(self, parent, data): 136 | self.parent = parent 137 | self.data = data 138 | self.addr = hashlib.md5(str(self.parent) + str(self.data)).hexdigest() 139 | 140 | def show(self): 141 | print('+---------------------------------------------------------------------+') 142 | print('| parent | addr |') 143 | if self.parent: 144 | p = self.parent 145 | else: 146 | p = 'None' 147 | print('| %-32s | %s |' % (p, self.addr)) 148 | print('+---------------------------------------------------------------------+') 149 | print('|%-69s|' % (self.data)) 150 | print('+---------------------------------------------------------------------+') 151 | 152 | l1 = Link(None, 'Amy pays Joe $5') 153 | l2 = Link(l1.addr, 'Joe pays Amy $7') 154 | l3 = Link(l2.addr, 'Joe pays Lou $1') 155 | 156 | for l in [l1, l2, l3]: 157 | l.show() 158 | ``` 159 | 160 | The links have gotten wider, so let's stack them instead of lining 161 | them up horizontally. When I ran this, I got these values: 162 | 163 | +---------------------------------------------------------------------+ 164 | | parent | addr | 165 | | None | 967436fb856f5bd2684310f5fae41773 | 166 | +---------------------------------------------------------------------+ 167 | |Amy pays Joe $5 | 168 | +---------------------------------------------------------------------+ 169 | 170 | +---------------------------------------------------------------------+ 171 | | parent | addr | 172 | | 967436fb856f5bd2684310f5fae41773 | e9b48dab1b3aac47e943eedd67f87d13 | 173 | +---------------------------------------------------------------------+ 174 | |Joe pays Amy $7 | 175 | +---------------------------------------------------------------------+ 176 | 177 | +---------------------------------------------------------------------+ 178 | | parent | addr | 179 | | e9b48dab1b3aac47e943eedd67f87d13 | a46d8ae4f32badac6f9280328e0fb954 | 180 | +---------------------------------------------------------------------+ 181 | |Joe pays Lou $1 | 182 | +---------------------------------------------------------------------+ 183 | 184 | It still works exactly like a linked list--the "parent" value of the 185 | second link still points to the address value of the first link. But 186 | that first address is now something fundamental to what that first 187 | link *is*. When I rerun the above code, I no longer get random 188 | addresses, I get **the exact same** addresses. That's because the 189 | address is computed from the link itself, which hasn't changed. 190 | 191 | How is this any more secure? Look above at the simple linked list 192 | hacking case. In order to insert a link, I had to change the "parent" 193 | attribute of a link. That was a simple edit in that case. But now if 194 | you change a link's attributes, **you also change the link's 195 | address**. 196 | 197 | When someone changes anything about a link, that forces the address to 198 | change. That means that the child of that link has to change *it's* 199 | parent address. Which means that child's address also changes, so the 200 | grandchild must also change. Any change anywhere in the blockchain 201 | must ripple all the way to the end. 202 | 203 | How can we do that in code? It's still pretty simple: 204 | 205 | ```python 206 | lhacker = Link(l1.addr, 'Amy -> Joe $1e6') 207 | l2.parent = lhacker.addr 208 | l3.parent = l2.addr 209 | 210 | for l in [l1, lhacker, l2, l3]: 211 | l.show() 212 | ``` 213 | 214 | Only one additional link existed in this demo blockchain, but by the 215 | argument above you can see that we might have to alter hundreds, 216 | thousands, millions or billions of links in the chain. *All* of them, 217 | from the point of alteration to the very end. 218 | 219 | All of the code from this section is in `blockchain2`. 220 | 221 | ## Security Measure #2 222 | 223 | Having to alter all subsequent history is a pain, but it's not 224 | impossible. That's why we introduce security measure #2: 225 | proof-of-work. In normal Bitcoin operation, this is also called 226 | "mining", so lets call it that. Anyone who wants to generate an 227 | address for a real blockchain link needs to do this mining step for 228 | each link. 229 | 230 | The mining works a lot like the lottery. With the lottery, any 231 | particular person has a very, very low chance of winning, but the 232 | chance of *someone* winning is relatively high. That's why blocks are 233 | actually created, even though it's very unlikely a particular hacker 234 | will be able to do it enough times to corrupt the entire chain. 235 | 236 | The hacker has a problem. If s/he alters a single block in history, 237 | all subsequent blocks need their addresses recomputed. But in order to 238 | do that recomputation, the "mining" needs to happen. Succeeding at any 239 | given mining step is phenomenally unlikely, so succeeding at 240 | dozens/hundreds/thousands/millions is basically impossible. This is 241 | why Bitcoin says that once your transaction is some number of blocks 242 | deep in the chain, it's a done deal. Re-mining that number of blocks 243 | is beyond any hacker's ability. 244 | 245 | How does the blockchain make mining difficult? By forcing the hashed 246 | address of the link fall below a given value. Remember that the hash 247 | function turns a pile of data into a number with a given number of 248 | digits. There's no way to look at the data and predict what that 249 | number is without actually computing it. And if you change the data 250 | slightly, you get a completely different number. 251 | 252 | Let's say that instead of that huge hex string, the hash function 253 | just produces a 3 digit decimal number, 000 to 999. To deliberately make 254 | mining difficult, we could require that the miner gets a hash address 255 | of 000 to **0**99. So here's what the miner does when trying to add a 256 | link to the blockchain: 257 | 258 | 1. hash the link 259 | 2. if the value of the hash is not 000 to 099 260 | 3. alter the link slightly so we get a different hash 261 | 4. hash the altered link and go to 2 262 | 5. success 263 | 264 | That's the pseudocode, here's the real code: 265 | 266 | ```python 267 | import hashlib 268 | 269 | class Link(): 270 | def __init__(self, parent, data, difficulty_level): 271 | self.parent = parent 272 | self.data = data 273 | 274 | nonce = 0 275 | addr = hashlib.md5(str(self.parent) + str(self.data) + str(nonce)).hexdigest() 276 | while int(addr,16) > difficulty_level: 277 | nonce += 1 278 | addr = hashlib.md5(str(self.parent) + str(self.data) + str(nonce)).hexdigest() 279 | 280 | self.nonce = nonce 281 | self.addr = addr 282 | 283 | def show(self): 284 | print('+---------------------------------------------------------------------+') 285 | print('| parent | addr |') 286 | if self.parent: 287 | p = self.parent 288 | else: 289 | p = 'None' 290 | print('| %-32s | %s |' % (p, self.addr)) 291 | print('+---------------------------------------------------------------------+') 292 | print('| nonce |') 293 | print('| %-67d |' % (self.nonce)) 294 | print('+---------------------------------------------------------------------+') 295 | print('|%-69s|' % (self.data)) 296 | print('+---------------------------------------------------------------------+') 297 | 298 | easy = 0xa0000000000000000000000000000000 299 | hard = 0x0a000000000000000000000000000000 300 | veryhard = 0x000000000000000a0000000000000000 301 | difficulty_level = easy 302 | 303 | l1 = Link(None, 'Amy pays Joe $5', difficulty_level) 304 | l2 = Link(l1.addr, 'Joe pays Amy $7', difficulty_level) 305 | l3 = Link(l2.addr, 'Joe pays Lou $1', difficulty_level) 306 | 307 | for l in [l1, l2, l3]: 308 | l.show() 309 | ``` 310 | 311 | The "difficulty level" is exactly analogous to the "000 to 099" range 312 | we specified in our example above. The "nonce" is how to "alter the 313 | link slightly" to get a different hash value. We just keep altering 314 | and hashing, altering and hashing until we happen to get a value below 315 | the difficulty level. The lower that level is, the harder it is to 316 | succeed. 317 | 318 | How hard is it? The hash I'm using here has 32 hex digits, so the max 319 | value is 16^32 = 2^128. The "easy" difficulty level is well over half 320 | of that, with a chance of 62.5% that I'll get it on any given try. The 321 | "hard" difficulty level is much smaller--around a 4% chance on any 322 | given try. The "very hard" difficulty level will succeed around 323 | 5.4e-17% of the time. That's pretty unlikely. Put another way, at the 324 | "very hard" difficulty level, I'd have to rehash **9.2e+17 times** 325 | just to get a 50% of succeeding. 326 | 327 | This code is in file `blockchain3`. Try running it with different 328 | values for the difficulty level. Watch your CPU usage at the same 329 | time. Now you know why it's called "proof of work". You can also see 330 | how many attempts were made to rehash the address by looking at the 331 | nonce value of the output. A high value here tells you it had to try a 332 | lot of times before the mining succeeded. 333 | 334 | It's hard to demo the hacker portion of this. For bitcoin, you have 335 | millions of people running the mining code. The difficulty level is 336 | set such that in that huge group of people, there's one success once 337 | every 10 minutes or so. The difficulty level is set lower and lower 338 | over time to keep this rate approximate constant. 339 | 340 | But for the demo, there's only one miner. That means the difficulty 341 | level has to be very easy or we'll never see a success. But that makes 342 | things too easy for the "hacker" when that portion of the code 343 | runs. To simulate this, let's change the difficulty level between the 344 | "real" and "hacked" sections of code. 345 | 346 | ```python 347 | difficulty_level = veryhard 348 | lhacker = Link(l1.addr, 'Amy -> Joe $1e6', difficulty_level) 349 | l2.parent = lhacker.addr 350 | l3.parent = l2.addr 351 | 352 | for l in [l1, lhacker, l2, l3]: 353 | l.show() 354 | ``` 355 | 356 | On my machine, the hacker code basically just hangs. It can't 357 | recompute a link in the chain in a reasonable time even once, let 358 | alone twice or more. The blockchain is effectively safe. 359 | 360 | ## Omitted Stuff 361 | 362 | There's a lot more to Bitcoin than I've presented here, which I won't 363 | even mention. 364 | 365 | There are also omitted issues that are more directly 366 | blockchain-related. For instance, how does anyone know you actually 367 | did the work to compute those hashes? Because they can check by 368 | running the hash themselves. Checking if the answer is right takes 369 | only a single try, so that's fast. 370 | 371 | Or what happens if by chance someone does succeed in altering a chain 372 | near the end, where there are fewer blocks to have to recompute? A 373 | large hacker group could coordinate on this. Yes, they could, but keep 374 | in mind that they have have a significant fraction of the world's 375 | computing resources in order to keep up. There are also mechanisms 376 | built into Bitcoin to help choose between alternate blockchains. 377 | 378 | And what about the hash, am I cheating by using Python's `hashlib` for 379 | that? No. A hash can be anything with the properties loosely described 380 | of "one way" and "non-linear". I used `hashlib.md5` because I used a 381 | similar library in my original code in another language that had fewer 382 | choices. 383 | 384 | --------------------------------------------------------------------------------