├── .gitignore ├── Makefile ├── README.md ├── cavoasm.py ├── cavosim.c ├── count.s ├── differences.s ├── iloop.s └── method-of-differences.c /.gitignore: -------------------------------------------------------------------------------- 1 | *~ 2 | count.image 3 | differences.image 4 | iloop.image 5 | method-of-differences 6 | cavosim 7 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | CFLAGS=-g 2 | 3 | all: count.image iloop.image differences.image test 4 | clean: 5 | $(RM) count.image iloop.image differences.image method-of-differences cavosim 6 | 7 | %.image: %.s cavoasm.py 8 | ./cavoasm.py < $< > $@ 9 | 10 | test: method-of-differences cavosim differences.image 11 | -./method-of-differences | head -11 # it outputs the initial a5 12 | -./cavosim differences.image | grep 'mem\[8]' | head # it doesn't 13 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Calculus Vaporis CPU Design 2 | =========================== 3 | 4 | This is a sketch of a design for a very small zero-operand 12-bit CPU 5 | in about 1000 NAND gates, or a smaller number of more powerful 6 | components (e.g. 70 lines of C). It’s my first CPU design, so it may 7 | be deeply flawed. 8 | 9 | Initially I called it the “Dumbass CPU”, but I thought that didn’t 10 | seem like a name Charles Babbage would have used. So I called it 11 | “Calculus Vaporis”, which means “counting-stone of steam” in Latin, I 12 | hope. 13 | 14 | This directory contains a simulator written in C, a few simple 15 | programs in a simple assembly language, and an assembler written in 16 | Python. Run `make` to try them out. 17 | 18 | Logical Organization 19 | -------------------- 20 | 21 | The machine has four 12-bit registers: P, A, X, and I. 22 | 23 | P is the program counter. It generally holds the address of the next 24 | instruction to fetch. It could actually be only 11 bits. 25 | 26 | A is the top-of-stack register. 27 | 28 | X is the second-on-stack register. 29 | 30 | I is the instruction-decoding register. It holds the instruction word 31 | currently being decoded. 32 | 33 | The machine additionally has an 11-bit address bus B, a 12-bit data 34 | bus D, and a bus operation O, which are treated as pseudo-registers 35 | for the purpose of the RTL description. The narrow memory bus is weird 36 | but simplifies the usage of immediate constants. 37 | 38 | Instruction Set 39 | --------------- 40 | 41 | The machine cycle has four microcycles: fetch 1, fetch 2, execute 1, 42 | execute 2. Fetch 1 and 2 are the same for all instructions. All 43 | instructions but @ do nothing during execute 2. The other RTL below 44 | explains what happens during execute 1. 45 | 46 | Following Python, I am using `R[-1]` to refer to the highest bit of 47 | register `R`, `R[:-1]` to refer to all but the highest bit, `R[-2]` to 48 | refer to the next-highest bit, and so on. 49 | 50 | There are seven instructions: `$`, `.`, `-`, `|`, `@`, `!`, and `nop`. 51 | They are stored one per machine word, which is grossly wasteful of 52 | memory space but makes for a simpler instruction decode cycle. 53 | Avoiding the gross waste of memory space would probably require adding 54 | at least one more register to the processor, because as it is defined 55 | now, roughly half the instructions in any code are `$`, which needs a 56 | whole machine word most of the time anyway. 57 | 58 | There are even simpler instruction sets, such as the subtract- 59 | indirect-and-branch-indirect-if-negative OISC, and the MOV machine. 60 | While these instruction sets are simpler to explain, the existing 61 | decoding logic is fairly small, about 20 out of the 1000 NAND gates. 62 | These simpler instruction sets call for a more complicated machine 63 | cycle. 64 | 65 | - `$` is the “load immediate” instruction; it’s represented by having a 66 | single high bit set in a 12-bit word. When written in code 67 | sequences, it is followed by the value of the other 11 bits. It 68 | does: 69 | 70 | > X ← A, A[:-1] ← I[:-1], A[-1] ← I[-2] 71 | 72 | Note that this is the only way to change the value of X, by copying 73 | A’s old value into it. 74 | 75 | - `.` is the “conditional call/jump” instruction. It does: 76 | 77 | > if X[-1] == 0: 78 | > P ← A[:-1], A[:-1] ← P, A[-1] ← 0; 79 | > else: 80 | > A ← X; # not very important, kind of arbitrary, maybe drop it 81 | 82 | - `-` is the “subtract” instruction. It does: 83 | 84 | > A ← X - A; 85 | 86 | - `|` is the “NAND” instruction (named after the Sheffer stroke). In C 87 | notation for bitwise operators, it does: 88 | 89 | > A ← ~(A & X); 90 | 91 | - `@` is the “fetch” instruction. It is the only instruction that needs 92 | both execute microcycles. During execute 1, it does: 93 | 94 | > B ← A[:-1], O ← “read”; 95 | 96 | Then, during execute 2: 97 | 98 | > A ← D; 99 | 100 | - `!` is the “store” instruction. It does: 101 | 102 | > B ← A[:-1], D ← X, O ← “write”, A ← X; 103 | 104 | - `nop` does nothing. 105 | 106 | Code Snippets 107 | ------------- 108 | 109 | These show that the instruction set is usable, just barely. 110 | 111 | Call the subroutine at address `x`, passing the return address in A: 112 | 113 | $0 $x . 114 | 115 | Store the return address passed in A at a fixed address `ra` (for 116 | non-reentrant subroutines): 117 | 118 | $ra ! 119 | 120 | Return to the address thus passed, trashing both A and X: 121 | 122 | $0 $ra @ . 123 | 124 | Decrement the memory location `sp`, used as, for example, a stack 125 | pointer: 126 | 127 | $sp @ $1 - $sp ! 128 | 129 | Store a return address at the location pointed to by `sp` after 130 | decrementing `sp`, using an address `tmp` for temporary storage: 131 | 132 | $tmp ! $sp @ $1 - $sp ! $tmp @ $sp ! 133 | 134 | Fetch that return address from the stack, increment `sp`, and return 135 | to the address: 136 | 137 | $sp @ @ $tmp ! $sp @ $-1 - $sp ! $0 $tmp @ . 138 | 139 | Negate the value stored at a memory location `var`: 140 | 141 | $0 $var @ - $var ! 142 | 143 | Add the values stored at memory locations `a` and `b` with the aid of 144 | a third temporary location `tmp`, leaving the sum in the A register: 145 | 146 | $0 $a @ - $tmp ! $b @ $tmp @ - 147 | 148 | The rest of the RTL 149 | ------------------- 150 | 151 | The instruction definitions above define what happens at the RTL level 152 | during the instruction execution microcycles. The RTL for the other 153 | two microcycles of the machine cycle follow: 154 | 155 | Fetch 1: 156 | 157 | > B ← P, P ← P + 1, O ← “read”; 158 | 159 | Fetch 2: 160 | 161 | > I ← D; 162 | 163 | The presumed memory interface semantics are: 164 | 165 | When O ← “read”, on the next microcycle: 166 | 167 | > D ← M[B]; 168 | 169 | When O ← “write”: 170 | 171 | > M[B] ← D. 172 | 173 | I don’t know how enough about memory to know how realistic that is. 174 | 175 | Translating the RTL design into gates 176 | ------------------------------------- 177 | 178 | (This part contains a number of errors. Hopefully it’s accurate 179 | enough that the number of gates can be meaningfully estimated.) 180 | 181 | We need a multiplexer attached to the input of most of the registers 182 | to implement the RTL described earlier. Here are the places each 183 | thing can come from, and when: 184 | 185 | 186 |
register is set from when 187 |
A I[:-1] sign-extended ($, execute 1) 188 |
P (., execute 1, if X[-1] == 0) 189 |
X - A (-, execute 1) 190 |
!(X & A) (|, execute 1) 191 |
D (@, execute 2) 192 |
X (!, execute 1; or ., execute 1, when X[-1] == 1) 193 |
X A ($, execute 1) 194 |
P P + 1 (fetch 1) 195 |
A (., execute 1, if X[-1] == 0) 196 |
I D (fetch 2) 197 |
B P (fetch 1) 198 |
A[:-1] (@ or !, execute 1) 199 |
D X (!, execute 1) 200 |
O “read” (fetch 1, or @, execute 1) 201 |
"write" (!, execute 1) 202 |
203 | 204 | At all other times, registers continue with their current values. 205 | 206 | The overall design, then, looks something like this. Some of the 207 | “wires” are N bits wide: 208 | 209 | machine(): 210 | register(P_output, P_input, P_write_enable) 211 | register(A_output, A_input, A_write_enable) 212 | register(X_output, X_input, X_write_enable) 213 | register(I_output, I_input, I_write_enable) 214 | instruction_decoder(I_output[-4:], fetch_enable, instruction_select) 215 | # everything up to execute_1 is the inputs; everything after 216 | # that is an output 217 | execute_1_controller(instruction_select, fetch_enable, X_output[-1], 218 | execute_1, A_select, jump, memory_write, 219 | send_A_to_B, X_write_enable, D_write_enable) 220 | 221 | # All of these are outputs 222 | microcycle_counter(execute_1, execute_2, fetch_1, fetch_2) 223 | fetch_2 = I_write_enable # that is, they’re two names for the same wire 224 | # the last argument is the output 225 | AND(fetch_enable, execute_2, get_A_from_D) 226 | OR_6input(get_A_from_D, ..., A_write_enable) 227 | A_input_mux(A_select, get_A_from_D, 228 | I_sign_extended, P_output, X_minus_A, X_nand_A, D, X_output, 229 | A_input) 230 | 231 | increment_P = fetch_1 232 | P_input_controller(jump, increment_P, A_output, B_output, P_input) 233 | OR(fetch_1, ???, memory_read) 234 | 235 | I_sign_extended = I[-2] || I[:-1] 236 | 237 | Most of these pieces are simple multiplexers or registers of one sort 238 | or another. The `microcycle_counter` is just a 4-bit ring counter; 239 | the `instruction_decoder` is just a 6-way decoder of its 4-bit input. 240 | However, the `execute_1_controller` requires a little more 241 | clarification. 242 | 243 | execute_1_controller(instruction_select, instruction_is_fetch, X_highbit, 244 | execute_1, A_select, jump, memory_write, 245 | send_A_to_B, X_write_enable, D_write_enable): 246 | # The A_select output is 5 bits; it’s used to determine where 247 | # the A register gets read from if it gets read during the 248 | # execute_1 microcycle. We don’t care what value it has the 249 | # rest of the time. 250 | A_select = (get_A_from_I_sign_extended, get_A_from_P, 251 | get_A_from_X_minus_A, get_A_from_X_nand_A, 252 | get_A_from_X) 253 | # The instruction_select input is also 5 bits, representing 254 | # the five instructions that can affect A, other than fetch. 255 | instruction_select = (instruction_is_immediate, instruction_is_jump, 256 | instruction_is_subtract, instruction_is_nand, 257 | instruction_is_store) 258 | 259 | # Here’s how those outputs are computed. 260 | get_A_from_I_sign_extended = instruction_is_immediate 261 | NOT(X_highbit, X_not_highbit) 262 | AND(instruction_is_jump, X_not_highbit, get_A_from_P) 263 | get_A_from_X_minus_A = instruction_is_subtract 264 | get_A_from_X_nand_A = instruction_is_nand 265 | AND(instruction_is_jump, X_highbit, failed_jump) 266 | OR(failed_jump, instruction_is_store, get_A_from_X) 267 | 268 | jump = get_A_from_P 269 | AND(execute_1, instruction_is_store, memory_write) 270 | 271 | OR(instruction_is_store, instruction_is_fetch, memory_access) 272 | AND(execute_1, memory_access, send_A_to_B) 273 | 274 | AND(execute_1, instruction_is_immediate, X_write_enable) 275 | 276 | AND(execute_1, instruction_is_store, D_write_enable) 277 | 278 | The ALU simply has to compute `X_minus_A` and `X_nand_A`, which 279 | constantly feed into the `A_input_mux`. `X_minus_A` requires 12 full 280 | subtractors (analogous to full adders) and `X_nand_A` requires 12 NAND 281 | gates. 282 | 283 | Probable errors: 284 | 285 | - I wasn’t clear who was responsible for setting the write enables on 286 | the various registers. 287 | - Some of the `execute_1_controller` outputs probably need to be zero 288 | when `execute_1` is zero. 289 | - I don’t know anything about memory interfaces and so the memory 290 | controller is omitted entirely. 291 | 292 | Size Estimate 293 | ------------- 294 | 295 | My initial estimates for number of 2-input NAND gates were: 296 | 297 |
bit-serial 12-bit parallel 298 |
microcycle counter 28 NANDs 28 299 |
rest of control 127 578 300 |
registers 384 260 (no shifting needed) 301 |
subtractor 25 204 302 |
NAND 1 12 303 |
bit counter 70 0 304 |
total 635 1082 305 |
306 | 307 | I was estimating a 5-gate D latch per bit in the parallel case, or an 308 | 8-gate-per-bit master-slave D flip-flop per bit in the serial case. 309 | My microcycle counter was going to be a pair of D flip-flops, four AND 310 | gates to compute the output bits, and an OR on two of those outputs to 311 | compute the new high bit. 312 | 313 | I figured on a ripple-carry subtractor that would cost about 17 NAND 314 | gates per output bit, although I didn’t actually design one. 315 | 316 | Because of the relative paucity of N-bit-wide data paths, going 317 | bit-serial doesn’t actually save many gates, but it would slow the 318 | machine down by a substantial factor. 319 | 320 | Possible Improvements 321 | --------------------- 322 | 323 | Dropping the get-X-from-A path on skipped jumps would simplify the 324 | processor, probably without making it any harder to use. 325 | 326 | A third register (or more) on the stack wouldn’t affect the 327 | instruction set at all, but would simplify some code. For example, 328 | the code for “add the values stored at memory locations `a` and `b`, 329 | leaving the sum in the A register” would simplify from: 330 | 331 | $0 $a @ - $tmp ! $b @ $tmp @ - 332 | 333 | to: 334 | 335 | $a @ $0 $b @ - - 336 | 337 | Using one-hot encodings of the instructions would require using seven 338 | bits of the I register instead of four, but would almost eliminate the 339 | instruction decoder. (You'd still need to ensure that the instruction 340 | wasn’t $.) 341 | 342 | Alternatively, you could pack three, four, or five instructions into 343 | each 12-bit word of memory, as Chuck Moore’s c18 core does, instead of 344 | one. The "I" instruction, if encodable in lower-order positions, 345 | could simply sign-extend more bits, reducing the number of encodable 346 | immediate constants in such a case, but not to zero. 347 | 348 | You could try a different bit width. 2048 instructions of code is 349 | close to a minimum to run, say, a BASIC interpreter. 350 | 351 | If N × N → N bit LUTs and registers are available as primitives, the 352 | total complexity of the device could drop by an order of magnitude. 353 | For example, with 5 × 5 → 5 bit LUTs, you could construct a 4-bit 354 | subtractor with borrow-in and borrow-out as a single LUT, and chain 355 | three of them together to get a 12-bit subtractor. Even if you have 356 | only 4 → 1 bit LUTs, like on a normal FPGA, you can implement a full 357 | subtractor in two LUTs attached to the same inputs, instead of 17 (or 358 | however many) discrete NAND gates. 359 | 360 | If the stack were a little deeper, a “dup”, “swap”, or “over” 361 | instruction might help a lot with certain code sequences by 362 | reducing the number of immediate constants. 363 | 364 | 365 | -------------------------------------------------------------------------------- /cavoasm.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | """Assembler for the Calculus Vaporis CPU.""" 3 | import sys, re 4 | 5 | def words(fileobj): 6 | for line in fileobj: 7 | line = re.sub(';.*', '', line) 8 | for word in line.split(): 9 | yield word 10 | 11 | instructions = { '.': 0, '-': 1, '|': 2, '@': 3, '!': 4, 'nop': 5 } 12 | 13 | nbits = 12 14 | 15 | def bit(n): 16 | return 1 << n 17 | 18 | immediate = bit(nbits-1) 19 | 20 | def bits(n): 21 | return bit(n) -1 22 | 23 | def is_integer(word): 24 | try: 25 | int(word) 26 | return True 27 | except ValueError: 28 | return False 29 | 30 | class Relocation: 31 | def __init__(self, encoding, destination): 32 | self.encoding = encoding 33 | self.destination = destination 34 | def resolve(self, program, resolution): 35 | program.memory[self.destination] = self.encoding(resolution) 36 | 37 | def encode_immediate(value): 38 | return (value & bits(nbits-1)) | immediate 39 | 40 | def encode_normal(value): 41 | return value 42 | 43 | class Program: 44 | def __init__(self): 45 | self.memory = [] 46 | self.labels = {} 47 | self.backpatches = {} 48 | def assemble_words(self, words): 49 | for word in words: 50 | self.assemble_word(word) 51 | def assemble_word(self, word): 52 | if word in instructions: 53 | self.assemble(instructions[word] << (nbits - 4)) 54 | elif word.startswith('$'): 55 | self.assemble_reference(encode_immediate, word[1:]) 56 | elif word.endswith(':'): 57 | self.assemble_label(word[:-1]) 58 | else: 59 | self.assemble_reference(encode_normal, word) 60 | def assemble_reference(self, encoding, text): 61 | if is_integer(text): 62 | return self.assemble(encoding(int(text))) 63 | elif text in self.labels: 64 | return self.assemble(encoding(self.labels[text])) 65 | else: 66 | rel = Relocation(encoding, len(self.memory)) 67 | self.backpatches.setdefault(text, []).append(rel) 68 | return self.assemble(0) 69 | def assemble(self, number): 70 | assert isinstance(number, int) 71 | self.memory.append(number) 72 | def assemble_label(self, label): 73 | self.resolve(label, len(self.memory)) 74 | def resolve(self, label, value): 75 | if label in self.backpatches: 76 | for item in self.backpatches[label]: 77 | item.resolve(self, value) 78 | del self.backpatches[label] 79 | self.labels[label] = value 80 | def warn_undefined_labels(self): 81 | for label in self.backpatches: 82 | sys.stderr.write('WARNING: label %r undefined\n' % label) 83 | def dump(self, outfile): 84 | self.warn_undefined_labels() 85 | for number in self.memory: 86 | outfile.write('%d\n' % number) 87 | 88 | if __name__ == '__main__': 89 | import cgitb 90 | cgitb.enable(format='text') 91 | p = Program() 92 | p.assemble_words(words(sys.stdin)) 93 | p.dump(sys.stdout) 94 | -------------------------------------------------------------------------------- /cavosim.c: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | #include 4 | 5 | #define BIT(n) (1 << (n)) 6 | enum { nbits = 12, immediate_mask = BIT(nbits-1) }; 7 | #define ONES(number_of_ones) (BIT(number_of_ones) - 1) 8 | #define SIGN_EXTEND(x) ((((x) & BIT(nbits-2)) << 1) | (x & ONES(nbits-1))) 9 | #define INSTRUCTION_MASK (ONES(nbits-1) & ~ONES(nbits-4)) 10 | enum instructions { jump = 0, subtract, nand, fetch, store, nop }; 11 | 12 | typedef int cavo_word; /* only use the bottom 12 bits */ 13 | enum { memory_size = BIT(nbits-1) }; 14 | cavo_word p, a, x, i, tmp, mem[memory_size]; /* CPU registers and memory */ 15 | 16 | void panic() { 17 | perror("panicking"); 18 | abort(); 19 | } 20 | 21 | void do_store(cavo_word addr, cavo_word value) { 22 | printf("mem[%d] ← %d\n", (int)addr, (int)value); 23 | mem[addr] = value; 24 | } 25 | 26 | 27 | void go() { 28 | for (;;) { 29 | 30 | /* fetch */ 31 | printf("[%d] ", p); 32 | fflush(stdout); 33 | i = mem[p++]; 34 | p %= memory_size; 35 | 36 | /* execute */ 37 | 38 | if (i & immediate_mask) { /* $ */ 39 | x = a; 40 | a = SIGN_EXTEND(i); 41 | 42 | } else { 43 | switch ((i & INSTRUCTION_MASK) >> (nbits-4)) { 44 | 45 | case jump: /* . */ 46 | if (BIT(nbits-1) & x) a = x; 47 | else { 48 | tmp = a; 49 | a = p; 50 | p = tmp % memory_size; 51 | } 52 | break; 53 | 54 | case subtract: /* - */ 55 | a = (x - a) & ONES(nbits); 56 | break; 57 | 58 | case nand: /* | */ 59 | a = ~(a & x); 60 | break; 61 | 62 | case fetch: /* @ */ 63 | a = mem[a & ONES(nbits-1)]; 64 | break; 65 | 66 | case store: /* ! */ 67 | do_store(a & ONES(nbits-1), x); 68 | break; 69 | 70 | case nop: 71 | break; 72 | 73 | default: 74 | panic(); 75 | } 76 | } 77 | } 78 | } 79 | 80 | int main(int argc, char **argv) { 81 | FILE *f = fopen(argv[1], "rb"); 82 | int i; 83 | if (!f) panic(); 84 | for (i = 0; i < memory_size; i++) { 85 | if (EOF == fscanf(f, "%d\n", &mem[i])) break; 86 | } 87 | p = 0; 88 | go(); 89 | return 0; 90 | } 91 | -------------------------------------------------------------------------------- /count.s: -------------------------------------------------------------------------------- 1 | ; next simplest interesting program: counting loop 2 | me: 3 | $counter @ $-1 - $counter ! $0 $me . 4 | counter: 127 5 | -------------------------------------------------------------------------------- /differences.s: -------------------------------------------------------------------------------- 1 | ; tabulate a polynomial by the method of differences 2 | 3 | $0 $main . 4 | a0: 1 5 | a1: -1 6 | a2: 1 7 | a3: -1 8 | a4: 1 9 | a5: -1 10 | tmp: 0 11 | main: 12 | $0 $a5 @ - $tmp ! $a4 @ $tmp @ - $a5 ! 13 | $0 $a4 @ - $tmp ! $a3 @ $tmp @ - $a4 ! 14 | $0 $a3 @ - $tmp ! $a2 @ $tmp @ - $a3 ! 15 | $0 $a2 @ - $tmp ! $a1 @ $tmp @ - $a2 ! 16 | $0 $a1 @ - $tmp ! $a0 @ $tmp @ - $a1 ! 17 | ; $a5 @ $arg1 ! 18 | ; $0 $output . # we don't have an output routine yet 19 | ; Instead, ./cavosim differences.image| grep 'mem\[8]' | head 20 | $0 $main . 21 | -------------------------------------------------------------------------------- /iloop.s: -------------------------------------------------------------------------------- 1 | ; simplest interesting program: infinite loop. 2 | me: 3 | $0 $me . 4 | -------------------------------------------------------------------------------- /method-of-differences.c: -------------------------------------------------------------------------------- 1 | #include 2 | 3 | /* tabulate -x³ + 2x² - x + 4 */ 4 | /* int a0 = 0, a1 = 0, a2 = -6, a3 = -2, a4 = 0, a5 = 4; */ 5 | 6 | /* tabulate a more boring polynomial */ 7 | int a0 = 1, a1 = -1, a2 = 1, a3 = -1, a4 = 1, a5 = -1; 8 | 9 | main() { 10 | for (;;) { 11 | printf("%d\n", a5); 12 | a5 += a4; 13 | a4 += a3; 14 | a3 += a2; 15 | a2 += a1; 16 | a1 += a0; 17 | } 18 | } 19 | --------------------------------------------------------------------------------