├── .gitignore
├── Makefile
├── README.md
├── cavoasm.py
├── cavosim.c
├── count.s
├── differences.s
├── iloop.s
└── method-of-differences.c
/.gitignore:
--------------------------------------------------------------------------------
1 | *~
2 | count.image
3 | differences.image
4 | iloop.image
5 | method-of-differences
6 | cavosim
7 |
--------------------------------------------------------------------------------
/Makefile:
--------------------------------------------------------------------------------
1 | CFLAGS=-g
2 |
3 | all: count.image iloop.image differences.image test
4 | clean:
5 | $(RM) count.image iloop.image differences.image method-of-differences cavosim
6 |
7 | %.image: %.s cavoasm.py
8 | ./cavoasm.py < $< > $@
9 |
10 | test: method-of-differences cavosim differences.image
11 | -./method-of-differences | head -11 # it outputs the initial a5
12 | -./cavosim differences.image | grep 'mem\[8]' | head # it doesn't
13 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | Calculus Vaporis CPU Design
2 | ===========================
3 |
4 | This is a sketch of a design for a very small zero-operand 12-bit CPU
5 | in about 1000 NAND gates, or a smaller number of more powerful
6 | components (e.g. 70 lines of C). It’s my first CPU design, so it may
7 | be deeply flawed.
8 |
9 | Initially I called it the “Dumbass CPU”, but I thought that didn’t
10 | seem like a name Charles Babbage would have used. So I called it
11 | “Calculus Vaporis”, which means “counting-stone of steam” in Latin, I
12 | hope.
13 |
14 | This directory contains a simulator written in C, a few simple
15 | programs in a simple assembly language, and an assembler written in
16 | Python. Run `make` to try them out.
17 |
18 | Logical Organization
19 | --------------------
20 |
21 | The machine has four 12-bit registers: P, A, X, and I.
22 |
23 | P is the program counter. It generally holds the address of the next
24 | instruction to fetch. It could actually be only 11 bits.
25 |
26 | A is the top-of-stack register.
27 |
28 | X is the second-on-stack register.
29 |
30 | I is the instruction-decoding register. It holds the instruction word
31 | currently being decoded.
32 |
33 | The machine additionally has an 11-bit address bus B, a 12-bit data
34 | bus D, and a bus operation O, which are treated as pseudo-registers
35 | for the purpose of the RTL description. The narrow memory bus is weird
36 | but simplifies the usage of immediate constants.
37 |
38 | Instruction Set
39 | ---------------
40 |
41 | The machine cycle has four microcycles: fetch 1, fetch 2, execute 1,
42 | execute 2. Fetch 1 and 2 are the same for all instructions. All
43 | instructions but @ do nothing during execute 2. The other RTL below
44 | explains what happens during execute 1.
45 |
46 | Following Python, I am using `R[-1]` to refer to the highest bit of
47 | register `R`, `R[:-1]` to refer to all but the highest bit, `R[-2]` to
48 | refer to the next-highest bit, and so on.
49 |
50 | There are seven instructions: `$`, `.`, `-`, `|`, `@`, `!`, and `nop`.
51 | They are stored one per machine word, which is grossly wasteful of
52 | memory space but makes for a simpler instruction decode cycle.
53 | Avoiding the gross waste of memory space would probably require adding
54 | at least one more register to the processor, because as it is defined
55 | now, roughly half the instructions in any code are `$`, which needs a
56 | whole machine word most of the time anyway.
57 |
58 | There are even simpler instruction sets, such as the subtract-
59 | indirect-and-branch-indirect-if-negative OISC, and the MOV machine.
60 | While these instruction sets are simpler to explain, the existing
61 | decoding logic is fairly small, about 20 out of the 1000 NAND gates.
62 | These simpler instruction sets call for a more complicated machine
63 | cycle.
64 |
65 | - `$` is the “load immediate” instruction; it’s represented by having a
66 | single high bit set in a 12-bit word. When written in code
67 | sequences, it is followed by the value of the other 11 bits. It
68 | does:
69 |
70 | > X ← A, A[:-1] ← I[:-1], A[-1] ← I[-2]
71 |
72 | Note that this is the only way to change the value of X, by copying
73 | A’s old value into it.
74 |
75 | - `.` is the “conditional call/jump” instruction. It does:
76 |
77 | > if X[-1] == 0:
78 | > P ← A[:-1], A[:-1] ← P, A[-1] ← 0;
79 | > else:
80 | > A ← X; # not very important, kind of arbitrary, maybe drop it
81 |
82 | - `-` is the “subtract” instruction. It does:
83 |
84 | > A ← X - A;
85 |
86 | - `|` is the “NAND” instruction (named after the Sheffer stroke). In C
87 | notation for bitwise operators, it does:
88 |
89 | > A ← ~(A & X);
90 |
91 | - `@` is the “fetch” instruction. It is the only instruction that needs
92 | both execute microcycles. During execute 1, it does:
93 |
94 | > B ← A[:-1], O ← “read”;
95 |
96 | Then, during execute 2:
97 |
98 | > A ← D;
99 |
100 | - `!` is the “store” instruction. It does:
101 |
102 | > B ← A[:-1], D ← X, O ← “write”, A ← X;
103 |
104 | - `nop` does nothing.
105 |
106 | Code Snippets
107 | -------------
108 |
109 | These show that the instruction set is usable, just barely.
110 |
111 | Call the subroutine at address `x`, passing the return address in A:
112 |
113 | $0 $x .
114 |
115 | Store the return address passed in A at a fixed address `ra` (for
116 | non-reentrant subroutines):
117 |
118 | $ra !
119 |
120 | Return to the address thus passed, trashing both A and X:
121 |
122 | $0 $ra @ .
123 |
124 | Decrement the memory location `sp`, used as, for example, a stack
125 | pointer:
126 |
127 | $sp @ $1 - $sp !
128 |
129 | Store a return address at the location pointed to by `sp` after
130 | decrementing `sp`, using an address `tmp` for temporary storage:
131 |
132 | $tmp ! $sp @ $1 - $sp ! $tmp @ $sp !
133 |
134 | Fetch that return address from the stack, increment `sp`, and return
135 | to the address:
136 |
137 | $sp @ @ $tmp ! $sp @ $-1 - $sp ! $0 $tmp @ .
138 |
139 | Negate the value stored at a memory location `var`:
140 |
141 | $0 $var @ - $var !
142 |
143 | Add the values stored at memory locations `a` and `b` with the aid of
144 | a third temporary location `tmp`, leaving the sum in the A register:
145 |
146 | $0 $a @ - $tmp ! $b @ $tmp @ -
147 |
148 | The rest of the RTL
149 | -------------------
150 |
151 | The instruction definitions above define what happens at the RTL level
152 | during the instruction execution microcycles. The RTL for the other
153 | two microcycles of the machine cycle follow:
154 |
155 | Fetch 1:
156 |
157 | > B ← P, P ← P + 1, O ← “read”;
158 |
159 | Fetch 2:
160 |
161 | > I ← D;
162 |
163 | The presumed memory interface semantics are:
164 |
165 | When O ← “read”, on the next microcycle:
166 |
167 | > D ← M[B];
168 |
169 | When O ← “write”:
170 |
171 | > M[B] ← D.
172 |
173 | I don’t know how enough about memory to know how realistic that is.
174 |
175 | Translating the RTL design into gates
176 | -------------------------------------
177 |
178 | (This part contains a number of errors. Hopefully it’s accurate
179 | enough that the number of gates can be meaningfully estimated.)
180 |
181 | We need a multiplexer attached to the input of most of the registers
182 | to implement the RTL described earlier. Here are the places each
183 | thing can come from, and when:
184 |
185 |
186 |
register
is set from
when
187 |
A
I[:-1] sign-extended
($, execute 1)
188 |
P
(., execute 1, if X[-1] == 0)
189 |
X - A
(-, execute 1)
190 |
!(X & A)
(|, execute 1)
191 |
D
(@, execute 2)
192 |
X
(!, execute 1; or ., execute 1, when X[-1] == 1)
193 |
X
A
($, execute 1)
194 |
P
P + 1
(fetch 1)
195 |
A
(., execute 1, if X[-1] == 0)
196 |
I
D
(fetch 2)
197 |
B
P
(fetch 1)
198 |
A[:-1]
(@ or !, execute 1)
199 |
D
X
(!, execute 1)
200 |
O
“read”
(fetch 1, or @, execute 1)
201 |
"write"
(!, execute 1)
202 |
203 |
204 | At all other times, registers continue with their current values.
205 |
206 | The overall design, then, looks something like this. Some of the
207 | “wires” are N bits wide:
208 |
209 | machine():
210 | register(P_output, P_input, P_write_enable)
211 | register(A_output, A_input, A_write_enable)
212 | register(X_output, X_input, X_write_enable)
213 | register(I_output, I_input, I_write_enable)
214 | instruction_decoder(I_output[-4:], fetch_enable, instruction_select)
215 | # everything up to execute_1 is the inputs; everything after
216 | # that is an output
217 | execute_1_controller(instruction_select, fetch_enable, X_output[-1],
218 | execute_1, A_select, jump, memory_write,
219 | send_A_to_B, X_write_enable, D_write_enable)
220 |
221 | # All of these are outputs
222 | microcycle_counter(execute_1, execute_2, fetch_1, fetch_2)
223 | fetch_2 = I_write_enable # that is, they’re two names for the same wire
224 | # the last argument is the output
225 | AND(fetch_enable, execute_2, get_A_from_D)
226 | OR_6input(get_A_from_D, ..., A_write_enable)
227 | A_input_mux(A_select, get_A_from_D,
228 | I_sign_extended, P_output, X_minus_A, X_nand_A, D, X_output,
229 | A_input)
230 |
231 | increment_P = fetch_1
232 | P_input_controller(jump, increment_P, A_output, B_output, P_input)
233 | OR(fetch_1, ???, memory_read)
234 |
235 | I_sign_extended = I[-2] || I[:-1]
236 |
237 | Most of these pieces are simple multiplexers or registers of one sort
238 | or another. The `microcycle_counter` is just a 4-bit ring counter;
239 | the `instruction_decoder` is just a 6-way decoder of its 4-bit input.
240 | However, the `execute_1_controller` requires a little more
241 | clarification.
242 |
243 | execute_1_controller(instruction_select, instruction_is_fetch, X_highbit,
244 | execute_1, A_select, jump, memory_write,
245 | send_A_to_B, X_write_enable, D_write_enable):
246 | # The A_select output is 5 bits; it’s used to determine where
247 | # the A register gets read from if it gets read during the
248 | # execute_1 microcycle. We don’t care what value it has the
249 | # rest of the time.
250 | A_select = (get_A_from_I_sign_extended, get_A_from_P,
251 | get_A_from_X_minus_A, get_A_from_X_nand_A,
252 | get_A_from_X)
253 | # The instruction_select input is also 5 bits, representing
254 | # the five instructions that can affect A, other than fetch.
255 | instruction_select = (instruction_is_immediate, instruction_is_jump,
256 | instruction_is_subtract, instruction_is_nand,
257 | instruction_is_store)
258 |
259 | # Here’s how those outputs are computed.
260 | get_A_from_I_sign_extended = instruction_is_immediate
261 | NOT(X_highbit, X_not_highbit)
262 | AND(instruction_is_jump, X_not_highbit, get_A_from_P)
263 | get_A_from_X_minus_A = instruction_is_subtract
264 | get_A_from_X_nand_A = instruction_is_nand
265 | AND(instruction_is_jump, X_highbit, failed_jump)
266 | OR(failed_jump, instruction_is_store, get_A_from_X)
267 |
268 | jump = get_A_from_P
269 | AND(execute_1, instruction_is_store, memory_write)
270 |
271 | OR(instruction_is_store, instruction_is_fetch, memory_access)
272 | AND(execute_1, memory_access, send_A_to_B)
273 |
274 | AND(execute_1, instruction_is_immediate, X_write_enable)
275 |
276 | AND(execute_1, instruction_is_store, D_write_enable)
277 |
278 | The ALU simply has to compute `X_minus_A` and `X_nand_A`, which
279 | constantly feed into the `A_input_mux`. `X_minus_A` requires 12 full
280 | subtractors (analogous to full adders) and `X_nand_A` requires 12 NAND
281 | gates.
282 |
283 | Probable errors:
284 |
285 | - I wasn’t clear who was responsible for setting the write enables on
286 | the various registers.
287 | - Some of the `execute_1_controller` outputs probably need to be zero
288 | when `execute_1` is zero.
289 | - I don’t know anything about memory interfaces and so the memory
290 | controller is omitted entirely.
291 |
292 | Size Estimate
293 | -------------
294 |
295 | My initial estimates for number of 2-input NAND gates were:
296 |
297 |
bit-serial
12-bit parallel
298 |
microcycle counter
28 NANDs
28
299 |
rest of control
127
578
300 |
registers
384
260 (no shifting needed)
301 |
subtractor
25
204
302 |
NAND
1
12
303 |
bit counter
70
0
304 |
total
635
1082
305 |
306 |
307 | I was estimating a 5-gate D latch per bit in the parallel case, or an
308 | 8-gate-per-bit master-slave D flip-flop per bit in the serial case.
309 | My microcycle counter was going to be a pair of D flip-flops, four AND
310 | gates to compute the output bits, and an OR on two of those outputs to
311 | compute the new high bit.
312 |
313 | I figured on a ripple-carry subtractor that would cost about 17 NAND
314 | gates per output bit, although I didn’t actually design one.
315 |
316 | Because of the relative paucity of N-bit-wide data paths, going
317 | bit-serial doesn’t actually save many gates, but it would slow the
318 | machine down by a substantial factor.
319 |
320 | Possible Improvements
321 | ---------------------
322 |
323 | Dropping the get-X-from-A path on skipped jumps would simplify the
324 | processor, probably without making it any harder to use.
325 |
326 | A third register (or more) on the stack wouldn’t affect the
327 | instruction set at all, but would simplify some code. For example,
328 | the code for “add the values stored at memory locations `a` and `b`,
329 | leaving the sum in the A register” would simplify from:
330 |
331 | $0 $a @ - $tmp ! $b @ $tmp @ -
332 |
333 | to:
334 |
335 | $a @ $0 $b @ - -
336 |
337 | Using one-hot encodings of the instructions would require using seven
338 | bits of the I register instead of four, but would almost eliminate the
339 | instruction decoder. (You'd still need to ensure that the instruction
340 | wasn’t $.)
341 |
342 | Alternatively, you could pack three, four, or five instructions into
343 | each 12-bit word of memory, as Chuck Moore’s c18 core does, instead of
344 | one. The "I" instruction, if encodable in lower-order positions,
345 | could simply sign-extend more bits, reducing the number of encodable
346 | immediate constants in such a case, but not to zero.
347 |
348 | You could try a different bit width. 2048 instructions of code is
349 | close to a minimum to run, say, a BASIC interpreter.
350 |
351 | If N × N → N bit LUTs and registers are available as primitives, the
352 | total complexity of the device could drop by an order of magnitude.
353 | For example, with 5 × 5 → 5 bit LUTs, you could construct a 4-bit
354 | subtractor with borrow-in and borrow-out as a single LUT, and chain
355 | three of them together to get a 12-bit subtractor. Even if you have
356 | only 4 → 1 bit LUTs, like on a normal FPGA, you can implement a full
357 | subtractor in two LUTs attached to the same inputs, instead of 17 (or
358 | however many) discrete NAND gates.
359 |
360 | If the stack were a little deeper, a “dup”, “swap”, or “over”
361 | instruction might help a lot with certain code sequences by
362 | reducing the number of immediate constants.
363 |
364 |
365 |
--------------------------------------------------------------------------------
/cavoasm.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/python
2 | """Assembler for the Calculus Vaporis CPU."""
3 | import sys, re
4 |
5 | def words(fileobj):
6 | for line in fileobj:
7 | line = re.sub(';.*', '', line)
8 | for word in line.split():
9 | yield word
10 |
11 | instructions = { '.': 0, '-': 1, '|': 2, '@': 3, '!': 4, 'nop': 5 }
12 |
13 | nbits = 12
14 |
15 | def bit(n):
16 | return 1 << n
17 |
18 | immediate = bit(nbits-1)
19 |
20 | def bits(n):
21 | return bit(n) -1
22 |
23 | def is_integer(word):
24 | try:
25 | int(word)
26 | return True
27 | except ValueError:
28 | return False
29 |
30 | class Relocation:
31 | def __init__(self, encoding, destination):
32 | self.encoding = encoding
33 | self.destination = destination
34 | def resolve(self, program, resolution):
35 | program.memory[self.destination] = self.encoding(resolution)
36 |
37 | def encode_immediate(value):
38 | return (value & bits(nbits-1)) | immediate
39 |
40 | def encode_normal(value):
41 | return value
42 |
43 | class Program:
44 | def __init__(self):
45 | self.memory = []
46 | self.labels = {}
47 | self.backpatches = {}
48 | def assemble_words(self, words):
49 | for word in words:
50 | self.assemble_word(word)
51 | def assemble_word(self, word):
52 | if word in instructions:
53 | self.assemble(instructions[word] << (nbits - 4))
54 | elif word.startswith('$'):
55 | self.assemble_reference(encode_immediate, word[1:])
56 | elif word.endswith(':'):
57 | self.assemble_label(word[:-1])
58 | else:
59 | self.assemble_reference(encode_normal, word)
60 | def assemble_reference(self, encoding, text):
61 | if is_integer(text):
62 | return self.assemble(encoding(int(text)))
63 | elif text in self.labels:
64 | return self.assemble(encoding(self.labels[text]))
65 | else:
66 | rel = Relocation(encoding, len(self.memory))
67 | self.backpatches.setdefault(text, []).append(rel)
68 | return self.assemble(0)
69 | def assemble(self, number):
70 | assert isinstance(number, int)
71 | self.memory.append(number)
72 | def assemble_label(self, label):
73 | self.resolve(label, len(self.memory))
74 | def resolve(self, label, value):
75 | if label in self.backpatches:
76 | for item in self.backpatches[label]:
77 | item.resolve(self, value)
78 | del self.backpatches[label]
79 | self.labels[label] = value
80 | def warn_undefined_labels(self):
81 | for label in self.backpatches:
82 | sys.stderr.write('WARNING: label %r undefined\n' % label)
83 | def dump(self, outfile):
84 | self.warn_undefined_labels()
85 | for number in self.memory:
86 | outfile.write('%d\n' % number)
87 |
88 | if __name__ == '__main__':
89 | import cgitb
90 | cgitb.enable(format='text')
91 | p = Program()
92 | p.assemble_words(words(sys.stdin))
93 | p.dump(sys.stdout)
94 |
--------------------------------------------------------------------------------
/cavosim.c:
--------------------------------------------------------------------------------
1 | #include
2 | #include
3 | #include
4 |
5 | #define BIT(n) (1 << (n))
6 | enum { nbits = 12, immediate_mask = BIT(nbits-1) };
7 | #define ONES(number_of_ones) (BIT(number_of_ones) - 1)
8 | #define SIGN_EXTEND(x) ((((x) & BIT(nbits-2)) << 1) | (x & ONES(nbits-1)))
9 | #define INSTRUCTION_MASK (ONES(nbits-1) & ~ONES(nbits-4))
10 | enum instructions { jump = 0, subtract, nand, fetch, store, nop };
11 |
12 | typedef int cavo_word; /* only use the bottom 12 bits */
13 | enum { memory_size = BIT(nbits-1) };
14 | cavo_word p, a, x, i, tmp, mem[memory_size]; /* CPU registers and memory */
15 |
16 | void panic() {
17 | perror("panicking");
18 | abort();
19 | }
20 |
21 | void do_store(cavo_word addr, cavo_word value) {
22 | printf("mem[%d] ← %d\n", (int)addr, (int)value);
23 | mem[addr] = value;
24 | }
25 |
26 |
27 | void go() {
28 | for (;;) {
29 |
30 | /* fetch */
31 | printf("[%d] ", p);
32 | fflush(stdout);
33 | i = mem[p++];
34 | p %= memory_size;
35 |
36 | /* execute */
37 |
38 | if (i & immediate_mask) { /* $ */
39 | x = a;
40 | a = SIGN_EXTEND(i);
41 |
42 | } else {
43 | switch ((i & INSTRUCTION_MASK) >> (nbits-4)) {
44 |
45 | case jump: /* . */
46 | if (BIT(nbits-1) & x) a = x;
47 | else {
48 | tmp = a;
49 | a = p;
50 | p = tmp % memory_size;
51 | }
52 | break;
53 |
54 | case subtract: /* - */
55 | a = (x - a) & ONES(nbits);
56 | break;
57 |
58 | case nand: /* | */
59 | a = ~(a & x);
60 | break;
61 |
62 | case fetch: /* @ */
63 | a = mem[a & ONES(nbits-1)];
64 | break;
65 |
66 | case store: /* ! */
67 | do_store(a & ONES(nbits-1), x);
68 | break;
69 |
70 | case nop:
71 | break;
72 |
73 | default:
74 | panic();
75 | }
76 | }
77 | }
78 | }
79 |
80 | int main(int argc, char **argv) {
81 | FILE *f = fopen(argv[1], "rb");
82 | int i;
83 | if (!f) panic();
84 | for (i = 0; i < memory_size; i++) {
85 | if (EOF == fscanf(f, "%d\n", &mem[i])) break;
86 | }
87 | p = 0;
88 | go();
89 | return 0;
90 | }
91 |
--------------------------------------------------------------------------------
/count.s:
--------------------------------------------------------------------------------
1 | ; next simplest interesting program: counting loop
2 | me:
3 | $counter @ $-1 - $counter ! $0 $me .
4 | counter: 127
5 |
--------------------------------------------------------------------------------
/differences.s:
--------------------------------------------------------------------------------
1 | ; tabulate a polynomial by the method of differences
2 |
3 | $0 $main .
4 | a0: 1
5 | a1: -1
6 | a2: 1
7 | a3: -1
8 | a4: 1
9 | a5: -1
10 | tmp: 0
11 | main:
12 | $0 $a5 @ - $tmp ! $a4 @ $tmp @ - $a5 !
13 | $0 $a4 @ - $tmp ! $a3 @ $tmp @ - $a4 !
14 | $0 $a3 @ - $tmp ! $a2 @ $tmp @ - $a3 !
15 | $0 $a2 @ - $tmp ! $a1 @ $tmp @ - $a2 !
16 | $0 $a1 @ - $tmp ! $a0 @ $tmp @ - $a1 !
17 | ; $a5 @ $arg1 !
18 | ; $0 $output . # we don't have an output routine yet
19 | ; Instead, ./cavosim differences.image| grep 'mem\[8]' | head
20 | $0 $main .
21 |
--------------------------------------------------------------------------------
/iloop.s:
--------------------------------------------------------------------------------
1 | ; simplest interesting program: infinite loop.
2 | me:
3 | $0 $me .
4 |
--------------------------------------------------------------------------------
/method-of-differences.c:
--------------------------------------------------------------------------------
1 | #include
2 |
3 | /* tabulate -x³ + 2x² - x + 4 */
4 | /* int a0 = 0, a1 = 0, a2 = -6, a3 = -2, a4 = 0, a5 = 4; */
5 |
6 | /* tabulate a more boring polynomial */
7 | int a0 = 1, a1 = -1, a2 = 1, a3 = -1, a4 = 1, a5 = -1;
8 |
9 | main() {
10 | for (;;) {
11 | printf("%d\n", a5);
12 | a5 += a4;
13 | a4 += a3;
14 | a3 += a2;
15 | a2 += a1;
16 | a1 += a0;
17 | }
18 | }
19 |
--------------------------------------------------------------------------------