├── README.md
├── hardware-ARCH.PNG
├── hardware
    ├── Aggregation.v
    ├── README.md
    ├── columnbuffer.v
    ├── hardware-ARCH.PNG
    ├── indiceDoubleBuffer.v
    ├── indptrDoubleBuffer.v
    ├── reluArray.v
    ├── reluUnit.v
    ├── rowbuffer.v
    ├── sysArray.v
    ├── sysUnit.v
    ├── system.v
    ├── transformationPlusActivation.v
    ├── vectorAdd.v
    └── weightBuffer.v
└── perf_model
    └── perf_analysis.py


/README.md:
--------------------------------------------------------------------------------
 1 | ## GraphACT: Accelerating GCN Training on CPU-FPGA Heterogeneous Platform [[PDF](https://arxiv.org/abs/2001.02498)]
 2 | ---
 3 | ### Abstract 
 4 | Graph Convolutional Networks (GCNs) have emerged as the state-of-the-art deep learning model for representation learning on graphs. It is challenging to accelerate training of GCNs, due to (1) substantial and irregular data communication to propagate information within the graph, and (2) intensive computation to propagate information along the neural network layers. To address these challenges, we design a novel accelerator for training GCNs on CPU-FPGA heterogeneous systems, by incorporating multiple algorithm-architecture co-optimizations. We first analyze the computation and communication characteristics of various GCN training algorithms, and select a subgraph-based algorithm that is well suited for hardware execution. To optimize the feature propagation within subgraphs, we propose a light-weight pre-processing step based on a graph theoretic approach. Such pre-processing performed on the CPU significantly reduces the memory access requirements and the computation to be performed on the FPGA.  To accelerate the weight update in GCN layers, we propose a systolic array based design for efficient parallelization. We integrate the above optimizations into a complete hardware pipeline, and analyze its load-balance and resource utilization by accurate performance modeling. We evaluate our design on Intel Stratix 10 board hosted by a 40-core Xeon server. On three large graphs, we achieve an order of magnitude training speedup with negligible accuracy loss, compared with state-of-the-art implementation on a multi-core platform. 
 5 | 
 6 | 
 7 | ### Implementation details
 8 | - FPGA platform: **Intel Stratix 10 1SX280LH3F55I3XG**
 9 |   - Tool: **Quartus prime pro 20.1** for synthesis
10 | - CPU platform: 40-core Xeon server (E5-2698 v4 @2.2GHz, hyper-threaded)
11 | 
12 | ### hardware architecture
13 | 
14 | ![arch](./hardware-ARCH.PNG)
15 | 
16 | ### Directory Structure
17 | - /hardware: contains the verilog code of the design
18 |   - system.v: main module
19 |   - Aggregation.v: feature aggregation module
20 |   - reluArray.v: activation module
21 |   - transformationPlusActivation.v: transformation models and activation models
22 |   - Data buffer:
23 |     - columnbuffer.v
24 |     - indiceDoubleBuffer.v
25 |     - indptrDoubleBuffer.v
26 |     - weightBuffer.v
27 |     - rowbuffer.v
28 | - /perf_model: performance modeling using python script
29 | 
30 | ### Configuring IP Cores
31 | - floating point Accumulator--sysAcc
32 |   - Find: IP catalog => basic function => arithmetic => floating point function
33 |   - Name: Accumulator
34 |   - Other Info: choose Generate Enable and generate HDL
35 | - floating point multiplier--mul
36 |   - Find: IP catalog => basic function => arithmetic => floating point function
37 |   - Name: Multiply
38 |   - Other Info: In Functionality choose Generate Enable and generate HDL
39 | - floating point comparator 
40 |   - Find: IP catalog => basic function => arithmetic => functionality => Comparison 
41 |   - Name: Greater than
42 |   - Other Info: In Functionality choose Generate Enable and generate HDL
43 | - RAM 1 port--Weight buffer 
44 |   - Find: IP catalog => on-chip memory => RAM: 1-port intel FPGA IP
45 |   - Name: weightbuffer 
46 |   - Other Info: choose Generate Enable and generate HDL
47 | - RAM 2 port--row buffer 
48 |   - Find: IP catalog => on-chip memory => RAM: 2-port intel FPGA IP
49 |   - Name: row buffer  
50 |   - Other Info: choose Generate Enable and generate HDL
51 | - RAM 2 port--column buffer 
52 |   - Find: IP catalog => on-chip memory => RAM: 2-port intel FPGA IP
53 |   - Name: column buffer  
54 |   - Other Info: choose Generate Enable and generate HDL
55 | - RAM 2 port--indptr buffer 
56 |   - Find: IP catalog => on-chip memory => RAM: 2-port intel FPGA IP
57 |   - Name: indptr buffer  
58 |   - Other Info: choose Generate Enable and generate HDL
59 | - RAM 2 port--indptr buffer 
60 |   - Find: IP catalog => on-chip memory => RAM: 2-port intel FPGA IP
61 |   - Name: indptr buffer  
62 |   - Other Info: choose Generate Enable and generate HDL
63 | - RAM 2 port--indice buffer 
64 |   - Find: IP catalog => on-chip memory => RAM: 2-port intel FPGA IP
65 |   - Name: indice buffer  
66 |   - Other Info: choose Generate Enable and generate HDL
67 |   
68 | ### Setting up the projects
69 | 1. Create a new project in Quartus Prime PRO 20.1 and set the **Intel Stratix 10 1SX280LH3F55I3XG** as the target device
70 | 2. Add all the .v files in /hardware directory to the project
71 | 3. Set the system.v as the top module
72 | 4. Start the compilation
73 | 


--------------------------------------------------------------------------------
/hardware-ARCH.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pgroupATusc/GraphACT/f0ecd30500247647105762934f5967350a0d8eab/hardware-ARCH.PNG


--------------------------------------------------------------------------------
/hardware/Aggregation.v:
--------------------------------------------------------------------------------
  1 | `timescale 1ns / 1ps
  2 | //////////////////////////////////////////////////////////////////////////////////
  3 | // Company: 
  4 | // Engineer: 
  5 | // 
  6 | // Create Date: 12/19/2019 12:49:59 PM
  7 | // Design Name: 
  8 | // Module Name: Aggregation
  9 | // Project Name: 
 10 | // Target Devices: 
 11 | // Tool Versions: 
 12 | // Description: 
 13 | // 
 14 | // Dependencies: 
 15 | // 
 16 | // Revision:
 17 | // Revision 0.01 - File Created
 18 | // Additional Comments:
 19 | // 
 20 | //////////////////////////////////////////////////////////////////////////////////
 21 | module Aggregation #(
 22 | 	parameter dataWidth = 32,   // the data bit width
 23 | 	parameter pvadd= 128,   // the parallesim in feature adder
 24 | 	parameter k = 1024  // define k as the block size
 25 | )
 26 | (
 27 | 	// system signal
 28 | 	clk,
 29 | 	rst,
 30 | 	enable,
 31 | 
 32 | 	// port for indptrDoubleBuffer
 33 | 	indptr_writeEnableA1,
 34 | 	indptr_writeEnableB1,
 35 | 	indptr_addressportA1_outside,
 36 | 	indptr_addressportB1_outside,
 37 | 	indptr_writeportA1,
 38 | 	indptr_writeportB1,
 39 | 
 40 | 	indptr_startAddressA,
 41 | 	indptr_startAddressB,
 42 | 	indptr_endAddress,
 43 | 
 44 | 	// port for indiceDoubleBuffer
 45 | 	indice_writeEnableA1,
 46 | 	indice_writeEnableB1,
 47 | 	indice_addressportA1_outside,
 48 | 	indice_addressportB1_outside,
 49 | 	indice_writeportA1,
 50 | 	indice_writeportB1,
 51 | 
 52 | 	// port for DoubleColumnBuffer
 53 | 	columnbuffer_writeEnableA1,
 54 | 	columnbuffer_writeEnableB1,
 55 | 	columnbuffer_addressportA1_outside,
 56 | 	columnbuffer_addressportB1_outside,
 57 | 	columnbuffer_writeportA1,
 58 | 	columnbuffer_writeportB1,
 59 | 
 60 | 	// port for Double Row Buffer
 61 | 
 62 | 	rowbuffer_addressportA1_outside,
 63 | 	rowbuffer_dataOut
 64 | 
 65 | 
 66 | );
 67 | 
 68 | input clk;
 69 | input rst;
 70 | input enable;
 71 | // define the two state FSM
 72 | wire finish;
 73 | reg triggerA;
 74 | reg triggerB;
 75 | 
 76 | reg [indice_addressWidth - 1:0] indiceCounterA;
 77 | reg [indice_addressWidth - 1:0] indiceCounterB;
 78 | 
 79 | parameter S1 = 1'b0;
 80 | parameter S2 = 1'b1;
 81 | 
 82 | reg state;
 83 | 
 84 | always @(posedge clk or negedge rst) begin : proc_state
 85 | 	if(~rst) begin
 86 | 		state <= S1;
 87 | 	end else begin
 88 | 		if(finish)begin
 89 | 			if(state == S1)
 90 | 				state <= S2 ;
 91 | 			else
 92 | 				state <= S1;
 93 | 		end
 94 | 		else
 95 | 			state <= state;
 96 | 	end
 97 | end
 98 | 
 99 | 
100 | 
101 | //define the parameter and port for indptrDoubleBuffer
102 | // write the indptrDouble buffer is from outside
103 | 
104 | reg [indptr_addressWidth - 1:0] indptrCounterA;
105 | reg [indptr_addressWidth - 1:0] indptrCounterB;
106 | 
107 | parameter indptr_addressWidth = $clog2(k + 1);
108 | parameter indptr_dataportWidth =  $clog2(k*k/32);
109 | 
110 | input indptr_writeEnableA1;
111 | input indptr_writeEnableB1;
112 | input [indptr_dataportWidth - 1:0] indptr_writeportA1;
113 | input [indptr_dataportWidth - 1:0] indptr_writeportB1;
114 | 
115 | wire [indptr_dataportWidth - 1:0]  indptr_readportA1;
116 | wire [indptr_dataportWidth - 1:0]  indptr_readportA2;
117 | wire [indptr_dataportWidth - 1:0]  indptr_readportB1;
118 | wire [indptr_dataportWidth - 1:0]  indptr_readportB2;
119 | 
120 | 
121 | wire [indptr_addressWidth - 1:0] indptr_addressportA1;
122 | wire [indptr_addressWidth - 1:0] indptr_addressportA2;
123 | wire [indptr_addressWidth - 1:0] indptr_addressportB1;
124 | wire [indptr_addressWidth - 1:0] indptr_addressportB2;
125 | 
126 | input [indptr_addressWidth - 1:0] indptr_addressportA1_outside;
127 | input [indptr_addressWidth - 1:0] indptr_addressportB1_outside;
128 | 
129 | 
130 | 
131 | 
132 | 
133 | wire [indptr_dataportWidth - 1:0] indptrAout;
134 | 
135 | assign indptrAout = (state == S1)? indptr_readportA1:indptr_readportB1;
136 | 
137 | wire [indptr_dataportWidth - 1:0] indptrBout;
138 | 
139 | assign indptrBout = (state == S1)? indptr_readportA2:indptr_readportB2;
140 | 
141 | 
142 | // ping pong operation
143 | assign indptr_addressportA1 = (state == S1)? indptr_addressportA1_outside: indptrCounterA;
144 | assign indptr_addressportA2 = (state == S1)? indptr_addressportB1_outside: indptrCounterB;
145 | 
146 | assign indptr_addressportB1 = (state == S1)? indptrCounterA: indptr_addressportA1_outside;
147 | assign indptr_addressportB2 = (state == S1)? indptrCounterB: indptr_addressportB1_outside;
148 | 
149 | 
150 | 
151 | 
152 | indptrDoubleBuffer #(
153 |     .k(k) 
154 | )  singleindptrDoubleBuffer
155 | (
156 | 	.clk(clk),
157 | 	.rst(rst),
158 | 	.enableA(enable),
159 | 	.enableB(enable),
160 | 	.writeEnableA1(indptr_writeEnableA1),
161 | 	.writeEnableA2(1'b0),
162 | 	.writeEnableB1(indptr_writeEnableB1),
163 | 	.writeEnableB2(1'b0),
164 | 	.readportA1(indptr_readportA1),
165 | 	.readportA2(indptr_readportA2),
166 | 	.writeportA1(indptr_writeportA1),
167 | 	.writeportA2(0),
168 | 	.readportB1(indptr_readportB1),
169 | 	.readportB2(indptr_readportB2),
170 | 	.writeportB1(indptr_writeportB1),
171 | 	.writeportB2(0),
172 |     .addressportA1(indptr_addressportA1),
173 |     .addressportA2(indptr_addressportA2),
174 |     .addressportB1(indptr_addressportB1),
175 |     .addressportB2(indptr_addressportB2)
176 | );
177 | 
178 | 
179 | 
180 | 
181 | //parameter for indiceDoubleBuffer
182 | 
183 | parameter indice_addressWidth = $clog2(k*k/32);
184 | parameter indice_dataportWidth = $clog2(k);
185 | 
186 | 
187 | input indice_writeEnableA1;
188 | input indice_writeEnableB1;
189 | input [indice_dataportWidth - 1:0] indice_writeportA1;
190 | input [indice_dataportWidth - 1:0] indice_writeportB1;
191 | 
192 | 
193 | wire [indice_dataportWidth - 1:0] indice_readportA1;
194 | wire [indice_dataportWidth - 1:0] indice_readportA2;
195 | wire [indice_dataportWidth - 1:0] indice_readportB1;
196 | wire [indice_dataportWidth - 1:0] indice_readportB2;
197 | 
198 | wire [indice_addressWidth - 1:0] addressportA1;
199 | wire [indice_addressWidth - 1:0] addressportA2;
200 | wire [indice_addressWidth - 1:0] addressportB1;
201 | wire [indice_addressWidth - 1:0] addressportB2;
202 | 
203 | input [indice_addressWidth - 1:0] indice_addressportA1_outside;
204 | input [indice_addressWidth - 1:0] indice_addressportB1_outside;
205 | 
206 | 
207 | assign addressportA1 = (state == S1)? indice_addressportA1_outside: indiceCounterA;
208 | assign addressportA2 = (state == S1)? indice_addressportB1_outside: indiceCounterB;
209 | 
210 | assign addressportB1 = (state == S1)? indiceCounterA: indice_addressportA1_outside;
211 | assign addressportB2 = (state == S1)? indiceCounterB: indice_addressportB1_outside;
212 | 
213 | 
214 | wire [indice_dataportWidth - 1:0] indiceForColumnBufferA;
215 | wire [indice_dataportWidth - 1:0] indiceForColumnBufferB;
216 | 
217 | assign indiceForColumnBufferA = (state == S1)? indice_readportA1: indice_readportB1;
218 | assign indiceForColumnBufferB = (state == S1)? indice_readportA2: indice_readportB2;
219 | 
220 | 
221 | indiceDoubleBuffer #(
222 |     .k(k)
223 | )  singleindiceDoubleBuffer
224 | (
225 | 	.clk(clk),
226 | 	.rst(rst),
227 | 	.enableA(enable),
228 | 	.enableB(enable),
229 | 	.writeEnableA1(indice_writeEnableA1),
230 | 	.writeEnableA2(1'b0),
231 | 	.writeEnableB1(indice_writeEnableB1),
232 | 	.writeEnableB2(1'b0),
233 | 	.readportA1(indice_readportA1),
234 | 	.readportA2(indice_readportA2),
235 | 	.writeportA1(indice_writeportA1),
236 | 	.writeportA2(0),
237 | 	.readportB1(indice_readportB1),
238 | 	.readportB2(indice_readportB2),
239 | 	.writeportB1(indice_writeportB1),
240 | 	.writeportB2(0),
241 |     .addressportA1(addressportA1),
242 |     .addressportA2(addressportA2),
243 |     .addressportB1(addressportB1),
244 |     .addressportB2(addressportB2)
245 | );
246 | 
247 | 
248 | 
249 | 
250 | // define the parameter for columnBuffer
251 | 
252 | parameter columnbuffer_addressWidth = $clog2(k);
253 | parameter columnbuffer_dataportWidth = dataWidth*pvadd;
254 | 
255 | 
256 | 
257 | 
258 | 
259 | input columnbuffer_writeEnableA1;
260 | input columnbuffer_writeEnableB1;
261 | 
262 | wire [columnbuffer_dataportWidth - 1:0] columnbuffer_readportA1;
263 | wire [columnbuffer_dataportWidth - 1:0] columnbuffer_readportA2;
264 | wire [columnbuffer_dataportWidth - 1:0] columnbuffer_readportB1;
265 | wire [columnbuffer_dataportWidth - 1:0] columnbuffer_readportB2;
266 | 
267 | wire [columnbuffer_addressWidth - 1:0] columnbuffer_addressportA1;
268 | wire [columnbuffer_addressWidth - 1:0] columnbuffer_addressportA2;
269 | wire [columnbuffer_addressWidth - 1:0] columnbuffer_addressportB1;
270 | wire [columnbuffer_addressWidth - 1:0] columnbuffer_addressportB2;
271 | 
272 | input [columnbuffer_addressWidth - 1:0] columnbuffer_addressportA1_outside;
273 | input [columnbuffer_addressWidth - 1:0] columnbuffer_addressportB1_outside;
274 | input [columnbuffer_dataportWidth - 1:0] columnbuffer_writeportA1;
275 | input [columnbuffer_dataportWidth - 1:0] columnbuffer_writeportB1;
276 | 
277 | // parameter for columnbuffer
278 | 
279 | assign columnbuffer_addressportA1 = (state == S1)? columnbuffer_addressportA1_outside: indiceCounterA;
280 | assign columnbuffer_addressportA2 = (state == S1)? columnbuffer_addressportB1_outside: indiceCounterB;
281 | 
282 | assign columnbuffer_addressportB1 = (state == S1)? indiceCounterA: columnbuffer_addressportA1_outside;
283 | assign columnbuffer_addressportB2 = (state == S1)? indiceCounterB: columnbuffer_addressportB1_outside;
284 | 
285 | 
286 | wire [columnbuffer_dataportWidth - 1:0] dataForVadderA;
287 | wire [columnbuffer_dataportWidth - 1:0] dataForVadderB;
288 | 
289 | 
290 | assign dataForVadderA = (state == S1)? columnbuffer_readportA1: columnbuffer_readportB1;
291 | assign dataForVadderB = (state == S1)? columnbuffer_readportA2: columnbuffer_readportB2;
292 | 
293 | 
294 | 
295 | 
296 | columnbuffer  #(
297 | 	.dataWidth(dataWidth),   // the data bit width
298 |     .pvadd(pvadd),   // the parallesim in feature adder
299 |     .k(k)  // define k as the block size
300 | )  singlecolumnbuffer
301 | (
302 | 	.clk(clk),
303 | 	.rst(rst),
304 | 	.enableA(enable),
305 | 	.enableB(enable),
306 | 	.writeEnableA1(columnbuffer_writeEnableA1),
307 | 	.writeEnableA2(1'b0),
308 | 	.writeEnableB1(columnbuffer_writeEnableB1),
309 | 	.writeEnableB2(1'b0),
310 | 	.readportA1(columnbuffer_readportA1),
311 | 	.readportA2(columnbuffer_readportA2),
312 | 	.writeportA1(columnbuffer_writeportA1),
313 | 	.writeportA2(32'b0),
314 | 	.readportB1(columnbuffer_readportB1),
315 | 	.readportB2(columnbuffer_readportB2),
316 | 	.writeportB1(columnbuffer_writeportB1),
317 | 	.writeportB2(32'b0),
318 |   	.addressportA1(columnbuffer_addressportA1),
319 |   	.addressportA2(columnbuffer_addressportA2),
320 |   	.addressportB1(columnbuffer_addressportB1),
321 |   	.addressportB2(columnbuffer_addressportB2)
322 | );
323 | 
324 | 
325 | 
326 | 
327 | 
328 | 
329 | 
330 | wire [columnbuffer_dataportWidth - 1:0] DataVectorAout;
331 | wire [columnbuffer_dataportWidth - 1:0] DataVectorBout;
332 | 
333 | 
334 | 
335 | //  vectorAdder1
336 | 
337 | vectorAdd #(
338 |     .dataWidth(dataWidth),
339 |     .pvadd(pvadd)
340 | ) vectorAdderA
341 | (
342 | 	.clk(clk),
343 | 	.rst(rst),
344 | 	.lastin(1'b1),
345 | 	.lastout(),
346 | 	.vectorIn(dataForVadderA),
347 | 	.vectorOut(DataVectorAout)    
348 | );
349 | 
350 | //  vectorAdder2
351 | vectorAdd #(
352 |     .dataWidth(dataWidth),
353 |     .pvadd(pvadd)
354 | ) vectorAdderB
355 | (
356 | 	.clk(clk),
357 | 	.rst(rst),
358 | 	.lastin(1'b1),
359 | 	.lastout(),
360 | 	.vectorIn(dataForVadderB),
361 | 	.vectorOut(DataVectorBout)    
362 | );
363 | 
364 | 
365 | /// define the dataport for the row buffer
366 | 
367 | wire rowbuffer_writeEnableA1;
368 | wire rowbuffer_writeEnableA2;
369 | wire rowbuffer_writeEnableB1;
370 | wire rowbuffer_writeEnableB2;
371 | 
372 | 
373 | assign rowbuffer_writeEnableA1 = (state == S1)? 1'b1: 1'b0;
374 | assign rowbuffer_writeEnableA2 = (state == S1)? 1'b1: 1'b0;
375 | 
376 | assign rowbuffer_writeEnableB1 = (state == S1)? 1'b0: 1'b1;
377 | assign rowbuffer_writeEnableB2 = (state == S1)? 1'b0: 1'b1;
378 | 
379 | // logic for the row buffer
380 | wire [columnbuffer_addressWidth - 1:0] rowbuffer_addressportA1;
381 | wire [columnbuffer_addressWidth - 1:0] rowbuffer_addressportA2;
382 | wire [columnbuffer_addressWidth - 1:0] rowbuffer_addressportB1;
383 | wire [columnbuffer_addressWidth - 1:0] rowbuffer_addressportB2;
384 | 
385 | 
386 | input [columnbuffer_addressWidth - 1:0] rowbuffer_addressportA1_outside;
387 | 
388 | assign rowbuffer_addressportA1 = (state == S1)? indptrCounterA: rowbuffer_addressportA1_outside;
389 | assign rowbuffer_addressportA2 = (state == S1)? indptrCounterB: rowbuffer_addressportA1_outside;
390 | 
391 | assign rowbuffer_addressportB1 = (state == S1)? rowbuffer_addressportA1_outside: indptrCounterA;
392 | assign rowbuffer_addressportB2 = (state == S1)? rowbuffer_addressportA1_outside: indptrCounterB;
393 | 
394 | 
395 | wire [columnbuffer_dataportWidth - 1:0] rowbuffer_readportA1;
396 | 
397 | wire [columnbuffer_dataportWidth - 1:0] rowbuffer_readportB1;
398 | 
399 | 
400 | output [columnbuffer_dataportWidth - 1:0] rowbuffer_dataOut;
401 | 
402 | assign rowbuffer_dataOut = (state == S1)? rowbuffer_readportB1: rowbuffer_readportA1;
403 | 
404 | 
405 | rowbuffer #(
406 | 	.dataWidth(dataWidth) ,   // the data bit width
407 |     .pvadd(pvadd),   // the parallesim in feature adder
408 |     .k(k)  // define k as the block size
409 | ) singlerowBuffer
410 | (
411 | 	.clk(clk),
412 | 	.rst(rst),
413 | 	.enableA(enable),
414 | 	.enableB(enable),
415 | 	.writeEnableA1(rowbuffer_writeEnableA1),
416 | 	.writeEnableA2(rowbuffer_writeEnableA2),
417 | 	.writeEnableB1(rowbuffer_writeEnableB1),
418 | 	.writeEnableB2(rowbuffer_writeEnableB2),
419 | 	.readportA1(rowbuffer_readportA1),
420 | 	.readportA2(),
421 | 	.writeportA1(DataVectorAout),
422 | 	.writeportA2(DataVectorBout),
423 | 	.readportB1(rowbuffer_readportB1),
424 | 	.readportB2(),
425 | 	.writeportB1(DataVectorAout),
426 | 	.writeportB2(DataVectorBout),
427 |   	.addressportA1(rowbuffer_addressportA1),
428 |   	.addressportA2(rowbuffer_addressportA2),
429 |   	.addressportB1(rowbuffer_addressportB1),
430 |   	.addressportB2(rowbuffer_addressportB2)
431 | );
432 | 
433 | 
434 | 
435 | 
436 | 
437 | // define the communication signal
438 | 
439 | 
440 | 
441 | // 
442 | 
443 | 
444 | 
445 | reg [indice_addressWidth - 1:0] indptrA1;
446 | reg [indice_addressWidth - 1:0] indptrA2;
447 | 
448 | reg [indice_addressWidth - 1:0] indptrB1;
449 | reg [indice_addressWidth - 1:0] indptrB2;
450 | 
451 | 
452 | 
453 | 
454 | 
455 | input [indptr_addressWidth - 1:0] indptr_startAddressA;
456 | input [indptr_addressWidth - 1:0] indptr_startAddressB;
457 | input [indptr_addressWidth - 1:0] indptr_endAddress;
458 | 
459 | always @(posedge clk or negedge rst) begin : proc_indptrCounterA
460 | 	if(~rst) begin
461 | 		indptrCounterA <= 0;
462 | 	end else begin
463 | 		if(finish) begin 
464 | 			indptrCounterA <= indptr_startAddressA;
465 | 		end
466 | 		else if(triggerA)begin 
467 | 			indptrCounterA <= indptrCounterA + 1;
468 | 		end
469 | 		else begin
470 | 			indptrCounterA <= indptrCounterA; 
471 | 		end			
472 | 	end
473 | end
474 | 
475 | always @(posedge clk or negedge rst) begin : proc_indptrCounterB
476 | 	if(~rst) begin
477 | 		indptrCounterB <= 0;
478 | 	end else begin
479 | 		if(finish) begin 
480 | 			indptrCounterB <= indptr_startAddressB;
481 | 		end
482 | 		else if(triggerA)begin 
483 | 			indptrCounterB <= indptrCounterB + 1;
484 | 		end
485 | 		else begin
486 | 			indptrCounterB <= indptrCounterB; 
487 | 		end
488 | 	end
489 | end
490 | 
491 | assign finish = (indptrCounterA == indptr_startAddressB && indptrCounterB == indptr_endAddress) ? 1'b1: 1'b0;
492 | 
493 | 
494 | 
495 | 
496 | always @(posedge clk or negedge rst) begin : proc_indptrA2
497 | 	if(~rst) begin
498 | 		indptrA2 <= 0;
499 | 	end else begin
500 | 		if(finish) begin
501 | 			indptrA2 <= indptrAout;
502 | 		end
503 | 		else if(indiceCounterA == indptrA2) begin 
504 | 			indptrA2 <= indptrAout;
505 | 		end
506 | 		else begin 
507 | 			indptrA2 <= indptrA2;
508 | 		end
509 | 	end
510 | end
511 | 
512 | always @(posedge clk or negedge rst) begin : proc_indptrA1
513 | 	if(~rst) begin
514 | 		indptrA1 <= 0;
515 | 	end else begin
516 | 		if(finish) begin 
517 | 			indptrA1 <= indptrAout;
518 | 		end
519 | 		else if(indiceCounterA == indptrA2) begin 
520 | 			indptrA1 <= indptrA2;
521 | 		end
522 | 		else begin 
523 | 			indptrA1 <= indptrA1;
524 | 		end
525 | 	end
526 | end
527 | 
528 | always @(posedge clk or negedge rst) begin : proc_indiceCounterA
529 | 	if(~rst) begin
530 | 		indiceCounterA <= 0;
531 | 	end else begin
532 | 		if(triggerA) begin 
533 | 			indiceCounterA <= indptrA1;
534 | 		end
535 | 		else if(indiceCounterA < indptrA2 - 1) begin 
536 | 			indiceCounterA <=  indiceCounterA + 1;
537 | 		end
538 | 		else begin 
539 | 			indiceCounterA <= indiceCounterA;
540 | 		end
541 | 	end
542 | end
543 | 
544 | always @(posedge clk or negedge rst) begin : proc_triggerA
545 | 	if(~rst) begin
546 | 		triggerA <= 0;
547 | 	end else begin
548 | 		if(indiceCounterA == indptrA2 - 1) begin 
549 | 			triggerA <= 1'b1;
550 | 		end
551 | 		else begin 
552 | 			triggerA <= 1'b0;
553 | 		end
554 | 	end
555 | end
556 | 
557 | 
558 | 
559 | 
560 | ///////////////////////////////////////////////////////////////////////////////
561 | 
562 | always @(posedge clk or negedge rst) begin : proc_indptrB2
563 | 	if(~rst) begin
564 | 		indptrB2 <= 0;
565 | 	end else begin
566 | 		if(finish) begin
567 | 			indptrB2 <= indptrBout;
568 | 		end
569 | 		else if(indiceCounterB == indptrB2) begin 
570 | 			indptrB2 <= indptrBout;
571 | 		end
572 | 		else begin 
573 | 			indptrB2 <= indptrB2;
574 | 		end
575 | 	end
576 | end
577 | 
578 | always @(posedge clk or negedge rst) begin : proc_indptrB1
579 | 	if(~rst) begin
580 | 		indptrB1 <= 0;
581 | 	end else begin
582 | 		if(finish) begin 
583 | 			indptrB1 <= indptrBout;
584 | 		end
585 | 		else if(indiceCounterB == indptrB2) begin 
586 | 			indptrB1 <= indptrB2;
587 | 		end
588 | 		else begin 
589 | 			indptrB1 <= indptrB1;
590 | 		end
591 | 	end
592 | end
593 | 
594 | always @(posedge clk or negedge rst) begin : proc_indiceCounterB
595 | 	if(~rst) begin
596 | 		indiceCounterB <= 0;
597 | 	end else begin
598 | 		if(triggerB) begin 
599 | 			indiceCounterB <= indptrB1;
600 | 		end
601 | 		else if(indiceCounterB < indptrB2 - 1) begin 
602 | 			indiceCounterB <=  indiceCounterB + 1;
603 | 		end
604 | 		else begin 
605 | 			indiceCounterB <= indiceCounterB;
606 | 		end
607 | 	end
608 | end
609 | 
610 | always @(posedge clk or negedge rst) begin : proc_triggerB
611 | 	if(~rst) begin
612 | 		triggerB <= 0;
613 | 	end else begin
614 | 		if(indiceCounterB == indptrB2 - 1) begin 
615 | 			triggerB <= 1'b1;
616 | 		end
617 | 		else begin 
618 | 			triggerB <= 1'b0;
619 | 		end
620 | 	end
621 | end
622 | 
623 | 
624 | //////////////////////////////////////////////////////////////
625 | 
626 | 
627 | 
628 | 
629 | endmodule
630 | 
631 | 
632 | 
633 | 


--------------------------------------------------------------------------------
/hardware/README.md:
--------------------------------------------------------------------------------
 1 | ## GraphACT: Accelerating GCN Training on CPU-FPGA Heterogeneous Platform [[PDF](https://arxiv.org/abs/2001.02498)]
 2 | ---
 3 | ### Abstract 
 4 | Graph Convolutional Networks (GCNs) have emerged as the state-of-the-art deep learning model for representation learning on graphs. It is challenging to accelerate training of GCNs, due to (1) substantial and irregular data communication to propagate information within the graph, and (2) intensive computation to propagate information along the neural network layers. To address these challenges, we design a novel accelerator for training GCNs on CPU-FPGA heterogeneous systems, by incorporating multiple algorithm-architecture co-optimizations. We first analyze the computation and communication characteristics of various GCN training algorithms, and select a subgraph-based algorithm that is well suited for hardware execution. To optimize the feature propagation within subgraphs, we propose a light-weight pre-processing step based on a graph theoretic approach. Such pre-processing performed on the CPU significantly reduces the memory access requirements and the computation to be performed on the FPGA.  To accelerate the weight update in GCN layers, we propose a systolic array based design for efficient parallelization. We integrate the above optimizations into a complete hardware pipeline, and analyze its load-balance and resource utilization by accurate performance modeling. We evaluate our design on Intel Stratix 10 board hosted by a 40-core Xeon server. On three large graphs, we achieve an order of magnitude training speedup with negligible accuracy loss, compared with state-of-the-art implementation on a multi-core platform. 
 5 | 
 6 | 
 7 | ### Implementation details
 8 | - FPGA platform: **Intel Stratix 10 1SX280LH3F55I3XG**
 9 |   - Tool: **Quartus prime pro 20.1** for synthesis
10 | - CPU platform: 40-core Xeon server (E5-2698 v4 @2.2GHz, hyper-threaded)
11 | 
12 | ### hardware architecture
13 | 
14 | ![arch](./hardware-ARCH.PNG)
15 | 
16 | ### Directory Structure
17 | - /hardware: contains the verilog code of the design
18 |   - system.v: main module
19 |   - Aggregation.v: feature aggregation module
20 |   - reluArray.v: activation module
21 |   - transformationPlusActivation.v: transformation models and activation models
22 |   - Data buffer:
23 |     - columnbuffer.v
24 |     - indiceDoubleBuffer.v
25 |     - indptrDoubleBuffer.v
26 |     - weightBuffer.v
27 |     - rowbuffer.v
28 | - /perf_model: performance modeling using python script
29 | 
30 | ### Configuring IP Cores
31 | - floating point Accumulator--sysAcc
32 |   - Find: IP catalog => basic function => arithmetic => floating point function
33 |   - Name: Accumulator
34 |   - Other Info: choose Generate Enable and generate HDL
35 | - floating point multiplier--mul
36 |   - Find: IP catalog => basic function => arithmetic => floating point function
37 |   - Name: Multiply
38 |   - Other Info: In Functionality choose Generate Enable and generate HDL
39 | - floating point comparator 
40 |   - Find: IP catalog => basic function => arithmetic => functionality => Comparison 
41 |   - Name: Greater than
42 |   - Other Info: In Functionality choose Generate Enable and generate HDL
43 | - RAM 1 port--Weight buffer 
44 |   - Find: IP catalog => on-chip memory => RAM: 1-port intel FPGA IP
45 |   - Name: weightbuffer 
46 |   - Other Info: choose Generate Enable and generate HDL
47 | - RAM 2 port--row buffer 
48 |   - Find: IP catalog => on-chip memory => RAM: 2-port intel FPGA IP
49 |   - Name: row buffer  
50 |   - Other Info: choose Generate Enable and generate HDL
51 | - RAM 2 port--column buffer 
52 |   - Find: IP catalog => on-chip memory => RAM: 2-port intel FPGA IP
53 |   - Name: column buffer  
54 |   - Other Info: choose Generate Enable and generate HDL
55 | - RAM 2 port--indptr buffer 
56 |   - Find: IP catalog => on-chip memory => RAM: 2-port intel FPGA IP
57 |   - Name: indptr buffer  
58 |   - Other Info: choose Generate Enable and generate HDL
59 | - RAM 2 port--indptr buffer 
60 |   - Find: IP catalog => on-chip memory => RAM: 2-port intel FPGA IP
61 |   - Name: indptr buffer  
62 |   - Other Info: choose Generate Enable and generate HDL
63 | - RAM 2 port--indice buffer 
64 |   - Find: IP catalog => on-chip memory => RAM: 2-port intel FPGA IP
65 |   - Name: indice buffer  
66 |   - Other Info: choose Generate Enable and generate HDL
67 |   
68 | ### Setting up the projects
69 | 1. Create a new project in Quartus Prime PRO 20.1 and set the **Intel Stratix 10 1SX280LH3F55I3XG** as the target device
70 | 2. Add all the .v files in /hardware directory to the project
71 | 3. Set the system.v as the top module
72 | 4. Start the compilation
73 | 


--------------------------------------------------------------------------------
/hardware/columnbuffer.v:
--------------------------------------------------------------------------------
 1 | module rowbuffer #(
 2 | 	parameter dataWidth = 32,   // the data bit width
 3 |     parameter pvadd= 256,   // the parallesim in feature adder
 4 |     parameter k = 1024  // define k as the block size
 5 | )
 6 | (
 7 | 	clk,
 8 | 	rst,
 9 | 	enableA,
10 | 	enableB,
11 | 	writeEnableA1,
12 | 	writeEnableA2,
13 | 	writeEnableB1,
14 | 	writeEnableB2,
15 | 	readportA1,
16 | 	readportA2,
17 | 	writeportA1,
18 | 	writeportA2,
19 | 	readportB1,
20 | 	readportB2,
21 | 	writeportB1,
22 | 	writeportB2,
23 |   addressportA1,
24 |   addressportA2,
25 |   addressportB1,
26 |   addressportB2
27 | );
28 | 
29 | 
30 | parameter addressWidth = $clog2(k);
31 | parameter dataportWidth = dataWidth*pvadd;
32 | parameter memorySize = k * dataportWidth;
33 | 
34 | 
35 | // define the input
36 | input clk;
37 | input rst;
38 | input enableA;
39 | input enableB;
40 | input writeEnableA1;
41 | input writeEnableA2;
42 | 	input writeEnableB1;
43 | input writeEnableB2;
44 | 
45 | //define the addressport
46 | 
47 | input [addressWidth - 1 : 0] addressportA1;
48 | input [addressWidth - 1 : 0] addressportA2;
49 | input [addressWidth - 1 : 0] addressportB1;
50 | input [addressWidth - 1 : 0] addressportB2;
51 | 
52 | // define the dataport input
53 | 
54 | 
55 | output [dataportWidth - 1:0] readportA1;
56 | output [dataportWidth - 1:0] readportA2;
57 | input [dataportWidth - 1:0] writeportA1;
58 | input [dataportWidth - 1:0] writeportA2;
59 | output [dataportWidth - 1:0] readportB1;
60 | output [dataportWidth - 1:0] readportB2;
61 | input [dataportWidth - 1:0] writeportB1;
62 | input [dataportWidth - 1:0] writeportB2;
63 | 
64 | 	rowbuffer u0 (
65 | 		.data_a    (writeportA1),    //   input,  width = 8192,    data_a.datain_a
66 | 		.q_a       (readportA1),       //  output,  width = 8192,       q_a.dataout_a
67 | 		.data_b    (writeportA2),    //   input,  width = 8192,    data_b.datain_b
68 | 		.q_b       (readportA2),       //  output,  width = 8192,       q_b.dataout_b
69 | 		.address_a (addressportA1), //   input,    width = 10, address_a.address_a
70 | 		.address_b (addressportA2), //   input,    width = 10, address_b.address_b
71 | 		.wren_a    (writeEnableA1),    //   input,     width = 1,    wren_a.wren_a
72 | 		.wren_b    (writeEnableA2),    //   input,     width = 1,    wren_b.wren_b
73 | 		.clock     (clk)      //   input,     width = 1,     clock.clk
74 | 	);
75 | 
76 | 	rowbuffer u1 (
77 | 		.data_a    (writeportB1),    //   input,  width = 8192,    data_a.datain_a
78 | 		.q_a       (readportB1),       //  output,  width = 8192,       q_a.dataout_a
79 | 		.data_b    (writeportB2),    //   input,  width = 8192,    data_b.datain_b
80 | 		.q_b       (readportB2),       //  output,  width = 8192,       q_b.dataout_b
81 | 		.address_a (addressportB1), //   input,    width = 10, address_a.address_a
82 | 		.address_b (addressportB2), //   input,    width = 10, address_b.address_b
83 | 		.wren_a    (writeEnableB1),    //   input,     width = 1,    wren_a.wren_a
84 | 		.wren_b    (writeEnableB2),    //   input,     width = 1,    wren_b.wren_b
85 | 		.clock     (clk)      //   input,     width = 1,     clock.clk
86 | 	);
87 | 
88 | 
89 | 
90 | endmodule
91 | 


--------------------------------------------------------------------------------
/hardware/hardware-ARCH.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/pgroupATusc/GraphACT/f0ecd30500247647105762934f5967350a0d8eab/hardware/hardware-ARCH.PNG


--------------------------------------------------------------------------------
/hardware/indiceDoubleBuffer.v:
--------------------------------------------------------------------------------
 1 | module indiceDoubleBuffer #(
 2 |     parameter k = 1024  // define k as the block size
 3 | )
 4 | (
 5 | 	clk,
 6 | 	rst,
 7 | 	enableA,
 8 | 	enableB,
 9 | 	writeEnableA1,
10 | 	writeEnableA2,
11 | 	writeEnableB1,
12 | 	writeEnableB2,
13 | 	readportA1,
14 | 	readportA2,
15 | 	writeportA1,
16 | 	writeportA2,
17 | 	readportB1,
18 | 	readportB2,
19 | 	writeportB1,
20 | 	writeportB2,
21 |     addressportA1,
22 |     addressportA2,
23 |     addressportB1,
24 |     addressportB2
25 | );
26 | 
27 | 
28 | parameter addressWidth = $clog2(k*k/32);
29 | parameter dataportWidth = $clog2(k);
30 | parameter memorySize = k*k/32 * dataportWidth;
31 | 
32 | 
33 | // define the input
34 | input clk;
35 | input rst;
36 | input enableA;
37 | input enableB;
38 | input writeEnableA1;
39 | input writeEnableA2;
40 | input writeEnableB1;
41 | input writeEnableB2;
42 | 
43 | //define the addressport
44 | 
45 | input [addressWidth - 1 : 0] addressportA1;
46 | input [addressWidth - 1 : 0] addressportA2;
47 | input [addressWidth - 1 : 0] addressportB1;
48 | input [addressWidth - 1 : 0] addressportB2;
49 | 
50 | // define the dataport input
51 | 
52 | 
53 | output [dataportWidth - 1:0] readportA1;
54 | output [dataportWidth - 1:0] readportA2;
55 | input [dataportWidth - 1:0] writeportA1;
56 | input [dataportWidth - 1:0] writeportA2;
57 | output [dataportWidth - 1:0] readportB1;
58 | output [dataportWidth - 1:0] readportB2;
59 | input [dataportWidth - 1:0] writeportB1;
60 | input [dataportWidth - 1:0] writeportB2;
61 | 
62 | indicebuffer u0 (
63 | 	.data_a    (writeportA1),    //   input,  width = 10,    data_a.datain_a
64 | 	.q_a       (readportA1),       //  output,  width = 10,       q_a.dataout_a
65 | 	.data_b    (writeportA2),    //   input,  width = 10,    data_b.datain_b
66 | 	.q_b       (readportA2),       //  output,  width = 10,       q_b.dataout_b
67 | 	.address_a (addressportA1), //   input,  width = 15, address_a.address_a
68 | 	.address_b (addressportA2), //   input,  width = 15, address_b.address_b
69 | 	.wren_a    (writeEnableA1),    //   input,   width = 1,    wren_a.wren_a
70 | 	.wren_b    (writeEnableA2),    //   input,   width = 1,    wren_b.wren_b
71 | 	.clock     (clk)      //   input,   width = 1,     clock.clk
72 | );
73 | 
74 | indicebuffer u1 (
75 | 	.data_a    (writeportB1),    //   input,  width = 10,    data_a.datain_a
76 | 	.q_a       (readportB1),       //  output,  width = 10,       q_a.dataout_a
77 | 	.data_b    (writeportB2),    //   input,  width = 10,    data_b.datain_b
78 | 	.q_b       (readportB2),       //  output,  width = 10,       q_b.dataout_b
79 | 	.address_a (addressportB1), //   input,  width = 15, address_a.address_a
80 | 	.address_b (addressportB2), //   input,  width = 15, address_b.address_b
81 | 	.wren_a    (writeEnableB1),    //   input,   width = 1,    wren_a.wren_a
82 | 	.wren_b    (writeEnableB2),    //   input,   width = 1,    wren_b.wren_b
83 | 	.clock     (clk)      //   input,   width = 1,     clock.clk
84 | );
85 | 
86 | 
87 | 
88 | 
89 | 
90 | endmodule


--------------------------------------------------------------------------------
/hardware/indptrDoubleBuffer.v:
--------------------------------------------------------------------------------
 1 | module indptrDoubleBuffer #(
 2 |     parameter k = 1024  // define k as the block size
 3 | )
 4 | (
 5 | 	clk,
 6 | 	rst,
 7 | 	enableA,
 8 | 	enableB,
 9 | 	writeEnableA1,
10 | 	writeEnableA2,
11 | 	writeEnableB1,
12 | 	writeEnableB2,
13 | 	readportA1,
14 | 	readportA2,
15 | 	writeportA1,
16 | 	writeportA2,
17 | 	readportB1,
18 | 	readportB2,
19 | 	writeportB1,
20 | 	writeportB2,
21 |     addressportA1,
22 |     addressportA2,
23 |     addressportB1,
24 |     addressportB2
25 | );
26 | 
27 | 
28 | parameter addressWidth = $clog2(k + 1);
29 | parameter dataportWidth =  $clog2(k*k/32);
30 | parameter memorySize = dataportWidth * (k+1) ;
31 | 
32 | 
33 | // define the input
34 | input clk;
35 | input rst;
36 | input enableA;
37 | input enableB;
38 | input writeEnableA1;
39 | input writeEnableA2;
40 | input writeEnableB1;
41 | input writeEnableB2;
42 | 
43 | //define the addressport
44 | 
45 | input [addressWidth - 1 : 0] addressportA1;
46 | input [addressWidth - 1 : 0] addressportA2;
47 | input [addressWidth - 1 : 0] addressportB1;
48 | input [addressWidth - 1 : 0] addressportB2;
49 | 
50 | // define the dataport input
51 | 
52 | 
53 | output [dataportWidth - 1:0] readportA1;
54 | output [dataportWidth - 1:0] readportA2;
55 | input [dataportWidth - 1:0] writeportA1;
56 | input [dataportWidth - 1:0] writeportA2;
57 | output [dataportWidth - 1:0] readportB1;
58 | output [dataportWidth - 1:0] readportB2;
59 | input [dataportWidth - 1:0] writeportB1;
60 | input [dataportWidth - 1:0] writeportB2;
61 | 
62 | indptrbuffer u0 (
63 | 	.data_a    (writeportA1),    //   input,  width = 15,    data_a.datain_a
64 | 	.q_a       (readportA1),       //  output,  width = 15,       q_a.dataout_a
65 | 	.data_b    (writeportA2),    //   input,  width = 15,    data_b.datain_b
66 | 	.q_b       (readportA2),       //  output,  width = 15,       q_b.dataout_b
67 | 	.address_a (addressportA1), //   input,  width = 11, address_a.address_a
68 | 	.address_b (addressportA2), //   input,  width = 11, address_b.address_b
69 | 	.wren_a    (writeEnableA1),    //   input,   width = 1,    wren_a.wren_a
70 | 	.wren_b    (writeEnableA2),    //   input,   width = 1,    wren_b.wren_b
71 | 	.clock     (clk)      //   input,   width = 1,     clock.clk
72 | );
73 | 
74 | indptrbuffer u1 (
75 | 	.data_a    (writeportB1),    //   input,  width = 15,    data_a.datain_a
76 | 	.q_a       (readportB1),       //  output,  width = 15,       q_a.dataout_a
77 | 	.data_b    (writeportB2),    //   input,  width = 15,    data_b.datain_b
78 | 	.q_b       (readportB2),       //  output,  width = 15,       q_b.dataout_b
79 | 	.address_a (addressportB1), //   input,  width = 11, address_a.address_a
80 | 	.address_b (addressportB2), //   input,  width = 11, address_b.address_b
81 | 	.wren_a    (writeEnableB1),    //   input,   width = 1,    wren_a.wren_a
82 | 	.wren_b    (writeEnableB2),    //   input,   width = 1,    wren_b.wren_b
83 | 	.clock     (clk)      //   input,   width = 1,     clock.clk
84 | );
85 | 
86 | 
87 | 
88 | endmodule
89 | 


--------------------------------------------------------------------------------
/hardware/reluArray.v:
--------------------------------------------------------------------------------
 1 | 
 2 | module reluArray #(
 3 | 	parameter dataWidth = 32,
 4 |     parameter pactivation= 128
 5 | )
 6 | (
 7 | 	clk,
 8 | 	rst,
 9 | 	inputArray,
10 | 	outputArray
11 | );
12 | 
13 | // define the inpit
14 | input clk;
15 | input rst;
16 | input [dataWidth*pactivation -1:0] inputArray;
17 | 
18 | //define the output
19 | output [dataWidth*pactivation -1:0] outputArray;
20 | 
21 | wire [dataWidth*pactivation -1:0] outputArray;
22 | 
23 | 
24 | genvar i;
25 | generate
26 | 	for(i = 0;i < pactivation; i = i+1) begin : singlerelu
27 | 		reluUnit #(.dataWidth(dataWidth)) reluUnitinstance
28 | 	    (
29 |     		.clk(clk), // system clock 
30 |     		.rst(rst),  // system rst
31 |     		.z(inputArray[dataWidth*(i+1) -1: dataWidth*i]),  // input value Z
32 |     		.a(outputArray[dataWidth*(i+1) -1: dataWidth*i])
33 |     	); // output activation A
34 | 
35 | 	end
36 | endgenerate
37 | 
38 | endmodule


--------------------------------------------------------------------------------
/hardware/reluUnit.v:
--------------------------------------------------------------------------------
 1 | 
 2 | module reluUnit #(
 3 |     parameter dataWidth = 32
 4 | )
 5 | (
 6 |     clk, // system clock 
 7 |     rst,  // system rst
 8 |     z,  // input value Z
 9 |     a); // output activation A
10 | input clk;
11 | input rst;
12 | input[dataWidth -1 : 0] z;
13 | output[dataWidth - 1: 0] a;
14 | 
15 | 
16 | 
17 | wire [7:0] compare;
18 | reg [dataWidth -1 : 0] a;
19 | 
20 | 
21 | 
22 | comp compareUnit (
23 | 		.clk    (clk),    //   input,   width = 1,    clk.clk
24 | 		.areset (rst), //   input,   width = 1, areset.reset
25 | 		.a      (z),      //   input,  width = 32,      a.a
26 | 		.b      (32'd0),      //   input,  width = 32,      b.b
27 | 		.q      (compare)       //  output,   width = 1,      q.q
28 | 	);
29 | 
30 | 
31 | reg [dataWidth - 1: 0] z1;
32 | reg [dataWidth - 1: 0] z2;
33 | 
34 | always @(posedge clk or negedge rst) begin : proc_z1
35 | 	if(~rst) begin
36 | 		z1 <= 0;
37 | 	end else begin
38 | 		z1 <= z;
39 | 	end
40 | end
41 | 
42 | always @(posedge clk or negedge rst) begin : proc_z2
43 | 	if(~rst) begin
44 | 		z2 <= 0;
45 | 	end else begin
46 | 		z2 <= z1;
47 | 	end
48 | end
49 | 
50 | 
51 | always @(posedge clk or negedge rst) begin : proc_a
52 | 	if(~rst) begin
53 | 		a <= 0;
54 | 	end else begin
55 | 		if(compare[0])
56 | 			begin
57 | 				a <= z2; 
58 | 			end
59 | 		else
60 | 			begin
61 | 				a <= 32'd0; 
62 | 			end
63 | 	end
64 | end
65 | 
66 | 
67 | 
68 | 
69 | endmodule
70 | 


--------------------------------------------------------------------------------
/hardware/rowbuffer.v:
--------------------------------------------------------------------------------
 1 | module rowbuffer #(
 2 | 	parameter dataWidth = 32,   // the data bit width
 3 |     parameter pvadd= 256,   // the parallesim in feature adder
 4 |     parameter k = 1024  // define k as the block size
 5 | )
 6 | (
 7 | 	clk,
 8 | 	rst,
 9 | 	enableA,
10 | 	enableB,
11 | 	writeEnableA1,
12 | 	writeEnableA2,
13 | 	writeEnableB1,
14 | 	writeEnableB2,
15 | 	readportA1,
16 | 	readportA2,
17 | 	writeportA1,
18 | 	writeportA2,
19 | 	readportB1,
20 | 	readportB2,
21 | 	writeportB1,
22 | 	writeportB2,
23 |   addressportA1,
24 |   addressportA2,
25 |   addressportB1,
26 |   addressportB2
27 | );
28 | 
29 | 
30 | parameter addressWidth = $clog2(k);
31 | parameter dataportWidth = dataWidth*pvadd;
32 | parameter memorySize = k * dataportWidth;
33 | 
34 | 
35 | // define the input
36 | input clk;
37 | input rst;
38 | input enableA;
39 | input enableB;
40 | input writeEnableA1;
41 | input writeEnableA2;
42 | 	input writeEnableB1;
43 | input writeEnableB2;
44 | 
45 | //define the addressport
46 | 
47 | input [addressWidth - 1 : 0] addressportA1;
48 | input [addressWidth - 1 : 0] addressportA2;
49 | input [addressWidth - 1 : 0] addressportB1;
50 | input [addressWidth - 1 : 0] addressportB2;
51 | 
52 | // define the dataport input
53 | 
54 | 
55 | output [dataportWidth - 1:0] readportA1;
56 | output [dataportWidth - 1:0] readportA2;
57 | input [dataportWidth - 1:0] writeportA1;
58 | input [dataportWidth - 1:0] writeportA2;
59 | output [dataportWidth - 1:0] readportB1;
60 | output [dataportWidth - 1:0] readportB2;
61 | input [dataportWidth - 1:0] writeportB1;
62 | input [dataportWidth - 1:0] writeportB2;
63 | 
64 | rowbuffer u0 (
65 | 	.data_a    (writeportA1),    //   input,  width = 8192,    data_a.datain_a
66 | 	.q_a       (readportA1),       //  output,  width = 8192,       q_a.dataout_a
67 | 	.data_b    (writeportA2),    //   input,  width = 8192,    data_b.datain_b
68 | 	.q_b       (readportA2),       //  output,  width = 8192,       q_b.dataout_b
69 | 	.address_a (addressportA1), //   input,    width = 10, address_a.address_a
70 | 	.address_b (addressportA2), //   input,    width = 10, address_b.address_b
71 | 	.wren_a    (writeEnableA1),    //   input,     width = 1,    wren_a.wren_a
72 | 	.wren_b    (writeEnableA2),    //   input,     width = 1,    wren_b.wren_b
73 | 	.clock     (clk)      //   input,     width = 1,     clock.clk
74 | );
75 | 
76 | rowbuffer u1 (
77 | 	.data_a    (writeportB1),    //   input,  width = 8192,    data_a.datain_a
78 | 	.q_a       (readportB1),       //  output,  width = 8192,       q_a.dataout_a
79 | 	.data_b    (writeportB2),    //   input,  width = 8192,    data_b.datain_b
80 | 	.q_b       (readportB2),       //  output,  width = 8192,       q_b.dataout_b
81 | 	.address_a (addressportB1), //   input,    width = 10, address_a.address_a
82 | 	.address_b (addressportB2), //   input,    width = 10, address_b.address_b
83 | 	.wren_a    (writeEnableB1),    //   input,     width = 1,    wren_a.wren_a
84 | 	.wren_b    (writeEnableB2),    //   input,     width = 1,    wren_b.wren_b
85 | 	.clock     (clk)      //   input,     width = 1,     clock.clk
86 | );
87 | 
88 | 
89 | 
90 | endmodule
91 | 


--------------------------------------------------------------------------------
/hardware/sysArray.v:
--------------------------------------------------------------------------------
  1 | 
  2 | module sysArray #(
  3 | 	parameter dataWidth = 32,
  4 | 	parameter SysDimension = 32,
  5 | 	parameter featureLen = 128
  6 | )
  7 | (
  8 | 	clk, // define the input clock
  9 | 	rst, // define the input reset
 10 | 	enable,
 11 | 	weightArray, // define the input weight array
 12 | 	featureArray, // define the input feature array
 13 | 	outputArray // define the output feature array
 14 | );
 15 | 
 16 | //define the inpt
 17 | input clk;
 18 | input rst;
 19 | input enable;
 20 | input [dataWidth*SysDimension - 1:0] weightArray;
 21 | input [dataWidth*SysDimension - 1:0] featureArray;
 22 | output [dataWidth*SysDimension - 1:0] outputArray;
 23 | reg [dataWidth*SysDimension - 1:0] outputArray;
 24 | 
 25 | wire [dataWidth*SysDimension - 1:0] weightWireArray[0:SysDimension - 1];
 26 | wire [dataWidth*SysDimension - 1:0] featureWireArray[0:SysDimension - 1];
 27 | 
 28 | wire [dataWidth*SysDimension - 1:0] resultWireArray[0:SysDimension - 1];
 29 | 
 30 | 
 31 | genvar i, j;
 32 | generate
 33 | 	for(i = 0;i < SysDimension; i = i + 1) begin : rowUroll
 34 | 		for(j = 0; j < SysDimension; j = j + 1) begin : columnUroll
 35 | 
 36 | 			// define weight wire connection
 37 | 
 38 | 
 39 | 			wire [dataWidth - 1:0] iweightwire;
 40 | 			
 41 | 			assign iweightwire = (j == 0) ? weightArray[(i + 1)*dataWidth - 1: i * dataWidth]: weightWireArray[j - 1][(i + 1)*dataWidth - 1: i * dataWidth];
 42 | 
 43 | 			wire [dataWidth - 1:0] iweightwireOut;
 44 | 
 45 | 			assign iweightwireOut = weightWireArray[j][(i + 1)*dataWidth  - 1: i * dataWidth];
 46 | 
 47 | 			// define feature wire connection
 48 | 
 49 | 			wire [dataWidth - 1:0] ifeaturewire ;
 50 | 
 51 | 			assign ifeaturewire = (i == 0) ? featureArray[(j + 1)*dataWidth - 1: j * dataWidth]: featureWireArray[i - 1][(j + 1)*dataWidth - 1: j * dataWidth];
 52 | 
 53 | 			wire [dataWidth - 1:0] ifeaturewireOut;
 54 | 
 55 | 			assign ifeaturewireOut = featureWireArray[i][(j + 1)*dataWidth - 1: j * dataWidth];
 56 | 
 57 | 			// define the result wire connection
 58 | 
 59 | 			wire [dataWidth - 1:0] iresultIn;
 60 | 
 61 | 			assign iresultIn = (i == 0)? 0:  resultWireArray[i - 1][(j + 1)*dataWidth - 1: j * dataWidth];
 62 | 
 63 | 			wire [dataWidth - 1:0] iresultOut;
 64 | 
 65 | 			assign resultWireArray[i][(j + 1)*dataWidth - 1: j * dataWidth] = iresultOut;
 66 | 
 67 | 			
 68 | 
 69 | 			sysUnit #(
 70 |    	 			.dataWidth(dataWidth),
 71 |     			.RowIndex(i),
 72 |     			.ColumnIndex(j),
 73 |     			.featureLen(featureLen),
 74 |     			.SysDimension(SysDimension)
 75 | 			) singleSysUnit
 76 | 			(
 77 | 				.clk(clk),                    // input clock
 78 | 				.rst(rst),                    // input reset
 79 | 				.enable(enable),			  // input enable		
 80 | 				.weight(iweightwire),         // input weight
 81 | 				.featureIn(ifeaturewire),     // input feature
 82 | 				.preresult(iresultIn),        // input previous result
 83 | 				.featureOut(iresultOut),      // output feature  
 84 | 				.featurePass(ifeaturewireOut),// output feature
 85 | 				.weightPass(iweightwireOut)
 86 | 			);
 87 | 
 88 | 		end
 89 | 	end
 90 | endgenerate
 91 | 
 92 | 
 93 | always @(posedge clk or negedge rst) begin : proc_outputArray
 94 |  	if(~rst) begin
 95 |  		outputArray <= 0;
 96 |  	end else begin
 97 |  		outputArray <= resultWireArray[SysDimension - 1];
 98 |  	end
 99 |  end 
100 | 
101 | endmodule


--------------------------------------------------------------------------------
/hardware/sysUnit.v:
--------------------------------------------------------------------------------
  1 | module sysUnit #(
  2 |     parameter dataWidth = 32,
  3 |     parameter RowIndex = 0,
  4 |     parameter ColumnIndex = 0,
  5 |     parameter featureLen = 128,
  6 |     parameter SysDimension = 32
  7 | )
  8 | (
  9 | 	clk,               // input clock
 10 | 	rst,               // input reset
 11 | 	enable,			   // input enable		
 12 | 	weight,            // input weight
 13 | 	featureIn,         // input feature
 14 | 	preresult,         // input previous result
 15 | 	featureOut,        // output feature  
 16 | 	featurePass,       // pass feature
 17 | 	weightPass         // pass weight 
 18 | );
 19 | 
 20 | // define the input
 21 | input clk;
 22 | input rst;
 23 | input enable;
 24 | input [dataWidth - 1:0] weight;
 25 | input [dataWidth - 1:0] featureIn;
 26 | input [dataWidth - 1:0] preresult;
 27 | 
 28 | //define the output
 29 | output [dataWidth - 1:0] featureOut;
 30 | reg [dataWidth - 1:0] featureOut;
 31 | output [dataWidth - 1:0] featurePass;
 32 | reg [dataWidth - 1:0] featurePass;
 33 | output [dataWidth - 1:0] weightPass;
 34 | reg [dataWidth - 1:0] weightPass;
 35 | 
 36 | 
 37 | parameter InitialLantency = 45;
 38 | parameter StartPlace = RowIndex + ColumnIndex + featureLen + InitialLantency  + 1;
 39 | parameter counterWidth = $clog2(InitialLantency + RowIndex + ColumnIndex + featureLen + 8);
 40 | 
 41 | 
 42 | parameter InitialStat = 1'b0;
 43 | parameter PipelineStat = 1'b1;
 44 | 
 45 | always @(posedge clk or negedge rst) begin : proc_weightPass
 46 | 	if(~rst) begin
 47 | 		weightPass <= 0;
 48 | 	end else begin
 49 | 		if(enable)
 50 | 			weightPass <= weight;
 51 | 		else
 52 | 			weightPass <= weightPass;
 53 | 	end
 54 | end
 55 | 
 56 | always @(posedge clk or negedge rst) begin : proc_featurePass
 57 | 	if(~rst) begin
 58 | 		featurePass <= 0;
 59 | 	end else begin
 60 | 		if(enable)
 61 | 			featurePass <= featureIn;
 62 | 		else
 63 | 			featurePass <= featurePass;
 64 | 	end
 65 | end
 66 | 
 67 | 
 68 | reg [counterWidth - 1: 0] counter;
 69 | 
 70 | reg state;
 71 | 
 72 | always @(posedge clk or negedge rst) begin : proc_state
 73 | 	if(~rst) begin
 74 | 		state <= InitialStat;
 75 | 	end else begin
 76 | 		if(state == InitialStat && counter == StartPlace) begin 
 77 | 			state <= PipelineStat;
 78 | 		end
 79 | 		else begin 
 80 | 			state <= state;
 81 | 		end
 82 | 	end
 83 | end
 84 | 
 85 | 
 86 | always @(posedge clk or negedge rst) begin : proc_counter
 87 | 	if(~rst) begin
 88 | 		counter <= 0;
 89 | 	end else begin
 90 | 		if(enable) begin 
 91 | 			if(state == InitialStat && counter < StartPlace - 1)
 92 | 				counter <= counter + 1;
 93 | 			else if(state == InitialStat && counter ==  StartPlace - 1)
 94 | 				counter <= 0;
 95 | 			else if(state == PipelineStat && counter < featureLen - 1)
 96 | 				counter <= counter + 1;
 97 | 			else if(state == PipelineStat && counter == featureLen - 1)
 98 | 			    counter <= 0;
 99 | 			else
100 | 				counter <= counter + 1;
101 | 		end
102 | 		else begin 
103 | 			counter <= counter;
104 | 		end
105 | 	end
106 | end
107 | 
108 | 
109 | 
110 | reg dataValid;
111 | 
112 | always @(posedge clk or negedge rst) begin : proc_dataValid
113 | 	if(~rst) begin
114 | 		dataValid <= 1'b0;
115 | 	end else begin
116 | 		if( state == InitialStat  && enable && counter >= RowIndex + ColumnIndex )
117 | 			dataValid <= 1'b1;
118 | 		else if(state == PipelineStat && enable)
119 | 			dataValid <= 1'b1;
120 | 		else
121 | 			dataValid <= 1'b0;
122 | 	end
123 | end
124 | 
125 | reg dataLast;
126 | 
127 | always @(posedge clk or negedge rst) begin : proc_dataLast
128 | 	if(~rst) begin
129 | 		dataLast <= 0;
130 | 	end else begin
131 | 		if(state == InitialStat && enable && counter == RowIndex + ColumnIndex  + featureLen)
132 | 			dataLast <= 1'b1;
133 | 		else if(state == PipelineStat && enable && counter == featureLen - InitialLantency)
134 | 			dataLast <= 1'b1;
135 | 		else
136 | 			dataLast <= 1'b0;
137 | 	end
138 | end
139 | 
140 | wire Mulvalid;
141 | wire [dataWidth - 1:0] Mulresult;
142 | wire Mullast;
143 | 
144 | 
145 | 
146 | mul u0 (
147 | 	.clk    (clk),    //   input,   width = 1,    clk.clk
148 | 	.areset (rst), //   input,   width = 1, areset.reset
149 | 	.a      (weight),      //   input,  width = 32,      a.a
150 | 	.b      (featureIn),      //   input,  width = 32,      b.b
151 | 	.q      (Mulresult)       //  output,  width = 32,      q.q
152 | );
153 | 
154 | 
155 | 
156 | 
157 | wire Accvalid;
158 | wire [dataWidth - 1:0] Accresult;
159 | wire Acclast;
160 | 
161 | 
162 | 	singleAcc u1 (
163 | 		.clk    (clk),    //   input,   width = 1,    clk.clk
164 | 		.areset (rst), //   input,   width = 1, areset.reset
165 | 		.a      (Mulresult),      //   input,  width = 32,      a.a
166 | 		.q      (Accresult),      //  output,  width = 32,      q.q
167 | 		.acc    (Mulvalid)     //   input,   width = 1,    acc.acc
168 | 	);
169 | 
170 | assign Acclast = 1'b1;
171 | assign Accvalid = 1'b1;
172 | // 
173 | reg [dataWidth - 1:0] temResult;
174 | 
175 | always @(posedge clk or negedge rst) begin : proc_temResult
176 | 	if(~rst) begin
177 | 		temResult <= 0;
178 | 	end else begin
179 | 		if(Acclast && Accvalid)
180 | 			temResult <= Accresult;
181 | 		else
182 | 			temResult <= temResult;
183 | 	end
184 | end
185 | 
186 | always @(posedge clk or negedge rst) begin : proc_featureOut
187 | 	if(~rst) begin
188 | 		featureOut <= 0;
189 | 	end else begin
190 | 		if(state == PipelineStat && enable && counter == 2*SysDimension - (RowIndex + ColumnIndex))
191 | 			featureOut <= temResult;
192 | 		else if (state == PipelineStat && enable && counter == 2*SysDimension - (RowIndex + ColumnIndex))
193 | 			featureOut <= preresult;
194 | 		else
195 | 			featureOut <= featureOut;
196 | 	end
197 | end
198 | 
199 | 
200 |  
201 | endmodule
202 | 


--------------------------------------------------------------------------------
/hardware/system.v:
--------------------------------------------------------------------------------
  1 | `timescale 1ns / 1ps
  2 | //////////////////////////////////////////////////////////////////////////////////
  3 | // Company: 
  4 | // Engineer: 
  5 | // 
  6 | // Create Date: 12/21/2019 05:26:25 PM
  7 | // Design Name: 
  8 | // Module Name: system
  9 | // Project Name: 
 10 | // Target Devices: 
 11 | // Tool Versions: 
 12 | // Description: 
 13 | // 
 14 | // Dependencies: 
 15 | // 
 16 | // Revision:
 17 | // Revision 0.01 - File Created
 18 | // Additional Comments:
 19 | // 
 20 | //////////////////////////////////////////////////////////////////////////////////
 21 | 
 22 | 
 23 | 
 24 | module system(
 25 | 
 26 | 	clk,
 27 | 	rst,
 28 | 	enable,
 29 | 	mode,
 30 | 	indptr_addressportA1_outside,
 31 | 	indptr_addressportB1_outside,
 32 | 	indptr_startAddressA,
 33 | 	indptr_startAddressB,
 34 | 	indptr_endAddress,
 35 | 	indptr_writeportA1,
 36 | 	indptr_writeportB1,
 37 | 	indice_addressportA1_outside,
 38 | 	indice_addressportB1_outside,
 39 | 	indice_writeportA1,
 40 | 	indice_writeportB1,
 41 | 	columnbuffer_addressportA1_outside,
 42 | 	columnbuffer_addressportB1_outside,
 43 | 	columnbuffer_writeportA1_narrow,
 44 | 	columnbuffer_writeportB1_narrow,
 45 | 
 46 | 	weight_narrow,
 47 | 	resultOut_narrow
 48 | 
 49 |  );
 50 | 
 51 | parameter dataWidth = 32;
 52 | parameter k = 1024;
 53 | parameter psys = 24;
 54 | parameter featureLen = 256;
 55 | parameter pactivation = 128;
 56 | parameter pvadd = 256;
 57 | parameter pavdd = 256;
 58 | 
 59 | 
 60 | input clk;
 61 | input rst;
 62 | input enable;
 63 | input mode;
 64 | 
 65 | input [$clog2(k + 1) - 1:0] indptr_addressportA1_outside;
 66 | input [$clog2(k + 1) - 1:0] indptr_addressportB1_outside;
 67 | input [$clog2(k + 1) - 1:0] indptr_startAddressA;
 68 | input [$clog2(k + 1) - 1:0] indptr_startAddressB;
 69 | input [$clog2(k + 1) - 1:0] indptr_endAddress;
 70 | 
 71 | input [$clog2(k*k/32) - 1:0] indptr_writeportA1;
 72 | input [$clog2(k*k/32) - 1:0] indptr_writeportB1;
 73 | 
 74 | input [$clog2(k*k/32) - 1:0] indice_addressportA1_outside;
 75 | input [$clog2(k*k/32) - 1:0] indice_addressportB1_outside;
 76 | 
 77 | input [$clog2(k) - 1:0] indice_writeportA1; 
 78 | input [$clog2(k) - 1:0] indice_writeportB1;
 79 | 
 80 | input [$clog2(k) - 1:0] columnbuffer_addressportA1_outside;
 81 | input [$clog2(k) - 1:0] columnbuffer_addressportB1_outside;
 82 | 
 83 | input wire [dataWidth - 1:0] columnbuffer_writeportA1_narrow;
 84 | input wire [dataWidth - 1:0] columnbuffer_writeportB1_narrow;
 85 | 
 86 | input wire [dataWidth - 1:0] weight_narrow;
 87 | output wire [dataWidth - 1:0] resultOut_narrow;
 88 | 
 89 | wire [dataWidth*psys - 1:0] resultOut;
 90 | 
 91 | wire [dataWidth*pvadd - 1:0] columnbuffer_writeportA1;
 92 | wire [dataWidth*pvadd - 1:0] columnbuffer_writeportB1;
 93 | 
 94 | 
 95 | wire [dataWidth*psys - 1:0] weightIn;
 96 | 
 97 | genvar i;
 98 | generate
 99 | 	for (i = 0; i< pvadd;i = i+1) begin:portnarrowTowide
100 | 		assign columnbuffer_writeportA1[ (i + 1)*dataWidth - 1 : i*dataWidth] = columnbuffer_writeportA1_narrow; 
101 | 		assign columnbuffer_writeportB1[ (i + 1)*dataWidth - 1 : i*dataWidth] = columnbuffer_writeportB1_narrow; 
102 | 	end
103 | endgenerate
104 | 
105 | genvar j;
106 | generate
107 | 	for (j = 0; j < psys ; j = j+1) begin: weightGenration
108 | 		assign weightIn[(j+1)*dataWidth - 1: j*dataWidth] = weight_narrow;
109 | 	end
110 | endgenerate
111 | 
112 | wire [dataWidth - 1:0] weight_narrow_r[0:psys - 1];
113 | 
114 | genvar l;
115 | generate
116 | 	 for (l = 0; l < psys - 1; l = l+1) begin : resultoutgeneration
117 | 	 	if(l == 0) begin : firstloop
118 | 	 		assign weight_narrow_r[l] = resultOut[(l + 2)*dataWidth - 1:(l+1)*dataWidth] & resultOut[(l+1)*dataWidth - 1:l*dataWidth];
119 | 	 	end
120 | 	 	else begin : remainingloop
121 | 	 		assign weight_narrow_r[l] = weight_narrow_r[l-1] & resultOut[(l + 2)*dataWidth - 1:(l+1)*dataWidth];
122 | 	 	end
123 | 	 end
124 | endgenerate
125 | 
126 | assign resultOut_narrow = weight_narrow_r[psys - 2];
127 | 
128 | wire [$clog2(k) - 1:0] rowAddressConnection;
129 | wire [pvadd*dataWidth - 1:0] rowdataConnection;
130 | 
131 | Aggregation #(
132 | 	.dataWidth(dataWidth),   // the data bit width
133 | 	.pvadd(pvadd),   // the parallesim in feature adder
134 | 	.k(k)  // define k as the block size
135 | ) singleAggregation
136 | (
137 | 	// system signal
138 | 	.clk(clk),
139 | 	.rst(rst),
140 | 	.enable(enable),
141 | 
142 | 	// port for indptrDoubleBuffer
143 | 	.indptr_writeEnableA1(enable),
144 | 	.indptr_writeEnableB1(enable),
145 | 	.indptr_addressportA1_outside(indptr_addressportA1_outside),
146 | 	.indptr_addressportB1_outside(indptr_addressportB1_outside),
147 | 	.indptr_writeportA1(indptr_writeportA1),
148 | 	.indptr_writeportB1(indptr_writeportB1),
149 | 
150 | 	.indptr_startAddressA(indptr_startAddressA),
151 | 	.indptr_startAddressB(indptr_startAddressB),
152 | 	.indptr_endAddress(indptr_endAddress),
153 | 
154 | 	// port for indiceDoubleBuffer
155 | 	.indice_writeEnableA1(enable),
156 | 	.indice_writeEnableB1(enable),
157 | 	.indice_addressportA1_outside(indice_addressportA1_outside),
158 | 	.indice_addressportB1_outside(indice_addressportB1_outside),
159 | 	.indice_writeportA1(indice_writeportA1),
160 | 	.indice_writeportB1(indice_writeportB1),
161 | 
162 | 	// port for DoubleColumnBuffer
163 | 	.columnbuffer_writeEnableA1(enable),
164 | 	.columnbuffer_writeEnableB1(enable),
165 | 	.columnbuffer_addressportA1_outside(columnbuffer_addressportA1_outside),
166 | 	.columnbuffer_addressportB1_outside(columnbuffer_addressportB1_outside),
167 | 	.columnbuffer_writeportA1(columnbuffer_writeportA1),
168 | 	.columnbuffer_writeportB1(columnbuffer_writeportB1),
169 | 
170 | 	// port for Double Row Buffer
171 | 
172 | 	.rowbuffer_addressportA1_outside(rowAddressConnection),
173 | 	.rowbuffer_dataOut(rowdataConnection)
174 | 
175 | 
176 | );
177 | 
178 | 
179 | transformationPlusActivation #(
180 | 	.dataWidth(dataWidth),
181 | 	.psys(psys),
182 | 	.featureLen(featureLen),
183 | 	.pactivation(pactivation),
184 | 	.pavdd(pavdd),
185 | 	.k(k)
186 | ) singletransformationPlusActivation
187 | (
188 | 	.clk(clk),
189 | 	.rst(rst),
190 | 	.enable(enable),
191 | 	// define port for weight buffer
192 | 	.weightWriteEnable(enable),
193 | 	.weigthDataIn(weightIn),
194 | 
195 | 	//define port for systolic array
196 | 	
197 | 
198 | 	//define the port for aggregation in
199 | 	.rowbuffer_address(rowAddressConnection),
200 | 	.rowbuffer_dataOut(rowdataConnection),
201 | 
202 | 	//define the model
203 | 	.mode(mode), 
204 | 
205 | 	//
206 | 	.resultOut(resultOut)
207 | 
208 | );
209 | 
210 | 
211 | 
212 | endmodule


--------------------------------------------------------------------------------
/hardware/transformationPlusActivation.v:
--------------------------------------------------------------------------------
  1 | `timescale 1ns / 1ps
  2 | //////////////////////////////////////////////////////////////////////////////////
  3 | // Company: 
  4 | // Engineer: 
  5 | // 
  6 | // Create Date: 12/21/2019 09:53:24 AM
  7 | // Design Name: 
  8 | // Module Name: transformationPlusActivation
  9 | // Project Name: 
 10 | // Target Devices: 
 11 | // Tool Versions: 
 12 | // Description: 
 13 | // 
 14 | // Dependencies: 
 15 | // 
 16 | // Revision:
 17 | // Revision 0.01 - File Created
 18 | // Additional Comments:
 19 | // 
 20 | //////////////////////////////////////////////////////////////////////////////////
 21 | 
 22 | 
 23 | module transformationPlusActivation #(
 24 | 	parameter dataWidth = 32,
 25 | 	parameter psys = 32,
 26 | 	parameter featureLen = 128,
 27 | 	parameter pactivation = 128,
 28 | 	parameter pavdd = 128,
 29 | 	parameter k = 1024
 30 | )
 31 | (
 32 | 	clk,
 33 | 	rst,
 34 | 	enable,
 35 | 	// define port for weight buffer
 36 | 	weightWriteEnable,
 37 | 	weigthDataIn,
 38 | 
 39 | 	//define port for systolic array
 40 | 	
 41 | 
 42 | 	//define the port for aggregation in
 43 | 	rowbuffer_address,
 44 | 	rowbuffer_dataOut,
 45 | 
 46 | 	//define the model
 47 | 	mode, 
 48 | 
 49 | 	//
 50 | 	resultOut
 51 | 
 52 | );
 53 | input clk;
 54 | input rst;
 55 | input enable;
 56 | input weightWriteEnable;
 57 | input [dataWidth*psys -1 :0] weigthDataIn;
 58 | input mode;
 59 | 
 60 | input [featureLen*dataWidth - 1:0] rowbuffer_dataOut;
 61 | 
 62 | output wire [psys*dataWidth - 1:0] resultOut;
 63 | 
 64 | 
 65 | wire [psys*dataWidth - 1:0] reluIn;
 66 | wire [psys*dataWidth - 1:0] reluOut;
 67 | 
 68 | 
 69 | wire [dataWidth*psys -1 :0] weightBufferTosysArray;
 70 | 
 71 | reg [$clog2(k) - 1:0] rowAddressCounter;
 72 | output wire [$clog2(k) - 1:0] rowbuffer_address;
 73 | 
 74 | 
 75 | always @(posedge clk or negedge rst) begin : rowAddressCounter_r
 76 | 	if(~rst) begin
 77 | 		rowAddressCounter <= 0;
 78 | 	end else begin
 79 | 		rowAddressCounter <= rowAddressCounter + 1 ;
 80 | 	end
 81 | end
 82 | 
 83 | assign rowbuffer_address = rowAddressCounter;
 84 | 
 85 | 
 86 | parameter integer round = $floor(featureLen / psys);
 87 | 
 88 | 
 89 | wire [dataWidth*psys -1 :0] sysin[0:round- 2];
 90 | 
 91 | 
 92 | wire [dataWidth*psys -1 :0] sysDataOut;
 93 | 
 94 | 
 95 | genvar i;
 96 | generate
 97 | for(i = 0; i< round - 2;i=i+1) begin:sysin_loop
 98 | 	if(i == 0) begin : round_r
 99 | 		assign sysin[i] = rowbuffer_dataOut[(i+1)*psys * dataWidth - 1: i*psys*dataWidth] & rowbuffer_dataOut[(i+2)*psys * dataWidth - 1: (i + 1)*psys*dataWidth];
100 | 	end
101 | 	else begin : round_p
102 | 		assign sysin[i] = sysin[i - 1] & rowbuffer_dataOut[(i+2)*psys * dataWidth - 1: (i + 1)*psys*dataWidth];
103 | 	end
104 | end
105 | endgenerate
106 | 
107 | 
108 | 
109 | wire [dataWidth*psys -1 :0] sysarrayIn;
110 | 
111 | assign sysarrayIn = (mode == 1'b0)? sysin[round - 2]: reluOut;
112 | 
113 | assign reluIn = (mode == 1'b0)? sysDataOut: sysin[round - 2];
114 | 
115 | assign resultOut = (mode == 1'b0)? reluOut: sysDataOut;
116 | 
117 | sysArray #(
118 | 	.dataWidth(dataWidth),
119 | 	.SysDimension(psys),
120 | 	.featureLen(featureLen)
121 | ) singlesysArray
122 | (
123 | 	.clk(clk), 								// define the input clock
124 | 	.rst(rst), 								// define the input reset
125 | 	.enable(enable),						// define the enable signal port
126 | 	.weightArray(weightBufferTosysArray),   // define the input weight array
127 | 	.featureArray(sysarrayIn),                        // define the input feature array
128 | 	.outputArray(sysDataOut) 				// define the output feature array
129 | );
130 | 
131 | parameter weightBufferAddressWidth = $clog2(featureLen * featureLen / psys);
132 | 
133 | 
134 | reg[weightBufferAddressWidth - 1:0] WeightAddr;
135 | 
136 | 
137 | always @(posedge clk or negedge rst) begin : proc_WeightAddr
138 | 	if(~rst) begin
139 | 		WeightAddr <= 0;
140 | 	end else begin
141 | 		WeightAddr <= WeightAddr + 1;
142 | 	end
143 | end
144 | 
145 | weightBuffer #(
146 | 	.dataWidth(dataWidth),
147 | 	.featureLen(featureLen),
148 | 	.psys(psys)
149 | ) singleWeightBuffer
150 | (
151 | 	.clk(clk),
152 | 	.rst(rst),
153 | 	.enable(enable),
154 | 	.writenable(weightWriteEnable),
155 | 	.din(weigthDataIn),
156 | 	.dout(weightBufferTosysArray),
157 | 	.addr(WeightAddr)
158 | );
159 | 
160 | 
161 | 
162 | 
163 | 
164 | 
165 | reluArray #(
166 | 	.dataWidth(dataWidth),
167 |     .pactivation(pactivation)
168 | ) singlereluArray
169 | (
170 | 	.clk(clk),
171 | 	.rst(rst),
172 | 	.inputArray(reluIn),
173 | 	.outputArray(reluOut)
174 | );
175 | 
176 | endmodule
177 | 
178 | 


--------------------------------------------------------------------------------
/hardware/vectorAdd.v:
--------------------------------------------------------------------------------
 1 | module vectorAdd #(
 2 |     parameter dataWidth = 32,
 3 |     parameter pvadd= 128
 4 | )
 5 | (
 6 | 	clk,
 7 | 	rst,
 8 | 	lastin,
 9 | 	lastout,
10 | 	vectorIn,
11 | 	vectorOut    
12 | );
13 | 
14 | // define the input
15 | input clk;
16 | input rst;
17 | input lastin;
18 | input [dataWidth*pvadd -1: 0] vectorIn;
19 | 
20 | //define the output
21 | 
22 | output wire lastout;
23 | output [dataWidth*pvadd -1: 0] vectorOut;
24 | 
25 | //define the type of output
26 | 
27 | 
28 | wire [dataWidth*pvadd -1: 0] vectorOut;
29 | assign vectorOut = 1'b1;
30 | genvar i;
31 | generate
32 | 	for(i = 0;i < pvadd; i = i+1) begin : single
33 | 		
34 | 		singleAcc u0 (
35 | 		.clk    (clk),    //   input,   width = 1,    clk.clk
36 | 		.areset (rst), //   input,   width = 1, areset.reset
37 | 		.a      (vectorIn[(i + 1)*dataWidth - 1: i*dataWidth]),      //   input,  width = 32,      a.a
38 | 		.q      (vectorOut[(i + 1)*dataWidth - 1: i*dataWidth]),      //  output,  width = 32,      q.q
39 | 		.acc    (lastin)     //   input,   width = 1,    acc.acc
40 | 	);
41 | 
42 | 
43 | 
44 | 
45 | 	end
46 | endgenerate
47 | 
48 | endmodule


--------------------------------------------------------------------------------
/hardware/weightBuffer.v:
--------------------------------------------------------------------------------
 1 | 
 2 | module weightBuffer #(
 3 | 	parameter dataWidth = 32,
 4 | 	parameter featureLen = 256,
 5 | 	parameter psys = 24
 6 | )
 7 | (
 8 | 	clk,
 9 | 	rst,
10 | 	enable,
11 | 	writenable,
12 | 	din,
13 | 	dout,
14 | 	addr
15 | 
16 | );
17 | 
18 | parameter addressWidth = $clog2(featureLen * featureLen / psys);
19 | parameter dataportWidth = dataWidth*psys;
20 | parameter memorySize = 288 * 288 * dataWidth;
21 | 
22 | 
23 | 
24 | // define the input port
25 | input clk;
26 | input rst;
27 | input enable;
28 | input writenable;
29 | input [dataportWidth - 1:0] din;
30 | input [addressWidth -1:0] addr;
31 | output [dataportWidth - 1:0] dout;
32 | 
33 | weightbuffer u0 (
34 | 	.data      (din),      //   input,  width = 768,      data.datain
35 | 	.q         (dout),         //  output,  width = 768,         q.dataout
36 | 	.wraddress ({1'b0, addr}), //   input,   width = 12, wraddress.wraddress
37 | 	.rdaddress ({1'b0, addr}), //   input,   width = 12, rdaddress.rdaddress
38 | 	.wren      (writenable),      //   input,    width = 1,      wren.wren
39 | 	.clock     (clk)      //   input,    width = 1,     clock.clk
40 | );
41 | 
42 | 
43 | 
44 | 
45 | endmodule


--------------------------------------------------------------------------------
/perf_model/perf_analysis.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import sys,math
  3 | 
  4 | 
  5 | class perf_analysis:
  6 | 	def __init__(self,RDSP,RBRAM,RDDR):
  7 | 		"""
  8 | 		Define the resources of the target FPGA
  9 | 
 10 | 		Inputs:
 11 | 			RDSP (int)			total num of DSP slices on-chip
 12 | 			RBRAM (float)		total BRAM size (in mega bits)
 13 | 			RDDR (float)		total off-chip bandwidth (in GB/s)
 14 | 		Output:
 15 | 			None
 16 | 		"""
 17 | 		self.RDSP = RDSP
 18 | 		self.RBRAM = RBRAM
 19 | 		self.RDDR = RDDR
 20 | 		self.dim_hid = None
 21 | 		self.dim_class = None
 22 | 		self.dim_node = None
 23 | 		self.reset()
 24 | 
 25 | 	def set_hardware_spec(self,FREQ,COST_word,COST_mult,COST_add,COST_mac):
 26 | 		"""
 27 | 		Specify the key cost values for the target accelerator
 28 | 
 29 | 		Inputs:
 30 | 			FREQ (float)		expected frequency (in MHz)
 31 | 			COST_word (int)		num bit per word (e.g., for single precision 
 32 | 								floating point, this value = 32)
 33 | 			COST_mult (int)		num of DSP slices to implement one multiplier
 34 | 			COST_add (int)		num of DSP slices to implement one adder
 35 | 			COST_mac (int)		num of DSP slices to implement one MAC
 36 | 		Output:
 37 | 			None
 38 | 		"""
 39 | 		self.FREQ = FREQ
 40 | 		self.COST_word = COST_word
 41 | 		self.COST_mult = COST_mult
 42 | 		self.COST_add = COST_add
 43 | 		self.COST_mac = COST_mac
 44 | 
 45 | 	def set_GNN_spec(self,L,dim_hid,num_epoch):
 46 | 		"""
 47 | 		Specify the GNN architecture and training parameters
 48 | 
 49 | 		Inputs:
 50 | 			L (int)				num of graph conv layers
 51 | 			dim_hid (int)		hidden dimension of graph conv layers
 52 | 			num_epoch (int)		estimated num of epochs till convergence
 53 | 		Output:
 54 | 			None
 55 | 		"""
 56 | 		self.L = L
 57 | 		self.dim_hid = dim_hid
 58 | 		self.num_epoch = num_epoch
 59 | 		if self.dim_node:
 60 | 			self.dim_all_l = [self.dim_node]+[self.dim_hid]*self.L+[self.dim_class]
 61 | 
 62 | 	def set_graph_spec(self,V,gamma,beta,dim_node,dim_class):
 63 | 		"""
 64 | 		Specify the graph parameters
 65 | 
 66 | 		Inputs:
 67 | 			V (int)				num training nodes
 68 | 			gamma (float)		redundancy reduction ratio
 69 | 			beta (float)		overhead in storing the partial sum
 70 | 			dim_node (int)		dim of initial node feature
 71 | 			dim_class (int)		num classes
 72 | 		Output:
 73 | 			None
 74 | 		"""
 75 | 		self.V = V
 76 | 		self.gamma, self.beta = gamma, beta
 77 | 		self.dim_node, self.dim_class = dim_node, dim_class
 78 | 		if self.dim_hid:
 79 | 			self.dim_all_l = [self.dim_node]+[self.dim_hid]*self.L+[self.dim_class]
 80 | 
 81 | 	def set_design(self,Psys,Pagg,Vsub,dsub):
 82 | 		"""
 83 | 		Set the design parameters of the architecture
 84 | 
 85 | 		Inputs:
 86 | 			Psys (int)			dim of systolic array
 87 | 			Pagg (int)			dim of feat aggr array
 88 | 			Vsub (int)			size of the sampled subgraph
 89 | 		Output:
 90 | 			None
 91 | 		"""
 92 | 		self.Psys, self.Pagg = Psys, Pagg
 93 | 		self.Vsub, self.dsub = Vsub, dsub
 94 | 
 95 | 	def ret_design(self):
 96 | 		return self.Psys, self.Pagg, self.Vsub, self.dsub
 97 | 
 98 | 	def reset(self):
 99 | 		# self.dim_hid = None
100 | 		# self.dim_class = None
101 | 		# self.dim_node = None
102 | 		self.Psys = None
103 | 		self.Pagg = None
104 | 		self.Vsub = None
105 | 		self.dsub = None
106 | 
107 | 	def _consumption_BRAM(self):
108 | 		"""
109 | 		Compute the consumption of on-chip BRAM
110 | 		
111 | 		Inputs:
112 | 			None
113 | 		Output:
114 | 			None
115 | 		"""
116 | 		try:
117 | 			_buf_weight = 2*(self.L+1)*self.dim_hid**2
118 | 		except Exception:
119 | 			import pdb; pdb.set_trace()
120 | 		_buf_WT = self.Psys*self.Vsub
121 | 		# size of buffer for storing the partial sum
122 | 		_dim_max = max(self.dim_hid,self.dim_node,self.dim_class)
123 | 		_buf_partial_agg = _dim_max*self.beta*self.Vsub
124 | 		# buffer to store the grad wrt X (to back-prop to prev layer)
125 | 		_buf_grad_X = 2*max(self.dim_hid,self.dim_class)*self.Vsub
126 | 		_buf_X = sum([f*self.Vsub for f in self.dim_all_l])
127 | 		_buf_inter = _dim_max*self.Vsub
128 | 		self.buf_actual = _buf_weight+_buf_WT+_buf_partial_agg+_buf_grad_X+_buf_inter
129 | 		# buf_pred = (self.L+5)*self.dim_hid*self.Vsub \
130 | 		# 		 + self.dim_hid*self.beta*self.Vsub \
131 | 		# 		 + self.Psys*self.Vsub + 2*(self.L+1)*self.dim_hid**2
132 | 	    
133 | 	def _consumption_DSP(self):
134 | 		self.dsp_actual = self.Pagg*self.COST_add + self.Psys**2*self.COST_mac
135 | 
136 | 	def is_feasible(self):
137 | 		self._consumption_BRAM()
138 | 		self._consumption_DSP()
139 | 		#import pdb; pdb.set_trace()
140 | 		return (self.buf_actual < self.RBRAM/self.COST_word*1e6) \
141 | 		   and (self.dsp_actual < self.RDSP) \
142 | 		   and (self.Pagg <= self.dim_hid)
143 | 
144 | 	def cycle_batch(self):
145 | 		# forward prop: one aggr + two matmul per graph conv layer
146 | 		t_agg_forward = [int(self.Vsub*self.dsub*self.gamma*math.ceil(f/self.Pagg)\
147 | 				      + math.ceil(self.Vsub/self.Psys)*f) for f in self.dim_all_l[:-2]]
148 | 		t_sys_forward_half = [int(math.ceil(0.5*self.dim_all_l[l+1]/self.Psys)\
149 | 					  * self.dim_all_l[l] * math.ceil(self.Vsub/self.Psys)) for l in range(self.L)]
150 | 		t_forward = [max(t_agg_forward[l],t_sys_forward_half[l])+t_sys_forward_half[l] for l in range(self.L)]
151 | 		t_forward.append(self.dim_all_l[-2]*math.ceil(self.dim_all_l[-1]/self.Psys)*math.ceil(self.Vsub/self.Psys))
152 | 		# backward prop: two aggr + four matmul per graph conv layer
153 | 		t_agg_backward_1 = [int(self.Vsub*self.dsub*self.gamma*math.ceil(f/self.Pagg) 
154 | 	    	 			 + math.ceil(f/self.Psys)*self.Vsub) for f in self.dim_all_l[:-2]]
155 | 		t_agg_backward_2 = [int(self.Vsub*self.dsub*self.gamma*math.ceil(0.5*f/self.Pagg)) for f in self.dim_all_l[2:-1]]
156 | 		t_sys_backward_w_half = [int(math.ceil(0.5*self.dim_all_l[l+1]/self.Psys)\
157 | 	    					  * math.ceil(self.dim_all_l[l]/self.Psys)*self.Vsub) for l in range(self.L)]
158 | 		t_sys_backward_x_half = [int(0.5*self.dim_all_l[l+2]*math.ceil(self.Vsub/self.Psys)\
159 | 	    					  * math.ceil(self.dim_all_l[l+1]/self.Psys)) for l in range(self.L-1)]
160 | 		t_backward = [max(t_agg_backward_1[l],t_sys_backward_w_half[l])+t_sys_backward_w_half[l] for l in range(self.L)]\
161 | 	               + [max(t_agg_backward_2[l],t_sys_backward_x_half[l])+t_sys_backward_x_half[l] for l in range(self.L-1)]
162 | 		t_backward.append(math.ceil(self.dim_all_l[-1]/self.Psys)*math.ceil(self.dim_all_l[-2]/self.Psys)*self.Vsub)
163 | 		t_backward.append(math.ceil(self.Vsub/self.Psys)*math.ceil(self.dim_all_l[-1]/self.Psys)*self.dim_all_l[-2])
164 | 		# forward and backward prop
165 | 		t_total = sum(t_forward) + sum(t_backward)
166 | 		return t_total
167 | 
168 | 	def time_converge(self, cyc_bat):
169 | 		return cyc_bat/(self.FREQ*1e6)*math.ceil(self.V/self.Vsub)*self.num_epoch
170 | 
171 | 	def ops_total(self):
172 | 		ops_agg_forward = [int(self.gamma*self.Vsub*self.dsub*f) for f in self.dim_all_l[:-2]]
173 | 		ops_sys_forward = [2*self.Vsub*self.dim_all_l[l]*self.dim_all_l[l+1] for l in range(self.L+1)]
174 | 		ops_agg_backward_1 = [int(self.Vsub*self.dsub*self.gamma*f) for f in self.dim_all_l[:-2]]
175 | 		ops_agg_backward_2 = [int(self.Vsub*self.dsub*self.gamma*0.5*f) for f in self.dim_all_l[2:-1]]
176 | 		ops_sys_backward_1 = [2*self.Vsub*self.dim_all_l[l]*self.dim_all_l[l+1] for l in range(self.L+1)]
177 | 		ops_sys_backward_2 = [2*self.Vsub*self.dim_all_l[l+1]*self.dim_all_l[l+2] for l in range(self.L-1)]
178 | 		return sum(ops_agg_forward)+sum(ops_sys_forward)\
179 | 	    	 + sum(ops_agg_backward_1)+sum(ops_agg_backward_2)\
180 | 	    	 + sum(ops_sys_backward_1)+sum(ops_sys_backward_2)
181 | 
182 | 
183 | 
184 | 
185 | def DSE():
186 | 	RDSP, RBRAM, RDDR = 5760, 229, 0
187 | 	pa = perf_analysis(RDSP, RBRAM, RDDR)
188 | 	FREQ, COST_word, COST_mult, COST_add, COST_mac = 200, 32, 1, 1, 2
189 | 	pa.set_hardware_spec(FREQ, COST_word, COST_mult, COST_add, COST_mac)
190 | 	L, dim_hid, num_epoch = 2, 256, 10
191 | 	pa.set_GNN_spec(L, dim_hid, num_epoch)
192 | 	V, gamma, beta, dim_node, dim_class = int(716847*0.66), 0.7, 1, 300, 100
193 | 	pa.set_graph_spec(V, gamma, beta, dim_node, dim_class)
194 | 	Psys_bound = int((RDSP/COST_mac)**0.5)
195 | 	best_design = None
196 | 	min_time = float('inf')
197 | 	for _ps in range(1,Psys_bound):
198 | 		for _pa in range(16,dim_hid,16):
199 | 			for _vs in range(500,8000,500):
200 | 				pa.reset()
201 | 				pa.set_design(_ps,_pa,_vs,15)
202 | 				if not pa.is_feasible():
203 | 					continue
204 | 				_cyc = pa.cycle_batch()
205 | 				_time = pa.time_converge(_cyc)
206 | 				if _time < min_time:
207 | 					min_time = _time
208 | 					best_design = pa.ret_design()
209 | 	print("Best design: Psys = {}\tPaggr = {}\tVsub = {}\tdsub = {}"\
210 | 			.format(*best_design))
211 | 	print("Convergence time: {}".format(min_time))
212 | 
213 | 
214 | if __name__ == "__main__":
215 | 	DSE()


--------------------------------------------------------------------------------