├── README.md ├── hardware-ARCH.PNG ├── hardware ├── Aggregation.v ├── README.md ├── columnbuffer.v ├── hardware-ARCH.PNG ├── indiceDoubleBuffer.v ├── indptrDoubleBuffer.v ├── reluArray.v ├── reluUnit.v ├── rowbuffer.v ├── sysArray.v ├── sysUnit.v ├── system.v ├── transformationPlusActivation.v ├── vectorAdd.v └── weightBuffer.v └── perf_model └── perf_analysis.py /README.md: -------------------------------------------------------------------------------- 1 | ## GraphACT: Accelerating GCN Training on CPU-FPGA Heterogeneous Platform [[PDF](https://arxiv.org/abs/2001.02498)] 2 | --- 3 | ### Abstract 4 | Graph Convolutional Networks (GCNs) have emerged as the state-of-the-art deep learning model for representation learning on graphs. It is challenging to accelerate training of GCNs, due to (1) substantial and irregular data communication to propagate information within the graph, and (2) intensive computation to propagate information along the neural network layers. To address these challenges, we design a novel accelerator for training GCNs on CPU-FPGA heterogeneous systems, by incorporating multiple algorithm-architecture co-optimizations. We first analyze the computation and communication characteristics of various GCN training algorithms, and select a subgraph-based algorithm that is well suited for hardware execution. To optimize the feature propagation within subgraphs, we propose a light-weight pre-processing step based on a graph theoretic approach. Such pre-processing performed on the CPU significantly reduces the memory access requirements and the computation to be performed on the FPGA. To accelerate the weight update in GCN layers, we propose a systolic array based design for efficient parallelization. We integrate the above optimizations into a complete hardware pipeline, and analyze its load-balance and resource utilization by accurate performance modeling. We evaluate our design on Intel Stratix 10 board hosted by a 40-core Xeon server. On three large graphs, we achieve an order of magnitude training speedup with negligible accuracy loss, compared with state-of-the-art implementation on a multi-core platform. 5 | 6 | 7 | ### Implementation details 8 | - FPGA platform: **Intel Stratix 10 1SX280LH3F55I3XG** 9 | - Tool: **Quartus prime pro 20.1** for synthesis 10 | - CPU platform: 40-core Xeon server (E5-2698 v4 @2.2GHz, hyper-threaded) 11 | 12 | ### hardware architecture 13 | 14 | ![arch](./hardware-ARCH.PNG) 15 | 16 | ### Directory Structure 17 | - /hardware: contains the verilog code of the design 18 | - system.v: main module 19 | - Aggregation.v: feature aggregation module 20 | - reluArray.v: activation module 21 | - transformationPlusActivation.v: transformation models and activation models 22 | - Data buffer: 23 | - columnbuffer.v 24 | - indiceDoubleBuffer.v 25 | - indptrDoubleBuffer.v 26 | - weightBuffer.v 27 | - rowbuffer.v 28 | - /perf_model: performance modeling using python script 29 | 30 | ### Configuring IP Cores 31 | - floating point Accumulator--sysAcc 32 | - Find: IP catalog => basic function => arithmetic => floating point function 33 | - Name: Accumulator 34 | - Other Info: choose Generate Enable and generate HDL 35 | - floating point multiplier--mul 36 | - Find: IP catalog => basic function => arithmetic => floating point function 37 | - Name: Multiply 38 | - Other Info: In Functionality choose Generate Enable and generate HDL 39 | - floating point comparator 40 | - Find: IP catalog => basic function => arithmetic => functionality => Comparison 41 | - Name: Greater than 42 | - Other Info: In Functionality choose Generate Enable and generate HDL 43 | - RAM 1 port--Weight buffer 44 | - Find: IP catalog => on-chip memory => RAM: 1-port intel FPGA IP 45 | - Name: weightbuffer 46 | - Other Info: choose Generate Enable and generate HDL 47 | - RAM 2 port--row buffer 48 | - Find: IP catalog => on-chip memory => RAM: 2-port intel FPGA IP 49 | - Name: row buffer 50 | - Other Info: choose Generate Enable and generate HDL 51 | - RAM 2 port--column buffer 52 | - Find: IP catalog => on-chip memory => RAM: 2-port intel FPGA IP 53 | - Name: column buffer 54 | - Other Info: choose Generate Enable and generate HDL 55 | - RAM 2 port--indptr buffer 56 | - Find: IP catalog => on-chip memory => RAM: 2-port intel FPGA IP 57 | - Name: indptr buffer 58 | - Other Info: choose Generate Enable and generate HDL 59 | - RAM 2 port--indptr buffer 60 | - Find: IP catalog => on-chip memory => RAM: 2-port intel FPGA IP 61 | - Name: indptr buffer 62 | - Other Info: choose Generate Enable and generate HDL 63 | - RAM 2 port--indice buffer 64 | - Find: IP catalog => on-chip memory => RAM: 2-port intel FPGA IP 65 | - Name: indice buffer 66 | - Other Info: choose Generate Enable and generate HDL 67 | 68 | ### Setting up the projects 69 | 1. Create a new project in Quartus Prime PRO 20.1 and set the **Intel Stratix 10 1SX280LH3F55I3XG** as the target device 70 | 2. Add all the .v files in /hardware directory to the project 71 | 3. Set the system.v as the top module 72 | 4. Start the compilation 73 | -------------------------------------------------------------------------------- /hardware-ARCH.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pgroupATusc/GraphACT/f0ecd30500247647105762934f5967350a0d8eab/hardware-ARCH.PNG -------------------------------------------------------------------------------- /hardware/Aggregation.v: -------------------------------------------------------------------------------- 1 | `timescale 1ns / 1ps 2 | ////////////////////////////////////////////////////////////////////////////////// 3 | // Company: 4 | // Engineer: 5 | // 6 | // Create Date: 12/19/2019 12:49:59 PM 7 | // Design Name: 8 | // Module Name: Aggregation 9 | // Project Name: 10 | // Target Devices: 11 | // Tool Versions: 12 | // Description: 13 | // 14 | // Dependencies: 15 | // 16 | // Revision: 17 | // Revision 0.01 - File Created 18 | // Additional Comments: 19 | // 20 | ////////////////////////////////////////////////////////////////////////////////// 21 | module Aggregation #( 22 | parameter dataWidth = 32, // the data bit width 23 | parameter pvadd= 128, // the parallesim in feature adder 24 | parameter k = 1024 // define k as the block size 25 | ) 26 | ( 27 | // system signal 28 | clk, 29 | rst, 30 | enable, 31 | 32 | // port for indptrDoubleBuffer 33 | indptr_writeEnableA1, 34 | indptr_writeEnableB1, 35 | indptr_addressportA1_outside, 36 | indptr_addressportB1_outside, 37 | indptr_writeportA1, 38 | indptr_writeportB1, 39 | 40 | indptr_startAddressA, 41 | indptr_startAddressB, 42 | indptr_endAddress, 43 | 44 | // port for indiceDoubleBuffer 45 | indice_writeEnableA1, 46 | indice_writeEnableB1, 47 | indice_addressportA1_outside, 48 | indice_addressportB1_outside, 49 | indice_writeportA1, 50 | indice_writeportB1, 51 | 52 | // port for DoubleColumnBuffer 53 | columnbuffer_writeEnableA1, 54 | columnbuffer_writeEnableB1, 55 | columnbuffer_addressportA1_outside, 56 | columnbuffer_addressportB1_outside, 57 | columnbuffer_writeportA1, 58 | columnbuffer_writeportB1, 59 | 60 | // port for Double Row Buffer 61 | 62 | rowbuffer_addressportA1_outside, 63 | rowbuffer_dataOut 64 | 65 | 66 | ); 67 | 68 | input clk; 69 | input rst; 70 | input enable; 71 | // define the two state FSM 72 | wire finish; 73 | reg triggerA; 74 | reg triggerB; 75 | 76 | reg [indice_addressWidth - 1:0] indiceCounterA; 77 | reg [indice_addressWidth - 1:0] indiceCounterB; 78 | 79 | parameter S1 = 1'b0; 80 | parameter S2 = 1'b1; 81 | 82 | reg state; 83 | 84 | always @(posedge clk or negedge rst) begin : proc_state 85 | if(~rst) begin 86 | state <= S1; 87 | end else begin 88 | if(finish)begin 89 | if(state == S1) 90 | state <= S2 ; 91 | else 92 | state <= S1; 93 | end 94 | else 95 | state <= state; 96 | end 97 | end 98 | 99 | 100 | 101 | //define the parameter and port for indptrDoubleBuffer 102 | // write the indptrDouble buffer is from outside 103 | 104 | reg [indptr_addressWidth - 1:0] indptrCounterA; 105 | reg [indptr_addressWidth - 1:0] indptrCounterB; 106 | 107 | parameter indptr_addressWidth = $clog2(k + 1); 108 | parameter indptr_dataportWidth = $clog2(k*k/32); 109 | 110 | input indptr_writeEnableA1; 111 | input indptr_writeEnableB1; 112 | input [indptr_dataportWidth - 1:0] indptr_writeportA1; 113 | input [indptr_dataportWidth - 1:0] indptr_writeportB1; 114 | 115 | wire [indptr_dataportWidth - 1:0] indptr_readportA1; 116 | wire [indptr_dataportWidth - 1:0] indptr_readportA2; 117 | wire [indptr_dataportWidth - 1:0] indptr_readportB1; 118 | wire [indptr_dataportWidth - 1:0] indptr_readportB2; 119 | 120 | 121 | wire [indptr_addressWidth - 1:0] indptr_addressportA1; 122 | wire [indptr_addressWidth - 1:0] indptr_addressportA2; 123 | wire [indptr_addressWidth - 1:0] indptr_addressportB1; 124 | wire [indptr_addressWidth - 1:0] indptr_addressportB2; 125 | 126 | input [indptr_addressWidth - 1:0] indptr_addressportA1_outside; 127 | input [indptr_addressWidth - 1:0] indptr_addressportB1_outside; 128 | 129 | 130 | 131 | 132 | 133 | wire [indptr_dataportWidth - 1:0] indptrAout; 134 | 135 | assign indptrAout = (state == S1)? indptr_readportA1:indptr_readportB1; 136 | 137 | wire [indptr_dataportWidth - 1:0] indptrBout; 138 | 139 | assign indptrBout = (state == S1)? indptr_readportA2:indptr_readportB2; 140 | 141 | 142 | // ping pong operation 143 | assign indptr_addressportA1 = (state == S1)? indptr_addressportA1_outside: indptrCounterA; 144 | assign indptr_addressportA2 = (state == S1)? indptr_addressportB1_outside: indptrCounterB; 145 | 146 | assign indptr_addressportB1 = (state == S1)? indptrCounterA: indptr_addressportA1_outside; 147 | assign indptr_addressportB2 = (state == S1)? indptrCounterB: indptr_addressportB1_outside; 148 | 149 | 150 | 151 | 152 | indptrDoubleBuffer #( 153 | .k(k) 154 | ) singleindptrDoubleBuffer 155 | ( 156 | .clk(clk), 157 | .rst(rst), 158 | .enableA(enable), 159 | .enableB(enable), 160 | .writeEnableA1(indptr_writeEnableA1), 161 | .writeEnableA2(1'b0), 162 | .writeEnableB1(indptr_writeEnableB1), 163 | .writeEnableB2(1'b0), 164 | .readportA1(indptr_readportA1), 165 | .readportA2(indptr_readportA2), 166 | .writeportA1(indptr_writeportA1), 167 | .writeportA2(0), 168 | .readportB1(indptr_readportB1), 169 | .readportB2(indptr_readportB2), 170 | .writeportB1(indptr_writeportB1), 171 | .writeportB2(0), 172 | .addressportA1(indptr_addressportA1), 173 | .addressportA2(indptr_addressportA2), 174 | .addressportB1(indptr_addressportB1), 175 | .addressportB2(indptr_addressportB2) 176 | ); 177 | 178 | 179 | 180 | 181 | //parameter for indiceDoubleBuffer 182 | 183 | parameter indice_addressWidth = $clog2(k*k/32); 184 | parameter indice_dataportWidth = $clog2(k); 185 | 186 | 187 | input indice_writeEnableA1; 188 | input indice_writeEnableB1; 189 | input [indice_dataportWidth - 1:0] indice_writeportA1; 190 | input [indice_dataportWidth - 1:0] indice_writeportB1; 191 | 192 | 193 | wire [indice_dataportWidth - 1:0] indice_readportA1; 194 | wire [indice_dataportWidth - 1:0] indice_readportA2; 195 | wire [indice_dataportWidth - 1:0] indice_readportB1; 196 | wire [indice_dataportWidth - 1:0] indice_readportB2; 197 | 198 | wire [indice_addressWidth - 1:0] addressportA1; 199 | wire [indice_addressWidth - 1:0] addressportA2; 200 | wire [indice_addressWidth - 1:0] addressportB1; 201 | wire [indice_addressWidth - 1:0] addressportB2; 202 | 203 | input [indice_addressWidth - 1:0] indice_addressportA1_outside; 204 | input [indice_addressWidth - 1:0] indice_addressportB1_outside; 205 | 206 | 207 | assign addressportA1 = (state == S1)? indice_addressportA1_outside: indiceCounterA; 208 | assign addressportA2 = (state == S1)? indice_addressportB1_outside: indiceCounterB; 209 | 210 | assign addressportB1 = (state == S1)? indiceCounterA: indice_addressportA1_outside; 211 | assign addressportB2 = (state == S1)? indiceCounterB: indice_addressportB1_outside; 212 | 213 | 214 | wire [indice_dataportWidth - 1:0] indiceForColumnBufferA; 215 | wire [indice_dataportWidth - 1:0] indiceForColumnBufferB; 216 | 217 | assign indiceForColumnBufferA = (state == S1)? indice_readportA1: indice_readportB1; 218 | assign indiceForColumnBufferB = (state == S1)? indice_readportA2: indice_readportB2; 219 | 220 | 221 | indiceDoubleBuffer #( 222 | .k(k) 223 | ) singleindiceDoubleBuffer 224 | ( 225 | .clk(clk), 226 | .rst(rst), 227 | .enableA(enable), 228 | .enableB(enable), 229 | .writeEnableA1(indice_writeEnableA1), 230 | .writeEnableA2(1'b0), 231 | .writeEnableB1(indice_writeEnableB1), 232 | .writeEnableB2(1'b0), 233 | .readportA1(indice_readportA1), 234 | .readportA2(indice_readportA2), 235 | .writeportA1(indice_writeportA1), 236 | .writeportA2(0), 237 | .readportB1(indice_readportB1), 238 | .readportB2(indice_readportB2), 239 | .writeportB1(indice_writeportB1), 240 | .writeportB2(0), 241 | .addressportA1(addressportA1), 242 | .addressportA2(addressportA2), 243 | .addressportB1(addressportB1), 244 | .addressportB2(addressportB2) 245 | ); 246 | 247 | 248 | 249 | 250 | // define the parameter for columnBuffer 251 | 252 | parameter columnbuffer_addressWidth = $clog2(k); 253 | parameter columnbuffer_dataportWidth = dataWidth*pvadd; 254 | 255 | 256 | 257 | 258 | 259 | input columnbuffer_writeEnableA1; 260 | input columnbuffer_writeEnableB1; 261 | 262 | wire [columnbuffer_dataportWidth - 1:0] columnbuffer_readportA1; 263 | wire [columnbuffer_dataportWidth - 1:0] columnbuffer_readportA2; 264 | wire [columnbuffer_dataportWidth - 1:0] columnbuffer_readportB1; 265 | wire [columnbuffer_dataportWidth - 1:0] columnbuffer_readportB2; 266 | 267 | wire [columnbuffer_addressWidth - 1:0] columnbuffer_addressportA1; 268 | wire [columnbuffer_addressWidth - 1:0] columnbuffer_addressportA2; 269 | wire [columnbuffer_addressWidth - 1:0] columnbuffer_addressportB1; 270 | wire [columnbuffer_addressWidth - 1:0] columnbuffer_addressportB2; 271 | 272 | input [columnbuffer_addressWidth - 1:0] columnbuffer_addressportA1_outside; 273 | input [columnbuffer_addressWidth - 1:0] columnbuffer_addressportB1_outside; 274 | input [columnbuffer_dataportWidth - 1:0] columnbuffer_writeportA1; 275 | input [columnbuffer_dataportWidth - 1:0] columnbuffer_writeportB1; 276 | 277 | // parameter for columnbuffer 278 | 279 | assign columnbuffer_addressportA1 = (state == S1)? columnbuffer_addressportA1_outside: indiceCounterA; 280 | assign columnbuffer_addressportA2 = (state == S1)? columnbuffer_addressportB1_outside: indiceCounterB; 281 | 282 | assign columnbuffer_addressportB1 = (state == S1)? indiceCounterA: columnbuffer_addressportA1_outside; 283 | assign columnbuffer_addressportB2 = (state == S1)? indiceCounterB: columnbuffer_addressportB1_outside; 284 | 285 | 286 | wire [columnbuffer_dataportWidth - 1:0] dataForVadderA; 287 | wire [columnbuffer_dataportWidth - 1:0] dataForVadderB; 288 | 289 | 290 | assign dataForVadderA = (state == S1)? columnbuffer_readportA1: columnbuffer_readportB1; 291 | assign dataForVadderB = (state == S1)? columnbuffer_readportA2: columnbuffer_readportB2; 292 | 293 | 294 | 295 | 296 | columnbuffer #( 297 | .dataWidth(dataWidth), // the data bit width 298 | .pvadd(pvadd), // the parallesim in feature adder 299 | .k(k) // define k as the block size 300 | ) singlecolumnbuffer 301 | ( 302 | .clk(clk), 303 | .rst(rst), 304 | .enableA(enable), 305 | .enableB(enable), 306 | .writeEnableA1(columnbuffer_writeEnableA1), 307 | .writeEnableA2(1'b0), 308 | .writeEnableB1(columnbuffer_writeEnableB1), 309 | .writeEnableB2(1'b0), 310 | .readportA1(columnbuffer_readportA1), 311 | .readportA2(columnbuffer_readportA2), 312 | .writeportA1(columnbuffer_writeportA1), 313 | .writeportA2(32'b0), 314 | .readportB1(columnbuffer_readportB1), 315 | .readportB2(columnbuffer_readportB2), 316 | .writeportB1(columnbuffer_writeportB1), 317 | .writeportB2(32'b0), 318 | .addressportA1(columnbuffer_addressportA1), 319 | .addressportA2(columnbuffer_addressportA2), 320 | .addressportB1(columnbuffer_addressportB1), 321 | .addressportB2(columnbuffer_addressportB2) 322 | ); 323 | 324 | 325 | 326 | 327 | 328 | 329 | 330 | wire [columnbuffer_dataportWidth - 1:0] DataVectorAout; 331 | wire [columnbuffer_dataportWidth - 1:0] DataVectorBout; 332 | 333 | 334 | 335 | // vectorAdder1 336 | 337 | vectorAdd #( 338 | .dataWidth(dataWidth), 339 | .pvadd(pvadd) 340 | ) vectorAdderA 341 | ( 342 | .clk(clk), 343 | .rst(rst), 344 | .lastin(1'b1), 345 | .lastout(), 346 | .vectorIn(dataForVadderA), 347 | .vectorOut(DataVectorAout) 348 | ); 349 | 350 | // vectorAdder2 351 | vectorAdd #( 352 | .dataWidth(dataWidth), 353 | .pvadd(pvadd) 354 | ) vectorAdderB 355 | ( 356 | .clk(clk), 357 | .rst(rst), 358 | .lastin(1'b1), 359 | .lastout(), 360 | .vectorIn(dataForVadderB), 361 | .vectorOut(DataVectorBout) 362 | ); 363 | 364 | 365 | /// define the dataport for the row buffer 366 | 367 | wire rowbuffer_writeEnableA1; 368 | wire rowbuffer_writeEnableA2; 369 | wire rowbuffer_writeEnableB1; 370 | wire rowbuffer_writeEnableB2; 371 | 372 | 373 | assign rowbuffer_writeEnableA1 = (state == S1)? 1'b1: 1'b0; 374 | assign rowbuffer_writeEnableA2 = (state == S1)? 1'b1: 1'b0; 375 | 376 | assign rowbuffer_writeEnableB1 = (state == S1)? 1'b0: 1'b1; 377 | assign rowbuffer_writeEnableB2 = (state == S1)? 1'b0: 1'b1; 378 | 379 | // logic for the row buffer 380 | wire [columnbuffer_addressWidth - 1:0] rowbuffer_addressportA1; 381 | wire [columnbuffer_addressWidth - 1:0] rowbuffer_addressportA2; 382 | wire [columnbuffer_addressWidth - 1:0] rowbuffer_addressportB1; 383 | wire [columnbuffer_addressWidth - 1:0] rowbuffer_addressportB2; 384 | 385 | 386 | input [columnbuffer_addressWidth - 1:0] rowbuffer_addressportA1_outside; 387 | 388 | assign rowbuffer_addressportA1 = (state == S1)? indptrCounterA: rowbuffer_addressportA1_outside; 389 | assign rowbuffer_addressportA2 = (state == S1)? indptrCounterB: rowbuffer_addressportA1_outside; 390 | 391 | assign rowbuffer_addressportB1 = (state == S1)? rowbuffer_addressportA1_outside: indptrCounterA; 392 | assign rowbuffer_addressportB2 = (state == S1)? rowbuffer_addressportA1_outside: indptrCounterB; 393 | 394 | 395 | wire [columnbuffer_dataportWidth - 1:0] rowbuffer_readportA1; 396 | 397 | wire [columnbuffer_dataportWidth - 1:0] rowbuffer_readportB1; 398 | 399 | 400 | output [columnbuffer_dataportWidth - 1:0] rowbuffer_dataOut; 401 | 402 | assign rowbuffer_dataOut = (state == S1)? rowbuffer_readportB1: rowbuffer_readportA1; 403 | 404 | 405 | rowbuffer #( 406 | .dataWidth(dataWidth) , // the data bit width 407 | .pvadd(pvadd), // the parallesim in feature adder 408 | .k(k) // define k as the block size 409 | ) singlerowBuffer 410 | ( 411 | .clk(clk), 412 | .rst(rst), 413 | .enableA(enable), 414 | .enableB(enable), 415 | .writeEnableA1(rowbuffer_writeEnableA1), 416 | .writeEnableA2(rowbuffer_writeEnableA2), 417 | .writeEnableB1(rowbuffer_writeEnableB1), 418 | .writeEnableB2(rowbuffer_writeEnableB2), 419 | .readportA1(rowbuffer_readportA1), 420 | .readportA2(), 421 | .writeportA1(DataVectorAout), 422 | .writeportA2(DataVectorBout), 423 | .readportB1(rowbuffer_readportB1), 424 | .readportB2(), 425 | .writeportB1(DataVectorAout), 426 | .writeportB2(DataVectorBout), 427 | .addressportA1(rowbuffer_addressportA1), 428 | .addressportA2(rowbuffer_addressportA2), 429 | .addressportB1(rowbuffer_addressportB1), 430 | .addressportB2(rowbuffer_addressportB2) 431 | ); 432 | 433 | 434 | 435 | 436 | 437 | // define the communication signal 438 | 439 | 440 | 441 | // 442 | 443 | 444 | 445 | reg [indice_addressWidth - 1:0] indptrA1; 446 | reg [indice_addressWidth - 1:0] indptrA2; 447 | 448 | reg [indice_addressWidth - 1:0] indptrB1; 449 | reg [indice_addressWidth - 1:0] indptrB2; 450 | 451 | 452 | 453 | 454 | 455 | input [indptr_addressWidth - 1:0] indptr_startAddressA; 456 | input [indptr_addressWidth - 1:0] indptr_startAddressB; 457 | input [indptr_addressWidth - 1:0] indptr_endAddress; 458 | 459 | always @(posedge clk or negedge rst) begin : proc_indptrCounterA 460 | if(~rst) begin 461 | indptrCounterA <= 0; 462 | end else begin 463 | if(finish) begin 464 | indptrCounterA <= indptr_startAddressA; 465 | end 466 | else if(triggerA)begin 467 | indptrCounterA <= indptrCounterA + 1; 468 | end 469 | else begin 470 | indptrCounterA <= indptrCounterA; 471 | end 472 | end 473 | end 474 | 475 | always @(posedge clk or negedge rst) begin : proc_indptrCounterB 476 | if(~rst) begin 477 | indptrCounterB <= 0; 478 | end else begin 479 | if(finish) begin 480 | indptrCounterB <= indptr_startAddressB; 481 | end 482 | else if(triggerA)begin 483 | indptrCounterB <= indptrCounterB + 1; 484 | end 485 | else begin 486 | indptrCounterB <= indptrCounterB; 487 | end 488 | end 489 | end 490 | 491 | assign finish = (indptrCounterA == indptr_startAddressB && indptrCounterB == indptr_endAddress) ? 1'b1: 1'b0; 492 | 493 | 494 | 495 | 496 | always @(posedge clk or negedge rst) begin : proc_indptrA2 497 | if(~rst) begin 498 | indptrA2 <= 0; 499 | end else begin 500 | if(finish) begin 501 | indptrA2 <= indptrAout; 502 | end 503 | else if(indiceCounterA == indptrA2) begin 504 | indptrA2 <= indptrAout; 505 | end 506 | else begin 507 | indptrA2 <= indptrA2; 508 | end 509 | end 510 | end 511 | 512 | always @(posedge clk or negedge rst) begin : proc_indptrA1 513 | if(~rst) begin 514 | indptrA1 <= 0; 515 | end else begin 516 | if(finish) begin 517 | indptrA1 <= indptrAout; 518 | end 519 | else if(indiceCounterA == indptrA2) begin 520 | indptrA1 <= indptrA2; 521 | end 522 | else begin 523 | indptrA1 <= indptrA1; 524 | end 525 | end 526 | end 527 | 528 | always @(posedge clk or negedge rst) begin : proc_indiceCounterA 529 | if(~rst) begin 530 | indiceCounterA <= 0; 531 | end else begin 532 | if(triggerA) begin 533 | indiceCounterA <= indptrA1; 534 | end 535 | else if(indiceCounterA < indptrA2 - 1) begin 536 | indiceCounterA <= indiceCounterA + 1; 537 | end 538 | else begin 539 | indiceCounterA <= indiceCounterA; 540 | end 541 | end 542 | end 543 | 544 | always @(posedge clk or negedge rst) begin : proc_triggerA 545 | if(~rst) begin 546 | triggerA <= 0; 547 | end else begin 548 | if(indiceCounterA == indptrA2 - 1) begin 549 | triggerA <= 1'b1; 550 | end 551 | else begin 552 | triggerA <= 1'b0; 553 | end 554 | end 555 | end 556 | 557 | 558 | 559 | 560 | /////////////////////////////////////////////////////////////////////////////// 561 | 562 | always @(posedge clk or negedge rst) begin : proc_indptrB2 563 | if(~rst) begin 564 | indptrB2 <= 0; 565 | end else begin 566 | if(finish) begin 567 | indptrB2 <= indptrBout; 568 | end 569 | else if(indiceCounterB == indptrB2) begin 570 | indptrB2 <= indptrBout; 571 | end 572 | else begin 573 | indptrB2 <= indptrB2; 574 | end 575 | end 576 | end 577 | 578 | always @(posedge clk or negedge rst) begin : proc_indptrB1 579 | if(~rst) begin 580 | indptrB1 <= 0; 581 | end else begin 582 | if(finish) begin 583 | indptrB1 <= indptrBout; 584 | end 585 | else if(indiceCounterB == indptrB2) begin 586 | indptrB1 <= indptrB2; 587 | end 588 | else begin 589 | indptrB1 <= indptrB1; 590 | end 591 | end 592 | end 593 | 594 | always @(posedge clk or negedge rst) begin : proc_indiceCounterB 595 | if(~rst) begin 596 | indiceCounterB <= 0; 597 | end else begin 598 | if(triggerB) begin 599 | indiceCounterB <= indptrB1; 600 | end 601 | else if(indiceCounterB < indptrB2 - 1) begin 602 | indiceCounterB <= indiceCounterB + 1; 603 | end 604 | else begin 605 | indiceCounterB <= indiceCounterB; 606 | end 607 | end 608 | end 609 | 610 | always @(posedge clk or negedge rst) begin : proc_triggerB 611 | if(~rst) begin 612 | triggerB <= 0; 613 | end else begin 614 | if(indiceCounterB == indptrB2 - 1) begin 615 | triggerB <= 1'b1; 616 | end 617 | else begin 618 | triggerB <= 1'b0; 619 | end 620 | end 621 | end 622 | 623 | 624 | ////////////////////////////////////////////////////////////// 625 | 626 | 627 | 628 | 629 | endmodule 630 | 631 | 632 | 633 | -------------------------------------------------------------------------------- /hardware/README.md: -------------------------------------------------------------------------------- 1 | ## GraphACT: Accelerating GCN Training on CPU-FPGA Heterogeneous Platform [[PDF](https://arxiv.org/abs/2001.02498)] 2 | --- 3 | ### Abstract 4 | Graph Convolutional Networks (GCNs) have emerged as the state-of-the-art deep learning model for representation learning on graphs. It is challenging to accelerate training of GCNs, due to (1) substantial and irregular data communication to propagate information within the graph, and (2) intensive computation to propagate information along the neural network layers. To address these challenges, we design a novel accelerator for training GCNs on CPU-FPGA heterogeneous systems, by incorporating multiple algorithm-architecture co-optimizations. We first analyze the computation and communication characteristics of various GCN training algorithms, and select a subgraph-based algorithm that is well suited for hardware execution. To optimize the feature propagation within subgraphs, we propose a light-weight pre-processing step based on a graph theoretic approach. Such pre-processing performed on the CPU significantly reduces the memory access requirements and the computation to be performed on the FPGA. To accelerate the weight update in GCN layers, we propose a systolic array based design for efficient parallelization. We integrate the above optimizations into a complete hardware pipeline, and analyze its load-balance and resource utilization by accurate performance modeling. We evaluate our design on Intel Stratix 10 board hosted by a 40-core Xeon server. On three large graphs, we achieve an order of magnitude training speedup with negligible accuracy loss, compared with state-of-the-art implementation on a multi-core platform. 5 | 6 | 7 | ### Implementation details 8 | - FPGA platform: **Intel Stratix 10 1SX280LH3F55I3XG** 9 | - Tool: **Quartus prime pro 20.1** for synthesis 10 | - CPU platform: 40-core Xeon server (E5-2698 v4 @2.2GHz, hyper-threaded) 11 | 12 | ### hardware architecture 13 | 14 | ![arch](./hardware-ARCH.PNG) 15 | 16 | ### Directory Structure 17 | - /hardware: contains the verilog code of the design 18 | - system.v: main module 19 | - Aggregation.v: feature aggregation module 20 | - reluArray.v: activation module 21 | - transformationPlusActivation.v: transformation models and activation models 22 | - Data buffer: 23 | - columnbuffer.v 24 | - indiceDoubleBuffer.v 25 | - indptrDoubleBuffer.v 26 | - weightBuffer.v 27 | - rowbuffer.v 28 | - /perf_model: performance modeling using python script 29 | 30 | ### Configuring IP Cores 31 | - floating point Accumulator--sysAcc 32 | - Find: IP catalog => basic function => arithmetic => floating point function 33 | - Name: Accumulator 34 | - Other Info: choose Generate Enable and generate HDL 35 | - floating point multiplier--mul 36 | - Find: IP catalog => basic function => arithmetic => floating point function 37 | - Name: Multiply 38 | - Other Info: In Functionality choose Generate Enable and generate HDL 39 | - floating point comparator 40 | - Find: IP catalog => basic function => arithmetic => functionality => Comparison 41 | - Name: Greater than 42 | - Other Info: In Functionality choose Generate Enable and generate HDL 43 | - RAM 1 port--Weight buffer 44 | - Find: IP catalog => on-chip memory => RAM: 1-port intel FPGA IP 45 | - Name: weightbuffer 46 | - Other Info: choose Generate Enable and generate HDL 47 | - RAM 2 port--row buffer 48 | - Find: IP catalog => on-chip memory => RAM: 2-port intel FPGA IP 49 | - Name: row buffer 50 | - Other Info: choose Generate Enable and generate HDL 51 | - RAM 2 port--column buffer 52 | - Find: IP catalog => on-chip memory => RAM: 2-port intel FPGA IP 53 | - Name: column buffer 54 | - Other Info: choose Generate Enable and generate HDL 55 | - RAM 2 port--indptr buffer 56 | - Find: IP catalog => on-chip memory => RAM: 2-port intel FPGA IP 57 | - Name: indptr buffer 58 | - Other Info: choose Generate Enable and generate HDL 59 | - RAM 2 port--indptr buffer 60 | - Find: IP catalog => on-chip memory => RAM: 2-port intel FPGA IP 61 | - Name: indptr buffer 62 | - Other Info: choose Generate Enable and generate HDL 63 | - RAM 2 port--indice buffer 64 | - Find: IP catalog => on-chip memory => RAM: 2-port intel FPGA IP 65 | - Name: indice buffer 66 | - Other Info: choose Generate Enable and generate HDL 67 | 68 | ### Setting up the projects 69 | 1. Create a new project in Quartus Prime PRO 20.1 and set the **Intel Stratix 10 1SX280LH3F55I3XG** as the target device 70 | 2. Add all the .v files in /hardware directory to the project 71 | 3. Set the system.v as the top module 72 | 4. Start the compilation 73 | -------------------------------------------------------------------------------- /hardware/columnbuffer.v: -------------------------------------------------------------------------------- 1 | module rowbuffer #( 2 | parameter dataWidth = 32, // the data bit width 3 | parameter pvadd= 256, // the parallesim in feature adder 4 | parameter k = 1024 // define k as the block size 5 | ) 6 | ( 7 | clk, 8 | rst, 9 | enableA, 10 | enableB, 11 | writeEnableA1, 12 | writeEnableA2, 13 | writeEnableB1, 14 | writeEnableB2, 15 | readportA1, 16 | readportA2, 17 | writeportA1, 18 | writeportA2, 19 | readportB1, 20 | readportB2, 21 | writeportB1, 22 | writeportB2, 23 | addressportA1, 24 | addressportA2, 25 | addressportB1, 26 | addressportB2 27 | ); 28 | 29 | 30 | parameter addressWidth = $clog2(k); 31 | parameter dataportWidth = dataWidth*pvadd; 32 | parameter memorySize = k * dataportWidth; 33 | 34 | 35 | // define the input 36 | input clk; 37 | input rst; 38 | input enableA; 39 | input enableB; 40 | input writeEnableA1; 41 | input writeEnableA2; 42 | input writeEnableB1; 43 | input writeEnableB2; 44 | 45 | //define the addressport 46 | 47 | input [addressWidth - 1 : 0] addressportA1; 48 | input [addressWidth - 1 : 0] addressportA2; 49 | input [addressWidth - 1 : 0] addressportB1; 50 | input [addressWidth - 1 : 0] addressportB2; 51 | 52 | // define the dataport input 53 | 54 | 55 | output [dataportWidth - 1:0] readportA1; 56 | output [dataportWidth - 1:0] readportA2; 57 | input [dataportWidth - 1:0] writeportA1; 58 | input [dataportWidth - 1:0] writeportA2; 59 | output [dataportWidth - 1:0] readportB1; 60 | output [dataportWidth - 1:0] readportB2; 61 | input [dataportWidth - 1:0] writeportB1; 62 | input [dataportWidth - 1:0] writeportB2; 63 | 64 | rowbuffer u0 ( 65 | .data_a (writeportA1), // input, width = 8192, data_a.datain_a 66 | .q_a (readportA1), // output, width = 8192, q_a.dataout_a 67 | .data_b (writeportA2), // input, width = 8192, data_b.datain_b 68 | .q_b (readportA2), // output, width = 8192, q_b.dataout_b 69 | .address_a (addressportA1), // input, width = 10, address_a.address_a 70 | .address_b (addressportA2), // input, width = 10, address_b.address_b 71 | .wren_a (writeEnableA1), // input, width = 1, wren_a.wren_a 72 | .wren_b (writeEnableA2), // input, width = 1, wren_b.wren_b 73 | .clock (clk) // input, width = 1, clock.clk 74 | ); 75 | 76 | rowbuffer u1 ( 77 | .data_a (writeportB1), // input, width = 8192, data_a.datain_a 78 | .q_a (readportB1), // output, width = 8192, q_a.dataout_a 79 | .data_b (writeportB2), // input, width = 8192, data_b.datain_b 80 | .q_b (readportB2), // output, width = 8192, q_b.dataout_b 81 | .address_a (addressportB1), // input, width = 10, address_a.address_a 82 | .address_b (addressportB2), // input, width = 10, address_b.address_b 83 | .wren_a (writeEnableB1), // input, width = 1, wren_a.wren_a 84 | .wren_b (writeEnableB2), // input, width = 1, wren_b.wren_b 85 | .clock (clk) // input, width = 1, clock.clk 86 | ); 87 | 88 | 89 | 90 | endmodule 91 | -------------------------------------------------------------------------------- /hardware/hardware-ARCH.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pgroupATusc/GraphACT/f0ecd30500247647105762934f5967350a0d8eab/hardware/hardware-ARCH.PNG -------------------------------------------------------------------------------- /hardware/indiceDoubleBuffer.v: -------------------------------------------------------------------------------- 1 | module indiceDoubleBuffer #( 2 | parameter k = 1024 // define k as the block size 3 | ) 4 | ( 5 | clk, 6 | rst, 7 | enableA, 8 | enableB, 9 | writeEnableA1, 10 | writeEnableA2, 11 | writeEnableB1, 12 | writeEnableB2, 13 | readportA1, 14 | readportA2, 15 | writeportA1, 16 | writeportA2, 17 | readportB1, 18 | readportB2, 19 | writeportB1, 20 | writeportB2, 21 | addressportA1, 22 | addressportA2, 23 | addressportB1, 24 | addressportB2 25 | ); 26 | 27 | 28 | parameter addressWidth = $clog2(k*k/32); 29 | parameter dataportWidth = $clog2(k); 30 | parameter memorySize = k*k/32 * dataportWidth; 31 | 32 | 33 | // define the input 34 | input clk; 35 | input rst; 36 | input enableA; 37 | input enableB; 38 | input writeEnableA1; 39 | input writeEnableA2; 40 | input writeEnableB1; 41 | input writeEnableB2; 42 | 43 | //define the addressport 44 | 45 | input [addressWidth - 1 : 0] addressportA1; 46 | input [addressWidth - 1 : 0] addressportA2; 47 | input [addressWidth - 1 : 0] addressportB1; 48 | input [addressWidth - 1 : 0] addressportB2; 49 | 50 | // define the dataport input 51 | 52 | 53 | output [dataportWidth - 1:0] readportA1; 54 | output [dataportWidth - 1:0] readportA2; 55 | input [dataportWidth - 1:0] writeportA1; 56 | input [dataportWidth - 1:0] writeportA2; 57 | output [dataportWidth - 1:0] readportB1; 58 | output [dataportWidth - 1:0] readportB2; 59 | input [dataportWidth - 1:0] writeportB1; 60 | input [dataportWidth - 1:0] writeportB2; 61 | 62 | indicebuffer u0 ( 63 | .data_a (writeportA1), // input, width = 10, data_a.datain_a 64 | .q_a (readportA1), // output, width = 10, q_a.dataout_a 65 | .data_b (writeportA2), // input, width = 10, data_b.datain_b 66 | .q_b (readportA2), // output, width = 10, q_b.dataout_b 67 | .address_a (addressportA1), // input, width = 15, address_a.address_a 68 | .address_b (addressportA2), // input, width = 15, address_b.address_b 69 | .wren_a (writeEnableA1), // input, width = 1, wren_a.wren_a 70 | .wren_b (writeEnableA2), // input, width = 1, wren_b.wren_b 71 | .clock (clk) // input, width = 1, clock.clk 72 | ); 73 | 74 | indicebuffer u1 ( 75 | .data_a (writeportB1), // input, width = 10, data_a.datain_a 76 | .q_a (readportB1), // output, width = 10, q_a.dataout_a 77 | .data_b (writeportB2), // input, width = 10, data_b.datain_b 78 | .q_b (readportB2), // output, width = 10, q_b.dataout_b 79 | .address_a (addressportB1), // input, width = 15, address_a.address_a 80 | .address_b (addressportB2), // input, width = 15, address_b.address_b 81 | .wren_a (writeEnableB1), // input, width = 1, wren_a.wren_a 82 | .wren_b (writeEnableB2), // input, width = 1, wren_b.wren_b 83 | .clock (clk) // input, width = 1, clock.clk 84 | ); 85 | 86 | 87 | 88 | 89 | 90 | endmodule -------------------------------------------------------------------------------- /hardware/indptrDoubleBuffer.v: -------------------------------------------------------------------------------- 1 | module indptrDoubleBuffer #( 2 | parameter k = 1024 // define k as the block size 3 | ) 4 | ( 5 | clk, 6 | rst, 7 | enableA, 8 | enableB, 9 | writeEnableA1, 10 | writeEnableA2, 11 | writeEnableB1, 12 | writeEnableB2, 13 | readportA1, 14 | readportA2, 15 | writeportA1, 16 | writeportA2, 17 | readportB1, 18 | readportB2, 19 | writeportB1, 20 | writeportB2, 21 | addressportA1, 22 | addressportA2, 23 | addressportB1, 24 | addressportB2 25 | ); 26 | 27 | 28 | parameter addressWidth = $clog2(k + 1); 29 | parameter dataportWidth = $clog2(k*k/32); 30 | parameter memorySize = dataportWidth * (k+1) ; 31 | 32 | 33 | // define the input 34 | input clk; 35 | input rst; 36 | input enableA; 37 | input enableB; 38 | input writeEnableA1; 39 | input writeEnableA2; 40 | input writeEnableB1; 41 | input writeEnableB2; 42 | 43 | //define the addressport 44 | 45 | input [addressWidth - 1 : 0] addressportA1; 46 | input [addressWidth - 1 : 0] addressportA2; 47 | input [addressWidth - 1 : 0] addressportB1; 48 | input [addressWidth - 1 : 0] addressportB2; 49 | 50 | // define the dataport input 51 | 52 | 53 | output [dataportWidth - 1:0] readportA1; 54 | output [dataportWidth - 1:0] readportA2; 55 | input [dataportWidth - 1:0] writeportA1; 56 | input [dataportWidth - 1:0] writeportA2; 57 | output [dataportWidth - 1:0] readportB1; 58 | output [dataportWidth - 1:0] readportB2; 59 | input [dataportWidth - 1:0] writeportB1; 60 | input [dataportWidth - 1:0] writeportB2; 61 | 62 | indptrbuffer u0 ( 63 | .data_a (writeportA1), // input, width = 15, data_a.datain_a 64 | .q_a (readportA1), // output, width = 15, q_a.dataout_a 65 | .data_b (writeportA2), // input, width = 15, data_b.datain_b 66 | .q_b (readportA2), // output, width = 15, q_b.dataout_b 67 | .address_a (addressportA1), // input, width = 11, address_a.address_a 68 | .address_b (addressportA2), // input, width = 11, address_b.address_b 69 | .wren_a (writeEnableA1), // input, width = 1, wren_a.wren_a 70 | .wren_b (writeEnableA2), // input, width = 1, wren_b.wren_b 71 | .clock (clk) // input, width = 1, clock.clk 72 | ); 73 | 74 | indptrbuffer u1 ( 75 | .data_a (writeportB1), // input, width = 15, data_a.datain_a 76 | .q_a (readportB1), // output, width = 15, q_a.dataout_a 77 | .data_b (writeportB2), // input, width = 15, data_b.datain_b 78 | .q_b (readportB2), // output, width = 15, q_b.dataout_b 79 | .address_a (addressportB1), // input, width = 11, address_a.address_a 80 | .address_b (addressportB2), // input, width = 11, address_b.address_b 81 | .wren_a (writeEnableB1), // input, width = 1, wren_a.wren_a 82 | .wren_b (writeEnableB2), // input, width = 1, wren_b.wren_b 83 | .clock (clk) // input, width = 1, clock.clk 84 | ); 85 | 86 | 87 | 88 | endmodule 89 | -------------------------------------------------------------------------------- /hardware/reluArray.v: -------------------------------------------------------------------------------- 1 | 2 | module reluArray #( 3 | parameter dataWidth = 32, 4 | parameter pactivation= 128 5 | ) 6 | ( 7 | clk, 8 | rst, 9 | inputArray, 10 | outputArray 11 | ); 12 | 13 | // define the inpit 14 | input clk; 15 | input rst; 16 | input [dataWidth*pactivation -1:0] inputArray; 17 | 18 | //define the output 19 | output [dataWidth*pactivation -1:0] outputArray; 20 | 21 | wire [dataWidth*pactivation -1:0] outputArray; 22 | 23 | 24 | genvar i; 25 | generate 26 | for(i = 0;i < pactivation; i = i+1) begin : singlerelu 27 | reluUnit #(.dataWidth(dataWidth)) reluUnitinstance 28 | ( 29 | .clk(clk), // system clock 30 | .rst(rst), // system rst 31 | .z(inputArray[dataWidth*(i+1) -1: dataWidth*i]), // input value Z 32 | .a(outputArray[dataWidth*(i+1) -1: dataWidth*i]) 33 | ); // output activation A 34 | 35 | end 36 | endgenerate 37 | 38 | endmodule -------------------------------------------------------------------------------- /hardware/reluUnit.v: -------------------------------------------------------------------------------- 1 | 2 | module reluUnit #( 3 | parameter dataWidth = 32 4 | ) 5 | ( 6 | clk, // system clock 7 | rst, // system rst 8 | z, // input value Z 9 | a); // output activation A 10 | input clk; 11 | input rst; 12 | input[dataWidth -1 : 0] z; 13 | output[dataWidth - 1: 0] a; 14 | 15 | 16 | 17 | wire [7:0] compare; 18 | reg [dataWidth -1 : 0] a; 19 | 20 | 21 | 22 | comp compareUnit ( 23 | .clk (clk), // input, width = 1, clk.clk 24 | .areset (rst), // input, width = 1, areset.reset 25 | .a (z), // input, width = 32, a.a 26 | .b (32'd0), // input, width = 32, b.b 27 | .q (compare) // output, width = 1, q.q 28 | ); 29 | 30 | 31 | reg [dataWidth - 1: 0] z1; 32 | reg [dataWidth - 1: 0] z2; 33 | 34 | always @(posedge clk or negedge rst) begin : proc_z1 35 | if(~rst) begin 36 | z1 <= 0; 37 | end else begin 38 | z1 <= z; 39 | end 40 | end 41 | 42 | always @(posedge clk or negedge rst) begin : proc_z2 43 | if(~rst) begin 44 | z2 <= 0; 45 | end else begin 46 | z2 <= z1; 47 | end 48 | end 49 | 50 | 51 | always @(posedge clk or negedge rst) begin : proc_a 52 | if(~rst) begin 53 | a <= 0; 54 | end else begin 55 | if(compare[0]) 56 | begin 57 | a <= z2; 58 | end 59 | else 60 | begin 61 | a <= 32'd0; 62 | end 63 | end 64 | end 65 | 66 | 67 | 68 | 69 | endmodule 70 | -------------------------------------------------------------------------------- /hardware/rowbuffer.v: -------------------------------------------------------------------------------- 1 | module rowbuffer #( 2 | parameter dataWidth = 32, // the data bit width 3 | parameter pvadd= 256, // the parallesim in feature adder 4 | parameter k = 1024 // define k as the block size 5 | ) 6 | ( 7 | clk, 8 | rst, 9 | enableA, 10 | enableB, 11 | writeEnableA1, 12 | writeEnableA2, 13 | writeEnableB1, 14 | writeEnableB2, 15 | readportA1, 16 | readportA2, 17 | writeportA1, 18 | writeportA2, 19 | readportB1, 20 | readportB2, 21 | writeportB1, 22 | writeportB2, 23 | addressportA1, 24 | addressportA2, 25 | addressportB1, 26 | addressportB2 27 | ); 28 | 29 | 30 | parameter addressWidth = $clog2(k); 31 | parameter dataportWidth = dataWidth*pvadd; 32 | parameter memorySize = k * dataportWidth; 33 | 34 | 35 | // define the input 36 | input clk; 37 | input rst; 38 | input enableA; 39 | input enableB; 40 | input writeEnableA1; 41 | input writeEnableA2; 42 | input writeEnableB1; 43 | input writeEnableB2; 44 | 45 | //define the addressport 46 | 47 | input [addressWidth - 1 : 0] addressportA1; 48 | input [addressWidth - 1 : 0] addressportA2; 49 | input [addressWidth - 1 : 0] addressportB1; 50 | input [addressWidth - 1 : 0] addressportB2; 51 | 52 | // define the dataport input 53 | 54 | 55 | output [dataportWidth - 1:0] readportA1; 56 | output [dataportWidth - 1:0] readportA2; 57 | input [dataportWidth - 1:0] writeportA1; 58 | input [dataportWidth - 1:0] writeportA2; 59 | output [dataportWidth - 1:0] readportB1; 60 | output [dataportWidth - 1:0] readportB2; 61 | input [dataportWidth - 1:0] writeportB1; 62 | input [dataportWidth - 1:0] writeportB2; 63 | 64 | rowbuffer u0 ( 65 | .data_a (writeportA1), // input, width = 8192, data_a.datain_a 66 | .q_a (readportA1), // output, width = 8192, q_a.dataout_a 67 | .data_b (writeportA2), // input, width = 8192, data_b.datain_b 68 | .q_b (readportA2), // output, width = 8192, q_b.dataout_b 69 | .address_a (addressportA1), // input, width = 10, address_a.address_a 70 | .address_b (addressportA2), // input, width = 10, address_b.address_b 71 | .wren_a (writeEnableA1), // input, width = 1, wren_a.wren_a 72 | .wren_b (writeEnableA2), // input, width = 1, wren_b.wren_b 73 | .clock (clk) // input, width = 1, clock.clk 74 | ); 75 | 76 | rowbuffer u1 ( 77 | .data_a (writeportB1), // input, width = 8192, data_a.datain_a 78 | .q_a (readportB1), // output, width = 8192, q_a.dataout_a 79 | .data_b (writeportB2), // input, width = 8192, data_b.datain_b 80 | .q_b (readportB2), // output, width = 8192, q_b.dataout_b 81 | .address_a (addressportB1), // input, width = 10, address_a.address_a 82 | .address_b (addressportB2), // input, width = 10, address_b.address_b 83 | .wren_a (writeEnableB1), // input, width = 1, wren_a.wren_a 84 | .wren_b (writeEnableB2), // input, width = 1, wren_b.wren_b 85 | .clock (clk) // input, width = 1, clock.clk 86 | ); 87 | 88 | 89 | 90 | endmodule 91 | -------------------------------------------------------------------------------- /hardware/sysArray.v: -------------------------------------------------------------------------------- 1 | 2 | module sysArray #( 3 | parameter dataWidth = 32, 4 | parameter SysDimension = 32, 5 | parameter featureLen = 128 6 | ) 7 | ( 8 | clk, // define the input clock 9 | rst, // define the input reset 10 | enable, 11 | weightArray, // define the input weight array 12 | featureArray, // define the input feature array 13 | outputArray // define the output feature array 14 | ); 15 | 16 | //define the inpt 17 | input clk; 18 | input rst; 19 | input enable; 20 | input [dataWidth*SysDimension - 1:0] weightArray; 21 | input [dataWidth*SysDimension - 1:0] featureArray; 22 | output [dataWidth*SysDimension - 1:0] outputArray; 23 | reg [dataWidth*SysDimension - 1:0] outputArray; 24 | 25 | wire [dataWidth*SysDimension - 1:0] weightWireArray[0:SysDimension - 1]; 26 | wire [dataWidth*SysDimension - 1:0] featureWireArray[0:SysDimension - 1]; 27 | 28 | wire [dataWidth*SysDimension - 1:0] resultWireArray[0:SysDimension - 1]; 29 | 30 | 31 | genvar i, j; 32 | generate 33 | for(i = 0;i < SysDimension; i = i + 1) begin : rowUroll 34 | for(j = 0; j < SysDimension; j = j + 1) begin : columnUroll 35 | 36 | // define weight wire connection 37 | 38 | 39 | wire [dataWidth - 1:0] iweightwire; 40 | 41 | assign iweightwire = (j == 0) ? weightArray[(i + 1)*dataWidth - 1: i * dataWidth]: weightWireArray[j - 1][(i + 1)*dataWidth - 1: i * dataWidth]; 42 | 43 | wire [dataWidth - 1:0] iweightwireOut; 44 | 45 | assign iweightwireOut = weightWireArray[j][(i + 1)*dataWidth - 1: i * dataWidth]; 46 | 47 | // define feature wire connection 48 | 49 | wire [dataWidth - 1:0] ifeaturewire ; 50 | 51 | assign ifeaturewire = (i == 0) ? featureArray[(j + 1)*dataWidth - 1: j * dataWidth]: featureWireArray[i - 1][(j + 1)*dataWidth - 1: j * dataWidth]; 52 | 53 | wire [dataWidth - 1:0] ifeaturewireOut; 54 | 55 | assign ifeaturewireOut = featureWireArray[i][(j + 1)*dataWidth - 1: j * dataWidth]; 56 | 57 | // define the result wire connection 58 | 59 | wire [dataWidth - 1:0] iresultIn; 60 | 61 | assign iresultIn = (i == 0)? 0: resultWireArray[i - 1][(j + 1)*dataWidth - 1: j * dataWidth]; 62 | 63 | wire [dataWidth - 1:0] iresultOut; 64 | 65 | assign resultWireArray[i][(j + 1)*dataWidth - 1: j * dataWidth] = iresultOut; 66 | 67 | 68 | 69 | sysUnit #( 70 | .dataWidth(dataWidth), 71 | .RowIndex(i), 72 | .ColumnIndex(j), 73 | .featureLen(featureLen), 74 | .SysDimension(SysDimension) 75 | ) singleSysUnit 76 | ( 77 | .clk(clk), // input clock 78 | .rst(rst), // input reset 79 | .enable(enable), // input enable 80 | .weight(iweightwire), // input weight 81 | .featureIn(ifeaturewire), // input feature 82 | .preresult(iresultIn), // input previous result 83 | .featureOut(iresultOut), // output feature 84 | .featurePass(ifeaturewireOut),// output feature 85 | .weightPass(iweightwireOut) 86 | ); 87 | 88 | end 89 | end 90 | endgenerate 91 | 92 | 93 | always @(posedge clk or negedge rst) begin : proc_outputArray 94 | if(~rst) begin 95 | outputArray <= 0; 96 | end else begin 97 | outputArray <= resultWireArray[SysDimension - 1]; 98 | end 99 | end 100 | 101 | endmodule -------------------------------------------------------------------------------- /hardware/sysUnit.v: -------------------------------------------------------------------------------- 1 | module sysUnit #( 2 | parameter dataWidth = 32, 3 | parameter RowIndex = 0, 4 | parameter ColumnIndex = 0, 5 | parameter featureLen = 128, 6 | parameter SysDimension = 32 7 | ) 8 | ( 9 | clk, // input clock 10 | rst, // input reset 11 | enable, // input enable 12 | weight, // input weight 13 | featureIn, // input feature 14 | preresult, // input previous result 15 | featureOut, // output feature 16 | featurePass, // pass feature 17 | weightPass // pass weight 18 | ); 19 | 20 | // define the input 21 | input clk; 22 | input rst; 23 | input enable; 24 | input [dataWidth - 1:0] weight; 25 | input [dataWidth - 1:0] featureIn; 26 | input [dataWidth - 1:0] preresult; 27 | 28 | //define the output 29 | output [dataWidth - 1:0] featureOut; 30 | reg [dataWidth - 1:0] featureOut; 31 | output [dataWidth - 1:0] featurePass; 32 | reg [dataWidth - 1:0] featurePass; 33 | output [dataWidth - 1:0] weightPass; 34 | reg [dataWidth - 1:0] weightPass; 35 | 36 | 37 | parameter InitialLantency = 45; 38 | parameter StartPlace = RowIndex + ColumnIndex + featureLen + InitialLantency + 1; 39 | parameter counterWidth = $clog2(InitialLantency + RowIndex + ColumnIndex + featureLen + 8); 40 | 41 | 42 | parameter InitialStat = 1'b0; 43 | parameter PipelineStat = 1'b1; 44 | 45 | always @(posedge clk or negedge rst) begin : proc_weightPass 46 | if(~rst) begin 47 | weightPass <= 0; 48 | end else begin 49 | if(enable) 50 | weightPass <= weight; 51 | else 52 | weightPass <= weightPass; 53 | end 54 | end 55 | 56 | always @(posedge clk or negedge rst) begin : proc_featurePass 57 | if(~rst) begin 58 | featurePass <= 0; 59 | end else begin 60 | if(enable) 61 | featurePass <= featureIn; 62 | else 63 | featurePass <= featurePass; 64 | end 65 | end 66 | 67 | 68 | reg [counterWidth - 1: 0] counter; 69 | 70 | reg state; 71 | 72 | always @(posedge clk or negedge rst) begin : proc_state 73 | if(~rst) begin 74 | state <= InitialStat; 75 | end else begin 76 | if(state == InitialStat && counter == StartPlace) begin 77 | state <= PipelineStat; 78 | end 79 | else begin 80 | state <= state; 81 | end 82 | end 83 | end 84 | 85 | 86 | always @(posedge clk or negedge rst) begin : proc_counter 87 | if(~rst) begin 88 | counter <= 0; 89 | end else begin 90 | if(enable) begin 91 | if(state == InitialStat && counter < StartPlace - 1) 92 | counter <= counter + 1; 93 | else if(state == InitialStat && counter == StartPlace - 1) 94 | counter <= 0; 95 | else if(state == PipelineStat && counter < featureLen - 1) 96 | counter <= counter + 1; 97 | else if(state == PipelineStat && counter == featureLen - 1) 98 | counter <= 0; 99 | else 100 | counter <= counter + 1; 101 | end 102 | else begin 103 | counter <= counter; 104 | end 105 | end 106 | end 107 | 108 | 109 | 110 | reg dataValid; 111 | 112 | always @(posedge clk or negedge rst) begin : proc_dataValid 113 | if(~rst) begin 114 | dataValid <= 1'b0; 115 | end else begin 116 | if( state == InitialStat && enable && counter >= RowIndex + ColumnIndex ) 117 | dataValid <= 1'b1; 118 | else if(state == PipelineStat && enable) 119 | dataValid <= 1'b1; 120 | else 121 | dataValid <= 1'b0; 122 | end 123 | end 124 | 125 | reg dataLast; 126 | 127 | always @(posedge clk or negedge rst) begin : proc_dataLast 128 | if(~rst) begin 129 | dataLast <= 0; 130 | end else begin 131 | if(state == InitialStat && enable && counter == RowIndex + ColumnIndex + featureLen) 132 | dataLast <= 1'b1; 133 | else if(state == PipelineStat && enable && counter == featureLen - InitialLantency) 134 | dataLast <= 1'b1; 135 | else 136 | dataLast <= 1'b0; 137 | end 138 | end 139 | 140 | wire Mulvalid; 141 | wire [dataWidth - 1:0] Mulresult; 142 | wire Mullast; 143 | 144 | 145 | 146 | mul u0 ( 147 | .clk (clk), // input, width = 1, clk.clk 148 | .areset (rst), // input, width = 1, areset.reset 149 | .a (weight), // input, width = 32, a.a 150 | .b (featureIn), // input, width = 32, b.b 151 | .q (Mulresult) // output, width = 32, q.q 152 | ); 153 | 154 | 155 | 156 | 157 | wire Accvalid; 158 | wire [dataWidth - 1:0] Accresult; 159 | wire Acclast; 160 | 161 | 162 | singleAcc u1 ( 163 | .clk (clk), // input, width = 1, clk.clk 164 | .areset (rst), // input, width = 1, areset.reset 165 | .a (Mulresult), // input, width = 32, a.a 166 | .q (Accresult), // output, width = 32, q.q 167 | .acc (Mulvalid) // input, width = 1, acc.acc 168 | ); 169 | 170 | assign Acclast = 1'b1; 171 | assign Accvalid = 1'b1; 172 | // 173 | reg [dataWidth - 1:0] temResult; 174 | 175 | always @(posedge clk or negedge rst) begin : proc_temResult 176 | if(~rst) begin 177 | temResult <= 0; 178 | end else begin 179 | if(Acclast && Accvalid) 180 | temResult <= Accresult; 181 | else 182 | temResult <= temResult; 183 | end 184 | end 185 | 186 | always @(posedge clk or negedge rst) begin : proc_featureOut 187 | if(~rst) begin 188 | featureOut <= 0; 189 | end else begin 190 | if(state == PipelineStat && enable && counter == 2*SysDimension - (RowIndex + ColumnIndex)) 191 | featureOut <= temResult; 192 | else if (state == PipelineStat && enable && counter == 2*SysDimension - (RowIndex + ColumnIndex)) 193 | featureOut <= preresult; 194 | else 195 | featureOut <= featureOut; 196 | end 197 | end 198 | 199 | 200 | 201 | endmodule 202 | -------------------------------------------------------------------------------- /hardware/system.v: -------------------------------------------------------------------------------- 1 | `timescale 1ns / 1ps 2 | ////////////////////////////////////////////////////////////////////////////////// 3 | // Company: 4 | // Engineer: 5 | // 6 | // Create Date: 12/21/2019 05:26:25 PM 7 | // Design Name: 8 | // Module Name: system 9 | // Project Name: 10 | // Target Devices: 11 | // Tool Versions: 12 | // Description: 13 | // 14 | // Dependencies: 15 | // 16 | // Revision: 17 | // Revision 0.01 - File Created 18 | // Additional Comments: 19 | // 20 | ////////////////////////////////////////////////////////////////////////////////// 21 | 22 | 23 | 24 | module system( 25 | 26 | clk, 27 | rst, 28 | enable, 29 | mode, 30 | indptr_addressportA1_outside, 31 | indptr_addressportB1_outside, 32 | indptr_startAddressA, 33 | indptr_startAddressB, 34 | indptr_endAddress, 35 | indptr_writeportA1, 36 | indptr_writeportB1, 37 | indice_addressportA1_outside, 38 | indice_addressportB1_outside, 39 | indice_writeportA1, 40 | indice_writeportB1, 41 | columnbuffer_addressportA1_outside, 42 | columnbuffer_addressportB1_outside, 43 | columnbuffer_writeportA1_narrow, 44 | columnbuffer_writeportB1_narrow, 45 | 46 | weight_narrow, 47 | resultOut_narrow 48 | 49 | ); 50 | 51 | parameter dataWidth = 32; 52 | parameter k = 1024; 53 | parameter psys = 24; 54 | parameter featureLen = 256; 55 | parameter pactivation = 128; 56 | parameter pvadd = 256; 57 | parameter pavdd = 256; 58 | 59 | 60 | input clk; 61 | input rst; 62 | input enable; 63 | input mode; 64 | 65 | input [$clog2(k + 1) - 1:0] indptr_addressportA1_outside; 66 | input [$clog2(k + 1) - 1:0] indptr_addressportB1_outside; 67 | input [$clog2(k + 1) - 1:0] indptr_startAddressA; 68 | input [$clog2(k + 1) - 1:0] indptr_startAddressB; 69 | input [$clog2(k + 1) - 1:0] indptr_endAddress; 70 | 71 | input [$clog2(k*k/32) - 1:0] indptr_writeportA1; 72 | input [$clog2(k*k/32) - 1:0] indptr_writeportB1; 73 | 74 | input [$clog2(k*k/32) - 1:0] indice_addressportA1_outside; 75 | input [$clog2(k*k/32) - 1:0] indice_addressportB1_outside; 76 | 77 | input [$clog2(k) - 1:0] indice_writeportA1; 78 | input [$clog2(k) - 1:0] indice_writeportB1; 79 | 80 | input [$clog2(k) - 1:0] columnbuffer_addressportA1_outside; 81 | input [$clog2(k) - 1:0] columnbuffer_addressportB1_outside; 82 | 83 | input wire [dataWidth - 1:0] columnbuffer_writeportA1_narrow; 84 | input wire [dataWidth - 1:0] columnbuffer_writeportB1_narrow; 85 | 86 | input wire [dataWidth - 1:0] weight_narrow; 87 | output wire [dataWidth - 1:0] resultOut_narrow; 88 | 89 | wire [dataWidth*psys - 1:0] resultOut; 90 | 91 | wire [dataWidth*pvadd - 1:0] columnbuffer_writeportA1; 92 | wire [dataWidth*pvadd - 1:0] columnbuffer_writeportB1; 93 | 94 | 95 | wire [dataWidth*psys - 1:0] weightIn; 96 | 97 | genvar i; 98 | generate 99 | for (i = 0; i< pvadd;i = i+1) begin:portnarrowTowide 100 | assign columnbuffer_writeportA1[ (i + 1)*dataWidth - 1 : i*dataWidth] = columnbuffer_writeportA1_narrow; 101 | assign columnbuffer_writeportB1[ (i + 1)*dataWidth - 1 : i*dataWidth] = columnbuffer_writeportB1_narrow; 102 | end 103 | endgenerate 104 | 105 | genvar j; 106 | generate 107 | for (j = 0; j < psys ; j = j+1) begin: weightGenration 108 | assign weightIn[(j+1)*dataWidth - 1: j*dataWidth] = weight_narrow; 109 | end 110 | endgenerate 111 | 112 | wire [dataWidth - 1:0] weight_narrow_r[0:psys - 1]; 113 | 114 | genvar l; 115 | generate 116 | for (l = 0; l < psys - 1; l = l+1) begin : resultoutgeneration 117 | if(l == 0) begin : firstloop 118 | assign weight_narrow_r[l] = resultOut[(l + 2)*dataWidth - 1:(l+1)*dataWidth] & resultOut[(l+1)*dataWidth - 1:l*dataWidth]; 119 | end 120 | else begin : remainingloop 121 | assign weight_narrow_r[l] = weight_narrow_r[l-1] & resultOut[(l + 2)*dataWidth - 1:(l+1)*dataWidth]; 122 | end 123 | end 124 | endgenerate 125 | 126 | assign resultOut_narrow = weight_narrow_r[psys - 2]; 127 | 128 | wire [$clog2(k) - 1:0] rowAddressConnection; 129 | wire [pvadd*dataWidth - 1:0] rowdataConnection; 130 | 131 | Aggregation #( 132 | .dataWidth(dataWidth), // the data bit width 133 | .pvadd(pvadd), // the parallesim in feature adder 134 | .k(k) // define k as the block size 135 | ) singleAggregation 136 | ( 137 | // system signal 138 | .clk(clk), 139 | .rst(rst), 140 | .enable(enable), 141 | 142 | // port for indptrDoubleBuffer 143 | .indptr_writeEnableA1(enable), 144 | .indptr_writeEnableB1(enable), 145 | .indptr_addressportA1_outside(indptr_addressportA1_outside), 146 | .indptr_addressportB1_outside(indptr_addressportB1_outside), 147 | .indptr_writeportA1(indptr_writeportA1), 148 | .indptr_writeportB1(indptr_writeportB1), 149 | 150 | .indptr_startAddressA(indptr_startAddressA), 151 | .indptr_startAddressB(indptr_startAddressB), 152 | .indptr_endAddress(indptr_endAddress), 153 | 154 | // port for indiceDoubleBuffer 155 | .indice_writeEnableA1(enable), 156 | .indice_writeEnableB1(enable), 157 | .indice_addressportA1_outside(indice_addressportA1_outside), 158 | .indice_addressportB1_outside(indice_addressportB1_outside), 159 | .indice_writeportA1(indice_writeportA1), 160 | .indice_writeportB1(indice_writeportB1), 161 | 162 | // port for DoubleColumnBuffer 163 | .columnbuffer_writeEnableA1(enable), 164 | .columnbuffer_writeEnableB1(enable), 165 | .columnbuffer_addressportA1_outside(columnbuffer_addressportA1_outside), 166 | .columnbuffer_addressportB1_outside(columnbuffer_addressportB1_outside), 167 | .columnbuffer_writeportA1(columnbuffer_writeportA1), 168 | .columnbuffer_writeportB1(columnbuffer_writeportB1), 169 | 170 | // port for Double Row Buffer 171 | 172 | .rowbuffer_addressportA1_outside(rowAddressConnection), 173 | .rowbuffer_dataOut(rowdataConnection) 174 | 175 | 176 | ); 177 | 178 | 179 | transformationPlusActivation #( 180 | .dataWidth(dataWidth), 181 | .psys(psys), 182 | .featureLen(featureLen), 183 | .pactivation(pactivation), 184 | .pavdd(pavdd), 185 | .k(k) 186 | ) singletransformationPlusActivation 187 | ( 188 | .clk(clk), 189 | .rst(rst), 190 | .enable(enable), 191 | // define port for weight buffer 192 | .weightWriteEnable(enable), 193 | .weigthDataIn(weightIn), 194 | 195 | //define port for systolic array 196 | 197 | 198 | //define the port for aggregation in 199 | .rowbuffer_address(rowAddressConnection), 200 | .rowbuffer_dataOut(rowdataConnection), 201 | 202 | //define the model 203 | .mode(mode), 204 | 205 | // 206 | .resultOut(resultOut) 207 | 208 | ); 209 | 210 | 211 | 212 | endmodule -------------------------------------------------------------------------------- /hardware/transformationPlusActivation.v: -------------------------------------------------------------------------------- 1 | `timescale 1ns / 1ps 2 | ////////////////////////////////////////////////////////////////////////////////// 3 | // Company: 4 | // Engineer: 5 | // 6 | // Create Date: 12/21/2019 09:53:24 AM 7 | // Design Name: 8 | // Module Name: transformationPlusActivation 9 | // Project Name: 10 | // Target Devices: 11 | // Tool Versions: 12 | // Description: 13 | // 14 | // Dependencies: 15 | // 16 | // Revision: 17 | // Revision 0.01 - File Created 18 | // Additional Comments: 19 | // 20 | ////////////////////////////////////////////////////////////////////////////////// 21 | 22 | 23 | module transformationPlusActivation #( 24 | parameter dataWidth = 32, 25 | parameter psys = 32, 26 | parameter featureLen = 128, 27 | parameter pactivation = 128, 28 | parameter pavdd = 128, 29 | parameter k = 1024 30 | ) 31 | ( 32 | clk, 33 | rst, 34 | enable, 35 | // define port for weight buffer 36 | weightWriteEnable, 37 | weigthDataIn, 38 | 39 | //define port for systolic array 40 | 41 | 42 | //define the port for aggregation in 43 | rowbuffer_address, 44 | rowbuffer_dataOut, 45 | 46 | //define the model 47 | mode, 48 | 49 | // 50 | resultOut 51 | 52 | ); 53 | input clk; 54 | input rst; 55 | input enable; 56 | input weightWriteEnable; 57 | input [dataWidth*psys -1 :0] weigthDataIn; 58 | input mode; 59 | 60 | input [featureLen*dataWidth - 1:0] rowbuffer_dataOut; 61 | 62 | output wire [psys*dataWidth - 1:0] resultOut; 63 | 64 | 65 | wire [psys*dataWidth - 1:0] reluIn; 66 | wire [psys*dataWidth - 1:0] reluOut; 67 | 68 | 69 | wire [dataWidth*psys -1 :0] weightBufferTosysArray; 70 | 71 | reg [$clog2(k) - 1:0] rowAddressCounter; 72 | output wire [$clog2(k) - 1:0] rowbuffer_address; 73 | 74 | 75 | always @(posedge clk or negedge rst) begin : rowAddressCounter_r 76 | if(~rst) begin 77 | rowAddressCounter <= 0; 78 | end else begin 79 | rowAddressCounter <= rowAddressCounter + 1 ; 80 | end 81 | end 82 | 83 | assign rowbuffer_address = rowAddressCounter; 84 | 85 | 86 | parameter integer round = $floor(featureLen / psys); 87 | 88 | 89 | wire [dataWidth*psys -1 :0] sysin[0:round- 2]; 90 | 91 | 92 | wire [dataWidth*psys -1 :0] sysDataOut; 93 | 94 | 95 | genvar i; 96 | generate 97 | for(i = 0; i< round - 2;i=i+1) begin:sysin_loop 98 | if(i == 0) begin : round_r 99 | assign sysin[i] = rowbuffer_dataOut[(i+1)*psys * dataWidth - 1: i*psys*dataWidth] & rowbuffer_dataOut[(i+2)*psys * dataWidth - 1: (i + 1)*psys*dataWidth]; 100 | end 101 | else begin : round_p 102 | assign sysin[i] = sysin[i - 1] & rowbuffer_dataOut[(i+2)*psys * dataWidth - 1: (i + 1)*psys*dataWidth]; 103 | end 104 | end 105 | endgenerate 106 | 107 | 108 | 109 | wire [dataWidth*psys -1 :0] sysarrayIn; 110 | 111 | assign sysarrayIn = (mode == 1'b0)? sysin[round - 2]: reluOut; 112 | 113 | assign reluIn = (mode == 1'b0)? sysDataOut: sysin[round - 2]; 114 | 115 | assign resultOut = (mode == 1'b0)? reluOut: sysDataOut; 116 | 117 | sysArray #( 118 | .dataWidth(dataWidth), 119 | .SysDimension(psys), 120 | .featureLen(featureLen) 121 | ) singlesysArray 122 | ( 123 | .clk(clk), // define the input clock 124 | .rst(rst), // define the input reset 125 | .enable(enable), // define the enable signal port 126 | .weightArray(weightBufferTosysArray), // define the input weight array 127 | .featureArray(sysarrayIn), // define the input feature array 128 | .outputArray(sysDataOut) // define the output feature array 129 | ); 130 | 131 | parameter weightBufferAddressWidth = $clog2(featureLen * featureLen / psys); 132 | 133 | 134 | reg[weightBufferAddressWidth - 1:0] WeightAddr; 135 | 136 | 137 | always @(posedge clk or negedge rst) begin : proc_WeightAddr 138 | if(~rst) begin 139 | WeightAddr <= 0; 140 | end else begin 141 | WeightAddr <= WeightAddr + 1; 142 | end 143 | end 144 | 145 | weightBuffer #( 146 | .dataWidth(dataWidth), 147 | .featureLen(featureLen), 148 | .psys(psys) 149 | ) singleWeightBuffer 150 | ( 151 | .clk(clk), 152 | .rst(rst), 153 | .enable(enable), 154 | .writenable(weightWriteEnable), 155 | .din(weigthDataIn), 156 | .dout(weightBufferTosysArray), 157 | .addr(WeightAddr) 158 | ); 159 | 160 | 161 | 162 | 163 | 164 | 165 | reluArray #( 166 | .dataWidth(dataWidth), 167 | .pactivation(pactivation) 168 | ) singlereluArray 169 | ( 170 | .clk(clk), 171 | .rst(rst), 172 | .inputArray(reluIn), 173 | .outputArray(reluOut) 174 | ); 175 | 176 | endmodule 177 | 178 | -------------------------------------------------------------------------------- /hardware/vectorAdd.v: -------------------------------------------------------------------------------- 1 | module vectorAdd #( 2 | parameter dataWidth = 32, 3 | parameter pvadd= 128 4 | ) 5 | ( 6 | clk, 7 | rst, 8 | lastin, 9 | lastout, 10 | vectorIn, 11 | vectorOut 12 | ); 13 | 14 | // define the input 15 | input clk; 16 | input rst; 17 | input lastin; 18 | input [dataWidth*pvadd -1: 0] vectorIn; 19 | 20 | //define the output 21 | 22 | output wire lastout; 23 | output [dataWidth*pvadd -1: 0] vectorOut; 24 | 25 | //define the type of output 26 | 27 | 28 | wire [dataWidth*pvadd -1: 0] vectorOut; 29 | assign vectorOut = 1'b1; 30 | genvar i; 31 | generate 32 | for(i = 0;i < pvadd; i = i+1) begin : single 33 | 34 | singleAcc u0 ( 35 | .clk (clk), // input, width = 1, clk.clk 36 | .areset (rst), // input, width = 1, areset.reset 37 | .a (vectorIn[(i + 1)*dataWidth - 1: i*dataWidth]), // input, width = 32, a.a 38 | .q (vectorOut[(i + 1)*dataWidth - 1: i*dataWidth]), // output, width = 32, q.q 39 | .acc (lastin) // input, width = 1, acc.acc 40 | ); 41 | 42 | 43 | 44 | 45 | end 46 | endgenerate 47 | 48 | endmodule -------------------------------------------------------------------------------- /hardware/weightBuffer.v: -------------------------------------------------------------------------------- 1 | 2 | module weightBuffer #( 3 | parameter dataWidth = 32, 4 | parameter featureLen = 256, 5 | parameter psys = 24 6 | ) 7 | ( 8 | clk, 9 | rst, 10 | enable, 11 | writenable, 12 | din, 13 | dout, 14 | addr 15 | 16 | ); 17 | 18 | parameter addressWidth = $clog2(featureLen * featureLen / psys); 19 | parameter dataportWidth = dataWidth*psys; 20 | parameter memorySize = 288 * 288 * dataWidth; 21 | 22 | 23 | 24 | // define the input port 25 | input clk; 26 | input rst; 27 | input enable; 28 | input writenable; 29 | input [dataportWidth - 1:0] din; 30 | input [addressWidth -1:0] addr; 31 | output [dataportWidth - 1:0] dout; 32 | 33 | weightbuffer u0 ( 34 | .data (din), // input, width = 768, data.datain 35 | .q (dout), // output, width = 768, q.dataout 36 | .wraddress ({1'b0, addr}), // input, width = 12, wraddress.wraddress 37 | .rdaddress ({1'b0, addr}), // input, width = 12, rdaddress.rdaddress 38 | .wren (writenable), // input, width = 1, wren.wren 39 | .clock (clk) // input, width = 1, clock.clk 40 | ); 41 | 42 | 43 | 44 | 45 | endmodule -------------------------------------------------------------------------------- /perf_model/perf_analysis.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import sys,math 3 | 4 | 5 | class perf_analysis: 6 | def __init__(self,RDSP,RBRAM,RDDR): 7 | """ 8 | Define the resources of the target FPGA 9 | 10 | Inputs: 11 | RDSP (int) total num of DSP slices on-chip 12 | RBRAM (float) total BRAM size (in mega bits) 13 | RDDR (float) total off-chip bandwidth (in GB/s) 14 | Output: 15 | None 16 | """ 17 | self.RDSP = RDSP 18 | self.RBRAM = RBRAM 19 | self.RDDR = RDDR 20 | self.dim_hid = None 21 | self.dim_class = None 22 | self.dim_node = None 23 | self.reset() 24 | 25 | def set_hardware_spec(self,FREQ,COST_word,COST_mult,COST_add,COST_mac): 26 | """ 27 | Specify the key cost values for the target accelerator 28 | 29 | Inputs: 30 | FREQ (float) expected frequency (in MHz) 31 | COST_word (int) num bit per word (e.g., for single precision 32 | floating point, this value = 32) 33 | COST_mult (int) num of DSP slices to implement one multiplier 34 | COST_add (int) num of DSP slices to implement one adder 35 | COST_mac (int) num of DSP slices to implement one MAC 36 | Output: 37 | None 38 | """ 39 | self.FREQ = FREQ 40 | self.COST_word = COST_word 41 | self.COST_mult = COST_mult 42 | self.COST_add = COST_add 43 | self.COST_mac = COST_mac 44 | 45 | def set_GNN_spec(self,L,dim_hid,num_epoch): 46 | """ 47 | Specify the GNN architecture and training parameters 48 | 49 | Inputs: 50 | L (int) num of graph conv layers 51 | dim_hid (int) hidden dimension of graph conv layers 52 | num_epoch (int) estimated num of epochs till convergence 53 | Output: 54 | None 55 | """ 56 | self.L = L 57 | self.dim_hid = dim_hid 58 | self.num_epoch = num_epoch 59 | if self.dim_node: 60 | self.dim_all_l = [self.dim_node]+[self.dim_hid]*self.L+[self.dim_class] 61 | 62 | def set_graph_spec(self,V,gamma,beta,dim_node,dim_class): 63 | """ 64 | Specify the graph parameters 65 | 66 | Inputs: 67 | V (int) num training nodes 68 | gamma (float) redundancy reduction ratio 69 | beta (float) overhead in storing the partial sum 70 | dim_node (int) dim of initial node feature 71 | dim_class (int) num classes 72 | Output: 73 | None 74 | """ 75 | self.V = V 76 | self.gamma, self.beta = gamma, beta 77 | self.dim_node, self.dim_class = dim_node, dim_class 78 | if self.dim_hid: 79 | self.dim_all_l = [self.dim_node]+[self.dim_hid]*self.L+[self.dim_class] 80 | 81 | def set_design(self,Psys,Pagg,Vsub,dsub): 82 | """ 83 | Set the design parameters of the architecture 84 | 85 | Inputs: 86 | Psys (int) dim of systolic array 87 | Pagg (int) dim of feat aggr array 88 | Vsub (int) size of the sampled subgraph 89 | Output: 90 | None 91 | """ 92 | self.Psys, self.Pagg = Psys, Pagg 93 | self.Vsub, self.dsub = Vsub, dsub 94 | 95 | def ret_design(self): 96 | return self.Psys, self.Pagg, self.Vsub, self.dsub 97 | 98 | def reset(self): 99 | # self.dim_hid = None 100 | # self.dim_class = None 101 | # self.dim_node = None 102 | self.Psys = None 103 | self.Pagg = None 104 | self.Vsub = None 105 | self.dsub = None 106 | 107 | def _consumption_BRAM(self): 108 | """ 109 | Compute the consumption of on-chip BRAM 110 | 111 | Inputs: 112 | None 113 | Output: 114 | None 115 | """ 116 | try: 117 | _buf_weight = 2*(self.L+1)*self.dim_hid**2 118 | except Exception: 119 | import pdb; pdb.set_trace() 120 | _buf_WT = self.Psys*self.Vsub 121 | # size of buffer for storing the partial sum 122 | _dim_max = max(self.dim_hid,self.dim_node,self.dim_class) 123 | _buf_partial_agg = _dim_max*self.beta*self.Vsub 124 | # buffer to store the grad wrt X (to back-prop to prev layer) 125 | _buf_grad_X = 2*max(self.dim_hid,self.dim_class)*self.Vsub 126 | _buf_X = sum([f*self.Vsub for f in self.dim_all_l]) 127 | _buf_inter = _dim_max*self.Vsub 128 | self.buf_actual = _buf_weight+_buf_WT+_buf_partial_agg+_buf_grad_X+_buf_inter 129 | # buf_pred = (self.L+5)*self.dim_hid*self.Vsub \ 130 | # + self.dim_hid*self.beta*self.Vsub \ 131 | # + self.Psys*self.Vsub + 2*(self.L+1)*self.dim_hid**2 132 | 133 | def _consumption_DSP(self): 134 | self.dsp_actual = self.Pagg*self.COST_add + self.Psys**2*self.COST_mac 135 | 136 | def is_feasible(self): 137 | self._consumption_BRAM() 138 | self._consumption_DSP() 139 | #import pdb; pdb.set_trace() 140 | return (self.buf_actual < self.RBRAM/self.COST_word*1e6) \ 141 | and (self.dsp_actual < self.RDSP) \ 142 | and (self.Pagg <= self.dim_hid) 143 | 144 | def cycle_batch(self): 145 | # forward prop: one aggr + two matmul per graph conv layer 146 | t_agg_forward = [int(self.Vsub*self.dsub*self.gamma*math.ceil(f/self.Pagg)\ 147 | + math.ceil(self.Vsub/self.Psys)*f) for f in self.dim_all_l[:-2]] 148 | t_sys_forward_half = [int(math.ceil(0.5*self.dim_all_l[l+1]/self.Psys)\ 149 | * self.dim_all_l[l] * math.ceil(self.Vsub/self.Psys)) for l in range(self.L)] 150 | t_forward = [max(t_agg_forward[l],t_sys_forward_half[l])+t_sys_forward_half[l] for l in range(self.L)] 151 | t_forward.append(self.dim_all_l[-2]*math.ceil(self.dim_all_l[-1]/self.Psys)*math.ceil(self.Vsub/self.Psys)) 152 | # backward prop: two aggr + four matmul per graph conv layer 153 | t_agg_backward_1 = [int(self.Vsub*self.dsub*self.gamma*math.ceil(f/self.Pagg) 154 | + math.ceil(f/self.Psys)*self.Vsub) for f in self.dim_all_l[:-2]] 155 | t_agg_backward_2 = [int(self.Vsub*self.dsub*self.gamma*math.ceil(0.5*f/self.Pagg)) for f in self.dim_all_l[2:-1]] 156 | t_sys_backward_w_half = [int(math.ceil(0.5*self.dim_all_l[l+1]/self.Psys)\ 157 | * math.ceil(self.dim_all_l[l]/self.Psys)*self.Vsub) for l in range(self.L)] 158 | t_sys_backward_x_half = [int(0.5*self.dim_all_l[l+2]*math.ceil(self.Vsub/self.Psys)\ 159 | * math.ceil(self.dim_all_l[l+1]/self.Psys)) for l in range(self.L-1)] 160 | t_backward = [max(t_agg_backward_1[l],t_sys_backward_w_half[l])+t_sys_backward_w_half[l] for l in range(self.L)]\ 161 | + [max(t_agg_backward_2[l],t_sys_backward_x_half[l])+t_sys_backward_x_half[l] for l in range(self.L-1)] 162 | t_backward.append(math.ceil(self.dim_all_l[-1]/self.Psys)*math.ceil(self.dim_all_l[-2]/self.Psys)*self.Vsub) 163 | t_backward.append(math.ceil(self.Vsub/self.Psys)*math.ceil(self.dim_all_l[-1]/self.Psys)*self.dim_all_l[-2]) 164 | # forward and backward prop 165 | t_total = sum(t_forward) + sum(t_backward) 166 | return t_total 167 | 168 | def time_converge(self, cyc_bat): 169 | return cyc_bat/(self.FREQ*1e6)*math.ceil(self.V/self.Vsub)*self.num_epoch 170 | 171 | def ops_total(self): 172 | ops_agg_forward = [int(self.gamma*self.Vsub*self.dsub*f) for f in self.dim_all_l[:-2]] 173 | ops_sys_forward = [2*self.Vsub*self.dim_all_l[l]*self.dim_all_l[l+1] for l in range(self.L+1)] 174 | ops_agg_backward_1 = [int(self.Vsub*self.dsub*self.gamma*f) for f in self.dim_all_l[:-2]] 175 | ops_agg_backward_2 = [int(self.Vsub*self.dsub*self.gamma*0.5*f) for f in self.dim_all_l[2:-1]] 176 | ops_sys_backward_1 = [2*self.Vsub*self.dim_all_l[l]*self.dim_all_l[l+1] for l in range(self.L+1)] 177 | ops_sys_backward_2 = [2*self.Vsub*self.dim_all_l[l+1]*self.dim_all_l[l+2] for l in range(self.L-1)] 178 | return sum(ops_agg_forward)+sum(ops_sys_forward)\ 179 | + sum(ops_agg_backward_1)+sum(ops_agg_backward_2)\ 180 | + sum(ops_sys_backward_1)+sum(ops_sys_backward_2) 181 | 182 | 183 | 184 | 185 | def DSE(): 186 | RDSP, RBRAM, RDDR = 5760, 229, 0 187 | pa = perf_analysis(RDSP, RBRAM, RDDR) 188 | FREQ, COST_word, COST_mult, COST_add, COST_mac = 200, 32, 1, 1, 2 189 | pa.set_hardware_spec(FREQ, COST_word, COST_mult, COST_add, COST_mac) 190 | L, dim_hid, num_epoch = 2, 256, 10 191 | pa.set_GNN_spec(L, dim_hid, num_epoch) 192 | V, gamma, beta, dim_node, dim_class = int(716847*0.66), 0.7, 1, 300, 100 193 | pa.set_graph_spec(V, gamma, beta, dim_node, dim_class) 194 | Psys_bound = int((RDSP/COST_mac)**0.5) 195 | best_design = None 196 | min_time = float('inf') 197 | for _ps in range(1,Psys_bound): 198 | for _pa in range(16,dim_hid,16): 199 | for _vs in range(500,8000,500): 200 | pa.reset() 201 | pa.set_design(_ps,_pa,_vs,15) 202 | if not pa.is_feasible(): 203 | continue 204 | _cyc = pa.cycle_batch() 205 | _time = pa.time_converge(_cyc) 206 | if _time < min_time: 207 | min_time = _time 208 | best_design = pa.ret_design() 209 | print("Best design: Psys = {}\tPaggr = {}\tVsub = {}\tdsub = {}"\ 210 | .format(*best_design)) 211 | print("Convergence time: {}".format(min_time)) 212 | 213 | 214 | if __name__ == "__main__": 215 | DSE() --------------------------------------------------------------------------------