├── Makefile ├── .gitignore ├── img ├── linear.png ├── Logarithmic.png ├── interleaved-visual.png └── sequential-visual.png ├── README.md └── Maximum.cu /Makefile: -------------------------------------------------------------------------------- 1 | all: 2 | nvcc -o Maximum Maximum.cu -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | *.csv 2 | *.exe 3 | *.exp 4 | *.lib 5 | -------------------------------------------------------------------------------- /img/linear.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MaxKotlan/Cuda-Find-Max-Using-Parallel-Reduction/HEAD/img/linear.png -------------------------------------------------------------------------------- /img/Logarithmic.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MaxKotlan/Cuda-Find-Max-Using-Parallel-Reduction/HEAD/img/Logarithmic.png -------------------------------------------------------------------------------- /img/interleaved-visual.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MaxKotlan/Cuda-Find-Max-Using-Parallel-Reduction/HEAD/img/interleaved-visual.png -------------------------------------------------------------------------------- /img/sequential-visual.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MaxKotlan/Cuda-Find-Max-Using-Parallel-Reduction/HEAD/img/sequential-visual.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Maximum Value Algorithm Variants Using Parallel Reduction 2 | 3 | The purpose of this program is to benchmark different variants of an algorithm for finding the maximum value in a set of elements. The data sets in this program are all randomly generated using rand. This program outputs the performance of each algorithm to the command line in csv format. 4 | 5 | ## What is Parallel Reduction? 6 | 7 | Parallel Reduction is a common design pattern, which is useful for executing associative operations (operations that can be performed in any order, such as addition, multiplication, etc.) in parallel. 8 | 9 | ## RESULTS 10 | The following graphs were created in Excel from data generated through the benchmark mode of the application 11 | 12 | ![Maximum Value Algorithm Linear Scale](img/linear.png) 13 | ![Maximum Value Algorithm Logarithmic Scale](img/Logarithmic.png) 14 | 15 | Three algorithms were tested in this benchmark. 16 | 17 | - Interleaved Addressing and Global Memory 18 | - Interleaved Addressing and Shared Memory 19 | - Sequential Addressing and Shared Memory 20 | 21 | The sequential addressing with shared memory was the fasted variant. Sequential addressing is faster than Interleaved addressing because all the memory that needs to be referenced is grouped together in consecutive memory addresses. The following are visuals found in [*Nvidia's Optimizing Parallel Reduction in CUDA*](https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf). 22 | 23 | ## Interleaved Addressing 24 | ![Interleaved Addressing](img/interleaved-visual.png) 25 | Interleaved addressing divides each section of the data into subsets called strides. Each iteration the size of the stride doubles, until the stride length is equal to the size of the dataset. With Interleaved addressing, the first element of the stride is where the result for that group is stored. This is more inefficient compared to sequential addressing, due to the way the gpu access global memory. When requesting data from a particular element in global memory, the gpu actually returns an entire chunk of data, rather than a single element. 26 | 27 | ## Sequential Addressing 28 | ![Sequential Addressing](img/sequential-visual.png) 29 | Since sequential addressing stores each result consecutively, fewer memory requests are needed, since a single request will return a large chunk of addresses. This is why sequential addressing is so much faster. 30 | 31 | ## Shared Memory 32 | Shared memory is closer to the streaming processor, and is significantly faster to read and write to than global memory. This is why the shared memory version of the interleaving algorithm is faster than the global memory version of the interleaving algorithm. 33 | -------------------------------------------------------------------------------- /Maximum.cu: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | #include 4 | 5 | #define MAX_CUDA_THREADS_PER_BLOCK 1024 6 | 7 | #define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); } 8 | inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true) 9 | { 10 | if (code != cudaSuccess) 11 | { 12 | fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line); 13 | if (abort) exit(code); 14 | } 15 | } 16 | 17 | struct Startup{ 18 | int random_range = INT_MAX; 19 | int threads_per_block = MAX_CUDA_THREADS_PER_BLOCK; 20 | } startup; 21 | 22 | struct DataSet{ 23 | float* values; 24 | int size; 25 | }; 26 | 27 | struct Result{ 28 | float MaxValue; 29 | float KernelExecutionTime; 30 | }; 31 | 32 | DataSet generateRandomDataSet(int size){ 33 | DataSet data; 34 | data.size = size; 35 | data.values = (float*)malloc(sizeof(float)*data.size); 36 | 37 | for (int i = 0; i < data.size; i++) 38 | data.values[i] = (float)(rand()%startup.random_range); 39 | 40 | return data; 41 | } 42 | 43 | __global__ void Max_Interleaved_Addressing_Global(float* data, int data_size){ 44 | int idx = blockDim.x * blockIdx.x + threadIdx.x; 45 | if (idx < data_size){ 46 | for(int stride=1; stride < data_size; stride *= 2) { 47 | if (idx % (2*stride) == 0) { 48 | float lhs = data[idx]; 49 | float rhs = data[idx + stride]; 50 | data[idx] = lhs < rhs ? rhs : lhs; 51 | } 52 | __syncthreads(); 53 | } 54 | } 55 | } 56 | 57 | __global__ void Max_Interleaved_Addressing_Shared(float* data, int data_size){ 58 | int idx = blockDim.x * blockIdx.x + threadIdx.x; 59 | __shared__ float sdata[MAX_CUDA_THREADS_PER_BLOCK]; 60 | if (idx < data_size){ 61 | 62 | /*copy to shared memory*/ 63 | sdata[threadIdx.x] = data[idx]; 64 | __syncthreads(); 65 | 66 | for(int stride=1; stride < blockDim.x; stride *= 2) { 67 | if (threadIdx.x % (2*stride) == 0) { 68 | float lhs = sdata[threadIdx.x]; 69 | float rhs = sdata[threadIdx.x + stride]; 70 | sdata[threadIdx.x] = lhs < rhs ? rhs : lhs; 71 | } 72 | __syncthreads(); 73 | } 74 | } 75 | if (idx == 0) data[0] = sdata[0]; 76 | } 77 | 78 | 79 | __global__ void Max_Sequential_Addressing_Shared(float* data, int data_size){ 80 | int idx = blockDim.x * blockIdx.x + threadIdx.x; 81 | __shared__ float sdata[MAX_CUDA_THREADS_PER_BLOCK]; 82 | if (idx < data_size){ 83 | 84 | /*copy to shared memory*/ 85 | sdata[threadIdx.x] = data[idx]; 86 | __syncthreads(); 87 | 88 | for(int stride=blockDim.x/2; stride > 0; stride /= 2) { 89 | if (threadIdx.x < stride) { 90 | float lhs = sdata[threadIdx.x]; 91 | float rhs = sdata[threadIdx.x + stride]; 92 | sdata[threadIdx.x] = lhs < rhs ? rhs : lhs; 93 | } 94 | __syncthreads(); 95 | } 96 | } 97 | if (idx == 0) data[0] = sdata[0]; 98 | } 99 | 100 | /*Algorithm Information. Includes pointers to different kernels, so they can be executed dynamically*/ 101 | const int Algorithm_Count = 3; 102 | typedef void (*Kernel)(float *, int); 103 | const char* Algorithm_Name[Algorithm_Count]= {"Max_Interleaved_Addressing_Global", "Max_Interleaved_Addressing_Shared", "Max_Sequential_Addressing_Shared"}; 104 | const Kernel Algorithm[Algorithm_Count] = { Max_Interleaved_Addressing_Global, Max_Interleaved_Addressing_Shared, Max_Sequential_Addressing_Shared}; 105 | 106 | Result calculateMaxValue(DataSet data, Kernel algorithm){ 107 | float* device_data; 108 | cudaEvent_t start, stop; 109 | cudaEventCreate(&start); 110 | cudaEventCreate(&stop); 111 | 112 | gpuErrchk(cudaMalloc((void **)&device_data, sizeof(float)*data.size)); 113 | gpuErrchk(cudaMemcpy(device_data, data.values, sizeof(float)*data.size, cudaMemcpyHostToDevice)); 114 | 115 | 116 | int threads_needed = data.size; 117 | cudaEventRecord(start); 118 | algorithm<<< threads_needed/ startup.threads_per_block + 1, startup.threads_per_block>>>(device_data, data.size); 119 | cudaEventRecord(stop); 120 | gpuErrchk(cudaEventSynchronize(stop)); 121 | 122 | float milliseconds = 0; 123 | cudaEventElapsedTime(&milliseconds, start, stop); 124 | 125 | float max_value; 126 | gpuErrchk(cudaMemcpy(&max_value, device_data, sizeof(float), cudaMemcpyDeviceToHost)); 127 | gpuErrchk(cudaFree(device_data)); 128 | 129 | Result r = {max_value, milliseconds}; 130 | return r; 131 | } 132 | 133 | Result calculateMaxValue(DataSet data){ 134 | return calculateMaxValue(data, Algorithm[Algorithm_Count - 1]); 135 | } 136 | 137 | void printDataSet(DataSet data){ 138 | for (int i = 0; i < data.size; i++) 139 | printf("%.6g, ", data.values[i]); 140 | printf("\n"); 141 | } 142 | 143 | void benchmarkCSV(){ 144 | /*Print Headers*/ 145 | printf("Elements, "); 146 | for (int algoID = 0; algoID < Algorithm_Count; algoID++) 147 | printf("%s, ", Algorithm_Name[algoID]); 148 | printf("\n"); 149 | /*Benchmark*/ 150 | for (int dataSize = 2; dataSize < INT_MAX; dataSize*=2){ 151 | DataSet random = generateRandomDataSet(dataSize); 152 | printf("%d, ", dataSize); 153 | for (int algoID = 0; algoID < Algorithm_Count; algoID++) { 154 | Result r = calculateMaxValue(random, Algorithm[algoID]); 155 | printf("%g, ", r.KernelExecutionTime); 156 | } 157 | printf("\n"); 158 | free(random.values); 159 | } 160 | } 161 | 162 | int main(int argc, char** argv){ 163 | srand(time(nullptr)); 164 | benchmarkCSV(); 165 | } --------------------------------------------------------------------------------