├── oldcode ├── misc │ ├── cf │ ├── sgemmN │ ├── README │ ├── cmpG │ ├── compileCG │ ├── gpuFunctions.linkinfo │ ├── compileG │ ├── compileC │ ├── compileGso │ ├── compileCX │ ├── devinfo_cr.py │ ├── vector.c │ ├── simple.cu │ ├── ctypes_extra.py │ ├── kernelGL.cu │ ├── devinfo_cu.py │ ├── simple.cubin │ ├── mklMath.py │ ├── sgemmN.log │ ├── gpuFunctions.py │ ├── ctypes_array_test.py │ ├── simple.py │ ├── cpuFunctions.py │ ├── matadd.txt │ ├── ctypes_array.py │ └── utilities.py ├── examples │ ├── __init__.py │ ├── TODO │ └── bw_test.py ├── cu │ └── __init__.py ├── cufft │ ├── __init__.py │ ├── cufft_defs.py │ └── cufft_api.py ├── cublas │ ├── __init__.py │ └── cublas_defs.py └── cuda │ └── __init__.py ├── cuda ├── cu │ └── __init__.py ├── cublas │ └── __init__.py ├── cuda │ └── __init__.py ├── sugar │ ├── memory │ │ ├── __init__.py │ │ └── linear.py │ ├── fft │ │ ├── __init__.py │ │ ├── fftconvolve2d_kernel.cu │ │ ├── conv_gold.py │ │ └── fft.py │ ├── query │ │ ├── __init__.py │ │ ├── cu_utils.py │ │ └── cuda_utils.py │ ├── blas │ │ ├── __init__.py │ │ ├── sdot.py │ │ ├── saxpy.py │ │ └── sgemm.py │ ├── kernel │ │ ├── __init__.py │ │ ├── tests │ │ │ ├── matrix_mul.py │ │ │ └── matrix_mul_kernel.cu │ │ ├── kernelfactorydrv.py │ │ ├── kernelfactoryrt.py │ │ └── compiler.py │ └── __init__.py ├── cufft │ └── __init__.py ├── utils │ ├── __init__.py │ ├── libutils.py │ └── logger.py └── __init__.py ├── .gitignore ├── mkdist ├── tests ├── cuda │ └── todo │ │ ├── cuda_GLimg.png │ │ ├── cuda_gflops.py │ │ ├── cuda_saxpy.py │ │ ├── cuda_poly.py │ │ ├── cuda_trig.py │ │ ├── cuda_add.py │ │ ├── cuda_blsc.py │ │ ├── cuda_sgemm.py │ │ ├── cuda_GL.py │ │ ├── cuda_streams.py │ │ └── cuda_QtGL.py ├── cufft │ ├── todo │ │ ├── manyfft.py │ │ ├── bfft.py │ │ ├── dfft.py │ │ ├── sfft.py │ │ ├── gfft_cu.py │ │ ├── gfft_cuda.py │ │ ├── xfft.py │ │ ├── cuda_fft.py │ │ └── cu_fft.py │ ├── cufft_fft.py │ └── fftlab.py ├── cu │ └── todo │ │ ├── cu_gflops.py │ │ ├── cu_saxpy.py │ │ ├── cu_add.py │ │ ├── cu_trig.py │ │ ├── cu_poly.py │ │ ├── cu_blsc.py │ │ ├── cu_sgemm.py │ │ └── cu_streams.py └── test_cublas.py ├── README.rst ├── xml ├── generate-xml.sh.orig ├── generate-xml_macosx.sh ├── generate-xml_linux.sh ├── generate-xml.sh └── createbindings.py └── setup.py /oldcode/misc/cf: -------------------------------------------------------------------------------- 1 | cu_fft.py -------------------------------------------------------------------------------- /oldcode/examples/__init__.py: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | -------------------------------------------------------------------------------- /cuda/cu/__init__.py: -------------------------------------------------------------------------------- 1 | from cudadrv import * 2 | -------------------------------------------------------------------------------- /cuda/cublas/__init__.py: -------------------------------------------------------------------------------- 1 | from cublas import * 2 | -------------------------------------------------------------------------------- /cuda/cuda/__init__.py: -------------------------------------------------------------------------------- 1 | from cudart import * 2 | -------------------------------------------------------------------------------- /oldcode/cu/__init__.py: -------------------------------------------------------------------------------- 1 | from cu_api import * 2 | -------------------------------------------------------------------------------- /oldcode/cufft/__init__.py: -------------------------------------------------------------------------------- 1 | from cufft_api import * 2 | -------------------------------------------------------------------------------- /cuda/sugar/memory/__init__.py: -------------------------------------------------------------------------------- 1 | from linear import * 2 | -------------------------------------------------------------------------------- /oldcode/cublas/__init__.py: -------------------------------------------------------------------------------- 1 | from cublas_api import * 2 | -------------------------------------------------------------------------------- /oldcode/cuda/__init__.py: -------------------------------------------------------------------------------- 1 | from cuda_api import * 2 | 3 | -------------------------------------------------------------------------------- /cuda/sugar/fft/__init__.py: -------------------------------------------------------------------------------- 1 | from fft import * 2 | from fftconvolve import * 3 | -------------------------------------------------------------------------------- /cuda/sugar/query/__init__.py: -------------------------------------------------------------------------------- 1 | from cu_utils import * 2 | from cuda_utils import * 3 | -------------------------------------------------------------------------------- /cuda/cufft/__init__.py: -------------------------------------------------------------------------------- 1 | from cufft import * 2 | CUFFT_FORWARD = -1 3 | CUFFT_INVERSE = 1 4 | -------------------------------------------------------------------------------- /oldcode/misc/sgemmN: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/npinto/python-cuda/HEAD/oldcode/misc/sgemmN -------------------------------------------------------------------------------- /cuda/sugar/blas/__init__.py: -------------------------------------------------------------------------------- 1 | from saxpy import * 2 | from sdot import * 3 | from sgemm import * 4 | -------------------------------------------------------------------------------- /cuda/sugar/kernel/__init__.py: -------------------------------------------------------------------------------- 1 | from kernelfactorydrv import * 2 | from kernelfactoryrt import * 3 | -------------------------------------------------------------------------------- /cuda/sugar/__init__.py: -------------------------------------------------------------------------------- 1 | import memory 2 | import kernel 3 | import fft 4 | import blas 5 | import query 6 | -------------------------------------------------------------------------------- /cuda/utils/__init__.py: -------------------------------------------------------------------------------- 1 | from logger import * 2 | from libutils import * 3 | from decorator import memoize 4 | 5 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | *.py~ 3 | junk 4 | build 5 | *egg* 6 | *testbed* 7 | *.so 8 | *.linkinfo 9 | *.swp 10 | -------------------------------------------------------------------------------- /mkdist: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | rm -rf build MANIFEST 3 | python setup.py bdist --formats=gztar 4 | rm -rf build MANIFEST 5 | -------------------------------------------------------------------------------- /tests/cuda/todo/cuda_GLimg.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/npinto/python-cuda/HEAD/tests/cuda/todo/cuda_GLimg.png -------------------------------------------------------------------------------- /oldcode/misc/README: -------------------------------------------------------------------------------- 1 | just a temp place to put stuff that we haven't sorted yet (e.g. from the old 'examples') 2 | 3 | 4 | -------------------------------------------------------------------------------- /oldcode/examples/TODO: -------------------------------------------------------------------------------- 1 | - put 6.963 examples (converted to python-cuda) 2 | - put SDK examples (converted to python-cuda) 3 | - anything else! -------------------------------------------------------------------------------- /oldcode/misc/cmpG: -------------------------------------------------------------------------------- 1 | #!/bin/tcsh -f 2 | set flopt="--maxrregcount 32 --use_fast_math --gpu-name sm_11" 3 | set lib="" 4 | nvcc -o ${1} -O3 $flopt ${1}.cu -lcublas 5 | -------------------------------------------------------------------------------- /oldcode/misc/compileCG: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | echo "Compiling CPU functions" 3 | compileC cpuFunctions 4 | echo "Compiling GPU functions" 5 | compileG gpuFunctions 6 | compileG simple 7 | -------------------------------------------------------------------------------- /oldcode/misc/gpuFunctions.linkinfo: -------------------------------------------------------------------------------- 1 | --import %laneid,%ctaid,%nctaid,%smid,A7,%pm3,%pm2,%pm1,__CC-temp__0__,%pm0,%tid,%clock,%warpid,%ntid,%gridid --export gpuScale,gpuPOLY5,gpuPOLY20,gpuPOLY40,init_array,gpuSAXPY,gpuSGEMM,gpuPOLY10,gpuVADD,gpuGFLOPS,gpuTRIG,gpuBLSC -------------------------------------------------------------------------------- /oldcode/misc/compileG: -------------------------------------------------------------------------------- 1 | #!/bin/tcsh -f 2 | set flag1="--ptx" 3 | set flag2="--cubin" 4 | #set flopt="--maxrregcount 12 --use_fast_math --gpu-architecture sm_11" 5 | set flopt="--use_fast_math --gpu-architecture sm_11" 6 | set lib="" 7 | nvcc $flag1 $flopt ${1}.cu |& grep -iv warning 8 | nvcc $flag2 $flopt ${1}.ptx |& grep -iv warning 9 | -------------------------------------------------------------------------------- /oldcode/misc/compileC: -------------------------------------------------------------------------------- 1 | #!/bin/tcsh -f 2 | set inc="-I/opt/local/Library/Frameworks/Python.framework/Versions/2.6/include/" 3 | #set inc="-I/usr/include/python2.5" 4 | set lib="" 5 | set flags="-fPIC -O2 -msse2"# -malign-double" 6 | gcc -c $flags $inc ${1}.c 7 | gcc -shared $lib -o _${1}.so ${1}.o 8 | strip -x _${1}.so 9 | rm *.o 10 | -------------------------------------------------------------------------------- /oldcode/misc/compileGso: -------------------------------------------------------------------------------- 1 | #!/bin/tcsh -f 2 | set flag1="" 3 | #set flopt="--maxrregcount 12 --use_fast_math --gpu-code sm_11" 4 | set flopt="--use_fast_math --gpu-code sm_11 --ptxas-options=-v" 5 | set lib="-L$CUDA/lib -lcudart -lcuda" 6 | #nvcc ${flag1} ${flopt} ${1}.cu -c -o ${1}.o |& grep -iv warning 7 | nvcc ${flag1} ${flopt} ${1}.cu -c -o ${1}.o 8 | g++ -shared ${lib} -o lib${1}.so ${1}.o 9 | strip -x lib${1}.so 10 | rm ${1}.o 11 | -------------------------------------------------------------------------------- /README.rst: -------------------------------------------------------------------------------- 1 | Description 2 | =========== 3 | 4 | Python bindings for CUDA 2.1 with numpy integration 5 | 6 | Authors 7 | ------- 8 | 9 | Justin Riley (jtriley@mit.edu) 10 | Nicolas Pinto (pinto@mit.edu) 11 | 12 | Mailing List 13 | ============ 14 | 15 | http://groups.google.com/group/python-cuda 16 | 17 | Bug Tracker 18 | =========== 19 | 20 | http://npinto.lighthouseapp.com/projects/24960-python-cuda 21 | 22 | License 23 | ======= 24 | 25 | see the LICENSE file 26 | 27 | -------------------------------------------------------------------------------- /oldcode/misc/compileCX: -------------------------------------------------------------------------------- 1 | #!/bin/tcsh -f 2 | 3 | # Link all needed Intel routines into _vector.so 4 | # thus avoiding later problems with dynamic loading 5 | 6 | set src="vector" 7 | set mkl="/opt/intel/mkl/10.0.1.014/lib/32" 8 | set lib="$mkl/libmkl_intel.a $mkl/libmkl_intel_thread.a $mkl/libmkl_core.a $mkl/libguide.a -lpthread" 9 | set flags="-static -fPIC -O2 -msse2 -malign-double" 10 | gcc -c $flags $src.c 11 | gcc -shared -o _$src.so $src.o $lib 12 | strip -x _$src.so 13 | rm *.o 14 | -------------------------------------------------------------------------------- /tests/cufft/todo/manyfft.py: -------------------------------------------------------------------------------- 1 | #!/bin/env python 2 | # coding:utf-8: © Arno Pähler, 2007-08 3 | 4 | import sys 5 | from utilities import System 6 | 7 | cmd = "cu_fft.py %d,%d,%d %s" 8 | scm = "" 9 | 10 | if len(sys.argv) > 1: 11 | scm = sys.argv[1] 12 | 13 | vz = (64,128,256) 14 | if "x" not in scm: 15 | vz = (64,128,256,512) 16 | 17 | for nx in (128,256): 18 | for ny in (128,256): 19 | for nz in vz: 20 | s,o,e = System(cmd % (nx,ny,nz,scm)) 21 | print o[-1] 22 | print 23 | print "---\n" 24 | -------------------------------------------------------------------------------- /oldcode/misc/devinfo_cr.py: -------------------------------------------------------------------------------- 1 | #!/bin/env python 2 | # coding:utf-8: © Arno Pähler, 2007-08 3 | 4 | from ctypes import * 5 | from cuda.cuda_api import * 6 | 7 | if __name__ == "__main__": 8 | print "+------------------------+" 9 | print "| CUDA Device Info |" 10 | print "| using CUDA runtime API |" 11 | print "+------------------------+\n" 12 | count = c_int() 13 | cudaGetDeviceCount(byref(count)) 14 | print "number of devices =", count.value 15 | props = cudaDeviceProp() 16 | cudaGetDeviceProperties(props, 0) 17 | print props 18 | -------------------------------------------------------------------------------- /xml/generate-xml.sh.orig: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | python createbindings.py -H $CUDA_PATH/include/cuda.h -l $CUDA_PATH/lib/libcuda.$LIB_EXT -x cu.xml -p ../cuda/cu/cudadrv.py 4 | python createbindings.py -H $CUDA_PATH/include/cuda_runtime.h -l $CUDA_PATH/lib/libcudart.$LIB_EXT -x cudart.xml -p ../cuda/cuda/cudart.py 5 | python createbindings.py -H $CUDA_PATH/include/cublas.h -l $CUDA_PATH/lib/libcublas.$LIB_EXT -x cublas.xml -p ../cuda/cublas/cublas.py 6 | python createbindings.py -H $CUDA_PATH/include/cufft.h -l $CUDA_PATH/lib/libcufft.$LIB_EXT -x cufft.xml -p ../cuda/cufft/cufft.py 7 | 8 | #find . -iname \*.py -exec python {} \; 9 | 10 | 11 | -------------------------------------------------------------------------------- /cuda/__init__.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | import cuda 4 | import cu 5 | 6 | import cublas 7 | import cufft 8 | 9 | import sugar 10 | import utils 11 | 12 | import platform 13 | 14 | # add $CUDA_ROOT/bin to %PATH% in windows 15 | if platform.system() == "Windows": 16 | import _winreg as wreg 17 | reg = wreg.ConnectRegistry(None, wreg.HKEY_LOCAL_MACHINE) 18 | key = wreg.OpenKey(reg, r"SOFTWARE\NVIDIA Corporation\Installed Products\NVIDIA CUDA") 19 | import os 20 | cuda_bin = os.path.join(wreg.QueryValueEx(key, "InstallDir")[0],"bin") 21 | os.environ['PATH'] += os.path.pathsep + cuda_bin 22 | 23 | import atexit 24 | atexit.register(cuda.cudaThreadExit) 25 | 26 | def debug(): 27 | utils.enable_debug() 28 | -------------------------------------------------------------------------------- /oldcode/misc/vector.c: -------------------------------------------------------------------------------- 1 | // To force linking in all needed Intel routines 2 | // See compileCX for details (linking against .a) 3 | void Dummy(void) 4 | { 5 | vsAdd(); 6 | vsSub(); 7 | vsDiv(); 8 | vsSqr(); 9 | vsMul(); 10 | vsAbs(); 11 | vsInv(); 12 | 13 | vsSin(); 14 | vsCos(); 15 | vsSinCos(); 16 | vsTan(); 17 | vsAsin(); 18 | vsAcos(); 19 | vsAtan(); 20 | vsAtan2(); 21 | 22 | vsSinh(); 23 | vsCosh(); 24 | vsTanh(); 25 | vsAsinh(); 26 | vsAcosh(); 27 | vsAtanh(); 28 | 29 | vsPow(); 30 | vsPowx(); 31 | vsSqrt(); 32 | vsCbrt(); 33 | vsInvSqrt(); 34 | vsInvCbrt(); 35 | vsHypot(); 36 | 37 | vsFloor(); 38 | vsCeil(); 39 | vsRound(); 40 | vsTrunc(); 41 | vsRint(); 42 | vsNearbyInt(); 43 | vsModf(); 44 | 45 | vsExp(); 46 | vsLn(); 47 | vsLog10(); 48 | 49 | vsErf(); 50 | vsErfc(); 51 | vsErfInv(); 52 | } 53 | -------------------------------------------------------------------------------- /xml/generate-xml_macosx.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | # NP: do we need a python script for that ? 4 | 5 | python createbindings.py -H /usr/local/cuda/include/cuda.h -l /usr/local/cuda/lib/libcuda.dylib -I /usr/local/cuda/include -x cu.xml -p ../cuda/cu/cudadrv.py 6 | python createbindings.py -H /usr/local/cuda/include/cuda_runtime.h -l /usr/local/cuda/lib/libcudart.dylib -I /usr/local/cuda/include -x cudart.xml -p ../cuda/cuda/cudart.py 7 | python createbindings.py -H /usr/local/cuda/include/cublas.h -l /usr/local/cuda/lib/libcublas.dylib -I /usr/local/cuda/include -x cublas.xml -p ../cuda/cublas/cublas.py 8 | python createbindings.py -H /usr/local/cuda/include/cufft.h -l /usr/local/cuda/lib/libcufft.dylib -I /usr/local/cuda/include -x cufft.xml -p ../cuda/cufft/cufft.py 9 | 10 | #find . -iname \*.py -exec python {} \; 11 | 12 | python ../cuda/cu/cudadrv.py 13 | python ../cuda/cuda/cudart.py 14 | python ../cuda/cublas/cublas.py 15 | python ../cuda/cufft/cufft.py 16 | -------------------------------------------------------------------------------- /oldcode/misc/simple.cu: -------------------------------------------------------------------------------- 1 | // © Arno Pähler, 2007-08 2 | extern "C" { 3 | typedef float _F; 4 | typedef const float _cF; 5 | typedef const unsigned int _cI; 6 | 7 | texture Arg; 8 | 9 | __global__ void TRIG 10 | (_F *d_Out1, _F *d_Out2, _cF *d_In1, _cI size ) 11 | { 12 | _cI tid = blockDim.x * blockIdx.x + threadIdx.x; 13 | _cI tsz = blockDim.x * gridDim.x; 14 | int i; 15 | 16 | for (i = tid; i < size; i += tsz) 17 | { 18 | d_Out1[i] = cosf(d_In1[i]); 19 | d_Out2[i] = sinf(d_In1[i]); 20 | } 21 | } 22 | 23 | __global__ void TRIGTex 24 | (_F *d_Out1, _F *d_Out2, _cI size ) 25 | { 26 | _cI tid = blockDim.x * blockIdx.x + threadIdx.x; 27 | _cI tsz = blockDim.x * gridDim.x; 28 | int i; 29 | __shared__ float x; 30 | 31 | for (i = tid; i < size; i += tsz) 32 | { 33 | x = tex1Dfetch(Arg,i); 34 | d_Out1[i] = cosf(x); 35 | d_Out2[i] = sinf(x); 36 | } 37 | } 38 | } 39 | -------------------------------------------------------------------------------- /xml/generate-xml_linux.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | # NP: do we need a python script for that ? 4 | 5 | INCLUDES="-I ./ -I /usr/local/cuda/include" 6 | 7 | python createbindings.py -H my_CUDA2100_vector_types.h -H cuda.h -l /usr/lib/libcuda.so $INCLUDES -x cudadrv.xml -p cudadrv.py && \ 8 | python cudadrv.py && cp -vf cudadrv.py ../cuda/cu/ 9 | 10 | python createbindings.py -H my_CUDA2100_vector_types.h -H cuda_runtime.h -l /usr/local/cuda/lib/libcudart.so $INCLUDES -x cudart.xml -p cudart.py && \ 11 | python cudart.py && cp -vf cudart.py ../cuda/cuda/ 12 | 13 | python createbindings.py -H my_CUDA2100_vector_types.h -H cublas.h -l /usr/local/cuda/lib/libcublas.so $INCLUDES -x cublas.xml -p cublas.py && \ 14 | python cublas.py && cp -vf cublas.py ../cuda/cublas/ 15 | 16 | python createbindings.py -H my_CUDA2100_vector_types.h -H cufft.h -l /usr/local/cuda/lib/libcufft.so $INCLUDES -x cufft.xml -p cufft.py && \ 17 | python cufft.py && cp -vf cufft.py ../cuda/cufft/ 18 | 19 | #rm -vf cudadrv.xml cudadrv.py cudart.xml cudart.py cublas.xml cublas.py cufft.xml cufft.py -------------------------------------------------------------------------------- /oldcode/cublas/cublas_defs.py: -------------------------------------------------------------------------------- 1 | # coding:utf-8: © Arno Pähler, 2007-08 2 | 3 | from ctypes import * 4 | 5 | c_int_p = POINTER(c_int) 6 | c_uint_p = POINTER(c_uint) 7 | c_float_p = POINTER(c_float) 8 | 9 | ###include "cuComplex.h" /* import complex data type */ 10 | ## 11 | ##/* CUBLAS status returns */ 12 | ###define CUBLAS_STATUS_SUCCESS 0x00000000 13 | ###define CUBLAS_STATUS_NOT_INITIALIZED 0x00000001 14 | ###define CUBLAS_STATUS_ALLOC_FAILED 0x00000003 15 | ###define CUBLAS_STATUS_INVALID_VALUE 0x00000007 16 | ###define CUBLAS_STATUS_MAPPING_ERROR 0x0000000B 17 | ###define CUBLAS_STATUS_EXECUTION_FAILED 0x0000000D 18 | ###define CUBLAS_STATUS_INTERNAL_ERROR 0x0000000E 19 | CUBLAS_STATUS_SUCCESS = 0x00000000 20 | CUBLAS_STATUS_NOT_INITIALIZED = 0x00000001 21 | CUBLAS_STATUS_ALLOC_FAILED = 0x00000003 22 | CUBLAS_STATUS_INVALID_VALUE = 0x00000007 23 | CUBLAS_STATUS_MAPPING_ERROR = 0x0000000B 24 | CUBLAS_STATUS_EXECUTION_FAILED = 0x0000000D 25 | CUBLAS_STATUS_INTERNAL_ERROR = 0x0000000E 26 | 27 | ##/* CUBLAS data types */ 28 | ##typedef unsigned int cublasStatus; 29 | cublasStatus = c_uint 30 | -------------------------------------------------------------------------------- /xml/generate-xml.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | # NP: do we need a python script for that ? 4 | 5 | INCLUDES="-I ./ -I /usr/local/cuda/include" 6 | 7 | python createbindings.py -o cuda -H my_CUDA2100_vector_types.h -H cuda.h -l /usr/lib/libcuda.so $INCLUDES -x cudadrv.xml -p cudadrv.py && \ 8 | python cudadrv.py && cp -vf cudadrv.py ../cuda/cu/ 9 | #python cudadrv2.py 10 | 11 | python createbindings.py -o cuda -H my_CUDA2100_vector_types.h -H cuda_runtime.h -l /usr/local/cuda/lib/libcudart.so $INCLUDES -x cudart.xml -p cudart.py && \ 12 | python cudart.py && cp -vf cudart.py ../cuda/cuda/ 13 | #python cudart2.py 14 | 15 | python createbindings.py -o cuda -H my_CUDA2100_vector_types.h -H cublas.h -l /usr/local/cuda/lib/libcublas.so $INCLUDES -x cublas.xml -p cublas.py && \ 16 | python cublas.py && cp -vf cublas.py ../cuda/cublas/ 17 | #python cublas2.py 18 | 19 | python createbindings.py -o cuda -H my_CUDA2100_vector_types.h -H cufft.h -l /usr/local/cuda/lib/libcufft.so $INCLUDES -x cufft.xml -p cufft.py && \ 20 | python cufft.py && cp -vf cufft.py ../cuda/cufft/ 21 | #python cufft2.py 22 | 23 | #rm -vf cudadrv.xml cudadrv.py cudart.xml cudart.py cublas.xml cublas.py cufft.xml cufft.py 24 | 25 | -------------------------------------------------------------------------------- /cuda/utils/libutils.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | import ctypes 3 | import os 4 | import platform 5 | 6 | OSNAME = platform.system() 7 | 8 | def get_lib(name, cdll_opts = None): 9 | libname = None 10 | if OSNAME == "Linux": 11 | libname = "lib" + name + ".so" 12 | elif OSNAME == "Darwin": 13 | libname = "lib" + name + ".dylib" 14 | elif OSNAME == "Windows": 15 | import _winreg as wreg 16 | reg = wreg.ConnectRegistry(None, wreg.HKEY_LOCAL_MACHINE) 17 | key = wreg.OpenKey(reg, r"SOFTWARE\NVIDIA Corporation\Installed Products\NVIDIA CUDA") 18 | cuda_bin = os.path.join(wreg.QueryValueEx(key, "InstallDir")[0],"bin") 19 | libname = os.path.join(cuda_bin, "%s.dll" % name) 20 | if name == "cuda": 21 | libname = "nvcuda.dll" 22 | lib = ctypes.windll.LoadLibrary( libname ) 23 | return lib 24 | if cdll_opts: 25 | lib = ctypes.CDLL(libname, cdll_opts) 26 | else: 27 | lib = ctypes.CDLL(libname) 28 | return lib 29 | 30 | if __name__ == "__main__": 31 | try: 32 | print "Loading libcuda..." 33 | get_lib("cuda") 34 | print "Test PASSED" 35 | except: 36 | print "Test FAILED" 37 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | """ setuptools setup.py for python-cuda """ 5 | 6 | from ez_setup import use_setuptools 7 | use_setuptools(version='0.6c9') 8 | 9 | from setuptools import setup, find_packages 10 | 11 | setup( 12 | name = 'python-cuda', 13 | 14 | version = '2.1-0.0.1', 15 | 16 | packages = ['cuda', 17 | 'cuda.cu', 18 | 'cuda.cuda', 19 | 'cuda.cublas', 20 | 'cuda.cufft', 21 | 'cuda.sugar', 22 | 'cuda.sugar.memory', 23 | 'cuda.sugar.kernel', 24 | 'cuda.sugar.fft', 25 | 'cuda.sugar.blas', 26 | 'cuda.sugar.query', 27 | 'cuda.utils'], 28 | 29 | package_dir = {'cuda':'cuda'}, 30 | 31 | package_data = {'cuda.sugar.fft': ['*.cu'] }, 32 | 33 | install_requires=[ 34 | "numpy>=1.3.0", 35 | "scipy>=0.7.0", 36 | ], 37 | 38 | 39 | # author='', 40 | # author_email='', 41 | # url='', 42 | # description='Python bindings for CUDA 2.1 with numpy integration', 43 | # long_description = """ """, 44 | # download_url='', 45 | # license='?', 46 | # package_data = {} 47 | 48 | ) 49 | -------------------------------------------------------------------------------- /oldcode/misc/ctypes_extra.py: -------------------------------------------------------------------------------- 1 | #!/bin/env python 2 | # coding:utf-8: © Arno Pähler, 2007-08 3 | from ctypes import * 4 | from numpy import * 5 | from numpy.random import rand 6 | from ctypes_array import convert 7 | 8 | ao = addressof 9 | def fa(a,o,n,dtype=None): 10 | if dtype is None: 11 | t = a.__class__._type_ 12 | else: 13 | t = dtype 14 | s = sizeof(t) 15 | return (t*n).from_address(ao(a)+o*s) 16 | 17 | class x_float(c_float): 18 | pass 19 | 20 | b = convert(rand(10)) 21 | a = (x_float*10)(*b) 22 | z = (c_float*10)(*b) 23 | 24 | try: 25 | u = (c_float*2).from_address(ao(a[6])) 26 | su = sizeof(u._type_) 27 | print "0x%8.8x" % ao(u) 28 | print "%10.7f %10.7f" % (u[0],u[1]) 29 | except TypeError: 30 | print "x_float does not work" 31 | 32 | try: 33 | v = fa(a,6,2,c_float) 34 | sv = sizeof(v._type_) 35 | except TypeError: 36 | print "x_float does not work" 37 | 38 | try: 39 | w = fa(z,6,2) 40 | sw = sizeof(w._type_) 41 | except TypeError: 42 | print "c_float does not work" 43 | 44 | sz = sizeof(z._type_) 45 | for i in range(len(v)): 46 | print "0x%8.8x 0x%8.8x 0x%8.8x 0x%8.8x" % ( 47 | ao(a[i+6]),ao(v)+i*sv,ao(z)+(i+6)*sz,ao(w)+i*sw) 48 | print "%10.7f %10.7f %10.7f % 10.7f" % (a[i+6].value,v[i],z[i+6],w[i]) 49 | -------------------------------------------------------------------------------- /cuda/sugar/blas/sdot.py: -------------------------------------------------------------------------------- 1 | #!/bin/env python 2 | 3 | from ctypes import c_float 4 | from time import time 5 | 6 | import cuda.cublas as cublas 7 | import cuda.cuda as cuda 8 | from cuda.sugar.memory import Linear 9 | 10 | import numpy 11 | from numpy.random import randn 12 | 13 | def gpu_sdot(a,b): 14 | assert a.size == b.size 15 | assert a.shape[0] == b.shape[1] 16 | cublas.cublasInit() 17 | cublas.cublasFree(0) 18 | d_X = Linear(a.shape).from_numpy(a) 19 | d_Y = Linear(b.shape).from_numpy(b) 20 | gpu_result = cublas.cublasSdot(a.shape[1], d_X.ref, 1, d_Y.ref, 1) 21 | cuda.cudaThreadSynchronize() 22 | cublas.cublasShutdown() 23 | return gpu_result 24 | 25 | def test(): 26 | vlength = 1024 27 | 28 | n2 = vlength*vlength 29 | 30 | h_X = randn(1,n2).astype('float32') 31 | h_Y = randn(1,n2).astype('float32') 32 | 33 | print "-"*80 34 | print "h_X:" 35 | print h_X 36 | print "-"*80 37 | 38 | print "-"*80 39 | print "h_Y:" 40 | print h_Y 41 | print "-"*80 42 | 43 | print "-"*80 44 | print numpy.dot(h_X,h_Y.transpose())[0][0] 45 | print "-"*80 46 | 47 | print "-"*80 48 | print "cublasSdot(d_X,d_Y):" 49 | print gpu_sdot(h_X, h_Y.transpose()) 50 | print "-"*80 51 | 52 | if __name__ == "__main__": 53 | test() 54 | -------------------------------------------------------------------------------- /oldcode/misc/kernelGL.cu: -------------------------------------------------------------------------------- 1 | extern "C" { 2 | __global__ void kernel1( 3 | float4* pos, unsigned int width, unsigned int height, float time) 4 | { 5 | unsigned int x = blockIdx.x*blockDim.x + threadIdx.x; 6 | unsigned int y = blockIdx.y*blockDim.y + threadIdx.y; 7 | 8 | // calculate uv coordinates 9 | float u = x / (float) width; 10 | float v = y / (float) height; 11 | u = u*2.0f - 1.0f; 12 | v = v*2.0f - 1.0f; 13 | 14 | // calculate simple sine wave pattern 15 | float freq = 4.0f; 16 | float w = sinf(u*freq + time) * cosf(v*freq + time) * 0.5f; 17 | 18 | // write output vertex 19 | pos[y*width+x] = make_float4(u, w, v, 1.0f); 20 | } 21 | 22 | __global__ void kernel2( 23 | float4* pos, unsigned int width, unsigned int height, float time) 24 | { 25 | unsigned int x = blockIdx.x*blockDim.x + threadIdx.x; 26 | unsigned int y = blockIdx.y*blockDim.y + threadIdx.y; 27 | 28 | // calculate uv coordinates 29 | float u = x / (float) width; 30 | float v = y / (float) height; 31 | u = u*2.0f - 1.0f; 32 | v = v*2.0f - 1.0f; 33 | 34 | // calculate simple sine wave pattern 35 | float freq = 4.0f; 36 | float efac = .5f*exp(.5f*sin((u+v)*freq+time)); 37 | float w = (sinf(u*freq + time) + cosf(v*freq + time)) * efac; 38 | 39 | // write output vertex 40 | pos[y*width+x] = make_float4(u, w, v, 1.0f); 41 | } 42 | } 43 | -------------------------------------------------------------------------------- /tests/cufft/cufft_fft.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | import sys 3 | import numpy.fft 4 | from numpy.random import randn 5 | from numpy import allclose 6 | import cuda.sugar.fft 7 | 8 | def main(): 9 | print "-"*55 10 | print "-- --" 11 | print "-- python-cuda versions of numpy.fft.{fftn,ifftn} --" 12 | print "-- --" 13 | print "-"*55 14 | print ">>> Creating host signal..." 15 | 16 | try: 17 | size = int(sys.argv[1]) 18 | except Exception,e: 19 | size = 10 20 | 21 | print ">>> Signal Size = %s" % size 22 | 23 | numpy_array = randn(size).astype('complex64') 24 | numpy_array -= numpy_array.mean() 25 | numpy_array /= numpy_array.std() 26 | 27 | print ">>> Computing ffts on GPU (CUDA) ..." 28 | 29 | print "[*] Forward fft on gpu ..." 30 | fft_res = cuda.sugar.fft.fftn(numpy_array) 31 | 32 | print "[*] Inverse fft on gpu ..." 33 | ifft_res = cuda.sugar.fft.ifftn(fft_res) 34 | 35 | print ">>> Computing references on CPU (numpy) ..." 36 | 37 | print "[*] Forward fft on cpu ..." 38 | forward_ref = numpy.fft.fftn(numpy_array) 39 | 40 | print "[*] Inverse fft on cpu ..." 41 | inverse_ref = numpy.fft.ifftn(forward_ref) 42 | 43 | print "l2norm fft: ", numpy.linalg.norm(fft_res - forward_ref) 44 | 45 | print "l2norm ifft: ", numpy.linalg.norm(ifft_res - inverse_ref) 46 | 47 | if __name__ == "__main__": 48 | main() 49 | -------------------------------------------------------------------------------- /oldcode/misc/devinfo_cu.py: -------------------------------------------------------------------------------- 1 | #!/bin/env python 2 | # coding:utf-8: © Arno Pähler, 2007-08 3 | 4 | from ctypes import * 5 | from cuda.cu_defs import CUdevprop 6 | from cuda.cu_api import * 7 | 8 | if __name__ == "__main__": 9 | print "+-----------------------+" 10 | print "| CUDA Device Info |" 11 | print "| using CUDA driver API |" 12 | print "+-----------------------+\n" 13 | cuInit(0) 14 | count = c_int() 15 | cuDeviceGetCount(byref(count)) 16 | device = CUdevice() 17 | name = (c_char*256)() 18 | cuDeviceGet(byref(device),0) 19 | cuDeviceGetName(name,256,device) 20 | memsize = c_uint() 21 | cuDeviceTotalMem(byref(memsize),device) 22 | major,minor = c_int(),c_int() 23 | cuDeviceComputeCapability(byref(major),byref(minor),device) 24 | props = CUdevprop() 25 | cuDeviceGetProperties(byref(props),device) 26 | 27 | cuContext = CUcontext() 28 | cuCtxCreate(byref(cuContext),0,device) 29 | free,total = c_uint(),c_uint() 30 | cuMemGetInfo(byref(free),byref(total)) 31 | free = free.value 32 | cuCtxDetach(cuContext) 33 | 34 | print "%-19s = %d" % ("number of devices",count.value) 35 | print "%-19s = %s" % ("device name =",name.value) 36 | print "%-19s = %.f MB" % ("memory size",memsize.value/1024.**2) 37 | print "%-19s = %.f MB" % ("memory free",free/1024.**2) 38 | print "%-19s = %.f MHz" % ("clock rate",props.clockRate/1000.) 39 | print "%-19s = %d" % ("major",major.value) 40 | print "%-19s = %d" % ("minor",minor.value) 41 | print 21*"-" 42 | print props 43 | -------------------------------------------------------------------------------- /cuda/sugar/kernel/tests/matrix_mul.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | 3 | from cuda.memory import Linear 4 | from cuda.kernel.kernelfactoryrt import SourceModule 5 | from cuda.cuda import dim3 6 | 7 | #from IPython.Shell import IPShellEmbed 8 | #ipshell = IPShellEmbed(argv=[]) 9 | 10 | 11 | BLOCK_SIZE = 16 12 | # Matrix A width 13 | WA = (3 * BLOCK_SIZE) 14 | # Matrix A height 15 | HA = (5 * BLOCK_SIZE) 16 | # Matrix B width 17 | WB = (8 * BLOCK_SIZE) 18 | # Matrix B height 19 | HB = WA 20 | # Matrix C width 21 | WC = WB 22 | # Matrix C height 23 | HC = HA 24 | 25 | matrixMul = SourceModule(open('matrix_mul_kernel.cu','r').read()) 26 | 27 | nA = np.random.random(size=(HA, WA)).astype(np.float32) 28 | nB = np.random.random(size=(HB, WB)).astype(np.float32) 29 | 30 | print 'Allocating arrays' 31 | dA = Linear(nA.shape).from_numpy(nA) 32 | dB = Linear(nB.shape).from_numpy(nB) 33 | dC = Linear((HC,WC)) 34 | 35 | print 'Calling kernel' 36 | grid = dim3(WC // BLOCK_SIZE, HC // BLOCK_SIZE, 1) 37 | block = dim3(BLOCK_SIZE, BLOCK_SIZE, 1) 38 | Mul = matrixMul.matrixMul(grid, block) 39 | Mul(dC.ref, dA.ref, dB.ref, WA, WB) 40 | 41 | print 'Collecting results' 42 | nC = dC.to_numpy() 43 | nC.reshape((HC, WC)) 44 | 45 | print 'Freeing data' 46 | dA._free() 47 | dB._free() 48 | dC._free() 49 | 50 | print 'Calculating error' 51 | print 52 | goldC = np.dot(nA, nB) 53 | err = nC - goldC 54 | print 'L2 err: %r' % np.linalg.norm(err, 2) 55 | print 'L1 err: %r' % np.linalg.norm(err, 1) 56 | print 'Linf err: %r' % np.linalg.norm(err, np.inf) 57 | print 'Lfro err: %r' % np.linalg.norm(err, 'fro') 58 | -------------------------------------------------------------------------------- /cuda/sugar/blas/saxpy.py: -------------------------------------------------------------------------------- 1 | #!/bin/env python 2 | from time import time 3 | from ctypes import cast,c_float, POINTER 4 | 5 | from numpy import empty_like,dot 6 | from numpy.random import randn 7 | 8 | from cuda.cublas import * 9 | from cuda.cuda import cudaThreadSynchronize 10 | from cuda.sugar.memory import Linear 11 | 12 | def embed_ipython(): 13 | from IPython.Shell import IPShellEmbed 14 | ipshell = IPShellEmbed(user_ns = dict()) 15 | ipshell() 16 | 17 | def cpu_saxpy(a,b, alpha): 18 | return (alpha*a+b) 19 | 20 | def gpu_saxpy(a,b,alpha): 21 | # init cublas lib 22 | cublasInit() 23 | 24 | # allocate device vectors from host 25 | d_X = Linear(a.shape).from_numpy(a) 26 | d_Y = Linear(b.shape).from_numpy(b) 27 | 28 | # execute cublasSaxpy and sync threads 29 | cublasSaxpy(a.shape[1],alpha,d_X.ref,1,d_Y.ref,1) 30 | cudaThreadSynchronize() 31 | 32 | return d_Y.to_numpy() 33 | 34 | def test(): 35 | vlength = 8192 36 | alpha = 1 37 | 38 | # allocate host vectors 39 | h_X = randn(1,vlength).astype('float32') 40 | h_Y = randn(1,vlength).astype('float32') 41 | 42 | print "-"*80 43 | print 'h_X:' 44 | print h_X 45 | print "-"*80 46 | 47 | print "-"*80 48 | print 'h_Y:' 49 | print h_Y 50 | print "-"*80 51 | 52 | print "-"*80 53 | print 'CPU RESULT:' 54 | print cpu_saxpy(h_X,h_Y,alpha) 55 | print "-"*80 56 | 57 | print "-"*80 58 | print 'GPU RESULT:' 59 | print gpu_saxpy(h_X, h_Y, alpha) 60 | print "-"*80 61 | 62 | if __name__ == "__main__": 63 | test() 64 | -------------------------------------------------------------------------------- /oldcode/misc/simple.cubin: -------------------------------------------------------------------------------- 1 | architecture {sm_11} 2 | abiversion {0} 3 | modname {cubin} 4 | sampler { 5 | name = Arg 6 | texunit = 0 7 | } 8 | code { 9 | name = TRIG 10 | lmem = 0 11 | smem = 32 12 | reg = 6 13 | bar = 0 14 | bincode { 15 | 0x10000205 0x40004780 0xa0000005 0x04000780 16 | 0x60014c05 0x00204780 0x3001cffd 0x6420c7c8 17 | 0x30000003 0x00000280 0x10000201 0x40004780 18 | 0x3002020d 0xc4100780 0x3002ce05 0xc4300780 19 | 0x41002810 0x2103ec00 0x2101ec04 0x2103e808 20 | 0x2000ca0d 0x0420c780 0x30020811 0xc4100780 21 | 0xd00e0015 0x80c00780 0xb0000a15 0xc0000780 22 | 0x90000a15 0xa0000780 0xd00e0415 0xa0c00780 23 | 0xd00e0015 0x80c00780 0xb0000a15 0xc0000780 24 | 0x90000a15 0x80000780 0x20000001 0x04010780 25 | 0xd00e0615 0xa0c00780 0x300101fd 0x640047c8 26 | 0x20048408 0x2004860c 0x1000c003 0x00000280 27 | 0xf0000001 0xe0000001 28 | } 29 | } 30 | code { 31 | name = TRIGTex 32 | lmem = 0 33 | smem = 32 34 | reg = 9 35 | bar = 0 36 | bincode { 37 | 0x10000205 0x40004780 0xa0000005 0x04000780 38 | 0x60014c05 0x00204780 0x3001cdfd 0x6420c7c8 39 | 0x30000003 0x00000280 0x10000201 0x40004780 40 | 0x30020211 0xc4100780 0xa0017003 0000000000 41 | 0x3002cc15 0xc4300780 0x41002808 0x2104e80c 42 | 0x2104ea10 0x2105e818 0x30020415 0xc4100780 43 | 0x10000201 0x0403c780 0xf3000001 0x00000784 44 | 0xb000001d 0xc0000780 0x90000e21 0xa0000780 45 | 0xd00e0621 0xa0c00780 0x90000e1d 0x80000780 46 | 0x2000060d 0x04014780 0xd00e081d 0xa0c00780 47 | 0x300607fd 0x640047c8 0x20028204 0x20058810 48 | 0x1000c003 0x00000280 0x00000e01 0xe4200782 49 | 0xf0000001 0xe0000001 50 | } 51 | } 52 | -------------------------------------------------------------------------------- /tests/cufft/todo/bfft.py: -------------------------------------------------------------------------------- 1 | #!/bin/env python 2 | # coding:utf-8: © Arno Pähler, 2007-08 3 | 4 | from math import log 5 | 6 | import sys,time 7 | from ctypes import c_float 8 | 9 | import xfft 10 | from cpuFunctions import arrayInit,checkError 11 | 12 | try: 13 | kr = int(sys.argv[1]) 14 | dims = tuple([int(x) for x in sys.argv[2].split(",")]) 15 | except IndexError: 16 | sys.exit() 17 | 18 | doComplex = False 19 | if kr < 0: 20 | kr = -kr 21 | doComplex = True 22 | 23 | size = reduce(lambda x,y:x*y,dims) 24 | if doComplex: 25 | r = (c_float*(size*2))() 26 | else: 27 | r = (c_float*size)() 28 | arrayInit(r) 29 | 30 | sz = 1.e6/float(size) 31 | 32 | fftw_start = time.clock() 33 | wall_start = time.time() 34 | 35 | 36 | xr = float(.5 )/float(kr) 37 | 38 | if doComplex: 39 | text = "complex" 40 | rcfftx = xfft.ccfft 41 | crfftx = xfft.icfft 42 | else: 43 | text = " real" 44 | rcfftx = xfft.rcfft 45 | crfftx = xfft.crfft 46 | for k in range(0,kr): 47 | c = rcfftx(r,dims) 48 | z = crfftx(c,dims) 49 | 50 | fftw_end = time.clock() 51 | wall_end = time.time() 52 | 53 | dif = fftw_end - fftw_start 54 | wif = wall_end - wall_start 55 | print "\nfft elapsed real time : %8.3f seconds" % wif 56 | print "%d-D %s-to-complex fft: %8.3f seconds" % (len(dims),text,dif*xr) 57 | 58 | flops = 2.*5.e-9*log(size)*size*kr/log(2.) 59 | print "Performance : %8.3f GFlops" % (flops/wif) 60 | dif = dif * xr * sz 61 | print "%d-D %s-to-complex fft: %8.3f µs/point\n" % (len(dims),text,dif) 62 | 63 | rz = 1./size 64 | err,mxe = checkError(r,z) 65 | print "avg and max error : %8.1e %8.1e" % (err,mxe) 66 | -------------------------------------------------------------------------------- /tests/cu/todo/cu_gflops.py: -------------------------------------------------------------------------------- 1 | #!/bin/env python 2 | # coding:utf-8: © Arno Pähler, 2007-08 3 | from ctypes import * 4 | from time import time 5 | 6 | from cu.cu_defs import * 7 | from cu.cu_api import * 8 | from utils.cu_utils import * 9 | 10 | from cpuFunctions import cpuGFLOPS 11 | 12 | BLOCK_SIZE_C = 192 13 | ITERATIONS_C = 512 14 | 15 | BLOCK_SIZE_G = 512 16 | GRID_SIZE_G = 512 17 | ITERATIONS_G = 512 18 | 19 | def main(device,loops = 1): 20 | 21 | gpuGFLOPS = device.functions["gpuGFLOPS"] 22 | 23 | cuFuncSetBlockShape(gpuGFLOPS,BLOCK_SIZE_G,1,1) 24 | 25 | t0 = time() 26 | for i in range(loops): 27 | cuCtxSynchronize() 28 | cuLaunchGrid(gpuGFLOPS,GRID_SIZE_G,1) 29 | cuCtxSynchronize() 30 | t0 = time()-t0 31 | 32 | flopsc = 4096.*ITERATIONS_C*BLOCK_SIZE_C 33 | flopsg = 4096.*ITERATIONS_G*BLOCK_SIZE_G*GRID_SIZE_G 34 | 35 | flopsc *= 1.e-9*float(loops) 36 | flopsg *= 1.e-9*float(loops) 37 | 38 | t1 = time() 39 | for i in range(loops): 40 | cpuGFLOPS() 41 | t1 = time()-t1 42 | # peakg = 4.*8.*2.*1.458 # 4MP*8SP/MP*2flops/SP/clock*clock[GHz] (8600GTS) 43 | peakg = 14.*8.*2.*1.512 # 14MP*8SP/MP*2flops/SP/clock*clock[GHz] (9800GT) 44 | print "%8.3f%8.2f%8.3f%8.2f [%.2f]" % ( 45 | t1,flopsc/t1,t0,flopsg/t0,peakg) 46 | 47 | if __name__ == "__main__": 48 | import sys 49 | 50 | device = cu_CUDA() 51 | device.getSourceModule("gpuFunctions.cubin") 52 | device.getFunction("gpuGFLOPS") 53 | 54 | loops = 1 55 | if len(sys.argv) > 1: 56 | loops = int(sys.argv[1]) 57 | print "%5d" % (loops), 58 | main(device,loops) 59 | cuCtxDetach(device.context) 60 | -------------------------------------------------------------------------------- /cuda/sugar/blas/sgemm.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | from cuda.cuda import cudaThreadSynchronize 3 | from cuda.cublas import cublasInit, cublasShutdown, cublasSgemm 4 | from cuda.sugar.memory import Linear 5 | 6 | import numpy 7 | from numpy.random import randn 8 | 9 | def gpu_sgemm(a,b, alpha=1): 10 | """ Single Precision Matrix Multiplication on GPU, expects two, two-dimensional numpy arrays as input. Arrays must be such that a.shape[1] == b.shape[0]. Optionally specify alpha for scalar multiplication""" 11 | # init cublas 12 | cublasInit() 13 | 14 | assert a.shape[1] == b.shape[0] 15 | 16 | c_shape = (a.shape[0], b.shape[1]) 17 | # allocate device matrices from host 18 | dA = Linear(a.shape, order='F').from_numpy(a) 19 | dB = Linear(b.shape, order='F').from_numpy(b) 20 | dC = Linear(c_shape, order='F') 21 | 22 | # transpose a/b ? t = yes, n = no 23 | transa = 'n' 24 | transb = 'n' 25 | 26 | # compute with CUBLAS 27 | cublasSgemm( transa, transb, a.shape[0], b.shape[1], a.shape[1], alpha, dA.ref, a.shape[0], dB.ref, b.shape[0], 0, dC.ref, a.shape[0] ) 28 | cudaThreadSynchronize() 29 | # shutdown 30 | cublasShutdown() 31 | return dC.to_numpy() 32 | 33 | 34 | 35 | def test(): 36 | # Size of square matrix 37 | N = 2 38 | 39 | # allocate host matrices 40 | A = randn(3,N).astype('float32') 41 | B = randn(3,5).astype('float32') 42 | 43 | # compute the cpu reference 44 | ref = numpy.dot(A,B) 45 | 46 | print '-'*80 47 | print ref 48 | print '-'*80 49 | 50 | print '-'*80 51 | print gpu_sgemm(A,B) 52 | print '-'*80 53 | 54 | 55 | if __name__ == "__main__": 56 | test() 57 | -------------------------------------------------------------------------------- /tests/cuda/todo/cuda_gflops.py: -------------------------------------------------------------------------------- 1 | #!/bin/env python 2 | # coding:utf-8: © Arno Pähler, 2007-08 3 | from ctypes import * 4 | from time import time 5 | 6 | from cuda.cuda_defs import * 7 | from cuda.cuda_api import * 8 | 9 | from cpuFunctions import cpuGFLOPS 10 | from gpuFunctions import gpuGFLOPS 11 | 12 | BLOCK_SIZE_C = 192 13 | ITERATIONS_C = 512 14 | 15 | BLOCK_SIZE_G = 512 16 | GRID_SIZE_G = 512 17 | ITERATIONS_G = 512 18 | S4 = sizeof(c_float) 19 | 20 | # This is SUBSTANTIALLY slower than cu_gflops.py. Why? 21 | # Looping about 50 times almost as fast as cu_gflops.py. 22 | 23 | def main(loops = 1): 24 | 25 | blockDim = dim3(BLOCK_SIZE_G,1,1) 26 | gridDim = dim3(GRID_SIZE_G,1,1) 27 | 28 | t0 = time() 29 | cudaThreadSynchronize() 30 | for i in range(loops): 31 | cudaConfigureCall(gridDim,blockDim,0,0) 32 | gpuGFLOPS() 33 | cudaThreadSynchronize() 34 | t0 = time()-t0 35 | cudaThreadExit() 36 | 37 | flopsc = 4096.*ITERATIONS_C*BLOCK_SIZE_C 38 | flopsg = 4096.*ITERATIONS_G*BLOCK_SIZE_G*GRID_SIZE_G 39 | flopsc *= 1.e-9*float(loops) 40 | flopsg *= 1.e-9*float(loops) 41 | 42 | t1 = time() 43 | for i in range(loops): 44 | cpuGFLOPS() 45 | t1 = time()-t1 46 | # peakg = 4.*8.*2.*1.458 # 4MP*8SP/MP*2flops/SP/clock*clock[GHz] (8600GTS) 47 | peakg = 14.*8.*2.*1.512 # 14MP*8SP/MP*2flops/SP/clock*clock[GHz] (9800GT) 48 | print "%8.3f%8.2f%8.3f%8.2f [%.2f]" % (t1,flopsc/t1,t0,flopsg/t0,peakg) 49 | print "%8.3f%8.2f" % (flopsc/t1*2.8,flopsg/t0*1.512/112) 50 | 51 | if __name__ == "__main__": 52 | import sys 53 | 54 | cudaSetDevice(0) 55 | 56 | loops = 1 57 | if len(sys.argv) > 1: 58 | loops = int(sys.argv[1]) 59 | print "%5d" % (loops), 60 | main(loops) 61 | -------------------------------------------------------------------------------- /tests/cufft/todo/dfft.py: -------------------------------------------------------------------------------- 1 | # coding:utf-8: © Arno Pähler, 2007-08 2 | from ctypes import c_int,c_void_p 3 | from ctypes import CDLL,POINTER,RTLD_GLOBAL 4 | 5 | ## This version is for 64-bit floats 6 | 7 | _ci = c_int 8 | _cip = POINTER(c_int) 9 | _cvp = c_void_p 10 | 11 | cc = CDLL("/usr/lib/libfftw.so",mode=RTLD_GLOBAL) 12 | cr = CDLL("/usr/lib/librfftw.so") 13 | 14 | fftw_plan = _ci 15 | 16 | ##extern [r]fftwnd_plan fftwnd_create_plan( 17 | ## int rank, const int *n, 18 | ## fftw_direction dir, int flags); 19 | ## 20 | ## all plans have the same signature 21 | 22 | CreatePlan_c = cc.fftwnd_create_plan 23 | CreatePlan_c.restype = fftw_plan 24 | CreatePlan_c.argtypes = [ _ci, _cip, _ci, _ci ] 25 | 26 | CreatePlan_r = cr.rfftwnd_create_plan 27 | CreatePlan_r.restype = fftw_plan 28 | CreatePlan_r.argtypes = [ _ci, _cip, _ci, _ci ] 29 | 30 | ##extern void [r]fftwnd_destroy_plan(rfftwnd_plan plan); 31 | DestroyPlan_c = cc.fftwnd_destroy_plan 32 | DestroyPlan_c.restype = None 33 | DestroyPlan_c.argtypes = [ _ci ] 34 | 35 | DestroyPlan_r = cr.rfftwnd_destroy_plan 36 | DestroyPlan_r.restype = None 37 | DestroyPlan_r.argtypes = [ _ci ] 38 | 39 | ##extern void fftwnd_one(fftwnd_plan p, 40 | ## fftw_complex *in, fftw_complex *out); 41 | Execute_c2c = cc.fftwnd_one 42 | Execute_c2c.restype = None 43 | Execute_c2c.argtypes = [ _ci, _cvp, _cvp ] 44 | 45 | ##extern void rfftwnd_one_real_to_complex(rfftwnd_plan p, 46 | ## fftw_real *in, fftw_complex *out); 47 | ##extern void rfftwnd_one_complex_to_real(rfftwnd_plan p, 48 | ## fftw_complex *in, fftw_real *out); 49 | Execute_r2c = cr.rfftwnd_one_real_to_complex 50 | Execute_r2c.restype = None 51 | Execute_r2c.argtypes = [ _ci, _cvp, _cvp ] 52 | 53 | Execute_c2r = cr.rfftwnd_one_complex_to_real 54 | Execute_c2r.restype = None 55 | Execute_c2r.argtypes = [ _ci, _cvp, _cvp ] 56 | -------------------------------------------------------------------------------- /tests/test_cublas.py: -------------------------------------------------------------------------------- 1 | #!/bin/env python 2 | 3 | import numpy as np 4 | from cuda.sugar.memory import Linear 5 | import cuda.sugar.blas as blas 6 | 7 | class TestCublas: 8 | 9 | def embed_ipython(): 10 | from IPython.Shell import IPShellEmbed 11 | ipshell = IPShellEmbed(user_ns = dict()) 12 | ipshell() 13 | 14 | def cpu_saxpy(self, a, b, alpha): 15 | return (alpha*a+b) 16 | 17 | def test_saxpy(self): 18 | vlength = 8192 19 | alpha = 1 20 | a = np.random.randn(1,vlength).astype('float32') 21 | b = np.random.randn(1,vlength).astype('float32') 22 | cpu_result = self.cpu_saxpy(a,b,alpha) 23 | gpu_result = blas.gpu_saxpy(a,b,alpha) 24 | 25 | print cpu_result 26 | print gpu_result 27 | 28 | assert np.allclose(cpu_result, gpu_result) == True 29 | 30 | def test_sdot(self): 31 | vlength = 1024 32 | n2 = vlength*vlength 33 | a = np.random.randn(1,n2).astype('float32') 34 | b = np.random.randn(1,n2).astype('float32') 35 | cpu_result = np.dot(a,b.transpose())[0][0] 36 | gpu_result = blas.gpu_sdot(a, b.transpose()) 37 | 38 | print cpu_result 39 | print gpu_result 40 | 41 | assert np.allclose([cpu_result], [gpu_result], atol=1e-1) == True 42 | 43 | def test_sgemm(self): 44 | M=7; N=5; P=3; 45 | a = np.random.randn(M,N).astype('float32') 46 | b = np.random.randn(N,P).astype('float32') 47 | cpu_result = np.dot(a,b) 48 | gpu_result = blas.gpu_sgemm(a,b) 49 | print cpu_result 50 | print gpu_result 51 | assert np.allclose(cpu_result, gpu_result) 52 | 53 | if __name__ == "__main__": 54 | test_cublas = TestCublas() 55 | test_cublas.test_saxpy() 56 | test_cublas.test_sdot() 57 | test_cublas.test_sgemm() 58 | -------------------------------------------------------------------------------- /tests/cufft/todo/sfft.py: -------------------------------------------------------------------------------- 1 | # coding:utf-8: © Arno Pähler, 2007-08 2 | # NP: remove absolute paths (argnnnnn!) 3 | 4 | from ctypes import c_int,c_void_p 5 | from ctypes import CDLL,POINTER,RTLD_GLOBAL 6 | 7 | ## This version is for 32-bit floats 8 | 9 | _ci = c_int 10 | _cip = POINTER(c_int) 11 | _cvp = c_void_p 12 | 13 | cc = CDLL("libsfftw.so",mode=RTLD_GLOBAL) 14 | cr = CDLL("libsrfftw.so") 15 | 16 | fftw_plan = _ci 17 | 18 | ##extern [r]fftwnd_plan fftwnd_create_plan( 19 | ## int rank, const int *n, 20 | ## fftw_direction dir, int flags); 21 | ## 22 | ## all plans have the same signature 23 | 24 | CreatePlan_c = cc.fftwnd_create_plan 25 | CreatePlan_c.restype = fftw_plan 26 | CreatePlan_c.argtypes = [ _ci, _cip, _ci, _ci ] 27 | 28 | CreatePlan_r = cr.rfftwnd_create_plan 29 | CreatePlan_r.restype = fftw_plan 30 | CreatePlan_r.argtypes = [ _ci, _cip, _ci, _ci ] 31 | 32 | ##extern void [r]fftwnd_destroy_plan(rfftwnd_plan plan); 33 | DestroyPlan_c = cc.fftwnd_destroy_plan 34 | DestroyPlan_c.restype = None 35 | DestroyPlan_c.argtypes = [ _ci ] 36 | 37 | DestroyPlan_r = cr.rfftwnd_destroy_plan 38 | DestroyPlan_r.restype = None 39 | DestroyPlan_r.argtypes = [ _ci ] 40 | 41 | ##extern void fftwnd_one(fftwnd_plan p, 42 | ## fftw_complex *in, fftw_complex *out); 43 | Execute_c2c = cc.fftwnd_one 44 | Execute_c2c.restype = None 45 | Execute_c2c.argtypes = [ _ci, _cvp, _cvp ] 46 | 47 | ##extern void rfftwnd_one_real_to_complex(rfftwnd_plan p, 48 | ## fftw_real *in, fftw_complex *out); 49 | ##extern void rfftwnd_one_complex_to_real(rfftwnd_plan p, 50 | ## fftw_complex *in, fftw_real *out); 51 | Execute_r2c = cr.rfftwnd_one_real_to_complex 52 | Execute_r2c.restype = None 53 | Execute_r2c.argtypes = [ _ci, _cvp, _cvp ] 54 | 55 | Execute_c2r = cr.rfftwnd_one_complex_to_real 56 | Execute_c2r.restype = None 57 | Execute_c2r.argtypes = [ _ci, _cvp, _cvp ] 58 | -------------------------------------------------------------------------------- /cuda/utils/logger.py: -------------------------------------------------------------------------------- 1 | # Setup logging globally (ie root logger) 2 | import types 3 | import logging 4 | import logging.handlers 5 | import platform 6 | 7 | INFO_NO_NEWLINE = logging.INFO + 1 8 | 9 | class MultipleFormatHandler(logging.StreamHandler): 10 | 11 | formatters = { logging.INFO: logging.Formatter(">>> %(message)s\n"), 12 | INFO_NO_NEWLINE: logging.Formatter(">>> %(message)s"), 13 | logging.DEBUG: logging.Formatter("%(filename)s:%(lineno)d - %(levelname)s - %(message)s\n"), 14 | logging.WARN: logging.Formatter("%(filename)s:%(lineno)d - %(levelname)s - %(message)s\n"), 15 | logging.CRITICAL: logging.Formatter("%(filename)s:%(lineno)d - %(levelname)s - %(message)s\n"), 16 | logging.ERROR: logging.Formatter("%(filename)s:%(lineno)d - %(levelname)s - %(message)s\n")} 17 | 18 | def format(self,record): 19 | return self.formatters[record.levelno].format(record) 20 | 21 | def emit(self, record): 22 | try: 23 | msg = self.format(record) 24 | fs = "%s" 25 | if not hasattr(types, "UnicodeType"): #if no unicode support... 26 | self.stream.write(fs % msg) 27 | else: 28 | try: 29 | self.stream.write(fs % msg) 30 | except UnicodeError: 31 | self.stream.write(fs % msg.encode("UTF-8")) 32 | self.flush() 33 | except (KeyboardInterrupt, SystemExit): 34 | raise 35 | except: 36 | self.handleError(record) 37 | 38 | logger = logging.getLogger('python-cuda') 39 | logger.setLevel(logging.INFO) 40 | 41 | mfh = MultipleFormatHandler() 42 | logger.addHandler(mfh) 43 | 44 | if platform.system() == "Linux": 45 | syslog_handler = logging.handlers.SysLogHandler(address='/dev/log') 46 | formatter = logging.Formatter("%(filename)s:%(lineno)d - %(levelname)s - %(message)s\n") 47 | syslog_handler.setFormatter(formatter) 48 | logger.addHandler(syslog_handler) 49 | 50 | def enable_debug(): 51 | logger.setLevel(logging.DEBUG) 52 | -------------------------------------------------------------------------------- /oldcode/cufft/cufft_defs.py: -------------------------------------------------------------------------------- 1 | # coding:utf-8: © Arno Pähler, 2007-08 2 | 3 | from ctypes import * 4 | 5 | #// CUFFT API function return values 6 | #typedef enum cufftResult_t { 7 | # CUFFT_SUCCESS = 0x0, 8 | # CUFFT_INVALID_PLAN = 0x1, 9 | # CUFFT_ALLOC_FAILED = 0x2, 10 | # CUFFT_INVALID_TYPE = 0x3, 11 | # CUFFT_INVALID_VALUE = 0x4, 12 | # CUFFT_INTERNAL_ERROR = 0x5, 13 | # CUFFT_EXEC_FAILED = 0x6, 14 | # CUFFT_SETUP_FAILED = 0x7, 15 | # CUFFT_INVALID_SIZE = 0x8 16 | #} cufftResult; 17 | 18 | cufftResult = c_int 19 | 20 | CUFFT_SUCCESS = 0x0 21 | CUFFT_INVALID_PLAN = 0x1 22 | CUFFT_ALLOC_FAILED = 0x2 23 | CUFFT_INVALID_TYPE = 0x3 24 | CUFFT_INVALID_VALUE = 0x4 25 | CUFFT_INTERNAL_ERROR = 0x5 26 | CUFFT_EXEC_FAILED = 0x6 27 | CUFFT_SETUP_FAILED = 0x7 28 | CUFFT_INVALID_SIZE = 0x8 29 | 30 | #// CUFFT defines and supports the following data types 31 | # 32 | #// cufftHandle is a handle type used to store and access CUFFT plans. 33 | #typedef unsigned int cufftHandle; 34 | # 35 | #// cufftReal is a single-precision, floating-point real data type. 36 | #typedef float cufftReal; 37 | # 38 | #// cufftComplex is a single-precision, floating-point complex data type that 39 | #// consists of interleaved real and imaginary components. 40 | #typedef float cufftComplex[2]; 41 | 42 | cufftHandle = c_uint 43 | cufftReal = c_float 44 | cufftComplex = (c_float*2) 45 | 46 | cufftHandle_p = POINTER(cufftHandle) 47 | 48 | #// CUFFT transform directions 49 | ##define CUFFT_FORWARD -1 // Forward FFT 50 | ##define CUFFT_INVERSE 1 // Inverse FFT 51 | 52 | CUFFT_FORWARD = -1 ## Forward FFT 53 | CUFFT_INVERSE = 1 ## Inverse FFT 54 | 55 | #// CUFFT supports the following transform types 56 | #typedef enum cufftType_t { 57 | # CUFFT_R2C = 0x2a, // Real to Complex (interleaved) 58 | # CUFFT_C2R = 0x2c, // Complex (interleaved) to Real 59 | # CUFFT_C2C = 0x29 // Complex to Complex, interleaved 60 | #} cufftType; 61 | 62 | cufftType = c_int 63 | 64 | CUFFT_R2C = 0x2a ## Real to Complex (interleaved) 65 | CUFFT_C2R = 0x2c ## Complex (interleaved) to Real 66 | CUFFT_C2C = 0x29 ## Complex to Complex, interleaved 67 | -------------------------------------------------------------------------------- /oldcode/misc/mklMath.py: -------------------------------------------------------------------------------- 1 | # coding:utf-8: © Arno Pähler, 2007-08 2 | 3 | from ctypes import CDLL 4 | from math import * 5 | 6 | vml = CDLL("./_vector.so") 7 | 8 | vcos = vml.vsCos 9 | vcos.restype = None 10 | 11 | vsin = vml.vsSin 12 | vsin.restype = None 13 | 14 | vsincos = vml.vsSinCos 15 | vsincos.restype = None 16 | 17 | vexp = vml.vsExp 18 | vexp.restype = None 19 | 20 | vlog = vml.vsLn 21 | vlog.restype = None 22 | 23 | vlog10 = vml.vsLog10 24 | vlog10.restype = None 25 | 26 | vsqrt = vml.vsSqrt 27 | vsqrt.restype = None 28 | 29 | def cpuTRIG(h_Y,h_Z,h_X): 30 | size = len(h_X) 31 | if False: 32 | vcos(size,h_X,h_Y) 33 | vsin(size,h_X,h_Z) 34 | else: # about 20% faster 35 | vsincos(size,h_X,h_Z,h_Y) 36 | 37 | ##//////////////////////////////////////////////////////////////////////////// 38 | ## Shared CPU/GPU functions, performing calculations for single option by 39 | ## Black-Scholes formula. 40 | ##//////////////////////////////////////////////////////////////////////////// 41 | A1 = 0.319381530 42 | A2 = -0.356563782 43 | A3 = 1.781477937 44 | A4 = -1.821255978 45 | A5 = 1.3302744290 46 | RSQRT2PI = 0.3989422804 47 | 48 | ##Polynomial approximation of cumulative normal distribution function 49 | def CND(d): 50 | K = 1.0 / (1.0 + 0.2316419 * abs(d)) 51 | 52 | cnd = RSQRT2PI * exp(-0.5 * d * d) * \ 53 | (K * (A1 + K * (A2 + K * (A3 + K * (A4 + K * A5))))) 54 | 55 | if d > 0: 56 | cnd = 1.0 - cnd 57 | 58 | return cnd 59 | 60 | ## Calculate Black-Scholes formula for both calls and puts 61 | ## S, ##Stock price 62 | ## X, ##Option strike 63 | ## T, ##Option years 64 | ## R, ##Riskless rate 65 | ## V ##Volatility rate 66 | def BlackScholesBody( 67 | S, X, T, R, V ): 68 | sqrtT = sqrt(T) 69 | d1 = (log(S / X) + (R + 0.5 * V * V) * T) / (V * sqrtT) 70 | d2 = d1 - V * sqrtT 71 | 72 | CNDD1 = CND(d1) 73 | CNDD2 = CND(d2) 74 | 75 | ##Calculate Call and Put simultaneously 76 | expRT = exp(- R * T) 77 | CallResult = S * CNDD1 - X * expRT * CNDD2 78 | PutResult = X * expRT * (1.0 - CNDD2) - S * (1.0 - CNDD1) 79 | 80 | return CallResult,PutResult 81 | 82 | def cpuBLSC(h_C,h_P,h_S,h_X,h_T,R,V,size): 83 | for i in range(size): 84 | h_C[i],h_P[i] = BlackScholesBody(h_S[i],h_X[i],h_T[i],R,V) 85 | -------------------------------------------------------------------------------- /tests/cuda/todo/cuda_saxpy.py: -------------------------------------------------------------------------------- 1 | #!/bin/env python 2 | # coding:utf-8: © Arno Pähler, 2007-08 3 | from ctypes import * 4 | from time import time 5 | 6 | from cuda.cuda_defs import * 7 | from cuda.cuda_api import * 8 | from cuda.cuda_utils import * 9 | 10 | from cpuFunctions import fixedInit,cpuSAXPY,checkError 11 | from gpuFunctions import gpuSAXPY 12 | 13 | BLOCK_SIZE = 64 14 | GRID_SIZE = 256 15 | S4 = sizeof(c_float) 16 | checkErrorFlag = False 17 | 18 | def main(vlength = 128,loops = 1): 19 | 20 | alfa = c_float(.5) 21 | n2 = vlength ## Vector length 22 | 23 | h_X = (c_float*n2)() 24 | h_Y = (c_float*n2)() 25 | g_Y = (c_float*n2)() 26 | 27 | fixedInit(h_X) 28 | 29 | d_X = getMemory(h_X) 30 | d_Y = getMemory(h_Y) 31 | 32 | blockDim = dim3(BLOCK_SIZE,1,1) 33 | gridDim = dim3(GRID_SIZE,1,1) 34 | 35 | t0 = time() 36 | cudaThreadSynchronize() 37 | for i in range(loops): 38 | cudaConfigureCall(gridDim,blockDim,0,0) 39 | gpuSAXPY(alfa,d_X,d_Y,n2) 40 | cudaThreadSynchronize() 41 | t0 = time()-t0 42 | 43 | flops = (2.e-9*n2)*float(loops) 44 | g_Y = (c_float*n2)() 45 | cudaMemcpy(g_Y,d_Y,S4*n2,cudaMemcpyDeviceToHost) 46 | cudaThreadSynchronize() 47 | 48 | cudaFree(d_X) 49 | cudaFree(d_Y) 50 | 51 | cudaThreadExit() 52 | t1 = time() 53 | for i in range(loops): 54 | cpuSAXPY(alfa,h_X,h_Y) 55 | t1 = time()-t1 56 | print "%10d%6.2f%6.2f" % (vlength,flops/t1,flops/t0) 57 | 58 | if checkErrorFlag: 59 | err,mxe = checkError(h_Y,g_Y) 60 | print "Avg and max rel error = %.2e %.2e" % (err,mxe) 61 | 62 | if __name__ == "__main__": 63 | import sys 64 | 65 | cudaSetDevice(0) 66 | 67 | lmin,lmax = 7,24 68 | if len(sys.argv) > 1: 69 | lmin = lmax = int(sys.argv[1]) 70 | loopx = -1 71 | if len(sys.argv) > 2: 72 | loopx = int(sys.argv[2]) 73 | lmax = min(max(0,lmax),24) 74 | lmin = min(max(0,lmin),lmax) 75 | for l in range(lmin,lmax+1): 76 | if l < 10: 77 | loops = 25000 78 | elif l < 17: 79 | loops = 10000 80 | elif l < 21: 81 | loops = 250 82 | else: 83 | loops = 25 84 | vlength = 1 << l 85 | if loopx > 0: 86 | loops = loopx 87 | print "%5d %5d" % (l,loops), 88 | main(vlength,loops) 89 | -------------------------------------------------------------------------------- /oldcode/misc/sgemmN.log: -------------------------------------------------------------------------------- 1 | 2 | testing sgemm( 'N', 'N', n, n, n, ... ) 3 | 4 | n CUBLAS,Gflop/s we,Gflop/s "error" 5 | 64 22.97 12.61 0 6 | 128 40.94 51.73 0 7 | 192 100.26 101.37 0 8 | 256 97.20 98.01 0 9 | 320 120.21 120.92 0 10 | 384 126.81 125.59 0 11 | 448 132.74 132.28 0 12 | 512 136.32 132.46 0 13 | 576 135.26 140.52 0 14 | 704 138.64 139.08 0 15 | 832 156.95 138.04 0 16 | 960 139.01 139.58 0 17 | 1088 141.96 172.23 0 18 | 1216 142.39 142.22 0 19 | 1408 142.33 143.02 0 20 | 1600 143.17 143.08 0 21 | 1792 143.69 192.88 0 22 | 1984 143.02 143.77 0 23 | 2240 164.13 143.80 0 24 | 2496 143.48 143.55 0 25 | 2816 161.37 176.05 0 26 | 3136 143.76 165.94 0 27 | 3520 185.63 180.48 0 28 | 3904 189.89 162.13 0 29 | 30 | testing sgemm( 'N', 'T', n, n, n, ... ) 31 | 32 | n CUBLAS,Gflop/s we,Gflop/s "error" 33 | 64 22.87 12.82 0 34 | 128 40.74 51.19 0 35 | 192 101.35 102.00 0 36 | 256 98.56 99.11 0 37 | 320 122.89 122.78 0 38 | 384 128.55 127.23 0 39 | 448 134.89 133.69 0 40 | 512 133.56 138.56 0 41 | 576 141.31 136.77 0 42 | 704 150.16 139.81 0 43 | 832 137.56 138.31 0 44 | 960 139.53 139.73 0 45 | 1088 142.25 148.80 0 46 | 1216 142.20 143.25 0 47 | 1408 142.83 142.88 0 48 | 1600 142.78 143.60 0 49 | 1792 142.23 141.73 0 50 | 1984 143.83 143.91 0 51 | 2240 186.56 170.33 0 52 | 2496 143.83 143.89 0 53 | 2816 157.43 173.14 0 54 | 3136 184.27 163.58 0 55 | 3520 180.04 177.53 0 56 | 3904 166.91 193.43 0 57 | -------------------------------------------------------------------------------- /tests/cufft/todo/gfft_cu.py: -------------------------------------------------------------------------------- 1 | # coding:utf-8: © Arno Pähler, 2007-08 2 | 3 | from cuda.cufft_api import * 4 | from cuda.cu_utils import * 5 | 6 | def makePlan(dims,kind): 7 | """ 8 | dims : tuple of array dimensions (1-3 els.) 9 | kind : type of transform desired 10 | returns plan to be used by transform 11 | """ 12 | spln = "cufftPlan%dd(*args)" 13 | ndim = len(dims) 14 | plan = cufftHandle() 15 | args = (byref(plan),)+dims+(kind,) 16 | if ndim == 1: 17 | args = args+(1,) 18 | eval(spln % ndim) 19 | return plan 20 | 21 | def rcfft(plan,r,SC=None,c=None): 22 | """ 23 | plan : plan created by fftw 24 | r : real array to be transformed 25 | SC : size of output array 26 | c : complex array, result of transform 27 | """ 28 | if c is None: 29 | if SC is None: 30 | cufftDestroy(plan) 31 | raise ValueError("array size missing") 32 | c = getMemory(SC) 33 | cufftExecR2C(plan,r,c) 34 | return c 35 | 36 | def crfft(plan,c,SC=None,r=None): 37 | """ 38 | plan : plan created by fftw 39 | c : complex array to be transformed 40 | SC : size of output array 41 | r : real array, result of transform 42 | """ 43 | if r is None: 44 | if SC is None: 45 | cufftDestroy(plan) 46 | raise ValueError("array size missing") 47 | r = getMemory(SC) 48 | cufftExecC2R(plan,c,r) 49 | return r 50 | 51 | def ccfft(plan,c,SC=None,z=None): 52 | """ 53 | plan : plan created by fftw 54 | c : complex array to be transformed 55 | SC : size of output array 56 | z : complex array, result of transform 57 | """ 58 | if z is None: 59 | if SC is None: 60 | cufftDestroy(plan) 61 | raise ValueError("array size missing") 62 | z = getMemory(SC) 63 | cufftExecC2C(plan,c,z,CUFFT_FORWARD) 64 | return z 65 | 66 | def icfft(plan,z,SC=None,c=None): 67 | """ 68 | plan : plan created by fftw 69 | z : complex array to be transformed 70 | SC : size of output array 71 | c : complex array, result of transform 72 | """ 73 | if c is None: 74 | if SC is None: 75 | cufftDestroy(plan) 76 | raise ValueError("array size missing") 77 | c = getMemory(SC) 78 | cufftExecC2C(plan,z,c,CUFFT_INVERSE) 79 | return c 80 | 81 | #shortcuts 82 | 83 | fft = ccfft 84 | ifft = icfft 85 | -------------------------------------------------------------------------------- /tests/cufft/todo/gfft_cuda.py: -------------------------------------------------------------------------------- 1 | # coding:utf-8: © Arno Pähler, 2007-08 2 | 3 | from cuda.cufft_api import * 4 | from cuda.cuda_utils import * 5 | 6 | def makePlan(dims,kind): 7 | """ 8 | dims : tuple of array dimensions (1-3 els.) 9 | kind : type of transform desired 10 | returns plan to be used by transform 11 | """ 12 | spln = "cufftPlan%dd(*args)" 13 | ndim = len(dims) 14 | plan = cufftHandle() 15 | args = (byref(plan),)+dims+(kind,) 16 | if ndim == 1: 17 | args = args+(1,) 18 | eval(spln % ndim) 19 | return plan 20 | 21 | def rcfft(plan,r,SC=None,c=None): 22 | """ 23 | plan : plan created by fftw 24 | r : real array to be transformed 25 | SC : size of output array 26 | c : complex array, result of transform 27 | """ 28 | if c is None: 29 | if SC is None: 30 | cufftDestroy(plan) 31 | raise ValueError("array size missing") 32 | c = getMemory(SC) 33 | cufftExecR2C(plan,r,c) 34 | return c 35 | 36 | def crfft(plan,c,SC=None,r=None): 37 | """ 38 | plan : plan created by fftw 39 | c : complex array to be transformed 40 | SC : size of output array 41 | r : real array, result of transform 42 | """ 43 | if r is None: 44 | if SC is None: 45 | cufftDestroy(plan) 46 | raise ValueError("array size missing") 47 | r = getMemory(SC) 48 | cufftExecC2R(plan,c,r) 49 | return r 50 | 51 | def ccfft(plan,c,SC=None,z=None): 52 | """ 53 | plan : plan created by fftw 54 | c : complex array to be transformed 55 | SC : size of output array 56 | z : complex array, result of transform 57 | """ 58 | if z is None: 59 | if SC is None: 60 | cufftDestroy(plan) 61 | raise ValueError("array size missing") 62 | z = getMemory(SC) 63 | cufftExecC2C(plan,c,z,CUFFT_FORWARD) 64 | return z 65 | 66 | def icfft(plan,z,SC=None,c=None): 67 | """ 68 | plan : plan created by fftw 69 | z : complex array to be transformed 70 | SC : size of output array 71 | c : complex array, result of transform 72 | """ 73 | if c is None: 74 | if SC is None: 75 | cufftDestroy(plan) 76 | raise ValueError("array size missing") 77 | c = getMemory(SC) 78 | cufftExecC2C(plan,z,c,CUFFT_INVERSE) 79 | return c 80 | 81 | #shortcuts 82 | 83 | fft = ccfft 84 | ifft = icfft 85 | -------------------------------------------------------------------------------- /tests/cu/todo/cu_saxpy.py: -------------------------------------------------------------------------------- 1 | #!/bin/env python 2 | # coding:utf-8: © Arno Pähler, 2007-08 3 | from ctypes import * 4 | from time import time 5 | 6 | from cu.cu_defs import * 7 | from cu.cu_api import * 8 | from utils.cu_utils import * 9 | 10 | from cpuFunctions import fixedInit,cpuSAXPY,checkError 11 | 12 | BLOCK_SIZE = 64 13 | GRID_SIZE = 256 14 | S4 = sizeof(c_float) 15 | checkErrorFlag = False 16 | 17 | def main(device,vlength = 128,loops = 1): 18 | 19 | alfa = c_float(.5) 20 | n2 = vlength ## Vector length 21 | gpuSAXPY = device.functions["gpuSAXPY"] 22 | 23 | h_X = (c_float*n2)() 24 | h_Y = (c_float*n2)() 25 | g_Y = (c_float*n2)() 26 | 27 | fixedInit(h_X) 28 | 29 | d_X = getMemory(h_X) 30 | d_Y = getMemory(h_Y) 31 | 32 | cuFuncSetBlockShape(gpuSAXPY,BLOCK_SIZE,1,1) 33 | cuParamSetf(gpuSAXPY,0,alfa) 34 | cuParamSeti(gpuSAXPY,4,d_X) 35 | cuParamSeti(gpuSAXPY,8,d_Y) 36 | cuParamSeti(gpuSAXPY,12,n2) 37 | cuParamSetSize(gpuSAXPY,16) 38 | 39 | cuCtxSynchronize() 40 | t0 = time() 41 | for i in range(loops): 42 | cuLaunchGrid(gpuSAXPY,GRID_SIZE,1) 43 | cuCtxSynchronize() 44 | t0 = time()-t0 45 | 46 | flops = (2.e-9*n2)*float(loops) 47 | cuMemcpyDtoH(g_Y,d_Y,n2*S4) 48 | cuCtxSynchronize() 49 | 50 | cuMemFree(d_X) 51 | cuMemFree(d_Y) 52 | 53 | t1 = time() 54 | for i in range(loops): 55 | cpuSAXPY(alfa,h_X,h_Y) 56 | t1 = time()-t1 57 | print "%10d%6.2f%6.2f" % (vlength,flops/t1,flops/t0) 58 | 59 | if checkErrorFlag: 60 | err,mxe = checkError(h_Y,g_Y) 61 | print "Avg and max rel error = %.2e %.2e" % (err,mxe) 62 | 63 | if __name__ == "__main__": 64 | import sys 65 | 66 | device = cu_CUDA() 67 | device.getSourceModule("gpuFunctions.cubin") 68 | device.getFunction("gpuSAXPY") 69 | 70 | lmin,lmax = 7,24 71 | if len(sys.argv) > 1: 72 | lmin = lmax = int(sys.argv[1]) 73 | loopx = -1 74 | if len(sys.argv) > 2: 75 | loopx = int(sys.argv[2]) 76 | lmax = min(max(0,lmax),24) 77 | lmin = min(max(0,lmin),lmax) 78 | for l in range(lmin,lmax+1): 79 | if l < 10: 80 | loops = 25000 81 | elif l < 17: 82 | loops = 10000 83 | elif l < 21: 84 | loops = 250 85 | else: 86 | loops = 25 87 | vlength = 1 << l 88 | if loopx > 0: 89 | loops = loopx 90 | print "%5d %5d" % (l,loops), 91 | main(device,vlength,loops) 92 | cuCtxDetach(device.context) 93 | -------------------------------------------------------------------------------- /tests/cu/todo/cu_add.py: -------------------------------------------------------------------------------- 1 | #!/bin/env python 2 | # coding:utf-8: © Arno Pähler, 2007-08 3 | from ctypes import * 4 | from time import time 5 | 6 | #from cuda.cu_defs import * 7 | from cu.cu_defs import * 8 | #from cuda.cu_api import * 9 | from cu.cu_api import * 10 | #from cuda.cu_utils import * 11 | from utils.cu_utils import * 12 | 13 | from cpuFunctions import fixedInit,cpuVADD,checkError 14 | 15 | BLOCK_SIZE = 64 16 | GRID_SIZE = 256 17 | S4 = sizeof(c_float) 18 | checkErrorFlag = False 19 | 20 | def main(device,vlength = 128,loops = 1): 21 | 22 | n2 = vlength ## Vector length 23 | gpuVADD = device.functions["gpuVADD"] 24 | 25 | h_X = (c_float*n2)() 26 | h_Y = (c_float*n2)() 27 | g_Y = (c_float*n2)() 28 | 29 | fixedInit(h_X) 30 | 31 | d_X = getMemory(h_X) 32 | d_Y = getMemory(h_Y) 33 | 34 | cuFuncSetBlockShape(gpuVADD,BLOCK_SIZE,1,1) 35 | cuParamSeti(gpuVADD,0,d_X) 36 | cuParamSeti(gpuVADD,4,d_Y) 37 | cuParamSeti(gpuVADD,8,n2) 38 | cuParamSetSize(gpuVADD,12) 39 | 40 | cuCtxSynchronize() 41 | t0 = time() 42 | for i in range(loops): 43 | cuLaunchGrid(gpuVADD,GRID_SIZE,1) 44 | cuCtxSynchronize() 45 | t0 = time()-t0 46 | 47 | flops = (1.e-9*n2)*float(loops) 48 | cuMemcpyDtoH(g_Y,d_Y,n2*S4) 49 | cuCtxSynchronize() 50 | 51 | cuMemFree(d_X) 52 | cuMemFree(d_Y) 53 | 54 | t1 = time() 55 | for i in range(loops): 56 | cpuVADD(h_X,h_Y) 57 | t1 = time()-t1 58 | print "%10d%6.2f%6.2f" % (vlength,flops/t1,flops/t0) 59 | 60 | if checkErrorFlag: 61 | err,mxe = checkError(h_Y,g_Y) 62 | print "Avg and max rel error = %.2e %.2e" % (err,mxe) 63 | 64 | if __name__ == "__main__": 65 | import sys 66 | 67 | device = cu_CUDA() 68 | device.getSourceModule("gpuFunctions.cubin") 69 | device.getFunction("gpuVADD") 70 | 71 | lmin,lmax = 7,24 72 | if len(sys.argv) > 1: 73 | lmin = lmax = int(sys.argv[1]) 74 | loopx = -1 75 | if len(sys.argv) > 2: 76 | loopx = int(sys.argv[2]) 77 | lmax = min(max(0,lmax),24) 78 | lmin = min(max(0,lmin),lmax) 79 | for l in range(lmin,lmax+1): 80 | if l < 10: 81 | loops = 25000 82 | elif l < 17: 83 | loops = 10000 84 | elif l < 21: 85 | loops = 250 86 | else: 87 | loops = 25 88 | vlength = 1 << l 89 | if loopx > 0: 90 | loops = loopx 91 | print "%5d %5d" % (l,loops), 92 | main(device,vlength,loops) 93 | cuCtxDetach(device.context) 94 | -------------------------------------------------------------------------------- /cuda/sugar/kernel/kernelfactorydrv.py: -------------------------------------------------------------------------------- 1 | import ctypes 2 | from cuda.cuda import * 3 | 4 | 5 | 6 | class KernelGetter(object): 7 | """ Wraps a ctypes CDLL instance for accessing CUDA kernels. 8 | 9 | Example 10 | ------- 11 | from ctypes import cdll 12 | mykernels = KernelGetter(cdll.LoadLibrary('libmykernels.so')) 13 | mykernels.FastKernel(grid, block)(x, y) 14 | # Equivalent CUDA call: 15 | # FastKernel<<>>(x, y) 16 | """ 17 | 18 | def __init__(self, dll): 19 | raise NotImplementedError 20 | # self.dll = dll 21 | 22 | # def __getattr__(self, name): 23 | # mangled_name = '__device_stub_%s' % name 24 | # try: 25 | # funcptr = getattr(self.dll, mangled_name) 26 | # except AttributeError: 27 | # raise AttributeError("could not find kernel named %r in %r" % (name, self.dll)) 28 | 29 | # # Return a factory function that will create the Kernel object. 30 | # factory = lambda *args, **kwds: Kernel(funcptr, *args, **kwds) 31 | 32 | # return factory 33 | 34 | 35 | # class Kernel(object): 36 | # """ Configure a CUDA kernel. 37 | # """ 38 | 39 | # def __init__(self, funcptr, gridDim, blockDim, sharedMem=0, tokens=0): 40 | # # The function pointer to the kernel. 41 | # self.funcptr = funcptr 42 | 43 | # # The configuration parameters for the call. These are the arguments 44 | # # inside the <<<>>> brackets in CUDA. 45 | # self.gridDim = gridDim 46 | # self.blockDim = blockDim 47 | # self.sharedMem = sharedMem 48 | # self.tokens = tokens 49 | 50 | # # Delegate .restype and .argtypes attribute access to the underlying 51 | # # function pointer. 52 | # def _get_restype(self): 53 | # return self.funcptr.restype 54 | # def _set_restype(self, val): 55 | # self.funcptr.restype = val 56 | # restype = property(_get_restype, _set_restype) 57 | 58 | # def _get_argtypes(self): 59 | # return self.funcptr.argtypes 60 | # def _set_argtypes(self, val): 61 | # self.funcptr.argtypes = val 62 | # argtypes = property(_get_argtypes, _set_argtypes) 63 | 64 | 65 | # def __call__(self, *args): 66 | # """ Call the kernel as configured. 67 | # """ 68 | # cudart.cudaConfigureCall(self.gridDim, self.blockDim, self.sharedMem, self.tokens) 69 | # self.funcptr(*args) 70 | # # Check to make sure we didn't get an error. 71 | # err = cudart.getLastError() 72 | # cudart._checkCudaStatus(err) 73 | -------------------------------------------------------------------------------- /oldcode/cufft/cufft_api.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | # -*- coding: utf-8 -*- 3 | 4 | # XXX 5 | # -- 6 | # Code forked from python-cuda-2.0_42 © Arno Pähler, 2007-08 7 | # -- 8 | 9 | from cufft_defs import * 10 | from cuda.utils import libutils 11 | 12 | cufft = libutils.get_lib("cufft", RTLD_GLOBAL) 13 | 14 | #cufftResult CUFFTAPI cufftPlan1d(cufftHandle *plan, 15 | # int nx, 16 | # cufftType type, 17 | # int batch); 18 | cufftPlan1d = cufft.cufftPlan1d 19 | cufftPlan1d.restype = cufftResult 20 | cufftPlan1d.argtypes = [ cufftHandle_p, 21 | c_int, cufftType, c_int ] 22 | 23 | #cufftResult CUFFTAPI cufftPlan2d(cufftHandle *plan, 24 | # int nx, int ny, 25 | # cufftType type); 26 | cufftPlan2d = cufft.cufftPlan2d 27 | cufftPlan2d.restype = cufftResult 28 | cufftPlan2d.argtypes = [ cufftHandle_p, 29 | c_int, c_int, cufftType ] 30 | 31 | #cufftResult CUFFTAPI cufftPlan3d(cufftHandle *plan, 32 | # int nx, int ny, int nz, 33 | # cufftType type); 34 | cufftPlan3d = cufft.cufftPlan3d 35 | cufftPlan3d.restype = cufftResult 36 | cufftPlan3d.argtypes = [ cufftHandle_p, 37 | c_int, c_int, c_int, cufftType ] 38 | 39 | #cufftResult CUFFTAPI cufftDestroy(cufftHandle plan); 40 | cufftDestroy = cufft.cufftDestroy 41 | cufftDestroy.restype = cufftResult 42 | cufftDestroy.argtypes = [ cufftHandle ] 43 | 44 | #cufftResult CUFFTAPI cufftExecC2C(cufftHandle plan, 45 | # cufftComplex *idata, 46 | # cufftComplex *odata, 47 | # int direction); 48 | cufftExecC2C = cufft.cufftExecC2C 49 | cufftExecC2C.restype = cufftResult 50 | cufftExecC2C.argtypes = [ cufftHandle, c_uint, c_uint, c_int ] 51 | 52 | #cufftResult CUFFTAPI cufftExecR2C(cufftHandle plan, 53 | # cufftReal *idata, 54 | # cufftComplex *odata); 55 | cufftExecR2C = cufft.cufftExecR2C 56 | cufftExecR2C.restype = cufftResult 57 | cufftExecR2C.argtypes = [ cufftHandle, c_uint, c_uint ] 58 | 59 | #cufftResult CUFFTAPI cufftExecC2R(cufftHandle plan, 60 | # cufftComplex *idata, 61 | # cufftReal *odata); 62 | cufftExecC2R = cufft.cufftExecC2R 63 | cufftExecC2R.restype = cufftResult 64 | cufftExecC2R.argtypes = [ cufftHandle, c_uint, c_uint ] 65 | -------------------------------------------------------------------------------- /oldcode/misc/gpuFunctions.py: -------------------------------------------------------------------------------- 1 | # coding:utf-8: © Arno Pähler, 2007-08 2 | from ctypes import * 3 | 4 | cvp = c_void_p 5 | _cf = c_float 6 | _ci = c_int 7 | 8 | lib = CDLL("./libgpuFunctions.so") 9 | 10 | #__global__ void gpuGFLOPS() 11 | gpuGFLOPS = lib.__device_stub_gpuGFLOPS 12 | gpuGFLOPS.restype = None 13 | gpuGFLOPS.argtypes = [ ] 14 | 15 | #__global__ void gpuBLSC( 16 | #float *d_Calls, float *d_Puts, 17 | #float *d_S, float *d_X, float *d_T, 18 | #float R, float V, int OptN) 19 | gpuBLSC = lib.__device_stub_gpuBLSC 20 | gpuBLSC.restype = None 21 | gpuBLSC.argtypes = [ cvp, cvp, cvp, cvp, cvp, 22 | _cf, _cf, _ci ] 23 | 24 | #__global__ void gpuPOLY5( 25 | #float *d_In1, float *d_Out1, int size ) 26 | gpuPOLY5 = lib.__device_stub_gpuPOLY5 27 | gpuPOLY5.restype = None 28 | gpuPOLY5.argtypes = [ cvp, cvp, _ci ] 29 | 30 | #__global__ void gpuPOLY10( 31 | #float *d_In1, float *d_Out1, int size ) 32 | gpuPOLY10 = lib.__device_stub_gpuPOLY10 33 | gpuPOLY10.restype = None 34 | gpuPOLY10.argtypes = [ cvp, cvp, _ci ] 35 | 36 | #__global__ void gpuPOLY20( 37 | #float *d_In1, float *d_Out1, int size ) 38 | gpuPOLY20 = lib.__device_stub_gpuPOLY20 39 | gpuPOLY20.restype = None 40 | gpuPOLY20.argtypes = [ cvp, cvp, _ci ] 41 | 42 | #__global__ void gpuPOLY40( 43 | #float *d_In1, float *d_Out1, int size ) 44 | gpuPOLY40 = lib.__device_stub_gpuPOLY40 45 | gpuPOLY40.restype = None 46 | gpuPOLY40.argtypes = [ cvp, cvp, _ci ] 47 | 48 | #__global__ void gpuSAXPY( 49 | #float Factor, float *d_In1, float *d_In2, int size ) 50 | gpuSAXPY = lib.__device_stub_gpuSAXPY 51 | gpuSAXPY.restype = None 52 | gpuSAXPY.argtypes = [ _cf, cvp, cvp, _ci ] 53 | 54 | #__global__ void gpuVADD( 55 | #float *d_In1, float *d_In2, int size ) 56 | gpuVADD = lib.__device_stub_gpuVADD 57 | gpuVADD.restype = None 58 | gpuVADD.argtypes = [ cvp, cvp, _ci ] 59 | 60 | #__global__ void gpuSGEMM( 61 | #float* C, float* A, float* B, int wA, int wB ) 62 | gpuSGEMM = lib.__device_stub_gpuSGEMM 63 | gpuSGEMM.restype = None 64 | gpuSGEMM.argtypes = [ cvp, cvp, cvp, _ci, _ci ] 65 | 66 | #__global__ void gpuTRIG( 67 | #float *d_Out1, float *d_Out2, float *d_In1, int size ) 68 | gpuTRIG = lib.__device_stub_gpuTRIG 69 | gpuTRIG.restype = None 70 | gpuTRIG.argtypes = [ cvp, cvp, cvp, _ci ] 71 | 72 | #__global__ void gpuScale( 73 | #float *d_Out1, _F *d_In1, _F scale, int size ) 74 | gpuScale = lib.__device_stub_gpuScale 75 | gpuScale.restype = None 76 | gpuScale.argtypes = [ cvp, cvp, _cf, _ci ] 77 | 78 | #// for streams example 79 | #__global__ void init_array( 80 | #int *g_data, int *factor){ 81 | init_array = lib.__device_stub_init_array 82 | init_array.restype = None 83 | init_array.argtypes = [ c_int, c_int ] 84 | -------------------------------------------------------------------------------- /cuda/sugar/kernel/kernelfactoryrt.py: -------------------------------------------------------------------------------- 1 | import ctypes 2 | from cuda.cuda import * 3 | from cuda.sugar.kernel.compiler import compile 4 | 5 | class SourceModule(object): 6 | """ Wraps a ctypes CDLL instance for accessing CUDA kernels. 7 | 8 | Example 9 | ------- 10 | from ctypes import cdll 11 | mykernels = KernelGetter(cdll.LoadLibrary('libmykernels.so')) 12 | mykernels.FastKernel(grid, block)(x, y) 13 | # Equivalent CUDA call: 14 | # FastKernel<<>>(x, y) 15 | """ 16 | def __init__(self, source, nvcc="nvcc", options=[], keep=False, 17 | no_extern_c=False, arch=None, code=None, cache_dir=None, 18 | include_dirs=[]): 19 | 20 | self.dll = compile(source, nvcc, options, keep, no_extern_c, 21 | arch, code, cache_dir, include_dirs) 22 | 23 | def __getattr__(self, name): 24 | mangled_name = '__device_stub_%s' % name 25 | try: 26 | funcptr = getattr(self.dll, mangled_name) 27 | except AttributeError: 28 | raise AttributeError("could not find kernel named %r in %r" % (name, self.dll)) 29 | 30 | # Return a factory function that will create the Kernel object. 31 | factory = lambda *args, **kwds: Kernel(funcptr, *args, **kwds) 32 | 33 | return factory 34 | 35 | class Kernel(object): 36 | """ Configure a CUDA kernel. 37 | """ 38 | 39 | def __init__(self, funcptr, gridDim, blockDim, sharedMem=0, tokens=0): 40 | # The function pointer to the kernel. 41 | self.funcptr = funcptr 42 | 43 | # The configuration parameters for the call. These are the arguments 44 | # inside the <<<>>> brackets in CUDA. 45 | self.gridDim = gridDim 46 | self.blockDim = blockDim 47 | self.sharedMem = sharedMem 48 | self.tokens = tokens 49 | 50 | # Delegate .restype and .argtypes attribute access to the underlying 51 | # function pointer. 52 | def _get_restype(self): 53 | return self.funcptr.restype 54 | def _set_restype(self, val): 55 | self.funcptr.restype = val 56 | restype = property(_get_restype, _set_restype) 57 | 58 | def _get_argtypes(self): 59 | return self.funcptr.argtypes 60 | def _set_argtypes(self, val): 61 | self.funcptr.argtypes = val 62 | argtypes = property(_get_argtypes, _set_argtypes) 63 | 64 | 65 | def __call__(self, *args): 66 | """ Call the kernel as configured. 67 | """ 68 | cudart.cudaConfigureCall(self.gridDim, self.blockDim, self.sharedMem, self.tokens) 69 | self.funcptr(*args) 70 | # Check to make sure we didn't get an error. 71 | #err = cudart.getLastError() 72 | #cudart._checkCudaStatus(err) 73 | -------------------------------------------------------------------------------- /oldcode/misc/ctypes_array_test.py: -------------------------------------------------------------------------------- 1 | #!/bin/env python 2 | 3 | from ctypes import * 4 | from ctypes_array import * 5 | from numpy import * 6 | from numpy.random import rand 7 | 8 | def pif(a): 9 | ai = a.__array_interface__ 10 | b = ai["strides"] is not None 11 | return " ".join([ 12 | "%-5s:" % str(b), 13 | ":",className(a), 14 | str(ai["strides"]), 15 | str(ai["shape"]),]) 16 | 17 | ac = rand(4,8) 18 | af = array(ac,order="F")#ac.T 19 | 20 | def norm2(a): 21 | return round(sqrt(sum(a*a)),3) 22 | 23 | print "\nOriginal arrays ac, af = ac.T" 24 | print "hasStrides(ac):",pif(ac) 25 | print "hasStrides(af):",pif(af) 26 | print "type ac,af",type(ac),type(af) 27 | 28 | print "\nConvert to ctypes: ac => bc, af => bf" 29 | bc = convert(ac); print "hasStrides(bc):",pif(bc) 30 | bf = convert(af); print "hasStrides(bf):",pif(bf) 31 | print "\ntype bc,bf",type(bc),type(bf) 32 | 33 | print "\nCan combine numpy arrays and ctypes objects with array interface" 34 | delta = (ac-bc).flatten() 35 | print "L2-norm ac-bc",norm2(delta) 36 | delta = (af-bf).flatten() 37 | print "L2-norm af-bf",norm2(delta) 38 | 39 | print "\nConvert to numpy: bc => cc, bf => cf, bf => CF (Fortran)" 40 | cc = convert(bc);print "hasStrides(cc):",pif(cc) 41 | cf = convert(bf);print "hasStrides(cf):",pif(cf) 42 | CF = convert(bf,order="F"); print "hasStrides(CF):",pif(CF) 43 | print "\ntype cc,cf,CF",type(cc),type(cf),type(CF) 44 | 45 | delta = (af-CF).flatten() 46 | print "L2-norm af-CF",norm2(delta) 47 | 48 | print "\nConvert to ctypes ac => dc, af => df (dc,df = 1D)" 49 | dc = (eval(typeName(bc))*ac.size)() 50 | df = (eval(typeName(bf))*af.size)() 51 | convert(ac,None,None,dc) 52 | convert(af,None,None,df) 53 | print "type dc,df",type(dc),type(df) 54 | 55 | delta = (ac-dc).flatten() 56 | print "L2-norm ac-dc",norm2(delta) 57 | delta = (af-df).flatten() 58 | print "L2-norm af-df",norm2(delta) 59 | delta = af.flatten()-ac.flatten() 60 | print "\ncomparing flattened ac,af" 61 | print "L2-norm af-ac",norm2(delta) 62 | set_printoptions(precision=3) 63 | print "\nac[:3],af[:3]" 64 | print ac.flatten()[:3] 65 | print af.flatten()[:3] 66 | print "\ndc[:3],df[:3]" 67 | print "[%6.3f %6.3f %6.3f]" % tuple(dc[:3]) 68 | print "[%6.3f %6.3f %6.3f]" % tuple(df[:3]) 69 | 70 | print "\nConvert to numpy: dc => ec, df => ef" 71 | ec = convert(dc,(4,8),"C") 72 | ef = convert(df,(4,8),"F") 73 | print "1D dc,df ctypes objects=> 2D numpy arrays ec,ef" 74 | print ; print "ac, ec" 75 | print pif(ac) 76 | print pif(ec) 77 | print ; print "af, ef" 78 | print pif(af) 79 | print pif(ef) 80 | print "\nL2-norm ac-ec, af-ef, ec-ef" 81 | print norm2(ac-ec) 82 | print norm2(af-ef) 83 | print norm2(ec-ef) 84 | print "\nL2-norm ac-ef, af-ec (flattened)" 85 | print norm2(ac.flatten()-ef.flatten()) 86 | print norm2(af.flatten()-ec.flatten()) 87 | 88 | 89 | -------------------------------------------------------------------------------- /tests/cuda/todo/cuda_poly.py: -------------------------------------------------------------------------------- 1 | #!/bin/env python 2 | # coding:utf-8: © Arno Pähler, 2007-08 3 | from ctypes import * 4 | from time import time 5 | 6 | from cuda.cuda_defs import * 7 | from cuda.cuda_api import * 8 | from cuda.cuda_utils import * 9 | 10 | from cpuFunctions import vectorInit,checkError 11 | from cpuFunctions import cpuPOLY5,cpuPOLY10,cpuPOLY20,cpuPOLY40 12 | from gpuFunctions import gpuPOLY5,gpuPOLY10,gpuPOLY20,gpuPOLY40 13 | 14 | BLOCK_SIZE = 144 15 | GRID_SIZE = 192 16 | checkErrorFlag = False 17 | 18 | S4 = sizeof(c_float) 19 | psize = 5 20 | 21 | def main(vlength = 128,loops = 1,m1 = 1): 22 | print "%5d %5d %5d" % (l,loops,m1), 23 | 24 | alfa = c_float(.5) 25 | n2 = vlength ## Vector length 26 | 27 | mp = 1 << (m1-1) 28 | print "%5d" % (mp*psize), 29 | gpuPOLY = eval("gpuPOLY%d"%(mp*psize)) 30 | h_X = (c_float*n2)() 31 | h_Y = (c_float*n2)() 32 | g_Y = (c_float*n2)() 33 | 34 | vectorInit(h_X) 35 | 36 | d_X = getMemory(h_X) 37 | d_Y = getMemory(h_Y) 38 | 39 | blockDim = dim3(BLOCK_SIZE,1,1) 40 | gridDim = dim3(GRID_SIZE,1,1) 41 | 42 | t0 = time() 43 | cudaThreadSynchronize() 44 | for i in range(loops): 45 | cudaConfigureCall(gridDim,blockDim,0,0) 46 | gpuPOLY(d_X,d_Y,n2) 47 | cudaThreadSynchronize() 48 | t0 = time()-t0 49 | 50 | flops = (2.e-9*m1*n2*(psize-1))*float(loops) 51 | cudaMemcpy(g_Y,d_Y,S4*n2,cudaMemcpyDeviceToHost) 52 | cudaThreadSynchronize() 53 | 54 | cudaFree(d_X) 55 | cudaFree(d_Y) 56 | 57 | cudaThreadExit() 58 | cpuPOLY = eval("cpuPOLY%d" % (mp*psize)) 59 | t1 = time() 60 | for i in range(loops): 61 | cpuPOLY(h_X,h_Y) 62 | t1 = time()-t1 63 | print "%10d%6.2f%6.2f" % (vlength,flops/t1,flops/t0) 64 | 65 | if checkErrorFlag: 66 | err,mxe = checkError(h_Y,g_Y) 67 | print "Avg and max rel error = %.2e %.2e" % (err,mxe) 68 | 69 | if __name__ == "__main__": 70 | import sys 71 | 72 | cudaSetDevice(0) 73 | 74 | lmin,lmax = 7,23 75 | if len(sys.argv) > 1: 76 | lmin = lmax = int(sys.argv[1]) 77 | loopx = -1 78 | if len(sys.argv) > 2: 79 | loopx = int(sys.argv[2]) 80 | m1 = 4 81 | if len(sys.argv) > 3: 82 | m1 = min(4,int(sys.argv[3])) 83 | lmax = min(max(0,lmax),23) 84 | lmin = min(max(0,lmin),lmax) 85 | 86 | for l in range(lmin,lmax+1): 87 | if l < 10: 88 | loops = 10000/m1 89 | elif l < 13: 90 | loops = 5000/m1 91 | elif l < 17: 92 | loops = 500/m1 93 | elif l < 21: 94 | loops = 250/m1 95 | else: 96 | loops = 100/m1 97 | vlength = 1 << l 98 | if loopx > 0: 99 | loops = loopx 100 | main(vlength,loops,m1) 101 | -------------------------------------------------------------------------------- /tests/cuda/todo/cuda_trig.py: -------------------------------------------------------------------------------- 1 | #!/bin/env python 2 | # coding:utf-8: © Arno Pähler, 2007-08 3 | from ctypes import * 4 | from time import time 5 | 6 | from cuda.cu_defs import * 7 | from cuda.cu_api import * 8 | from cuda.cuda_utils import * 9 | 10 | from cpuFunctions import vectorInit,checkError 11 | from gpuFunctions import gpuTRIG 12 | 13 | UseVML = True 14 | if UseVML: 15 | from mklMath import cpuTRIG 16 | else: 17 | from cpuFunctions import cpuTRIG 18 | 19 | BLOCK_SIZE = 128 20 | GRID_SIZE = 192 21 | checkErrorFlag = False 22 | 23 | S4 = sizeof(c_float) 24 | 25 | def main(vlength = 128,loops = 1): 26 | 27 | n2 = vlength ## Vector length 28 | 29 | h_X = (c_float*n2)() 30 | h_Y = (c_float*n2)() 31 | h_Z = (c_float*n2)() 32 | 33 | vectorInit(h_X) 34 | 35 | d_X = getMemory(h_X) 36 | d_Y = getMemory(h_Y) 37 | d_Z = getMemory(h_Z) 38 | 39 | blockDim = dim3(BLOCK_SIZE,1,1) 40 | gridDim = dim3(GRID_SIZE,1,1) 41 | 42 | t0 = time() 43 | cudaThreadSynchronize() 44 | for i in range(loops): 45 | cudaConfigureCall(gridDim,blockDim,0,0) 46 | gpuTRIG(d_Y,d_Z,d_X,n2) 47 | cudaThreadSynchronize() 48 | t0 = time()-t0 49 | 50 | flops = (2.e-9*n2)*float(loops) 51 | g_Y = (c_float*n2)() 52 | cudaMemcpy(g_Y,d_Y,S4*n2,cudaMemcpyDeviceToHost) 53 | cudaThreadSynchronize() 54 | 55 | flops = (8.e-9*n2)*float(loops) 56 | g_Y = (c_float*n2)() 57 | g_Z = (c_float*n2)() 58 | cudaMemcpy(g_Y,d_Y,S4*n2,cudaMemcpyDeviceToHost) 59 | cudaMemcpy(g_Z,d_Z,S4*n2,cudaMemcpyDeviceToHost) 60 | cudaThreadSynchronize() 61 | 62 | cudaFree(d_X) 63 | cudaFree(d_Y) 64 | cudaFree(d_Z) 65 | 66 | cudaThreadExit() 67 | t1 = time() 68 | for i in range(loops): 69 | cpuTRIG(h_Y,h_Z,h_X) 70 | t1 = time()-t1 71 | print "%10d%6.2f%6.2f GFlops" % (vlength,flops/t1,flops/t0) 72 | 73 | if checkErrorFlag: 74 | err,mxe = checkError(h_Y,g_Y,n2) 75 | print "Avg and max rel error (cos) = %.2e %.2e" % (err,mxe) 76 | err,mxe = checkError(h_Z,g_Z,n2) 77 | print "Avg and max rel error (sin) = %.2e %.2e" % (err,mxe) 78 | 79 | if __name__ == "__main__": 80 | import sys 81 | 82 | cudaSetDevice(0) 83 | 84 | lmin,lmax = 7,23 85 | if len(sys.argv) > 1: 86 | lmin = lmax = int(sys.argv[1]) 87 | lmax = min(max(0,lmax),23) 88 | lmin = min(max(0,lmin),lmax) 89 | for l in range(lmin,lmax+1): 90 | if l < 10: 91 | loops = 10000 92 | elif l < 13: 93 | loops = 2000 94 | elif l < 17: 95 | loops = 250 96 | elif l < 21: 97 | loops = 100 98 | else: 99 | loops = 50 100 | vlength = 1 << l 101 | print "%5d %5d" % (l,loops), 102 | main(vlength,loops) 103 | -------------------------------------------------------------------------------- /tests/cu/todo/cu_trig.py: -------------------------------------------------------------------------------- 1 | #!/bin/env python 2 | # coding:utf-8: © Arno Pähler, 2007-08 3 | from ctypes import * 4 | from time import time 5 | 6 | from cu.cu_defs import * 7 | from cu.cu_api import * 8 | from utils.cu_utils import * 9 | 10 | from cpuFunctions import vectorInit,checkError 11 | 12 | UseVML = True 13 | if UseVML: 14 | from mklMath import cpuTRIG 15 | else: 16 | from cpuFunctions import cpuTRIG 17 | 18 | BLOCK_SIZE = 128 19 | GRID_SIZE = 192 20 | checkErrorFlag = False 21 | 22 | S4 = sizeof(c_float) 23 | 24 | def main(device,vlength = 128,loops = 1): 25 | 26 | n2 = vlength ## Vector length 27 | gpuTRIG = device.functions["gpuTRIG"] 28 | 29 | h_X = (c_float*n2)() 30 | h_Y = (c_float*n2)() 31 | h_Z = (c_float*n2)() 32 | 33 | vectorInit(h_X) 34 | 35 | d_X = getMemory(h_X) 36 | d_Y = getMemory(h_Y) 37 | d_Z = getMemory(h_Z) 38 | 39 | cuFuncSetBlockShape(gpuTRIG,BLOCK_SIZE,1,1) 40 | cuParamSeti(gpuTRIG,0,d_Y) 41 | cuParamSeti(gpuTRIG,4,d_Z) 42 | cuParamSeti(gpuTRIG,8,d_X) 43 | cuParamSeti(gpuTRIG,12,n2) 44 | cuParamSetSize(gpuTRIG,16) 45 | 46 | cuCtxSynchronize() 47 | t0 = time() 48 | for i in range(loops): 49 | cuLaunchGrid(gpuTRIG,GRID_SIZE,1) 50 | cuCtxSynchronize() 51 | t0 = time()-t0 52 | 53 | flops = (8.e-9*n2)*float(loops) 54 | g_Y = (c_float*n2)() 55 | g_Z = (c_float*n2)() 56 | cuMemcpyDtoH(g_Y,d_Y,S4*n2) 57 | cuMemcpyDtoH(g_Z,d_Z,S4*n2) 58 | cuCtxSynchronize() 59 | 60 | cuMemFree(d_X) 61 | cuMemFree(d_Y) 62 | cuMemFree(d_Z) 63 | 64 | t1 = time() 65 | for i in range(loops): 66 | cpuTRIG(h_Y,h_Z,h_X) 67 | t1 = time()-t1 68 | print "%10d%6.2f%6.2f GFlops" % (vlength,flops/t1,flops/t0) 69 | 70 | if checkErrorFlag: 71 | err,mxe = checkError(h_Y,g_Y,n2) 72 | print "Avg and max rel error (cos) = %.2e %.2e" % (err,mxe) 73 | err,mxe = checkError(h_Z,g_Z,n2) 74 | print "Avg and max rel error (sin) = %.2e %.2e" % (err,mxe) 75 | 76 | if __name__ == "__main__": 77 | import sys 78 | 79 | device = cu_CUDA() 80 | device.getSourceModule("gpuFunctions.cubin") 81 | device.getFunction("gpuTRIG") 82 | 83 | lmin,lmax = 7,23 84 | if len(sys.argv) > 1: 85 | lmin = lmax = int(sys.argv[1]) 86 | lmax = min(max(0,lmax),23) 87 | lmin = min(max(0,lmin),lmax) 88 | for l in range(lmin,lmax+1): 89 | if l < 10: 90 | loops = 10000 91 | elif l < 13: 92 | loops = 2000 93 | elif l < 17: 94 | loops = 250 95 | elif l < 21: 96 | loops = 100 97 | else: 98 | loops = 50 99 | vlength = 1 << l 100 | print "%5d %5d" % (l,loops), 101 | main(device,vlength,loops) 102 | cuCtxDetach(device.context) 103 | -------------------------------------------------------------------------------- /cuda/sugar/memory/linear.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | """Array-like objects for CUDA.""" 3 | 4 | from cuda.cuda import * 5 | from cuda.cublas import * 6 | import numpy 7 | import ctypes 8 | from ctypes import * 9 | 10 | # cuda <-> dtype conversion 11 | cudaDtypes = {'float32': ctypes.c_float, 12 | 'int32': ctypes.c_int, 13 | 'complex64': ctypes.c_float*2, 14 | } 15 | 16 | class Linear(object): 17 | 18 | ref = property(fget=lambda self: self._get_ref()) 19 | 20 | def __init__(self, shape=None, dtype='float32', order=None): 21 | self.shape = shape 22 | self.size = numpy.prod(shape) 23 | self.dtype = numpy.dtype(dtype) 24 | self.order = order 25 | self.ctype = self._convert_type(self.dtype) 26 | self.nbytes = self.size*ctypes.sizeof(self.ctype) 27 | self.allocated = False 28 | self.data = None 29 | self._alloc() 30 | 31 | def __del__(self): 32 | self._free() 33 | 34 | def _convert_type(self, dtype): 35 | ct = cudaDtypes.get(dtype.name, None) 36 | if ct is None: 37 | raise TypeError("Unsupported dtype") 38 | return ct 39 | 40 | def _get_ref(self): 41 | return cast(self.data,POINTER(self._convert_type(self.dtype))) 42 | 43 | def _alloc(self): 44 | self.data = c_void_p() 45 | cudaMalloc(byref(self.data), self.nbytes) 46 | self.allocated = True 47 | 48 | def _free(self): 49 | if self.allocated: 50 | cudaFree(self.data) 51 | self.data = None 52 | self.allocated = False 53 | 54 | def to_numpy(self, a=None): 55 | if not self.allocated: 56 | raise Exception("Must first allocate") 57 | if a is None: 58 | a = numpy.empty(self.shape, dtype=self.dtype, order=self.order) 59 | else: 60 | # Check that the given array is appropriate. 61 | if a.size != self.size: 62 | raise ValueError("need an array of size %s; got %s" % (self.size, a.size)) 63 | if a.dtype.name != self.dtype.name: 64 | # XXX: compare dtypes directly? issubdtype? 65 | raise ValueError("need an array of dtype %r; got %r" % (self.dtype, a.dtype)) 66 | cudaMemcpy(a.ctypes.data, self.ref, self.nbytes, cudaMemcpyDeviceToHost) 67 | a = a.reshape(self.shape, order=self.order) 68 | return a 69 | 70 | def from_numpy(self, a): 71 | if not self.allocated: 72 | raise Exception("Must first allocate") 73 | assert a.size == self.size, "size must be the same" 74 | assert a.dtype == self.dtype, "dtype must be the same" 75 | a = numpy.ascontiguousarray(a,dtype=None) 76 | if self.order == 'F': 77 | a = numpy.asfortranarray(a) 78 | cudaMemcpy(self.data, a.ctypes.data, self.nbytes, cudaMemcpyHostToDevice) 79 | return self 80 | -------------------------------------------------------------------------------- /tests/cuda/todo/cuda_add.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | # -*- coding: utf-8 -*- 3 | 4 | # XXX 5 | # -- 6 | # Code forked from python-cuda-2.0_42 © Arno Pähler, 2007-08 7 | # -- 8 | 9 | from ctypes import * 10 | from time import time 11 | 12 | # NP: simple vector addition example 13 | # XXX should be easy ;-) 14 | 15 | from cuda import cuda 16 | print cuda.cudaConfigureCall 17 | print cuda.cudaThreadSynchronize 18 | 19 | #fixedInit(h_X) 20 | 21 | #d_X = getMemory(h_X) 22 | #d_Y = getMemory(h_Y) 23 | 24 | #blockDim = dim3(BLOCK_SIZE,1,1) 25 | #gridDim = dim3(GRID_SIZE,1,1) 26 | #cudaMemcpy(g_Y,d_Y,S4*n2,cudaMemcpyDeviceToHost) 27 | #cudaFree(d_Y) 28 | #cudaThreadExit() 29 | #cudaSetDevice(0) 30 | import sys 31 | sys.exit(0)#raise 32 | 33 | # 34 | #from cpuFunctions import fixedInit,cpuVADD,checkError 35 | from gpuFunctions import gpuVADD 36 | 37 | BLOCK_SIZE = 256 38 | GRID_SIZE = 256 39 | S4 = sizeof(c_float) 40 | checkErrorFlag = False 41 | 42 | def main(vlength = 128,loops = 1): 43 | 44 | n2 = vlength ## Vector length 45 | 46 | h_X = (c_float*n2)() 47 | h_Y = (c_float*n2)() 48 | g_Y = (c_float*n2)() 49 | 50 | fixedInit(h_X) 51 | 52 | d_X = getMemory(h_X) 53 | d_Y = getMemory(h_Y) 54 | 55 | blockDim = dim3(BLOCK_SIZE,1,1) 56 | gridDim = dim3(GRID_SIZE,1,1) 57 | 58 | t0 = time() 59 | cudaThreadSynchronize() 60 | for i in range(loops): 61 | cudaConfigureCall(gridDim,blockDim,0,0) 62 | gpuVADD(d_X,d_Y,n2) 63 | cudaThreadSynchronize() 64 | t0 = time()-t0 65 | 66 | flops = (1.e-9*n2)*float(loops) 67 | g_Y = (c_float*n2)() 68 | cudaMemcpy(g_Y,d_Y,S4*n2,cudaMemcpyDeviceToHost) 69 | cudaThreadSynchronize() 70 | 71 | cudaFree(d_X) 72 | cudaFree(d_Y) 73 | 74 | cudaThreadExit() 75 | t1 = time() 76 | for i in range(loops): 77 | cpuVADD(h_X,h_Y) 78 | t1 = time()-t1 79 | print "%10d%6.2f%6.2f" % (vlength,flops/t1,flops/t0) 80 | 81 | if checkErrorFlag: 82 | err,mxe = checkError(h_Y,g_Y) 83 | print "Avg and max rel error = %.2e %.2e" % (err,mxe) 84 | 85 | if __name__ == "__main__": 86 | import sys 87 | 88 | cudaSetDevice(0) 89 | 90 | lmin,lmax = 7,24 91 | if len(sys.argv) > 1: 92 | lmin = lmax = int(sys.argv[1]) 93 | loopx = -1 94 | if len(sys.argv) > 2: 95 | loopx = int(sys.argv[2]) 96 | lmax = min(max(0,lmax),24) 97 | lmin = min(max(0,lmin),lmax) 98 | for l in range(lmin,lmax+1): 99 | if l < 10: 100 | loops = 25000 101 | elif l < 17: 102 | loops = 10000 103 | elif l < 21: 104 | loops = 250 105 | else: 106 | loops = 25 107 | vlength = 1 << l 108 | if loopx > 0: 109 | loops = loopx 110 | print "%5d %5d" % (l,loops), 111 | main(vlength,loops) 112 | -------------------------------------------------------------------------------- /tests/cuda/todo/cuda_blsc.py: -------------------------------------------------------------------------------- 1 | #!/bin/env python 2 | # coding:utf-8: © Arno Pähler, 2007-08 3 | from ctypes import * 4 | from time import time 5 | 6 | from cuda.cuda_defs import * 7 | from cuda.cuda_api import * 8 | from cuda.cuda_utils import * 9 | 10 | from cpuFunctions import randInit,checkError 11 | from gpuFunctions import gpuBLSC 12 | 13 | UseVML = False 14 | if UseVML: 15 | from mklMath import cpuBLSC 16 | else: 17 | from cpuFunctions import cpuBLSC 18 | 19 | BLOCK_SIZE = 128 20 | GRID_SIZE = 192 21 | checkErrorFlag = False 22 | 23 | S4 = sizeof(c_float) 24 | 25 | def main(vlength = 128,loops = 1): 26 | 27 | n2 = vlength ## Vector length 28 | 29 | h_S = (c_float*n2)() 30 | h_X = (c_float*n2)() 31 | h_T = (c_float*n2)() 32 | h_C = (c_float*n2)() 33 | h_P = (c_float*n2)() 34 | 35 | 36 | randInit(h_S,5.,30.) 37 | randInit(h_X,1.,100.) 38 | randInit(h_T,.25,10.) 39 | R,V = .03,.3 40 | 41 | d_S = getMemory(h_S) 42 | d_X = getMemory(h_X) 43 | d_T = getMemory(h_T) 44 | d_C = getMemory(h_C) 45 | d_P = getMemory(h_P) 46 | 47 | blockDim = dim3(BLOCK_SIZE,1,1) 48 | gridDim = dim3(GRID_SIZE,1,1) 49 | 50 | cudaThreadSynchronize() 51 | t0 = time() 52 | for i in range(loops): 53 | cudaConfigureCall(gridDim,blockDim,0,0) 54 | gpuBLSC(d_C,d_P,d_S,d_X,d_T,R,V,n2) 55 | cudaThreadSynchronize() 56 | t0 = time()-t0 57 | 58 | flops = (2.e-6*n2)*float(loops) 59 | g_C = (c_float*n2)() 60 | g_P = (c_float*n2)() 61 | cudaMemcpy(g_C,d_C,S4*n2,cudaMemcpyDeviceToHost) 62 | cudaMemcpy(g_P,d_P,S4*n2,cudaMemcpyDeviceToHost) 63 | cudaThreadSynchronize() 64 | 65 | cudaFree(d_S) 66 | cudaFree(d_X) 67 | cudaFree(d_T) 68 | cudaFree(d_C) 69 | cudaFree(d_P) 70 | 71 | cudaThreadExit() 72 | t1 = time() 73 | for i in range(loops): 74 | cpuBLSC(h_C,h_P,h_S,h_X,h_T,R,V,n2) 75 | t1 = time()-t1 76 | print "%10d%10.2f%10.2f" % (vlength,flops/t1,flops/t0) 77 | 78 | if checkErrorFlag: 79 | err,mxe = checkError(h_C,g_C) 80 | print "Avg rel error (call) = %.2e" % (err,) 81 | err,mxe = checkError(h_P,g_P) 82 | print "Avg rel error (put) = %.2e" % (err,) 83 | 84 | if __name__ == "__main__": 85 | import sys 86 | 87 | cudaSetDevice(0) 88 | 89 | lmin,lmax = 7,23 90 | if len(sys.argv) > 1: 91 | lmin = lmax = int(sys.argv[1]) 92 | lmax = min(max(0,lmax),23) 93 | lmin = min(max(0,lmin),lmax) 94 | for l in range(lmin,lmax+1): 95 | if l < 10: 96 | loops = 1000 97 | elif l < 13: 98 | loops = 500 99 | elif l < 17: 100 | loops = 100 101 | elif l < 21: 102 | loops = 10 103 | else: 104 | loops = 5 105 | loops = 2 106 | vlength = 1 << l 107 | print "%5d %5d" % (l,loops), 108 | main(vlength,loops) 109 | -------------------------------------------------------------------------------- /tests/cu/todo/cu_poly.py: -------------------------------------------------------------------------------- 1 | #!/bin/env python 2 | # coding:utf-8: © Arno Pähler, 2007-08 3 | from ctypes import * 4 | from time import time 5 | 6 | from cu.cu_defs import * 7 | from cu.cu_api import * 8 | from utils.cu_utils import * 9 | 10 | from cpuFunctions import cpuPOLY5,cpuPOLY10,cpuPOLY20,cpuPOLY40 11 | 12 | BLOCK_SIZE = 144 13 | GRID_SIZE = 192 14 | ##BLOCK_SIZE = 320 15 | ##GRID_SIZE = 8 16 | checkErrorFlag = False 17 | 18 | S4 = sizeof(c_float) 19 | psize = 5 20 | 21 | def main(device,vlength = 128,loops = 1,m1 = 1): 22 | print "%5d %5d %5d" % (l,loops,m1), 23 | 24 | alfa = c_float(.5) 25 | n2 = vlength ## Vector length 26 | 27 | mp = 1 << (m1-1) 28 | print "%5d" % (mp*psize), 29 | fcn = "gpuPOLY%d"%(mp*psize) 30 | gpuPOLY = device.functions[fcn] 31 | h_X = (c_float*n2)() 32 | h_Y = (c_float*n2)() 33 | g_Y = (c_float*n2)() 34 | 35 | vectorInit(h_X) 36 | 37 | d_X = getMemory(h_X) 38 | d_Y = getMemory(h_Y) 39 | 40 | cuFuncSetBlockShape(gpuPOLY,BLOCK_SIZE,1,1) 41 | cuParamSeti(gpuPOLY,0,d_X) 42 | cuParamSeti(gpuPOLY,4,d_Y) 43 | cuParamSeti(gpuPOLY,8,n2) 44 | cuParamSetSize(gpuPOLY,12) 45 | 46 | cuCtxSynchronize() 47 | cuLaunchGrid(gpuPOLY,GRID_SIZE,1) 48 | t0 = time() 49 | for i in range(loops): 50 | cuLaunchGrid(gpuPOLY,GRID_SIZE,1) 51 | cuCtxSynchronize() 52 | t0 = time()-t0 53 | 54 | flops = (2.e-9*m1*n2*(psize-1))*float(loops) 55 | cuMemcpyDtoH(g_Y,d_Y,n2*S4) 56 | cuCtxSynchronize() 57 | 58 | cuMemFree(d_X) 59 | cuMemFree(d_Y) 60 | 61 | cpuPOLY = eval("cpuPOLY%d" % (mp*psize)) 62 | t1 = time() 63 | for i in range(loops): 64 | cpuPOLY(h_X,h_Y) 65 | t1 = time()-t1 66 | print "%10d%6.2f%6.2f" % (vlength,flops/t1,flops/t0) 67 | 68 | if checkErrorFlag: 69 | err,mxe = checkError(h_Y,g_Y) 70 | print "Avg and max rel error = %.2e %.2e" % (err,mxe) 71 | 72 | if __name__ == "__main__": 73 | import sys 74 | 75 | lmin,lmax = 7,23 76 | if len(sys.argv) > 1: 77 | lmin = lmax = int(sys.argv[1]) 78 | loopx = -1 79 | if len(sys.argv) > 2: 80 | loopx = int(sys.argv[2]) 81 | m1 = 4 82 | if len(sys.argv) > 3: 83 | m1 = min(4,int(sys.argv[3])) 84 | lmax = min(max(0,lmax),23) 85 | lmin = min(max(0,lmin),lmax) 86 | 87 | mp = 1 << (m1-1) 88 | device = cu_CUDA() 89 | device.getSourceModule("gpuFunctions.cubin") 90 | fcn = "gpuPOLY%d"%(mp*psize) 91 | device.getFunction(fcn) 92 | 93 | for l in range(lmin,lmax+1): 94 | if l < 10: 95 | loops = 10000/m1 96 | elif l < 13: 97 | loops = 5000/m1 98 | elif l < 17: 99 | loops = 500/m1 100 | elif l < 21: 101 | loops = 250/m1 102 | else: 103 | loops = 100/m1 104 | vlength = 1 << l 105 | if loopx > 0: 106 | loops = loopx 107 | main(device,vlength,loops,m1) 108 | cuCtxDetach(device.context) 109 | -------------------------------------------------------------------------------- /tests/cuda/todo/cuda_sgemm.py: -------------------------------------------------------------------------------- 1 | #!/bin/env python 2 | # coding:utf-8: © Arno Pähler, 2007-08 3 | from ctypes import * 4 | from time import time 5 | 6 | from cuda.cuda_api import * 7 | from cuda.cuda_defs import * 8 | from cuda.cuda_utils import * 9 | 10 | from cpuFunctions import arrayInit,cpuSGEMM,checkError 11 | from gpuFunctions import gpuSGEMM 12 | from ctypes_array import * 13 | 14 | useSciPy = True 15 | if useSciPy: 16 | from scipy.lib.blas.fblas import sgemm as _sgemm 17 | # C : A*B (on the GPU) 18 | # F : (A*B).T = B.T * A.T (scipy) 19 | def sgemm(z,x,y,m,n,k): 20 | nx = convert(x,(m,k)).T 21 | ny = convert(y,(k,n)).T 22 | nz = _sgemm(1.,ny,nx) 23 | convert(nz,out=z) 24 | return z 25 | else: 26 | # C : A*B (on the CPU) (in C) 27 | sgemm = cpuSGEMM 28 | 29 | BLOCK_SIZE = 1 << 4 30 | S4 = sizeof(c_float) 31 | 32 | def main(N = 1024,L = 100): 33 | M = N 34 | K = N >> 1 35 | N = N << 1 36 | flops = (2.e-9*M*N)*float(K*L) 37 | print "M = %d, N = %d, K = %d, L = %d; GFlops = %.1f\n" % (M,N,K,L,flops) 38 | na,nb,nc,alfa,beta = M*K,K*N,M*N,1.,0. 39 | 40 | t0 = time() 41 | 42 | sizeA = M*K 43 | sizeB = K*N 44 | sizeC = M*N 45 | 46 | h_A = (c_float*sizeA)() 47 | h_B = (c_float*sizeB)() 48 | 49 | arrayInit(h_A) 50 | arrayInit(h_B) 51 | 52 | d_A = getMemory(h_A) 53 | d_B = getMemory(h_B) 54 | d_C = getMemory(sizeC) 55 | 56 | blockDim = dim3(BLOCK_SIZE,BLOCK_SIZE,1) 57 | gridDim = dim3(N/BLOCK_SIZE,M/BLOCK_SIZE,1) 58 | sharedMem = S4*2*BLOCK_SIZE*BLOCK_SIZE 59 | tt = t0 = time()-t0 60 | print "Overhead runtime API: %.3f sec\n" % t0 61 | 62 | t0 = time() 63 | cudaThreadSynchronize() 64 | for i in range(L): 65 | cudaConfigureCall(gridDim,blockDim,sharedMem,0) 66 | gpuSGEMM(d_C,d_A,d_B,K,N) 67 | cudaThreadSynchronize() 68 | t0 = time()-t0 69 | tt += t0 70 | 71 | h_C = (c_float*sizeC)() 72 | cudaMemcpy(h_C,d_C,S4*sizeC,cudaMemcpyDeviceToHost) 73 | 74 | cudaThreadSynchronize() 75 | 76 | cudaFree(d_A) 77 | cudaFree(d_B) 78 | cudaFree(d_C) 79 | 80 | cudaThreadExit() 81 | print "Processing time: %.3g (%.3g) sec" % (t0,tt) 82 | print "Gigaflops GPU: %.2f (%.2f)" % (flops/t0,flops/tt) 83 | 84 | ref = (c_float*sizeC)() 85 | 86 | t1 = time() 87 | for i in range(L): 88 | sgemm(ref,h_A,h_B,M,N,K) 89 | t1 = time()-t1 90 | print "\nProcessing time: %.3g sec" % t1 91 | print "Gigaflops CPU : %.2f" % (flops/t1) 92 | print "Speedup GPU/CPU: %.2f" % (t1/t0) 93 | 94 | err,mxe = checkError(ref,h_C) 95 | print "\nAvg and max rel error = %.2e %.2e" % (err,mxe) 96 | 97 | if __name__ == "__main__": 98 | import sys 99 | 100 | cudaSetDevice(0) 101 | 102 | M, L = 1024, 100 103 | if len(sys.argv) > 1: 104 | M = int(sys.argv[1]) 105 | M = (M >> 5) << 5 # multiple of (2*BLOCK_SIZE) 106 | if len(sys.argv) > 2: 107 | L = int(sys.argv[2]) 108 | 109 | print "+-----------------------+" 110 | print "| Matrix Multiplication |" 111 | print "| using CUDA API |" 112 | print "+-----------------------+\n" 113 | main(M,L) 114 | -------------------------------------------------------------------------------- /tests/cu/todo/cu_blsc.py: -------------------------------------------------------------------------------- 1 | #!/bin/env python 2 | # coding:utf-8: © Arno Pähler, 2007-08 3 | from ctypes import * 4 | from time import time 5 | 6 | from cu.cu_defs import * 7 | from cu.cu_api import * 8 | from utils.cu_utils import * 9 | 10 | from cpuFunctions import randInit,checkError 11 | 12 | UseVML = False 13 | if UseVML: 14 | from mklMath import cpuBLSC 15 | else: 16 | from cpuFunctions import cpuBLSC 17 | 18 | BLOCK_SIZE = 128 19 | GRID_SIZE = 192 20 | checkErrorFlag = False 21 | 22 | S4 = sizeof(c_float) 23 | 24 | def main(device,vlength = 128,loops = 1): 25 | 26 | n2 = vlength ## Vector length 27 | 28 | gpuBLSC = device.functions["gpuBLSC"] 29 | 30 | h_S = (c_float*n2)() 31 | h_X = (c_float*n2)() 32 | h_T = (c_float*n2)() 33 | h_C = (c_float*n2)() 34 | h_P = (c_float*n2)() 35 | 36 | 37 | randInit(h_S,5.,30.) 38 | randInit(h_X,1.,100.) 39 | randInit(h_T,.25,10.) 40 | R,V = .03,.3 41 | 42 | d_S = getMemory(h_S) 43 | d_X = getMemory(h_X) 44 | d_T = getMemory(h_T) 45 | d_C = getMemory(h_C) 46 | d_P = getMemory(h_P) 47 | 48 | cuFuncSetBlockShape(gpuBLSC,BLOCK_SIZE,1,1) 49 | cuParamSeti(gpuBLSC, 0,d_C) 50 | cuParamSeti(gpuBLSC, 4,d_P) 51 | cuParamSeti(gpuBLSC, 8,d_S) 52 | cuParamSeti(gpuBLSC,12,d_X) 53 | cuParamSeti(gpuBLSC,16,d_T) 54 | cuParamSetf(gpuBLSC,20,R) 55 | cuParamSetf(gpuBLSC,24,V) 56 | cuParamSeti(gpuBLSC,28,n2) 57 | cuParamSetSize(gpuBLSC,32) 58 | 59 | cuCtxSynchronize() 60 | t0 = time() 61 | for i in range(loops): 62 | cuLaunchGrid(gpuBLSC,GRID_SIZE,1) 63 | cuCtxSynchronize() 64 | t0 = time()-t0 65 | 66 | flops = (2.e-6*n2)*float(loops) 67 | g_C = (c_float*n2)() 68 | g_P = (c_float*n2)() 69 | cuMemcpyDtoH(g_C,d_C,n2*S4) 70 | cuMemcpyDtoH(g_P,d_P,n2*S4) 71 | cuCtxSynchronize() 72 | 73 | cuMemFree(d_S) 74 | cuMemFree(d_X) 75 | cuMemFree(d_T) 76 | cuMemFree(d_C) 77 | cuMemFree(d_P) 78 | 79 | t1 = time() 80 | for i in range(loops): 81 | cpuBLSC(h_C,h_P,h_S,h_X,h_T,R,V,n2) 82 | t1 = time()-t1 83 | print "%10d%10.2f%10.2f" % (vlength,flops/t1,flops/t0) 84 | 85 | if checkErrorFlag: 86 | err,mxe = checkError(h_C,g_C) 87 | print "Avg rel error (call) = %.2e" % (err,) 88 | err,mxe = checkError(h_P,g_P) 89 | print "Avg rel error (put) = %.2e" % (err,) 90 | 91 | if __name__ == "__main__": 92 | import sys 93 | 94 | device = cu_CUDA() 95 | device.getSourceModule("gpuFunctions.cubin") 96 | device.getFunction("gpuBLSC") 97 | 98 | lmin,lmax = 7,23 99 | if len(sys.argv) > 1: 100 | lmin = lmax = int(sys.argv[1]) 101 | lmax = min(max(0,lmax),23) 102 | lmin = min(max(0,lmin),lmax) 103 | for l in range(lmin,lmax+1): 104 | if l < 10: 105 | loops = 1000 106 | elif l < 13: 107 | loops = 500 108 | elif l < 17: 109 | loops = 100 110 | elif l < 21: 111 | loops = 10 112 | else: 113 | loops = 5 114 | loops = 2 115 | vlength = 1 << l 116 | print "%5d %5d" % (l,loops), 117 | main(device,vlength,loops) 118 | cuCtxDetach(device.context) 119 | -------------------------------------------------------------------------------- /tests/cu/todo/cu_sgemm.py: -------------------------------------------------------------------------------- 1 | #!/bin/env python 2 | # coding:utf-8: © Arno Pähler, 2007-08 3 | from ctypes import * 4 | from time import time 5 | 6 | from cu.cu_defs import * 7 | from cu.cu_api import * 8 | from utils.cu_utils import * 9 | 10 | from cpuFunctions import arrayInit,cpuSGEMM,checkError 11 | from ctypes_array import * 12 | 13 | useSciPy = True 14 | if useSciPy: 15 | from scipy.lib.blas.fblas import sgemm as _sgemm 16 | # C : A*B (on the GPU) 17 | # F : (A*B).T = B.T * A.T (scipy) 18 | def sgemm(z,x,y,m,n,k): 19 | nx = convert(x,(m,k),"C").T 20 | ny = convert(y,(k,n),"C").T 21 | nz = _sgemm(1.,ny,nx) 22 | convert(nz,out=z) 23 | return z 24 | else: 25 | # C : A*B (on the CPU) (in C) 26 | sgemm = cpuSGEMM 27 | 28 | BLOCK_SIZE = 1 << 4 29 | S4 = sizeof(c_float) 30 | 31 | def main(N = 1024,L = 100): 32 | M = N 33 | K = N >> 1 34 | N = N << 1 35 | flops = (2.e-9*M*N)*float(K*L) 36 | print "M = %d, N = %d, K = %d, L = %d; GFlops = %.1f\n" % (M,N,K,L,flops) 37 | na,nb,nc,alfa,beta = M*K,K*N,M*N,1.,0. 38 | 39 | t0 = time() 40 | device = cu_CUDA() 41 | device.getSourceModule("gpuFunctions.cubin") 42 | gpuSGEMM = device.getFunction("gpuSGEMM") 43 | 44 | sizeA = M*K 45 | sizeB = K*N 46 | sizeC = M*N 47 | 48 | h_A = (c_float*sizeA)() 49 | h_B = (c_float*sizeB)() 50 | 51 | arrayInit(h_A) 52 | arrayInit(h_B) 53 | 54 | d_A = getMemory(h_A) 55 | d_B = getMemory(h_B) 56 | d_C = getMemory(sizeC) 57 | 58 | cuFuncSetBlockShape(gpuSGEMM,BLOCK_SIZE,BLOCK_SIZE,1) 59 | cuFuncSetSharedSize(gpuSGEMM,2*BLOCK_SIZE*BLOCK_SIZE*S4) 60 | cuParamSeti(gpuSGEMM,0,d_C) 61 | cuParamSeti(gpuSGEMM,4,d_A) 62 | cuParamSeti(gpuSGEMM,8,d_B) 63 | cuParamSeti(gpuSGEMM,12,K) 64 | cuParamSeti(gpuSGEMM,16,N) 65 | cuParamSetSize(gpuSGEMM,20) 66 | tt = t0 = time()-t0 67 | print "Overhead driver API: %.3f sec\n" % t0 68 | 69 | t0 = time() 70 | cuCtxSynchronize() 71 | for i in range(L): 72 | cuLaunchGrid(gpuSGEMM,N/BLOCK_SIZE,M/BLOCK_SIZE) 73 | cuCtxSynchronize() 74 | t0 = time()-t0 75 | tt += t0 76 | 77 | h_C = (c_float*sizeC)() 78 | cuMemcpyDtoH(h_C,d_C,S4*sizeC) 79 | cuCtxSynchronize() 80 | 81 | cuMemFree(d_A) 82 | cuMemFree(d_B) 83 | cuMemFree(d_C) 84 | cuCtxDetach(device.context) 85 | 86 | print "Processing time: %.3g (%.3g) sec" % (t0,tt) 87 | print "Gigaflops GPU: %.2f (%.2f)" % (flops/t0,flops/tt) 88 | 89 | ref = (c_float*sizeC)() 90 | 91 | t1 = time() 92 | for i in range(L): 93 | sgemm(ref,h_A,h_B,M,N,K) 94 | t1 = time()-t1 95 | print "\nProcessing time: %.3g sec" % t1 96 | print "Gigaflops CPU : %.2f" % (flops/t1) 97 | print "Speedup GPU/CPU: %.2f" % (t1/t0) 98 | 99 | err,mxe = checkError(ref,h_C) 100 | print "\nAvg and max rel error = %.2e %.2e" % (err,mxe) 101 | 102 | if __name__ == "__main__": 103 | import sys 104 | 105 | M, L = 1024, 100 106 | if len(sys.argv) > 1: 107 | M = int(sys.argv[1]) 108 | M = (M >> 5) << 5 # multiple of (2*BLOCK_SIZE) 109 | if len(sys.argv) > 2: 110 | L = int(sys.argv[2]) 111 | 112 | print "+-----------------------+" 113 | print "| Matrix Multiplication |" 114 | print "| using CUDA driver API |" 115 | print "+-----------------------+\n" 116 | main(M,L) 117 | -------------------------------------------------------------------------------- /tests/cufft/todo/xfft.py: -------------------------------------------------------------------------------- 1 | # coding:utf-8: © Arno Pähler, 2007-08 2 | 3 | from ctypes import c_double,c_float,c_int 4 | 5 | #cannot mix single and double in one file 6 | #segmentation fault on conflicting loads 7 | #of both single and double versions of fft 8 | #import dfft,sfft 9 | 10 | import sfft 11 | from cpuFunctions import scale 12 | 13 | _cd = c_double 14 | _cf = c_float 15 | _ci = c_int 16 | 17 | ## constants 18 | FORWARD = -1 ## Forward FFT 19 | BACKWARD = 1 ## Inverse FFT 20 | 21 | ESTIMATE = 0 ## for plans 22 | MEASURE = 1 ## for plans 23 | 24 | OUT_OF_PLACE = 0 25 | IN_PLACE = 8 26 | USE_WISDOM = 16 27 | THREADSAFE = 128 28 | 29 | R2C = FORWARD 30 | C2R = BACKWARD 31 | 32 | fftw_plan = _ci 33 | 34 | x_cache = {} 35 | 36 | def getType(item): 37 | """gets the type of ctypes item""" 38 | itype = item._type_._type_ 39 | if itype == 'd': 40 | # return _cd,dfft 41 | return None,None 42 | elif itype == 'f': 43 | return _cf,sfft 44 | else: 45 | return None,None 46 | 47 | def rcfft(r,dims): 48 | global x_cache 49 | if not isinstance(dims,tuple): 50 | dims = tuple(dims) 51 | dimx = list(dims) 52 | dimx[-1] = 2*(dimx[-1]/2+1) 53 | size = reduce(lambda x,y:x*y,dimx) 54 | ndim = len(dims) 55 | x_type,x_fft = getType(r) 56 | if x_type is None: 57 | return None 58 | c = (x_type*size)() 59 | try: 60 | wsave = x_cache[('rc',dims)] 61 | except KeyError: 62 | xdim = (_ci*ndim)(*dims) 63 | wsave = x_fft.CreatePlan_r(ndim,xdim, 64 | R2C,ESTIMATE) 65 | x_cache[('rc',dims)] = wsave 66 | x_fft.Execute_r2c(wsave,r,c) 67 | return c 68 | 69 | def crfft(c,dims): 70 | global x_cache 71 | if not isinstance(dims,tuple): 72 | dims = tuple(dims) 73 | dims = list(dims) 74 | dims = tuple(dims) 75 | size = reduce(lambda x,y:x*y,dims) 76 | ndim = len(dims) 77 | x_type,x_fft = getType(c) 78 | if x_type is None: 79 | return None 80 | r = (x_type*size)() 81 | try: 82 | wsave = x_cache[('cr',dims)] 83 | except KeyError: 84 | xdim = (_ci*ndim)(*dims) 85 | wsave = x_fft.CreatePlan_r(ndim,xdim, 86 | C2R,ESTIMATE) 87 | x_cache[('cr',dims)] = wsave 88 | x_fft.Execute_c2r(wsave,c,r) 89 | sc = 1./float(size) 90 | scale(r,sc) 91 | return r 92 | 93 | def ccfft(c,dims): 94 | global x_cache 95 | if not isinstance(dims,tuple): 96 | dims = tuple(dims) 97 | size = reduce(lambda x,y:x*y,dims) 98 | ndim = len(dims) 99 | x_type,x_fft = getType(c) 100 | if x_type is None: 101 | return None 102 | z = (x_type*(size*2))() 103 | try: 104 | wsave = x_cache[('cc',dims)] 105 | except KeyError: 106 | xdim = (_ci*ndim)(*dims) 107 | wsave = x_fft.CreatePlan_c(ndim,xdim, 108 | FORWARD,ESTIMATE) 109 | x_cache[('cc',dims)] = wsave 110 | x_fft.Execute_c2c(wsave,c,z) 111 | return z 112 | 113 | def icfft(z,dims): 114 | global x_cache 115 | if not isinstance(dims,tuple): 116 | dims = tuple(dims) 117 | size = reduce(lambda x,y:x*y,dims) 118 | ndim = len(dims) 119 | x_type,x_fft = getType(z) 120 | if x_type is None: 121 | return None 122 | c = (x_type*(size*2))() 123 | try: 124 | wsave = x_cache[('ic',dims)] 125 | except KeyError: 126 | xdim = (_ci*ndim)(*dims) 127 | wsave = x_fft.CreatePlan_c(ndim,xdim, 128 | BACKWARD,ESTIMATE) 129 | x_cache[('ic',dims)] = wsave 130 | x_fft.Execute_c2c(wsave,z,c) 131 | sc = 1./float(size) 132 | scale(c,sc) 133 | return c 134 | 135 | #shortcuts 136 | 137 | fft = ccfft 138 | ifft = icfft 139 | -------------------------------------------------------------------------------- /oldcode/misc/simple.py: -------------------------------------------------------------------------------- 1 | #!/bin/env python 2 | # coding:utf-8: © Arno Pähler, 2007-08 3 | from ctypes import * 4 | from time import time 5 | 6 | from cuda.cu_defs import * 7 | from cuda.cu_api import * 8 | from cuda.cu_utils import * 9 | 10 | from cpuFunctions import checkError,checkTrig,vectorInit 11 | 12 | UseVML = True 13 | if UseVML: 14 | from mklMath import cpuTRIG 15 | else: 16 | from cpuFunctions import cpuTRIG 17 | 18 | BLOCK_SIZE = 320 19 | GRID_SIZE = 8 20 | 21 | S4 = sizeof(c_float) 22 | 23 | def main(device,vlength = 128,loops = 1): 24 | print "+-----------------------+" 25 | print "| Simple TRIG Test |" 26 | print "| using CUDA driver API |" 27 | print "+-----------------------+" 28 | print "params: %2d %5dK %3d\n" % (log2n,vlength >> 10,loops), 29 | 30 | n2 = vlength ## Vector length 31 | 32 | # TRIGTex is about 1.5x faster than TRIG 33 | # name = "TRIG" 34 | name = "TRIGTex" 35 | 36 | TRIG = device.functions[name] 37 | mod0 = device.modules[0] 38 | 39 | sizeV = S4*n2 40 | h_Arg = (c_float*n2)() 41 | h_Cos = (c_float*n2)() 42 | h_Sin = (c_float*n2)() 43 | 44 | vectorInit(h_Arg) 45 | 46 | d_Arg = getMemory(h_Arg) 47 | d_Cos = getMemory(n2) 48 | d_Sin = getMemory(n2) 49 | 50 | tex = devMemToTex(mod0,"Arg",d_Arg,sizeV) 51 | 52 | cuFuncSetBlockShape(TRIG,BLOCK_SIZE,1,1) 53 | cuParamSeti(TRIG,0,d_Cos) 54 | cuParamSeti(TRIG,4,d_Sin) 55 | if name != "TRIGTex": 56 | cuParamSeti(TRIG,8,d_Arg) 57 | cuParamSeti(TRIG,12,n2) 58 | cuParamSetSize(TRIG,16) 59 | else: 60 | cuParamSetTexRef(TRIG,CU_PARAM_TR_DEFAULT,tex) 61 | cuParamSeti(TRIG,8,n2) 62 | cuParamSetSize(TRIG,12) 63 | cuCtxSynchronize() 64 | 65 | t0 = time() 66 | for i in range(loops): 67 | cuLaunchGrid(TRIG,GRID_SIZE,1) 68 | cuCtxSynchronize() 69 | t0 = time()-t0 70 | 71 | g_Cos = (c_float*n2)() 72 | g_Sin = (c_float*n2)() 73 | cuMemcpyDtoH(g_Cos,d_Cos,sizeV) 74 | cuMemcpyDtoH(g_Sin,d_Sin,sizeV) 75 | cuCtxSynchronize() 76 | 77 | cuMemFree(d_Arg) 78 | cuMemFree(d_Cos) 79 | cuMemFree(d_Sin) 80 | 81 | t1 = time() 82 | for i in range(loops): 83 | cpuTRIG(h_Cos,h_Sin,h_Arg) 84 | t1 = time()-t1 85 | 86 | flopsg = (2.e-6*n2)*float(loops) 87 | flopsc = flopsg 88 | 89 | t0 *= 1.e3; 90 | t1 *= 1.e3; 91 | print "\n time[msec] GFlops\n" 92 | print "GPU: %12.1f%10.2f" % (t0,flopsg/t0) 93 | print "CPU: %12.1f%10.2f" % (t1,flopsc/t1) 94 | print " %12.1f" % (t1/t0) 95 | 96 | x = float(1 << 23) 97 | e,m = checkTrig(g_Cos,g_Sin) 98 | print "\n",name, "internal check GPU" 99 | print "%8.1e %8.1e" % (e,m) 100 | print "%8.1f %8.1f" % (e*x,m*x) 101 | 102 | e,m = checkTrig(h_Cos,h_Sin) 103 | print "\n",name, "internal check CPU" 104 | print "%8.1e %8.1e" % (e,m) 105 | print "%8.1f %8.1f" % (e*x,m*x) 106 | 107 | print "\n","check between CPU and GPU" 108 | err,mxe = checkError(h_Cos,g_Cos) 109 | print "Avg and max abs error (cos) = %8.1e %8.1e" % (err,mxe) 110 | print " %8.1f %8.1f" % (err*x,mxe*x) 111 | err,mxe = checkError(h_Sin,g_Sin) 112 | print "Avg and max abs error (sin) = %8.1e %8.1e" % (err,mxe) 113 | print " %8.1f %8.1f" % (err*x,mxe*x) 114 | 115 | if __name__ == "__main__": 116 | import sys 117 | device = cu_CUDA() 118 | device.getSourceModule("simple.cubin") 119 | device.getFunction("TRIG") 120 | device.getFunction("TRIGTex") 121 | 122 | log2n,loops = 15,1 123 | if len(sys.argv) > 1: 124 | log2n = int(sys.argv[1]) 125 | log2n = max(0,min(log2n,25)) 126 | if len(sys.argv) > 2: 127 | loops = int(sys.argv[2]) 128 | vlength = 1 << log2n 129 | main(device,vlength,loops) 130 | cuCtxDetach(device.context) 131 | -------------------------------------------------------------------------------- /cuda/sugar/query/cu_utils.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | from cuda.cu import * 3 | from ctypes import * 4 | 5 | class GPUException(Exception): 6 | pass 7 | 8 | class cu_CUDA(object): 9 | usedDevices = 0 10 | def __init__(self): 11 | flags = 0 # see manual 12 | self.device = None 13 | self.context = None 14 | self.module = None 15 | self.deviceID = -1 16 | cuInit(flags) 17 | device_count = c_int() 18 | cuDeviceGetCount(byref(device_count)) 19 | if cu_CUDA.usedDevices >= device_count.value: 20 | print "No more uninitialized devices available" 21 | return 22 | self.device = CUdevice() 23 | self.context = CUcontext() 24 | self.modules = list() 25 | self.functions = dict() 26 | self.deviceID = cu_CUDA.usedDevices 27 | cuDeviceGet(byref(self.device),self.deviceID) 28 | cu_CUDA.usedDevices += 1 29 | status = cuCtxCreate(byref(self.context),0,self.device) 30 | if status != CUDA_SUCCESS: 31 | cuCtxDetach(self.context) 32 | raise GPUException("Failed to create CUDA context") 33 | self.getInfo() 34 | 35 | def getSourceModule(self,name=None): 36 | if name is None: 37 | name = "gpuFunctions.cubin" 38 | module = CUmodule() 39 | status = cuModuleLoad(byref(module),name) 40 | if status != CUDA_SUCCESS: 41 | print "File not found: %s" % name 42 | self.modules.append(module) 43 | return module 44 | 45 | def getFunction(self,name): 46 | missing = True 47 | function = CUfunction() 48 | for module in self.modules: 49 | status = cuModuleGetFunction(function,module,name) 50 | if status != CUDA_SUCCESS: 51 | continue 52 | else: 53 | self.functions[name] = function 54 | missing = False 55 | break 56 | if missing: 57 | print "Function not found: %s" % name 58 | return None 59 | return function 60 | 61 | def getInfo(self): 62 | device = self.device 63 | info = dict() 64 | count = c_int() 65 | cuDeviceGetCount(byref(count)) 66 | info["count"] = count.value 67 | name = (c_char*256)() 68 | cuDeviceGetName(name,256,device) 69 | info["name"] = name.value 70 | memsize = c_uint() 71 | cuDeviceTotalMem(byref(memsize),device) 72 | info["memory"] = memsize.value 73 | free,total = c_uint(),c_uint() 74 | cuMemGetInfo(byref(free),byref(total)) 75 | info["free"] = free.value 76 | major,minor = c_int(),c_int() 77 | cuDeviceComputeCapability(byref(major),byref(minor),device) 78 | info["capability"] = (major.value,minor.value) 79 | props = CUdevprop() 80 | cuDeviceGetProperties(byref(props),device) 81 | info["properties"] = props 82 | self.info = info 83 | 84 | def __str__(self): 85 | s = ["Device Info:\n"] 86 | i = self.info 87 | s.append("%-19s = %d" % ( 88 | "number of devices",i["count"])) 89 | s.append("%-19s = %d" % ( 90 | "current device ID",self.deviceID)) 91 | s.append("%-19s = %s" % ( 92 | "device name =",i["name"])) 93 | s.append("%-19s = %.f MB" % ( 94 | "memory size",i["memory"]/1024.**2)) 95 | s.append("%-19s = %.f MB" % ( 96 | "memory free",i["free"]/1024.**2)) 97 | s.append("%-19s = %.f MHz" % ( 98 | "clock rate",i["properties"].clockRate/1000.)) 99 | s.append("%-19s = %d" % ( 100 | "major",i["capability"][0])) 101 | s.append("%-19s = %d" % ( 102 | "minor",i["capability"][1])) 103 | s.append(21*"-") 104 | s.append(str(i["properties"])) 105 | return "\n".join(s) 106 | -------------------------------------------------------------------------------- /cuda/sugar/query/cuda_utils.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | import cuda.cuda as cuda 3 | from ctypes import byref, c_int 4 | import logging 5 | log = logging.getLogger('python-cuda') 6 | 7 | CUDART_VERSION = 2010 8 | 9 | def cutilSafeCall(retval): 10 | if retval != 0: 11 | log.error( 'error! %s' % retval) 12 | 13 | def get_device_count(): 14 | device_count = c_int() 15 | cutilSafeCall(cuda.cudaGetDeviceCount(byref(device_count))); 16 | return device_count.value 17 | 18 | def has_cuda_device(): 19 | dev_count = get_device_count() 20 | if dev_count > 0: 21 | log.debug("Found %d gpu devices" % dev_count) 22 | else: 23 | log.debug("There is no device supporting CUDA") 24 | return False 25 | 26 | cuda_enabled = False 27 | 28 | for dev in range(0, dev_count): 29 | dev_prop = cuda.cudaDeviceProp() 30 | retval = cuda.cudaGetDeviceProperties(byref(dev_prop), dev) 31 | if dev_prop.major == 9999 and dev_prop.minor == 9999: 32 | log.debug( "Device %s does not support cuda." % dev) 33 | continue 34 | cuda_enabled = True 35 | break 36 | 37 | if not cuda_enabled: 38 | log.debug("There is no device supporting CUDA") 39 | return cuda_enabled 40 | 41 | def needs_emulation(): 42 | return has_cuda_device() 43 | 44 | def get_devices(): 45 | dev_count = get_device_count() 46 | if dev_count > 0: 47 | log.debug("found %d gpu devices" % dev_count) 48 | else: 49 | log.debug("there is no device supporting cuda") 50 | 51 | for dev in range(0, dev_count): 52 | dev_prop = cuda.cudaDeviceProp() 53 | retval = cuda.cudaGetDeviceProperties(byref(dev_prop), dev) 54 | if retval == 3: 55 | log.debug( "there is no device supporting cuda") 56 | break 57 | elif dev == 0: 58 | if dev_prop.major == 9999 and dev_prop.minor == 9999: 59 | log.debug( "there is no device supporting cuda.") 60 | elif dev_count == 1: 61 | log.debug( "there is 1 device supporting cuda") 62 | else: 63 | log.debug( "there are %d devices supporting cuda" % dev_count) 64 | 65 | log.debug('Device %d: "%s"' % (dev, dev_prop.name)) 66 | log.debug("Major revision number: %d" % dev_prop.major) 67 | log.debug("Minor revision number: %d" % dev_prop.minor) 68 | log.debug("Total amount of global memory: %u bytes" % dev_prop.totalGlobalMem) 69 | 70 | if CUDART_VERSION >= 2000: 71 | log.debug("Number of multiprocessors: %d", dev_prop.multiProcessorCount); 72 | log.debug("Number of cores: %d", 8 * dev_prop.multiProcessorCount); 73 | 74 | log.debug( "Total amount of constant memory: %u bytes" % dev_prop.totalConstMem) 75 | log.debug( "Total amount of shared memory per block: %u bytes" % dev_prop.sharedMemPerBlock) 76 | log.debug( "Total number of registers available per block: %d" % dev_prop.regsPerBlock) 77 | log.debug( "Warp size: %d" % dev_prop.warpSize) 78 | log.debug( "Maximum number of threads per block: %d" % dev_prop.maxThreadsPerBlock) 79 | log.debug( "Maximum sizes of each dimension of a block: %d x %d x %d" % (dev_prop.maxThreadsDim[0], dev_prop.maxThreadsDim[1], dev_prop.maxThreadsDim[2])) 80 | log.debug( "Maximum sizes of each dimension of a grid: %d x %d x %d" % (dev_prop.maxGridSize[0], dev_prop.maxGridSize[1], dev_prop.maxGridSize[2])) 81 | log.debug( "Maximum memory pitch: %u bytes" % dev_prop.memPitch) 82 | log.debug( "Texture alignment: %u bytes" % dev_prop.textureAlignment) 83 | log.debug( "Clock rate: %.2f GHz" % (dev_prop.clockRate * (1e-6))) 84 | 85 | if CUDART_VERSION >= 2000: 86 | log.debug("Concurrent copy and execution: %s" % bool(dev_prop.deviceOverlap)) 87 | -------------------------------------------------------------------------------- /tests/cuda/todo/cuda_GL.py: -------------------------------------------------------------------------------- 1 | #!/bin/env python 2 | # coding:utf-8: © Arno Pähler, 2007-08 3 | # GLUT version 4 | from ctypes import * 5 | 6 | from ogl.gl import * 7 | from OpenGL.GLUT import * 8 | 9 | from cuda.cuda_defs import * 10 | from cuda.cuda_api import * 11 | 12 | lib = CDLL("./libkernelGL.so") 13 | 14 | kernel1 = lib.__device_stub_kernel1 15 | kernel1.restype = None 16 | kernel1.argtypes = [ c_void_p, c_uint, c_uint, c_float ] 17 | 18 | kernel2 = lib.__device_stub_kernel2 19 | kernel2.restype = None 20 | kernel2.argtypes = [ c_void_p, c_uint, c_uint, c_float ] 21 | 22 | window_width = 512 23 | window_height = 512 24 | 25 | mesh_width = 256 26 | mesh_height = 256 27 | 28 | anim = 0.0 29 | mouse_buttons = 0 30 | rotate_x,rotate_y,translate_z = 0.,0.,-3.0 31 | global mouse_old_x,mouse_old_y 32 | 33 | vbo = GLuint() 34 | 35 | kernel = kernel1 36 | 37 | def main(argc,argv): 38 | global vbo 39 | 40 | glutInit(argc,argv) 41 | glutInitDisplayMode(GLUT_RGBA|GLUT_DOUBLE) 42 | glutInitWindowSize(window_width,window_height) 43 | glutCreateWindow("Cuda GL Demo") 44 | 45 | initGL() 46 | 47 | glutDisplayFunc(display) 48 | glutKeyboardFunc(keyboard) 49 | glutMouseFunc(mouse) 50 | glutMotionFunc(motion) 51 | 52 | vbo = createVBO() 53 | runCuda(vbo) 54 | 55 | glutMainLoop() 56 | 57 | def runCuda(vbo): 58 | vptr = c_void_p() 59 | status = cudaGLMapBufferObject(byref(vptr),vbo) 60 | 61 | block = dim3(8,8,1) 62 | grid = dim3(mesh_width/block.x,mesh_height/block.y,1) 63 | status = cudaConfigureCall(grid,block,0,0) 64 | kernel(vptr,mesh_width,mesh_height,anim) 65 | 66 | status = cudaGLUnmapBufferObject(vbo) 67 | if status != 0: 68 | exit() 69 | 70 | def initGL(): 71 | glClearColor(0.0,0.0,0.0,1.0) 72 | glDisable(GL_DEPTH_TEST) 73 | 74 | glViewport(0,0,window_width,window_height) 75 | glMatrixMode(GL_PROJECTION) 76 | glLoadIdentity() 77 | ratio = float(window_width)/float(window_height) 78 | glFrustum(-1.,1.,-1.,1.,2.,10.) 79 | 80 | return True 81 | 82 | def createVBO(): 83 | global vbo 84 | glGenBuffers(1,byref(vbo)) 85 | glBindBuffer(GL_ARRAY_BUFFER,vbo) 86 | 87 | size = mesh_width*mesh_height*4*sizeof(c_float) 88 | glBufferData(GL_ARRAY_BUFFER,size,0,GL_DYNAMIC_DRAW) 89 | 90 | glBindBuffer(GL_ARRAY_BUFFER,0) 91 | 92 | status = cudaGLRegisterBufferObject(vbo) 93 | return vbo 94 | 95 | def deleteVBO(): 96 | global vbo 97 | glBindBuffer(1,vbo) 98 | glDeleteBuffers(1,vbo) 99 | 100 | status = cudaGLUnregisterBufferObject(vbo) 101 | vbo = 0 102 | 103 | def display(): 104 | global anim,vbo 105 | runCuda(vbo) 106 | 107 | glClear(GL_COLOR_BUFFER_BIT|GL_DEPTH_BUFFER_BIT) 108 | 109 | glMatrixMode(GL_MODELVIEW) 110 | glLoadIdentity() 111 | glTranslatef(0.0,0.0,translate_z) 112 | glRotatef(rotate_x,1.0,0.0,0.0) 113 | glRotatef(rotate_y,0.0,1.0,0.0) 114 | 115 | glBindBuffer(GL_ARRAY_BUFFER,vbo) 116 | glVertexPointer(4,GL_FLOAT,0,0) 117 | 118 | glEnableClientState(GL_VERTEX_ARRAY) 119 | glColor3f(1.0,0.0,0.0) 120 | glDrawArrays(GL_POINTS,0,mesh_width*mesh_height) 121 | glDisableClientState(GL_VERTEX_ARRAY) 122 | 123 | glutSwapBuffers() 124 | glutPostRedisplay() 125 | 126 | anim += 0.01 127 | 128 | def keyboard(key,x,y): 129 | if key == chr(27): 130 | deleteVBO() 131 | exit() 132 | 133 | def mouse(button,state,x,y): 134 | global mouse_buttons 135 | global mouse_old_x,mouse_old_y 136 | if state == GLUT_DOWN: 137 | mouse_buttons |= 1<