├── .gitignore ├── README.md ├── chapel ├── fmtime.log ├── kernel.chpl ├── kernelmatrix.chpl ├── script.chpl ├── script.sh └── time.log ├── charts ├── benchplot.jpg ├── benchplot.svg ├── charts.r ├── fmbenchplot.jpg ├── fmbenchplot.svg ├── ndsliceDiagnostic.jpg └── ndsliceDiagnostic.svg ├── d ├── arrays.d ├── fmtime.log ├── kernel.d ├── mathdemo.d ├── script.d ├── script.sh └── time.log ├── data ├── chapelBench.csv ├── dBench.csv ├── dNDSliceBench.csv ├── juliaBench.csv └── ndsliceTime.log ├── docs └── kernel.pdf ├── fmdata ├── chapelBench.csv ├── dBench.csv └── juliaBench.csv ├── julia ├── KernelMatrix.jl ├── fmtime.log ├── script.jl ├── script.sh └── time.log ├── ndslice ├── dub.json └── source │ ├── app.d │ └── ndslice │ └── kernels.d └── script.sh /.gitignore: -------------------------------------------------------------------------------- 1 | *.o 2 | */.dub 3 | ndslice/ndslice 4 | d/mathdemo 5 | d/script 6 | ndslice/dub.selections.json 7 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # A look at Chapel, D, and Julia using Kernel Matrix calculations 2 | 3 | *Author: Dr Chibisi Chima-Okereke*
4 | *Date: 2020-06-11 (Updated article with ieee and fast math calcuations for Chapel, D, and Julia)* 5 | 6 | ## Introduction 7 | 8 | It seems each time you turn around, there is a new programming language aimed at solving some specific problem set. Increased proliferation of programming languages and data are deeply connected in a fundamental way, and increasing demand for “data science” computing is a related phenomenon. In the field of scientific computing Chapel, D, and Julia are highly relevant programming languages, they arise from different needs and are aimed at different problem sets. Chapel focuses on data parallelism in multicore machines and large clusters. D was developed as a productive safer alternative to C++, and Julia was created for technical and scientific computing. All three languages emphasize performance as a feature. This article benchmarks their performance on kernel matrix calculations, and it presents approaches to performance optimization and other usability features of the languages. 9 | 10 | Kernel matrix calculations are the basis of kernel methods in machine learning applications. They scale rather poorly `O(m n^2)`, `n` is the number of items and `m` is the number of elements in each item. In our exercsie `m` will be constant and we will be looking at execution time in each implementation as `n` increases. Here `m = 784`, and `n = 1k, 5k, 10k, 20k, 30k`, each calculation is run 3 times and an average is taken. We disallow any use of BLAS and only allow use of packages or modules from the standard library of each language. D does not have "mathematical"-style arrays as Julia and Chapel do, so a Matrix object is implemented. The performance of the Matrix object is then compared with calculations using Mir, a multidimensional array package written in D to make sure that this implementation of the Matrix reflects the true performance of D. The details for the calculation of the kernel matrix and kernel functions are given [here](https://github.com/dataPulverizer/KernelMatrixBenchmark/blob/master/docs/kernel.pdf). 11 | 12 | Two benchmark types are given, one in IEEE mode - the default mode and another in fast math which violates IEEE floating point calculation standards. Fast math allows reassociation transformations of floating point instructions which could mean getting different results from the IEEE case; it breaks `NaN` and `Inf`, uses approximates modes of calculating certian functions such as `log`, `sin`, `sqrt` and makes [other consessions to standard practice](https://llvm.org/docs/LangRef.html#fast-math-flags) to boost performance. In most real world applications IEEE standard rather than fast math would be used but like other compiler safety features such as bounds checking, after production ready code has be created, if the analyst is sure and has tests perhaps with proofs showing that using fast math calculations will not adversely affect the result it can be an added boost to performance. 13 | 14 | While preparing the code for this article, the Chapel, D, and Julia communities were very helpful and patient with all enquiries, so they are acknowledged here. 15 | 16 | In terms of bias, going in I was much more familiar with D and Julia than I was of Chapel, however getting the best performance from each language requried a lot of interaction with each programming community and I have done my best to be aware of my biases and attempted to correct for them. If the reader has any issue with the way this analysis has been conducted, they can raise it with the [GitHub repository](https://github.com/dataPulverizer/KernelMatrixBenchmark) where the code to carry out the calculation is located. 17 | 18 | ## Language Benchmarks for Kernel Matrix Calculation 19 |

20 | 21 | The [above chart](https://github.com/dataPulverizer/KernelMatrixBenchmark/blob/master/charts/charts.r) shows the performance benchmark time taken in seconds (log scale) against the number of items (`n` as above) for nine kernels all executed on Chapel, D, and Julia for IEEE mathematics calculations. The chart below shows a repeated benchmark as above when using the fast math calculations in each language. 22 |

23 | 24 | In the IEEE floating point case, Julia performs better than D and Chapel in all but the `log` and `power` kernels where D is performs better. In cases where fast math is used the performance of Julia falls behind Chapel and D in all but the power kernel benchmark where D performs best, and Julia second. Chapel and D show very similar performance in all but the `log` and `power` benchmarks. 25 | 26 | In the interest of transparency, the mathematics functions used in D were pulled from C's math module made available in the D compiler in its [`core.stdc.math`](https://dlang.org/library/core/stdc/math.html) module. This was done because the mathematical functions in D's standard library [`std.math`](https://dlang.org/phobos/std_math.html) can be slow. The math functions used are given in the import statement in the [script for calculating kernel functions](https://github.com/dataPulverizer/KernelMatrixBenchmark/blob/master/d/kernel.d). By way of comparison consider the [mathdemo.d](https://github.com/dataPulverizer/KernelMatrixBenchmark/blob/master/d/mathdemo.d) script comparing the imported C `log` function D's `log` function from `std.math`: 27 | 28 | ```bash 29 | $ ldc2 -O --boundscheck=off --mcpu=native mathdemo.d && ./mathdemo 30 | Time taken for c log: 0.58623 seconds. 31 | Time taken for d log: 2.3747 seconds. 32 | ``` 33 | 34 | #### Suitability of Matrix object used 35 | The Matrix object used in the D benchmark was implemented specifically because use of modules outside language standard libraries was disallowed for this article (discussed later), but to make sure that this implementation is competitive i.e. does not unfairly represent D's performance, it is compared to Mir's ndslice library written in D. The chart below shows the difference in execution times of the kernel matrix calculation between the implementation of Matrix and ndslice as a percentage of Matrix's kernel benchmark running time. Negative means that ndslice is slower and positive times mean that ndslice is faster. Performance across the kernels are all about the same, sometimes ndslice is slightly faster and at other times it is slightly slower, so the Matrix object used is a fair representation of D's performance. 36 | 37 |

38 | 39 | ## Environment 40 | 41 | The code was run on a computer with an Ubuntu 20.04 OS, 32 GB memory and an Intel® Core™ i9-8950HK CPU @ 2.90GHz with 6 cores and 12 threads. 42 | 43 | ```bash 44 | $ julia --version 45 | julia version 1.4.1 46 | ``` 47 | 48 | ```bash 49 | $ dmd --version 50 | DMD64 D Compiler v2.090.1 51 | ``` 52 | 53 | ```bash 54 | ldc2 --version 55 | LDC - the LLVM D compiler (1.18.0): 56 | based on DMD v2.088.1 and LLVM 9.0.0 57 | ``` 58 | 59 | ```bash 60 | $ chpl --version 61 | chpl version 1.22.0 62 | ``` 63 | 64 | ### Compilation 65 | 66 | Compilation is done with scripts see `script.sh` file in each language folder and the `script.sh` script in the [home folder](https://github.com/dataPulverizer/KernelMatrixBenchmark) of the repository. 67 | 68 | ## Implementations 69 | 70 | Efforts were made to avoid non-standard libraries while implementing these kernel functions. The reasons for this are: 71 | 72 | * It is completely transparent and shows how each language works. This article is about the programming languages and what the reader can expect in terms of performance and other factors discussed later. Packages can sometimes give a false impression about what using a programming language is like. 73 | * Packages outside standard libraries can go extinct so avoiding external libraries keeps the article and code relevant. 74 | * Making it easy for a reader after installing the language to copy and run the code. Having to install external libraries can be a bit of a "faff". 75 | 76 | 77 | ### Chapel 78 | 79 | Chapel uses a `forall` loop to parallelize over threads and uses `guided` iteration over indices are used. Array rows and columns are usually indexed in a user friendly manner, but above they are accessed using pointers which can be a way of boosting performance. 80 | 81 | ```chpl 82 | proc calculateKernelMatrix(K, data: [?D] ?T) 83 | { 84 | var n = D.dim(0).last; 85 | var p = D.dim(1).last; 86 | var E: domain(2) = {D.dim(0), D.dim(0)}; 87 | var mat: [E] T; 88 | var rowPointers: [1..n] c_ptr(T) = 89 | forall i in 1..n do c_ptrTo(data[i, 1]); 90 | 91 | forall j in guided(1..n by -1) { 92 | for i in j..n { 93 | mat[i, j] = K.kernel(rowPointers[i], rowPointers[j], p); 94 | mat[j, i] = mat[i, j]; 95 | } 96 | } 97 | return mat; 98 | } 99 | ``` 100 | 101 | ### D 102 | 103 | D uses a `taskPool` of threads from its `std.parallel` package to parallelize code. The D code underwent the least amount of change for performance optimization, a lot of the performance benefits came from the specific compiler used and flags selected (discussed later). The implementation of `Matrix` allows columns to be selected by reference `refColumnSelect`. 104 | 105 | ```d 106 | auto calculateKernelMatrix(alias K, T)(K!(T) kernel, Matrix!(T) data) 107 | { 108 | long n = data.ncol; 109 | auto mat = Matrix!(T)(n, n); 110 | 111 | foreach(j; taskPool.parallel(iota(n))) 112 | { 113 | auto arrj = data.refColumnSelect(j).array; 114 | foreach(long i; j..n) 115 | { 116 | mat[i, j] = kernel(data.refColumnSelect(i).array, arrj); 117 | mat[j, i] = mat[i, j]; 118 | } 119 | } 120 | return mat; 121 | } 122 | ``` 123 | 124 | ### Julia 125 | 126 | The Julia code uses `@threads` macro for parallelising the code and `@views` macro for referencing arrays. One confusing thing about Julia's arrays is their reference status. Sometimes as in this case arrays will behave like value objects and they have to be referenced by using the `@views` macro otherwise they generate copies, at other times they behave like reference objects, for example passing them into a function. It can be a little tricky dealing with this because it's not always obvious which set of operations will generate a copy, but where this occurs `@views` provides a good solution. 127 | 128 | Julia also has `Symmetric` matrix type which meaning allocating to both sides of the matrix is not necessary. 129 | 130 | ```jl 131 | function calculateKernelMatrix(Kernel::K, data::Array{T}) where {K <: AbstractKernel,T <: AbstractFloat} 132 | n = size(data)[2] 133 | mat = zeros(T, n, n) 134 | @threads for j in 1:n 135 | @views for i in j:n 136 | mat[i,j] = kernel(Kernel, data[:, i], data[:, j]) 137 | end 138 | end 139 | return Symmetric(mat, :L) 140 | end 141 | ``` 142 | 143 | The `@bounds` and `@simd` macros in the kernel functions were used to turn bounds checking off and apply SIMD optimization to the calculations: 144 | 145 | ```jl 146 | struct DotProduct <: AbstractKernel end 147 | @inline function kernel(K::DotProduct, x::AbstractArray{T, N}, y::AbstractArray{T, N}) where {T,N} 148 | ret = zero(T) 149 | m = length(x) 150 | @inbounds @simd for k in 1:m 151 | ret += x[k] * y[k] 152 | end 153 | return ret 154 | end 155 | ``` 156 | 157 | These optimizations are quite visible but easy to apply. Note that Julia has the `@fastmath` macro for applying fast math to individual code lines/blocks but only the command line option was used in this analysis. 158 | 159 | ## Memory Usage 160 | 161 | The total time for each benchmark and memory used was captured using the `/usr/bin/time -v` command. The output for each of the languages is given below: 162 | 163 | The complete calculation in Chapel took about the longest amount of time to execute but consumed a moderate amount of memory (nearly 9GB RAM peak memory): 164 | ``` 165 | Command being timed: "./script --verbose=true --fastmath=false" 166 | User time (seconds): 114342.55 167 | System time (seconds): 17.96 168 | Percent of CPU this job got: 1192% 169 | Elapsed (wall clock) time (h:mm:ss or m:ss): 2:39:53 170 | Average shared text size (kbytes): 0 171 | Average unshared data size (kbytes): 0 172 | Average stack size (kbytes): 0 173 | Average total size (kbytes): 0 174 | Maximum resident set size (kbytes): 9266328 175 | Average resident set size (kbytes): 0 176 | Major (requiring I/O) page faults: 0 177 | Minor (reclaiming a frame) page faults: 2315637 178 | Voluntary context switches: 625 179 | Involuntary context switches: 3419118 180 | Swaps: 0 181 | File system inputs: 0 182 | File system outputs: 8 183 | Socket messages sent: 0 184 | Socket messages received: 0 185 | Signals delivered: 0 186 | Page size (bytes): 4096 187 | Exit status: 0 188 | ``` 189 | 190 | D consumed the most amount of memory (around 20GB RAM peak memory) but took the least amount of time to execute: 191 | 192 | ``` 193 | Command being timed: "./script" 194 | User time (seconds): 69089.75 195 | System time (seconds): 41.91 196 | Percent of CPU this job got: 1181% 197 | Elapsed (wall clock) time (h:mm:ss or m:ss): 1:37:29 198 | Average shared text size (kbytes): 0 199 | Average unshared data size (kbytes): 0 200 | Average stack size (kbytes): 0 201 | Average total size (kbytes): 0 202 | Maximum resident set size (kbytes): 20458972 203 | Average resident set size (kbytes): 0 204 | Major (requiring I/O) page faults: 0 205 | Minor (reclaiming a frame) page faults: 15393443 206 | Voluntary context switches: 4884 207 | Involuntary context switches: 2222841 208 | Swaps: 0 209 | File system inputs: 8 210 | File system outputs: 8 211 | Socket messages sent: 0 212 | Socket messages received: 0 213 | Signals delivered: 0 214 | Page size (bytes): 4096 215 | Exit status: 0 216 | ``` 217 | 218 | Julia consumed the least of memory (around 7.5 GB peak memory) and ran the second fastest in total: 219 | 220 | ``` 221 | Command being timed: "julia script.jl data true" 222 | User time (seconds): 49163.19 223 | System time (seconds): 32.08 224 | Percent of CPU this job got: 717% 225 | Elapsed (wall clock) time (h:mm:ss or m:ss): 1:54:18 226 | Average shared text size (kbytes): 0 227 | Average unshared data size (kbytes): 0 228 | Average stack size (kbytes): 0 229 | Average total size (kbytes): 0 230 | Maximum resident set size (kbytes): 7501592 231 | Average resident set size (kbytes): 0 232 | Major (requiring I/O) page faults: 804 233 | Minor (reclaiming a frame) page faults: 38021184 234 | Voluntary context switches: 2657 235 | Involuntary context switches: 477363 236 | Swaps: 0 237 | File system inputs: 368240 238 | File system outputs: 8 239 | Socket messages sent: 0 240 | Socket messages received: 0 241 | Signals delivered: 0 242 | Page size (bytes): 4096 243 | Exit status: 0 244 | ``` 245 | 246 | ## Performance optimization 247 | 248 | The process of performance optimization in all three languages was very different and all three communities were very helpful in the process. But there were some common themes. 249 | 250 | * Static dispatching of kernel functions instead of using polymorphism. This means that when passing the kernel function, use parametric (static compile time) polymorphism rather than runtime (dynamic) polymorphism were dispatch with virtual functions carries a performance penalty. 251 | * Using views/references rather than copying data over multiple threads – makes a big difference. 252 | * Parallelising the calculations makes a huge difference. 253 | * Knowing if your array is row/column major and using that in your calculation makes a huge difference. 254 | * Bounds checks and compiler optimizations makes a huge difference especially in Chapel and D. 255 | * Enabling SIMD in D and Julia made a contribution to the performance. In D this was done using the `-mcpu=native` flag and in Julia this was done using the `@simd` macro. 256 | 257 | In terms of language specific issues, getting to performant code in Chapel was the most challenging and the Chapel code changed the most from easy to read array operations to using pointers and guided iterations. But on the compiler side it was relatively easy to add `--fast` and get a large performance boost. 258 | 259 | In D the code changed very little and most of the performance was gained in the compiler used and optimization flags. D’s LDC compiler is rich in terms of options for performance optimization. It has 8 `-O` optimization levels but some are repetitions of others for instance `-O`, `-O3`, and `-O5` are identical, and there are a myriad of other flags that affect performance in various ways. In this case the flags used were `-O5 --boundscheck=off –ffast-math` representing aggressive compiler optimizations, bounds checking, and LLVM’s fast-math and `-mcpu=native` to enable CPU vectorization instructions. 260 | 261 | In the Julia the macro changes discussed previously markedly improved the performance but they were not too intrusive. 262 | 263 | ## Quality of life 264 | 265 | This section examines the relative pros and cons around the convenience and ease of use of each language. People underestimate the effort it takes to use a language day to day, the support and infrastructure required is a lot so it is worth comparing various facets of each language. Readers seeking to avoid the TLDR should scroll to the end of this section for the table comparing the language features discussed here. Every effort has been made to be as objective as possible but comparing programming languages is difficult, bias prone, and contentious so read this section with that in mind. Some elements looked at such as arrays are from the “data science”/technical/scientific computing point of view and others are more general. 266 | 267 | ### Interactivity 268 | 269 | Programmers want a fast code/compile/result loop during development to quickly observe results and outputs in order to make progress or necessary changes; Julia’s interpreter is hands down the best for this and offers a smooth and feature-rich development experience, and D comes a close second. This code/compile/result loop in compilers can be slow even when compiling small code volumes. D has three compilers, the standard DMD compiler, the LLVM-based LDC compiler, and the GCC-based GDC. In this development process, the DMD and LDC compilers were used. DMD has **very** fast compilation times which is great for development. The LDC compiler is great at creating **fast** code. Chapel's compiler is very slow in comparison, to give an example running Linux’s `time` command on DMD vs Chapel’s compiler for the kernel matrix code with no optimizations gives us for D: 270 | 271 | ``` 272 | real 0m0.545s 273 | user 0m0.447s 274 | sys 0m0.101s 275 | ``` 276 | Compared with Chapel: 277 | 278 | ``` 279 | real 0m5.980s 280 | user 0m5.787s 281 | sys 0m0.206s 282 | ``` 283 | That’s a large actual and *psychological* difference, it can make programmers reluctant to check their work and delay the development loop if they have to wait for outputs especially when source code increases in volume and compilation times become significant. 284 | 285 | It is worth mentioning however that when developing packages in Julia compilation times can be very long, and users have noticed that when they load some packages compilation times can stretch so the experience of the development loop in Julia could vary, but in this specific case the process was seamless. 286 | 287 | ### Documentation and examples 288 | 289 | One way of comparing documentation in the different languages is to compare them all with Python’s official documentation which is *the* gold standard for programming languages, it combines examples with formal definitions and tutorials in a seamless and user friendly way. Since many programmers are familiar with the Python documentation this approach gives an idea of how they compare. 290 | 291 | Julia’s documentation the is closest to Python’s documentation quality and gives the user a very smooth detailed and relatively painless transition into the language, it also has a rich ecosystem of blogs and topics on many aspects of the language are easy to come by. D’s official documentation is not as good and can be challenging and frustrating, however there is a *very* good free book [“Programming in D”](https://wiki.dlang.org/Books) which is a great introduction to the language but no single book can cover a programming language and there are not many sources for advanced topics. Chapel’s documentation is quite good for getting things done though examples vary in presence and quality, often the programmer needs a lot of knowledge to look in the right place. A good topic for comparison is file i/o libraries in Chapel, D, and Julia. Chapel’s i/o library has too few examples but is relatively clear and straightforward, D’s i/o is kind of spread across a few modules and documentation is more difficult to follow, Julia’s i/o documentation has lots of examples and is clear and easy to follow. 292 | 293 | Perhaps one factor affecting Chapel’s adoption is lack of examples, since its arrays have a non-standard interface the user has to work hard to become familiar with them, were as even though D’s documentation may not be as good in places, the language has many similarities to C/C++ and so gets away with more sparse documentation. 294 | 295 | ### Multi-dimensional Array support 296 | 297 | “Arrays” here do not refer to native C/C++ style arrays available in D but mathematical arrays. Julia and Chapel ship with array support and D does not but it has the [Mir](http://docs.algorithm.dlang.io/latest/mir_ndslice.html) which has multidimensional arrays (ndslice). In the implementation of kernel matrix, I wrote my own matrix object in D – which is not difficult if you understand the principle but it's not something a user wants to do, however D has a linear algebra library called [Lubeck](https://github.com/kaleidicassociates/lubeck) which has impressive performance characteristics and interfaces with all the usual BLAS implementations. Julia’s arrays are by far the easiest and most familiar, Chapel arrays are more difficult to get started than Julia’s but are designed to be run on single core, multicore and computer clusters using the same or very similar code which is a good unique selling point. 298 | 299 | ### Language power 300 | 301 | Since Julia is a dynamic programming language some might say, “well Julia’s is a dynamic language which is far more permissive than static programming languages therefore the debate is over” but it’s more complicated than that. There is power in static type systems, Julia has a type system similar in nature to type systems from static languages so you can write code as if you were using a static language but you can do things reserved only for dynamic languages, it has a highly developed generic and meta-programming syntax and powerful macros. It also has highly flexible object system and multiple dispatch. This mix of features is what makes Julia is the most powerful language of the three. 302 | 303 | D was intended to be a replacement for C++ and takes very much after C++ (and also borrows from Java) but makes template programming and compile time evaluation much more user friendly than in C++, it is a single dispatch language (though multi-methods are available in a package), instead of macros D has string and template “mixins” which serve a similar purpose. 304 | 305 | Chapel has generic programming support and nascent support for single dispatch OOP, no macro support, and is not yet as mature as D or Julia in these terms. 306 | 307 | ### Concurrency & Parallel Programming 308 | 309 | Nowadays new languages tout support for concurrency and it’s popular subset parallelism but the detail varies a lot between languages. Parallelism is more relevant in this example and all three languages deliver. Writing parallel for loops required is straightforward in all three languages. 310 | 311 | Chapel’s concurrency model has much more emphasis on data parallelism but has tools for task parallelism and ships with support for cluster-based concurrency. 312 | 313 | Julia has good support for both concurrency and parallelism. 314 | 315 | D has industry strength support for parallelism and concurrency, though its support for threading is much less well documented with examples. 316 | 317 | ### Standard Library 318 | 319 | How good is the standard library of all three languages in general? What range of tasks do they allow users to easily tend to? It’s a tough question because library quality and documentation factor in. All three languages have very good standard libraries, D has the most comprehensive standard library, but Julia is great second then Chapel, but things are never that simple. For example, a user seeking to writing binary i/o may find Julia the easiest to start with, it has the most straightforward clear interface and documentation, followed by Chapel and then D, and Julia code is easy to write for cases unavailable in the other two languages. 320 | 321 | ### Package Managers & Package Ecosystems 322 | 323 | In terms of documentation, usage and features, D’s Dub package manager is the most comprehensive. D also has a rich package ecosystem in the [Dub website](https://code.dlang.org/), Julia’s package manager runs tightly integrated with GitHub and is a good package system with good documentation. Chapel has a package manager but does not have a highly developed package ecosystem. 324 | 325 | ### C Integration 326 | 327 | C interop is easy in all three languages; Chapel’s has good documentation but is not as well popularised as the others. D’s documentation is better and Julia’s documentation is the most comprehensive. Oddly enough though, none of the language documentations show the commands required to compile your own C code and integrate it with the language which is an oversight especially when it comes to novices. It is however easy to search for and find examples for the compilation process in D and Julia. 328 | 329 | ### Community 330 | 331 | All three languages have convenient places where users can ask questions. For Chapel, the easiest place is Gitter, for Julia it’s Discourse (though there is a Julia Gitter) and for D it’s the official website forum. The Julia community is the most active, followed by D and then Chapel. I’ve found that you’ll get good responses from all three communities but you’ll probably get quicker answers from D and Julia communities. 332 | 333 | | | Chapel | D | Julia | 334 | | --------------------------- |:-------------:|:-----------------------------------:| --------:| 335 | | Compilation/Interactivty | Slow | Fast | Best | 336 | | Documentation & Examples | Detailed | Patchy | Best | 337 | | Multi-dimensional Arrays | Yes | Native Only
(library support) | Yes | 338 | | Language Power | Good | Great | Best | 339 | | Concurrency & Parallelism | Great | Great | Good | 340 | | Standard Library | Good | Great | Great | 341 | | Package Manager & Ecosystem | Nascent | Best | Great | 342 | | C Integration | Great | Great | Great | 343 | | Community | Small | Vibrant | Largest | 344 | 345 | Table for quality of life features in Chapel, D & Julia 346 | 347 | ## Summary 348 | 349 | If you are a novice programmer writing numerical algorithms, doing scientific computing and want a fast language that's easy to use Julia is your best bet. If you are an experienced programmer working in the same space Julia is still a great option. If you need an "industrial strength" static compiled high performance language with all the "bells and whistles" but want something more productive, safer and less painful than C++ then D is your best bet. You can write "anything" in D and get great performance from its compilers. If you need to run array calculations on large clusters while avoiding the pain of writing MPI C++ code then Chapel is probably the best place to go. 350 | 351 | In terms of performance on this task, for IEEE math Julia is the winner performing better in 7 of the 9 kernels. For applications that use fast math D and Chapel performed better than Julia in 7 of the 9 kernels, and D performed best in the other 2 kernels (`log` and `power`) in IEEE and fast math modes. This exercise reveals that Julia's label as a high performance language is more than just hype, it has held it's own against highly competitive languages. Chapel and D's performance were very similar to each other in both the IEEE and fast math modes of calculation. 352 | -------------------------------------------------------------------------------- /chapel/fmtime.log: -------------------------------------------------------------------------------- 1 | Command being timed: "./script --verbose=true --fastmath=true" 2 | User time (seconds): 105423.83 3 | System time (seconds): 17.24 4 | Percent of CPU this job got: 1192% 5 | Elapsed (wall clock) time (h:mm:ss or m:ss): 2:27:24 6 | Average shared text size (kbytes): 0 7 | Average unshared data size (kbytes): 0 8 | Average stack size (kbytes): 0 9 | Average total size (kbytes): 0 10 | Maximum resident set size (kbytes): 9266032 11 | Average resident set size (kbytes): 0 12 | Major (requiring I/O) page faults: 0 13 | Minor (reclaiming a frame) page faults: 2315605 14 | Voluntary context switches: 460 15 | Involuntary context switches: 3139168 16 | Swaps: 0 17 | File system inputs: 0 18 | File system outputs: 8 19 | Socket messages sent: 0 20 | Socket messages received: 0 21 | Signals delivered: 0 22 | Page size (bytes): 4096 23 | Exit status: 0 24 | 25 | -------------------------------------------------------------------------------- /chapel/kernel.chpl: -------------------------------------------------------------------------------- 1 | use CPtr; 2 | use Math; 3 | 4 | record DotProduct { 5 | type T; 6 | proc kernel(xrow:c_ptr, yrow:c_ptr(?T), p: int): T 7 | { 8 | var dist: T = 0: T; 9 | for i in 0..#p { 10 | dist += xrow[i] * yrow[i]; 11 | } 12 | return dist; 13 | } 14 | } 15 | 16 | record Gaussian { 17 | type T; 18 | const theta: T; 19 | proc kernel(xrow:c_ptr, yrow:c_ptr(?T), p: int): T 20 | { 21 | var dist: T = 0: T; 22 | for i in 0..#p { 23 | var tmp = xrow[i] - yrow[i]; 24 | dist += tmp * tmp; 25 | } 26 | return exp(-sqrt(dist)/this.theta); 27 | } 28 | } 29 | 30 | record Polynomial { 31 | type T; 32 | const d: T; 33 | const offset: T; 34 | proc kernel(xrow:c_ptr, yrow:c_ptr(?T), p: int): T 35 | { 36 | var dist: T = 0: T; 37 | for i in 0..#p { 38 | dist += xrow[i]*yrow[i]; 39 | } 40 | return (dist + this.offset)**this.d; 41 | } 42 | } 43 | 44 | record Exponential { 45 | type T; 46 | const theta: T; 47 | proc kernel(xrow:c_ptr, yrow:c_ptr(?T), p: int): T 48 | { 49 | var dist: T = 0: T; 50 | for i in 0..#p { 51 | dist -= abs(xrow[i] - yrow[i]); 52 | } 53 | return exp(dist/this.theta); 54 | } 55 | } 56 | 57 | record Log { 58 | type T; 59 | const beta: T; 60 | proc kernel(xrow:c_ptr, yrow:c_ptr(?T), p: int): T 61 | { 62 | var dist: T = 0: T; 63 | for i in 0..#p { 64 | dist += abs(xrow[i] - yrow[i])**this.beta; 65 | } 66 | dist = dist**(1/this.beta); 67 | return -log(1 + dist); 68 | } 69 | } 70 | 71 | record Cauchy { 72 | type T; 73 | const theta: T; 74 | proc kernel(xrow:c_ptr, yrow:c_ptr(?T), p: int): T 75 | { 76 | var dist: T = 0: T; 77 | for i in 0..#p { 78 | var tmp = xrow[i] - yrow[i]; 79 | dist += tmp * tmp; 80 | } 81 | dist = sqrt(dist)/this.theta; 82 | return 1/(1 + dist); 83 | } 84 | } 85 | 86 | record Power { 87 | type T; 88 | const beta: T; 89 | proc kernel(xrow:c_ptr, yrow:c_ptr(?T), p: int): T 90 | { 91 | var dist: T = 0: T; 92 | for i in 0..#p { 93 | dist += abs(xrow[i] - yrow[i])**this.beta; 94 | } 95 | return -dist**(1/this.beta); 96 | } 97 | } 98 | 99 | record Wave { 100 | type T; 101 | const theta: T; 102 | proc kernel(xrow:c_ptr, yrow:c_ptr(?T), p: int): T 103 | { 104 | var dist: T = 0: T; 105 | for i in 0..#p { 106 | dist += abs(xrow[i] - yrow[i]); 107 | } 108 | var tmp = this.theta/dist; 109 | return tmp*sin(1/tmp); 110 | } 111 | } 112 | 113 | record Sigmoid { 114 | type T; 115 | const beta0: T; 116 | const beta1: T; 117 | proc kernel(xrow:c_ptr, yrow:c_ptr(?T), p: int): T 118 | { 119 | var dist: T = 0: T; 120 | for i in 0..#p { 121 | dist += xrow[i] * yrow[i]; 122 | } 123 | return tanh(this.beta0 * dist + this.beta1); 124 | } 125 | } 126 | 127 | /***************************************************************************/ 128 | use DynamicIters; 129 | proc calculateKernelMatrix(K, data: [?D] ?T) /* : [?E] T */ 130 | { 131 | var n = D.dim(0).last; 132 | var p = D.dim(1).last; 133 | var E: domain(2) = {D.dim(0), D.dim(0)}; 134 | var mat: [E] T; 135 | // code below assumes data starts at 1,1 136 | var rowPointers: [1..n] c_ptr(T) = 137 | forall i in 1..n do c_ptrTo(data[i, 1]); 138 | 139 | forall j in guided(1..n by -1) { 140 | for i in j..n { 141 | mat[i, j] = K.kernel(rowPointers[i], rowPointers[j], p); 142 | mat[j, i] = mat[i, j]; 143 | } 144 | } 145 | return mat; 146 | } 147 | 148 | -------------------------------------------------------------------------------- /chapel/kernelmatrix.chpl: -------------------------------------------------------------------------------- 1 | use CPtr; 2 | use Math; 3 | 4 | record DotProduct { 5 | type T; 6 | proc kernel(xrow:c_ptr, yrow:c_ptr(?T), p: int): T 7 | { 8 | var dist: T = 0: T; 9 | for i in 0..#p { 10 | dist += xrow[i] * yrow[i]; 11 | } 12 | return dist; 13 | } 14 | } 15 | 16 | record Gaussian { 17 | type T; 18 | const theta: T; 19 | proc kernel(xrow:c_ptr, yrow:c_ptr(?T), p: int): T 20 | { 21 | var dist: T = 0: T; 22 | for i in 0..#p { 23 | var tmp = xrow[i] - yrow[i]; 24 | dist += tmp * tmp; 25 | } 26 | return exp(-sqrt(dist)/this.theta); 27 | } 28 | } 29 | 30 | record Polynomial { 31 | type T; 32 | const d: T; 33 | const offset: T; 34 | proc kernel(xrow:c_ptr, yrow:c_ptr(?T), p: int): T 35 | { 36 | var dist: T = 0: T; 37 | for i in 0..#p { 38 | dist += xrow[i]*yrow[i]; 39 | } 40 | return (dist + this.offset)**this.d; 41 | } 42 | } 43 | 44 | record Exponential { 45 | type T; 46 | const theta: T; 47 | proc kernel(xrow:c_ptr, yrow:c_ptr(?T), p: int): T 48 | { 49 | var dist: T = 0: T; 50 | for i in 0..#p { 51 | dist -= abs(xrow[i] - yrow[i]); 52 | } 53 | return exp(dist/this.theta); 54 | } 55 | } 56 | 57 | record Log { 58 | type T; 59 | const beta: T; 60 | proc kernel(xrow:c_ptr, yrow:c_ptr(?T), p: int): T 61 | { 62 | var dist: T = 0: T; 63 | for i in 0..#p { 64 | dist += abs(xrow[i] - yrow[i])**this.beta; 65 | } 66 | dist = dist**(1/this.beta); 67 | return -log(1 + dist); 68 | } 69 | } 70 | 71 | record Cauchy { 72 | type T; 73 | const theta: T; 74 | proc kernel(xrow:c_ptr, yrow:c_ptr(?T), p: int): T 75 | { 76 | var dist: T = 0: T; 77 | for i in 0..#p { 78 | var tmp = xrow[i] - yrow[i]; 79 | dist += tmp * tmp; 80 | } 81 | dist = sqrt(dist)/this.theta; 82 | return 1/(1 + dist); 83 | } 84 | } 85 | 86 | record Power { 87 | type T; 88 | const beta: T; 89 | proc kernel(xrow:c_ptr, yrow:c_ptr(?T), p: int): T 90 | { 91 | var dist: T = 0: T; 92 | for i in 0..#p { 93 | dist += abs(xrow[i] - yrow[i])**this.beta; 94 | } 95 | return -dist**(1/this.beta); 96 | } 97 | } 98 | 99 | record Wave { 100 | type T; 101 | const theta: T; 102 | proc kernel(xrow:c_ptr, yrow:c_ptr(?T), p: int): T 103 | { 104 | var dist: T = 0: T; 105 | for i in 0..#p { 106 | dist += abs(xrow[i] - yrow[i]); 107 | } 108 | var tmp = this.theta/dist; 109 | return tmp*sin(1/tmp); 110 | } 111 | } 112 | 113 | record Sigmoid { 114 | type T; 115 | const beta0: T; 116 | const beta1: T; 117 | proc kernel(xrow:c_ptr, yrow:c_ptr(?T), p: int): T 118 | { 119 | var dist: T = 0: T; 120 | for i in 0..#p { 121 | dist += xrow[i] * yrow[i]; 122 | } 123 | return tanh(this.beta0 * dist + this.beta1); 124 | } 125 | } 126 | 127 | /***************************************************************************/ 128 | use DynamicIters; 129 | proc calculateKernelMatrix(K, data: [?D] ?T) /* : [?E] T */ 130 | { 131 | var n = D.dim(0).last; 132 | var p = D.dim(1).last; 133 | var E: domain(2) = {D.dim(0), D.dim(0)}; 134 | var mat: [E] T; 135 | // code below assumes data starts at 1,1 136 | var rowPointers: [1..n] c_ptr(T) = 137 | forall i in 1..n do c_ptrTo(data[i, 1]); 138 | 139 | forall j in guided(1..n by -1) { 140 | for i in j..n { 141 | mat[i, j] = K.kernel(rowPointers[i], rowPointers[j], p); 142 | mat[j, i] = mat[i, j]; 143 | } 144 | } 145 | return mat; 146 | } 147 | 148 | -------------------------------------------------------------------------------- /chapel/script.chpl: -------------------------------------------------------------------------------- 1 | use kernel; 2 | 3 | use IO; 4 | use Time; 5 | use Random; 6 | 7 | config const fastmath: bool = false; 8 | config const verbose: bool = true; 9 | const folder = if fastmath then "fmdata" else "data"; 10 | 11 | record BenchRecord { 12 | var D: domain(1); 13 | var n: [D] int(64); 14 | var times: [D] real(64); 15 | proc init() 16 | { 17 | this.D = {0..1}; 18 | this.n = [0, 1]; 19 | this.times = [0.0, 1.0]; 20 | } 21 | } 22 | 23 | proc bench(type T, Kernel, n: [?D] int(64)) 24 | { 25 | var nitems: int(64) = D.dim(0).last: int(64); 26 | var times: [0..nitems] real(64); 27 | 28 | var result: BenchRecord; 29 | result.D = {0..nitems}; 30 | 31 | for i in 0..nitems { 32 | var _times: [0..2] real(64); 33 | var data: [1..n[i], 1..784] T; 34 | fillRandom(data); 35 | for j in 0..2 { 36 | var sw = new Timer(); 37 | sw.start(); 38 | var mat = calculateKernelMatrix(Kernel, data); 39 | sw.stop(); 40 | _times[j] = (sw.elapsed(TimeUnits.microseconds)/1000_000): real(64); 41 | } 42 | times[i] = (_times[0] + _times[1] + _times[2])/3; 43 | if verbose { 44 | writeln("Average time for n = ", n[i], ", ", times[i], " seconds."); 45 | writeln("Detailed times: ", _times); 46 | } 47 | } 48 | result.n = n; 49 | result.times = times; 50 | return result; 51 | } 52 | 53 | proc runKernelBenchmarks(type T, kernels, n: [?D] int(64)) 54 | { 55 | var results: [0..#kernels.size] BenchRecord; 56 | for param i in 0..(kernels.size - 1) { 57 | const kernel = kernels(i); 58 | if verbose { 59 | writeln("\n\nRunning benchmarks for ", kernel.type: string, kernel: string); 60 | } 61 | results[i] = bench(T, kernel, n); 62 | } 63 | return results; 64 | } 65 | 66 | /** 67 | To compile: 68 | chpl script.chpl kernel.chpl --fast && ./script 69 | */ 70 | proc runAllKernelBenchmarks(type T, folder: string) 71 | { 72 | //var n = [100, 500, 1000]; 73 | var n = [1000, 5000, 10000, 20000, 30000]; 74 | 75 | var kernels = (new DotProduct(T), new Gaussian(T, 1: T), new Polynomial(T, 2.5: T, 1: T), 76 | new Exponential(T, 1: T), new Log(T, 3: T), new Cauchy(T, 1: T), 77 | new Power(T, 2.5: T), new Wave(T, 1: T), new Sigmoid(T, 1: T, 1: T)); 78 | var kernelNames = ["DotProduct", "Gaussian", "Polynomial", 79 | "Exponential", "Log", "Cauchy", 80 | "Power", "Wave", "Sigmoid"]; 81 | var results = runKernelBenchmarks(T, kernels, n); 82 | 83 | 84 | var last: int(64) = n.domain.dim(0).last: int(64); 85 | var tabLen = n.size * kernels.size; 86 | var table: [0..tabLen, 0..3] string; 87 | table[0, ..] = ["language,", "kernel,", "nitems,", "time"]; 88 | while (true) 89 | { 90 | var k = 1; 91 | for i in 0..#kernels.size { 92 | var tmp = ["Chapel,", kernelNames[i] + ",", "", ""]; 93 | for j in 0..(n.size - 1) { 94 | tmp[2] = results[i].n[j]: string + ","; 95 | tmp[3] = results[i].times[j]: string; 96 | table[k, ..] = tmp; 97 | k += 1; 98 | } 99 | } 100 | if k > tabLen 101 | { 102 | break; 103 | } 104 | } 105 | var file = open("../" + folder + "/chapelBench.csv", iomode.cw); 106 | var _channel = file.writer(); 107 | _channel.write(table); 108 | _channel.close(); 109 | file.close(); 110 | return; 111 | } 112 | 113 | proc main() 114 | { 115 | writeln("folder: ", folder); 116 | runAllKernelBenchmarks(real(32), folder); 117 | } 118 | -------------------------------------------------------------------------------- /chapel/script.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/bash 2 | # Uses "regular" mathematical functions 3 | chpl --fast --ieee-float script.chpl kernel.chpl 4 | /usr/bin/time -v ./script --verbose=true --fastmath=false 5 | # Uses Fast Math 6 | chpl --fast --no-ieee-float script.chpl kernel.chpl 7 | /usr/bin/time -v ./script --verbose=true --fastmath=true 8 | -------------------------------------------------------------------------------- /chapel/time.log: -------------------------------------------------------------------------------- 1 | Command being timed: "./script --verbose=true --fastmath=false" 2 | User time (seconds): 114342.55 3 | System time (seconds): 17.96 4 | Percent of CPU this job got: 1192% 5 | Elapsed (wall clock) time (h:mm:ss or m:ss): 2:39:53 6 | Average shared text size (kbytes): 0 7 | Average unshared data size (kbytes): 0 8 | Average stack size (kbytes): 0 9 | Average total size (kbytes): 0 10 | Maximum resident set size (kbytes): 9266328 11 | Average resident set size (kbytes): 0 12 | Major (requiring I/O) page faults: 0 13 | Minor (reclaiming a frame) page faults: 2315637 14 | Voluntary context switches: 625 15 | Involuntary context switches: 3419118 16 | Swaps: 0 17 | File system inputs: 0 18 | File system outputs: 8 19 | Socket messages sent: 0 20 | Socket messages received: 0 21 | Signals delivered: 0 22 | Page size (bytes): 4096 23 | Exit status: 0 24 | 25 | -------------------------------------------------------------------------------- /charts/benchplot.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dataPulverizer/KernelMatrixBenchmark/28c40acb02497a3052f2d04689ab9c93fd090de2/charts/benchplot.jpg -------------------------------------------------------------------------------- /charts/charts.r: -------------------------------------------------------------------------------- 1 | require(data.table) 2 | require(ggplot2) 3 | require(scales) 4 | 5 | # Plots for language benchmarks 6 | createJPGPlot = function(folder, filename) 7 | { 8 | results = Map(fread, c(paste0("../", folder,"/chapelBench.csv"), 9 | paste0("../", folder,"/dBench.csv"), 10 | paste0("../", folder,"/juliaBench.csv"))) 11 | results = rbindlist(results) 12 | results[, kernel := gsub(" ", "", kernel)] 13 | p = ggplot(results, aes(x = nitems, y = time, color = language)) + geom_line() + 14 | geom_point() + scale_y_continuous(trans = "log10", 15 | labels = trans_format("log10", math_format(10^.x))) + 16 | theme(legend.position="top") + ylab("time (s)") + 17 | xlab("Number Of Items") + facet_wrap(~ kernel, scale = "free_y") 18 | jpeg(file = filename, width = 9, height = 7, units = "in", res = 200) 19 | plot(p) 20 | dev.off() 21 | return(invisible(p)) 22 | } 23 | createSVGPlot = function(folder, filename) 24 | { 25 | results = Map(fread, c(paste0("../", folder,"/chapelBench.csv"), 26 | paste0("../", folder,"/dBench.csv"), 27 | paste0("../", folder,"/juliaBench.csv"))) 28 | results = rbindlist(results) 29 | results[, kernel := gsub(" ", "", kernel)] 30 | p = ggplot(results, aes(x = nitems, y = time, color = language)) + geom_line() + 31 | geom_point() + scale_y_continuous(trans = "log10", 32 | labels = trans_format("log10", math_format(10^.x))) + 33 | theme(legend.position="top") + ylab("time (s)") + 34 | xlab("Number Of Items") + facet_wrap(~ kernel, scale = "free_y") 35 | svg(file = filename, width = 9, height = 7) 36 | plot(p) 37 | dev.off() 38 | return(invisible(p)) 39 | } 40 | 41 | createSVGPlot("data", "benchplot.svg") 42 | createSVGPlot("fmdata", "fmbenchplot.svg") 43 | 44 | createJPGPlot("data", "benchplot.jpg") 45 | createJPGPlot("fmdata", "fmbenchplot.jpg") 46 | 47 | 48 | # Difference between NDSlice and My basic matrix implementation 49 | createJPGNDSlicePlot = function(filename) 50 | { 51 | results = Map(fread, c("../data/dNDSliceBench.csv", "../data/dBench.csv")) 52 | results[[1]][, language := "NDSlice"] 53 | results[[2]][, time := 100*(time - results[[1]][,time])/time] 54 | results = results[[2]] 55 | 56 | p = ggplot(results, aes(x = nitems, y = time, fill = kernel)) + geom_col() + 57 | theme(legend.position="none", plot.title = element_text(hjust = 0.5)) + 58 | ylab("% Difference\n(+ive = ndslice is faster)") + 59 | xlab("Number Of Items") + facet_wrap(~ kernel, scale = "free_y") + 60 | ggtitle("Matrix and NDSlice percentage time difference") 61 | 62 | jpeg(file = filename, width = 7, height = 7, units = "in", res = 200) 63 | plot(p) 64 | dev.off() 65 | return(invisible(p)) 66 | } 67 | # "ndsliceDiagnostic.jpg" 68 | createSVGNDSlicePlot = function(filename) 69 | { 70 | results = Map(fread, c("../data/dNDSliceBench.csv", "../data/dBench.csv")) 71 | results[[1]][, language := "NDSlice"] 72 | results[[2]][, time := 100*(time - results[[1]][,time])/time] 73 | results = results[[2]] 74 | 75 | p = ggplot(results, aes(x = nitems, y = time, fill = kernel)) + geom_col() + 76 | theme(legend.position="none", plot.title = element_text(hjust = 0.5)) + 77 | ylab("% Difference\n(+ive = ndslice is faster)") + 78 | xlab("Number Of Items") + facet_wrap(~ kernel, scale = "free_y") + 79 | ggtitle("Matrix and NDSlice percentage time difference") 80 | 81 | svg(file = filename, width = 7, height = 7) 82 | plot(p) 83 | dev.off() 84 | return(invisible(p)) 85 | } 86 | 87 | createJPGNDSlicePlot("ndsliceDiagnostic.jpg") 88 | createSVGNDSlicePlot("ndsliceDiagnostic.svg") 89 | -------------------------------------------------------------------------------- /charts/fmbenchplot.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dataPulverizer/KernelMatrixBenchmark/28c40acb02497a3052f2d04689ab9c93fd090de2/charts/fmbenchplot.jpg -------------------------------------------------------------------------------- /charts/ndsliceDiagnostic.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dataPulverizer/KernelMatrixBenchmark/28c40acb02497a3052f2d04689ab9c93fd090de2/charts/ndsliceDiagnostic.jpg -------------------------------------------------------------------------------- /d/arrays.d: -------------------------------------------------------------------------------- 1 | /* 2 | This module contains implementations for vectors and matrices an altered version from my glmsolverd package 3 | */ 4 | 5 | module arrays; 6 | 7 | import std.conv: to; 8 | import std.format: format; 9 | import std.traits: isFloatingPoint, isIntegral, isNumeric; 10 | import std.algorithm: min, max; 11 | import std.math: modf; 12 | import core.memory: GC; 13 | import core.stdc.stdlib: malloc, free; 14 | import std.stdio: writeln; 15 | import std.random; 16 | import std.parallelism; 17 | import std.range : iota; 18 | 19 | /********************************************* Printer Utility Functions *********************************************/ 20 | auto getRange(T)(const(T[]) data) 21 | if(isFloatingPoint!T) 22 | { 23 | real[2] range = [cast(real)data[0], cast(real)data[0]]; 24 | foreach(el; data) 25 | { 26 | range[0] = min(range[0], el); 27 | range[1] = max(range[1], el); 28 | } 29 | return range; 30 | } 31 | string getFormat(real[] range, long maxLength = 8, long gap = 2) 32 | { 33 | writeln("range: ", range); 34 | string form = ""; 35 | if((range[0] > 0.01) & (range[1] < 1000_000)) 36 | { 37 | form = "%" ~ to!(string)(gap + 2 + maxLength) ~ "." ~ to!(string)(maxLength) ~ "g"; 38 | }else if((range[0] < 0.0001) | (range[1] > 1000_000)) 39 | { 40 | form = "%" ~ to!(string)(gap + 1 + maxLength) ~ "." ~ to!(string)(4) ~ "g"; 41 | } 42 | return form; 43 | } 44 | /********************************************* Matrix Class *********************************************/ 45 | 46 | /* 47 | Faster Array Creation 48 | */ 49 | auto newArray(T)(long n) 50 | { 51 | auto data = (cast(T*)GC.malloc(T.sizeof*n, GC.BlkAttr.NO_SCAN))[0..n]; 52 | //auto data = new T[](n); 53 | if(data == null) 54 | assert(0, "Array Allocation Failed!"); 55 | return data; 56 | } 57 | 58 | /* 59 | Matrix will be column major 60 | */ 61 | mixin template MatrixGubbings(T) 62 | { 63 | private: 64 | T[] data; 65 | long[] dim; 66 | 67 | public: 68 | this(T[] _data, long rows, long cols) 69 | { 70 | assert(rows*cols == _data.length, 71 | "dimension of matrix inconsistent with length of array"); 72 | data = _data; dim = [rows, cols]; 73 | } 74 | this(long n, long m) 75 | { 76 | long _len = n*m; 77 | data = newArray!(T)(_len); 78 | dim = [n, m]; 79 | } 80 | this(T[] _data, long[] _dim) 81 | { 82 | long tlen = _dim[0]*_dim[1]; 83 | assert(tlen == _data.length, 84 | "dimension of matrix inconsistent with length of array"); 85 | data = _data; dim = _dim; 86 | } 87 | this(Matrix!(T) mat) 88 | { 89 | data = mat.data.dup; 90 | dim = mat.dim.dup; 91 | } 92 | @property Matrix!(T) dup() const 93 | { 94 | return Matrix!(T)(data.dup, dim.dup); 95 | } 96 | T opIndex(long i, long j) const 97 | { 98 | return data[dim[0]*j + i]; 99 | } 100 | void opIndexAssign(T x, long i, long j) 101 | { 102 | data[dim[0]*j + i] = x; 103 | } 104 | T opIndexOpAssign(string op)(T x, long i, long j) 105 | { 106 | static if((op == "+") | (op == "-") | (op == "*") | (op == "/") | (op == "^^")) 107 | mixin("return data[dim[0]*j + i] " ~ op ~ "= x;"); 108 | else static assert(0, "Operator \"" ~ op ~ "\" not implemented"); 109 | } 110 | Matrix!(T) opBinary(string op)(Matrix!(T) x) 111 | { 112 | assert( data.length == x.array.length, 113 | "Number of rows and columns in matrices not equal."); 114 | long n = data.length; 115 | auto ret = Matrix!(T)(dim[0], dim[1]); 116 | static if((op == "+") | (op == "-") | (op == "*") | (op == "/") | (op == "^^")) 117 | { 118 | for(long i = 0; i < n; ++i) 119 | { 120 | mixin("ret.array[i] = " ~ "data[i] " ~ op ~ " x.array[i];"); 121 | } 122 | }else static assert(0, "Operator \"" ~ op ~ "\" not implemented"); 123 | return ret; 124 | } 125 | Matrix!(T) opBinary(string op)(T rhs) 126 | { 127 | ulong n = data.length; 128 | Matrix!(T) ret = Matrix!(T)(dim[0], dim[1]); 129 | static if((op == "+") | (op == "-") | (op == "*") | (op == "/") | (op == "^^")) 130 | { 131 | for(ulong i = 0; i < n; ++i) 132 | { 133 | mixin("ret.array[i] = " ~ "data[i] " ~ op ~ " rhs;"); 134 | } 135 | }else static assert(0, "Operator \"" ~ op ~ "\" not implemented"); 136 | return ret; 137 | } 138 | Matrix!(T) opBinaryRight(string op)(T lhs) 139 | { 140 | long n = data.length; 141 | Matrix!(T) ret = Matrix!(T)(dim[0], dim[1]); 142 | static if((op == "+") | (op == "-") | (op == "*") | (op == "/") | (op == "^^")) 143 | { 144 | for(long i = 0; i < n; ++i) 145 | { 146 | mixin("ret.array[i] = " ~ "lhs " ~ op ~ " data[i];"); 147 | } 148 | }else static assert(0, "Operator \"" ~ op ~ "\" not implemented"); 149 | return ret; 150 | } 151 | void opOpAssign(string op)(Matrix!(T) x) 152 | { 153 | assert( data.length == x.array.length, 154 | "Number of rows and columns in matrices not equal."); 155 | long n = data.length; 156 | static if((op == "+") | (op == "-") | (op == "*") | (op == "/") | (op == "^^")) 157 | { 158 | for(long i = 0; i < n; ++i) 159 | { 160 | mixin("data[i] " ~ op ~ "= x.array[i];"); 161 | } 162 | }else static assert(0, "Operator \"" ~ op ~ "\" not implemented"); 163 | } 164 | /* mat "op"= rhs */ 165 | void opOpAssign(string op)(T rhs) 166 | { 167 | long n = data.length; 168 | static if((op == "+") | (op == "-") | (op == "*") | (op == "/") | (op == "^^")) 169 | { 170 | for(long i = 0; i < n; ++i) 171 | { 172 | mixin("data[i] " ~ op ~ "= rhs;"); 173 | } 174 | }else static assert(0, "Operator \"" ~ op ~ "\" not implemented"); 175 | } 176 | @property long nrow() const 177 | { 178 | return dim[0]; 179 | } 180 | @property long ncol() const 181 | { 182 | return dim[1]; 183 | } 184 | @property T[] array() 185 | { 186 | return data; 187 | } 188 | @property long len() const 189 | { 190 | return data.length; 191 | } 192 | @property long length() const 193 | { 194 | return data.length; 195 | } 196 | @property size() const 197 | { 198 | return dim.dup; 199 | } 200 | /* Returns transposed matrix (duplicated) */ 201 | Matrix!(T) t() const 202 | { 203 | auto _data = data.dup; 204 | long[] _dim = new long[2]; 205 | _dim[0] = dim[1]; _dim[1] = dim[0]; 206 | if((dim[0] == 1) & (dim[1] == 1)){ 207 | } else if(dim[0] != dim[1]) { 208 | for(long j = 0; j < dim[1]; ++j) 209 | { 210 | for(long i = 0; i < dim[0]; ++i) 211 | { 212 | _data[_dim[0]*i + j] = data[dim[0]*j + i]; 213 | } 214 | } 215 | } else if(dim[0] == dim[1]) { 216 | for(long j = 0; j < dim[1]; ++j) 217 | { 218 | for(long i = 0; i < dim[0]; ++i) 219 | { 220 | if(i == j) 221 | continue; 222 | _data[_dim[0]*i + j] = data[dim[0]*j + i]; 223 | } 224 | } 225 | } 226 | return Matrix!(T)(_data, _dim); 227 | } 228 | 229 | /* Appends Vector to the END of the matrix */ 230 | void appendColumn(T[] rhs) 231 | { 232 | assert(rhs.length == nrow, 233 | "Vector is not of the same length as number of rows."); 234 | data ~= rhs; 235 | dim[1] += 1; 236 | return; 237 | } 238 | void appendColumn(Matrix!(T) rhs) 239 | { 240 | assert((rhs.nrow == 1) | (rhs.ncol == 1), 241 | "Matrix does not have 1 row or 1 column"); 242 | appendColumn(rhs.array); 243 | } 244 | void appendColumn(T _rhs) 245 | { 246 | auto rhs = newArray!(T)(nrow); 247 | rhs[] = _rhs; 248 | appendColumn(rhs); 249 | } 250 | /* Prepends Column Vector to the START of the matrix */ 251 | void prependColumn(T[] rhs) 252 | { 253 | assert(rhs.length == nrow, 254 | "Vector is not of the same length as number of rows."); 255 | data = rhs ~ data; 256 | dim[1] += 1; 257 | return; 258 | } 259 | void prependColumn(Matrix!(T) rhs) 260 | { 261 | assert((rhs.nrow == 1) | (rhs.ncol == 1), 262 | "Matrix does not have 1 row or 1 column"); 263 | prependColumn(rhs.array); 264 | } 265 | void prependColumn(T _rhs) 266 | { 267 | auto rhs = newArray!(T)(nrow); 268 | rhs[] = _rhs; 269 | prependColumn(rhs); 270 | } 271 | /* Contiguous column select copies the column */ 272 | auto columnSelect(long start, long end) 273 | { 274 | assert(end > start, "Starting column is not less than end column"); 275 | long nCol = end - start; 276 | long _len = nrow * nCol; 277 | auto arr = newArray!(T)(_len); 278 | auto startIndex = start*nrow; 279 | long iStart = 0; 280 | for(long i = 0; i < nCol; ++i) 281 | { 282 | arr[iStart..((iStart + nrow))] = data[startIndex..(startIndex + nrow)]; 283 | startIndex += nrow; 284 | iStart += nrow; 285 | } 286 | return Matrix!(T)(arr, [nrow, nCol]); 287 | } 288 | auto columnSelect(long index) 289 | { 290 | assert(index < ncol, "Selected index is not less than number of columns."); 291 | auto arr = newArray!(T)(nrow); 292 | auto startIndex = index*nrow; 293 | arr[] = data[startIndex..(startIndex + nrow)]; 294 | return Matrix!(T)(arr, [nrow, 1]); 295 | } 296 | auto refColumnSelect(long index) 297 | { 298 | assert(index < ncol, "Selected index is not less than number of columns."); 299 | auto startIndex = index*nrow; 300 | return Matrix!(T)(data[startIndex..(startIndex + nrow)], [nrow, 1]); 301 | } 302 | auto refColumnSelectArr(long index) 303 | { 304 | //assert(index < ncol, "Selected index is not less than number of columns."); 305 | auto startIndex = index*nrow; 306 | return data[startIndex..(startIndex + nrow)]; 307 | } 308 | /* 309 | Function to remove a column from the matrix. 310 | */ 311 | Matrix!(T) refColumnRemove(long index) 312 | { 313 | /* Remove first column */ 314 | if(index == 0) 315 | { 316 | data = data[nrow..$]; 317 | dim[1] -= 1; 318 | return this; 319 | /* Remove last column */ 320 | }else if(index == (ncol - 1)) 321 | { 322 | data = data[0..($ - nrow)]; 323 | dim[1]-= 1; 324 | return this; 325 | /* Remove any other column */ 326 | }else{ 327 | auto start = index*nrow; 328 | long _len = data.length - nrow; 329 | auto _data = newArray!(T)(_len); 330 | _data[0..start] = data[0..start]; 331 | _data[start..$] = data[(start + nrow)..$]; 332 | data = _data; 333 | dim[1] -= 1; 334 | return this; 335 | } 336 | } 337 | /* Assigns vector in-place to a specific column */ 338 | /* Refactor these two methods */ 339 | Matrix!(T) refColumnAssign(T[] col, long index) 340 | { 341 | assert(col.length == nrow, "Length of vector is not the same as number of rows"); 342 | /* Replace first column */ 343 | if(index == 0) 344 | { 345 | data[0..nrow] = col; 346 | return this; 347 | /* Replace last column */ 348 | }else if(index == (ncol - 1)) 349 | { 350 | data[($ - nrow)..$] = col; 351 | return this; 352 | /* Replace any other column */ 353 | }else{ 354 | auto start = index*nrow; 355 | data[start..(start + nrow)] = col; 356 | return this; 357 | } 358 | } 359 | Matrix!(T) refColumnAssign(T col, long index) 360 | { 361 | /* Replace first column */ 362 | if(index == 0) 363 | { 364 | data[0..nrow] = col; 365 | return this; 366 | /* Replace last column */ 367 | }else if(index == (ncol - 1)) 368 | { 369 | data[($ - nrow)..$] = col; 370 | return this; 371 | /* Replace any other column */ 372 | }else{ 373 | auto start = index*nrow; 374 | data[start..(start + nrow)] = col; 375 | return this; 376 | } 377 | } 378 | } 379 | /* Assuming column major */ 380 | struct Matrix(T) 381 | if(isFloatingPoint!T) 382 | { 383 | mixin MatrixGubbings!(T); 384 | string toString() const 385 | { 386 | string dform = getFormat(getRange(data)); 387 | string repr = format(" Matrix(%d x %d)\n", dim[0], dim[1]); 388 | for(long i = 0; i < dim[0]; ++i) 389 | { 390 | for(long j = 0; j < dim[1]; ++j) 391 | { 392 | repr ~= format(dform, opIndex(i, j)); 393 | } 394 | repr ~= "\n"; 395 | } 396 | return repr; 397 | } 398 | } 399 | 400 | // Create random matrix 401 | /****************************************************************************/ 402 | auto createRNG() 403 | { 404 | Mt19937_64 rng; 405 | rng.seed(unpredictableSeed); 406 | return rng; 407 | } 408 | Matrix!T createRandomMatrix(T)(ulong rows, ulong cols) 409 | { 410 | //Mt19937_64 gen; 411 | auto RNG = taskPool.workerLocalStorage(createRNG()); 412 | //gen.seed(unpredictableSeed); 413 | ulong len = rows*cols; 414 | T[] data = newArray!(T)(len); 415 | foreach(i; taskPool.parallel(iota(len))) 416 | data[i] = uniform01!(T)(RNG.get); 417 | return Matrix!T(data, rows, cols); 418 | } 419 | Matrix!T createRandomMatrix(T)(ulong m) 420 | { 421 | return createRandomMatrix!(T)(m, m); 422 | } 423 | Matrix!T createRandomMatrix(T)(ulong[] dim) 424 | { 425 | return createRandomMatrix!(T)(dim[0], dim[1]); 426 | } 427 | 428 | -------------------------------------------------------------------------------- /d/fmtime.log: -------------------------------------------------------------------------------- 1 | Command being timed: "./script" 2 | User time (seconds): 61656.04 3 | System time (seconds): 54.86 4 | Percent of CPU this job got: 1182% 5 | Elapsed (wall clock) time (h:mm:ss or m:ss): 1:26:59 6 | Average shared text size (kbytes): 0 7 | Average unshared data size (kbytes): 0 8 | Average stack size (kbytes): 0 9 | Average total size (kbytes): 0 10 | Maximum resident set size (kbytes): 20458860 11 | Average resident set size (kbytes): 0 12 | Major (requiring I/O) page faults: 0 13 | Minor (reclaiming a frame) page faults: 15898753 14 | Voluntary context switches: 4863 15 | Involuntary context switches: 1999947 16 | Swaps: 0 17 | File system inputs: 0 18 | File system outputs: 8 19 | Socket messages sent: 0 20 | Socket messages received: 0 21 | Signals delivered: 0 22 | Page size (bytes): 4096 23 | Exit status: 0 24 | 25 | -------------------------------------------------------------------------------- /d/kernel.d: -------------------------------------------------------------------------------- 1 | import arrays; 2 | import std.parallelism; 3 | import std.range : iota; 4 | import std.stdio: writeln; 5 | import std.datetime.stopwatch: AutoStart, StopWatch; 6 | 7 | import core.stdc.math: exp, exp = expf, exp = expl, 8 | fabs, fabs = fabsf, fabs = fabsl, 9 | log, log = logf, log = logl, 10 | pow, pow = powf, pow = powl, 11 | sin, sin = sinf, sin = sinhl, 12 | sqrt, sqrt = sqrtf, sqrt = sqrtl, 13 | tanh, tanh = tanhf, tanh = tanhl; 14 | 15 | /** 16 | Kernel Function Types: 17 | */ 18 | struct DotProduct(T) 19 | { 20 | public: 21 | this(T _nothing) 22 | {} 23 | T opCall(T[] x, T[] y) const 24 | { 25 | T dist = 0; 26 | auto m = x.length; 27 | for(size_t i = 0; i < m; ++i) 28 | { 29 | dist += x[i] * y[i]; 30 | } 31 | return dist; 32 | } 33 | } 34 | 35 | struct Gaussian(T) 36 | { 37 | private: 38 | T theta; 39 | public: 40 | this(T _theta) 41 | { 42 | theta = _theta; 43 | } 44 | T opCall(T[] x, T[] y) const 45 | { 46 | T dist = 0; 47 | auto m = x.length; 48 | for(size_t i = 0; i < m; ++i) 49 | { 50 | auto tmp = x[i] - y[i]; 51 | dist += tmp * tmp; 52 | } 53 | return exp(-sqrt(dist)/theta); 54 | } 55 | } 56 | 57 | struct Polynomial(T) 58 | { 59 | private: 60 | T d; 61 | T offset; 62 | public: 63 | this(T _d, T _offset) 64 | { 65 | d = _d; 66 | offset = _offset; 67 | } 68 | T opCall(T[] x, T[] y) const 69 | { 70 | T dist = 0; 71 | auto m = x.length; 72 | for(size_t i = 0; i < m; ++i) 73 | { 74 | dist += x[i] * y[i]; 75 | } 76 | return pow(dist + offset, d); 77 | } 78 | } 79 | 80 | struct Exponential(T) 81 | { 82 | private: 83 | T theta; 84 | public: 85 | this(T _theta) 86 | { 87 | theta = _theta; 88 | } 89 | T opCall(T[] x, T[] y) const 90 | { 91 | T dist = 0; 92 | auto m = x.length; 93 | for(size_t i = 0; i < m; ++i) 94 | { 95 | dist -= fabs(x[i] - y[i]); 96 | } 97 | return exp(dist/theta); 98 | } 99 | } 100 | 101 | struct Log(T) 102 | { 103 | private: 104 | T beta; 105 | public: 106 | this(T _beta) 107 | { 108 | beta = _beta; 109 | } 110 | T opCall(T[] x, T[] y) const 111 | { 112 | T dist = 0; 113 | auto m = x.length; 114 | for(size_t i = 0; i < m; ++i) 115 | { 116 | dist += pow(fabs(x[i] - y[i]), beta); 117 | } 118 | dist = pow(dist, 1/beta); 119 | return -log(1 + dist); 120 | } 121 | } 122 | 123 | struct Cauchy(T) 124 | { 125 | private: 126 | T theta; 127 | public: 128 | this(T _theta) 129 | { 130 | theta = _theta; 131 | } 132 | T opCall(T[] x, T[] y) const 133 | { 134 | T dist = 0; 135 | auto m = x.length; 136 | for(size_t i = 0; i < m; ++i) 137 | { 138 | auto tmp = x[i] - y[i]; 139 | dist += tmp * tmp; 140 | } 141 | dist = sqrt(dist)/theta; 142 | return 1/(1 + dist); 143 | } 144 | } 145 | 146 | struct Power(T) 147 | { 148 | private: 149 | T beta; 150 | public: 151 | this(T _beta) 152 | { 153 | beta = _beta; 154 | } 155 | T opCall(T[] x, T[] y) const 156 | { 157 | T dist = 0; 158 | auto m = x.length; 159 | for(size_t i = 0; i < m; ++i) 160 | { 161 | dist += pow(fabs(x[i] - y[i]), beta); 162 | } 163 | return -pow(dist, 1/beta); 164 | } 165 | } 166 | 167 | struct Wave(T) 168 | { 169 | private: 170 | T theta; 171 | public: 172 | this(T _theta) 173 | { 174 | theta = _theta; 175 | } 176 | T opCall(T[] x, T[] y) const 177 | { 178 | T dist = 0; 179 | auto m = x.length; 180 | for(size_t i = 0; i < m; ++i) 181 | { 182 | dist += fabs(x[i] - y[i]); 183 | } 184 | auto tmp = theta/dist; 185 | return tmp*sin(1/tmp); 186 | } 187 | } 188 | 189 | struct Sigmoid(T) 190 | { 191 | private: 192 | T beta0; 193 | T beta1; 194 | public: 195 | this(T _beta0, T _beta1) 196 | { 197 | beta0 = _beta0; 198 | beta1 = _beta1; 199 | } 200 | T opCall(T[] x, T[] y) const 201 | { 202 | T dist = 0; 203 | auto m = x.length; 204 | for(size_t i = 0; i < m; ++i) 205 | { 206 | dist += x[i] * y[i]; 207 | } 208 | return tanh(beta0 * dist + beta1); 209 | } 210 | } 211 | 212 | /************************************************************************************/ 213 | 214 | auto calculateKernelMatrix(alias K, T)(K!(T) kernel, Matrix!(T) data) 215 | { 216 | size_t n = data.ncol; 217 | auto mat = Matrix!(T)(n, n); 218 | 219 | foreach(j; taskPool.parallel(iota(n))) 220 | { 221 | auto arrj = data.refColumnSelect(j).array; 222 | foreach(size_t i; j..n) 223 | { 224 | mat[i, j] = kernel(data.refColumnSelect(i).array, arrj); 225 | mat[j, i] = mat[i, j]; 226 | } 227 | } 228 | return mat; 229 | } 230 | -------------------------------------------------------------------------------- /d/mathdemo.d: -------------------------------------------------------------------------------- 1 | import std.math: dlog = log; 2 | import std.stdio: writeln; 3 | import std.random: uniform01; 4 | import core.stdc.math: logf, log, logl; 5 | import std.datetime.stopwatch: AutoStart, StopWatch; 6 | 7 | T log(T)(T x) 8 | if(is(T == float)) 9 | { 10 | return logf(x); 11 | } 12 | T log(T)(T x) 13 | if(is(T == double)) 14 | { 15 | return log(x); 16 | } 17 | T log(T)(T x) 18 | if(is(T == real)) 19 | { 20 | return logl(x); 21 | } 22 | 23 | auto makeRandomArray(T)(size_t n) 24 | { 25 | T[] arr = new T[n]; 26 | foreach(ref el; arr) 27 | { 28 | el = uniform01!(T)(); 29 | } 30 | return arr; 31 | } 32 | 33 | auto apply(alias fun, T)(T[] arr) 34 | { 35 | foreach(ref el; arr) 36 | { 37 | el = fun(el); 38 | } 39 | return; 40 | } 41 | 42 | /** 43 | ldc2 -O --boundscheck=off --ffast-math --mcpu=native --boundscheck=off mathdemo.d && ./mathdemo 44 | Time taken for c log: 0.324789 seconds. 45 | Time taken for d log: 2.30737 seconds. 46 | */ 47 | void main() 48 | { 49 | auto sw = StopWatch(AutoStart.no); 50 | 51 | /* For C's log function */ 52 | auto arr = makeRandomArray!(float)(100_000_000); 53 | sw.start(); 54 | apply!(log)(arr); 55 | sw.stop(); 56 | writeln("Time taken for c log: ", sw.peek.total!"nsecs"/1000_000_000.0, " seconds."); 57 | sw.reset(); 58 | 59 | /* For D's log function */ 60 | arr = makeRandomArray!(float)(100_000_000); 61 | sw.start(); 62 | apply!(dlog)(arr); 63 | sw.stop(); 64 | writeln("Time taken for d log: ", sw.peek.total!"nsecs"/1000_000_000.0, " seconds."); 65 | sw.reset(); 66 | } 67 | -------------------------------------------------------------------------------- /d/script.d: -------------------------------------------------------------------------------- 1 | import arrays; 2 | import kernel; 3 | 4 | import std.conv: to; 5 | import std.meta: AliasSeq; 6 | import std.algorithm : sum; 7 | import std.stdio: File, writeln; 8 | import std.typecons: tuple, Tuple; 9 | import std.datetime.stopwatch: AutoStart, StopWatch; 10 | 11 | /** 12 | To compile: 13 | ldc2 script.d kernel.d math.d arrays.d -O --boundscheck=off --ffast-math -mcpu=native 14 | /usr/bin/time -v ./script 15 | 16 | ldc2 --mcpu=help 17 | ldc2 script.d kernel.d math.d arrays.d -O --boundscheck=off --ffast-math --mcpu=core-avx2 -mattr=+avx2,+sse4.1,+sse4.2 18 | /usr/bin/time -v ./script 19 | */ 20 | 21 | auto bench(alias K, T)(K!T kernel, long[] n) 22 | { 23 | auto times = new double[n.length]; 24 | auto sw = StopWatch(AutoStart.no); 25 | foreach(i; 0..n.length) 26 | { 27 | double[3] _times; 28 | auto data = createRandomMatrix!T(784L, n[i]); 29 | foreach(ref t; _times[]) 30 | { 31 | sw.start(); 32 | auto mat = calculateKernelMatrix!(K!T, T)(kernel, data); 33 | sw.stop(); 34 | t = sw.peek.total!"nsecs"/1000_000_000.0; 35 | sw.reset(); 36 | } 37 | times[i] = sum(_times[])/3.0; 38 | version(verbose) 39 | { 40 | writeln("Average time for n = ", n[i], ", ", times[i], " seconds."); 41 | writeln("Detailed times: ", _times, "\n"); 42 | } 43 | } 44 | return tuple(n, times); 45 | } 46 | 47 | auto runKernelBenchmark(KS)(KS kernels, long[] n) 48 | { 49 | auto tmp = bench(kernels[0], n); 50 | alias R = typeof(tmp); 51 | R[kernels.length] results; 52 | results[0] = tmp; 53 | static foreach(i; 1..kernels.length) 54 | { 55 | version(verbose) 56 | { 57 | writeln("Running benchmarks for ", kernels[i]); 58 | } 59 | results[i] = bench(kernels[i], n); 60 | } 61 | return results; 62 | } 63 | 64 | void writeRow(File file, string[] row) 65 | { 66 | string line = ""; 67 | foreach(i; 0..(row.length - 1)) 68 | line ~= row[i] ~ ","; 69 | line ~= row[row.length - 1] ~ "\n"; 70 | file.write(line); 71 | return; 72 | } 73 | 74 | void runAllKernelBenchmarks(T = float)() 75 | { 76 | auto kernels = tuple(DotProduct!(T)(), Gaussian!(T)(1), Polynomial!(T)(2.5f, 1), 77 | Exponential!(T)(1), Log!(T)(3), Cauchy!(T)(1), 78 | Power!(T)(2.5f), Wave!(T)(1), Sigmoid!(T)(1, 1)); 79 | auto kernelNames = ["DotProduct", "Gaussian", "Polynomial", 80 | "Exponential", "Log", "Cauchy", 81 | "Power", "Wave", "Sigmoid"]; 82 | //long[] n = [100L, 500L, 1000L]; 83 | long[] n = [1000L, 5000L, 10_000L, 20_000L, 30_000L]; 84 | 85 | auto results = runKernelBenchmark(kernels, n); 86 | 87 | auto table = new string[][] (n.length * kernels.length + 1, 4); 88 | table[0][] = ["language", "kernel", "nitems", "time"]; 89 | auto tmp = ["D", "", "", ""]; 90 | while(true) 91 | { 92 | auto k = 1; 93 | foreach(i; 0..kernels.length) 94 | { 95 | tmp = ["D", kernelNames[i], "", ""]; 96 | foreach(j; 0..n.length) 97 | { 98 | tmp[2] = to!(string)(results[i][0][j]); 99 | tmp[3] = to!(string)(results[i][1][j]); 100 | table[k][] = tmp.dup; 101 | k += 1; 102 | } 103 | } 104 | if(k > (table.length - 1)) 105 | { 106 | break; 107 | } 108 | } 109 | version(fastmath) 110 | { 111 | auto file = File("../fmdata/dBench.csv", "w"); 112 | }else{ 113 | auto file = File("../data/dBench.csv", "w"); 114 | } 115 | foreach(row; table) 116 | file.writeRow(row); 117 | } 118 | 119 | void main() 120 | { 121 | runAllKernelBenchmarks(); 122 | } 123 | -------------------------------------------------------------------------------- /d/script.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/bash 2 | # Uses "regular" mathematical functions 3 | ldc2 script.d kernel.d math.d arrays.d --release -O --d-version=verbose --boundscheck=off --mcpu=native 4 | /usr/bin/time -v ./script 5 | # Uses Fast Math 6 | ldc2 script.d kernel.d math.d arrays.d --release -O --d-version=verbose --d-version=fastmath --ffast-math --boundscheck=off --mcpu=native 7 | /usr/bin/time -v ./script 8 | -------------------------------------------------------------------------------- /d/time.log: -------------------------------------------------------------------------------- 1 | Command being timed: "./script" 2 | User time (seconds): 69089.75 3 | System time (seconds): 41.91 4 | Percent of CPU this job got: 1181% 5 | Elapsed (wall clock) time (h:mm:ss or m:ss): 1:37:29 6 | Average shared text size (kbytes): 0 7 | Average unshared data size (kbytes): 0 8 | Average stack size (kbytes): 0 9 | Average total size (kbytes): 0 10 | Maximum resident set size (kbytes): 20458972 11 | Average resident set size (kbytes): 0 12 | Major (requiring I/O) page faults: 0 13 | Minor (reclaiming a frame) page faults: 15393443 14 | Voluntary context switches: 4884 15 | Involuntary context switches: 2222841 16 | Swaps: 0 17 | File system inputs: 8 18 | File system outputs: 8 19 | Socket messages sent: 0 20 | Socket messages received: 0 21 | Signals delivered: 0 22 | Page size (bytes): 4096 23 | Exit status: 0 24 | 25 | -------------------------------------------------------------------------------- /data/chapelBench.csv: -------------------------------------------------------------------------------- 1 | language, kernel, nitems, time 2 | Chapel, DotProduct, 1000, 0.0327997 3 | Chapel, DotProduct, 5000, 0.828653 4 | Chapel, DotProduct, 10000, 3.46129 5 | Chapel, DotProduct, 20000, 15.6243 6 | Chapel, DotProduct, 30000, 37.36 7 | Chapel, Gaussian, 1000, 0.0378753 8 | Chapel, Gaussian, 5000, 0.980372 9 | Chapel, Gaussian, 10000, 3.95018 10 | Chapel, Gaussian, 20000, 16.7415 11 | Chapel, Gaussian, 30000, 39.4446 12 | Chapel, Polynomial, 1000, 0.03782 13 | Chapel, Polynomial, 5000, 0.967143 14 | Chapel, Polynomial, 10000, 4.00965 15 | Chapel, Polynomial, 20000, 16.9653 16 | Chapel, Polynomial, 30000, 39.7892 17 | Chapel, Exponential, 1000, 0.037031 18 | Chapel, Exponential, 5000, 0.959345 19 | Chapel, Exponential, 10000, 3.90207 20 | Chapel, Exponential, 20000, 16.5751 21 | Chapel, Exponential, 30000, 39.0656 22 | Chapel, Log, 1000, 0.985141 23 | Chapel, Log, 5000, 24.4549 24 | Chapel, Log, 10000, 98.1639 25 | Chapel, Log, 20000, 394.476 26 | Chapel, Log, 30000, 878.653 27 | Chapel, Cauchy, 1000, 0.0356847 28 | Chapel, Cauchy, 5000, 0.929519 29 | Chapel, Cauchy, 10000, 3.83651 30 | Chapel, Cauchy, 20000, 16.2932 31 | Chapel, Cauchy, 30000, 38.3297 32 | Chapel, Power, 1000, 0.975878 33 | Chapel, Power, 5000, 24.2018 34 | Chapel, Power, 10000, 96.7752 35 | Chapel, Power, 20000, 387.347 36 | Chapel, Power, 30000, 871.812 37 | Chapel, Wave, 1000, 0.0342877 38 | Chapel, Wave, 5000, 0.950036 39 | Chapel, Wave, 10000, 3.92775 40 | Chapel, Wave, 20000, 16.5319 41 | Chapel, Wave, 30000, 38.7444 42 | Chapel, Sigmoid, 1000, 0.035638 43 | Chapel, Sigmoid, 5000, 0.928199 44 | Chapel, Sigmoid, 10000, 3.82996 45 | Chapel, Sigmoid, 20000, 16.1571 46 | Chapel, Sigmoid, 30000, 38.0366 -------------------------------------------------------------------------------- /data/dBench.csv: -------------------------------------------------------------------------------- 1 | language,kernel,nitems,time 2 | D,DotProduct,1000,0.0331662 3 | D,DotProduct,5000,0.863256 4 | D,DotProduct,10000,3.92915 5 | D,DotProduct,20000,17.1248 6 | D,DotProduct,30000,40.4689 7 | D,Gaussian,1000,0.0364082 8 | D,Gaussian,5000,1.02408 9 | D,Gaussian,10000,4.2395 10 | D,Gaussian,20000,18.0227 11 | D,Gaussian,30000,41.9824 12 | D,Polynomial,1000,0.0336679 13 | D,Polynomial,5000,0.992885 14 | D,Polynomial,10000,4.03298 15 | D,Polynomial,20000,17.307 16 | D,Polynomial,30000,40.8675 17 | D,Exponential,1000,0.0348232 18 | D,Exponential,5000,1.00611 19 | D,Exponential,10000,4.20423 20 | D,Exponential,20000,17.814 21 | D,Exponential,30000,41.7958 22 | D,Log,1000,0.960558 23 | D,Log,5000,24.7134 24 | D,Log,10000,94.2969 25 | D,Log,20000,378.366 26 | D,Log,30000,852.784 27 | D,Cauchy,1000,0.0338152 28 | D,Cauchy,5000,1.01084 29 | D,Cauchy,10000,4.08566 30 | D,Cauchy,20000,17.6364 31 | D,Cauchy,30000,41.2217 32 | D,Power,1000,0.961671 33 | D,Power,5000,24.1544 34 | D,Power,10000,96.2633 35 | D,Power,20000,383.808 36 | D,Power,30000,863.304 37 | D,Wave,1000,0.034339 38 | D,Wave,5000,0.960398 39 | D,Wave,10000,4.14267 40 | D,Wave,20000,17.9007 41 | D,Wave,30000,41.9656 42 | D,Sigmoid,1000,0.0337945 43 | D,Sigmoid,5000,0.901038 44 | D,Sigmoid,10000,4.0042 45 | D,Sigmoid,20000,16.9423 46 | D,Sigmoid,30000,39.9905 47 | -------------------------------------------------------------------------------- /data/dNDSliceBench.csv: -------------------------------------------------------------------------------- 1 | language,kernel,nitems,time 2 | D,DotProduct,1000,0.0347107 3 | D,DotProduct,5000,0.858201 4 | D,DotProduct,10000,4.01725 5 | D,DotProduct,20000,17.5919 6 | D,DotProduct,30000,41.5271 7 | D,Gaussian,1000,0.0344868 8 | D,Gaussian,5000,0.969041 9 | D,Gaussian,10000,4.25197 10 | D,Gaussian,20000,18.2775 11 | D,Gaussian,30000,42.3535 12 | D,Polynomial,1000,0.0328877 13 | D,Polynomial,5000,0.902159 14 | D,Polynomial,10000,3.99975 15 | D,Polynomial,20000,17.4276 16 | D,Polynomial,30000,41.2833 17 | D,Exponential,1000,0.0341596 18 | D,Exponential,5000,0.954447 19 | D,Exponential,10000,4.09684 20 | D,Exponential,20000,17.6142 21 | D,Exponential,30000,42.0758 22 | D,Log,1000,0.518967 23 | D,Log,5000,13.3401 24 | D,Log,10000,53.1602 25 | D,Log,20000,212.737 26 | D,Log,30000,480.446 27 | D,Cauchy,1000,0.0328838 28 | D,Cauchy,5000,0.959192 29 | D,Cauchy,10000,4.08631 30 | D,Cauchy,20000,17.912 31 | D,Cauchy,30000,42.1615 32 | D,Power,1000,0.507569 33 | D,Power,5000,12.902 34 | D,Power,10000,51.4925 35 | D,Power,20000,205.891 36 | D,Power,30000,464.023 37 | D,Wave,1000,0.0335611 38 | D,Wave,5000,0.906904 39 | D,Wave,10000,4.11685 40 | D,Wave,20000,17.9911 41 | D,Wave,30000,42.3834 42 | D,Sigmoid,1000,0.0326368 43 | D,Sigmoid,5000,0.898398 44 | D,Sigmoid,10000,3.90918 45 | D,Sigmoid,20000,17.2436 46 | D,Sigmoid,30000,41.0899 47 | -------------------------------------------------------------------------------- /data/juliaBench.csv: -------------------------------------------------------------------------------- 1 | language,kernel,nitems,time 2 | Julia,DotProduct,1000,0.028618017832438152 3 | Julia,DotProduct,5000,0.335107962290446 4 | Julia,DotProduct,10000,2.273205359776815 5 | Julia,DotProduct,20000,12.081494728724161 6 | Julia,DotProduct,30000,29.30661424001058 7 | Julia,Gaussian,1000,0.02318231264750163 8 | Julia,Gaussian,5000,0.36138232549031574 9 | Julia,Gaussian,10000,2.553627332051595 10 | Julia,Gaussian,20000,13.242238680521647 11 | Julia,Gaussian,30000,32.783209005991615 12 | Julia,Polynomial,1000,0.022588332494099934 13 | Julia,Polynomial,5000,0.37053394317626953 14 | Julia,Polynomial,10000,2.513085683186849 15 | Julia,Polynomial,20000,13.059705018997192 16 | Julia,Polynomial,30000,31.536365350087483 17 | Julia,Exponential,1000,0.02099498112996419 18 | Julia,Exponential,5000,0.3473227024078369 19 | Julia,Exponential,10000,2.622964064280192 20 | Julia,Exponential,20000,13.915411392847696 21 | Julia,Exponential,30000,32.7377610206604 22 | Julia,Log,1000,0.6652313073476156 23 | Julia,Log,5000,17.037853320439655 24 | Julia,Log,10000,67.28145058949788 25 | Julia,Log,20000,284.03041768074036 26 | Julia,Log,30000,640.7916380564371 27 | Julia,Cauchy,1000,0.020720958709716797 28 | Julia,Cauchy,5000,0.3785683314005534 29 | Julia,Cauchy,10000,2.387653350830078 30 | Julia,Cauchy,20000,12.377931674321493 31 | Julia,Cauchy,30000,31.607904354731243 32 | Julia,Power,1000,0.6242409547170004 33 | Julia,Power,5000,17.334346691767376 34 | Julia,Power,10000,69.57900404930115 35 | Julia,Power,20000,265.5375280380249 36 | Julia,Power,30000,588.9969596862793 37 | Julia,Wave,1000,0.0548706849416097 38 | Julia,Wave,5000,0.4465626080830892 39 | Julia,Wave,10000,2.4781373341878257 40 | Julia,Wave,20000,13.298714955647787 41 | Julia,Wave,30000,33.43664868672689 42 | Julia,Sigmoid,1000,0.023600339889526367 43 | Julia,Sigmoid,5000,0.35213859875996906 44 | Julia,Sigmoid,10000,2.459144671758016 45 | Julia,Sigmoid,20000,12.380852301915487 46 | Julia,Sigmoid,30000,30.208860715230305 47 | -------------------------------------------------------------------------------- /data/ndsliceTime.log: -------------------------------------------------------------------------------- 1 | Command being timed: "./ndslice" 2 | User time (seconds): 68998.50 3 | System time (seconds): 36.82 4 | Percent of CPU this job got: 1181% 5 | Elapsed (wall clock) time (h:mm:ss or m:ss): 1:37:24 6 | Average shared text size (kbytes): 0 7 | Average unshared data size (kbytes): 0 8 | Average stack size (kbytes): 0 9 | Average total size (kbytes): 0 10 | Maximum resident set size (kbytes): 20461888 11 | Average resident set size (kbytes): 0 12 | Major (requiring I/O) page faults: 0 13 | Minor (reclaiming a frame) page faults: 15345124 14 | Voluntary context switches: 3961 15 | Involuntary context switches: 1315598 16 | Swaps: 0 17 | File system inputs: 0 18 | File system outputs: 8 19 | Socket messages sent: 0 20 | Socket messages received: 0 21 | Signals delivered: 0 22 | Page size (bytes): 4096 23 | Exit status: 0 24 | -------------------------------------------------------------------------------- /docs/kernel.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dataPulverizer/KernelMatrixBenchmark/28c40acb02497a3052f2d04689ab9c93fd090de2/docs/kernel.pdf -------------------------------------------------------------------------------- /fmdata/chapelBench.csv: -------------------------------------------------------------------------------- 1 | language, kernel, nitems, time 2 | Chapel, DotProduct, 1000, 0.00924367 3 | Chapel, DotProduct, 5000, 0.221611 4 | Chapel, DotProduct, 10000, 1.42578 5 | Chapel, DotProduct, 20000, 8.38982 6 | Chapel, DotProduct, 30000, 22.7254 7 | Chapel, Gaussian, 1000, 0.00718267 8 | Chapel, Gaussian, 5000, 0.276599 9 | Chapel, Gaussian, 10000, 1.50735 10 | Chapel, Gaussian, 20000, 8.65949 11 | Chapel, Gaussian, 30000, 24.1533 12 | Chapel, Polynomial, 1000, 0.00828233 13 | Chapel, Polynomial, 5000, 0.261301 14 | Chapel, Polynomial, 10000, 1.52145 15 | Chapel, Polynomial, 20000, 8.84253 16 | Chapel, Polynomial, 30000, 24.1945 17 | Chapel, Exponential, 1000, 0.0122707 18 | Chapel, Exponential, 5000, 0.252101 19 | Chapel, Exponential, 10000, 1.53333 20 | Chapel, Exponential, 20000, 8.68634 21 | Chapel, Exponential, 30000, 23.6449 22 | Chapel, Log, 1000, 0.957172 23 | Chapel, Log, 5000, 23.7193 24 | Chapel, Log, 10000, 94.9719 25 | Chapel, Log, 20000, 380.074 26 | Chapel, Log, 30000, 855.543 27 | Chapel, Cauchy, 1000, 0.008833 28 | Chapel, Cauchy, 5000, 0.255776 29 | Chapel, Cauchy, 10000, 1.51067 30 | Chapel, Cauchy, 20000, 8.53653 31 | Chapel, Cauchy, 30000, 23.9485 32 | Chapel, Power, 1000, 0.959575 33 | Chapel, Power, 5000, 23.7316 34 | Chapel, Power, 10000, 94.9118 35 | Chapel, Power, 20000, 379.592 36 | Chapel, Power, 30000, 854.344 37 | Chapel, Wave, 1000, 0.00928133 38 | Chapel, Wave, 5000, 0.271076 39 | Chapel, Wave, 10000, 1.54933 40 | Chapel, Wave, 20000, 8.69982 41 | Chapel, Wave, 30000, 24.2359 42 | Chapel, Sigmoid, 1000, 0.009228 43 | Chapel, Sigmoid, 5000, 0.249445 44 | Chapel, Sigmoid, 10000, 1.47198 45 | Chapel, Sigmoid, 20000, 8.46401 46 | Chapel, Sigmoid, 30000, 23.3408 -------------------------------------------------------------------------------- /fmdata/dBench.csv: -------------------------------------------------------------------------------- 1 | language,kernel,nitems,time 2 | D,DotProduct,1000,0.00758157 3 | D,DotProduct,5000,0.227808 4 | D,DotProduct,10000,1.28944 5 | D,DotProduct,20000,9.12727 6 | D,DotProduct,30000,26.5607 7 | D,Gaussian,1000,0.0090272 8 | D,Gaussian,5000,0.265671 9 | D,Gaussian,10000,1.42508 10 | D,Gaussian,20000,9.2337 11 | D,Gaussian,30000,27.3075 12 | D,Polynomial,1000,0.0068673 13 | D,Polynomial,5000,0.23578 14 | D,Polynomial,10000,1.31602 15 | D,Polynomial,20000,8.99742 16 | D,Polynomial,30000,26.939 17 | D,Exponential,1000,0.008833 18 | D,Exponential,5000,0.27474 19 | D,Exponential,10000,1.4347 20 | D,Exponential,20000,9.3266 21 | D,Exponential,30000,27.2397 22 | D,Log,1000,0.966266 23 | D,Log,5000,24.1955 24 | D,Log,10000,96.2422 25 | D,Log,20000,384.62 26 | D,Log,30000,864.477 27 | D,Cauchy,1000,0.00608963 28 | D,Cauchy,5000,0.212442 29 | D,Cauchy,10000,1.25913 30 | D,Cauchy,20000,8.69336 31 | D,Cauchy,30000,26.7323 32 | D,Power,1000,0.959982 33 | D,Power,5000,24.1416 34 | D,Power,10000,96.0607 35 | D,Power,20000,384.133 36 | D,Power,30000,854.033 37 | D,Wave,1000,0.0075854 38 | D,Wave,5000,0.263024 39 | D,Wave,10000,1.57122 40 | D,Wave,20000,10.0207 41 | D,Wave,30000,29.9925 42 | D,Sigmoid,1000,0.0056798 43 | D,Sigmoid,5000,0.251254 44 | D,Sigmoid,10000,1.47337 45 | D,Sigmoid,20000,8.76444 46 | D,Sigmoid,30000,28.8495 47 | -------------------------------------------------------------------------------- /fmdata/juliaBench.csv: -------------------------------------------------------------------------------- 1 | language,kernel,nitems,time 2 | Julia,DotProduct,1000,0.028934717178344727 3 | Julia,DotProduct,5000,0.3384213447570801 4 | Julia,DotProduct,10000,2.285879373550415 5 | Julia,DotProduct,20000,12.012665033340454 6 | Julia,DotProduct,30000,29.216094970703125 7 | Julia,Gaussian,1000,0.022055943806966145 8 | Julia,Gaussian,5000,0.3577253818511963 9 | Julia,Gaussian,10000,2.520039637883504 10 | Julia,Gaussian,20000,13.040017366409302 11 | Julia,Gaussian,30000,32.2067592938741 12 | Julia,Polynomial,1000,0.02389367421468099 13 | Julia,Polynomial,5000,0.3974326451619466 14 | Julia,Polynomial,10000,2.538082679112752 15 | Julia,Polynomial,20000,13.033354997634888 16 | Julia,Polynomial,30000,31.670559326807656 17 | Julia,Exponential,1000,0.02152903874715169 18 | Julia,Exponential,5000,0.36496035257975257 19 | Julia,Exponential,10000,2.6344366868336992 20 | Julia,Exponential,20000,13.753678719202677 21 | Julia,Exponential,30000,33.27915596961975 22 | Julia,Log,1000,0.6249533494313557 23 | Julia,Log,5000,15.74955932299296 24 | Julia,Log,10000,61.271198670069374 25 | Julia,Log,20000,247.89898459116617 26 | Julia,Log,30000,565.0516930421193 27 | Julia,Cauchy,1000,0.020629008611043293 28 | Julia,Cauchy,5000,0.3662620385487874 29 | Julia,Cauchy,10000,2.3178850015004473 30 | Julia,Cauchy,20000,12.884659051895142 31 | Julia,Cauchy,30000,31.82168634732564 32 | Julia,Power,1000,0.6112387180328369 33 | Julia,Power,5000,16.487519025802612 34 | Julia,Power,10000,61.17924292882283 35 | Julia,Power,20000,247.75332935651141 36 | Julia,Power,30000,559.8640073140461 37 | Julia,Wave,1000,0.05431739489237467 38 | Julia,Wave,5000,0.4737757047017415 39 | Julia,Wave,10000,2.6081150372823076 40 | Julia,Wave,20000,13.91323169072469 41 | Julia,Wave,30000,33.91305796305338 42 | Julia,Sigmoid,1000,0.021110375722249348 43 | Julia,Sigmoid,5000,0.32453298568725586 44 | Julia,Sigmoid,10000,2.39143697420756 45 | Julia,Sigmoid,20000,12.46296731630961 46 | Julia,Sigmoid,30000,30.015175342559814 47 | -------------------------------------------------------------------------------- /julia/KernelMatrix.jl: -------------------------------------------------------------------------------- 1 | using Base.Threads: @threads, @spawn 2 | using Random: shuffle! 3 | using LinearAlgebra: Symmetric 4 | 5 | # Kernel Function Types 6 | #======================# 7 | abstract type AbstractKernel{T <: AbstractFloat} end 8 | 9 | struct DotProduct{T} <: AbstractKernel{T} end 10 | @inline function kernel(K::DotProduct{T}, x::AbstractArray{T, N}, y::AbstractArray{T, N}) where {T,N} 11 | dist = T(0) 12 | m = length(x) 13 | @inbounds @simd for i in 1:m 14 | dist += x[i] * y[i] 15 | end 16 | return dist 17 | end 18 | 19 | struct Gaussian{T} <: AbstractKernel{T} 20 | theta::T 21 | end 22 | @inline function kernel(K::Gaussian{T}, x::AbstractArray{T, N}, y::AbstractArray{T, N}) where {T,N} 23 | dist::T = T(0) 24 | tmp::T = T(0) 25 | m = length(x) 26 | @inbounds @simd for i in 1:m 27 | tmp = x[i] - y[i] 28 | dist += tmp * tmp 29 | end 30 | return exp(-sqrt(dist)/K.theta) 31 | end 32 | 33 | struct Polynomial{T} <: AbstractKernel{T} 34 | d::T 35 | offset::T 36 | end 37 | @inline function kernel(K::Polynomial{T}, x::AbstractArray{T, N}, y::AbstractArray{T, N}) where {T, N} 38 | dist::T = T(0) 39 | m = length(x) 40 | @inbounds @simd for i = 1:m 41 | dist += x[i] * y[i] 42 | end 43 | return (dist + K.offset)^K.d 44 | end 45 | 46 | struct Exponential{T} <: AbstractKernel{T} 47 | theta::T 48 | end 49 | @inline function kernel(K::Exponential{T}, x::AbstractArray{T, N}, y::AbstractArray{T, N}) where {T, N} 50 | dist::T = T(0) 51 | m = length(x) 52 | @inbounds @simd for i in 1:m 53 | dist -= abs(x[i] - y[i]) 54 | end 55 | return exp(dist/K.theta) 56 | end 57 | 58 | struct Log{T} <: AbstractKernel{T} 59 | beta::T 60 | end 61 | @inline function kernel(K::Log{T}, x::AbstractArray{T, N}, y::AbstractArray{T, N}) where {T, N} 62 | dist::T = T(0) 63 | m = length(x) 64 | @inbounds @simd for i in 1:m 65 | dist += abs(x[i] - y[i])^K.beta 66 | end 67 | dist ^= (1/K.beta) 68 | return -log(1 + dist) 69 | end 70 | 71 | struct Cauchy{T} <: AbstractKernel{T} 72 | theta::T 73 | end 74 | @inline function kernel(K::Cauchy{T}, x::AbstractArray{T, N}, y::AbstractArray{T, N}) where {T, N} 75 | dist::T = T(0) 76 | tmp::T = T(0) 77 | m = length(x) 78 | @inbounds @simd for i in 1:m 79 | tmp = x[i] - y[i] 80 | dist += tmp*tmp 81 | end 82 | dist = sqrt(dist)/K.theta 83 | return 1/(1 + dist) 84 | end 85 | 86 | struct Power{T} <: AbstractKernel{T} 87 | beta::T 88 | end 89 | @inline function kernel(K::Power{T}, x::AbstractArray{T, N}, y::AbstractArray{T, N}) where {T, N} 90 | dist::T = T(0) 91 | m = length(x) 92 | @inbounds @simd for i in 1:m 93 | dist += abs(x[i] - y[i])^K.beta 94 | end 95 | return -dist^(1/K.beta) 96 | end 97 | 98 | struct Wave{T} <: AbstractKernel{T} 99 | theta::T 100 | end 101 | @inline function kernel(K::Wave{T}, x::AbstractArray{T, N}, y::AbstractArray{T, N}) where {T, N} 102 | dist::T = T(0) 103 | m = length(x) 104 | @inbounds @simd for i in 1:m 105 | dist += abs(x[i] - y[i]) 106 | end 107 | tmp = K.theta/dist; 108 | return tmp*sin(1/tmp); 109 | end 110 | 111 | struct Sigmoid{T} <: AbstractKernel{T} 112 | beta0::T 113 | beta1::T 114 | end 115 | @inline function kernel(K::Sigmoid{T}, x::AbstractArray{T, N}, y::AbstractArray{T, N}) where {T, N} 116 | dist::T = T(0) 117 | m = length(x) 118 | @inbounds @simd for i = 1:m 119 | dist += x[i] * y[i] 120 | end 121 | return tanh(K.beta0 * dist + K.beta1) 122 | end 123 | 124 | #=======================================================================================# 125 | 126 | function calculateKernelMatrix(Kernel::AbstractKernel{T}, data::AbstractArray{T, N}) where {T, N} 127 | n = size(data)[2] 128 | mat::Array{T, 2} = zeros(T, n, n) 129 | @threads for j in 1:n 130 | @views for i in j:n 131 | mat[i,j] = kernel(Kernel, data[:, i], data[:, j]) 132 | end 133 | end 134 | return Symmetric(mat, :L) 135 | end 136 | -------------------------------------------------------------------------------- /julia/fmtime.log: -------------------------------------------------------------------------------- 1 | Command being timed: "julia --math-mode=fast script.jl fmdata true" 2 | User time (seconds): 45388.59 3 | System time (seconds): 31.50 4 | Percent of CPU this job got: 716% 5 | Elapsed (wall clock) time (h:mm:ss or m:ss): 1:45:36 6 | Average shared text size (kbytes): 0 7 | Average unshared data size (kbytes): 0 8 | Average stack size (kbytes): 0 9 | Average total size (kbytes): 0 10 | Maximum resident set size (kbytes): 7489840 11 | Average resident set size (kbytes): 0 12 | Major (requiring I/O) page faults: 1 13 | Minor (reclaiming a frame) page faults: 38016102 14 | Voluntary context switches: 1583 15 | Involuntary context switches: 437027 16 | Swaps: 0 17 | File system inputs: 96 18 | File system outputs: 8 19 | Socket messages sent: 0 20 | Socket messages received: 0 21 | Signals delivered: 0 22 | Page size (bytes): 4096 23 | Exit status: 0 24 | 25 | -------------------------------------------------------------------------------- /julia/script.jl: -------------------------------------------------------------------------------- 1 | include("KernelMatrix.jl") 2 | using DelimitedFiles: writedlm; 3 | using InteractiveUtils: @code_warntype; 4 | 5 | const folder, _verbose = ARGS 6 | const verbose = _verbose == "true" ? true : false; 7 | 8 | function bench(Kernel::AbstractKernel{T}, n::Array{Int64, 1}) where {T} 9 | times::Array{Float64, 1} = zeros(Float64, length(n)) 10 | for i in 1:length(n) 11 | _times::Array{Float64, 1} = zeros(Float64, 3) 12 | data = rand(T, (784, n[i])) 13 | for j in 1:3 14 | t1 = time() 15 | mat = calculateKernelMatrix(Kernel, data); 16 | t2 = time() 17 | _times[j] = t2 - t1 18 | end 19 | times[i] = (_times[1] + _times[2] + _times[3])/3 20 | if verbose 21 | println("Average time for n = ", n[i], ", ", times[i], " seconds.") 22 | println("Detailed times: ", _times); 23 | end 24 | end 25 | return times 26 | end 27 | 28 | function precompileKernel(Kernel::AbstractKernel{T}, n::Array{Int64, 1}) where {T} 29 | precompile(kernel, (typeof(Kernel), Array{T, 1}, Array{T, 1})) 30 | precompile(calculateKernelMatrix, (typeof(Kernel), Array{T, 2})) 31 | precompile(bench, (typeof(Kernel), Array{Int64, 1})) 32 | 33 | times = bench(Kernel, n) 34 | if verbose 35 | println("\n\nBenchmark for kernel: ", repr(Kernel), "\ntimes: ", times) 36 | end 37 | 38 | return (n, times) 39 | end 40 | 41 | function runKernelBenchmarks(kernels::NTuple{N, AbstractKernel{T}}, n::Array{Int64, 1}) where {N, T} 42 | results = Array{Tuple{Array{Int64, 1}, Array{Float64, 1}}, 1}(undef, length(kernels)) 43 | for i in 1:length(results) 44 | if verbose # to check types are known at compilation 45 | @code_warntype precompileKernel(kernels[i], n) 46 | end 47 | results[i] = precompileKernel(kernels[i], n) 48 | end 49 | return results 50 | end 51 | 52 | function main(::Type{T}) where {T} 53 | # n = [100, 500, 1000] 54 | n = [1000, 5000 , 10_000, 20_000, 30_000]; 55 | kernels = (DotProduct{T}(), Gaussian{T}(1), Polynomial{T}(2.5, 1), 56 | Exponential{T}(1), Log{T}(3), Cauchy{T}(1), 57 | Power{T}(2.5), Wave{T}(1), Sigmoid{T}(1, 1)); 58 | kernelNames = ["DotProduct", "Gaussian", "Polynomial", 59 | "Exponential", "Log", "Cauchy", 60 | "Power", "Wave", "Sigmoid"]; 61 | outputs = runKernelBenchmarks(kernels, n) 62 | 63 | table = Array{String, 2}(undef, (length(n)*length(kernels) + 1, 4)) 64 | table[1, :] = ["language", "kernel", "nitems", "time"] 65 | while true 66 | k = 2 67 | for i in 1:length(kernelNames) 68 | tmp = ["Julia", kernelNames[i], "", ""] 69 | for j in 1:length(n) 70 | tmp[3] = repr(outputs[i][1][j]) 71 | tmp[4] = repr(outputs[i][2][j]) 72 | table[k, :] = tmp 73 | k += 1 74 | end 75 | end 76 | if k > size(table)[1] 77 | break 78 | end 79 | end 80 | 81 | writedlm("../" * folder * "/juliaBench.csv", table, ',') 82 | 83 | return 84 | end 85 | 86 | #= 87 | To run: 88 | /usr/bin/time -v julia script.jl 89 | =# 90 | main(Float32) 91 | -------------------------------------------------------------------------------- /julia/script.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/bash 2 | # Uses "regular" mathematical functions 3 | /usr/bin/time -v julia script.jl data true 4 | # Uses Fast Math 5 | /usr/bin/time -v julia --math-mode=fast script.jl fmdata true 6 | -------------------------------------------------------------------------------- /julia/time.log: -------------------------------------------------------------------------------- 1 | Command being timed: "julia script.jl data true" 2 | User time (seconds): 49163.19 3 | System time (seconds): 32.08 4 | Percent of CPU this job got: 717% 5 | Elapsed (wall clock) time (h:mm:ss or m:ss): 1:54:18 6 | Average shared text size (kbytes): 0 7 | Average unshared data size (kbytes): 0 8 | Average stack size (kbytes): 0 9 | Average total size (kbytes): 0 10 | Maximum resident set size (kbytes): 7501592 11 | Average resident set size (kbytes): 0 12 | Major (requiring I/O) page faults: 804 13 | Minor (reclaiming a frame) page faults: 38021184 14 | Voluntary context switches: 2657 15 | Involuntary context switches: 477363 16 | Swaps: 0 17 | File system inputs: 368240 18 | File system outputs: 8 19 | Socket messages sent: 0 20 | Socket messages received: 0 21 | Signals delivered: 0 22 | Page size (bytes): 4096 23 | Exit status: 0 24 | 25 | 26 | -------------------------------------------------------------------------------- /ndslice/dub.json: -------------------------------------------------------------------------------- 1 | { 2 | "authors": [ 3 | "Dr Chibisi Chima-Okereke" 4 | ], 5 | "copyright": "Copyright © 2020, Dr Chibisi Chima-Okereke", 6 | "dependencies": { 7 | "mir-algorithm": "~>3.8.12", 8 | "mir-random": "~>2.2.14" 9 | }, 10 | "description": "Kernel Matrix Calculations using D's Mir Algorithm Package", 11 | "license": "MIT", 12 | "name": "ndslice", 13 | "dflags": ["--boundscheck=off", "-mcpu=native"], 14 | "toolchainRequirements": { 15 | "dmd": ">=2.090.1", 16 | "gdc": "no", 17 | "ldc2": ">=1.18.0" 18 | }, 19 | "targetType": "executable" 20 | } -------------------------------------------------------------------------------- /ndslice/source/app.d: -------------------------------------------------------------------------------- 1 | import ndslice.kernels; 2 | 3 | import mir.math.sum; 4 | import mir.ndslice; 5 | import mir.random.algorithm: randomSlice; 6 | import mir.random.variable: UniformVariable; 7 | 8 | import std.conv: to; 9 | import std.datetime.stopwatch: AutoStart, StopWatch; 10 | import std.meta: AliasSeq; 11 | import std.stdio: File, writeln; 12 | import std.typecons: tuple, Tuple; 13 | 14 | /** 15 | To compile: 16 | dub run --compiler=ldc2 --build=release 17 | */ 18 | 19 | auto bench(alias K, T)(K!(T) kernel, long[] n, bool verbose = true) 20 | { 21 | auto times = new double[n.length]; 22 | auto sw = StopWatch(AutoStart.no); 23 | foreach(i; 0..n.length) 24 | { 25 | double[3] _times; 26 | auto data = UniformVariable!T(0, 1).randomSlice(n[i], 784L); 27 | foreach(ref t; _times[]) 28 | { 29 | sw.start(); 30 | auto mat = calculateKernelMatrix!(K, T)(kernel, data); 31 | sw.stop(); 32 | t = sw.peek.total!"nsecs"/1000_000_000.0; 33 | sw.reset(); 34 | } 35 | times[i] = sum!"naive"(_times[])/3.0; 36 | if(verbose) 37 | { 38 | writeln("Average time for n = ", n[i], ", ", times[i], " seconds."); 39 | writeln("Detailed times: ", _times, "\n"); 40 | } 41 | } 42 | return tuple(n, times); 43 | } 44 | 45 | auto runKernelBenchmark(KS)(KS kernels, long[] n, bool verbose = true) 46 | { 47 | auto tmp = bench(kernels[0], n, verbose); 48 | alias R = typeof(tmp); 49 | R[kernels.length] results; 50 | results[0] = tmp; 51 | static foreach(i; 1..kernels.length) 52 | { 53 | if(verbose) 54 | { 55 | writeln("Running benchmarks for ", kernels[i]); 56 | } 57 | results[i] = bench(kernels[i], n, verbose); 58 | } 59 | return results; 60 | } 61 | 62 | void writeRow(File file, string[] row) 63 | { 64 | string line = ""; 65 | foreach(i; 0..(row.length - 1)) 66 | line ~= row[i] ~ ","; 67 | line ~= row[row.length - 1] ~ "\n"; 68 | file.write(line); 69 | return; 70 | } 71 | 72 | 73 | void runAllKernelBenchmarks(T = float)(bool verbose = true) 74 | { 75 | auto kernels = tuple(DotProduct!(T)(), Gaussian!(T)(1), Polynomial!(T)(2.5f, 1), 76 | Exponential!(T)(1), Log!(T)(3), Cauchy!(T)(1), 77 | Power!(T)(2.5f), Wave!(T)(1), Sigmoid!(T)(1, 1)); 78 | auto kernelNames = ["DotProduct", "Gaussian", "Polynomial", 79 | "Exponential", "Log", "Cauchy", 80 | "Power", "Wave", "Sigmoid"]; 81 | //long[] n = [100L, 500L, 1000L]; 82 | long[] n = [1000L, 5000L, 10_000L, 20_000L, 30_000L]; 83 | auto results = runKernelBenchmark(kernels, n, verbose); 84 | 85 | auto table = new string[][] (n.length * kernels.length + 1, 4); 86 | table[0][] = ["language", "kernel", "nitems", "time"]; 87 | auto tmp = ["D", "", "", ""]; 88 | while(true) 89 | { 90 | auto k = 1; 91 | foreach(i; 0..kernels.length) 92 | { 93 | tmp = ["D", kernelNames[i], "", ""]; 94 | foreach(j; 0..n.length) 95 | { 96 | tmp[2] = to!(string)(results[i][0][j]); 97 | tmp[3] = to!(string)(results[i][1][j]); 98 | table[k][] = tmp.dup; 99 | k += 1; 100 | } 101 | } 102 | if(k > (table.length - 1)) 103 | { 104 | break; 105 | } 106 | } 107 | auto file = File("../data/dNDSliceBench.csv", "w"); 108 | foreach(row; table) 109 | file.writeRow(row); 110 | 111 | writeln("table: ", table); 112 | } 113 | 114 | void main() 115 | { 116 | runAllKernelBenchmarks(); 117 | } 118 | -------------------------------------------------------------------------------- /ndslice/source/ndslice/kernels.d: -------------------------------------------------------------------------------- 1 | module ndslice.kernels; 2 | import core.stdc.tgmath: tanh; 3 | import mir.algorithm.iteration; 4 | import mir.math.common; 5 | import mir.ndslice; 6 | 7 | import std.parallelism; 8 | 9 | /** 10 | Kernel Function Types: 11 | */ 12 | struct DotProduct(T) 13 | { 14 | public: 15 | this(T _nothing) 16 | {} 17 | T opCall(Slice!(T*) x, Slice!(T*) y) const 18 | { 19 | T dist = 0; 20 | auto m = x.length; 21 | for(size_t i = 0; i < m; ++i) 22 | { 23 | dist += x[i] * y[i]; 24 | } 25 | return dist; 26 | } 27 | } 28 | 29 | struct Gaussian(T) 30 | { 31 | private: 32 | T theta; 33 | public: 34 | this(T _theta) 35 | { 36 | theta = _theta; 37 | } 38 | T opCall(Slice!(T*) x, Slice!(T*) y) const 39 | { 40 | T dist = 0; 41 | auto m = x.length; 42 | for(size_t i = 0; i < m; ++i) 43 | { 44 | auto tmp = x[i] - y[i]; 45 | dist += tmp * tmp; 46 | } 47 | return exp(-sqrt(dist)/theta); 48 | } 49 | } 50 | 51 | struct Polynomial(T) 52 | { 53 | private: 54 | T d; 55 | T offset; 56 | public: 57 | this(T _d, T _offset) 58 | { 59 | d = _d; 60 | offset = _offset; 61 | } 62 | T opCall(Slice!(T*) x, Slice!(T*) y) const 63 | { 64 | T dist = 0; 65 | auto m = x.length; 66 | for(size_t i = 0; i < m; ++i) 67 | { 68 | dist += x[i] * y[i]; 69 | } 70 | return pow(dist + offset, d); 71 | } 72 | } 73 | 74 | struct Exponential(T) 75 | { 76 | private: 77 | T theta; 78 | public: 79 | this(T _theta) 80 | { 81 | theta = _theta; 82 | } 83 | T opCall(Slice!(T*) x, Slice!(T*) y) const 84 | { 85 | T dist = 0; 86 | auto m = x.length; 87 | for(size_t i = 0; i < m; ++i) 88 | { 89 | dist -= fabs(x[i] - y[i]); 90 | } 91 | return exp(dist/theta); 92 | } 93 | } 94 | 95 | struct Log(T) 96 | { 97 | private: 98 | T beta; 99 | public: 100 | this(T _beta) 101 | { 102 | beta = _beta; 103 | } 104 | T opCall(Slice!(T*) x, Slice!(T*) y) const 105 | { 106 | T dist = 0; 107 | auto m = x.length; 108 | for(size_t i = 0; i < m; ++i) 109 | { 110 | dist += pow(fabs(x[i] - y[i]), beta); 111 | } 112 | dist = pow(dist, 1/beta); 113 | return -log(1 + dist); 114 | } 115 | } 116 | 117 | struct Cauchy(T) 118 | { 119 | private: 120 | T theta; 121 | public: 122 | this(T _theta) 123 | { 124 | theta = _theta; 125 | } 126 | T opCall(Slice!(T*) x, Slice!(T*) y) const 127 | { 128 | T dist = 0; 129 | auto m = x.length; 130 | for(size_t i = 0; i < m; ++i) 131 | { 132 | auto tmp = x[i] - y[i]; 133 | dist += tmp * tmp; 134 | } 135 | dist = sqrt(dist)/theta; 136 | return 1/(1 + dist); 137 | } 138 | } 139 | 140 | struct Power(T) 141 | { 142 | private: 143 | T beta; 144 | public: 145 | this(T _beta) 146 | { 147 | beta = _beta; 148 | } 149 | T opCall(Slice!(T*) x, Slice!(T*) y) const 150 | { 151 | T dist = 0; 152 | auto m = x.length; 153 | for(size_t i = 0; i < m; ++i) 154 | { 155 | dist += pow(fabs(x[i] - y[i]), beta); 156 | } 157 | return -pow(dist, 1/beta); 158 | } 159 | } 160 | 161 | struct Wave(T) 162 | { 163 | private: 164 | T theta; 165 | public: 166 | this(T _theta) 167 | { 168 | theta = _theta; 169 | } 170 | T opCall(Slice!(T*) x, Slice!(T*) y) const 171 | { 172 | T dist = 0; 173 | auto m = x.length; 174 | for(size_t i = 0; i < m; ++i) 175 | { 176 | dist += fabs(x[i] - y[i]); 177 | } 178 | auto tmp = theta/dist; 179 | return tmp*sin(1/tmp); 180 | } 181 | } 182 | 183 | struct Sigmoid(T) 184 | { 185 | private: 186 | T beta0; 187 | T beta1; 188 | public: 189 | this(T _beta0, T _beta1) 190 | { 191 | beta0 = _beta0; 192 | beta1 = _beta1; 193 | } 194 | T opCall(Slice!(T*) x, Slice!(T*) y) const 195 | { 196 | T dist = 0; 197 | auto m = x.length; 198 | for(size_t i = 0; i < m; ++i) 199 | { 200 | dist += x[i] * y[i]; 201 | } 202 | return tanh(beta0 * dist + beta1); 203 | } 204 | } 205 | 206 | /************************************************************************************/ 207 | 208 | auto calculateKernelMatrix(alias K, T)(K!(T) kernel, Slice!(T*, 2) data) 209 | { 210 | size_t n = data.length!0; 211 | auto mat = uninitSlice!(T)(n, n); 212 | foreach(j, arrj; taskPool.parallel(data)) 213 | foreach (i; j .. n) 214 | mat[j, i] = mat[i, j] = kernel(data[i], arrj); 215 | return mat; 216 | } 217 | -------------------------------------------------------------------------------- /script.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/bash 2 | cd d 3 | printf "\n#============== Running D Benchmark ==============#\n" 4 | ./script.sh 5 | cd ../chapel 6 | printf "\n#============== Running Chapel Benchmark ==============#\n" 7 | ./script.sh 8 | cd ../julia 9 | printf "\n#============== Running Julia Benchmark ==============#\n" 10 | ./script.sh 11 | printf "\n#============== Running Mir NDSlice Benchmark ==============#\n" 12 | cd ../ndslice 13 | dub build --compiler=ldc2 --build=release --force 14 | /usr/bin/time -v ./ndslice 15 | --------------------------------------------------------------------------------