├── EULA ├── FAQ ├── README.md ├── doc ├── SASSI-Tutorial-Micro2015.pptx ├── sassi-isca-2015.pdf ├── sassi-user-guide.pdf └── tex │ ├── Makefile │ ├── background.tex │ ├── command-line.tex │ ├── discussion.tex │ ├── environment.tex │ ├── example1.tex │ ├── example2.tex │ ├── example3.tex │ ├── example4.tex │ ├── figures │ └── sassi-flow.pdf │ ├── installing.tex │ ├── instrumenting.tex │ ├── introduction.tex │ ├── required.tex │ ├── restrictions.tex │ └── sassi.tex ├── example ├── Makefile ├── inc │ ├── exception.h │ ├── helper_functions.h │ ├── helper_image.h │ ├── helper_string.h │ └── helper_timer.h └── matrixMul.cu └── instlibs ├── LICENSE ├── Makefile ├── env.mk ├── src ├── Makefile ├── branch.cu ├── cfg.cu ├── memdiverge.cu ├── opcode.cu ├── ophist.cu ├── ophist_fermi.cu ├── stubs.cu └── valueprof.cu └── utils ├── sassi_cuptiwrapper.hpp ├── sassi_dictionary.hpp ├── sassi_intrinsics.h ├── sassi_lazyallocator.hpp ├── sassi_managed.hpp ├── sassi_srcmap.hpp └── utiltest ├── Makefile └── test_dictionary.cu /FAQ: -------------------------------------------------------------------------------- 1 | 2 | 1. Why can't I get the source code for SASSI? 3 | 4 | SASSI is cooked into the production backend compiler, which is not 5 | open source, and probably never will be. If you really want to see 6 | the code, please come work for NVIDIA. :) 7 | 8 | 2. Why is the installation script crashing with the following error?: 9 | 10 | Traceback (most recent call last): 11 | File "setup", line 114, in 12 | install(cuda7Root, sassi7Root) 13 | File "setup", line 56, in install 14 | print "COPYING files from {}...".format(cuda_dir) 15 | ValueError: zero length field name in format 16 | 17 | The installation script is written in Python, and it uses a 18 | command, `format`, that requires Python 2.7 or newer. Please 19 | upgrade your version of Python and try again if you run into 20 | this problem. 21 | 22 | 3. Why don't the instrumentation libraries build for Fermi? 23 | 24 | Most of the instrumentation libraries use features that are only 25 | available in 3.0+ architectures. There are a couple of very simple 26 | example libraries that work with Fermi. Try `make fermi-only` 27 | when building the instrumentation libraries to build them. 28 | 29 | 4. Why is my instrumented program generating "illegal memory access" 30 | errors? 31 | 32 | We have seen this for a number of reasons, including: 33 | 34 | 1. Your driver version is not at least 346.41. You can test your 35 | driver version with the "nvidia-smi" command. 36 | 37 | 2. You built your instrumentation libraries on one platform, but 38 | are trying to link them in on another platform. I have 39 | encountered this when I built my libraries an old version of 40 | CentOS, but accidentally linked them in on a newer Debian 41 | platform. 42 | 43 | 3. You're using an old version of CUPTI. If you're going to use 44 | CUPTI, you should use the CUDA 7 version. 45 | 46 | 5. Why doesn't the SASSI version of ptxas generate exactly the same 47 | code as the version distributed in the official toolkit? 48 | 49 | The backend compiler in which SASSI is embedded is based on a fork 50 | of the production compiler, and thus in some cases, code generation 51 | may differ from that of the official ptxas. But in all cases, the 52 | performance of the generated code should be highly competitive. 53 | 54 | 6. Why am I seeing a bunch of CUPTI-related link errors? 55 | 56 | Link order is very important with nvcc, with library functions only 57 | being linked in "as-needed". Therefore, if your instrumentation 58 | library depends on CUPTI, make sure to link in CUPTI *after* your 59 | library. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | SASSI is deprecated! 2 | ========================================== 3 | 4 | SASSI is now officially too old to use. For your current NVIDIA GPU instrumentation needs, please use [NVBit!](https://github.com/NVlabs/NVBit) 5 | 6 | 7 | 8 | SASSI Instrumentation Tool for NVIDIA GPUs 9 | ========================================== 10 | 11 | This project contains the SASSI instrumentation tool. SASSI is not 12 | part of the official CUDA toolkit, but instead is a research prototype 13 | from the Architecture Research Group at NVIDIA. 14 | 15 | SASSI is a selective instrumentation framework 16 | for NVIDIA GPUs. SASSI stands for SASS Instrumenter, where SASS is 17 | NVIDIA's name for its native ISA. SASSI is a pass in NVIDIA's backend 18 | compiler, ptxas, that selectively inserts instrumentation 19 | code. The purpose of SASSI is to allow users to measure or modify 20 | ptxas-generated SASS by injecting instrumentation code 21 | *during* code generation. 22 | 23 | NVIDIA has many excellent development tools. Why the need for another 24 | tool? NVIDIA's tools such as cuda-memcheck and nvvp provide excellent, 25 | but *fixed-function* inspection of programs. While they are great at 26 | what they are designed for, the user has to choose from a fixed menu 27 | of program characteristics to measure. If you want to measure some 28 | aspect of program execution outside the purview of those tools you are 29 | out of luck. SASSI allows users to flexibly inject their own 30 | instrumentation to measure novel aspects of GPGPU execution. 31 | 32 | SASSI consists of two main components: 33 | * A closed-source fork of NVIDIA's PTX assembler, ptxas, that is capable of 34 | injecting instrumentation code during compilation. SASSI's version of 35 | ptxas is distributed on GitHub via "Releases". 36 | * Several realistic samples that demonstrate SASSI's operation. 37 | 38 | Newest release notes 39 | ========================================== 40 | 41 | * We have added some new features. There is a new instrumentation 42 | library that demonstrates how to map a SASS instruction with a given 43 | PUPC (SASSI's version of a PC) to the CUDA source. See the "branch" 44 | library for its usage. Also see the `branch` target in 45 | `example/Makefile` for the compiler flags necessary to use the new 46 | feature. 47 | 48 | * Support for emulating novel SASS instructions for ISA exploration is 49 | more stable. We have not yet documented this feature because we are 50 | still working out the kinks, but if you are interested in this 51 | feature, please contact me. 52 | 53 | * Bug fix. The PUPC was invalid for functions with long names. This 54 | fix requires installing the latest SASSI binaries. 55 | 56 | Prerequisites 57 | ------------------ 58 | 59 | SASSI has the following system prerequisites: 60 | 61 | 1. Platform requirement: SASSI requires an X86 64-bit host; a Fermi-, 62 | Kepler-, or Maxwell-based GPU; and at the time of this writing we 63 | have generated SASSI for Ubuntu (12, 14, and 15), Debian 7 and 8, and CentOS 6 and 7. 64 | 2. Install CUDA 7: At the time of this writing, CUDA 7 can be 65 | fetched [from here](https://developer.nvidia.com/cuda-toolkit-70). 66 | 3. Make sure you have a 346.41 driver or newer: The CUDA 7 67 | installation script can install a new driver for you that meets this 68 | requirement. If you already have a newer driver, that should be 69 | fine. You can test your driver version with the `nvidia-smi` 70 | command. 71 | 4. The installation script requires Python 2.7 or newer. 72 | 73 | Installation 74 | ------------------ 75 | 76 | After you have fulfilled your prerequisites, install SASSI by doing the following: 77 | 78 | 1. Find the release for your platform by clicking on the "release" tab on the 79 | GitHub project page, or by [navigating 80 | here](https://github.com/NVlabs/SASSI/releases). Find your 81 | architecture in the "Downloads" list and download. This download is a 82 | very simple binary installer. 83 | 2. Run the installer via `sh`, for example, `sh SASSI_x86_64_centos_6.run`. 84 | 85 | You might need to run the installer as root, depending on where you 86 | plan to install SASSI. 87 | 88 | Usage 89 | ------------------ 90 | 91 | For usage, please follow the instructions in the user guide, which you 92 | can find in `doc/sassi-user-guide.pdf`. 93 | 94 | Additionally, `ptxas -h` lists SASSI's supported options. 95 | 96 | Restrictions and caveats 97 | ------------------ 98 | 99 | 1. 32-bit architectures are not supported. 100 | 101 | This was an early design decision to reduce the large cross product of 102 | possible configurations. Please let us know if 32-bit support would 103 | be useful though, because it probably wouldn't be too hard to 104 | support. 105 | 106 | 2. Programs currently have to be compiled with `-rdc=true`, which 107 | affects performance. 108 | 109 | SASSI allows users to instrument code by injecting function calls to 110 | user-defined functions that are later linked in. In order to perform 111 | cross-module function calls in CUDA one must use the "relocatable 112 | device code" option, `-rdc=true`. Future versions of SASSI may remove 113 | this restriction. 114 | 115 | 3. Minimum driver required is 346.41. 116 | 117 | This version of SASSI is designed to work with the CUDA 7 toolchain, 118 | which also has that requirement. 119 | 120 | 121 | 122 | -------------------------------------------------------------------------------- /doc/SASSI-Tutorial-Micro2015.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NVlabs/SASSI/8c758417c7f1e18af0b35aaddfa1d9f10782004b/doc/SASSI-Tutorial-Micro2015.pptx -------------------------------------------------------------------------------- /doc/sassi-isca-2015.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NVlabs/SASSI/8c758417c7f1e18af0b35aaddfa1d9f10782004b/doc/sassi-isca-2015.pdf -------------------------------------------------------------------------------- /doc/sassi-user-guide.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NVlabs/SASSI/8c758417c7f1e18af0b35aaddfa1d9f10782004b/doc/sassi-user-guide.pdf -------------------------------------------------------------------------------- /doc/tex/Makefile: -------------------------------------------------------------------------------- 1 | 2 | sassi-guide.pdf: sassi.tex 3 | pdflatex sassi.tex 4 | pdflatex sassi.tex 5 | pdflatex sassi.tex 6 | mv sassi.pdf ../sassi-user-guide.pdf 7 | 8 | clean: 9 | $(RM) -f *~ *.log *.aux *.lot *.lof *.toc -------------------------------------------------------------------------------- /doc/tex/background.tex: -------------------------------------------------------------------------------- 1 | \section{NVIDIA's architectures and compiler flow} 2 | 3 | This section provides background on basic GPU architecture terminology 4 | and the NVIDIA GPU compilation flow. Some of the examples later in 5 | this guide require a baseline level of knowledge about GPU 6 | architectures. 7 | 8 | \subsection{Architecture Terminology} 9 | 10 | GPU programming models allow the creation of thousands of threads that 11 | each execute the same code. Threads are grouped into 32-element 12 | vectors called \emph{warps} to improve efficiency. The threads in each 13 | warp execute in a SIMT (\emph{single instruction, multiple thread}) 14 | fashion, all fetching from a single Program Counter (PC) in the 15 | absence of control flow. Many warps are then assigned to execute 16 | concurrently on a single GPU core, or \emph{streaming multiprocessor} 17 | (SM). A GPU consists of multiple such SM building blocks along with a 18 | memory hierarchy including SM-local scratchpad memories and L1 caches, 19 | a shared L2 cache, and multiple memory controllers. Different GPUs 20 | deploy differing numbers of SMs. 21 | 22 | \subsection{GPU Software Stack} 23 | 24 | Historically, NVIDIA has referred to units of code that run on the GPU 25 | as \emph{shaders}. There are several broad categories of shaders, 26 | including DirectX shaders, OpenGL shaders, and compute shaders 27 | (\emph{e.g.}, CUDA kernels). SASSI currently only handles compute 28 | shaders. For compute shaders, a \emph{front-end} compiler can be used 29 | to simplify the task of writing a shader. For example, a user can 30 | write parallel programs using high-level programming languages such as 31 | CUDA or OpenCL, and use a front-end compiler, such as NVIDIA's NVVM, 32 | to generate intermediate code in a virtual ISA called parallel thread 33 | execution (PTX)\@. 34 | 35 | PTX exposes the GPU as a data-parallel computing device by providing a 36 | stable programming model and instruction set for general purpose 37 | parallel programming, but PTX does not run directly on the GPU\@. A 38 | \emph{backend} compiler optimizes and translates PTX instructions into 39 | machine code that can run on the device. For compute shaders, the 40 | backend compiler can be invoked in two ways: (1) NVIDIA supports 41 | ahead-of-time compilation of compute kernels via a PTX assembler 42 | (\texttt{ptxas}), and (2) a JIT-time compiler in the \emph{display 43 | driver} can compile a PTX representation of the kernel if it is 44 | available in the binary. SASSI instrumentation relies on 45 | ahead-of-time compilation because it is embedded in \texttt{ptxas}. 46 | 47 | \begin{figure}[t] 48 | \center 49 | \includegraphics[width=0.60\textwidth]{figures/sassi-flow.pdf} 50 | \caption{SASSI's instrumentation flow.} 51 | \label{fig:sassi-comp-flow} 52 | \end{figure} 53 | 54 | Figure~\ref{fig:sassi-comp-flow} shows the compiler tool flow that 55 | includes the SASSI instrumentation process. Shaders are first 56 | compiled to an intermediate representation by a \emph{front-end} 57 | compiler. Before they can run on the GPU, however, the \emph{backend} 58 | compiler must read the intermediate representation and generate 59 | SASS\@. When more than one module is compiled, the modules have to be 60 | linked together using \texttt{nvlink} before running on the GPU. 61 | 62 | SASSI is implemented as the final compiler pass in \texttt{ptxas}, and 63 | as such it does not disrupt the perceived final instruction schedule 64 | or register usage. Furthermore as part of \texttt{ptxas}, SASSI is 65 | capable of instrumenting programs written in languages that target 66 | PTX, which includes CUDA and OpenCL. Apart from the injected 67 | instrumentation code, the original SASS code ordering remains 68 | unaffected. With the SASSI prototype we use \texttt{nvlink} to link 69 | the instrumented applications with user-level instrumentation 70 | handlers. SASSI could at some point in the future also be embedded in 71 | the driver to JIT compile PTX inputs, as shown by dotted lines in 72 | Figure~\ref{fig:sassi-comp-flow}. 73 | -------------------------------------------------------------------------------- /doc/tex/command-line.tex: -------------------------------------------------------------------------------- 1 | \pagebreak 2 | \section{Other command-line arguments for instrumentation} 3 | \label{sec:command-line-args} 4 | 5 | This section describes a few additional flags that guide how 6 | instrumentation is performed. 7 | 8 | \subsection{--sassi-debug} 9 | 10 | This flag performs two functions. First, it inserts canary 11 | values on the stack before calling a handler function, then upon 12 | return, it performs a check to make sure the canary values are 13 | intact. If the canary values differ from expectation, the check 14 | traps. This option is meant as a sanity check to help guard against 15 | unknown stack corruption. (Because SASSI manipulates the 16 | stack.) 17 | 18 | This flag can also be used to selectively disable some instrumentation 19 | points. When you assemble a PTX file with this option enabled, SASSI 20 | will dump the SASS that was generated to file named 21 | \texttt{sassi-instrumented-.txt}, and it will show you where the 22 | instrumentation was inserted. For example, look at this snippet from 23 | the example code, with the branch example 24 | (Section~\ref{sec:example2}): 25 | \begin{verbatim} 26 | . 27 | . 28 | . 29 | # BAR 30 | # ==== BASIC BLOCK HEADER ==== 31 | # IADD 32 | # IADD 33 | # ISETP 34 | +_Z13matrixMulCUDAILi16EEvPfS0_S0_ii, 111, _Z20sassi_before_handlerP17SASSIBeforeParams... 35 | # BRA 36 | # ==== BASIC BLOCK HEADER ==== 37 | # SYNC 38 | . 39 | . 40 | . 41 | \end{verbatim} 42 | The dump shows the instructions that were processed by SASSI, 43 | including the ones around which SASSI added instrumentation. The 44 | lines that begin with ``+'', show where instrumentation was inserted. 45 | The tuple on each ``+'' line shows the kernel, the unique function ID 46 | of the instrumented instruction, and the handler that was called for 47 | each instrumentation site. So, the first ``+'' line shows that a call 48 | to \texttt{sassi\_before\_handler} was inserted before the \texttt{BRA} 49 | instruction. 50 | 51 | If you take this output and put it in a file called 52 | \texttt{sassi-skips.txt}, you can selectively disable instrumentation 53 | for certain points by changing the ``+'' to a ``-''. This selective 54 | skipping of instrumentation points is only enabled in 55 | \texttt{--sassi-debug} mode. This simple interface has allowed us to 56 | binary search for problems with SASSI's code generation. But users 57 | may find it useful, though tedious, for finer-grained control over 58 | instrumentation sites. 59 | 60 | \subsection{--sassi-inst-inline} 61 | 62 | By default SASSI adds instrumentation via a code trampoline, which 63 | better preserves the original code layout, i.e., the offset of each 64 | instruction from the beginning of the function/kernel. More 65 | specifically, SASSI replaces the instrumented instruction with jump to 66 | an instrumentation trampoline, which performs the instrumentation, 67 | executes the instruction, and if necessary, branches back to the next 68 | instruction to be executed. By setting this option to true, SASSI will 69 | instead forgo a code trampoline, and inject the instrumentation 70 | directly inline. 71 | 72 | \subsection{--sassi-iff-true-predicate-handler-call} 73 | 74 | NVIDIA GPUs support predication. Most instructions can have a 75 | boolean source argument that dictates whether or not that instruction 76 | will actually execute. If this predicate operand is false, should the 77 | instrumentation handler be called? By default, the handler is always 78 | called, regardless of the predicate's value. For ``before'' 79 | instrumentation, the handler can determine whether or not the 80 | instruction \emph{will execute}; in the case of ``after'' 81 | instrumentation, the instrumentation handler can inspect the value of 82 | the guarding predicate. 83 | 84 | With this flag enabled, the handler will be called if and only 85 | if the instruction is actually executed. 86 | 87 | -------------------------------------------------------------------------------- /doc/tex/discussion.tex: -------------------------------------------------------------------------------- 1 | \section{Discussion} 2 | 3 | SASSI has some clear shortcomings. First, as we mentioned in the 4 | introduction, it is not a binary instrumenter, so source code is 5 | required. Furthermore, the additional complexity of the build process 6 | has proven to be fairly difficult for some users. A true binary 7 | instrumentation approach like that of \texttt{cuda-memcheck} would 8 | make the process seamless. On the other hand, there are some 9 | advantages: we are investigating porting the tool to the driver, which 10 | would allow us to handle graphics applications (which are 11 | JITted)\footnote{For this case, it does not make sense to have the 12 | driver compile the application, then disassemble it, patch it, then 13 | re-apply.}; in 14 | addition, SASSI can gather potentially useful properties that the 15 | compiler knows about such as register liveness, operand datatypes, 16 | basic block boundaries, etc. Related to this, the instrumentation 17 | code should be much less intrusive than that of binary patching. 18 | -------------------------------------------------------------------------------- /doc/tex/environment.tex: -------------------------------------------------------------------------------- 1 | \section{Setting up your environment for the examples} 2 | \label{sec:environment} 3 | 4 | The next four sections walk users through complete examples. First, 5 | let's compile all the instrumentation handlers that come with SASSI, 6 | including the ones the next few sections cover. First, navigate to 7 | the instrumentation library directory: 8 | \begin{lstlisting}[style=BashInputStyle] 9 | cd $SASSI_REPO/instlibs 10 | \end{lstlisting} 11 | Next, we need to edit the file \texttt{env.mk}. There are four 12 | environment variables we need to set: 13 | \begin{itemize} 14 | \item \texttt{CUDA\_HOME}: This is the location of your CUDA 7 root. 15 | \item \texttt{SASSI\_HOME}: This is the location of your SASSI 16 | installation root (not the repository root). 17 | \item \texttt{CCBIN}: The location of the C++-11 compiler you want to 18 | use. 19 | \item \texttt{GENCODE}: What GPUs you want to compile your 20 | instrumentation libraries for. \hl{Remember you must specify a 21 | \emph{real} architecture, not a virtual one.} 22 | \end{itemize} 23 | 24 | After editing the file, you \emph{should} be able to simply invoke 25 | \texttt{make} (in \texttt{\$SASSI\_REPO/instlibs}), which will compile 26 | the instrumentation libraries and install them in 27 | \texttt{\$SASSI\_REPO/instlibs/lib}. \hl{For users that are targeting 28 | Fermi cards (sm\_20, sm\_21), it is very important to note that not 29 | all of the instrumentation libraries will compile. Most of the 30 | instrumentation libraries use features that are only available in 31 | 3.0+ architectures. There is a special make target, \texttt{make 32 | fermi-only}, that will only make the instrumentation libraries 33 | that will run on Fermi.} 34 | 35 | The examples in the next several sections all rely heavily on NVIDIA's 36 | CUPTI library, so let's add it to our \texttt{LD\_LIBRARY\_PATH}: 37 | \begin{lstlisting}[style=BashInputStyle] 38 | export LD_LIBRARY_PATH=$SASSI_HOME/extras/CUPTI/lib64:$LD_LIBRARY_PATH 39 | \end{lstlisting} 40 | -------------------------------------------------------------------------------- /doc/tex/example1.tex: -------------------------------------------------------------------------------- 1 | \section{Example 1: Creating a histogram of opcodes} 2 | \label{sec:example1} 3 | 4 | This simple example creates a histogram of the opcodes that are 5 | encountered during the execution of a program. As we have already 6 | mentioned, the compiler flow for SASSI instrumentation involves 7 | 1) compiling the application to enable callbacks to instrumentation 8 | handlers to be inserted, and 2) defining the instrumentation handler, 9 | compiling it, and linking it in with the application. 10 | 11 | Before looking at these two steps in detail, let's just build and run 12 | the example. If you have not done so, please setup your environment 13 | per the instructions in Section~\ref{sec:environment}. Now, do: 14 | \begin{lstlisting}[style=BashInputStyle] 15 | # Go to the example directory. 16 | cd $SASSI_REPO/example 17 | make clean 18 | # Instrument the application and link with the instrumentation library. 19 | make ophist 20 | # Run the example. (Alternatively, type 'make run') 21 | ./matrixMul 22 | \end{lstlisting} 23 | 24 | This can take several seconds to minutes depending on the horsepower 25 | of your GPU. If all went well, you should now have a file in that 26 | directory, called \texttt{sassi-ophist.txt}, that contains a histogram 27 | of the encountered opcodes. 28 | 29 | \subsection{How to compile your application} 30 | 31 | Now let's take a look at how the makefile builds the 32 | \texttt{matrixMul} application. For this example, we want to track 33 | all instructions that execute, so we'll need to tell SASSI to add 34 | instrumentation callbacks before \emph{all} instructions: 35 | \begin{lstlisting}[style=BashInputStyle] 36 | /usr/local/sassi7/bin/nvcc -I./inc -c -O3 \ 37 | -gencode arch=compute_50,code=sm_50 \ 38 | -Xptxas --sassi-inst-before=''all'' \ 39 | -dc \ 40 | -o matrixMul.o matrixMul.cu 41 | \end{lstlisting} 42 | 43 | Lines (1) and (5) are the options that you would ordinarily compiler 44 | your CUDA application with. Line (2) is required because we need the 45 | compiler to generate \emph{actual} not virtual code. In this case, we 46 | are targetting a first-generation Maxwell card, with \texttt{sm\_50}. 47 | Line (3) instructs SASSI to inject instrumentation before all SASS 48 | instructions, and line (4) is required because we are going to link in 49 | the instrumentation handler momentarily, and cross-module calls 50 | require ``relocatable device code.'' In fact, let's look at the 51 | message that SASSI spits out: 52 | \begin{verbatim} 53 | ****************************************************************************** 54 | * 55 | * SASSI Instrumentation Details 56 | * 57 | * For the settings you passed in, you'll need to make sure that you have 58 | * an instrumentation library with the following properties: 59 | * - It MUST BE compiled using only 16 registers!! To accomplish this 60 | * simply compile your library with the nvcc flag, --maxrregcount=16 61 | * - It must define the following functions: 62 | * __device__ void sassi_before_handler(SASSIBeforeParams*) 63 | * 64 | ****************************************************************************** 65 | \end{verbatim} 66 | 67 | This tells us that we have to define and link in a function with the 68 | signature,\\ 69 | \lstinline{__device__ void sassi_before_handler(SASSIBeforeParams*)}, 70 | which we have defined in a library called 71 | \texttt{libophist.a}. Don't worry, we'll eventually show you what's in the 72 | instrumentation library, and how we build it, but for now, let's just 73 | link it in: 74 | \begin{lstlisting}[style=BashInputStyle] 75 | /usr/local/sassi7/bin/nvcc -o matrixMul matrixMul.o \ 76 | -gencode arch=compute_50,code=sm_50 \ 77 | -lcudadevrt \ 78 | -L/usr/local/sassi7/extras/CUPTI/lib64 -lcupti \ 79 | -L../instlibs/lib -lophist 80 | \end{lstlisting} 81 | 82 | This example uses \texttt{nvcc} to link the instrumented application 83 | with the instrumentation handler and CUPTI. The instrumentation 84 | library, as we'll see soon, uses CUPTI to 85 | initialize on-device counters and copy them off the device before it 86 | is reset. Line (2) again specifies the NVIDIA architecture that we are 87 | targetting. Lines (3-5) link in the various required libraries for 88 | this application. Of note, line (4) links in CUPTI, and line (5) 89 | links in the instrumentation library, \texttt{ophist}. 90 | 91 | It is a good idea to use \texttt{cuobjdump} to dump the code that 92 | generated, and make sure that your application was instrumented: 93 | \begin{lstlisting}[style=BashInputStyle] 94 | /usr/local/sassi/bin/cuobjdump -sass matrixMul 95 | \end{lstlisting} 96 | 97 | You should see \emph{a lot} of code bloat! Each original instruction 98 | should be supplanted with a branch to an instrumentation trampoline. 99 | 100 | \subsection{Writing the instrumentation library} 101 | 102 | The instrumentation library that corresponds to this example is fully 103 | implemented in \\ 104 | \texttt{\$SASSI\_REPO/instlibs/src/ophist.cu}. The library code is shown in 105 | Figure~\ref{fig:handler-example1}. 106 | 107 | \begin{figure*}[h!] 108 | \begin{lstlisting} 109 | #include 110 | #include 111 | #include 112 | #include 113 | #include "sassi_intrinsics.h" 114 | #include "sassi_lazyallocator.hpp" 115 | #include 116 | 117 | // Keep track of all the opcodes that were executed. 118 | __managed__ unsigned long long dynamic_instr_counts[SASSI_NUM_OPCODES]; 119 | 120 | /////////////////////////////////////////////////////////////////////////////////// 121 | /// This is a SASSI handler that handles only basic information about each 122 | /// instrumented instruction. The calls to this handler are placed by 123 | /// convention *before* each instrumented instruction. 124 | /////////////////////////////////////////////////////////////////////////////////// 125 | __device__ void sassi_before_handler(SASSIBeforeParams* bp) 126 | { 127 | if (bp->GetInstrWillExecute()) 128 | { 129 | SASSIInstrOpcode op = bp->GetOpcode(); 130 | atomicAdd(&(dynamic_instr_counts[op]), 1ULL); 131 | } 132 | } 133 | 134 | /////////////////////////////////////////////////////////////////////////////////// 135 | /// Write out the statistics we've gathered. 136 | /////////////////////////////////////////////////////////////////////////////////// 137 | static void sassi_finalize() 138 | { 139 | cudaDeviceSynchronize(); 140 | 141 | FILE *resultFile = fopen("sassi-ophist.txt", "w"); 142 | for (unsigned i = 0; i < SASSI_NUM_OPCODES; i++) { 143 | if (dynamic_instr_counts[i] > 0) { 144 | fprintf(resultFile, "%-10.10s: %llu\n", SASSIInstrOpcodeStrings[i], dynamic_instr_counts[i]); 145 | } 146 | } 147 | fclose(resultFile); 148 | } 149 | 150 | /////////////////////////////////////////////////////////////////////////////////// 151 | /// Lazily initialize the counters before the first kernel launch. 152 | /////////////////////////////////////////////////////////////////////////////////// 153 | static sassi::lazy_allocator counterInitializer( 154 | /* Initialize the counters. */ 155 | []() { 156 | bzero(dynamic_instr_counts, sizeof(dynamic_instr_counts)); 157 | }, sassi_finalize); 158 | \end{lstlisting} 159 | \caption{Instrumentation library for creating a histogram of opcodes.} 160 | \label{fig:handler-example1} 161 | \end{figure*} 162 | 163 | Notice, the handler, \texttt{sassi\_before\_handler} is a 164 | \texttt{\_\_device\_\_} function, which means it will run on the GPU. 165 | As you've seen, with the right instrumentation flags, we can get SASSI 166 | to inject calls to this function before every instruction. 167 | 168 | This example is extremely simple, and only requires the basic 169 | information SASSI extracts about each instruction, which is passed in 170 | as an argument of type \texttt{SASSIBeforeParams*}. The handler first checks whether 171 | the instruction will execute with the \texttt{GetInstrWillExecute()} 172 | method. Note this is because an instruction might have a guarding 173 | predicate that is \texttt{false} for this invocation of the handler. 174 | 175 | If the instruction will execute, the handler simply gets the 176 | instruction's opcode with \texttt{GetOpcode()} and then uses CUDA's 177 | \texttt{atomicAdd} function to increment the opcode's associated 178 | counter. 179 | 180 | Even though there is a performance penalty (which our instrumentation 181 | overheads dwarf) for using Unified Virtual Memory, we find that UVM is 182 | quite convenient for writing instrumentation handlers. For this 183 | example, \texttt{dynamic\_instr\_counts[]} is declared as a 184 | \texttt{\_\_managed\_\_} type, which means that we can access it on 185 | the host \emph{and} the device. 186 | 187 | In \texttt{\$SASSI\_REPO/instlibs/utils} there is a convenience class, 188 | \lstinline{sassi::lazy_allocator}, that allows us to initialize 189 | variables before the first kernel launch, and be notified before the 190 | program exits or the device is reset (so we can dump counters to a 191 | file, for instance). This class relies on CUPTI to respond to these 192 | events. 193 | 194 | \subsubsection{Building the handler as a library} 195 | 196 | The makefile in \texttt{\$SASSI\_REPO/instlibs/src} shows the standard 197 | way to build a library with the NVIDIA CUDA toolchain. The only way to 198 | instrument with SASSI involves creating a standalone library, or even 199 | an object file, and later linking it in with your SASSI-compiled 200 | application. 201 | 202 | Let's take a look at what the makfile does for the \texttt{ophist} 203 | library: 204 | \begin{lstlisting}[style=BashInputStyle] 205 | /usr/local/cuda-7.0/bin/nvcc \ 206 | -ccbin /usr/local/gcc-4.8.4/bin/ -std=c++11 \ 207 | -O3 \ 208 | -gencode arch=compute_50,code=sm_50 \ 209 | -c -rdc=true -o ophist.o \ 210 | ophist.cu \ 211 | --maxrregcount=16 \ 212 | -I/usr/local/sassi7/include \ 213 | -I/usr/local/sassi7/extras/CUPTI/include/ \ 214 | -I/home/mstephenson/Projects/sassi/instlibs/utils/ \ 215 | -I/home/mstephenson/Projects/sassi/instlibs/include 216 | ar r /home/mstephenson/Projects/sassi/instlibs/lib/libophist.a ophist.o 217 | ranlib /home/mstephenson/Projects/sassi/instlibs/lib/libophist.a 218 | \end{lstlisting} 219 | 220 | The instrumentation libraries in the instlibs rely on C++-11 support, 221 | hence line (2). We also build the instrumentation library to support 222 | the architecture we plan to run on in (4). Change this to match your 223 | architecture. Finally, of supreme importance, we instruction NVCC to 224 | use only 16 registers for the handler in (7). We include 225 | \emph{include} paths in (8)-(11). Finally, lines (12) and (13) bundle 226 | the single object file into a library. 227 | -------------------------------------------------------------------------------- /doc/tex/example2.tex: -------------------------------------------------------------------------------- 1 | \pagebreak 2 | \section{Example 2: Conditional Branch} 3 | \label{sec:example2} 4 | 5 | This example examines the branch behavior of an application. We 6 | introduce a couple of features not required for the last example. 7 | First, we need to instruct SASSI to pass an additional argument to our 8 | instrumentation handler, one that conveys information about the 9 | behavior of conditional branches; and second, we use a hash map to 10 | record per-branch statistics. 11 | 12 | Before walking through the specifics of the example, let's just build 13 | and run it. If you have not done so, please setup your environment 14 | per the instructions in Section~\ref{sec:environment}. Now, do: 15 | \begin{lstlisting}[style=BashInputStyle] 16 | # Go to the example directory. 17 | cd $SASSI_REPO/example 18 | make clean 19 | # Instrument the application and link with the instrumentation library. 20 | make branch 21 | # Run the example. (Alternatively, type 'make run') 22 | ./matrixMul 23 | \end{lstlisting} 24 | 25 | This can take some time to finish running. Assuming everything went 26 | well, you should now have a file in that directory called 27 | \texttt{sassi-branch.txt}. Here are the contents of the file for 28 | \texttt{sm\_50}: 29 | \begin{verbatim} 30 | Address Total/32 Dvrge/32 Active Taken NTaken 31 | c2375281000000d0 1926400 61644800 61644800 32 | c237528100000668 19264000 616448000 554803200 61644800 33 | \end{verbatim} 34 | For \texttt{sm\_50}, the simple application that we are testing 35 | this instrumentation library with has only one kernel, which only has 36 | two \emph{conditional} branches (i.e., the branch direction depends on 37 | some statically unknown condition). The file separates out the branch 38 | statistics per conditional branch. The columns shown show 1) the number 39 | of branches encountered per warp; 2) the number of those branches that 40 | \emph{diverge} (a condition where within a warp, some threads take the 41 | branch, and others fall through); 3) the total number of active 42 | threads that executed the branch; 4) the number of active threads that 43 | take the branch; 5) the number of active threads that fall through; 44 | 6) and the type of the branch. For complete coverage of this 45 | example, please see ``Case Study I'' in the accompanying ISCA paper 46 | in the \texttt{doc} directory. 47 | 48 | \subsection{How to compile your application} 49 | 50 | Now let's take a look at how the makefile builds the 51 | \texttt{matrixMul} application for this this example. For this 52 | example, we are only interested in conditional branches, so while we 53 | could still add instrumentation before all instructions, it will be 54 | more efficient to limit our instrumentation to only conditional branches: 55 | \begin{lstlisting}[style=BashInputStyle] 56 | /usr/local/sassi7/bin/nvcc -I./inc -c -O3 \ 57 | -gencode arch=compute_50,code=sm_50 \ 58 | -Xptxas --sassi-inst-before=''cond-branches'' \ 59 | -Xptxas --sassi-before-args=''cond-branch-info'' \ 60 | -dc \ 61 | -o matrixMul.o matrixMul.cu 62 | \end{lstlisting} 63 | 64 | Lines (1) and (6) are the options that you would ordinarily compiler 65 | your CUDA application with. Line (2) is required because we need the 66 | compiler to generate \emph{actual} not virtual code. In this case, we 67 | are targetting a first-generation Maxwell card, with \texttt{sm\_50}. 68 | Line (3) instructs SASSI to inject instrumentation before conditional 69 | branch SASS instructions only. The instrumentation library for this 70 | example needs to know the direction that the conditional branch will 71 | take (e.g., TAKEN or NOT TAKEN), and this information is available in 72 | objects of \texttt{SASSICondBranchParams}. The 73 | \texttt{cond-branch-info} flag will direct SASSI to pass such objects 74 | to the instrumentation handler. This is done in line (4). Line (5) 75 | is required because we are going to link in the instrumentation 76 | handler momentarily, and cross-module calls require ``relocatable 77 | device code.'' Here is the message that SASSI emits for this example: 78 | \begin{verbatim} 79 | ****************************************************************************** 80 | * 81 | * SASSI Instrumentation Details 82 | * 83 | * For the settings you passed in, you'll need to make sure that you have 84 | * an instrumentation library with the following properties: 85 | * - It MUST BE compiled using only 16 registers!! To accomplish this 86 | * simply compile your library with the nvcc flag, --maxrregcount=16 87 | * - It must define the following functions: 88 | * __device__ void sassi_before_handler(SASSIBeforeParams*,SASSICondBranchParams*) 89 | * 90 | ****************************************************************************** 91 | \end{verbatim} 92 | 93 | Notice that SASSI tells us that we have to make sure to link in a 94 | handler with this signature: 95 | \begin{lstlisting} 96 | __device__ void sassi_before_handler(SASSIBeforeParams*,SASSICondBranchParams*); 97 | \end{lstlisting} 98 | which we have defined in a library called \texttt{libbranch.a}, which 99 | we'll describe shortly. We link this in exactly like we did with the 100 | last example: 101 | \begin{lstlisting}[style=BashInputStyle] 102 | /usr/local/sassi7/bin/nvcc -o matrixMul matrixMul.o \ 103 | -gencode arch=compute_50,code=sm_50 \ 104 | -lcudadevrt \ 105 | -L/usr/local/sassi7/extras/CUPTI/lib64 -lcupti \ 106 | -L../instlibs/lib -lbranch 107 | \end{lstlisting} 108 | The only difference is linking in \texttt{libbranch.a} versus 109 | \texttt{libophist.a}. 110 | 111 | \subsection{Writing the instrumentation library} 112 | 113 | The instrumentation library that corresponds to this example is fully 114 | implemented in \\ \texttt{\$SASSI\_REPO/instlibs/src/branch.cu}. The 115 | instrumentation handler portion of the library is shown in 116 | Figure~\ref{fig:handler-example2}. In the last example, we first 117 | checked to see if the instruction would execute using 118 | \texttt{bp->GetInstrWillExecute()}. We don't do that here, because 119 | the branch instruction's guarding predicate is often used to determine 120 | the branch direction. 121 | 122 | \begin{figure*}[h!] 123 | \begin{lstlisting}[numbers=left,numbersep=4pt] 124 | /////////////////////////////////////////////////////////////////////////////////// 125 | // 126 | /// This function will be inserted before every conditional branch instruction. 127 | // 128 | /////////////////////////////////////////////////////////////////////////////////// 129 | __device__ void sassi_before_handler(SASSIBeforeParams *bp, SASSICondBranchParams *brp) 130 | { 131 | // Find out thread index within the warp. 132 | int threadIdxInWarp = get_laneid(); 133 | 134 | // Get masks and counts of 1) active threads in this warp, 135 | // 2) threads that take the branch, and 136 | // 3) threads that do not take the branch. 137 | int active = __ballot(1); 138 | bool dir = brp->GetDirection(); 139 | int taken = __ballot(dir == true); 140 | int ntaken = __ballot(dir == false); 141 | int numActive = __popc(active); 142 | int numTaken = __popc(taken); 143 | int numNotTaken = __popc(ntaken); 144 | bool divergent = (numTaken != numActive && numNotTaken != numActive); 145 | 146 | // The first active thread in each warp gets to write results. 147 | if ((__ffs(active)-1) == threadIdxInWarp) { 148 | // Get the address, we'll use it for hashing. 149 | uint64_t inst_addr = bp->GetPUPC(); 150 | 151 | // Looks up the counters associated with 'inst_addr', but if no such entry 152 | // exits, initialize the counters in the lambda. 153 | BranchCounter *stats = (*sassi_stats).getOrInit(inst_addr, [inst_addr,brp](BranchCounter* v) { 154 | v->address = inst_addr; 155 | v->branchType = brp->GetType(); 156 | v->taggedUnanimous = brp->IsUnanimous(); 157 | }); 158 | 159 | // Why not sanity check the hash map? 160 | assert(stats->address == inst_addr); 161 | assert(numTaken + numNotTaken == numActive); 162 | 163 | // Increment the various counters that are associated 164 | // with this instruction appropriately. 165 | atomicAdd(&(stats->totalBranches), 1ULL); 166 | atomicAdd(&(stats->activeThreads), numActive); 167 | atomicAdd(&(stats->takenThreads), numTaken); 168 | atomicAdd(&(stats->takenNotThreads), numNotTaken); 169 | atomicAdd(&(stats->divergentBranches), divergent); 170 | } 171 | } 172 | \end{lstlisting} 173 | \caption{Instrumentation handler portion of the conditional branch 174 | behavior profiling library. See the library's source code for the 175 | full example.} 176 | \label{fig:handler-example2} 177 | \end{figure*} 178 | 179 | The handler first determines the thread index within the warp (line 180 | 9), the bitmask of active threads (line 14), and the direction in 181 | which the thread is going to branch (line 15). CUDA provides several 182 | warp-wide broadcast and reduction operations that NVIDIA's 183 | architectures efficiently support. For example, many of the handlers 184 | we write use the \texttt{\_\_ballot(predicate)} instruction, which 185 | ``evaluates \texttt{predicate} for all active threads of the warp and 186 | returns an integer whose $N^{th}$ bit is set if and only if 187 | \texttt{predicate} evaluates to non-zero for the $N^{th}$ thread of 188 | the warp \emph{and} the $N^{th}$ thread is active.''[CUDA Programming Guide]. 189 | 190 | The handler also uses \texttt{\_\_ballot} on lines 16 and 17 to set 191 | masks corresponding to threads that are going to take the branch 192 | (\texttt{taken}), and the threads that are going to fall through 193 | (\texttt{ntaken}). With these masks, the handler uses the 194 | \emph{population count} instruction (\texttt{\_\_popc}) to efficiently 195 | determine the number of threads in each respective category 196 | (\texttt{numActive}, \texttt{numTaken}, \texttt{numNotTaken}). 197 | 198 | On line 24 the handler elects the first active thread in the warp 199 | (using the \emph{find first set} CUDA intrinsic, \texttt{\_\_ffs}) to 200 | record the results. Because this handler records 201 | per-branch statistics, it uses a hash table in which to store 202 | counters. Line 30 finds the hash table entry for the instrumented 203 | branch (using the templatized hash table in 204 | \texttt{\$SASSI\_REPO/instlibs/utils}). Lines 42-46 update the 205 | counters. 206 | 207 | Under the covers, this library relies on the CUPTI library to register 208 | callbacks for kernel \emph{launch} and \emph{exit} events. Using 209 | these callbacks, which run on the host, we can appropriately marshal 210 | data to initialize and record the values in the device-side hash 211 | table. Please see the source code of the instrumentation handler for 212 | full details of the example. 213 | 214 | \subsubsection{Building the handler as a library} 215 | 216 | We build the library exactly the way we did for the \texttt{ophist} 217 | library. 218 | 219 | \begin{lstlisting}[style=BashInputStyle] 220 | /usr/local/cuda-7.0/bin/nvcc \ 221 | -ccbin /usr/local/gcc-4.8.4/bin/ -std=c++11 \ 222 | -O3 \ 223 | -gencode arch=compute_50,code=sm_50 \ 224 | -c -rdc=true -o branch.o \ 225 | branch.cu \ 226 | --maxrregcount=16 \ 227 | -I/usr/local/sassi7/include \ 228 | -I/usr/local/sassi7/extras/CUPTI/include/ \ 229 | -I/home/mstephenson/Projects/sassi/instlibs/utils/ \ 230 | -I/home/mstephenson/Projects/sassi/instlibs/include 231 | ar r /home/mstephenson/Projects/sassi/instlibs/lib/libbranch.a branch.o 232 | ranlib /home/mstephenson/Projects/sassi/instlibs/lib/libbranch.a 233 | \end{lstlisting} 234 | 235 | The instrumentation libraries in the \texttt{instlibs} directory rely on C++-11 support, 236 | hence line (2). We also build the instrumentation library to support 237 | the architecture we plan to run on in (4). Change this to match your 238 | architecture. Finally, of supreme importance, we instruction NVCC to 239 | use only 16 registers for the handler in (7). We include 240 | \emph{include} paths in (8)-(11). Finally, lines (12) and (13) bundle 241 | the single object file into a library. 242 | 243 | 244 | -------------------------------------------------------------------------------- /doc/tex/example3.tex: -------------------------------------------------------------------------------- 1 | \pagebreak 2 | \section{Example 3: Memory Divergence} 3 | \label{sec:example3} 4 | 5 | This example, as with the previous example, performs \emph{before} 6 | instrumentation, and requests that SASSI pass an extra parameter to 7 | the instrumentation handler. However, in this case, we inject 8 | instrumentation before operations that touch (read or write) memory, 9 | and we pass along information about instruction memory usage. 10 | 11 | Before walking through the specifics of the example, let's just build 12 | and run it. If you have not done so, please setup your environment 13 | per the instructions in Section~\ref{sec:environment}. Now, do: 14 | \begin{lstlisting}[style=BashInputStyle] 15 | # Go to the example directory. 16 | cd $SASSI_REPO/example 17 | make clean 18 | # Instrument the application and link with the instrumentation library. 19 | make memdiverge 20 | # Run the example. (Alternatively, type 'make run') 21 | ./matrixMul 22 | \end{lstlisting} 23 | 24 | This can take some time to finish running. Assuming everything went 25 | well, you should now have a file in that directory called 26 | \texttt{sassi-memdiverge.txt}. The memory behavior of the sample 27 | application is so regular that these results are fairly boring. For 28 | \texttt{sm\_50}, these results show that for every load and store to 29 | \emph{global} memory, all 32 threads in a warp were active, and within 30 | each warp, four cache lines were touched. The rationale for printing 31 | out a matrix will become more clear later in this section. For 32 | complete coverage of this example, please see ``Case Study II'' in the 33 | accompanying ISCA paper in the \texttt{doc} directory. 34 | 35 | \subsection{How to compile your application} 36 | 37 | Now let's take a look at how the makefile builds the 38 | \texttt{matrixMul} application for this this example. For this 39 | example, we are only interested in memory operations, so while we 40 | could still add instrumentation before all instructions, it will be 41 | more efficient to limit our instrumentation to only memory operations: 42 | \begin{lstlisting}[style=BashInputStyle] 43 | /usr/local/sassi7/bin/nvcc -I./inc -c -O3 \ 44 | -gencode arch=compute_50,code=sm_50 \ 45 | -Xptxas --sassi-inst-before=''memory'' \ 46 | -Xptxas --sassi-before-args=''mem-info'' \ 47 | -dc \ 48 | -o matrixMul.o matrixMul.cu 49 | \end{lstlisting} 50 | 51 | Lines (1) and (6) are the options that you would ordinarily compiler 52 | your CUDA application with. Line (2) is required because we need the 53 | compiler to generate \emph{actual} not virtual code. In this case, we 54 | are targetting a first-generation Maxwell card, with \texttt{sm\_50}. 55 | Line (3) instructs SASSI to inject instrumentation before SASS 56 | instructions that touch memory only. It is worth noting that SASSI 57 | does not consider loads from the constant memory window to be memory 58 | operations. The instrumentation library for this 59 | example needs to know the address that the memory operation targets, 60 | and this information is available in objects of 61 | \texttt{SASSIMemoryParams}. The 62 | \texttt{mem-info} flag will direct SASSI to pass such objects 63 | to the instrumentation handler. This is done in line (4). Line (5) 64 | is required because we are going to link in the instrumentation 65 | handler momentarily, and cross-module calls require ``relocatable 66 | device code.'' Here is the message that SASSI emits for this example: 67 | \begin{verbatim} 68 | ****************************************************************************** 69 | * 70 | * SASSI Instrumentation Details 71 | * 72 | * For the settings you passed in, you'll need to make sure that you have 73 | * an instrumentation library with the following properties: 74 | * - It MUST BE compiled using only 16 registers!! To accomplish this 75 | * simply compile your library with the nvcc flag, --maxrregcount=16 76 | * - It must define the following functions: 77 | * __device__ void sassi_before_handler(SASSIBeforeParams*,SASSIMemoryParams*) 78 | * 79 | ****************************************************************************** 80 | \end{verbatim} 81 | 82 | Notice that SASSI tells us that we have to make sure to link in a 83 | handler with this signature: 84 | \begin{lstlisting} 85 | __device__ void sassi_before_handler(SASSIBeforeParams*,SASSIMemoryParams*); 86 | \end{lstlisting} 87 | which we have defined in a library called \texttt{libmemdiverge.a}, which 88 | we'll describe shortly. We link this in exactly like we did with the 89 | last example: 90 | \begin{lstlisting}[style=BashInputStyle] 91 | /usr/local/sassi7/bin/nvcc -o matrixMul matrixMul.o \ 92 | -gencode arch=compute_50,code=sm_50 \ 93 | -lcudadevrt \ 94 | -L/usr/local/sassi7/extras/CUPTI/lib64 -lcupti \ 95 | -L../instlibs/lib -lmemdiverge 96 | \end{lstlisting} 97 | 98 | \subsection{Writing the instrumentation library} 99 | 100 | The instrumentation library that corresponds to this example is fully 101 | implemented in \\ \texttt{\$SASSI\_REPO/instlibs/src/memdiverge.cu}. 102 | The instrumentation handler portion of the library is shown in 103 | Figure~\ref{fig:handler-example3}. This example aims to create a 104 | $32\times32$ matrix, where the rows of the matrix tally the number of 105 | threads that were active for each warp's invocation of a memory 106 | operation, and the columns tally the number of cache lines touched by 107 | the memory operation. The code then, simply determines for each 108 | memory operation, 1) how many threads in the warp are active 109 | (\texttt{numActive}), and 2) how many unique cache lines are touched 110 | by the active threads (\texttt{unique}). At the end of the handler on 111 | line (44), we simply increment the counter associated with 112 | \texttt{numActive} and \texttt{unique}. 113 | 114 | On line (13), we get the address of the memory operation using the 115 | \texttt{GetAddress()} method of the \texttt{SASSIMemoryParams} object 116 | passed in. On line (15), we filter out all memory accesses except 117 | global memory requests (i.e., we don't consider \emph{shared} or 118 | \emph{local} requests), though it would be trivial to consider other 119 | memory windows instead. On line (20), we find out the cache line to 120 | which the address maps. 121 | 122 | Line (22) uses \texttt{\_\_ballot()}, a commonly used warp-collective 123 | CUDA function, to determine how many threads in the warp are actively 124 | participating. Line (24) uses \texttt{\_\_popc()}, another built-in 125 | CUDA function, to count the number of set bits and report the number 126 | of active threads. The while loop in (25)-(26) uses 127 | \texttt{\_\_ffs()}, \texttt{\_\_ballot()}, and \texttt{\_\_broadcast()} 128 | to determine the number of unique values stored in the 129 | \texttt{lineAddr} variable across the warp. Every thread in the warp 130 | computes the same value for \texttt{unique}, but we only let the first 131 | active theard commit the results on line (43). 132 | 133 | \begin{figure*}[h!] 134 | \begin{lstlisting}[numbers=left,numbersep=4pt] 135 | /// The counters that we will use to record our statistics. 136 | __managed__ unsigned long long sassi_counters[WARP_SIZE + 1][WARP_SIZE + 1]; 137 | 138 | /////////////////////////////////////////////////////////////////////////////////// 139 | /// 140 | /// This is the function that will be inserted before every memory operation. 141 | /// 142 | /////////////////////////////////////////////////////////////////////////////////// 143 | __device__ void sassi_before_handler(SASSIBeforeParams *bp, SASSIMemoryParams *mp) 144 | { 145 | if (bp->GetInstrWillExecute()) 146 | { 147 | intptr_t addrAsInt = mp->GetAddress(); 148 | // Don't look at shared or local memory. 149 | if (__isGlobal((void*)addrAsInt)) { 150 | // The number of unique addresses across the warp 151 | unsigned unique = 0; // for the instrumented instruction. 152 | 153 | // Shift off the offset bits into the cache line. 154 | intptr_t lineAddr = addrAsInt >> LINE_BITS; 155 | 156 | int workset = __ballot(1); 157 | int firstActive = __ffs(workset) - 1; 158 | int numActive = __popc(workset); 159 | while (workset) { 160 | // Elect a leader, get its line, see who all matches it. 161 | int leader = __ffs(workset) - 1; 162 | intptr_t leadersAddr = __broadcast(lineAddr, leader); 163 | int notMatchesLeader = __ballot(leadersAddr != lineAddr); 164 | 165 | // We have accounted for all values that match the leader's. 166 | // Let's remove them all from the workset. 167 | workset = workset & notMatchesLeader; 168 | unique++; 169 | assert(unique <= 32); 170 | } 171 | 172 | assert(unique > 0 && unique <= 32); 173 | 174 | // Each thread independently computed 'numActive' and 'unique'. 175 | // Let's let the first active thread actually tally the result. 176 | int threadsLaneId = get_laneid(); 177 | if (threadsLaneId == firstActive) { 178 | atomicAdd(&(sassi_counters[numActive][unique]), 1LL); 179 | } 180 | } 181 | } 182 | } 183 | \end{lstlisting} 184 | \caption{Instrumentation handler portion of the memory divergence 185 | library. See the library's source code for the full example.} 186 | \label{fig:handler-example3} 187 | \end{figure*} 188 | 189 | \vfill\eject 190 | \subsubsection{Building the handler as a library} 191 | 192 | We build the library exactly the way we did for the \texttt{ophist} 193 | library. 194 | 195 | \begin{lstlisting}[style=BashInputStyle] 196 | /usr/local/cuda-7.0/bin/nvcc \ 197 | -ccbin /usr/local/gcc-4.8.4/bin/ -std=c++11 \ 198 | -O3 \ 199 | -gencode arch=compute_50,code=sm_50 \ 200 | -c -rdc=true -o memdiverge.o \ 201 | memdiverge.cu \ 202 | --maxrregcount=16 \ 203 | -I/usr/local/sassi7/include \ 204 | -I/usr/local/sassi7/extras/CUPTI/include/ \ 205 | -I/home/mstephenson/Projects/sassi/instlibs/utils/ \ 206 | -I/home/mstephenson/Projects/sassi/instlibs/include 207 | ar r /home/mstephenson/Projects/sassi/instlibs/lib/libmemdiverge.a memdiverge.o 208 | ranlib /home/mstephenson/Projects/sassi/instlibs/lib/libmemdiverge.a 209 | \end{lstlisting} 210 | 211 | The instrumentation libraries in the \texttt{instlibs} directory rely on C++-11 support, 212 | hence line (2). We also build the instrumentation library to support 213 | the architecture we plan to run on in (4). Change this to match your 214 | architecture. Finally, of supreme importance, we instruction NVCC to 215 | use only 16 registers for the handler in (7). We include 216 | \emph{include} paths in (8)-(11). Finally, lines (12) and (13) bundle 217 | the single object file into a library. 218 | -------------------------------------------------------------------------------- /doc/tex/example4.tex: -------------------------------------------------------------------------------- 1 | \pagebreak 2 | \section{Example 4: Value Profiling} 3 | \label{sec:example4} 4 | 5 | This example is the most involved example, and performs fairly 6 | obtrusive instrumentation to profile the values assigned to the 7 | general-purpose registers during the execution of a program. This 8 | example uses \emph{after} instrumentation to inject instrumentation 9 | after every instruction that writes an ISA-visible register. We then 10 | instruct SASSI to extract and pass along information about each SASS 11 | instruction's register usage, which our handler uses to inspect 12 | register values. 13 | 14 | As with our other example, before walking through the specifics of the 15 | example, let's just build and run it. If you have not done so, please 16 | setup your environment per the instructions in 17 | Section~\ref{sec:environment}. 18 | \begin{lstlisting}[style=BashInputStyle] 19 | # Go to the example directory. 20 | cd $SASSI_REPO/example 21 | make clean 22 | # Instrument the application and link with the instrumentation library. 23 | make valueprof 24 | # Run the example. (Alternatively, type 'make run') 25 | ./matrixMul 26 | \end{lstlisting} 27 | 28 | This can take a really long time to finish running because the 29 | slowdowns of instrumentation for this example are fairly severe 30 | ($1000\times$). Assuming everything went well, you should now have a 31 | file in that directory called \texttt{sassi-valueprof.txt}. Here are 32 | the abridged contents of the file for \texttt{sm\_50}: 33 | \begin{verbatim} 34 | Value profiling results 35 | ADDRESS | WEIGHT | [regnum, type, scalarness, bitstring]* 36 | --------------------------------------------------------- 37 | [c237528100000008, 61644800, [[1, ``uint'', SCALAR , [00000000111111111111101110100000]]]] 38 | [c237528100000010, 61644800, [[16, ``uint'', SCALAR , [0000000000000000000000000000TTTT]]]] 39 | [c237528100000018, 61644800, [[2, ``uint'', SCALAR , [000000000000000000000000000TTTTT]]]] 40 | ... 41 | [c237528100000208, 616448000, [[5, ``int'', SCALAR , [00000000000000000000000000000101]]]] 42 | [c237528100000210, 616448000, [[4, ``float'', SCALAR , [00111100001000111101011100001010]]]] 43 | [c237528100000238, 616448000, [[26, ``float'', SCALAR , [00111100001000111101011100001010]]]] 44 | ... 45 | \end{verbatim} 46 | 47 | The results dump the general-purpose register usage of each SASS 48 | instruction. For the first instruction above, we see that it writes 49 | to register \texttt{1}, which is of type \texttt{uint}, and it always 50 | writes the same constant \texttt{[00000000111111111111101110100000]}. 51 | Because it writes a constant, it is also marked as \emph{scalar}, 52 | which means every participating thread is writing the same value. The 53 | next instruction writes to register \texttt{16}. It's writes are 54 | always \emph{scalar} too, but slightly more interestingly it does not 55 | write a constant. Instead, as the \texttt{T} represents, the bottom 56 | four bits assume \texttt{0} and \texttt{1}. The top 28 bits are still 57 | always \texttt{0}, however. Looking at the CUDA source, we see that 58 | the matrices in this example are artificially initialized to constant 59 | values, leading to the very high number of scalar and constant bits. 60 | For complete coverage of this example, please see ``Case Study III'' 61 | in the accompanying ISCA paper in the \texttt{doc} directory. 62 | 63 | \subsection{How to compile your application} 64 | 65 | Now let's take a look at how the makefile builds the 66 | \texttt{matrixMul} application for this this example. For this 67 | example, we instrument after all instructions that write registers. 68 | \begin{lstlisting}[style=BashInputStyle] 69 | /usr/local/sassi7/bin/nvcc -I./inc -c -O3 \ 70 | -gencode arch=compute_50,code=sm_50 \ 71 | -Xptxas --sassi-inst-after=''reg-writes'' \ 72 | -Xptxas --sassi-after-args=''reg-info'' \ 73 | -Xptxas --sassi-iff-true-predicate-handler-call \ 74 | -dc \ 75 | -o matrixMul.o matrixMul.cu 76 | \end{lstlisting} 77 | 78 | Lines (1) and (7) are the options that you would ordinarily compiler 79 | your CUDA application with. Line (2) is required because we need the 80 | compiler to generate \emph{actual} not virtual code. In this case, we 81 | are targetting a first-generation Maxwell card, with \texttt{sm\_50}. 82 | Line (3) instructs SASSI to inject instrumentation \emph{after} SASS 83 | instructions that write registers. The instrumentation library for 84 | this example needs to know what registers are used along with the 85 | values in the registers, which it can obtain in objects of type 86 | \texttt{SASSIRegisterParams}. On line (4), the \texttt{reg-info} flag 87 | will direct SASSI to pass such objects to the instrumentation 88 | handler. Line (5) directs SASSI to only call the instrumentation 89 | handler if the instruction is going to execute (i.e., the instruction 90 | is unpredicated, or its guarding predicate is $true$). For 91 | \emph{before} instrumentation, we can easily use the 92 | \texttt{GetInstrWillExecute()} method to determine whether the 93 | instruction will execute. This is trickier for \emph{after} 94 | instrumentation, where the instruction's guarding predicate could have 95 | been clobbered by the instruction itself. For this example, we only 96 | care about instructions that executed, so this is the right option to 97 | use. 98 | 99 | Line (6) is required because we are going to link in the 100 | instrumentation handler momentarily, and cross-module calls require 101 | ``relocatable device code.'' Please note, that the SASSI flags on 102 | lines (3) and (4) both contain the word \emph{after}, unlike in the 103 | previous examples. Here is the message that SASSI emits for this 104 | example: 105 | \begin{verbatim} 106 | ****************************************************************************** 107 | * 108 | * SASSI Instrumentation Details 109 | * 110 | * For the settings you passed in, you'll need to make sure that you have 111 | * an instrumentation library with the following properties: 112 | * - It MUST BE compiled using only 16 registers!! To accomplish this 113 | * simply compile your library with the nvcc flag, --maxrregcount=16 114 | * - It must define the following functions: 115 | * __device__ void sassi_after_handler(SASSIAfterParams*,SASSIRegisterParams*) 116 | * 117 | ****************************************************************************** 118 | \end{verbatim} 119 | 120 | Notice that SASSI tells us that we have to make sure to link in a 121 | handler with this signature: 122 | \begin{lstlisting} 123 | __device__ void sassi_after_handler(SASSIAfterParams*,SASSIRegisterParams*); 124 | \end{lstlisting} 125 | which we have defined in a library called \texttt{libvalueprof.a}, which 126 | we'll describe shortly. We link this in exactly like we did with the 127 | last example: 128 | \begin{lstlisting}[style=BashInputStyle] 129 | /usr/local/sassi7/bin/nvcc -o matrixMul matrixMul.o \ 130 | -gencode arch=compute_50,code=sm_50 \ 131 | -lcudadevrt \ 132 | -L/usr/local/sassi7/extras/CUPTI/lib64 -lcupti \ 133 | -L../instlibs/lib -lvalueprof 134 | \end{lstlisting} 135 | 136 | \subsection{Writing the instrumentation library} 137 | 138 | The instrumentation library that corresponds to this example is fully 139 | implemented in \\ \texttt{\$SASSI\_REPO/instlibs/src/valueprof.cu}. The 140 | instrumentation handler portion of the library is shown in 141 | Figure~\ref{fig:handler-example4}. Note, that because we specified 142 | the \texttt{--sassi-iff-true-predicate-handler-call} flag, this 143 | handler will only be called if the instruction was actually executed. 144 | 145 | Line (9) gets the lane in the warp that the current thread is running 146 | on. Line (10) finds the first active thread-- this thread is going to 147 | be the one that all other threads compare themselves against when 148 | looking for \emph{scalar} values. Line (13) gets the probably unique 149 | virtual PC of the SASS instruction. We use this PC to index into a 150 | hashmap on line (18). If the PC isn't in the hashmap already, it is 151 | added and initialized on line (19). Thus, on line (22), we will have 152 | an initialized statistics counter for the instruction. Line (23) 153 | simply increments a counter we maintain for this PC to track how many 154 | total times it has executed. 155 | 156 | Line (24) iterates over all the destination registers in the SASS 157 | instruction. For each destination: 158 | \begin{itemize} 159 | \item Line 26 retrieves information about the register, including type 160 | information and register number. 161 | \item Line 27 gets the 32-bit value in the register. 162 | \item Line 30 uses CUDA's \texttt{atomicAnd} operation to keep track 163 | of the bits that are 1. If a bit in the operand ever becomes 0 at 164 | any point during the program, the \texttt{atomicAnd} is going to 165 | sticky set it to 0. 166 | \item Line 31 uses CUDA's \texttt{atomicAnd} operation to keep track 167 | of the bits that are 0. Note the bitwise negation of bits, ``$\sim$''. If a bit in the operand ever becomes 1 at 168 | any point during the program, the \texttt{atomicAnd} is going to 169 | sticky set it to 0. 170 | \item Lines (33)-(38) check and record the scalarness of the value. 171 | \end{itemize} 172 | 173 | \begin{figure*}[h!] 174 | \begin{lstlisting}[numbers=left,numbersep=4pt] 175 | /////////////////////////////////////////////////////////////////////////////////// 176 | // 177 | // This example uses the atomic bitwise operations to keep track of the constant 178 | // bits produced by each instruction. 179 | // 180 | /////////////////////////////////////////////////////////////////////////////////// 181 | __device__ void sassi_after_handler(SASSIAfterParams* ap, SASSIRegisterParams *rp) 182 | { 183 | int threadIdxInWarp = get_laneid(); 184 | int firstActiveThread = (__ffs(__ballot(1))-1); /*leader*/ 185 | 186 | // Get the "probably unique" PC. 187 | uint64_t pupc = ap->GetPUPC(); 188 | 189 | // The dictionary will return the SASSOp associated with this PC, or insert 190 | // it if it does not exist. If it does not exist, the lambda passed as 191 | // the second argument to getOrInit is used to initialize the SASSOp. 192 | SASSOp *stats = sassi_stats->getOrInit(pupc, [&rp](SASSOp *v) { 193 | SASSOp::init(v, rp); 194 | }); 195 | 196 | // Record the number of times the instruction executes. 197 | atomicAdd(&(stats->weight), 1); 198 | for (int d = 0; d < rp->GetNumGPRDsts(); d++) { 199 | // Get the value in each destination register. 200 | SASSIRegisterParams::GPRRegInfo regInfo = rp->GetGPRDst(d); 201 | SASSIRegisterParams::GPRRegValue regVal = rp->GetRegValue(ap, regInfo); 202 | 203 | // Use atomic AND operations to track constant bits. 204 | atomicAnd(&(stats->operands[d].constantOnes), regVal.asInt); 205 | atomicAnd(&(stats->operands[d].constantZeros), ~regVal.asInt); 206 | 207 | int leaderValue = __shfl(regVal.asInt, firstActiveThread); 208 | int allSame = (__all(regVal.asInt == leaderValue) != 0); 209 | // The warp leader gets to write results. 210 | if (threadIdxInWarp == firstActiveThread) { 211 | atomicAnd(&(stats->operands[d].isScalar), allSame); 212 | } 213 | } 214 | } 215 | \end{lstlisting} 216 | \caption{Instrumentation handler portion of the value profiling 217 | library. See the library's source code for the full example.} 218 | \label{fig:handler-example4} 219 | \end{figure*} 220 | 221 | \subsubsection{Building the handler as a library} 222 | 223 | We build the library exactly the way we did for the \texttt{ophist} 224 | library. 225 | 226 | \begin{lstlisting}[style=BashInputStyle] 227 | /usr/local/cuda-7.0/bin/nvcc \ 228 | -ccbin /usr/local/gcc-4.8.4/bin/ -std=c++11 \ 229 | -O3 \ 230 | -gencode arch=compute_50,code=sm_50 \ 231 | -c -rdc=true -o valueprof.o \ 232 | valueprof.cu \ 233 | --maxrregcount=16 \ 234 | -I/usr/local/sassi7/include \ 235 | -I/usr/local/sassi7/extras/CUPTI/include/ \ 236 | -I/home/mstephenson/Projects/sassi/instlibs/utils/ \ 237 | -I/home/mstephenson/Projects/sassi/instlibs/include 238 | ar r /home/mstephenson/Projects/sassi/instlibs/lib/libvalueprof.a valueprof.o 239 | ranlib /home/mstephenson/Projects/sassi/instlibs/lib/libvalueprof.a 240 | \end{lstlisting} 241 | 242 | The instrumentation libraries in the \texttt{instlibs} directory rely on C++-11 support, 243 | hence line (2). We also build the instrumentation library to support 244 | the architecture we plan to run on in (4). Change this to match your 245 | architecture. Finally, of supreme importance, we instruction NVCC to 246 | use only 16 registers for the handler in (7). We include 247 | \emph{include} paths in (8)-(11). Finally, lines (12) and (13) bundle 248 | the single object file into a library. 249 | -------------------------------------------------------------------------------- /doc/tex/figures/sassi-flow.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NVlabs/SASSI/8c758417c7f1e18af0b35aaddfa1d9f10782004b/doc/tex/figures/sassi-flow.pdf -------------------------------------------------------------------------------- /doc/tex/installing.tex: -------------------------------------------------------------------------------- 1 | \section{Prerequisites, Getting, and Installing SASSI} 2 | 3 | SASSI is, in a nutshell, a specialized version of \texttt{ptxas} that 4 | has a few extra options for instrumenting applications. We distribute 5 | this specialized \texttt{ptxas} as a (closed-source) binary blob using 6 | GitHub's ``releases'' functionality. However, to make SASSI more 7 | usable, we also distribute several sample instrumentation libraries 8 | and utilities that we have found to be useful. 9 | 10 | Throughout the rest of the guide, we will refer to the following 11 | environment variables: 12 | \begin{itemize} 13 | \item \texttt{\$SASSI\_REPO}: The base directory of the SASSI project, 14 | which you will clone from GitHub. 15 | \item \texttt{\$CUDA\_HOME}: The location of your CUDA 7 installation 16 | (which is a prerequisite for installing SASSI). 17 | \item \texttt{\$SASSI\_HOME}: The location of your SASSI installation, 18 | which is essentially a deep copy of your CUDA 7 installation, with a 19 | few added files and a different version of \texttt{ptxas}. 20 | \end{itemize} 21 | 22 | \subsection{Prerequisites} 23 | 24 | First, let's walk through SASSI's prerequisites. 25 | 26 | \begin{enumerate} 27 | \item {\bf Platform requirements}: SASSI requires an X86 64-bit host; 28 | a Fermi-, Kepler-, or Maxwell-based GPU; and at the time of this 29 | writing we have generated SASSI for Ubuntu (12 and 15), Debian 30 | 7, and CentOS 6.\footnote{Please let us know if your platform 31 | is not listed, and we'll see what we can do.} 32 | \item {\bf Python requirement}: The SASSI installation script requires 33 | Python 2.7 or newer. 34 | \item {\bf Install CUDA 7}: At the time of this writing, CUDA 7 can 35 | be fetched 36 | from:\\ \\ \texttt{https://developer.nvidia.com/cuda-downloads} \\ 37 | \item {\bf Make sure you have a 346.41 driver or newer:} The CUDA 7 38 | installation script can install a new driver for you that meets this 39 | requirement. If you already have a newer driver, that should be 40 | fine. You can test your driver version with the \texttt{nvidia-smi} 41 | command. 42 | \item {\bf Get the SASSI project}: You can get the SASSI project from 43 | GitHub via the following command: \\ \\ \texttt{git clone 44 | https://github.com/NVlabs/SASSI} \\ \\ 45 | The project that you get will be your \texttt{\$SASSI\_REPO}. The 46 | project contains several sample instrumentation libraries, 47 | documentation, and useful utilities. 48 | \item {\bf Compilers [Optional]}: To use the instrumentation libraries 49 | we provide, you will need a C++-11 compiler. We have been using GCC 50 | 4.8.2. For details about the host compiler that you can use on your 51 | platform for CUDA 7, consult the following 52 | guide:\\ \\ \texttt{http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/} 53 | \end{enumerate} 54 | 55 | \subsection{SASSI Project Structure} 56 | 57 | The SASSI project that is hosted on GitHub has the directory structure 58 | show in Figure~\ref{fig:structure}. The \texttt{doc} directory 59 | contains this document and an ISCA paper that describes SASSI's inner 60 | workings. 61 | 62 | The \texttt{example} directory contains a \texttt{Makefile} and a 63 | single application from the CUDA SDK, which this guide uses for the 64 | examples in Sections~\ref{sec:example1}-\ref{sec:example4}. 65 | 66 | The \texttt{instlibs} directory contains sample SASSI instrumentation 67 | libraries that range in complexity from very simple to fairly complex. 68 | There are examples that profile branch behavior, memory behavior, and 69 | operand value behavior. In addition, there are a couple of 70 | pedagogical examples that demonstrate SASSI's API. We will continue 71 | to add instrumentation libraries to this directory over time. The 72 | \texttt{instlibs/utils} directory contains a templatized device-side 73 | hash map to facilitate bookkeeping. 74 | 75 | \begin{figure*} 76 | \center 77 | \begin{verbatim} 78 | $SASSI_REPO 79 | | EULA 80 | | FAQ 81 | | README 82 | | doc 83 | | sassi-user-guide.pdf 84 | | sassi-isca-2015.pdf 85 | | example 86 | | Makefile 87 | | matrixMul.cu 88 | | instlibs 89 | | env.mk 90 | | LICENSE 91 | | Makefile 92 | | src 93 | | branch.cu 94 | | memdiverge.cu 95 | | ... 96 | | utils 97 | | sassi-dictionary.hpp 98 | | ... 99 | \end{verbatim} 100 | \caption{The directory structure of the SASSI project.} 101 | \label{fig:structure} 102 | \end{figure*} 103 | 104 | \subsection{Installation} 105 | 106 | We are using GitHub's ``releases'' functionality to release the 107 | closed-source portion of SASSI. Specifically, we are distributing 108 | SASSI as a self-extracting installer that installs SASSI's specialized 109 | version of \texttt{ptxas} and header files that describe 110 | SASSI's API. To get a binary installer for your architecture, go to:\\ 111 | \begin{center} 112 | \texttt{https://github.com/NVlabs/SASSI/releases} 113 | \end{center} 114 | Choose the installer for your architecture, download it, and execute 115 | it. The installer is simply a shell script. For example to install SASSI 116 | on an Ubuntu 15 x86-64 host, get the installer, 117 | \texttt{SASSI\_x86\_64\_ubuntu\_vivid.run}, then execute it: 118 | \begin{center} 119 | \texttt{sh SASSI\_x86\_64\_ubuntu\_vivid.run} 120 | \end{center} 121 | 122 | The installation script performs the following very simple tasks: 123 | \begin{enumerate} 124 | \item {\bf Has the end user agree to the End User License Agreement:} 125 | The license agreement is exactly the same as the EULA for the CUDA 126 | 7 toolkit. 127 | \item {\bf Asks the user for the location of the CUDA 7 toolkit:} 128 | This guide refers to this location as \texttt{\$CUDA\_HOME}. 129 | \item {\bf Asks the user where it should install SASSI}: Note, this 130 | directory should be different from your CUDA 7 location. Also note, 131 | you may need to run the script as root if you're planning on 132 | installing SASSI in the default location of 133 | \texttt{/usr/local/sassi}. This guide refers to this location as 134 | \texttt{\$SASSI\_HOME}. 135 | \item {\bf Copies the whole CUDA 7 directory to the SASSI install 136 | location:} This step can take a while depending on the details of 137 | your host and filesystem. 138 | \item {\bf Copies the SASSI header files to the SASSI install 139 | location.} The header files are the best documentation for SASSI's 140 | interface, so you should know that they are installed in 141 | \lstinline[style=BashInputStyle]´$SASSI_HOME/include/sassi/´ 142 | \item {\bf Copies SASSI's \texttt{ptxas} to the SASSI install 143 | location.} More specifically, SASSI's \texttt{ptxas} is moved to 144 | \lstinline[style=BashInputStyle]´$SASSI_HOME/bin/´ 145 | \end{enumerate} 146 | 147 | Before going any further, you should make sure that the installation 148 | script succesfully installed SASSI. If there were no warnings, 149 | errors, or abort messages, it probably worked. Nevertheless, let's 150 | check. First, we'll make sure that the header files were installed in 151 | the right place: 152 | \begin{lstlisting}[style=BashInputStyle] 153 | ls -l $SASSI_HOME/include/sassi 154 | total 64 155 | -rw-r--r-- 1 mstephenson mstephenson 4759 Jul 17 08:05 sassi-branch.hpp 156 | -rw-r--r-- 1 mstephenson mstephenson 11402 Jul 17 08:05 sassi-core.hpp 157 | -rw-r--r-- 1 mstephenson mstephenson 2818 Jul 17 08:05 sassi.hpp 158 | -rw-r--r-- 1 mstephenson mstephenson 3396 Jul 17 08:05 sassi-ins-properties.h 159 | -rw-r--r-- 1 mstephenson mstephenson 3994 Jul 17 08:05 sassi-kernel.hpp 160 | -rw-r--r-- 1 mstephenson mstephenson 4090 Jul 17 08:05 sassi-memory.hpp 161 | -rw-r--r-- 1 mstephenson mstephenson 6499 Jul 17 08:05 sassi-opcodes.h 162 | -rw-r--r-- 1 mstephenson mstephenson 10999 Jul 17 08:05 sassi-regs.hpp 163 | -rw-r--r-- 1 mstephenson mstephenson 3636 Jul 17 08:05 sassi-types.h 164 | \end{lstlisting} 165 | 166 | Now we can check to see if the SASSI-enabled \texttt{ptxas} was 167 | installed. All SASSI instrumentation is turned off by default, so 168 | without providing any SASSI-specific options, SASSI's \texttt{ptxas} 169 | will behave like \texttt{ptxas} from your CUDA 7 installation. 170 | Issuing \texttt{ptxas -h} will print out all options available, 171 | including the optional arguments for instrumenting with SASSI. Let's 172 | see if the SASSI options are available: 173 | \begin{lstlisting}[style=BashInputStyle] 174 | $SASSI_HOME/bin/ptxas -h | grep sassi 175 | --sassi-after-args (-sassi-after-args) 176 | --sassi-before-args (-sassi-before-args) 177 | --sassi-debug (-sassi-debug) 178 | ... 179 | \end{lstlisting} 180 | 181 | {\bf{Note:} If at any point you upgrade or patch your version of CUDA 7, you 182 | should re-run the installation script.} The SASSI installation 183 | script simply performs a deep copy of the CUDA 7 installation. 184 | 185 | -------------------------------------------------------------------------------- /doc/tex/introduction.tex: -------------------------------------------------------------------------------- 1 | \section{Introduction} 2 | 3 | This document describes SASSI, a selective instrumentation framework 4 | for NVIDIA GPUs. SASSI stands for SASS Instrumenter, where SASS is 5 | NVIDIA's name for its native ISA. SASSI is a pass in NVIDIA's backend 6 | compiler, \texttt{ptxas}, that selectively inserts instrumentation 7 | code. The purpose of SASSI is to allow users to measure or modify 8 | \texttt{ptxas}-generated SASS by injecting instrumentation code 9 | \emph{during} code generation. 10 | 11 | NVIDIA has many excellent development tools. Why the need for another 12 | tool? NVIDIA's tools such as \texttt{cuda-memcheck} and \texttt{nvvp} 13 | provide excellent, but \emph{fixed-function} inspection of programs. 14 | Tools like \texttt{cuda-memcheck} perform binary patching to insert 15 | instrumentation code. While they are great at what they are designed 16 | for, the user has to choose from a fixed menu of program 17 | characteristics to measure. If you want to measure some aspect of 18 | program execution outside the purview of those tools you are out of 19 | luck. One example that we provide in this guide --- in 20 | Section~\ref{sec:example3} --- is a perfect example of something 21 | that SASSI is good at, that NVIDIA's production tools currently cannot 22 | handle. 23 | 24 | SASSI is invoked extremely late in the compiler's flow to minimize any 25 | disruption the instrumentation code has on the code the user wants to 26 | measure. Because SASSI is part of the compiler, it is important for us 27 | to clearly state up front that SASSI is not a \emph{binary 28 | instrumenter}. SASSI instrumentation of programs requires the source 29 | code (either CUDA, OpenCL, or PTX). As a consequence, any device 30 | libraries (e.g., \texttt{cuFFT}) that your application links in will 31 | not be instrumented unless the user has explicitly recompiled the 32 | library with SASSI. We also note here that SASSI is a fork of the 33 | production backend compiler, and therefore, the code it instruments 34 | may deviate from that of the official \texttt{ptxas}. 35 | 36 | As a small research effort however, working within the backend 37 | compiler affords us stability and portability. NVIDIA's architectures 38 | change dramatically from generation to generation, yet because SASSI 39 | is part of the compiler flow, it can target Fermi, Kepler, and Maxwell 40 | devices, and it runs on Windows and Linux. \hl{NVIDIA Research is 41 | releasing SASSI ``as-is'', and users should realize that it is a 42 | \emph{research prototype} with no promise of support or continuity.} 43 | -------------------------------------------------------------------------------- /doc/tex/required.tex: -------------------------------------------------------------------------------- 1 | \section{\hl{IMPORTANT: Required compiler flags}} 2 | \label{sec:important-required} 3 | 4 | This section discusses some flags and constraints that that SASSI 5 | requires for proper operation. While the complete examples that we 6 | show in the next several sections will discuss these flags and 7 | constraints, it is important to highlight them here, because \hl{ 8 | failure to follow this recipe will cause your instrumented programs 9 | to fail:} 10 | 11 | \begin{itemize} 12 | \item \hl{You must compile your instrumentation handlers using only 13 | 16 registers:} GPGPU programs often using a staggering number of 14 | registers. The SASSI approach of injecting function calls implies 15 | that live registers must be saved and restored across call sites. 16 | By forcing the instrumentation handlers to use only 16 registers, 17 | SASSI can drastically limit the number of saves and restores 18 | required across handler call boundaries. SASSI instrumentation 19 | will only save and restore the first 16 live registers, so if your 20 | handler uses registers outside this set, there's a good chance 21 | your instrumented program will break. \hl{This restriction only 22 | applies to the instrumentation handlers, not the programs you 23 | instrument-- they can still use as many registers as the user deems 24 | necessary.} We can trivially set the number of registers to 16 25 | with the following \texttt{nvcc} flag: 26 | \texttt{--maxrregcount=16}. 27 | \item \hl{You must explicitly specify what \emph{code} you want to 28 | generate:} The CUDA toolchain allows users to generate two forms 29 | of code: virtual and real ISA code. In the case that you generate 30 | virtual ISA code (e.g., \texttt{compute\_35}, \texttt{compute\_50}), the display 31 | driver will just-in-time compile your code when it is loaded. 32 | This will not work for SASSI because SASSI is not in the driver's 33 | compiler. \hl{Instead, you must specify a real ISA, which always 34 | starts with an ``sm'' (e.g., \texttt{sm\_35}, \texttt{sm\_50}). This will cause the 35 | toolchain to precompile your code with \texttt{ptxas}.} For 36 | example, if we are targeting an \texttt{sm\_50}, I would probably 37 | supply the following option to \texttt{nvcc} to generate the 38 | correct code: \texttt{-gencode arch=compute\_50,}\hl{\texttt{code=sm\_50}}. 39 | Please run \texttt{nvcc -h} for details about this option. 40 | \item \hl{You must generate position independent code with 41 | \texttt{-dc} (or equivalently, \texttt{-rcd=true}):} SASSI 42 | instrumentation handlers are linked in with an instrumented 43 | application. To make ``cross-module'' calls with the CUDA 44 | toolchain, you must use this flag. While position independent 45 | code adds modularity to GPU programs and brings CUDA more in line 46 | with general purpose computing, it can affect the performance of 47 | the generated code, sometimes significantly. Future work will try 48 | to remove this requirement. 49 | \end{itemize} 50 | -------------------------------------------------------------------------------- /doc/tex/restrictions.tex: -------------------------------------------------------------------------------- 1 | \section{What can go in an instrumentation handler?} 2 | \label{sec:what-can-go} 3 | 4 | Not all CUDA is legal within SASSI instrumentation handlers. For example, 5 | thread barriers (\texttt{syncthreads}) cannot be used because the 6 | instrumentation function may be called when the threads in a warp are 7 | diverged; {\tt syncthreads} executed by diverged warps precludes all 8 | threads from reaching the common barrier. Finally, SASSI 9 | instrumentation libraries that use shared resources, such as stack, shared 10 | and constant memory, not only risk affecting occupancy, but they could 11 | also cause instrumented programs to fail. For instance, it is not 12 | uncommon for programs to use \emph{all} of shared memory, leaving 13 | nothing for the instrumentation library. In practice, we have not been 14 | limited by these restrictions. 15 | 16 | We should also point out that designing instrumentation handlers for 17 | SASSI requires the user to carefully consider synchronization and data 18 | sharing. Writing parallel programs is tricky business, and SASSI 19 | instrumentation code is highly parallel by design. 20 | 21 | 22 | 23 | 24 | -------------------------------------------------------------------------------- /doc/tex/sassi.tex: -------------------------------------------------------------------------------- 1 | \documentclass{article} 2 | 3 | \usepackage[letterpaper, portrait, margin=1in]{geometry} 4 | \usepackage{soul} 5 | \usepackage{graphicx} 6 | \usepackage{caption} 7 | \usepackage{subcaption} 8 | \usepackage{listings} 9 | \usepackage{xcolor} 10 | \usepackage{array} 11 | \usepackage{fancyhdr} 12 | \newcolumntype{L}[1]{>{\raggedright\let\newline\\\arraybackslash\hspace{0pt}}m{#1}} 13 | 14 | \pagestyle{fancy} 15 | \lhead{} 16 | \chead{} 17 | \rhead{NVIDIA Research} 18 | \lfoot{} 19 | \cfoot{\thepage} 20 | \rfoot{} 21 | \renewcommand\headrulewidth{0pt} 22 | \renewcommand\footrulewidth{0pt} 23 | 24 | \definecolor{teagreen}{rgb}{0.82, 0.94, 0.75} 25 | 26 | \lstdefinestyle{BashInputStyle}{ 27 | language=bash, 28 | basicstyle=\ttfamily\normalsize, 29 | numbers=left, 30 | numberstyle=\tiny, 31 | numbersep=3pt, 32 | frame=tb, 33 | columns=fullflexible, 34 | keywordstyle=\color{black}, 35 | backgroundcolor=\color{teagreen}, 36 | } 37 | 38 | \lstset{language=C++, 39 | basicstyle=\ttfamily\small, 40 | keywordstyle=\color{blue}, 41 | backgroundcolor=\color{teagreen}, 42 | commentstyle=\bfseries, 43 | morekeywords={__device__}, 44 | frame=tb, 45 | columns=fullflexible 46 | } 47 | 48 | \title{SASSI User Guide} 49 | \author{Mark Stephenson (mstephenson@nvidia.com)} 50 | 51 | \begin{document} 52 | \maketitle 53 | 54 | \pagenumbering{Roman} 55 | \tableofcontents 56 | \newpage 57 | \listoffigures 58 | \newpage 59 | \listoftables 60 | \newpage 61 | \pagenumbering{arabic} 62 | 63 | \input{introduction} 64 | \input{installing} 65 | \input{background} 66 | \input{instrumenting} 67 | \input{required} 68 | \input{environment} 69 | \input{example1} 70 | \input{example2} 71 | \input{example3} 72 | \input{example4} 73 | \input{command-line} 74 | \input{restrictions} 75 | \input{discussion} 76 | 77 | \section{Bug Reports} 78 | 79 | We plan to track issues using GitHub's issue tracking features. 80 | 81 | \end{document} 82 | -------------------------------------------------------------------------------- /example/Makefile: -------------------------------------------------------------------------------- 1 | TARGET := matrixMul 2 | OBJECTS := matrixMul.o 3 | 4 | # Copy the environment from env.mk 5 | include ../instlibs/env.mk 6 | 7 | # Make sure NVCC uses the C++ compiler we want it to. 8 | export PATH := $(CCBIN):$(PATH) 9 | 10 | # What example do you want to build? 11 | EXAMPLES := branch cfg none memdiverge opcode ophist ophist-fermi valueprof 12 | 13 | # Point to SASSI's copy of nvcc. 14 | NVCC := $(SASSI_HOME)/bin/nvcc 15 | 16 | # Definitions for the insertion sites. 17 | AFTER_REG := -Xptxas --sassi-inst-after="reg-writes" 18 | BEFORE_ALL := -Xptxas --sassi-inst-before="all" 19 | BEFORE_COND_BRANCHES := -Xptxas --sassi-inst-before="cond-branches" 20 | BEFORE_MEM := -Xptxas --sassi-inst-before="memory" 21 | BEFORE_REGS := -Xptxas --sassi-inst-before="reg-writes,reg-reads" 22 | 23 | # Definitions for the arguments to pass to handlers. 24 | AFTER_REG_INFO := -Xptxas --sassi-after-args="reg-info" 25 | BEFORE_COND_BRANCH_INFO := -Xptxas --sassi-before-args="cond-branch-info" 26 | BEFORE_MEM_INFO := -Xptxas --sassi-before-args="mem-info" 27 | BEFORE_REG_INFO := -Xptxas --sassi-before-args="reg-info" 28 | 29 | # If we want to skip predicated off instructions, include this option. 30 | IFF := -Xptxas --sassi-iff-true-predicate-handler-call 31 | 32 | # If we want to debug. 33 | DEBUG := -Xptxas --sassi-debug 34 | 35 | # The location of our instrumentation libraries. 36 | INST_LIB_DIR = ../instlibs/lib 37 | 38 | # We rely heavily on CUPTI to let us know when kernels are launched. 39 | CUPTI_LIB_DIR = $(SASSI_HOME)/extras/CUPTI/lib64 40 | CUPTI = -L$(CUPTI_LIB_DIR) -lcupti 41 | 42 | LINK_FLAGS += -lcudadevrt -Xlinker -rpath,$(CUPTI_LIB_DIR) 43 | NVCC_FLAGS += -g -O3 -dc 44 | CXX_FLAGS += -g -O3 45 | 46 | # If you want to "intercept" main and exit, use this GCC-specific option. 47 | # We only use this option for Fermi examples, where we don't have the luxury 48 | # of using UVM. 49 | LDWRAP := -Xlinker "--wrap=main" -Xlinker "--wrap=exit" 50 | 51 | # Depending on the experiment, let's choose different SASSI options. 52 | branch: EXTRA_NVCC_FLAGS = $(LDWRAP) -lineinfo $(BEFORE_COND_BRANCHES) $(BEFORE_COND_BRANCH_INFO) $(NVCC_FLAGS) 53 | branch: EXTRA_LINK_FLAGS = $(LDWRAP) -L$(INST_LIB_DIR) -lbranch $(CUPTI) $(LINK_FLAGS) -L$(BOOST_HOME)/lib -lboost_regex -lcrypto -Xlinker -rpath,$(BOOST_HOME)/lib 54 | 55 | cfg: EXTRA_NVCC_FLAGS = -Xptxas --sassi-function-entry -Xptxas --sassi-bb-entry $(NVCC_FLAGS) 56 | cfg: EXTRA_LINK_FLAGS = $(LINK_FLAGS) -L$(INST_LIB_DIR) -lcfg $(CUPTI) 57 | 58 | none: EXTRA_NVCC_FLAGS = $(NVCC_FLAGS) 59 | none: EXTRA_LINK_FLAGS = $(LINK_FLAGS) 60 | 61 | memdiverge: EXTRA_NVCC_FLAGS = $(BEFORE_MEM) $(BEFORE_MEM_INFO) $(NVCC_FLAGS) 62 | memdiverge: EXTRA_LINK_FLAGS = -L$(INST_LIB_DIR) -lmemdiverge $(CUPTI) $(LINK_FLAGS) 63 | 64 | opcode: EXTRA_NVCC_FLAGS = $(BEFORE_ALL) $(BEFORE_REG_INFO) $(NVCC_FLAGS) 65 | opcode: EXTRA_LINK_FLAGS = -L$(INST_LIB_DIR) -lopcode $(CUPTI) $(LINK_FLAGS) 66 | 67 | ophist: EXTRA_NVCC_FLAGS = $(BEFORE_ALL) $(NVCC_FLAGS) 68 | ophist: EXTRA_LINK_FLAGS = -L$(INST_LIB_DIR) -lophist $(CUPTI) $(LINK_FLAGS) 69 | 70 | ophist-fermi: EXTRA_NVCC_FLAGS = $(LDWRAP) $(BEFORE_ALL) $(NVCC_FLAGS) 71 | ophist-fermi: EXTRA_LINK_FLAGS = $(LDWRAP) -L$(INST_LIB_DIR) -lophist_fermi $(CUPTI) $(LINK_FLAGS) 72 | 73 | valueprof: EXTRA_NVCC_FLAGS = $(AFTER_REG) $(AFTER_REG_INFO) $(IFF) $(NVCC_FLAGS) 74 | valueprof: EXTRA_LINK_FLAGS = -L$(INST_LIB_DIR) -lvalueprof $(CUPTI) $(LINK_FLAGS) 75 | 76 | $(EXAMPLES): $(TARGET) 77 | 78 | $(TARGET): $(OBJECTS) 79 | $(NVCC) -o $@ $^ $(GENCODE) $(EXTRA_LINK_FLAGS) 80 | 81 | %.o:%.cu 82 | $(NVCC) -I./inc -c $(GENCODE) $(EXTRA_NVCC_FLAGS) -o $@ $^ 83 | 84 | %.o:%.cpp 85 | $(NVCC) -I./inc -c $(CXX_FLAGS) -o $@ $^ 86 | 87 | run: $(TARGET) 88 | ./$(TARGET) 89 | 90 | clean: 91 | $(RM) -f $(TARGET) *.o 92 | 93 | -------------------------------------------------------------------------------- /example/inc/exception.h: -------------------------------------------------------------------------------- 1 | /* 2 | * Copyright 1993-2013 NVIDIA Corporation. All rights reserved. 3 | * 4 | * Please refer to the NVIDIA end user license agreement (EULA) associated 5 | * with this source code for terms and conditions that govern your use of 6 | * this software. Any use, reproduction, disclosure, or distribution of 7 | * this software and related documentation outside the terms of the EULA 8 | * is strictly prohibited. 9 | * 10 | */ 11 | 12 | /* CUda UTility Library */ 13 | #ifndef _EXCEPTION_H_ 14 | #define _EXCEPTION_H_ 15 | 16 | // includes, system 17 | #include 18 | #include 19 | #include 20 | #include 21 | 22 | //! Exception wrapper. 23 | //! @param Std_Exception Exception out of namespace std for easy typing. 24 | template 25 | class Exception : public Std_Exception 26 | { 27 | public: 28 | 29 | //! @brief Static construction interface 30 | //! @return Alwayss throws ( Located_Exception) 31 | //! @param file file in which the Exception occurs 32 | //! @param line line in which the Exception occurs 33 | //! @param detailed details on the code fragment causing the Exception 34 | static void throw_it(const char *file, 35 | const int line, 36 | const char *detailed = "-"); 37 | 38 | //! Static construction interface 39 | //! @return Alwayss throws ( Located_Exception) 40 | //! @param file file in which the Exception occurs 41 | //! @param line line in which the Exception occurs 42 | //! @param detailed details on the code fragment causing the Exception 43 | static void throw_it(const char *file, 44 | const int line, 45 | const std::string &detailed); 46 | 47 | //! Destructor 48 | virtual ~Exception() throw(); 49 | 50 | private: 51 | 52 | //! Constructor, default (private) 53 | Exception(); 54 | 55 | //! Constructor, standard 56 | //! @param str string returned by what() 57 | Exception(const std::string &str); 58 | 59 | }; 60 | 61 | //////////////////////////////////////////////////////////////////////////////// 62 | //! Exception handler function for arbitrary exceptions 63 | //! @param ex exception to handle 64 | //////////////////////////////////////////////////////////////////////////////// 65 | template 66 | inline void 67 | handleException(const Exception_Typ &ex) 68 | { 69 | std::cerr << ex.what() << std::endl; 70 | 71 | exit(EXIT_FAILURE); 72 | } 73 | 74 | //! Convenience macros 75 | 76 | //! Exception caused by dynamic program behavior, e.g. file does not exist 77 | #define RUNTIME_EXCEPTION( msg) \ 78 | Exception::throw_it( __FILE__, __LINE__, msg) 79 | 80 | //! Logic exception in program, e.g. an assert failed 81 | #define LOGIC_EXCEPTION( msg) \ 82 | Exception::throw_it( __FILE__, __LINE__, msg) 83 | 84 | //! Out of range exception 85 | #define RANGE_EXCEPTION( msg) \ 86 | Exception::throw_it( __FILE__, __LINE__, msg) 87 | 88 | //////////////////////////////////////////////////////////////////////////////// 89 | //! Implementation 90 | 91 | // includes, system 92 | #include 93 | 94 | //////////////////////////////////////////////////////////////////////////////// 95 | //! Static construction interface. 96 | //! @param Exception causing code fragment (file and line) and detailed infos. 97 | //////////////////////////////////////////////////////////////////////////////// 98 | /*static*/ template 99 | void 100 | Exception:: 101 | throw_it(const char *file, const int line, const char *detailed) 102 | { 103 | std::stringstream s; 104 | 105 | // Quiet heavy-weight but exceptions are not for 106 | // performance / release versions 107 | s << "Exception in file '" << file << "' in line " << line << "\n" 108 | << "Detailed description: " << detailed << "\n"; 109 | 110 | throw Exception(s.str()); 111 | } 112 | 113 | //////////////////////////////////////////////////////////////////////////////// 114 | //! Static construction interface. 115 | //! @param Exception causing code fragment (file and line) and detailed infos. 116 | //////////////////////////////////////////////////////////////////////////////// 117 | /*static*/ template 118 | void 119 | Exception:: 120 | throw_it(const char *file, const int line, const std::string &msg) 121 | { 122 | throw_it(file, line, msg.c_str()); 123 | } 124 | 125 | //////////////////////////////////////////////////////////////////////////////// 126 | //! Constructor, default (private). 127 | //////////////////////////////////////////////////////////////////////////////// 128 | template 129 | Exception::Exception() : 130 | Std_Exception("Unknown Exception.\n") 131 | { } 132 | 133 | //////////////////////////////////////////////////////////////////////////////// 134 | //! Constructor, standard (private). 135 | //! String returned by what(). 136 | //////////////////////////////////////////////////////////////////////////////// 137 | template 138 | Exception::Exception(const std::string &s) : 139 | Std_Exception(s) 140 | { } 141 | 142 | //////////////////////////////////////////////////////////////////////////////// 143 | //! Destructor 144 | //////////////////////////////////////////////////////////////////////////////// 145 | template 146 | Exception::~Exception() throw() { } 147 | 148 | // functions, exported 149 | 150 | #endif // #ifndef _EXCEPTION_H_ 151 | 152 | -------------------------------------------------------------------------------- /example/inc/helper_functions.h: -------------------------------------------------------------------------------- 1 | /** 2 | * Copyright 1993-2013 NVIDIA Corporation. All rights reserved. 3 | * 4 | * Please refer to the NVIDIA end user license agreement (EULA) associated 5 | * with this source code for terms and conditions that govern your use of 6 | * this software. Any use, reproduction, disclosure, or distribution of 7 | * this software and related documentation outside the terms of the EULA 8 | * is strictly prohibited. 9 | * 10 | */ 11 | 12 | // These are helper functions for the SDK samples (string parsing, timers, image helpers, etc) 13 | #ifndef HELPER_FUNCTIONS_H 14 | #define HELPER_FUNCTIONS_H 15 | 16 | #ifdef WIN32 17 | #pragma warning(disable:4996) 18 | #endif 19 | 20 | // includes, project 21 | #include 22 | #include 23 | #include 24 | #include 25 | #include 26 | #include 27 | 28 | #include 29 | #include 30 | #include 31 | #include 32 | 33 | // includes, timer, string parsing, image helpers 34 | #include // helper functions for timers 35 | #include // helper functions for string parsing 36 | #include // helper functions for image compare, dump, data comparisons 37 | 38 | #ifndef EXIT_WAIVED 39 | #define EXIT_WAIVED 2 40 | #endif 41 | 42 | #endif // HELPER_FUNCTIONS_H 43 | -------------------------------------------------------------------------------- /instlibs/LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2015, NVIDIA open source projects 2 | All rights reserved. 3 | 4 | This license covers the instrumentation libraries that we include in the 5 | SASSI project. 6 | 7 | Redistribution and use in source and binary forms, with or without 8 | modification, are permitted provided that the following conditions are met: 9 | 10 | * Redistributions of source code must retain the above copyright notice, this 11 | list of conditions and the following disclaimer. 12 | 13 | * Redistributions in binary form must reproduce the above copyright notice, 14 | this list of conditions and the following disclaimer in the documentation 15 | and/or other materials provided with the distribution. 16 | 17 | * Neither the name of SASSI nor the names of its 18 | contributors may be used to endorse or promote products derived from 19 | this software without specific prior written permission. 20 | 21 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 22 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 23 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 24 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 25 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 26 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 27 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 28 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 29 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 30 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 31 | 32 | -------------------------------------------------------------------------------- /instlibs/Makefile: -------------------------------------------------------------------------------- 1 | 2 | # Setup your variables in env.mk before building! 3 | include env.mk 4 | 5 | export PATH:=$(PATH):$(CUDA_HOME)/bin/ 6 | export INSTALL_DIR := $(PWD) 7 | 8 | all: instlibs 9 | 10 | # This target will build all the example libraries, 11 | # but it requires 3.0+ devices. 12 | instlibs: lib 13 | make -C src install 14 | 15 | # Use this target if you targeting a Fermi device. 16 | fermi-only: lib 17 | make -C src install-fermi-only 18 | 19 | lib: 20 | mkdir lib 21 | 22 | clean: 23 | make -C src clean 24 | $(RM) -r lib 25 | -------------------------------------------------------------------------------- /instlibs/env.mk: -------------------------------------------------------------------------------- 1 | # Point this toward your CUDA 7 compiler. 2 | export CUDA_HOME ?= /usr/local/cuda-7.0/ 3 | export SASSI_HOME ?= /usr/local/sassi7/ 4 | 5 | # Point this toward a C++-11 capable compiler (not the compiler 6 | # itself, just its location). 7 | export CCBIN ?= /usr/local/gcc-4.8.4/bin/ 8 | 9 | # Set this to target your specific GPU. Note some libries use 10 | # CUDA features that are only supported for > compute_30. 11 | # IMPORTANT: YOU MUST SPECIFY A REAL ARCHITECTURE. IF YOUR 12 | # code SETTING DOES NOT HAVE THE "sm" PREFIX, YOUR INSTRUMENTATION 13 | # WILL NOT WORK! 14 | export GENCODE ?= -gencode arch=compute_35,code=sm_35 \ 15 | -gencode arch=compute_50,code=sm_50 16 | 17 | # You might want to debug an instrumentation handler. If so, 18 | # uncomment the line below. Be aware that CUPTI and cuda-gdb do 19 | # not play nicely together, so you'll want to make sure you're 20 | # not using any CUPTI-related libraries if you want to debug an 21 | # instrumentation handler... 22 | #export DEBUG = -G -g 23 | 24 | # If you use the src_map example, it relies on boost. You can 25 | # set the location here. 26 | export BOOST_HOME ?= -------------------------------------------------------------------------------- /instlibs/src/Makefile: -------------------------------------------------------------------------------- 1 | ############################################################### 2 | # 3 | # Build and install the SASSI libraries. 4 | # 5 | ############################################################### 6 | 7 | # Setup your variables in env.mk before building! 8 | include ../env.mk 9 | 10 | # These libraries work with Fermi, the others do not. 11 | export FERMI_LIBS := libophist_fermi.a libstubs.a 12 | 13 | export PATH := $(CCBIN):$(PATH) 14 | 15 | # Decide which libraries you want to compile. 16 | export LIBS := libstubs.a \ 17 | libbranch.a \ 18 | libcfg.a \ 19 | libmemdiverge.a \ 20 | libopcode.a \ 21 | libophist.a \ 22 | libvalueprof.a \ 23 | 24 | ifndef INSTALL_DIR 25 | INSTALL_DIR := ../ 26 | endif 27 | 28 | NVCC = $(CUDA_HOME)/bin/nvcc 29 | CUPTI = $(CUDA_HOME)/extras/CUPTI 30 | 31 | INSTALLED_LIBS := $(addprefix $(INSTALL_DIR)/lib/, $(LIBS)) 32 | INSTALLED_FERMI_LIBS := $(addprefix $(INSTALL_DIR)/lib/, $(FERMI_LIBS)) 33 | 34 | install: $(INSTALLED_LIBS) $(INSTALLED_FERMI_LIBS) 35 | 36 | install-fermi-only: $(INSTALLED_FERMI_LIBS) 37 | 38 | $(INSTALL_DIR)/lib/lib%.a: %.o 39 | ar r $@ $^ 40 | ranlib $@ 41 | 42 | %.o: %.cu 43 | $(NVCC) -ccbin $(CCBIN) -std=c++11 \ 44 | --compiler-options -Wall \ 45 | $(DEBUG) \ 46 | -O3 $(GENCODE) \ 47 | -lineinfo -c -rdc=true -o $@ $^ \ 48 | --maxrregcount=16 \ 49 | --compiler-options -Wall \ 50 | -I$(SASSI_HOME)/include \ 51 | -I$(BOOST_HOME)/include \ 52 | -I$(CUPTI)/include/ \ 53 | -I$(INSTALL_DIR)/utils/ \ 54 | -I$(INSTALL_DIR)/include 55 | 56 | clean: 57 | $(RM) *.o *.a *~ 58 | $(RM) $(INSTALLED_LIBS) 59 | -------------------------------------------------------------------------------- /instlibs/src/branch.cu: -------------------------------------------------------------------------------- 1 | /*********************************************************************************** \ 2 | * Copyright (c) 2015, NVIDIA open source projects 3 | * All rights reserved. 4 | * 5 | * Redistribution and use in source and binary forms, with or without 6 | * modification, are permitted provided that the following conditions are met: 7 | * 8 | * - Redistributions of source code must retain the above copyright notice, this 9 | * list of conditions and the following disclaimer. 10 | * 11 | * - Redistributions in binary form must reproduce the above copyright notice, 12 | * this list of conditions and the following disclaimer in the documentation 13 | * and/or other materials provided with the distribution. 14 | * 15 | * - Neither the name of SASSI nor the names of its 16 | * contributors may be used to endorse or promote products derived from 17 | * this software without specific prior written permission. 18 | * 19 | * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 20 | * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 21 | * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 22 | * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 23 | * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 24 | * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 25 | * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 26 | * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 27 | * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 28 | * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 29 | * 30 | * This is a SASSI instrumentation library for gathering branch statistics. It 31 | * corresponds to Case Study I in, 32 | * 33 | * "Flexible Software Profiling of GPU Architectures" 34 | * Stephenson et al., ISCA 2015. 35 | * 36 | * The application code the user instruments should be instrumented with the 37 | * following SASSI flag: -Xptxas --sassi-inst-before="cond-branches" \ 38 | * -Xptxas --sassi-before-args="cond-branch-info". 39 | * 40 | * In addition, be sure to link your application with flags necessary to 41 | * hijack "main" and "exit". You can trivially do this using GNU tools with 42 | * 43 | * -Xlinker "--wrap=main" -Xlinker "--wrap=exit" 44 | * 45 | * This will cause calls to main and exit to be replaced by calls to 46 | * __wrap_exit(int status) and __wrap_main(int argc, char **argv), which we have 47 | * defined below. This allows us to do initialization and finalization without 48 | * having to worry about object constructor and destructor orders. 49 | * 50 | * This version of the library also lets us correlate SASS location to the 51 | * corresponding CUDA source locations. To use this feature, you must 52 | * compile your application with the "-lineinfo" option. 53 | * 54 | * See the branch example in example/Makfile for all the flags you should use. 55 | * 56 | \***********************************************************************************/ 57 | 58 | #define __STDC_FORMAT_MACROS 59 | #include 60 | #include 61 | #include 62 | #include 63 | #include 64 | #include 65 | #include 66 | #include 67 | #include "sassi_intrinsics.h" 68 | #include "sassi_dictionary.hpp" 69 | #include "sassi_srcmap.hpp" 70 | #include 71 | #include 72 | 73 | struct BranchCounter { 74 | uint64_t address; 75 | int32_t branchType; // The branch type. 76 | int32_t taggedUnanimous; // Branch had .U modifier, so compiler knows... 77 | unsigned long long totalBranches; 78 | unsigned long long takenThreads; 79 | unsigned long long takenNotThreads; 80 | unsigned long long divergentBranches; // Not all branches go the same way. 81 | unsigned long long activeThreads; // Number of active threads. 82 | }; 83 | 84 | static sassi::src_mapper *sassiMapper; 85 | 86 | // The actual dictionary of counters, where the key is a branch's PC, and 87 | // the value is the set of counters associated with it. 88 | static __managed__ sassi::dictionary *sassiStats; 89 | 90 | // Convert the SASSIBranchType to a string that we can print. See the 91 | // CUDA binary utilities webpage for more information about these types. 92 | const char *SASSIBranchTypeAsString[] = { 93 | "BRX", "BRA", "RET", "EXIT", "SYNC", "OTHER" 94 | }; 95 | 96 | 97 | /////////////////////////////////////////////////////////////////////////////////// 98 | /// 99 | /// Collect the stats and print them out before the device counters are reset. 100 | /// 101 | /////////////////////////////////////////////////////////////////////////////////// 102 | static void sassi_finalize(__attribute__((unused)) sassi::cupti_wrapper *wrapper, 103 | __attribute__((unused)) const CUpti_CallbackData *cb) 104 | { 105 | // This function will be called either when 1) the device is reset, or 2) the 106 | // the program is about to exit. Let's check to see whether the sassiStats 107 | // map is still valid. For instance, the user could have reset the device 108 | // before the program exited, which would essentially invalidate all device 109 | // data. (In fact, explicitly reseting the device before program exit is 110 | // considered best practice.) 111 | if (sassiMapper->is_device_state_valid()) 112 | { 113 | FILE *fRes = fopen("sassi-branch.txt", "w"); 114 | 115 | fprintf(fRes, "%-16.16s %-10.10s %-10.10s %-10.10s %-10.10s %-10.10s %-8.8s %-8.8s Location\n", 116 | "Address", "Total/32", "Dvrge/32", "Active", "Taken", "NTaken", 117 | "Type", ".U"); 118 | 119 | // Get the SASS PUPC to source code line mapping. 120 | auto const locMapper = sassiMapper->get_location_map(); 121 | 122 | sassiStats->map([fRes,&locMapper](uint64_t& pupc, BranchCounter& val) { 123 | assert(val.address == pupc); 124 | 125 | fprintf(fRes, "%-16.16" PRIx64 126 | " %-10.llu %-10.llu %-10.llu %-10.llu %-10.llu %-8.4s %-8.d ", 127 | pupc, 128 | val.totalBranches, 129 | val.divergentBranches, 130 | val.activeThreads, 131 | val.takenThreads, 132 | val.takenNotThreads, 133 | SASSIBranchTypeAsString[val.branchType], 134 | val.taggedUnanimous 135 | ); 136 | 137 | // See if there is a source code mapping for this PUPC. If you 138 | // compiled your code with "-lineinfo" there should be a valid 139 | // mapping. 140 | auto it = locMapper.find(pupc); 141 | if (it != locMapper.end()) { 142 | fprintf(fRes, "%s, line %d\n", it->second.file_name->c_str(), it->second.line_num); 143 | } else { 144 | fprintf(fRes, "\n"); 145 | } 146 | }); 147 | 148 | fclose(fRes); 149 | } 150 | } 151 | 152 | /////////////////////////////////////////////////////////////////////////////////// 153 | /// 154 | /// We will compile our application using ld's --wrap option, which in this 155 | /// case lets us replace calls to "exit" with calls to "__wrap_exit". See 156 | /// the make target "ophist-fermi" in ./example/Makefile to see how this 157 | /// is done. 158 | /// 159 | /// This should allow us to perform CUDA operations before the CUDA runtime 160 | /// starts shutting down. In particular, we want to copy our 161 | /// "dynamic_instr_counts" off the device. If we used UVM, this would happen 162 | /// automatically for us. But since we don't have the luxury of using UVM 163 | /// for Fermi, we have to make sure that the CUDA runtime is still up and 164 | /// running before trying to issue a cudaMemcpy. Hence these shenanigans. 165 | /// 166 | /////////////////////////////////////////////////////////////////////////////////// 167 | extern "C" void __real_exit(int status); 168 | extern "C" void __wrap_exit(int status) 169 | { 170 | sassi_finalize(NULL, NULL); 171 | __real_exit(status); 172 | } 173 | 174 | /////////////////////////////////////////////////////////////////////////////////// 175 | /// 176 | /// For programs that don't call exit explicitly, let's catch the fallthrough. 177 | /// 178 | /////////////////////////////////////////////////////////////////////////////////// 179 | extern "C" int __real_main(int argc, char **argv); 180 | extern "C" int __wrap_main(int argc, char **argv) 181 | { 182 | // Initialize a src_mapper to give us SASS PC->CUDA line mappings. 183 | sassiMapper = new sassi::src_mapper(); 184 | 185 | // Initialize a hashmap to keep track of statistics of branches. The key 186 | // is the PC, the value is a BranchCounter. 187 | sassiStats = new sassi::dictionary(); 188 | 189 | // Whenever the device is reset, be sure to print out the counters before 190 | // they are clobbered. 191 | sassiMapper->register_callback(sassi::cupti_wrapper::event_type::DEVICE_RESET, 192 | sassi::cupti_wrapper::callback_before, 193 | sassi_finalize); 194 | 195 | int ret = __real_main(argc, argv); 196 | sassi_finalize(NULL, NULL); 197 | return ret; 198 | } 199 | 200 | /////////////////////////////////////////////////////////////////////////////////// 201 | // 202 | /// This function will be inserted before every conditional branch instruction. 203 | // 204 | /////////////////////////////////////////////////////////////////////////////////// 205 | __device__ void sassi_before_handler(SASSIBeforeParams *bp, SASSICondBranchParams *brp) 206 | { 207 | // Find out thread index within the warp. 208 | int threadIdxInWarp = get_laneid(); 209 | 210 | // Get masks and counts of 1) active threads in this warp, 211 | // 2) threads that take the branch, and 212 | // 3) threads that do not take the branch. 213 | int active = __ballot(1); 214 | bool dir = brp->GetDirection(); 215 | int taken = __ballot(dir == true); 216 | int ntaken = __ballot(dir == false); 217 | int numActive = __popc(active); 218 | int numTaken = __popc(taken); 219 | int numNotTaken = __popc(ntaken); 220 | bool divergent = (numTaken != numActive && numNotTaken != numActive); 221 | 222 | // The first active thread in each warp gets to write results. 223 | if ((__ffs(active)-1) == threadIdxInWarp) { 224 | // Get the address, we'll use it for hashing. 225 | uint64_t instAddr = bp->GetPUPC(); 226 | 227 | // Looks up the counters associated with 'instAddr', but if no such entry 228 | // exits, initialize the counters in the lambda. 229 | BranchCounter *stats = (*sassiStats).getOrInit(instAddr, [instAddr,brp](BranchCounter* v) { 230 | v->address = instAddr; 231 | v->branchType = brp->GetType(); 232 | v->taggedUnanimous = brp->IsUnanimous(); 233 | }); 234 | 235 | // Why not sanity check the hash map? 236 | assert(stats->address == instAddr); 237 | assert(numTaken + numNotTaken == numActive); 238 | 239 | // Increment the various counters that are associated 240 | // with this instruction appropriately. 241 | atomicAdd(&(stats->totalBranches), 1ULL); 242 | atomicAdd(&(stats->activeThreads), numActive); 243 | atomicAdd(&(stats->takenThreads), numTaken); 244 | atomicAdd(&(stats->takenNotThreads), numNotTaken); 245 | atomicAdd(&(stats->divergentBranches), divergent); 246 | } 247 | } 248 | 249 | 250 | 251 | -------------------------------------------------------------------------------- /instlibs/src/cfg.cu: -------------------------------------------------------------------------------- 1 | /*********************************************************************************** \ 2 | * Copyright (c) 2015, NVIDIA open source projects 3 | * All rights reserved. 4 | * 5 | * Redistribution and use in source and binary forms, with or without 6 | * modification, are permitted provided that the following conditions are met: 7 | * 8 | * - Redistributions of source code must retain the above copyright notice, this 9 | * list of conditions and the following disclaimer. 10 | * 11 | * - Redistributions in binary form must reproduce the above copyright notice, 12 | * this list of conditions and the following disclaimer in the documentation 13 | * and/or other materials provided with the distribution. 14 | * 15 | * - Neither the name of SASSI nor the names of its 16 | * contributors may be used to endorse or promote products derived from 17 | * this software without specific prior written permission. 18 | * 19 | * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 20 | * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 21 | * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 22 | * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 23 | * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 24 | * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 25 | * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 26 | * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 27 | * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 28 | * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 29 | * 30 | * This example shows how to use SASSI to inspect the control flow graph. 31 | * 32 | * The application code the user instruments should be instrumented with the 33 | * following SASSI flags: -Xptxas --sassi-function-entry -Xptxas --sassi-bb-entry 34 | * 35 | \***********************************************************************************/ 36 | 37 | #include 38 | #include 39 | #include 40 | #include 41 | #include 42 | #include 43 | #include "sassi_intrinsics.h" 44 | #include "sassi_lazyallocator.hpp" 45 | #include "sassi_dictionary.hpp" 46 | #include 47 | 48 | // 8Mb of space for CFG information. 49 | #define POOLSIZE (1024 * 1024 * 8) 50 | #define MAX_FN_STR_LEN 64 51 | 52 | // Create a memory pool that we can populate on the device and read on the host. 53 | static __managed__ uint8_t sassi_mempool[POOLSIZE]; 54 | static __managed__ int sassi_mempool_cur; 55 | 56 | // A structure to record a basic block. We will perform a deep copy 57 | // of SASSI's SASSIBasicBlockParams for each basic block. 58 | struct BLOCK { 59 | int id; 60 | unsigned long long weight; 61 | bool isEntry; 62 | bool isExit; 63 | int numInstrs; 64 | int numSuccs; 65 | int succs[2]; 66 | }; 67 | 68 | // A structure to record a function's CFG. 69 | struct CFG { 70 | char fnName[MAX_FN_STR_LEN]; 71 | int numBlocks; 72 | BLOCK *blocks; 73 | }; 74 | 75 | // A dictionary of CFGs. 76 | static __managed__ sassi::dictionary *sassi_cfg; 77 | 78 | // A dictionary of basic blocks. 79 | static __managed__ sassi::dictionary *sassi_cfg_blocks; 80 | 81 | /////////////////////////////////////////////////////////////////////////////////// 82 | /// 83 | /// Allocate data from the UVM mempool. 84 | /// 85 | /////////////////////////////////////////////////////////////////////////////////// 86 | __device__ void *simple_malloc(size_t sz) 87 | { 88 | int ptr = atomicAdd(&sassi_mempool_cur, sz); 89 | assert ((ptr + sz) <= POOLSIZE); 90 | return (void*) &(sassi_mempool[ptr]); 91 | } 92 | 93 | /////////////////////////////////////////////////////////////////////////////////// 94 | /// 95 | /// A simple string copy to copy from device memory to our UVM malloc'd region. 96 | /// 97 | /////////////////////////////////////////////////////////////////////////////////// 98 | __device__ void simple_strncpy(char *dest, const char *src) 99 | { 100 | int i; 101 | for (i = 0; i < MAX_FN_STR_LEN-1; i++) { 102 | char c = src[i]; 103 | if (c == 0) break; 104 | dest[i] = c; 105 | } 106 | dest[i] = '\0'; 107 | } 108 | 109 | /////////////////////////////////////////////////////////////////////////////////// 110 | /// 111 | /// A call to this function will be inserted at the beginning of every 112 | /// CUDA function or kernel. We will essentially perform a deep copy of the 113 | /// CFG SASSI presents. All of the copied data only contains static information 114 | /// about the CFG. In the sassi_basic_block_entry handler, below, we will 115 | /// record the dynamic number of times the basic block is invoked. 116 | /// 117 | /////////////////////////////////////////////////////////////////////////////////// 118 | __device__ void sassi_function_entry(SASSIFunctionParams* fp) 119 | { 120 | int numBlocks = fp->GetNumBlocks(); 121 | const SASSIBasicBlockParams * const * blocks = fp->GetBlocks(); 122 | 123 | CFG *cPtr = *(sassi_cfg->getOrInit((int64_t)fp, [numBlocks, blocks, fp](CFG **cfg) { 124 | CFG *cPtr = (CFG*) simple_malloc(sizeof(CFG)); 125 | simple_strncpy(cPtr->fnName, fp->GetFnName()); 126 | cPtr->numBlocks = numBlocks; 127 | cPtr->blocks = (BLOCK*) simple_malloc(sizeof(BLOCK) * numBlocks); 128 | *cfg = cPtr; 129 | })); 130 | 131 | __threadfence(); 132 | 133 | for (int bb = 0; bb < numBlocks; bb++) { 134 | const SASSIBasicBlockParams *blockParam = blocks[bb]; 135 | BLOCK *blockPtr = &(cPtr->blocks[bb]); 136 | sassi_cfg_blocks->getOrInit((int64_t)blockParam, [blockParam, blockPtr](BLOCK **bpp) { 137 | *bpp = blockPtr; 138 | blockPtr->id = blockParam->GetID(); 139 | blockPtr->weight = 0; 140 | blockPtr->isEntry = blockParam->IsEntryBlock(); 141 | blockPtr->isExit = blockParam->IsExitBlock(); 142 | blockPtr->numInstrs = blockParam->GetNumInstrs(); 143 | blockPtr->numSuccs = blockParam->GetNumSuccs(); 144 | assert(blockParam->GetNumSuccs() <= 2); 145 | const SASSIBasicBlockParams * const * succs = blockParam->GetSuccs(); 146 | for (int s = 0; s < blockParam->GetNumSuccs(); s++) { 147 | blockPtr->succs[s] = succs[s]->GetID(); 148 | } 149 | }); 150 | } 151 | } 152 | 153 | /////////////////////////////////////////////////////////////////////////////////// 154 | /// 155 | /// Simply lookup the basic block in our dictionary, get its "weight" feild 156 | /// and increment it. 157 | /// 158 | /////////////////////////////////////////////////////////////////////////////////// 159 | __device__ void sassi_basic_block_entry(SASSIBasicBlockParams *bb) 160 | { 161 | BLOCK **blockStr = sassi_cfg_blocks->getOrInit((int64_t)bb, [](BLOCK **bpp) { assert(0); }); 162 | atomicAdd(&((*blockStr)->weight), 1); 163 | } 164 | 165 | /////////////////////////////////////////////////////////////////////////////////// 166 | /// 167 | /// Print the graph out in "dot" format. 168 | /// E.g., use: 169 | /// 170 | /// dot -Tps -o graph.ps sassi-cfg.dot 171 | /// 172 | /// to render the graph in postscript. 173 | /// 174 | /////////////////////////////////////////////////////////////////////////////////// 175 | static void sassi_finalize(sassi::lazy_allocator::device_reset_reason unused) 176 | { 177 | cudaDeviceSynchronize(); 178 | FILE *cfgFile = fopen("sassi-cfg.dot", "w"); 179 | sassi_cfg->map([cfgFile](int64_t k, CFG* &cfg) { 180 | fprintf(cfgFile, "digraph %s {\n", cfg->fnName); 181 | double weightMax = 0.0; 182 | for (int bb = 0; bb < cfg->numBlocks; bb++) { 183 | BLOCK *block = &(cfg->blocks[bb]); 184 | weightMax = std::max(weightMax, (double)block->weight); 185 | } 186 | for (int bb = 0; bb < cfg->numBlocks; bb++) { 187 | BLOCK *block = &(cfg->blocks[bb]); 188 | int per = block->isExit ? 3 : 1; 189 | int boxWeight = 100 - std::round(100.0 * ((double)block->weight / weightMax)); 190 | int fontWeight = boxWeight > 40 ? 0 : 100; 191 | fprintf(cfgFile, "\tBB%d [style=filled,fontcolor=gray%d,shape=box," 192 | "peripheries=%d,color=gray%d,label=\"BB%d : %d ins\"];\n", 193 | block->id, fontWeight, per, boxWeight, block->id, block->numInstrs); 194 | } 195 | for (int bb = 0; bb < cfg->numBlocks; bb++) { 196 | BLOCK *block = &(cfg->blocks[bb]); 197 | for (int s = 0; s < block->numSuccs; s++) { 198 | fprintf(cfgFile, "\tBB%d -> BB%d;\n", block->id, block->succs[s]); 199 | } 200 | } 201 | fprintf(cfgFile, "}\n"); 202 | }); 203 | fclose(cfgFile); 204 | } 205 | 206 | /////////////////////////////////////////////////////////////////////////////////// 207 | /// 208 | /// Initialize the UVM memory pool and our two dictionaries. 209 | /// 210 | /////////////////////////////////////////////////////////////////////////////////// 211 | static void sassi_init() 212 | { 213 | sassi_mempool_cur = 0; 214 | bzero(sassi_mempool, sizeof(sassi_mempool)); 215 | sassi_cfg = new sassi::dictionary(601); 216 | sassi_cfg_blocks = new sassi::dictionary(7919); 217 | } 218 | 219 | 220 | /////////////////////////////////////////////////////////////////////////////////// 221 | /// 222 | /// 223 | /// 224 | /////////////////////////////////////////////////////////////////////////////////// 225 | static sassi::lazy_allocator mapAllocator(sassi_init, sassi_finalize); 226 | -------------------------------------------------------------------------------- /instlibs/src/memdiverge.cu: -------------------------------------------------------------------------------- 1 | /*********************************************************************************** \ 2 | * Copyright (c) 2015, NVIDIA open source projects 3 | * All rights reserved. 4 | * 5 | * Redistribution and use in source and binary forms, with or without 6 | * modification, are permitted provided that the following conditions are met: 7 | * 8 | * - Redistributions of source code must retain the above copyright notice, this 9 | * list of conditions and the following disclaimer. 10 | * 11 | * - Redistributions in binary form must reproduce the above copyright notice, 12 | * this list of conditions and the following disclaimer in the documentation 13 | * and/or other materials provided with the distribution. 14 | * 15 | * - Neither the name of SASSI nor the names of its 16 | * contributors may be used to endorse or promote products derived from 17 | * this software without specific prior written permission. 18 | * 19 | * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 20 | * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 21 | * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 22 | * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 23 | * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 24 | * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 25 | * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 26 | * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 27 | * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 28 | * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 29 | * 30 | * This example tracks memory divergence, and is based on case study II in, 31 | * 32 | * "Flexible Software Profiling of GPU Architectures" 33 | * Stephenson et al., ISCA 2015. 34 | * 35 | * The application code the user instruments should be instrumented with the 36 | * following SASSI flag: -Xptxas --sassi-inst-before="memory" 37 | * -Xptxas --sassi-before-args="mem-info" 38 | * 39 | \***********************************************************************************/ 40 | 41 | #include 42 | #include 43 | #include 44 | #include 45 | #include 46 | #include 47 | #include 48 | #include "sassi_intrinsics.h" 49 | #include "sassi_lazyallocator.hpp" 50 | #include 51 | #include 52 | 53 | /// The number of bits we need to shift off to get the cache line address. 54 | #define LINE_BITS 5 55 | 56 | // The width of a warp. 57 | #define WARP_SIZE 32 58 | 59 | /// The counters that we will use to record our statistics. 60 | __managed__ unsigned long long sassi_counters[WARP_SIZE + 1][WARP_SIZE + 1]; 61 | 62 | 63 | /////////////////////////////////////////////////////////////////////////////////// 64 | /// 65 | /// Lazily initialize the counters before the first kernel launch. 66 | /// 67 | /////////////////////////////////////////////////////////////////////////////////// 68 | static sassi::lazy_allocator counterInitializer( 69 | /* Initialize the counters. */ 70 | []() { 71 | bzero(sassi_counters, sizeof(sassi_counters)); 72 | }, 73 | 74 | /* Dump the stats. */ 75 | [](sassi::lazy_allocator::device_reset_reason reason) { 76 | FILE *rf = fopen("sassi-memdiverge.txt", "a"); 77 | fprintf(rf, "Active x Diverged:\n"); 78 | for (unsigned m = 0; m <= WARP_SIZE; m++) { 79 | fprintf(rf, "%d ", m); 80 | for (unsigned u = 0; u <= WARP_SIZE; u++) 81 | { 82 | fprintf(rf, "%llu ", sassi_counters[m][u]); 83 | } 84 | fprintf(rf, "\n"); 85 | } 86 | fprintf(rf, "\n"); 87 | fclose(rf); 88 | } 89 | ); 90 | 91 | 92 | /////////////////////////////////////////////////////////////////////////////////// 93 | /// 94 | /// This is the function that will be inserted before every memory operation. 95 | /// 96 | /////////////////////////////////////////////////////////////////////////////////// 97 | __device__ void sassi_before_handler(SASSIBeforeParams *bp, SASSIMemoryParams *mp) 98 | { 99 | if (bp->GetInstrWillExecute()) 100 | { 101 | intptr_t addrAsInt = mp->GetAddress(); 102 | // Don't look at shared or local memory. 103 | if (__isGlobal((void*)addrAsInt)) { 104 | // The number of unique addresses across the warp 105 | unsigned unique = 0; // for the instrumented instruction. 106 | 107 | // Shift off the offset bits into the cache line. 108 | intptr_t lineAddr = addrAsInt >> LINE_BITS; 109 | 110 | int workset = __ballot(1); 111 | int firstActive = __ffs(workset) - 1; 112 | int numActive = __popc(workset); 113 | while (workset) { 114 | // Elect a leader, get its line, see who all matches it. 115 | int leader = __ffs(workset) - 1; 116 | intptr_t leadersAddr = __broadcast(lineAddr, leader); 117 | int notMatchesLeader = __ballot(leadersAddr != lineAddr); 118 | 119 | // We have accounted for all values that match the leader's. 120 | // Let's remove them all from the workset. 121 | workset = workset & notMatchesLeader; 122 | unique++; 123 | assert(unique <= 32); 124 | } 125 | 126 | assert(unique > 0 && unique <= 32); 127 | 128 | // Each thread independently computed 'numActive' and 'unique'. 129 | // Let's let the first active thread actually tally the result. 130 | int threadsLaneId = get_laneid(); 131 | if (threadsLaneId == firstActive) { 132 | atomicAdd(&(sassi_counters[numActive][unique]), 1LL); 133 | } 134 | } 135 | } 136 | } 137 | -------------------------------------------------------------------------------- /instlibs/src/opcode.cu: -------------------------------------------------------------------------------- 1 | /*********************************************************************************** \ 2 | * Copyright (c) 2015, NVIDIA open source projects 3 | * All rights reserved. 4 | * 5 | * Redistribution and use in source and binary forms, with or without 6 | * modification, are permitted provided that the following conditions are met: 7 | * 8 | * - Redistributions of source code must retain the above copyright notice, this 9 | * list of conditions and the following disclaimer. 10 | * 11 | * - Redistributions in binary form must reproduce the above copyright notice, 12 | * this list of conditions and the following disclaimer in the documentation 13 | * and/or other materials provided with the distribution. 14 | * 15 | * - Neither the name of SASSI nor the names of its 16 | * contributors may be used to endorse or promote products derived from 17 | * this software without specific prior written permission. 18 | * 19 | * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 20 | * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 21 | * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 22 | * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 23 | * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 24 | * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 25 | * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 26 | * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 27 | * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 28 | * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 29 | * 30 | * This example shows how to use SASSI to extract SASS opcodes and register 31 | * information. It will dump a file that shows all of the instructions that 32 | * were executed at least once, and for each instruction, it will dump all 33 | * of the registers that were sourced and defined. 34 | * 35 | * The application code the user instruments should be instrumented with the 36 | * following SASSI flags: -Xptxas --sassi-inst-before="all" \ 37 | * -Xptxas --sassi-before-args="reg-info" 38 | * 39 | \***********************************************************************************/ 40 | 41 | #define __STDC_FORMAT_MACROS 42 | #include 43 | #include 44 | #include 45 | #include 46 | #include 47 | #include 48 | #include 49 | #include 50 | #include "sassi_intrinsics.h" 51 | #include "sassi_dictionary.hpp" 52 | #include "sassi_lazyallocator.hpp" 53 | #include 54 | #include 55 | #include 56 | 57 | 58 | // A struct to record the opcode and registers used by each instruction. 59 | struct SASSIInfo { 60 | SASSIInstrOpcode opcode; 61 | int32_t numGPRDstPtrs; 62 | int32_t numGPRSrcPtrs; 63 | int32_t GPRDsts[SASSI_NUMGPRDSTS]; 64 | int32_t GPRSrcs[SASSI_NUMGPRSRCS]; 65 | bool CCDst; 66 | bool CCSrc; 67 | int32_t PRDst; 68 | int32_t PRSrc; 69 | uint64_t pupc; 70 | unsigned long long weight; 71 | }; 72 | 73 | 74 | /// The actual dictionary, declared as a UVM managed type. 75 | static __managed__ sassi::dictionary *sassi_stats; 76 | 77 | 78 | /////////////////////////////////////////////////////////////////////////////////// 79 | /// 80 | /// Print out predicate usage. 81 | /// 82 | /////////////////////////////////////////////////////////////////////////////////// 83 | static void print_predicates(FILE *resultsFile, int32_t mask) 84 | { 85 | int i = 0; 86 | while (mask != 0) { 87 | if (mask & 0x1) { 88 | fprintf(resultsFile, "(p%d) ", i); 89 | } 90 | mask = mask >> 1; 91 | i++; 92 | } 93 | } 94 | 95 | /////////////////////////////////////////////////////////////////////////////////// 96 | /// 97 | /// Prints out CC register usage. 98 | /// 99 | /////////////////////////////////////////////////////////////////////////////////// 100 | static void print_cc(FILE *resultsFile, bool cc) 101 | { 102 | if (cc) { 103 | fprintf(resultsFile, "(CC) "); 104 | } else { 105 | fprintf(resultsFile, " "); 106 | } 107 | } 108 | 109 | /////////////////////////////////////////////////////////////////////////////////// 110 | /// 111 | /// Write out the statistics we've gathered. 112 | /// 113 | /////////////////////////////////////////////////////////////////////////////////// 114 | static void sassi_finalize(sassi::lazy_allocator::device_reset_reason reason) 115 | { 116 | struct KVTuple { 117 | uint64_t pupc; 118 | SASSIInfo *v; 119 | }; 120 | 121 | FILE *resultsFile = fopen("sassi-opcode-map.txt", "a"); 122 | 123 | // Let's first sort the instructions according to PC. 124 | std::vector ops; 125 | sassi_stats->map([&ops](uint64_t& key, SASSIInfo &val) { 126 | ops.push_back({key, &val}); 127 | }); 128 | 129 | std::sort(ops.begin(), ops.end(), [](KVTuple a, const KVTuple b) { 130 | return (a.pupc < b.pupc); 131 | }); 132 | 133 | for (KVTuple t : ops) { 134 | assert(t.pupc == (t.v)->pupc); // Consistency check. 135 | 136 | fprintf(resultsFile, "%-10.llu ", (t.v)->weight); // Print the dynamic weight. 137 | fprintf(resultsFile, "%-16.16" PRIx64 " ", t.pupc); // Print the virtual PC. 138 | print_predicates(resultsFile, (t.v)->PRDst); // Predicate dests. 139 | print_cc(resultsFile, (t.v)->CCDst); 140 | for (int d = 0; d < (t.v)->numGPRDstPtrs; d++) // GPR dests. 141 | fprintf(resultsFile, "R%d ", (t.v)->GPRDsts[d]); 142 | // Print the opcode. 143 | fprintf(resultsFile, "%s ", SASSIInstrOpcodeStrings[(t.v)->opcode]); 144 | print_predicates(resultsFile, (t.v)->PRSrc); // Predicate srcs. 145 | print_cc(resultsFile, (t.v)->CCSrc); 146 | for (int s = 0; s < (t.v)->numGPRSrcPtrs; s++) // GPR srcs. 147 | fprintf(resultsFile, "R%d ", (t.v)->GPRSrcs[s]); 148 | fprintf(resultsFile, "\n"); 149 | } 150 | 151 | fclose(resultsFile); 152 | } 153 | 154 | 155 | /////////////////////////////////////////////////////////////////////////////////// 156 | /// 157 | /// Lazily allocate a dictionary before the first kernel launch. 158 | /// 159 | /////////////////////////////////////////////////////////////////////////////////// 160 | static sassi::lazy_allocator mapAllocator([]() { 161 | sassi_stats = new sassi::dictionary(); 162 | }, sassi_finalize); 163 | 164 | 165 | /////////////////////////////////////////////////////////////////////////////////// 166 | /// 167 | /// Records static information about each instruction. Also records the 168 | /// dynamic weight associated with each instruction. 169 | /// 170 | /////////////////////////////////////////////////////////////////////////////////// 171 | __device__ void sassi_before_handler(SASSIBeforeParams* bp, SASSIRegisterParams *rp) 172 | { 173 | int threadIdxInWarp = get_laneid(); 174 | int active = __ballot(1); 175 | int firstActiveThread = (__ffs(active)-1); /*leader*/ 176 | 177 | // The warp leader gets to write results. 178 | if (threadIdxInWarp == firstActiveThread) { 179 | // Get the "probably unique" PC. 180 | uint64_t pupc = bp->GetPUPC(); 181 | 182 | // When an instruction is allocated a spot in the hashmap, we will go 183 | // ahead and initialize the slot with this instruction's static information. 184 | SASSIInfo *stats = sassi_stats->getOrInit(pupc, [bp,rp,pupc](SASSIInfo *v) { 185 | v->opcode = bp->GetOpcode(); 186 | v->numGPRDstPtrs = rp->numGPRDstPtrs; 187 | v->numGPRSrcPtrs = rp->numGPRSrcPtrs; 188 | v->CCDst = rp->IsCCDefined(); 189 | v->CCSrc = rp->IsCCUsed(); 190 | v->PRDst = rp->GetPredicateDstMask(); 191 | v->PRSrc = rp->GetPredicateSrcMask(); 192 | v->pupc = pupc; 193 | v->weight = 0; 194 | for (unsigned d = 0; d < rp->numGPRDstPtrs; d++) 195 | v->GPRDsts[d] = rp->GetRegNum(rp->GetGPRDst(d)); 196 | for (unsigned s = 0; s < rp->numGPRSrcPtrs; s++) 197 | v->GPRSrcs[s] = rp->GetRegNum(rp->GetGPRSrc(s)); 198 | }); 199 | 200 | // Every time this instruction is executed, let's increment its weight. 201 | atomicAdd(&(stats->weight), 1); 202 | } 203 | } 204 | -------------------------------------------------------------------------------- /instlibs/src/ophist.cu: -------------------------------------------------------------------------------- 1 | /*********************************************************************************** \ 2 | * Copyright (c) 2015, NVIDIA open source projects 3 | * All rights reserved. 4 | * 5 | * Redistribution and use in source and binary forms, with or without 6 | * modification, are permitted provided that the following conditions are met: 7 | * 8 | * - Redistributions of source code must retain the above copyright notice, this 9 | * list of conditions and the following disclaimer. 10 | * 11 | * - Redistributions in binary form must reproduce the above copyright notice, 12 | * this list of conditions and the following disclaimer in the documentation 13 | * and/or other materials provided with the distribution. 14 | * 15 | * - Neither the name of SASSI nor the names of its 16 | * contributors may be used to endorse or promote products derived from 17 | * this software without specific prior written permission. 18 | * 19 | * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 20 | * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 21 | * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 22 | * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 23 | * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 24 | * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 25 | * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 26 | * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 27 | * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 28 | * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 29 | * 30 | * This example shows how to use SASSI to create a histogram of opcodes 31 | * encountered during the execution of a program. 32 | * 33 | * The application code the user instruments should be instrumented with the 34 | * following SASSI flag: "-Xptxas --sassi-inst-before=all". 35 | * 36 | \***********************************************************************************/ 37 | 38 | #include 39 | #include 40 | #include 41 | #include 42 | #include 43 | #include "sassi_intrinsics.h" 44 | #include "sassi_lazyallocator.hpp" 45 | #include 46 | 47 | // Keep track of all the opcodes that were executed. 48 | __managed__ unsigned long long dynamic_instr_counts[SASSI_NUM_OPCODES]; 49 | 50 | /////////////////////////////////////////////////////////////////////////////////// 51 | /// 52 | /// This is a SASSI handler that handles only basic information about each 53 | /// instrumented instruction. The calls to this handler are placed by 54 | /// convention *before* each instrumented instruction. 55 | /// 56 | /////////////////////////////////////////////////////////////////////////////////// 57 | __device__ void sassi_before_handler(SASSIBeforeParams* bp) 58 | { 59 | if (bp->GetInstrWillExecute()) 60 | { 61 | SASSIInstrOpcode op = bp->GetOpcode(); 62 | atomicAdd(&(dynamic_instr_counts[op]), 1ULL); 63 | } 64 | } 65 | 66 | /////////////////////////////////////////////////////////////////////////////////// 67 | /// 68 | /// Write out the statistics we've gathered. 69 | /// 70 | /////////////////////////////////////////////////////////////////////////////////// 71 | void sassi_finalize(sassi::lazy_allocator::device_reset_reason reason) 72 | { 73 | FILE *resultFile = fopen("sassi-ophist.txt", "w"); 74 | for (unsigned i = 0; i < SASSI_NUM_OPCODES; i++) { 75 | if (dynamic_instr_counts[i] > 0) { 76 | fprintf(resultFile, "%-10.10s: %llu\n", SASSIInstrOpcodeStrings[i], dynamic_instr_counts[i]); 77 | } 78 | } 79 | fclose(resultFile); 80 | } 81 | 82 | /////////////////////////////////////////////////////////////////////////////////// 83 | /// 84 | /// Lazily initialize the counters before the first kernel launch. 85 | /// 86 | /////////////////////////////////////////////////////////////////////////////////// 87 | static sassi::lazy_allocator counterInitializer( 88 | /* Initialize the counters. */ 89 | []() { 90 | bzero(dynamic_instr_counts, sizeof(dynamic_instr_counts)); 91 | }, sassi_finalize); 92 | 93 | -------------------------------------------------------------------------------- /instlibs/src/ophist_fermi.cu: -------------------------------------------------------------------------------- 1 | /*********************************************************************************** \ 2 | * Copyright (c) 2015, NVIDIA open source projects 3 | * All rights reserved. 4 | * 5 | * Redistribution and use in source and binary forms, with or without 6 | * modification, are permitted provided that the following conditions are met: 7 | * 8 | * - Redistributions of source code must retain the above copyright notice, this 9 | * list of conditions and the following disclaimer. 10 | * 11 | * - Redistributions in binary form must reproduce the above copyright notice, 12 | * this list of conditions and the following disclaimer in the documentation 13 | * and/or other materials provided with the distribution. 14 | * 15 | * - Neither the name of SASSI nor the names of its 16 | * contributors may be used to endorse or promote products derived from 17 | * this software without specific prior written permission. 18 | * 19 | * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 20 | * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 21 | * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 22 | * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 23 | * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 24 | * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 25 | * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 26 | * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 27 | * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 28 | * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 29 | * 30 | * This example shows how to use SASSI to create a histogram of opcodes 31 | * encountered during the execution of a program. Unlike many of the other 32 | * examples we include, this example does not use Unified Virtual Memory (UVM), 33 | * and is intended as an example that applies across all NVIDIA's architectures 34 | * from Fermi to Maxwell. 35 | * 36 | * The application code the user instruments should be instrumented with the 37 | * following SASSI flag: "-Xptxas --sassi-inst-before=all". 38 | * 39 | * Additionally, we rely on ld's "--wrap" functionality to "intercept" 40 | * main and exit. SASSI issues #4 and #6 prompted this approach. The reason 41 | * we intercept main and exit is commented below. 42 | * 43 | \***********************************************************************************/ 44 | 45 | #include 46 | #include 47 | #include 48 | #include 49 | #include 50 | #include 51 | #include "sassi_intrinsics.h" 52 | #include "sassi_lazyallocator.hpp" 53 | #include 54 | 55 | // Keep track of all the opcodes that were executed. Normally, we would declare 56 | // this array to be __managed__ so that UVM would take care of copying data 57 | // back and forth. In this example, we will just declare this array to reside 58 | // on the device, and we will explicitly copy the data back and forth. 59 | __device__ unsigned long long dynamic_instr_counts[SASSI_NUM_OPCODES]; 60 | 61 | /////////////////////////////////////////////////////////////////////////////////// 62 | /// 63 | /// This is a SASSI handler that handles only basic information about each 64 | /// instrumented instruction. The calls to this handler are placed by 65 | /// convention *before* each instrumented instruction. 66 | /// 67 | /////////////////////////////////////////////////////////////////////////////////// 68 | __device__ void sassi_before_handler(SASSIBeforeParams* bp) 69 | { 70 | if (bp->GetInstrWillExecute()) 71 | { 72 | SASSIInstrOpcode op = bp->GetOpcode(); 73 | atomicAdd(&(dynamic_instr_counts[op]), 1ULL); 74 | } 75 | } 76 | 77 | /////////////////////////////////////////////////////////////////////////////////// 78 | // The file that we will dump results to. We will initialize this in our 79 | // __wrap_main() function, which will be called before the instrumented program's 80 | // main if we compile the application correctly. 81 | /////////////////////////////////////////////////////////////////////////////////// 82 | static FILE *resultFile = NULL; 83 | 84 | /////////////////////////////////////////////////////////////////////////////////// 85 | /// 86 | /// Get the counters off the device and dump them. 87 | /// 88 | /////////////////////////////////////////////////////////////////////////////////// 89 | static void collect_and_dump_counters(const char *when) 90 | { 91 | unsigned long long instr_counts[SASSI_NUM_OPCODES]; 92 | 93 | assert(resultFile != NULL); 94 | 95 | // Copy the data off of the device. 96 | CHECK_CUDA_ERROR(cudaMemcpyFromSymbol(&instr_counts, dynamic_instr_counts, sizeof(instr_counts))); 97 | 98 | fprintf(resultFile, "----- Results %s -----\n", when); 99 | for (unsigned i = 0; i < SASSI_NUM_OPCODES; i++) { 100 | if (instr_counts[i] > 0) { 101 | fprintf(resultFile, "%-10.10s: %llu\n", SASSIInstrOpcodeStrings[i], instr_counts[i]); 102 | } 103 | } 104 | } 105 | 106 | /////////////////////////////////////////////////////////////////////////////////// 107 | /// 108 | /// We will compile our application using ld's --wrap option, which in this 109 | /// case lets us replace calls to "exit" with calls to "__wrap_exit". See 110 | /// the make target "ophist-fermi" in ./example/Makefile to see how this 111 | /// is done. 112 | /// 113 | /// This should allow us to perform CUDA operations before the CUDA runtime 114 | /// starts shutting down. In particular, we want to copy our 115 | /// "dynamic_instr_counts" off the device. If we used UVM, this would happen 116 | /// automatically for us. But since we don't have the luxury of using UVM 117 | /// for Fermi, we have to make sure that the CUDA runtime is still up and 118 | /// running before trying to issue a cudaMemcpy. Hence these shenanigans. 119 | /// 120 | /////////////////////////////////////////////////////////////////////////////////// 121 | extern "C" void __real_exit(int status); 122 | extern "C" void __wrap_exit(int status) 123 | { 124 | collect_and_dump_counters("before program explicitly ended"); 125 | fclose(resultFile); 126 | __real_exit(status); 127 | } 128 | 129 | 130 | /////////////////////////////////////////////////////////////////////////////////// 131 | /// 132 | /// For programs that don't call exit explicitly, let's catch the fallthrough. 133 | /// 134 | /////////////////////////////////////////////////////////////////////////////////// 135 | extern "C" int __real_main(int argc, char **argv); 136 | extern "C" int __wrap_main(int argc, char **argv) 137 | { 138 | resultFile = fopen("sassi-ophist-fermi.txt", "w"); 139 | 140 | int ret = __real_main(argc, argv); 141 | collect_and_dump_counters("before program implicitly ended"); 142 | fclose(resultFile); 143 | return ret; 144 | } 145 | 146 | 147 | /////////////////////////////////////////////////////////////////////////////////// 148 | /// 149 | /// We can still use a CUPTI callback to find out when the device has been 150 | /// explictly reset with a call to cudaDeviceReset(). We want to intercept 151 | /// that call and dump our counters before they are obliterated. 152 | /// 153 | /////////////////////////////////////////////////////////////////////////////////// 154 | void sassi_finalize(sassi::lazy_allocator::device_reset_reason reason) 155 | { 156 | // Unlike in the UVM case, we can't dump the counters at exit because 157 | // there is no way for us to guarantee that the CUDA runtime is still 158 | // available. See issues #4 and #6. 159 | if (reason == sassi::lazy_allocator::device_reset_reason::DEVICE_RESET) { 160 | collect_and_dump_counters("before device explicitly reset"); 161 | } 162 | } 163 | 164 | /////////////////////////////////////////////////////////////////////////////////// 165 | /// 166 | /// Lazily initialize the counters before the first kernel launch. 167 | /// 168 | /////////////////////////////////////////////////////////////////////////////////// 169 | static sassi::lazy_allocator counterInitializer( 170 | /* Initialize the counters. */ 171 | []() { 172 | unsigned long long instr_counts[SASSI_NUM_OPCODES]; 173 | bzero(instr_counts, sizeof(instr_counts)); 174 | // Initialize the array we allocated on the device. 175 | CHECK_CUDA_ERROR(cudaMemcpyToSymbol(dynamic_instr_counts, &instr_counts, sizeof(instr_counts))); 176 | }, sassi_finalize); 177 | 178 | -------------------------------------------------------------------------------- /instlibs/src/stubs.cu: -------------------------------------------------------------------------------- 1 | /*********************************************************************************** \ 2 | * Copyright (c) 2015, NVIDIA open source projects 3 | * All rights reserved. 4 | * 5 | * Redistribution and use in source and binary forms, with or without 6 | * modification, are permitted provided that the following conditions are met: 7 | * 8 | * - Redistributions of source code must retain the above copyright notice, this 9 | * list of conditions and the following disclaimer. 10 | * 11 | * - Redistributions in binary form must reproduce the above copyright notice, 12 | * this list of conditions and the following disclaimer in the documentation 13 | * and/or other materials provided with the distribution. 14 | * 15 | * - Neither the name of SASSI nor the names of its 16 | * contributors may be used to endorse or promote products derived from 17 | * this software without specific prior written permission. 18 | * 19 | * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 20 | * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 21 | * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 22 | * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 23 | * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 24 | * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 25 | * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 26 | * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 27 | * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 28 | * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 29 | * 30 | * This file just contains stubbed out instrumentation callback functions. 31 | * It's useful for debugging and measuring the baseline overhead of 32 | * instrumentation. 33 | * 34 | \***********************************************************************************/ 35 | 36 | #include 37 | #include 38 | 39 | /////////////////////////////////////////////////////////////////////////////////// 40 | /// 41 | /// This is a SASSI handler that handles only basic information about each 42 | /// instrumented instruction. The calls to this handler are placed by 43 | /// convention *before* each instrumented instruction. 44 | /// 45 | /////////////////////////////////////////////////////////////////////////////////// 46 | __device__ void sassi_before_handler(SASSIBeforeParams* bp) 47 | { 48 | } 49 | 50 | -------------------------------------------------------------------------------- /instlibs/src/valueprof.cu: -------------------------------------------------------------------------------- 1 | /*********************************************************************************** \ 2 | * Copyright (c) 2015, NVIDIA open source projects 3 | * All rights reserved. 4 | * 5 | * Redistribution and use in source and binary forms, with or without 6 | * modification, are permitted provided that the following conditions are met: 7 | * 8 | * - Redistributions of source code must retain the above copyright notice, this 9 | * list of conditions and the following disclaimer. 10 | * 11 | * - Redistributions in binary form must reproduce the above copyright notice, 12 | * this list of conditions and the following disclaimer in the documentation 13 | * and/or other materials provided with the distribution. 14 | * 15 | * - Neither the name of SASSI nor the names of its 16 | * contributors may be used to endorse or promote products derived from 17 | * this software without specific prior written permission. 18 | * 19 | * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 20 | * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 21 | * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 22 | * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 23 | * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 24 | * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 25 | * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 26 | * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 27 | * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 28 | * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 29 | * 30 | * This example computes, for each instruction, for each destination operand, 31 | * for each bit in the operand, whether or not the bit is constant over all 32 | * threads and throughout the course of execution of the program. It also 33 | * tracks whether an operand is scalar. 34 | * 35 | * The example is based on case study III in the paper, 36 | * 37 | * "Flexible Software Profiling of GPU Architectures" 38 | * Stephenson et al., ISCA 2015. 39 | * 40 | * The application code the user instruments should be instrumented with the 41 | * following SASSI flag: -Xptxas --sassi-inst-after="reg-writes" 42 | * -Xptxas --sassi-after-args="reg-info" 43 | * [optional] -Xptxas --sassi-iff-true-predicate-handler-call 44 | * 45 | \***********************************************************************************/ 46 | 47 | #define __STDC_FORMAT_MACROS 48 | #include 49 | #include 50 | #include 51 | #include 52 | #include 53 | #include 54 | #include 55 | #include 56 | #include 57 | #include 58 | #include "sassi_intrinsics.h" 59 | #include "sassi_dictionary.hpp" 60 | #include "sassi_lazyallocator.hpp" 61 | #include 62 | #include 63 | 64 | 65 | /////////////////////////////////////////////////////////////////////////////////// 66 | /// 67 | /// Each SASS operation has a number of DSTOperands. This class 68 | /// keeps track of the stats for a given operand. 69 | /// 70 | /////////////////////////////////////////////////////////////////////////////////// 71 | class DSTOperand { 72 | public: 73 | /////////////////////////////////////////////////////////////////// 74 | /// 75 | /// Initialize the counters for this operand. 76 | /// 77 | /////////////////////////////////////////////////////////////////// 78 | static __device__ void init(DSTOperand *op) 79 | { 80 | op->isScalar = -1; 81 | op->constantOnes = -1; 82 | op->constantZeros = -1; 83 | } 84 | 85 | /////////////////////////////////////////////////////////////////// 86 | /// 87 | /// Prints the bitvector to the given file. 88 | /// 89 | /////////////////////////////////////////////////////////////////// 90 | void print_bits(FILE *f) const 91 | { 92 | for (int bn = 31; bn >= 0; bn--) { 93 | int oneBit = ((constantOnes >> bn) & 0x1); 94 | int zeroBit = ((constantZeros >> bn) & 0x1); 95 | if (oneBit == 0 && zeroBit == 0) { 96 | fprintf(f, "T"); 97 | } 98 | else if (oneBit == 0 && zeroBit == 1) { 99 | fprintf(f, "0"); 100 | } 101 | else if (oneBit == 1 && zeroBit == 0) { 102 | fprintf(f, "1"); 103 | } 104 | else if (oneBit == 1 && zeroBit == 1) { 105 | fprintf(f, "X"); 106 | } 107 | } 108 | } 109 | 110 | /////////////////////////////////////////////////////////////////// 111 | /// 112 | /// Prints the operand stats to the given file. 113 | /// 114 | /////////////////////////////////////////////////////////////////// 115 | void print(FILE *f) 116 | { 117 | fprintf(f, "[%d, \"%s\", %s, [", 118 | regNum, 119 | SASSITypeAsString[regType], 120 | isScalar ? "SCALAR " : "VARIANT"); 121 | print_bits(f); 122 | fprintf(f, "]]"); 123 | } 124 | 125 | int regNum; 126 | int isScalar; 127 | SASSIType regType; 128 | int constantOnes; 129 | int constantZeros; 130 | }; 131 | 132 | 133 | /////////////////////////////////////////////////////////////////////////////////// 134 | /// 135 | /// Keep the statistics for each SASS operation. 136 | /// 137 | /////////////////////////////////////////////////////////////////////////////////// 138 | class SASSOp { 139 | public: 140 | // This is a bit dirty, and deserves some explanation. Until 141 | // device-side allocation of memory matures, we are going to 142 | // pre-allocate space for the SASSOp statistics on the host. 143 | // In so doing, we will need to account for the worst-case 144 | // SASS instruction with regard to number of destination 145 | // operands. 146 | #define MAX_DST_OPERANDS 4 147 | 148 | /////////////////////////////////////////////////////////////////// 149 | /// 150 | /// Initialize the SASSOp passed in. 151 | /// 152 | /////////////////////////////////////////////////////////////////// 153 | __device__ static void init(SASSOp *op, SASSIRegisterParams *rp) 154 | { 155 | op->weight = 0; 156 | op->numDsts = rp->GetNumGPRDsts(); 157 | assert(op->numDsts <= MAX_DST_OPERANDS); 158 | 159 | // Initialize all of the fields appropriately. 160 | for (int i = 0; i < op->numDsts; i++) { 161 | DSTOperand::init(&(op->operands[i])); 162 | SASSIRegisterParams::GPRRegInfo regInfo = rp->GetGPRDst(i); 163 | op->operands[i].regNum = rp->GetRegNum(regInfo); 164 | op->operands[i].regType = rp->GetRegType(regInfo); 165 | } 166 | } 167 | 168 | /////////////////////////////////////////////////////////////////// 169 | /// 170 | /// Prints the operation stats to the given file. 171 | /// 172 | /////////////////////////////////////////////////////////////////// 173 | void print(FILE *f) 174 | { 175 | fprintf(f, "%lld, [", weight); 176 | for (int i = 0; i < numDsts; i++) { 177 | operands[i].print(f); 178 | } 179 | fprintf(f, "]"); 180 | } 181 | 182 | unsigned long long weight; 183 | int numDsts; 184 | DSTOperand operands[MAX_DST_OPERANDS]; 185 | }; 186 | 187 | 188 | /// The actual dictionary, declared as a UVM managed type so we can access it on 189 | /// the device and host. 190 | static __managed__ sassi::dictionary *sassi_stats; 191 | 192 | 193 | /////////////////////////////////////////////////////////////////////////////////// 194 | /// 195 | /// We will register this function to be called whenever the device is reset, 196 | /// or when the program is about to exit. The function will print out the 197 | /// aggregated statistics. 198 | /// 199 | /////////////////////////////////////////////////////////////////////////////////// 200 | static void sassi_finalize(sassi::lazy_allocator::device_reset_reason reason) 201 | { 202 | struct KVTuple { 203 | uint64_t k; 204 | SASSOp *v; 205 | }; 206 | 207 | FILE *resultsFile = fopen("sassi-valueprof.txt", "w"); 208 | 209 | fprintf(resultsFile, "\nValue profiling results\n"); 210 | fprintf(resultsFile, "ADDRESS | WEIGHT | [regnum, type, scalarness, bitstring]*\n"); 211 | fprintf(resultsFile, "---------------------------------------------------------\n"); 212 | 213 | std::vector ops; 214 | sassi_stats->map([&ops](uint64_t& key, SASSOp &val) { 215 | ops.push_back({key, &val}); 216 | }); 217 | 218 | std::sort(ops.begin(), ops.end(), [](KVTuple a, const KVTuple b) { 219 | return a.k < b.k; 220 | }); 221 | 222 | for (KVTuple t : ops) { 223 | fprintf(resultsFile, "[%.16" PRIx64 ", ", t.k); 224 | t.v->print(resultsFile); 225 | fprintf(resultsFile, "]\n"); 226 | } 227 | 228 | cudaDeviceSynchronize(); 229 | fclose(resultsFile); 230 | } 231 | 232 | /////////////////////////////////////////////////////////////////////////////////// 233 | /// 234 | /// Lazily allocate a dictionary before the first kernel launch. 235 | /// 236 | /////////////////////////////////////////////////////////////////////////////////// 237 | static sassi::lazy_allocator mapAllocator([]() { 238 | sassi_stats = new sassi::dictionary(); 239 | }, sassi_finalize); 240 | 241 | 242 | /////////////////////////////////////////////////////////////////////////////////// 243 | // 244 | // This example uses the atomic bitwise operations to keep track of the constant 245 | // bits produced by each instruction. 246 | // 247 | /////////////////////////////////////////////////////////////////////////////////// 248 | __device__ void sassi_after_handler(SASSIAfterParams* ap, SASSIRegisterParams *rp) 249 | { 250 | int threadIdxInWarp = get_laneid(); 251 | int firstActiveThread = (__ffs(__ballot(1))-1); /*leader*/ 252 | 253 | // Get the "probably unique" PC. 254 | uint64_t pupc = ap->GetPUPC(); 255 | 256 | // The dictionary will return the SASSOp associated with this PC, or insert 257 | // it if it does not exist. If it does not exist, the lambda passed as 258 | // the second argument to getOrInit is used to initialize the SASSOp. 259 | SASSOp *stats = sassi_stats->getOrInit(pupc, [&rp](SASSOp *v) { 260 | SASSOp::init(v, rp); 261 | }); 262 | 263 | // Record the number of times the instruction executes. 264 | atomicAdd(&(stats->weight), 1); 265 | for (int d = 0; d < rp->GetNumGPRDsts(); d++) { 266 | // Get the value in each destination register. 267 | SASSIRegisterParams::GPRRegInfo regInfo = rp->GetGPRDst(d); 268 | SASSIRegisterParams::GPRRegValue regVal = rp->GetRegValue(ap, regInfo); 269 | 270 | // Use atomic AND operations to track constant bits. 271 | atomicAnd(&(stats->operands[d].constantOnes), regVal.asInt); 272 | atomicAnd(&(stats->operands[d].constantZeros), ~regVal.asInt); 273 | 274 | int leaderValue = __shfl(regVal.asInt, firstActiveThread); 275 | int allSame = (__all(regVal.asInt == leaderValue) != 0); 276 | // The warp leader gets to write results. 277 | if (threadIdxInWarp == firstActiveThread) { 278 | atomicAnd(&(stats->operands[d].isScalar), allSame); 279 | } 280 | } 281 | } 282 | 283 | 284 | 285 | 286 | -------------------------------------------------------------------------------- /instlibs/utils/sassi_cuptiwrapper.hpp: -------------------------------------------------------------------------------- 1 | /****************************************************************************\ 2 | Copyright (c) 2015, NVIDIA open source projects 3 | All rights reserved. 4 | 5 | Redistribution and use in source and binary forms, with or without 6 | modification, are permitted provided that the following conditions are met: 7 | 8 | * Redistributions of source code must retain the above copyright notice, this 9 | list of conditions and the following disclaimer. 10 | 11 | * Redistributions in binary form must reproduce the above copyright notice, 12 | this list of conditions and the following disclaimer in the documentation 13 | and/or other materials provided with the distribution. 14 | 15 | * Neither the name of SASSI nor the names of its 16 | contributors may be used to endorse or promote products derived from 17 | this software without specific prior written permission. 18 | 19 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 20 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 21 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 22 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 23 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 24 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 25 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 26 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 27 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 28 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 29 | 30 | This is a wrapper around the CUPTI library. It should eventually supplant 31 | the sassi_lazyallocator. 32 | 33 | \****************************************************************************/ 34 | 35 | #ifndef __SASSI_CUPTIWRAPPER_HPP___ 36 | #define __SASSI_CUPTIWRAPPER_HPP___ 37 | 38 | #include 39 | #include 40 | #include 41 | #include 42 | #include 43 | #include "sassi_intrinsics.h" 44 | 45 | namespace sassi { 46 | 47 | class cupti_wrapper { 48 | public: 49 | 50 | typedef void (*cupti_cb_type)(cupti_wrapper*, const CUpti_CallbackData*); 51 | 52 | enum class event_type { 53 | MODULE_LOAD, // The driver is loading code onto the device. 54 | MODULE_UNLOAD, // The driver is unloading code from the device. 55 | DEVICE_RESET, // Device state is about to die. 56 | KERNEL_LAUNCH // Kernel enter or exit from launch. 57 | }; 58 | 59 | static const bool callback_before = true; 60 | static const bool callback_after = false; 61 | 62 | ///////////////////////////////////////////////////////////////////////////// 63 | // 64 | // Create a new cupti_wrapper, which essentially just creates a managed 65 | // CUPTI subscriber. 66 | // 67 | ///////////////////////////////////////////////////////////////////////////// 68 | cupti_wrapper() 69 | { 70 | CHECK_CUPTI_ERROR(cuptiSubscribe(&subscriber, (CUpti_CallbackFunc)cupti_cb, this), 71 | "cuptiSubscribe"); 72 | } 73 | 74 | ///////////////////////////////////////////////////////////////////////////// 75 | // 76 | // Destroy our cupti_wrapper by unsubscribing our CUPTI subscriber. 77 | // 78 | ///////////////////////////////////////////////////////////////////////////// 79 | ~cupti_wrapper() 80 | { 81 | CHECK_CUPTI_ERROR(cuptiUnsubscribe(subscriber), "cuptiUnsubscribe"); 82 | } 83 | 84 | ///////////////////////////////////////////////////////////////////////////// 85 | // 86 | // Register for another event type with a new callback function. 87 | // 88 | ///////////////////////////////////////////////////////////////////////////// 89 | void register_callback(event_type cbid, bool before, cupti_cb_type callback) 90 | { 91 | register_callback_h(cbid, before, callback); 92 | } 93 | 94 | protected: 95 | 96 | struct callback_st { 97 | bool callBefore; 98 | cupti_cb_type callbackFn; 99 | }; 100 | 101 | void assign(CUpti_CallbackDomain domain, CUpti_CallbackId cbid, callback_st cbs) 102 | { 103 | uCallbackMap.insert(std::make_pair(cbid, cbs)); 104 | CHECK_CUPTI_ERROR(cuptiEnableCallback(true, subscriber, domain, cbid), 105 | "Problem enabling callback"); 106 | } 107 | 108 | void register_callback_h(event_type cbid, bool before, cupti_cb_type callback) 109 | { 110 | callback_st cbs = { before, callback }; 111 | switch (cbid) { 112 | case event_type::MODULE_LOAD: 113 | assign(CUPTI_CB_DOMAIN_RESOURCE, CUPTI_CBID_RESOURCE_MODULE_LOADED, cbs); 114 | break; 115 | case event_type::MODULE_UNLOAD: 116 | assign(CUPTI_CB_DOMAIN_RESOURCE, CUPTI_CBID_RESOURCE_MODULE_UNLOAD_STARTING, cbs); 117 | break; 118 | case event_type::DEVICE_RESET: 119 | assign(CUPTI_CB_DOMAIN_RUNTIME_API, CUPTI_RUNTIME_TRACE_CBID_cudaDeviceReset_v3020, cbs); 120 | assign(CUPTI_CB_DOMAIN_RUNTIME_API, CUPTI_RUNTIME_TRACE_CBID_cudaThreadExit_v3020, cbs); 121 | break; 122 | case event_type::KERNEL_LAUNCH: 123 | assign(CUPTI_CB_DOMAIN_RUNTIME_API, CUPTI_RUNTIME_TRACE_CBID_cudaLaunch_v3020, cbs); 124 | break; 125 | } 126 | } 127 | 128 | protected: 129 | 130 | typedef std::multimap event_map; 131 | 132 | ///////////////////////////////////////////////////////////////////////////// 133 | // 134 | // Our main CUPTI callback for handling launches and cases where the 135 | // device is reset. 136 | // 137 | ///////////////////////////////////////////////////////////////////////////// 138 | static void CUPTIAPI cupti_cb(void *userdata, 139 | CUpti_CallbackDomain domain, 140 | CUpti_CallbackId cbid, 141 | const CUpti_CallbackData *cbInfo) 142 | { 143 | cupti_wrapper *t = (cupti_wrapper*)userdata; 144 | event_map::const_iterator lb = t->uCallbackMap.lower_bound(cbid); 145 | event_map::const_iterator ub = t->uCallbackMap.upper_bound(cbid); 146 | 147 | while (lb != ub) 148 | { 149 | if (domain == CUPTI_CB_DOMAIN_RUNTIME_API) { 150 | if (((cbInfo->callbackSite == CUPTI_API_ENTER) && lb->second.callBefore) || 151 | ((cbInfo->callbackSite == CUPTI_API_EXIT) && !(lb->second.callBefore))) { 152 | lb->second.callbackFn(t, cbInfo); 153 | } 154 | } else if (domain == CUPTI_CB_DOMAIN_RESOURCE) { 155 | lb->second.callbackFn(t, cbInfo); 156 | } 157 | ++lb; 158 | } 159 | } 160 | 161 | private: 162 | 163 | CUpti_SubscriberHandle subscriber; 164 | event_map uCallbackMap; 165 | }; 166 | 167 | } // End namespace. 168 | 169 | #endif 170 | -------------------------------------------------------------------------------- /instlibs/utils/sassi_dictionary.hpp: -------------------------------------------------------------------------------- 1 | /****************************************************************************\ 2 | Copyright (c) 2015, NVIDIA open source projects 3 | All rights reserved. 4 | 5 | Redistribution and use in source and binary forms, with or without 6 | modification, are permitted provided that the following conditions are met: 7 | 8 | * Redistributions of source code must retain the above copyright notice, this 9 | list of conditions and the following disclaimer. 10 | 11 | * Redistributions in binary form must reproduce the above copyright notice, 12 | this list of conditions and the following disclaimer in the documentation 13 | and/or other materials provided with the distribution. 14 | 15 | * Neither the name of SASSI nor the names of its 16 | contributors may be used to endorse or promote products derived from 17 | this software without specific prior written permission. 18 | 19 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 20 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 21 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 22 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 23 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 24 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 25 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 26 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 27 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 28 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 29 | 30 | A dictionary implementation. 31 | \****************************************************************************/ 32 | 33 | #ifndef __SASSI_DICTIONARY_HPP___ 34 | #define __SASSI_DICTIONARY_HPP___ 35 | 36 | #include 37 | #include 38 | #include "sassi_managed.hpp" 39 | 40 | 41 | #ifdef __CUDACC__ 42 | #define CUDA_CALLABLE __host__ __device__ 43 | #else 44 | #define CUDA_CALLABLE 45 | #endif 46 | 47 | 48 | namespace sassi { 49 | 50 | template 51 | class dictionary: public managed 52 | { 53 | public: 54 | 55 | typedef unsigned size_type; 56 | typedef dictionary self_type; 57 | 58 | public: 59 | 60 | ///////////////////////////////////////////////////////////////////////////// 61 | // 62 | // Create a new dictionary. The dictionary is created on the host, and 63 | // uses UVM (cudaMallocManaged) so that we can view the hashtable on the 64 | // host after our kernels finish executing. 65 | // 66 | ///////////////////////////////////////////////////////////////////////////// 67 | dictionary(size_type slots = 1088723, unsigned maxRetries = 8): 68 | m_size(0), m_slots(slots), m_maxRetries(maxRetries) 69 | { 70 | cudaMallocManaged(&m_keytable, sizeof(keytable) * slots); 71 | cudaMallocManaged(&m_valtable, sizeof(V) * slots); 72 | bzero(m_keytable, sizeof(keytable) * slots); 73 | bzero(m_valtable, sizeof(V) * slots); 74 | } 75 | 76 | ///////////////////////////////////////////////////////////////////////////// 77 | // 78 | // Gets a pointer to the value associated with the key. If there is no 79 | // such key in the dictionary, a key-value mapping is created, and the 80 | // "init" function is called to initialize the value. 81 | // 82 | ///////////////////////////////////////////////////////////////////////////// 83 | __device__ V *getOrInit(K key, nvstd::function init) 84 | { 85 | V *retVal = NULL; 86 | 87 | uint64_t h1 = key * 1076279; 88 | uint64_t h2 = (key * 99809343) + 1; 89 | unsigned attempts = 1; 90 | 91 | // We will try several times to find an empty slot for the key-value 92 | // pair. On each try, there are three cases we need to consider. 93 | for (; attempts <= m_maxRetries; attempts++) 94 | { 95 | unsigned h = (unsigned)((h1 + attempts * h2) % m_slots); 96 | int32_t *metaPtr = &(m_keytable[h].metadata); 97 | int32_t old = atomicCAS(metaPtr, 0, LOCKED_ENTRY); 98 | 99 | if (old == 0) { 100 | // 1. The entry is empty and we got the lock. 101 | // Let's initialize the key and call the user's init function. 102 | m_keytable[h].key = key; 103 | init(&(m_valtable[h])); 104 | *metaPtr = VALID_ENTRY; 105 | atomicAdd(&m_size, 1); 106 | __threadfence(); 107 | 108 | retVal = &(m_valtable[h]); 109 | break; 110 | } 111 | 112 | if (old == LOCKED_ENTRY) { 113 | // 2. The entry is locked. We need to wait for it to be valid. 114 | // We are guaranteed that all other threads in this warp 115 | // would have already unlocked their entries. 116 | do { 117 | __threadfence(); 118 | old = *metaPtr; 119 | } while (old == LOCKED_ENTRY); 120 | } 121 | 122 | if (old == VALID_ENTRY) { 123 | // 3. The entry is valid according to the metadata. In this case, 124 | // we need to compare the key to that in the table. If it is the 125 | // same, then we can return the value associated with the tuple. 126 | // Otherwise, we need to rehash. 127 | bool compares = (key == m_keytable[h].key); 128 | if (compares) { 129 | retVal = &(m_valtable[h]); 130 | break; 131 | } 132 | } 133 | } 134 | 135 | assert(attempts < m_maxRetries); 136 | return retVal; 137 | } 138 | 139 | ///////////////////////////////////////////////////////////////////////////// 140 | // 141 | // Return the number of valid entries. 142 | // 143 | ///////////////////////////////////////////////////////////////////////////// 144 | CUDA_CALLABLE size_type size() const { return m_size; } 145 | 146 | ///////////////////////////////////////////////////////////////////////////// 147 | // 148 | // Returns true if the array is empty, false otherwise. 149 | // 150 | ///////////////////////////////////////////////////////////////////////////// 151 | CUDA_CALLABLE bool empty() const { return m_size == 0; } 152 | 153 | ///////////////////////////////////////////////////////////////////////////// 154 | // 155 | // Applies 'fun' to every key and value in the dictionary. 156 | // 157 | ///////////////////////////////////////////////////////////////////////////// 158 | void map(std::function fun) 159 | { 160 | for (unsigned entry = 0; entry < m_slots; entry++) { 161 | if (m_keytable[entry].metadata == VALID_ENTRY) { 162 | fun(m_keytable[entry].key, m_valtable[entry]); 163 | } 164 | } 165 | } 166 | 167 | ///////////////////////////////////////////////////////////////////////////// 168 | // 169 | // Empties the dictionary. 170 | // 171 | ///////////////////////////////////////////////////////////////////////////// 172 | void clear() 173 | { 174 | bzero(m_keytable, sizeof(keytable) * m_slots); 175 | bzero(m_valtable, sizeof(V) * m_slots); 176 | m_size = 0; 177 | } 178 | 179 | protected: 180 | 181 | ///////////////////////////////////////////////////////////////////////////// 182 | // 183 | // This host-side destructor frees up the dynamically allocated memory 184 | // on the device. 185 | // 186 | ///////////////////////////////////////////////////////////////////////////// 187 | virtual ~dictionary() 188 | { 189 | cudaFree(m_keytable); 190 | cudaFree(m_valtable); 191 | } 192 | 193 | protected: 194 | 195 | struct keytable { 196 | int32_t metadata; 197 | K key; 198 | }; 199 | 200 | const int32_t UNLOCKED_ENTRY = 0x0; 201 | const int32_t VALID_ENTRY = 0x1; 202 | const int32_t LOCKED_ENTRY = 0X2; 203 | 204 | size_type m_size; 205 | const unsigned m_slots; 206 | const unsigned m_maxRetries; 207 | 208 | keytable *m_keytable; 209 | V *m_valtable; 210 | }; 211 | } // End namespace. 212 | 213 | #endif /* __SASSI_DICTIONARY_HPP__ */ 214 | -------------------------------------------------------------------------------- /instlibs/utils/sassi_intrinsics.h: -------------------------------------------------------------------------------- 1 | /****************************************************************************\ 2 | Copyright (c) 2015, NVIDIA open source projects 3 | All rights reserved. 4 | 5 | Redistribution and use in source and binary forms, with or without 6 | modification, are permitted provided that the following conditions are met: 7 | 8 | * Redistributions of source code must retain the above copyright notice, this 9 | list of conditions and the following disclaimer. 10 | 11 | * Redistributions in binary form must reproduce the above copyright notice, 12 | this list of conditions and the following disclaimer in the documentation 13 | and/or other materials provided with the distribution. 14 | 15 | * Neither the name of SASSI nor the names of its 16 | contributors may be used to endorse or promote products derived from 17 | this software without specific prior written permission. 18 | 19 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 20 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 21 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 22 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 23 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 24 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 25 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 26 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 27 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 28 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 29 | 30 | Some functions that are common enough that they should be intrinsics. 31 | \****************************************************************************/ 32 | 33 | #ifndef __INTRINSICS_H____ 34 | #define __INTRINSICS_H____ 35 | 36 | #include 37 | 38 | ///////////////////////////////////////////////////////////////////////////// 39 | // 40 | // Check for a CUDA error. 41 | // 42 | ///////////////////////////////////////////////////////////////////////////// 43 | #define CHECK_CUDA_ERROR(err) \ 44 | if (err != cudaSuccess) { \ 45 | printf("Error: %s\n", cudaGetErrorString(err)); \ 46 | } 47 | 48 | ///////////////////////////////////////////////////////////////////////////// 49 | // 50 | // Check for a CUPTI error. 51 | // 52 | ///////////////////////////////////////////////////////////////////////////// 53 | #define CHECK_CUPTI_ERROR(err, cuptifunc) \ 54 | if (err != CUPTI_SUCCESS) \ 55 | { \ 56 | const char *errstr; \ 57 | cuptiGetResultString(err, &errstr); \ 58 | printf ("%s:%d:Error %s for CUPTI API function '%s'.\n", \ 59 | __FILE__, __LINE__, errstr, cuptifunc); \ 60 | exit(-1); \ 61 | } 62 | 63 | ///////////////////////////////////////////////////////////////////////////// 64 | // 65 | // Get a thread's CTA ID. 66 | // 67 | ///////////////////////////////////////////////////////////////////////////// 68 | __device__ __forceinline__ int4 get_ctaid(void) { 69 | int4 ret; 70 | asm("mov.u32 %0, %ctaid.x;" : "=r"(ret.x)); 71 | asm("mov.u32 %0, %ctaid.y;" : "=r"(ret.y)); 72 | asm("mov.u32 %0, %ctaid.z;" : "=r"(ret.z)); 73 | return ret; 74 | } 75 | 76 | ///////////////////////////////////////////////////////////////////////////// 77 | // 78 | // Get the number of CTA ids per grid. 79 | // 80 | ///////////////////////////////////////////////////////////////////////////// 81 | __device__ __forceinline__ int4 get_nctaid(void) { 82 | int4 ret; 83 | asm("mov.u32 %0, %nctaid.x;" : "=r"(ret.x)); 84 | asm("mov.u32 %0, %nctaid.y;" : "=r"(ret.y)); 85 | asm("mov.u32 %0, %nctaid.z;" : "=r"(ret.z)); 86 | return ret; 87 | } 88 | 89 | ///////////////////////////////////////////////////////////////////////////// 90 | // 91 | // Get a thread's SM ID. 92 | // 93 | ///////////////////////////////////////////////////////////////////////////// 94 | __device__ __forceinline__ unsigned int get_smid(void) { 95 | unsigned int ret; 96 | asm("mov.u32 %0, %smid;" : "=r"(ret) ); 97 | return ret; 98 | } 99 | 100 | ///////////////////////////////////////////////////////////////////////////// 101 | // 102 | // Get a thread's warp ID. 103 | // 104 | ///////////////////////////////////////////////////////////////////////////// 105 | __device__ __forceinline__ unsigned int get_warpid(void) { 106 | unsigned int ret; 107 | asm("mov.u32 %0, %warpid;" : "=r"(ret) ); 108 | return ret; 109 | } 110 | 111 | ///////////////////////////////////////////////////////////////////////////// 112 | // 113 | // Get a thread's lane ID. 114 | // 115 | ///////////////////////////////////////////////////////////////////////////// 116 | __device__ __forceinline__ unsigned int get_laneid(void) { 117 | unsigned int laneid; 118 | asm volatile ("mov.u32 %0, %laneid;" : "=r"(laneid)); 119 | return laneid; 120 | } 121 | 122 | ///////////////////////////////////////////////////////////////////////////// 123 | // 124 | // Returns true if the pointer points to shared memory. Similar to the 125 | // isGlobal() CUDA intrinsic. 126 | // 127 | ///////////////////////////////////////////////////////////////////////////// 128 | __device__ unsigned int __isShared(const void *ptr) 129 | { 130 | unsigned int ret; 131 | asm volatile ("{ \n\t" 132 | " .reg .pred p; \n\t" 133 | " isspacep.shared p, %1; \n\t" 134 | " selp.u32 %0, 1, 0, p; \n\t" 135 | #if (defined(_MSC_VER) && defined(_WIN64)) || defined(__LP64__) 136 | "} \n\t" : "=r"(ret) : "l"(ptr)); 137 | #else 138 | "} \n\t" : "=r"(ret) : "r"(ptr)); 139 | #endif 140 | 141 | return ret; 142 | } 143 | 144 | ///////////////////////////////////////////////////////////////////////////// 145 | // 146 | // Returns true if the pointer points to local memory. Similar to the 147 | // isGlobal() CUDA intrinsic. 148 | // 149 | ///////////////////////////////////////////////////////////////////////////// 150 | __device__ unsigned int __isLocal(const void *ptr) 151 | { 152 | unsigned int ret; 153 | asm volatile ("{ \n\t" 154 | " .reg .pred p; \n\t" 155 | " isspacep.local p, %1; \n\t" 156 | " selp.u32 %0, 1, 0, p; \n\t" 157 | #if (defined(_MSC_VER) && defined(_WIN64)) || defined(__LP64__) 158 | "} \n\t" : "=r"(ret) : "l"(ptr)); 159 | #else 160 | "} \n\t" : "=r"(ret) : "r"(ptr)); 161 | #endif 162 | 163 | return ret; 164 | } 165 | 166 | ///////////////////////////////////////////////////////////////////////////// 167 | // 168 | // Returns true if the pointer points to constant memory. Similar to the 169 | // isGlobal() CUDA intrinsic. 170 | // 171 | ///////////////////////////////////////////////////////////////////////////// 172 | __device__ unsigned int __isConst(const void *ptr) 173 | { 174 | unsigned int ret; 175 | asm volatile ("{ \n\t" 176 | " .reg .pred p; \n\t" 177 | " isspacep.const p, %1; \n\t" 178 | " selp.u32 %0, 1, 0, p; \n\t" 179 | #if (defined(_MSC_VER) && defined(_WIN64)) || defined(__LP64__) 180 | "} \n\t" : "=r"(ret) : "l"(ptr)); 181 | #else 182 | "} \n\t" : "=r"(ret) : "r"(ptr)); 183 | #endif 184 | 185 | return ret; 186 | } 187 | 188 | ///////////////////////////////////////////////////////////////////////////// 189 | // 190 | // A semi-generic warp broadcast function. 191 | // 192 | ///////////////////////////////////////////////////////////////////////////// 193 | template 194 | __device__ T __broadcast(T t, int fromWhom) 195 | { 196 | union { 197 | int32_t shflVals[sizeof(T)]; 198 | T t; 199 | } p; 200 | 201 | p.t = t; 202 | #pragma unroll 203 | for (int i = 0; i < sizeof(T); i++) { 204 | int32_t shfl = (int32_t)p.shflVals[i]; 205 | p.shflVals[i] = __shfl(shfl, fromWhom); 206 | } 207 | return p.t; 208 | } 209 | 210 | #endif 211 | -------------------------------------------------------------------------------- /instlibs/utils/sassi_lazyallocator.hpp: -------------------------------------------------------------------------------- 1 | /****************************************************************************\ 2 | Copyright (c) 2015, NVIDIA open source projects 3 | All rights reserved. 4 | 5 | Redistribution and use in source and binary forms, with or without 6 | modification, are permitted provided that the following conditions are met: 7 | 8 | * Redistributions of source code must retain the above copyright notice, this 9 | list of conditions and the following disclaimer. 10 | 11 | * Redistributions in binary form must reproduce the above copyright notice, 12 | this list of conditions and the following disclaimer in the documentation 13 | and/or other materials provided with the distribution. 14 | 15 | * Neither the name of SASSI nor the names of its 16 | contributors may be used to endorse or promote products derived from 17 | this software without specific prior written permission. 18 | 19 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 20 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 21 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 22 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 23 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 24 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 25 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 26 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 27 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 28 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 29 | 30 | This class allows us to lazily call an allocation/initialization 31 | function once and only once before the first kernel launch. It uses 32 | CUPTI to catch thread launches, and on the first thread launch, it 33 | calls the callback the user registered with. We do this because for 34 | SASSI libraries we generally don't have control over when things are 35 | initialized. If we try to do CUDA or CUPTI related things in a static 36 | constructor, it is very possible that the CUDA runtime has not even 37 | been initialized. 38 | This class also provides an interface to call a function before and/or after 39 | kernel calls. 40 | \****************************************************************************/ 41 | 42 | #ifndef __SASSI_LAZYALLOCATOR_HPP___ 43 | #define __SASSI_LAZYALLOCATOR_HPP___ 44 | 45 | #include 46 | #include 47 | #include 48 | #include 49 | #include "sassi_intrinsics.h" 50 | 51 | namespace sassi { 52 | 53 | ///////////////////////////////////////////////////////////////////////////// 54 | // 55 | // Lazily calls an allocation/initialization function once before the 56 | // first kernel launch. 57 | // 58 | ///////////////////////////////////////////////////////////////////////////// 59 | class lazy_allocator { 60 | public: 61 | 62 | enum class device_reset_reason { 63 | DEVICE_RESET, 64 | PROGRAM_EXIT 65 | }; 66 | 67 | typedef lazy_allocator self_type; 68 | typedef void (*device_init_cb_type)(); 69 | typedef void (*device_reset_cb_type)(device_reset_reason); 70 | typedef void (*kernel_ee_cb_type)(const CUpti_CallbackData*); 71 | 72 | ///////////////////////////////////////////////////////////////////////////// 73 | // 74 | // Registers the user's callbacks. 75 | // 76 | ///////////////////////////////////////////////////////////////////////////// 77 | lazy_allocator(device_init_cb_type init_cb, 78 | device_reset_cb_type reset_cb, 79 | kernel_ee_cb_type kernel_entry_cb = [](const CUpti_CallbackData*){}, 80 | kernel_ee_cb_type kernel_exit_cb = [](const CUpti_CallbackData*){}): 81 | init_cb(init_cb), reset_cb(reset_cb), 82 | entry_cb(kernel_entry_cb), exit_cb(kernel_exit_cb), 83 | valid_data(false) 84 | { 85 | CHECK_CUPTI_ERROR(cuptiSubscribe(&subscriber, (CUpti_CallbackFunc)cupti_cb, this), 86 | "cuptiSubscribe"); 87 | setupLaunchCB(); 88 | setupResetCB(); 89 | } 90 | 91 | ~lazy_allocator() 92 | { 93 | if (valid_data && reset_cb) { 94 | reset_cb(device_reset_reason::PROGRAM_EXIT); 95 | } 96 | } 97 | 98 | protected: 99 | 100 | ///////////////////////////////////////////////////////////////////////////// 101 | // 102 | // Sets up a one-time callback for kernel launches. If the device is 103 | // reset, then this callback will be setup again. 104 | // 105 | ///////////////////////////////////////////////////////////////////////////// 106 | void setupLaunchCB() 107 | { 108 | if (init_cb) { 109 | CHECK_CUPTI_ERROR(cuptiEnableCallback(1, subscriber, CUPTI_CB_DOMAIN_RUNTIME_API, 110 | CUPTI_RUNTIME_TRACE_CBID_cudaLaunch_v3020), 111 | "Problem enabling callback"); 112 | } 113 | } 114 | 115 | ///////////////////////////////////////////////////////////////////////////// 116 | // 117 | // Sets up callbacks for when the device is reset. 118 | // 119 | ///////////////////////////////////////////////////////////////////////////// 120 | void setupResetCB() 121 | { 122 | if (reset_cb) { 123 | CHECK_CUPTI_ERROR(cuptiEnableCallback(1, subscriber, CUPTI_CB_DOMAIN_RUNTIME_API, 124 | CUPTI_RUNTIME_TRACE_CBID_cudaDeviceReset_v3020), 125 | "Problem enabling callback"); 126 | CHECK_CUPTI_ERROR(cuptiEnableCallback(1, subscriber, CUPTI_CB_DOMAIN_RUNTIME_API, 127 | CUPTI_RUNTIME_TRACE_CBID_cudaThreadExit_v3020), 128 | "Problem enabling callback"); 129 | } 130 | } 131 | 132 | ///////////////////////////////////////////////////////////////////////////// 133 | // 134 | // Our main CUPTI callback for handling launches and cases where the 135 | // device is reset. 136 | // 137 | ///////////////////////////////////////////////////////////////////////////// 138 | static void CUPTIAPI cupti_cb(void *userdata, 139 | CUpti_CallbackDomain domain, 140 | CUpti_CallbackId cbid, 141 | const CUpti_CallbackData *cbInfo) 142 | { 143 | self_type *ld = (self_type*) userdata; 144 | 145 | if ((cbid == CUPTI_RUNTIME_TRACE_CBID_cudaDeviceReset_v3020) || 146 | (cbid == CUPTI_RUNTIME_TRACE_CBID_cudaThreadExit_v3020)) 147 | { 148 | if (ld->valid_data && ld->reset_cb) { 149 | ld->reset_cb(device_reset_reason::DEVICE_RESET); 150 | ld->valid_data = false; 151 | } 152 | } 153 | else if (cbid == CUPTI_RUNTIME_TRACE_CBID_cudaLaunch_v3020) 154 | { 155 | cudaDeviceSynchronize(); 156 | 157 | // Call the init_cb function only once. 158 | if(!ld->valid_data) { 159 | ld->valid_data = true; 160 | ld->init_cb(); 161 | } 162 | 163 | if (cbInfo->callbackSite == CUPTI_API_ENTER) { 164 | // Call the user defined entry_cb function on every kernel entry 165 | ld->entry_cb(cbInfo); 166 | } else if (cbInfo->callbackSite == CUPTI_API_EXIT) { 167 | // Call the user defined exit_cb function on every kernel exit 168 | ld->exit_cb(cbInfo); 169 | } 170 | } 171 | } 172 | 173 | private: 174 | 175 | CUpti_SubscriberHandle subscriber; 176 | bool valid_data; 177 | const device_init_cb_type init_cb; 178 | const device_reset_cb_type reset_cb; 179 | const kernel_ee_cb_type entry_cb; 180 | const kernel_ee_cb_type exit_cb; 181 | }; 182 | 183 | } // End namespace. 184 | 185 | #endif 186 | -------------------------------------------------------------------------------- /instlibs/utils/sassi_managed.hpp: -------------------------------------------------------------------------------- 1 | /****************************************************************************\ 2 | Copyright (c) 2015, NVIDIA open source projects 3 | All rights reserved. 4 | 5 | Redistribution and use in source and binary forms, with or without 6 | modification, are permitted provided that the following conditions are met: 7 | 8 | * Redistributions of source code must retain the above copyright notice, this 9 | list of conditions and the following disclaimer. 10 | 11 | * Redistributions in binary form must reproduce the above copyright notice, 12 | this list of conditions and the following disclaimer in the documentation 13 | and/or other materials provided with the distribution. 14 | 15 | * Neither the name of SASSI nor the names of its 16 | contributors may be used to endorse or promote products derived from 17 | this software without specific prior written permission. 18 | 19 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 20 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 21 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 22 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 23 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 24 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 25 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 26 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 27 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 28 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 29 | 30 | Allows our classes to be managed by UVM, which frees us from the 31 | tedium of explicitly copying data back and forth. 32 | \****************************************************************************/ 33 | 34 | #ifndef __SASSI_MANAGED_HPP___ 35 | #define __SASSI_MANAGED_HPP___ 36 | 37 | namespace sassi { 38 | 39 | ///////////////////////////////////////////////////////////////////////////// 40 | // 41 | // Use UVM for host-side allocations that we want accessible on the device 42 | // as well. 43 | // 44 | ///////////////////////////////////////////////////////////////////////////// 45 | class managed { 46 | public: 47 | void *operator new(size_t len) { 48 | void *ptr; 49 | cudaMallocManaged(&ptr, len); 50 | return ptr; 51 | } 52 | 53 | void operator delete(void *ptr) { 54 | cudaFree(ptr); 55 | } 56 | }; 57 | 58 | } // End namespace. 59 | 60 | #endif 61 | -------------------------------------------------------------------------------- /instlibs/utils/sassi_srcmap.hpp: -------------------------------------------------------------------------------- 1 | /*********************************************************************************** \ 2 | * Copyright (c) 2015, NVIDIA open source projects 3 | * All rights reserved. 4 | * 5 | * Redistribution and use in source and binary forms, with or without 6 | * modification, are permitted provided that the following conditions are met: 7 | * 8 | * - Redistributions of source code must retain the above copyright notice, this 9 | * list of conditions and the following disclaimer. 10 | * 11 | * - Redistributions in binary form must reproduce the above copyright notice, 12 | * this list of conditions and the following disclaimer in the documentation 13 | * and/or other materials provided with the distribution. 14 | * 15 | * - Neither the name of SASSI nor the names of its 16 | * contributors may be used to endorse or promote products derived from 17 | * this software without specific prior written permission. 18 | * 19 | * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 20 | * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 21 | * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 22 | * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 23 | * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 24 | * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 25 | * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 26 | * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 27 | * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 28 | * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 29 | * 30 | * This header file provides support for mapping SASSI's PUPC (program counter) 31 | * to source code lines. The class, src_mapper, derives from cupti_wrapper, 32 | * which allows us to intercept CUDA runtime events. In particular, src_mapper 33 | * tracks "module loads", "kernel launches", and "device resets". When the 34 | * module is loaded, src_mapper gets the "cubin" (the device-side object file), 35 | * disassembles it via "nvdisasm", and then parses the result to get source 36 | * mappings. 37 | * 38 | * Dependencies: 1) Be sure have "nvdisasm" in your path. This is often in 39 | * /usr/local/cuda7/bin or you can use the version in your 40 | * SASSI install location. 41 | * 2) OpenSSL crypto library (libcrypto): To get a hash of the 42 | * function name, which SASSI uses to generate PUPCs. 43 | * 3) Boost regex: To parse the disassembled cubin. 44 | * 45 | * Be sure to link with -lcrypto -lboost_regex 46 | * 47 | \***********************************************************************************/ 48 | 49 | #ifndef __SASSI_SRCMAP_HPP___ 50 | #define __SASSI_SRCMAP_HPP___ 51 | 52 | #include 53 | #include 54 | #include 55 | #include 56 | #include 57 | #include 58 | #include 59 | #include 60 | #include 61 | #include "sassi_cuptiwrapper.hpp" 62 | 63 | namespace sassi { 64 | 65 | struct loc_info { 66 | std::string sass; // The actual SASS at this location. 67 | const std::string *file_name; // A pointer to the filename. 68 | const std::string *function_name; // A pointer to the function name. 69 | unsigned line_num; // The line of source. 70 | }; 71 | 72 | class src_mapper : public cupti_wrapper { 73 | public: 74 | 75 | //////////////////////////////////////////////////////////////////////////////// 76 | // 77 | // Create a src_mapper, whose function is to map a SASSI PUPC to a line of 78 | // CUDA source. As usual to have access to line information embedded with 79 | // the code, you must use the "-lineinfo" option. 80 | // 81 | //////////////////////////////////////////////////////////////////////////////// 82 | src_mapper(): device_state_valid(false) 83 | { 84 | register_callback(event_type::MODULE_LOAD, callback_after, dump_pc_mapping); 85 | register_callback(event_type::KERNEL_LAUNCH, callback_before, set_valid_data); 86 | register_callback(event_type::DEVICE_RESET, callback_after, reset_valid_data); 87 | } 88 | 89 | //////////////////////////////////////////////////////////////////////////////// 90 | // 91 | // Gets a hashmap that maps from PUPCs to locations. 92 | // 93 | //////////////////////////////////////////////////////////////////////////////// 94 | const std::map& get_location_map() const { return location_map; } 95 | 96 | //////////////////////////////////////////////////////////////////////////////// 97 | // 98 | // The source mapper keeps track of whether state you have stored in device 99 | // memory should be valid. If the return is true, your device-side state 100 | // (e.g., SASSI counters) should be valid; otherwise, your device-side state 101 | // is probably garbage. 102 | // 103 | //////////////////////////////////////////////////////////////////////////////// 104 | bool is_device_state_valid() const { return device_state_valid; } 105 | 106 | protected: 107 | 108 | //////////////////////////////////////////////////////////////////////////////// 109 | // 110 | // Uses CUPTI's facilities to dump the CUBIN associated with this module. 111 | // 112 | //////////////////////////////////////////////////////////////////////////////// 113 | static std::string dump_cumodule(void *resourceDescriptor) 114 | { 115 | const char *pCubin; 116 | size_t cubinSize; 117 | 118 | CUpti_ModuleResourceData *moduleResourceData = 119 | (CUpti_ModuleResourceData *)resourceDescriptor; 120 | 121 | std::string cubinFileName = tmpnam(nullptr); 122 | pCubin = moduleResourceData->pCubin; 123 | cubinSize = moduleResourceData->cubinSize; 124 | 125 | FILE *cubin; 126 | cubin = fopen(cubinFileName.c_str(), "wb"); 127 | fwrite(pCubin, sizeof(uint8_t), cubinSize, cubin); 128 | fclose(cubin); 129 | return cubinFileName; 130 | } 131 | 132 | 133 | //////////////////////////////////////////////////////////////////////////////// 134 | // 135 | // Uses off-the-shelf NVIDIA commands to disassemble the CUBIN and dump it 136 | // to a file. Make sure that nvdisasm is in your path. It will be in the 137 | // same location as nvcc, ptxas, etc. 138 | // 139 | //////////////////////////////////////////////////////////////////////////////// 140 | static std::string create_sass(const std::string fname) 141 | { 142 | std::string sfname = fname + ".sass"; 143 | std::string command = "nvdisasm -g -ndf -c " + fname + " > " + sfname; 144 | 145 | int status = system(command.c_str()); 146 | if (status != 0) { 147 | fprintf(stderr, "::: ERROR: Make sure nvdisasm is in your PATH :::\n"); 148 | assert(0); 149 | } 150 | 151 | return sfname; 152 | } 153 | 154 | 155 | //////////////////////////////////////////////////////////////////////////////// 156 | // 157 | // Parses the dumped file to get the SASS correlations. 158 | // 159 | //////////////////////////////////////////////////////////////////////////////// 160 | void get_sass_corrs(std::string sass_filename) 161 | { 162 | std::ifstream ifs(sass_filename.c_str()); 163 | boost::regex funNameRegex("\\.text\\.(\\w+):"); 164 | boost::regex fileNameRegex("\\s*\\/\\/## File \"([^\"]+)\", line (\\d+)$"); 165 | boost::regex SASSLineRegex("\\s*\\/\\*([0-9a-f]+)\\*\\/\\s+([^$]+)$"); 166 | 167 | uint64_t addrBase = 0; 168 | std::string functionName; 169 | std::string fileName = "Location not known"; 170 | std::string lineNumber = "-1"; 171 | std::string line; 172 | 173 | while (getline(ifs, line)) 174 | { 175 | // Check for .text.: and functionName <- 176 | // Compute SHA256 hash of functionName 177 | boost::cmatch match; 178 | if (boost::regex_match(line.c_str(), match, funNameRegex)) 179 | { 180 | unsigned char md[SHA_DIGEST_LENGTH]; 181 | functionName = std::string(match[1].first, match[1].second); 182 | if (SHA1((const unsigned char*)functionName.c_str(), 183 | functionName.length(), md)) { 184 | void *mdv = (void*)md; 185 | uint32_t v = *((uint32_t*)mdv); 186 | addrBase = ((uint64_t)v << 32); 187 | } else { 188 | fprintf(stderr, "::: ERROR: During SHA1 hashing :::\n"); 189 | assert(0); 190 | } 191 | } 192 | 193 | // Get the file name and line number. 194 | if (boost::regex_match(line.c_str(), match, fileNameRegex)) 195 | { 196 | fileName = std::string(match[1].first, match[1].second); 197 | lineNumber = std::string(match[2].first, match[2].second); 198 | } 199 | 200 | // Gets the canonical file and function name. 201 | std::pair::iterator,bool> file_it = 202 | file_names.insert(fileName);; 203 | std::pair::iterator,bool> function_it = 204 | function_names.insert(functionName);; 205 | 206 | // Look for /*[a-f0-9]+*/, put the hex value in offset 207 | // Also, extract the SASS, which is everything after this match 208 | if (boost::regex_match(line.c_str(), match, SASSLineRegex)) 209 | { 210 | loc_info info; 211 | std::stringstream offsetStream; 212 | uint64_t offset, pupc; 213 | 214 | // Compute the PUPC using hash and offset. 215 | std::string offsetString = std::string(match[1].first, match[1].second); 216 | offsetStream << std::hex << offsetString; 217 | offsetStream >> offset; 218 | pupc = addrBase | offset; 219 | 220 | // Put the loc and SASS in hash map using PUPC as the key. 221 | info.sass = std::string(match[2].first, match[2].second); 222 | info.file_name = &(*(file_it.first)); 223 | info.function_name = &(*(function_it.first)); 224 | info.line_num = std::stoi(lineNumber); 225 | location_map.insert(std::pair(pupc, info)); 226 | } 227 | } 228 | } 229 | 230 | //////////////////////////////////////////////////////////////////////////////// 231 | // 232 | // When the module is loaded, this callback will be called and we can get 233 | // the source code mappings. 234 | // 235 | //////////////////////////////////////////////////////////////////////////////// 236 | static void 237 | dump_pc_mapping(cupti_wrapper* wrapper, const CUpti_CallbackData *cbdata) 238 | { 239 | CUpti_ResourceData *rdata = (CUpti_ResourceData*)cbdata; 240 | std::string cubinFileName = dump_cumodule(rdata->resourceDescriptor); 241 | std::string sassFileName = create_sass(cubinFileName); 242 | ((src_mapper*)wrapper)->get_sass_corrs(sassFileName); 243 | } 244 | 245 | //////////////////////////////////////////////////////////////////////////////// 246 | // 247 | // Go ahead and set the device_state_valid flag to true when a kernel is 248 | // launched. 249 | // 250 | //////////////////////////////////////////////////////////////////////////////// 251 | static void 252 | set_valid_data(cupti_wrapper* wrapper, const CUpti_CallbackData *cbdata) 253 | { 254 | ((src_mapper*)wrapper)->device_state_valid = true; 255 | } 256 | 257 | //////////////////////////////////////////////////////////////////////////////// 258 | // 259 | // This callback is fired when the device is reset. We set the 260 | // device_state_valid flag to false. 261 | // 262 | //////////////////////////////////////////////////////////////////////////////// 263 | static void 264 | reset_valid_data(cupti_wrapper* wrapper, const CUpti_CallbackData *cbdata) 265 | { 266 | ((src_mapper*)wrapper)->device_state_valid = false; 267 | } 268 | 269 | protected: 270 | 271 | bool device_state_valid; 272 | std::set file_names; 273 | std::set function_names; 274 | std::map location_map; 275 | }; 276 | 277 | } 278 | 279 | #endif 280 | -------------------------------------------------------------------------------- /instlibs/utils/utiltest/Makefile: -------------------------------------------------------------------------------- 1 | 2 | include ../../env.mk 3 | 4 | NVCC := $(CUDA_HOME)/bin/nvcc 5 | 6 | all: test_dictionary 7 | 8 | # Notice we are testing with maxrregcount set at 16. This is important for SASSI. 9 | # We also need CUPTI for the lazy allocator that we are testing. 10 | test_dictionary: test_dictionary.cu ../sassi_dictionary.hpp ../sassi_managed.hpp ../sassi_lazyallocator.hpp 11 | $(NVCC) --compiler-options -Wall \ 12 | -O3 -lineinfo -ccbin $(CCBIN) -I$(CUDA_HOME)/extras/CUPTI/include \ 13 | -std=c++11 -maxrregcount=16 -rdc=true \ 14 | $(GENCODE) -o $@ test_dictionary.cu -L$(CUDA_HOME)/extras/CUPTI/lib64 -lcupti 15 | 16 | clean: 17 | $(RM) *~ test_dictionary 18 | -------------------------------------------------------------------------------- /instlibs/utils/utiltest/test_dictionary.cu: -------------------------------------------------------------------------------- 1 | /****************************************************************************\ 2 | Copyright (c) 2015, NVIDIA open source projects 3 | All rights reserved. 4 | 5 | Redistribution and use in source and binary forms, with or without 6 | modification, are permitted provided that the following conditions are met: 7 | 8 | * Redistributions of source code must retain the above copyright notice, this 9 | list of conditions and the following disclaimer. 10 | 11 | * Redistributions in binary form must reproduce the above copyright notice, 12 | this list of conditions and the following disclaimer in the documentation 13 | and/or other materials provided with the distribution. 14 | 15 | * Neither the name of SASSI nor the names of its 16 | contributors may be used to endorse or promote products derived from 17 | this software without specific prior written permission. 18 | 19 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 20 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 21 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 22 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 23 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 24 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 25 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 26 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 27 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 28 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 29 | 30 | Tests out the dictionary. 31 | \****************************************************************************/ 32 | 33 | #include 34 | #include 35 | #include 36 | #include 37 | #include 38 | #include 39 | #include 40 | #include 41 | #include 42 | #include 43 | #include "../sassi_dictionary.hpp" 44 | #include "../sassi_intrinsics.h" 45 | #include "../sassi_lazyallocator.hpp" 46 | 47 | typedef sassi::dictionary IUMap; 48 | 49 | __managed__ IUMap *map; 50 | 51 | std::map kCountMap; // count number of kernel invocations 52 | bool verbose = false; 53 | 54 | //////////////////////////////////////////////////////////////////////////////////// 55 | // 56 | // Let's test out the lazy allocator while we are at it, which will let us 57 | // allocate our data structures before the first kernel launch. 58 | // 59 | //////////////////////////////////////////////////////////////////////////////////// 60 | static void mapAllocate() { 61 | map = new IUMap(5800079, 128); 62 | } 63 | 64 | //////////////////////////////////////////////////////////////////////////////////// 65 | // 66 | // User defined host funtion to be executed on kernel entry. 67 | // User has access to const CUpti_CallbackData* cbInfo. 68 | // 69 | //////////////////////////////////////////////////////////////////////////////////// 70 | static void onKernelEntry(const CUpti_CallbackData* cbInfo) { 71 | std::string kName = cbInfo->symbolName; 72 | if (kCountMap.find(kName) != kCountMap.end()) { // kernel name found in kCountMap 73 | kCountMap[kName] += 1; 74 | } else { // kernel if seen for the first time 75 | kCountMap[kName] = 1; 76 | } 77 | 78 | if (verbose) { 79 | printf("\nKernelEntry: Name=%s, Invocation count=%d\n", kName.c_str(), kCountMap[kName]); 80 | } 81 | } 82 | 83 | //////////////////////////////////////////////////////////////////////////////////// 84 | // 85 | // User defined host funtion to be executed on every kernel exit. 86 | // User has access to const CUpti_CallbackData* cbInfo. 87 | // 88 | //////////////////////////////////////////////////////////////////////////////////// 89 | static void onKernelExit(const CUpti_CallbackData* cbInfo) { 90 | cudaError_t * error = (cudaError_t*) cbInfo->functionReturnValue; 91 | if ( (*error) != cudaSuccess ) { 92 | printf("Kernel Exit Error: %d", (*error)); 93 | } 94 | 95 | if (verbose) { 96 | printf("KernelExit: Name=%s, Return value=%d\n", cbInfo->symbolName, *error); 97 | } 98 | } 99 | 100 | //////////////////////////////////////////////////////////////////////////////////// 101 | // 102 | // This user defined host function will be called after all kernels have been executed. 103 | // This exmaple prints the list of kernel names that were executed during a particular 104 | // run along with the number of invocations. 105 | // 106 | //////////////////////////////////////////////////////////////////////////////////// 107 | static void finalize() { 108 | std::map::iterator it; 109 | for(it=kCountMap.begin(); it!=kCountMap.end(); ++it) { 110 | printf("Kernel Name: %s, Num invocations: %d\n", it->first.c_str(), it->second); 111 | } 112 | } 113 | 114 | static sassi::lazy_allocator mapAllocator(mapAllocate, finalize, onKernelEntry, onKernelExit); 115 | 116 | 117 | 118 | //////////////////////////////////////////////////////////////////////////////////// 119 | // 120 | // Inserts a key from 'random' and sets the value of the associated node to 121 | // the thread's tid. 122 | // 123 | //////////////////////////////////////////////////////////////////////////////////// 124 | double computeElapsed(timeval &t1, timeval &t2) 125 | { 126 | return ((t2.tv_sec - t1.tv_sec) * 1000.0) 127 | + ((t2.tv_usec - t1.tv_usec) / 1000.0); 128 | } 129 | 130 | //////////////////////////////////////////////////////////////////////////////////// 131 | // 132 | // Inserts a key from 'random' and sets the value of the associated node to 133 | // the thread's tid. 134 | // 135 | //////////////////////////////////////////////////////////////////////////////////// 136 | __global__ void random_key_tid_value_kernel(int *random) 137 | { 138 | unsigned tid = threadIdx.x + blockIdx.x * blockDim.x; 139 | unsigned *t = map->getOrInit(random[tid], [&tid](unsigned *v) { 140 | *v = tid; 141 | }); 142 | assert (t != NULL); 143 | } 144 | 145 | //////////////////////////////////////////////////////////////////////////////////// 146 | // 147 | // Inserts a key from 'random' and increments the value of the associated node. 148 | // 149 | //////////////////////////////////////////////////////////////////////////////////// 150 | __global__ void agg_kernel(int *random) 151 | { 152 | unsigned tid = threadIdx.x + blockIdx.x * blockDim.x; 153 | unsigned *t = map->getOrInit(random[tid], [&tid](unsigned *v) { 154 | *v = 0; 155 | }); 156 | assert (t != NULL); 157 | atomicAdd(t, 1); 158 | } 159 | 160 | //////////////////////////////////////////////////////////////////////////////////// 161 | // 162 | // Inserts M * N random nodes. 163 | // 164 | //////////////////////////////////////////////////////////////////////////////////// 165 | bool randomInsertTest(unsigned M, unsigned N, double &kt) 166 | { 167 | std::vector rngs(N * M); 168 | int *dev_rngs; 169 | timeval t1, t2; 170 | 171 | CHECK_CUDA_ERROR(cudaMalloc((void**)&dev_rngs, M * N * sizeof(int))); 172 | 173 | srand48(0); 174 | for (unsigned i = 0; i < M * N; i++) rngs[i] = lrand48(); 175 | 176 | CHECK_CUDA_ERROR(cudaMemcpy(dev_rngs, &rngs[0], M * N * sizeof(int), cudaMemcpyHostToDevice)); 177 | 178 | gettimeofday(&t1, NULL); 179 | random_key_tid_value_kernel<<>>(dev_rngs); 180 | cudaDeviceSynchronize(); 181 | gettimeofday(&t2, NULL); 182 | kt = computeElapsed(t1, t2); 183 | CHECK_CUDA_ERROR(cudaFree(dev_rngs)); 184 | 185 | bool failed = false; 186 | std::set values; 187 | map->map([&](int &k, unsigned &v) { 188 | failed |= (k != rngs[v]); 189 | values.insert(v); 190 | }); 191 | 192 | if (map->size() != values.size()) { 193 | printf("size %zu != M, %d\n", values.size(), map->size()); 194 | failed = true; 195 | } 196 | 197 | for (unsigned v : values) { 198 | failed |= (v > (M * N)); 199 | } 200 | 201 | return !failed; 202 | } 203 | 204 | //////////////////////////////////////////////////////////////////////////////////// 205 | // 206 | // Inserts M * N nodes in sorted order. 207 | // 208 | //////////////////////////////////////////////////////////////////////////////////// 209 | bool linearInsertTest(unsigned M, unsigned N, double &kt) 210 | { 211 | std::vector rngs(N * M); 212 | int *dev_rngs; 213 | timeval t1, t2; 214 | 215 | CHECK_CUDA_ERROR(cudaMalloc((void**)&dev_rngs, M * N * sizeof(int))); 216 | 217 | for (unsigned i = 0; i < M * N; i++) rngs[i] = i; 218 | 219 | CHECK_CUDA_ERROR(cudaMemcpy(dev_rngs, &rngs[0], M * N * sizeof(int), cudaMemcpyHostToDevice)); 220 | 221 | gettimeofday(&t1, NULL); 222 | random_key_tid_value_kernel<<>>(dev_rngs); 223 | cudaDeviceSynchronize(); 224 | gettimeofday(&t2, NULL); 225 | kt = computeElapsed(t1, t2); 226 | CHECK_CUDA_ERROR(cudaFree(dev_rngs)); 227 | 228 | bool failed = false; 229 | std::set values; 230 | map->map([&](int &k, unsigned &v) { 231 | failed |= ((k != rngs[v]) || ((unsigned)k != v)); 232 | values.insert(v); 233 | }); 234 | 235 | failed |= map->size() != values.size(); 236 | failed |= map->size() != (M * N); 237 | for (unsigned i = 0; i < M * N; i++) { 238 | failed |= (values.find(i) == values.end()); 239 | } 240 | 241 | return !failed; 242 | } 243 | 244 | //////////////////////////////////////////////////////////////////////////////////// 245 | // 246 | // Inserts at most modulus unique keys in linear order. If M * N > modulus, 247 | // then we will insert some duplicate keys. 248 | // 249 | //////////////////////////////////////////////////////////////////////////////////// 250 | bool linearAggTest(int M, int N, unsigned modulus, double &kt) 251 | { 252 | std::vector rngs(N * M); 253 | std::vector sums(modulus); 254 | int *dev_rngs; 255 | timeval t1, t2; 256 | 257 | CHECK_CUDA_ERROR(cudaMalloc((void**)&dev_rngs, M * N * sizeof(int))); 258 | 259 | for (int i = (M * N) - 1; i >= 0; i--) { 260 | int m = i % modulus; 261 | rngs[i] = m; 262 | sums[m]++; 263 | } 264 | 265 | CHECK_CUDA_ERROR(cudaMemcpy(dev_rngs, &rngs[0], M * N * sizeof(int), cudaMemcpyHostToDevice)); 266 | 267 | gettimeofday(&t1, NULL); 268 | agg_kernel<<>>(dev_rngs); 269 | cudaDeviceSynchronize(); 270 | gettimeofday(&t2, NULL); 271 | kt = computeElapsed(t1, t2); 272 | CHECK_CUDA_ERROR(cudaFree(dev_rngs)); 273 | 274 | bool failed = false; 275 | std::set values; 276 | map->map([&](int &k, unsigned &v) { 277 | failed |= (sums[k] != v); 278 | values.insert(k); 279 | }); 280 | 281 | failed |= (modulus != values.size()); 282 | failed |= (modulus != map->size()); 283 | for (unsigned i = 0; i < modulus; i++) { 284 | failed |= (values.find(i) == values.end()); 285 | } 286 | 287 | return !failed; 288 | } 289 | 290 | //////////////////////////////////////////////////////////////////////////////////// 291 | // 292 | // Clears the hash table so we can run the next test. 293 | // 294 | //////////////////////////////////////////////////////////////////////////////////// 295 | void clearTable() 296 | { 297 | cudaDeviceSynchronize(); 298 | map->clear(); 299 | } 300 | 301 | //////////////////////////////////////////////////////////////////////////////////// 302 | // 303 | // Run the tests. 304 | // 305 | //////////////////////////////////////////////////////////////////////////////////// 306 | int main(void) 307 | { 308 | int const tests[][2] = {{1, 4}, {1, 20}, {1, 256}, {1, 1024}, 309 | {4, 1}, {19, 1}, {256, 1}, {1024, 1}, 310 | {4, 4}, {256, 256}, {300, 300}, 311 | {512, 512},{1024, 1024},{1900, 1024}, 312 | {2048, 1}, {2048, 1024},{1000000,2}, 313 | {5679, 234}, {9876, 186}}; 314 | 315 | int passed = 0; 316 | int failed = 0; 317 | int total = 0; 318 | 319 | for (unsigned i = 0; i < sizeof(tests)/sizeof(int[2]); i++) 320 | { 321 | double kernel; 322 | 323 | int M = tests[i][0]; 324 | int N = tests[i][1]; 325 | printf("Testing %dx%d kernels\n", M, N); 326 | 327 | printf("1. Inserting %d random keys: ", M * N); fflush(stdout); 328 | bool rit = randomInsertTest(M, N, kernel); 329 | printf("%s (%f kernel)\n", 330 | rit ? " PASSED" : "FAILED", kernel); 331 | fflush(stdout); 332 | clearTable(); 333 | 334 | printf("2. Inserting %d linear keys: ", M * N); fflush(stdout); 335 | bool lit = linearInsertTest(M, N, kernel); 336 | printf("%s (%f kernel)\n", 337 | lit ? " PASSED" : "FAILED", kernel); 338 | fflush(stdout); 339 | clearTable(); 340 | 341 | printf("3. Inserting %d items with repetition: ", M * N); fflush(stdout); 342 | bool agg = linearAggTest(M, N, min(M, N), kernel); 343 | printf("%s (%f kernel)\n", 344 | agg ? " PASSED" : "FAILED", kernel); 345 | fflush(stdout); 346 | clearTable(); 347 | 348 | total += 3; 349 | if (rit) passed++; else failed++; 350 | if (lit) passed++; else failed++; 351 | if (agg) passed++; else failed++; 352 | } 353 | 354 | printf("Total tests : %d\n", total); 355 | printf("Total passed: %d\n", passed); 356 | printf("Total failed: %d\n", failed); 357 | } 358 | --------------------------------------------------------------------------------