├── README.md
├── examples
    └── basic.py
├── exebench
    ├── __init__.py
    ├── clib
    │   ├── Makefile
    │   ├── fft_synth
    │   │   ├── lib.c
    │   │   └── lib.h
    │   ├── synthesizer.c
    │   ├── synthesizer.h
    │   └── test
    │   │   ├── Makefile
    │   │   ├── test_fft
    │   │   └── test_fft.c
    └── nlohmann
    │   └── json.hpp
├── legacy
    └── legacy.py
├── requirements.txt
└── requirements_examples.txt


/README.md:
--------------------------------------------------------------------------------
 1 | # ExeBench: an ML-scale dataset of executable C functions
 2 | 
 3 | ExeBench is a dataset of millions of C functions paired with dependencies and metadatada such that at least a subset of it can be executed with IO pairs. It is mainly inteded for machine learning applications but it is application-agnostic enough to have other usages.
 4 | Please read the paper for more information: https://dl.acm.org/doi/abs/10.1145/3520312.3534867
 5 | 
 6 | ## Usage
 7 | 
 8 | ### Option 1: Using the helpers in this repo
 9 | 
10 | 
11 | ```
12 | git clone https://github.com/jordiae/exebench.git
13 | cd exebench/
14 | python -m venv venv
15 | source venv/bin/activate
16 | pip install -r requirements_examples.txt
17 | PYTHONPATH="${PYTHONPATH}:${pwd}" python examples/basic.py
18 | ```
19 | 
20 | ### Option 2: Directly using the Hugginface Datasets library
21 | 
22 | 
23 | ```
24 | !pip install datasets zstandard
25 | from datasets import load_dataset
26 | 
27 | # Load dataset split. In this case, synthetic test split
28 | dataset = load_dataset('jordiae/exebench', split='test_synth')
29 | for e in dataset:
30 |   ...
31 | ```
32 | 
33 | ### Option 3: Directly download the dataset
34 | 
35 | Take a look at the files at: https://huggingface.co/datasets/jordiae/exebench/tree/main
36 | The dataset consist of directories compressed with TAR. Inside each TAR, there is a series of jsonline files compressed with zstandard.
37 | 
38 | ## Statistics and versions
39 | 
40 | This release corresponds to ExeBench v1.01, a version with some improvements with respect to the original one presented in the paper. The statistics and studies presented in the paper remain consistent with respect to the new ones. The final splits of the new version consist of the following functions:
41 | 
42 | 
43 | ```
44 | train_not_compilable: 2.357M
45 | train_synth_compilable: 2.308373M
46 | train_real_compilable: 0.675074M
47 | train_synth_simple_io: 0.550116M
48 | train_real_simple_io: 0.043769M
49 | train_synth_rich_io: 0.097250M
50 | valid_synth: 5k
51 | valid_real: 2.133k
52 | test_synth: 5k
53 | test_real: 2.134k
54 | ```
55 | 
56 | The original dataset (v1.00) with the exact same data studied in the paper can be accessed on request at: https://huggingface.co/datasets/jordiae/exebench_legacy (please reach out for access)
57 | 
58 | 
59 | ## License
60 | 
61 | All C functions keep the original license as per their original Github repository (available in the metadata). All ExeBench contributions (I/O examples, boilerplate to run functions, etc) are released with an MIT license.
62 | 
63 | ## Citation
64 | 
65 | ```
66 | @inproceedings{10.1145/3520312.3534867,
67 | author = {Armengol-Estap\'{e}, Jordi and Woodruff, Jackson and Brauckmann, Alexander and Magalh\~{a}es, Jos\'{e} Wesley de Souza and O'Boyle, Michael F. P.},
68 | title = {ExeBench: An ML-Scale Dataset of Executable C Functions},
69 | year = {2022},
70 | isbn = {9781450392730},
71 | publisher = {Association for Computing Machinery},
72 | address = {New York, NY, USA},
73 | url = {https://doi.org/10.1145/3520312.3534867},
74 | doi = {10.1145/3520312.3534867},
75 | abstract = {Machine-learning promises to transform compilation and software engineering, yet is frequently limited by the scope of available datasets. In particular, there is a lack of runnable, real-world datasets required for a range of tasks ranging from neural program synthesis to machine learning-guided program optimization. We introduce a new dataset, ExeBench, which attempts to address this. It tackles two key issues with real-world code: references to external types and functions and scalable generation of IO examples. ExeBench is the first publicly available dataset that pairs real-world C code taken from GitHub with IO examples that allow these programs to be run. We develop a toolchain that scrapes GitHub, analyzes the code, and generates runnable snippets of code. We analyze our benchmark suite using several metrics, and show it is representative of real-world code. ExeBench contains 4.5M compilable and 700k executable C functions. This scale of executable, real functions will enable the next generation of machine learning-based programming tasks.},
76 | booktitle = {Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming},
77 | pages = {50–59},
78 | numpages = {10},
79 | keywords = {Code Dataset, Program Synthesis, Mining Software Repositories, C, Machine Learning for Code, Compilers},
80 | location = {San Diego, CA, USA},
81 | series = {MAPS 2022}
82 | }
83 | ```
84 | 
85 | ## Credits
86 | 
87 | We thank Anghabench authors for their type inference-based synthetic dependencies generation for C functions. This software, Psyche-C, can be found at: https://github.com/ltcmelo/psychec
88 | 
89 | ## Contact
90 | 
91 | ```
92 | jordi.armengol.estape at ed.ac.uk
93 | ```
94 | 


--------------------------------------------------------------------------------
/examples/basic.py:
--------------------------------------------------------------------------------
 1 | from datasets import load_dataset
 2 | from exebench import Wrapper, diff_io, exebench_dict_to_dict
 3 | 
 4 | 
 5 | def main():
 6 |     # 1) Load dataset split. In this case, synthetic test split
 7 |     dataset = load_dataset('jordiae/exebench', split='test_synth') # , use_auth_token=True)
 8 |     # 2) Iterate over dataset
 9 |     for row in dataset:
10 |         # Do stuff with each row
11 |         # 3) Option A: Manually access row fields. For instance, access the function definition:
12 |         print(row['func_head_types'])
13 |         print(row['func_def'])  # print function definition
14 |         print(row['asm']['code'][0])  # print assembly with the first target, angha_gcc_x86_O0
15 |         # print first I/O example (synthetic)
16 |         print('Input:', exebench_dict_to_dict(row['synth_io_pairs']['input'][0]))
17 |         print('Output:', exebench_dict_to_dict(row['synth_io_pairs']['output'][0]))
18 |         print(row['synth_exe_wrapper'][0])  # print C++ wrapper to run function with IO
19 |         # You can manually compile, run with IO, etc
20 |         # TODO! somethimes row['func_head'] is not OK!
21 |         # 3) Option B: Use ExeBenchFunc wrapper.
22 |         try:
23 |             synth_wrapper = Wrapper(c_deps=row['synth_deps'] + '\n' + row['synth_io_pairs']['dummy_funcs'][0] + '\n',
24 |                                     func_c_signature=row['func_head_types'].replace('extern', ''), func_assembly=row['asm']['code'][0],
25 |                                     cpp_wrapper=row['synth_exe_wrapper'])
26 |             observed_output = synth_wrapper(exebench_dict_to_dict(row['synth_io_pairs']['input'][0]))  # Run synthetic example number 0
27 |             print('Input', exebench_dict_to_dict(row['synth_io_pairs']['input'][0]))
28 |             print('Observed Output:', observed_output)
29 |             print('Does this output coincide with the expected one?',
30 |                   'Yes' if diff_io(observed_output=observed_output,
31 |                                    expected_output=exebench_dict_to_dict(row['synth_io_pairs']['output'][0])) else 'No')
32 |         except:
33 |             # Very occasionally the compilating using func_assembly=row['asm']['code'][0] seems to fail.
34 |             # My best guess at this moment is that the self-contained function assembly is not "self-contained enough"
35 |             # in a few cases, and in these cases it's better to recompile everything and run it all together.
36 |             # TODO: fix, or find a better explanation
37 |             pass
38 | 
39 |         # Run with custom IO (variable names must match:
40 |         # inp = {'a': 3, ...}
41 | 
42 |         break
43 | 
44 | 
45 | if __name__ == '__main__':
46 |     main()
47 | 


--------------------------------------------------------------------------------
/exebench/__init__.py:
--------------------------------------------------------------------------------
  1 | import math
  2 | import json
  3 | from pathlib import Path
  4 | import subprocess
  5 | from typing import Optional, Tuple
  6 | import tempfile
  7 | import contextlib
  8 | import os
  9 | import shutil
 10 | import glob
 11 | import re
 12 | from ast import literal_eval
 13 | 
 14 | __all__ = ['diff_io', 'Wrapper', 'exebench_dict_to_dict']
 15 | 
 16 | __version__ = 0.1
 17 | 
 18 | # UTILS (in a self-contained file to ease deployment)
 19 | 
 20 | _DEFAULT_CMD_TIMEOUT = 5
 21 | _ROOT_PATH_FOR_JSON_HPP = os.path.dirname(__file__)
 22 | _SYNTH_LIBS_PATH = os.path.dirname(__file__)
 23 | 
 24 | 
 25 | def _run_command(command: str, stdin: Optional[str] = None, timeout: Optional[int] = _DEFAULT_CMD_TIMEOUT) -> Tuple[str, str]:
 26 |     output = subprocess.run(command.split(), capture_output=True, text=True, input=stdin, timeout=timeout)
 27 |     stdout = output.stdout.decode('utf-8') if isinstance(output.stdout, bytes) else output.stdout
 28 |     stderr = output.stderr.decode('utf-8') if isinstance(output.stderr, bytes) else output.stderr
 29 |     return stdout, stderr
 30 | 
 31 | 
 32 | def _get_host_process_id():
 33 |     process_id = 'exebench_' + os.uname()[1] + '_' + str(os.getpid())
 34 |     return process_id
 35 | 
 36 | 
 37 | def _cleanup(path_pattern):
 38 |     for path in glob.glob(path_pattern):
 39 |         try:
 40 |             shutil.rmtree(path)
 41 |         except:
 42 |             pass
 43 | 
 44 | 
 45 | @contextlib.contextmanager
 46 | def _get_tmp_path(content: Optional[str] = None, suffix: Optional[str] = None, delete=True) -> str:
 47 |     prefix = _get_host_process_id()
 48 |     try:
 49 |         with tempfile.NamedTemporaryFile(prefix=prefix, suffix=suffix, delete=delete, mode='w+') as ntf:
 50 |             if content:
 51 |                 ntf.write(content)
 52 |                 ntf.flush()
 53 |             yield ntf.name
 54 |     except OSError:
 55 |         _cleanup(os.path.join(tempfile.gettempdir(), prefix, '*'))
 56 |         with tempfile.NamedTemporaryFile(prefix=prefix, suffix=suffix, delete=delete, mode='w+') as ntf:
 57 |             if content:
 58 |                 ntf.write(content)
 59 |                 ntf.flush()
 60 |             yield ntf.name
 61 | 
 62 | 
 63 | class _Assembler:
 64 |     def __call__(self, c_deps, func_c_signature, func_assembly, cpp_wrapper) -> Path:
 65 |         raise NotImplemented
 66 | 
 67 | 
 68 | class _DefaultAssembler(_Assembler):
 69 |     def __call__(self, c_deps, func_c_signature, func_assembly, cpp_wrapper) -> Path:
 70 |         with _get_tmp_path(content=None, suffix='.x', delete=False) as executable_path:
 71 |             c_deps += f'\nextern {func_c_signature};\n'
 72 |             with _get_tmp_path(content=c_deps, suffix='.c') as c_deps_path:
 73 |                 cpp_wrapper = re.sub(
 74 |                     r'extern\s\"C\"\s\{\s.*\s\}', 'extern "C" \n{\n#include "' + c_deps_path + '"\n}\n',
 75 |                     cpp_wrapper) # replace tmp path
 76 |                 with _get_tmp_path(content=cpp_wrapper, suffix='.cpp') as cpp_path, \
 77 |                         _get_tmp_path(content=func_assembly, suffix='.s') as s_path:
 78 | 
 79 |                     cmd = f'g++ -fpermissive -O0 -o {executable_path} {cpp_path} {s_path} -I {_ROOT_PATH_FOR_JSON_HPP} -I{_SYNTH_LIBS_PATH}'
 80 | 
 81 |                     stdout, stderr = _run_command(cmd)
 82 | 
 83 |         return Path(executable_path)
 84 | 
 85 | 
 86 | def _compile_exe_path(c_deps, func_c_signature, func_assembly, cpp_wrapper, assembler_backend):
 87 |     return assembler_backend(c_deps, func_c_signature, func_assembly, cpp_wrapper)
 88 | 
 89 | # API
 90 | 
 91 | 
 92 | class Wrapper:
 93 |     def __init__(self, c_deps, func_c_signature, func_assembly, cpp_wrapper, assembler_backend=_DefaultAssembler()):
 94 |         self._compiled_exe_path = self._compile_exe_path(c_deps, func_c_signature, func_assembly, cpp_wrapper, assembler_backend)
 95 | 
 96 |     @staticmethod
 97 |     def _compile_exe_path(c_deps, func_c_signature, func_assembly, cpp_wrapper, assembler_backend):
 98 |         return _compile_exe_path(c_deps, func_c_signature, func_assembly, cpp_wrapper, assembler_backend)
 99 | 
100 |     def __call__(self, inp, return_stdout_and_stderr=False):
101 |         executable = self._compiled_exe_path
102 | 
103 |         with _get_tmp_path(content=None, suffix='.json') as input_tmp_json_path:
104 |             output_file = ''.join(input_tmp_json_path.split(".")[:1]) + '-out.json'
105 | 
106 |             with open(input_tmp_json_path, 'w') as f:
107 |                 json.dump(inp, f)
108 | 
109 |             stdout, stderr = _run_command(f'{executable} {input_tmp_json_path} {output_file}')
110 | 
111 |             with open(output_file, 'r') as f:
112 |                 output = json.load(f)
113 |             os.remove(output_file)
114 | 
115 |         if return_stdout_and_stderr:
116 |             return output, stdout, stderr
117 | 
118 |         return output
119 | 
120 | 
121 | def diff_io(observed_output, expected_output) -> bool:
122 |     if type(observed_output) is not type(expected_output):
123 |         return False
124 |     if isinstance(observed_output, list):
125 |         if len(observed_output) != len(expected_output):
126 |             return False
127 |         for e1, e2 in zip(observed_output, expected_output):
128 |             ok = diff_io(e1, e2)
129 |             if not ok:
130 |                 return False
131 |     elif isinstance(observed_output, dict):
132 |         for key in observed_output:
133 |             if key not in expected_output:
134 |                 return False
135 |             ok = diff_io(observed_output[key], expected_output[key])
136 |             if not ok:
137 |                 return False
138 |     elif isinstance(observed_output, float):
139 |         ok = math.isclose(observed_output, expected_output)
140 |         if not ok:
141 |             return False
142 |     else:
143 |         ok = observed_output == expected_output
144 |         if not ok:
145 |             return False
146 |     return True
147 | 
148 | 
149 | def _fix_nested_dict(inp):  # hack
150 | 
151 |     if isinstance(inp, dict):
152 |         for k in inp:
153 |             inp[k] = _fix_nested_dict(inp[k])
154 |     elif isinstance(inp, list):
155 |         for idx, e in enumerate(inp):
156 |             inp[idx] = _fix_nested_dict(e)
157 |     else:
158 |         return literal_eval(inp)
159 |     return inp
160 | 
161 | 
162 | def exebench_dict_to_dict(exebench_dict):
163 |     keys = exebench_dict['var']
164 |     values = exebench_dict['value']
165 |     return _fix_nested_dict({k: v for k, v in zip(keys, values)})
166 | 


--------------------------------------------------------------------------------
/exebench/clib/Makefile:
--------------------------------------------------------------------------------
1 | all:
2 | 	# Use .co files to avoid clashes with the OCaml build system
3 | 	g++ -Wall synthesizer.c -c -o synthesizer.co
4 | 	g++ fft_synth/lib.c -c -o fft_synth_lib.co
5 | 


--------------------------------------------------------------------------------
/exebench/clib/fft_synth/lib.c:
--------------------------------------------------------------------------------
 1 | unsigned int reverseBits(unsigned int num, unsigned int no_bits)
 2 | {
 3 |     unsigned int reverse_num = 0, temp;
 4 | 
 5 | 	while (no_bits > 1)
 6 |     {
 7 | 		reverse_num = reverse_num << 1;
 8 | 		reverse_num = reverse_num | (num & 1);
 9 | 
10 | 		// Shrink.
11 | 		no_bits = no_bits >> 1;
12 | 		num = num >> 1;
13 |     }
14 | 
15 |     return reverse_num;
16 | }
17 | 


--------------------------------------------------------------------------------
/exebench/clib/fft_synth/lib.h:
--------------------------------------------------------------------------------
 1 | // A set of functions that are used by the synthesizer that
 2 | // I don't want to embed in OCaml.
 3 | // Not 100% clear that this is the best decision (as opposed
 4 | // to embedding implementations, since we could genericize
 5 | // the generation of those.
 6 | 
 7 | // Copied from
 8 | // https://www.geeksforgeeks.org/write-an-efficient-c-program-to-reverse-bits-of-a-number/
 9 | unsigned int reverseBits(unsigned int num, unsigned int no_bits);
10 | 
11 | // This doesn't use the post-index because it makes no sense
12 | // to --- but it does make it semantically replacable
13 | // with the other options here.
14 | #define BIT_REVERSE(arr, postind, len) 		                 \
15 | 	for (unsigned int i = 0; i < len; i ++) {        \
16 | 		unsigned int reversed = reverseBits(i, len); \
17 | 		if (i < reversed) {                          \
18 | 			auto temp = arr[i];                      \
19 | 			arr[i] = arr[reversed];                  \
20 | 			arr[reversed] = temp;                    \
21 | 		}                                            \
22 | 	}
23 | 
24 | // Every problem is solved by another layer of indirection?
25 | 
26 | // These are for doing sub-assigns with the connectors already included.
27 | #define ARRAY_NORM_BASE(arr, postind, len) for (int i = 0; i < len; i ++) { arr[i]postind = arr[i]postind / len; }
28 | #define ARRAY_DENORM_BASE(arr, postind, len) for (int i = 0; i < len; i ++) { arr[i]postind = arr[i]postind * len; }
29 | #define ARRAY_HALF_NORM_BASE(arr, postind, len) for (int i = 0; i < len; i ++) { arr[i]postind = arr[i]postind / (len / 2); } 
30 | #define ARRAY_HALF_DENORM_BASE(arr, postind, len) for (int i = 0; i < len; i ++) { arr[i]postind = arr[i]postind * (len / 2); } 
31 | 
32 | /// These are for doing arrays with sub-lengths
33 | #define ARRAY_NORM_POSTIND(arr, postind, len) ARRAY_NORM_BASE(arr, .postind, len)
34 | #define ARRAY_DENORM_POSTIND(arr, postind, len) ARRAY_DENORM_BASE(arr, .postind, len)
35 | #define ARRAY_HALF_NORM_POSTIND(arr, postind, len) ARRAY_HALF_NORM_BASE(arr, .postind, len)
36 | #define ARRAY_HALF_DENORM_POSTIND(arr, postin, len) ARRAY_HALF_DENORM_BASE(arr, .postind, len)
37 | 
38 | // These are for doing arrays without subelements.
39 | #define ARRAY_NORM(arr, len) ARRAY_NORM_BASE(arr, , len)
40 | #define ARRAY_DENORM(arr, len) ARRAY_DENORM_BASE(arr, , len)
41 | #define ARRAY_HALF_NORM(arr, len) ARRAY_HALF_NORM_BASE(arr, , len)
42 | #define ARRAY_HALF_DENORM(arr, len) ARRAY_HALF_DENORM_BASE(arr, , len)
43 | 


--------------------------------------------------------------------------------
/exebench/clib/synthesizer.c:
--------------------------------------------------------------------------------
 1 | #include <stddef.h>
 2 | #include <stdint.h> // TODO -- this may not be enough on some platforms for uintptr?  Can we just use long long safetly?
 3 | #include <stdlib.h>
 4 | #include "synthesizer.h"
 5 | 
 6 | // We need to keep track of the differences that are 
 7 | // added to each pointer to align them so we can 
 8 | // free them properly.
 9 | 
10 | // Fortunately, this is easier to write than
11 | // it might otherwise be because if an accelerator
12 | // has N malloc-able arguments, then the most we
13 | // are going to be malloc-ing is N + 1 (for the
14 | // return value.  FFTs tend to have two (in and out)
15 | // but four is not implausible (in real, out real,
16 | // in imag, out imag).  Anyway, so we'll just
17 | // support like 8 for now, and segfault if there
18 | // are more.
19 | 
20 | int facc_malloc_count = 0;
21 | typedef struct {
22 | 	uintptr_t returned;
23 | 	uintptr_t original;
24 | } pointer_map_pair;
25 | 
26 | pointer_map_pair pointer_map[8];
27 | 
28 | // An aligned malloc and free function.
29 | // Note that if an aligned malloc escapes the
30 | // accelerated function, then all frees that are
31 | // reachable by that pointer must be replaced with
32 | // facc_free.
33 | void *facc_malloc(size_t alignment, size_t size) {
34 | 	char* ptr = (char*) malloc(size + alignment);
35 | 
36 | 	uintptr_t original_pointer = (uintptr_t) ptr;
37 | 	uintptr_t offset_pointer = original_pointer;
38 | 	if (alignment != 0) {
39 | 		offset_pointer += (original_pointer % alignment);
40 | 		// If the offset is zero, we don't have to use
41 | 		// the facc malloc.  Really, we should do
42 | 		// this statically in FACC (i.e. turn into a normal
43 | 		// malloc)
44 | 		pointer_map[facc_malloc_count].original = original_pointer;
45 | 		pointer_map[facc_malloc_count].returned = offset_pointer;
46 | 		facc_malloc_count += 1;
47 | 	}
48 | 
49 | 	return (void*) offset_pointer;
50 | }
51 | 
52 | // This could really be much more efficient IMO.
53 | void facc_free(void* pointer) {
54 | 	uintptr_t pointer_value = (uintptr_t) pointer;
55 | 	for (int i = 0; i < facc_malloc_count; i ++) {
56 | 		if (pointer_map[i].returned == pointer_value) {
57 | 			free((void *) pointer_map[i].original);
58 | 
59 | 			// Shuffle everyting back down
60 | 			for (int j = i; j < facc_malloc_count - 1; j ++) {
61 | 				pointer_map[j] = pointer_map[j + 1];
62 | 			}
63 | 
64 | 			facc_malloc_count --;
65 | 			return;
66 | 		}
67 | 	}
68 | 
69 | 	free(pointer);
70 | }
71 | 
72 | void facc_strcopy(char *str_in, char *str_out) {
73 | 	while (*str_in != '\0') {
74 | 		*str_out = *str_in;
75 | 		str_in ++;
76 | 		str_out ++;
77 | 	}
78 | 	// Copy the null terminating character too.
79 | 	*str_out = *str_in;
80 | }
81 | 


--------------------------------------------------------------------------------
/exebench/clib/synthesizer.h:
--------------------------------------------------------------------------------
 1 | #include <stddef.h>
 2 | 
 3 | // Conditions
 4 | #define POWER_OF_TWO(x) ((x & (x - 1)) == 0)
 5 | #define GREATER_THAN(x, y) x > y
 6 | #define GREATER_THAN_OR_EQUAL(x, y) x >= y
 7 | #define LESS_THAN(x, y) x < y
 8 | #define LESS_THAN_OR_EQUAL(x, y) x <= y
 9 | #define PRIM_EQUAL(x, y) x == y
10 | /* TODO --- would like to make this better.  */
11 | #define FLOAT_EQUAL(x, y) ((x < y + x / 1000.0) && (x > y - x / 1000.0))
12 | 
13 | // Operations
14 | #define Pow2(x) (1 << x)
15 | #define IntDivide(x,y) (x / y)
16 | 
17 | #define Multiply(x, y) (x * y)
18 | 
19 | // Strings are internally dimensionless, but if they need
20 | // to be alloated in C we need to decide on the amount
21 | // of space they require.  One expects this to be platform
22 | // and application dependent (and really to be changed
23 | // to an appropriate value by the programmer).
24 | #define MAX_STRING_SIZE 2048
25 | 
26 | // Builtin types
27 | typedef struct { float f32_1; float f32_2; } facc_2xf32_t;
28 | typedef struct { float f64_1; float f64_2; } facc_2xf64_t;
29 | 
30 | // FACC aligned memory functions
31 | void *facc_malloc(size_t alignment, size_t size);
32 | void facc_free(void* pointer);
33 | 
34 | // Generic utility functions
35 | void facc_strcopy(char *str_in, char *str_out);
36 | 


--------------------------------------------------------------------------------
/exebench/clib/test/Makefile:
--------------------------------------------------------------------------------
1 | fft:
2 | 	g++ -I ../fft_synth/ test_fft.c ../fft_synth/lib.c -o test_fft
3 | 	./test_fft
4 | 
5 | all: fft
6 | 


--------------------------------------------------------------------------------
/exebench/clib/test/test_fft:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jordiae/exebench/8b47d483c2e39ea40a304fbaac46e41a1a14272f/exebench/clib/test/test_fft


--------------------------------------------------------------------------------
/exebench/clib/test/test_fft.c:
--------------------------------------------------------------------------------
 1 | #include <assert.h>
 2 | #include "../fft_synth/lib.h"
 3 | #include <stdio.h>
 4 | 
 5 | int main() {
 6 | 	assert(reverseBits(1, 8) == 4);
 7 | 	assert(reverseBits(2, 8) == 2);
 8 | 
 9 | 	int arr[8] = {0, 1, 2, 3, 4, 5, 6, 7};
10 | 
11 | 	BIT_REVERSE(arr, 8);
12 | 
13 | 	printf("Array is %d, %d, %d, %d, %d, %d, %d, %d\n", arr[0], arr[1], arr[2], arr[3], arr[4], arr[5], arr[6], arr[7]);
14 | 
15 | 	assert(arr[0] == 0);
16 | 	assert(arr[1] == 4);
17 | 	assert(arr[2] == 2);
18 | 	assert(arr[3] == 6);
19 | }
20 | 


--------------------------------------------------------------------------------
/legacy/legacy.py:
--------------------------------------------------------------------------------
  1 | from typing import Dict, Optional, List
  2 | import operator
  3 | import itertools
  4 | from dataclasses import asdict, dataclass
  5 | import sh
  6 | from dataclasses import dataclass, asdict
  7 | from abc import ABC
  8 | from koda import Ok, Err, Result
  9 | import re
 10 | import json
 11 | import math
 12 | 
 13 | 
 14 | # For the moment, don't support MASM
 15 | 
 16 | 
 17 | @dataclass
 18 | class AsmTarget:
 19 |     impl: str
 20 |     bits: int
 21 |     lang: str
 22 |     o: str
 23 | 
 24 |     def __post_init__(self):
 25 |         assert self.impl in ['gcc', 'clang']
 26 |         assert self.bits in [32, 64]
 27 |         assert self.lang in ['masm', 'gas', 'llvm']
 28 |         assert self.o in ['0', '1', '2', '3', 'fast', 'g', 'fast', 's']
 29 | 
 30 |     def dict(self):
 31 |         return asdict(self)
 32 | 
 33 | 
 34 | @dataclass
 35 | class FuncAsm:
 36 |     pre_asm: str  # asm directives before, and also e.g. global variable declarations needed to compile llvm functions
 37 |     func_asm: str  # asm of function itself
 38 |     post_asm: str  # asm directives after the function itself
 39 |     target: AsmTarget
 40 | 
 41 |     def dict(self):
 42 |         return asdict(self)
 43 | 
 44 | 
 45 | class Compiler:
 46 |     def __init__(self, arch, o, lang, bits=64):
 47 |         self.arch = arch
 48 |         self.o = o
 49 |         self.bits = bits
 50 |         self.lang = lang
 51 | 
 52 |     def get_func_asm(self, all_required_c_code, fname, output_path=None) -> Result[FuncAsm, BaseException]:
 53 |         return self._get_func_asm(all_required_c_code, fname, output_path, arch=self.arch, o=self.o, bits=self.bits)
 54 | 
 55 |     def _get_func_asm(self, all_required_c_code, fname, output_path, arch, o, bits) -> Result[FuncAsm, BaseException]:
 56 |         raise NotImplementedError
 57 | 
 58 |     def _asm_replace_constants_with_literals(self, all_asm, func_asm):
 59 |         raise NotImplementedError
 60 | 
 61 |     @classmethod
 62 |     def factory(cls, impl, *args, **kwargs):
 63 |         if impl == 'gcc':
 64 |             return GCC(*args, **kwargs)
 65 |         elif impl == 'clang':
 66 |             return Clang(*args, **kwargs)
 67 |         raise NotImplementedError(f'impl = {impl}')
 68 | 
 69 | 
 70 | class GASCompiler(ABC, Compiler):
 71 | 
 72 |     def get_comment_sym(self):
 73 |         if self.lang == 'gas':
 74 |             if self.arch == 'arm':
 75 |                 return '@'
 76 |             return '#'
 77 |         elif self.lang == 'llvm':
 78 |             return ';'
 79 |         else:
 80 |             raise ValueError(f'lang = {self.lang}')
 81 | 
 82 |     def _asm_replace_constants_with_literals(self, all_asm, func_asm):
 83 |         all_asm = all_asm.decode("utf-8")
 84 |         asm_to_add = []
 85 |         for symbol in re.compile('\.LC[0-9]*').findall(func_asm):  # TODO: move, compile once
 86 |             for e in re.findall(f'\.{symbol.replace(".", "")}:[\r\n]+([^\r\n]+)', all_asm):
 87 |                 asm_to_add.append(symbol + ': ' + e)
 88 |                 break
 89 |         for symbol in re.compile('a\.[0-9]*').findall(func_asm):  # TODO: move, compile once
 90 |             for e in re.findall(f'{symbol}:[\r\n]+([^\r\n]+)', all_asm):
 91 |                 asm_to_add.append(symbol + ': ' + e)
 92 |                 break
 93 |         return func_asm + '\n' + '\n'.join(asm_to_add) + '\n'
 94 | 
 95 |     def _gas_get_func_asm_from_all_asm(self, fname, all_asm):
 96 | 
 97 |         def strip_comments(code, comment_sym):  # only support simple commands, asm
 98 |             res = []
 99 |             for l in code.splitlines():
100 |                 without_comments = l.split(comment_sym)[0]
101 |                 if len(without_comments.split()) > 0:
102 |                     res.append(without_comments)
103 |             return '\n'.join(res)
104 | 
105 |         func = [f'.globl {fname}', f'.type {fname}, @function']
106 |         inside_func = False
107 |         after_func = False
108 |         pre_asm = []
109 |         post_asm = []
110 |         for l in all_asm.splitlines():
111 |             if l.startswith(f'{fname}:'):
112 |                 inside_func = True
113 |             if inside_func:
114 |                 func.append(l)
115 |             elif after_func:
116 |                 post_asm.append(l)
117 |             else:
118 |                 if f'.globl {fname}' not in l:
119 |                     pre_asm.append(l)
120 |             if inside_func and '.cfi_endproc' in l:
121 |                 inside_func = False
122 |                 after_func = True
123 | 
124 |         pre_asm = '\n'.join(pre_asm) + '\n'
125 |         func_asm = '\n'.join(func) + '\n'
126 |         comment_sym = self.get_comment_sym()
127 |         func_asm = strip_comments(func_asm, comment_sym=comment_sym)
128 |         post_asm = '\n'.join(post_asm) + '\n'
129 | 
130 |         return pre_asm, func_asm, post_asm
131 | 
132 | 
133 | class GCC(GASCompiler):
134 |     def __init__(self, *args, **kwargs):
135 |         super().__init__(*args, lang='gas', **kwargs)
136 |         self.arm_64 = sh.aarch64_linux_gnu_gcc  # sudo apt install aarch64-linux-gnu-gcc
137 |         self.x86_64 = sh.gcc
138 | 
139 |     def _get_func_asm(self, all_required_c_code, fname, output_path, arch, o, bits) -> Result[FuncAsm, BaseException]:
140 |         lang = 'gas'  # we don't support masm
141 |         if arch == 'arm' and bits == 64:
142 |             backend = self.arm_64
143 |         elif arch == 'x86' and bits == 64:
144 |             backend = self.x86_64
145 |         else:
146 |             raise NotImplementedError(f'arch = {arch}, bits = {bits}')
147 |         try:
148 |             out = backend('-S', f'-O{o}', '-x', 'c', '-o', '/dev/stdout', '-', _in=all_required_c_code)
149 |         except BaseException as e:
150 |             return Err(e)
151 | 
152 |         pre_asm, func_asm, post_asm = self._gas_get_func_asm_from_all_asm(all_asm=out, fname=fname)
153 |         func_asm = self._asm_replace_constants_with_literals(all_asm=out.stdout, func_asm=func_asm)
154 |         func_asm = FuncAsm(pre_asm=pre_asm, func_asm=func_asm, post_asm=post_asm, target=AsmTarget(impl='gcc',
155 |                                                                                                    bits=bits,
156 |                                                                                                    lang=lang,
157 |                                                                                                    o=o))
158 |         return Ok(func_asm)
159 | 
160 | 
161 | class Clang(GASCompiler):
162 |     def __init__(self, *args, emit_llvm=False, **kwargs):
163 |         lang = 'llvm' if emit_llvm else 'gas'
164 |         super().__init__(*args, lang=lang, **kwargs)
165 |         self.clang = sh.clang  # sudo apt install clang
166 |         self.emit_llvm = emit_llvm
167 |         self.emit_llvm_flag = '-emit-llvm' if emit_llvm else ''
168 | 
169 |     def _get_func_asm(self, all_required_c_code, fname, output_path, arch, o, bits) -> Result[FuncAsm, BaseException]:
170 |         if arch == 'x86' and bits == 64:
171 |             backend = self.clang
172 |         else:
173 |             raise NotImplementedError(f'arch = {arch}, bits = {bits}')
174 |         try:
175 |             out = backend('-S', self.emit_llvm_flag, f'-O{o}', '-x', 'c', '-o', '/dev/stdout', '-',
176 |                           _in=all_required_c_code)
177 |         except BaseException as e:
178 |             return Err(e)
179 |         try:
180 |             if self.emit_llvm:
181 |                 pre_asm, func_asm, post_asm = self._llvm_get_func_asm_from_all_asm(all_asm=out, fname=fname)
182 |             else:
183 |                 pre_asm, func_asm, post_asm = self._gas_get_func_asm_from_all_asm(all_asm=out, fname=fname)
184 |         except RuntimeError as e:
185 |             return Err(e)
186 |         func_asm = self._asm_replace_constants_with_literals(all_asm=out.stdout, func_asm=func_asm)
187 |         func_asm = FuncAsm(pre_asm=pre_asm, func_asm=func_asm, post_asm=post_asm, target=AsmTarget(impl='clang',
188 |                                                                                                    bits=bits,
189 |                                                                                                    lang='llvm' if self.emit_llvm else 'gas',
190 |                                                                                                    o=o))
191 |         return Ok(func_asm)
192 | 
193 |     @staticmethod
194 |     def _llvm_get_func_asm_from_all_asm(fname, all_asm):
195 |         # @var = common dso_local global i32 0, align 4
196 |         # ; Function Attrs: noinline nounwind optnone uwtable
197 |         # define dso_local i32 @f(i32 %0) #0 {
198 |         func = []
199 |         inside_func = False
200 |         after_func = False
201 |         pre_asm = []
202 |         post_asm = []
203 |         for l in all_asm.splitlines():
204 |             if l.startswith('define') and f'@{fname}(' in l:
205 |                 inside_func = True
206 |             if inside_func:
207 |                 func.append(l)
208 |             elif after_func:
209 |                 post_asm.append(l)
210 |             else:
211 |                 pre_asm.append(l)
212 |             # if inside_func and 'ret' in l.split():  # Todo: not always ret
213 |             if inside_func and l.startswith('}'):
214 |                 inside_func = False
215 |                 after_func = True
216 |         func.append('}')
217 |         if len(post_asm) == 0:
218 |             raise RuntimeError("Couldn't process assembly")
219 |         del post_asm[0]
220 | 
221 |         pre_asm = '\n'.join(pre_asm) + '\n'
222 |         func_asm = '\n'.join(func) + '\n'
223 |         post_asm = '\n'.join(post_asm) + '\n'
224 | 
225 |         return pre_asm, func_asm, post_asm
226 | 
227 | 
228 | # TODO: literals/constats, global variables etc in LLVM, clang etc
229 | 
230 | 
231 | @dataclass
232 | class IOPair:
233 |     input: Dict
234 |     output: Dict
235 |     dummy_funcs: Optional[str] = None
236 |     dummy_funcs_seed: Optional[int] = None
237 | 
238 |     def dict(self):
239 |         return asdict(self)
240 | 
241 |     @classmethod
242 |     def group_by_seed(cls, iopairs: List['IOPair']):
243 |         get_attr = operator.attrgetter('dummy_funcs_seed')
244 |         grouped_by = [list(g) for k, g in itertools.groupby(sorted(iopairs, key=get_attr), get_attr)]
245 |         return grouped_by
246 | 
247 | 
248 | @dataclass
249 | class FuncDataclass:
250 |     path: str
251 |     func_def: str
252 |     func_head: str
253 |     fname: str
254 |     signature: List[str]
255 |     doc: Optional[str] = None
256 |     angha_error: Optional[str] = None
257 |     real_error: Optional[str] = None
258 |     # TODO: fname, signature (without parnames)
259 |     asm: Optional[Dict[str, Optional[FuncAsm]]] = None
260 |     angha_deps: Optional[str] = None
261 |     real_deps: Optional[str] = None
262 |     angha_io_pairs: List[IOPair] = None
263 |     real_io_pairs: List[IOPair] = None
264 |     angha_io_error: Optional[str] = None
265 |     real_io_error: Optional[str] = None
266 |     angha_exe_wrapper: Optional[str] = None
267 |     real_exe_wrapper: Optional[str] = None
268 |     angha_io_pairs_are_trivial: bool = False
269 |     real_io_pairs_are_trivial: bool = False
270 |     angha_iospec: Optional[Dict] = None
271 |     real_iospec: Optional[Dict] = None
272 |     ref: str = 'master'
273 | 
274 |     def dict(self):
275 |         return asdict(self)
276 | 
277 |     def get_fname_tmp_fix(self):
278 |         return self.func_head.split('(')[-2].split()[-1].replace('*', '')
279 | 
280 |     @classmethod
281 |     def legacy_init(cls, **kwargs):
282 |         del kwargs['io_pairs']
283 |         return cls(**kwargs)
284 | 
285 | 
286 | def run():
287 |     gpp = GPP(print_stderr=True)
288 |     compiled_wrapper_output_path = wrapper_path.replace('.cpp', '.x')  # os.path.join('wrapper')
289 |     gpp.compile_exe_to_file(open(wrapper_path, 'r').read(), path=compiled_wrapper_output_path)
290 | 
291 |     # 5: Run IO
292 | 
293 |     io_pairs = []
294 |     for json_file in sorted(glob(os.path.join(io_path, 'io', '1', '*.json'))):
295 |         output_file = ''.join(json_file.split(".")[:1]) + '-out.json'
296 |         stdout, stderr = run_command(f'{compiled_wrapper_output_path} {json_file} {output_file}')
297 |         if fast_sanity_check:
298 |             if not stderr or len(stderr) == 0:
299 |                 output_file_check = ''.join(json_file.split(".")[:1]) + '-out-check.json'
300 |                 stdout, stderr = run_command(f'{compiled_wrapper_output_path} {json_file} {output_file_check}')
301 |             else:
302 |                 raise IOGenerationError
303 |         with open(json_file, 'r') as f1, open(output_file, 'r') as f2:
304 |             output = json.load(f2)
305 |             if fast_sanity_check:
306 |                 with open(output_file_check, 'r') as f3:
307 |                     check = json.load(f3)
308 |                     if output and not io_check(check, output):
309 |                         raise IOGenerationError
310 |             io_pairs.append({'input': json.load(f1), 'output': output})
311 |     return io_pairs, iospec, wrapper
312 | 
313 | 
314 | def run_io(ground_truth_io_pairs, compiled_wrapper_path):
315 |     output_io_pairs = []
316 |     for io_pair in ground_truth_io_pairs:
317 |         with get_tmp_path(content=None, suffix='.json') as input_tmp_json_path:
318 |             output_file = ''.join(input_tmp_json_path.split(".")[:1]) + '-out.json'
319 |             with open(input_tmp_json_path, 'w') as f:
320 |                 json.dump(io_pair['input'], f)
321 |             stdout, stderr = run_command(f'{compiled_wrapper_path} {input_tmp_json_path} {output_file}')
322 |         with open(output_file, 'r') as f:
323 |             output_io_pairs.append({'input': io_pair['input'], 'output': json.load(f)})
324 |     return output_io_pairs
325 | 
326 | 
327 | class ExeError(Exception):
328 |     pass
329 | 
330 | 
331 | class ExeBenchFunc:
332 |     def __init__(self, deps, assem, ):
333 |         # self._row = exebench_row
334 |         self._compiled_real = self._compile_real()
335 |         self._compiled_synth = self._compile_synth()
336 | 
337 |     @classmethod
338 |     def from_exebench_row(cls, exebench_row):
339 |         return cls()
340 | 
341 |     @classmethod
342 |     def f
343 | 
344 |     def run_real_io(self, inp):
345 |         assert self._compiled_real
346 |         return self(inp, executable=self._compiled_real)
347 | 
348 |     def run_synth_io(self, inp):
349 |         assert self._compiled_synth
350 |         return self(inp, executable=self._compiled_synth)
351 | 
352 |     def run_real_all_io_in_row(self):
353 |         assert self._compiled_real
354 |         outputs = []
355 |         for inp in self._row('')
356 |             return self(inp, executable=self._compiled_real)
357 | 
358 |     def __call__(self, inp, executable):
359 |         with get_tmp_path(content=None, suffix='.json') as input_tmp_json_path:
360 |             output_file = ''.join(input_tmp_json_path.split(".")[:1]) + '-out.json'
361 |             with open(input_tmp_json_path, 'w') as f:
362 |                 json.dump(inp, f)
363 |             stdout, stderr = run_command(f'{executable} {input_tmp_json_path} {output_file}')
364 |             with open(output_file, 'r') as f:
365 |                 output = json.load(f)
366 | 
367 |         return output
368 | 
369 | 
370 | class Wrapper:
371 |     def __init__(self, deps, assembly, assembler):
372 |         self._compiled_exe_path = self._compile_exe_path(deps, assembly, assembler)
373 | 
374 |     @staticmethod
375 |     def _compile_exe_path(deps, assembly, assembler):
376 |         return ''
377 | 
378 |     def __call__(self, inp):
379 |         executable = self._compiled_exe_path
380 |         with get_tmp_path(content=None, suffix='.json') as input_tmp_json_path:
381 |             output_file = ''.join(input_tmp_json_path.split(".")[:1]) + '-out.json'
382 |             with open(input_tmp_json_path, 'w') as f:
383 |                 json.dump(inp, f)
384 |             stdout, stderr = run_command(f'{executable} {input_tmp_json_path} {output_file}')
385 | 
386 |             with open(output_file, 'r') as f:
387 |                 output = json.load(f)
388 | 
389 |         return output
390 | 
391 | 
392 | class ExeBenchRowWrapper:
393 |     def __init__(self, exebench_row):
394 |         self._synth_wrapper = W
395 |         # self._row = exebench_row
396 |         self._compiled_real = self._compile_real()
397 |         self._compiled_synth = self._compile_synth()
398 | 
399 |     @classmethod
400 |     def from_exebench_row(cls, exebench_row):
401 |         return cls()
402 | 
403 |     @classmethod
404 |     def f
405 | 
406 |     def run_real_io(self, inp):
407 |         assert self._compiled_real
408 |         return self(inp, executable=self._compiled_real)
409 | 
410 |     def run_synth_io(self, inp):
411 |         assert self._compiled_synth
412 |         return self(inp, executable=self._compiled_synth)
413 | 
414 |     def run_real_all_io_in_row(self):
415 |         assert self._compiled_real
416 |         outputs = []
417 |         for inp in self._row('')
418 |             return self(inp, executable=self._compiled_real)
419 | 
420 |     def __call__(self, inp, executable):
421 |         with get_tmp_path(content=None, suffix='.json') as input_tmp_json_path:
422 |             output_file = ''.join(input_tmp_json_path.split(".")[:1]) + '-out.json'
423 |             with open(input_tmp_json_path, 'w') as f:
424 |                 json.dump(inp, f)
425 |             stdout, stderr = run_command(f'{executable} {input_tmp_json_path} {output_file}')
426 |             with open(output_file, 'r') as f:
427 |                 output = json.load(f)
428 | 
429 |         return output
430 | 
431 | 
432 | def diff_io(observed_output, expected_output) -> bool:
433 |     if type(observed_output) is not type(expected_output):
434 |         return False
435 |     if isinstance(observed_output, list):
436 |         if len(observed_output) != len(expected_output):
437 |             return False
438 |         for e1, e2 in zip(observed_output, expected_output):
439 |             ok = diff_io(e1, e2)
440 |             if not ok:
441 |                 return False
442 |     elif isinstance(observed_output, dict):
443 |         for key in observed_output:
444 |             if key not in expected_output:
445 |                 return False
446 |             ok = diff_io(observed_output[key], expected_output[key])
447 |             if not ok:
448 |                 return False
449 |     elif isinstance(observed_output, float):
450 |         ok = math.isclose(observed_output, expected_output)
451 |         if not ok:
452 |             return False
453 |     else:
454 |         ok = observed_output == expected_output
455 |         if not ok:
456 |             return False
457 |     return True
458 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jordiae/exebench/8b47d483c2e39ea40a304fbaac46e41a1a14272f/requirements.txt


--------------------------------------------------------------------------------
/requirements_examples.txt:
--------------------------------------------------------------------------------
1 | datasets
2 | zstandard


--------------------------------------------------------------------------------