├── .gitignore ├── LICENSE ├── README.md ├── psb2 ├── __init__.py ├── format.py └── psb2.py └── setup.py /.gitignore: -------------------------------------------------------------------------------- 1 | dist/ 2 | psb2.egg-info/ 3 | BUILD_NOTES.md 4 | *.pyc -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Eclipse Public License - v 2.0 2 | 3 | THE ACCOMPANYING PROGRAM IS PROVIDED UNDER THE TERMS OF THIS ECLIPSE 4 | PUBLIC LICENSE ("AGREEMENT"). ANY USE, REPRODUCTION OR DISTRIBUTION 5 | OF THE PROGRAM CONSTITUTES RECIPIENT'S ACCEPTANCE OF THIS AGREEMENT. 6 | 7 | 1. DEFINITIONS 8 | 9 | "Contribution" means: 10 | 11 | a) in the case of the initial Contributor, the initial content 12 | Distributed under this Agreement, and 13 | 14 | b) in the case of each subsequent Contributor: 15 | i) changes to the Program, and 16 | ii) additions to the Program; 17 | where such changes and/or additions to the Program originate from 18 | and are Distributed by that particular Contributor. A Contribution 19 | "originates" from a Contributor if it was added to the Program by 20 | such Contributor itself or anyone acting on such Contributor's behalf. 21 | Contributions do not include changes or additions to the Program that 22 | are not Modified Works. 23 | 24 | "Contributor" means any person or entity that Distributes the Program. 25 | 26 | "Licensed Patents" mean patent claims licensable by a Contributor which 27 | are necessarily infringed by the use or sale of its Contribution alone 28 | or when combined with the Program. 29 | 30 | "Program" means the Contributions Distributed in accordance with this 31 | Agreement. 32 | 33 | "Recipient" means anyone who receives the Program under this Agreement 34 | or any Secondary License (as applicable), including Contributors. 35 | 36 | "Derivative Works" shall mean any work, whether in Source Code or other 37 | form, that is based on (or derived from) the Program and for which the 38 | editorial revisions, annotations, elaborations, or other modifications 39 | represent, as a whole, an original work of authorship. 40 | 41 | "Modified Works" shall mean any work in Source Code or other form that 42 | results from an addition to, deletion from, or modification of the 43 | contents of the Program, including, for purposes of clarity any new file 44 | in Source Code form that contains any contents of the Program. Modified 45 | Works shall not include works that contain only declarations, 46 | interfaces, types, classes, structures, or files of the Program solely 47 | in each case in order to link to, bind by name, or subclass the Program 48 | or Modified Works thereof. 49 | 50 | "Distribute" means the acts of a) distributing or b) making available 51 | in any manner that enables the transfer of a copy. 52 | 53 | "Source Code" means the form of a Program preferred for making 54 | modifications, including but not limited to software source code, 55 | documentation source, and configuration files. 56 | 57 | "Secondary License" means either the GNU General Public License, 58 | Version 2.0, or any later versions of that license, including any 59 | exceptions or additional permissions as identified by the initial 60 | Contributor. 61 | 62 | 2. GRANT OF RIGHTS 63 | 64 | a) Subject to the terms of this Agreement, each Contributor hereby 65 | grants Recipient a non-exclusive, worldwide, royalty-free copyright 66 | license to reproduce, prepare Derivative Works of, publicly display, 67 | publicly perform, Distribute and sublicense the Contribution of such 68 | Contributor, if any, and such Derivative Works. 69 | 70 | b) Subject to the terms of this Agreement, each Contributor hereby 71 | grants Recipient a non-exclusive, worldwide, royalty-free patent 72 | license under Licensed Patents to make, use, sell, offer to sell, 73 | import and otherwise transfer the Contribution of such Contributor, 74 | if any, in Source Code or other form. This patent license shall 75 | apply to the combination of the Contribution and the Program if, at 76 | the time the Contribution is added by the Contributor, such addition 77 | of the Contribution causes such combination to be covered by the 78 | Licensed Patents. The patent license shall not apply to any other 79 | combinations which include the Contribution. No hardware per se is 80 | licensed hereunder. 81 | 82 | c) Recipient understands that although each Contributor grants the 83 | licenses to its Contributions set forth herein, no assurances are 84 | provided by any Contributor that the Program does not infringe the 85 | patent or other intellectual property rights of any other entity. 86 | Each Contributor disclaims any liability to Recipient for claims 87 | brought by any other entity based on infringement of intellectual 88 | property rights or otherwise. As a condition to exercising the 89 | rights and licenses granted hereunder, each Recipient hereby 90 | assumes sole responsibility to secure any other intellectual 91 | property rights needed, if any. For example, if a third party 92 | patent license is required to allow Recipient to Distribute the 93 | Program, it is Recipient's responsibility to acquire that license 94 | before distributing the Program. 95 | 96 | d) Each Contributor represents that to its knowledge it has 97 | sufficient copyright rights in its Contribution, if any, to grant 98 | the copyright license set forth in this Agreement. 99 | 100 | e) Notwithstanding the terms of any Secondary License, no 101 | Contributor makes additional grants to any Recipient (other than 102 | those set forth in this Agreement) as a result of such Recipient's 103 | receipt of the Program under the terms of a Secondary License 104 | (if permitted under the terms of Section 3). 105 | 106 | 3. REQUIREMENTS 107 | 108 | 3.1 If a Contributor Distributes the Program in any form, then: 109 | 110 | a) the Program must also be made available as Source Code, in 111 | accordance with section 3.2, and the Contributor must accompany 112 | the Program with a statement that the Source Code for the Program 113 | is available under this Agreement, and informs Recipients how to 114 | obtain it in a reasonable manner on or through a medium customarily 115 | used for software exchange; and 116 | 117 | b) the Contributor may Distribute the Program under a license 118 | different than this Agreement, provided that such license: 119 | i) effectively disclaims on behalf of all other Contributors all 120 | warranties and conditions, express and implied, including 121 | warranties or conditions of title and non-infringement, and 122 | implied warranties or conditions of merchantability and fitness 123 | for a particular purpose; 124 | 125 | ii) effectively excludes on behalf of all other Contributors all 126 | liability for damages, including direct, indirect, special, 127 | incidental and consequential damages, such as lost profits; 128 | 129 | iii) does not attempt to limit or alter the recipients' rights 130 | in the Source Code under section 3.2; and 131 | 132 | iv) requires any subsequent distribution of the Program by any 133 | party to be under a license that satisfies the requirements 134 | of this section 3. 135 | 136 | 3.2 When the Program is Distributed as Source Code: 137 | 138 | a) it must be made available under this Agreement, or if the 139 | Program (i) is combined with other material in a separate file or 140 | files made available under a Secondary License, and (ii) the initial 141 | Contributor attached to the Source Code the notice described in 142 | Exhibit A of this Agreement, then the Program may be made available 143 | under the terms of such Secondary Licenses, and 144 | 145 | b) a copy of this Agreement must be included with each copy of 146 | the Program. 147 | 148 | 3.3 Contributors may not remove or alter any copyright, patent, 149 | trademark, attribution notices, disclaimers of warranty, or limitations 150 | of liability ("notices") contained within the Program from any copy of 151 | the Program which they Distribute, provided that Contributors may add 152 | their own appropriate notices. 153 | 154 | 4. COMMERCIAL DISTRIBUTION 155 | 156 | Commercial distributors of software may accept certain responsibilities 157 | with respect to end users, business partners and the like. While this 158 | license is intended to facilitate the commercial use of the Program, 159 | the Contributor who includes the Program in a commercial product 160 | offering should do so in a manner which does not create potential 161 | liability for other Contributors. Therefore, if a Contributor includes 162 | the Program in a commercial product offering, such Contributor 163 | ("Commercial Contributor") hereby agrees to defend and indemnify every 164 | other Contributor ("Indemnified Contributor") against any losses, 165 | damages and costs (collectively "Losses") arising from claims, lawsuits 166 | and other legal actions brought by a third party against the Indemnified 167 | Contributor to the extent caused by the acts or omissions of such 168 | Commercial Contributor in connection with its distribution of the Program 169 | in a commercial product offering. The obligations in this section do not 170 | apply to any claims or Losses relating to any actual or alleged 171 | intellectual property infringement. In order to qualify, an Indemnified 172 | Contributor must: a) promptly notify the Commercial Contributor in 173 | writing of such claim, and b) allow the Commercial Contributor to control, 174 | and cooperate with the Commercial Contributor in, the defense and any 175 | related settlement negotiations. The Indemnified Contributor may 176 | participate in any such claim at its own expense. 177 | 178 | For example, a Contributor might include the Program in a commercial 179 | product offering, Product X. That Contributor is then a Commercial 180 | Contributor. If that Commercial Contributor then makes performance 181 | claims, or offers warranties related to Product X, those performance 182 | claims and warranties are such Commercial Contributor's responsibility 183 | alone. Under this section, the Commercial Contributor would have to 184 | defend claims against the other Contributors related to those performance 185 | claims and warranties, and if a court requires any other Contributor to 186 | pay any damages as a result, the Commercial Contributor must pay 187 | those damages. 188 | 189 | 5. NO WARRANTY 190 | 191 | EXCEPT AS EXPRESSLY SET FORTH IN THIS AGREEMENT, AND TO THE EXTENT 192 | PERMITTED BY APPLICABLE LAW, THE PROGRAM IS PROVIDED ON AN "AS IS" 193 | BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, EITHER EXPRESS OR 194 | IMPLIED INCLUDING, WITHOUT LIMITATION, ANY WARRANTIES OR CONDITIONS OF 195 | TITLE, NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR 196 | PURPOSE. Each Recipient is solely responsible for determining the 197 | appropriateness of using and distributing the Program and assumes all 198 | risks associated with its exercise of rights under this Agreement, 199 | including but not limited to the risks and costs of program errors, 200 | compliance with applicable laws, damage to or loss of data, programs 201 | or equipment, and unavailability or interruption of operations. 202 | 203 | 6. DISCLAIMER OF LIABILITY 204 | 205 | EXCEPT AS EXPRESSLY SET FORTH IN THIS AGREEMENT, AND TO THE EXTENT 206 | PERMITTED BY APPLICABLE LAW, NEITHER RECIPIENT NOR ANY CONTRIBUTORS 207 | SHALL HAVE ANY LIABILITY FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, 208 | EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING WITHOUT LIMITATION LOST 209 | PROFITS), HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN 210 | CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) 211 | ARISING IN ANY WAY OUT OF THE USE OR DISTRIBUTION OF THE PROGRAM OR THE 212 | EXERCISE OF ANY RIGHTS GRANTED HEREUNDER, EVEN IF ADVISED OF THE 213 | POSSIBILITY OF SUCH DAMAGES. 214 | 215 | 7. GENERAL 216 | 217 | If any provision of this Agreement is invalid or unenforceable under 218 | applicable law, it shall not affect the validity or enforceability of 219 | the remainder of the terms of this Agreement, and without further 220 | action by the parties hereto, such provision shall be reformed to the 221 | minimum extent necessary to make such provision valid and enforceable. 222 | 223 | If Recipient institutes patent litigation against any entity 224 | (including a cross-claim or counterclaim in a lawsuit) alleging that the 225 | Program itself (excluding combinations of the Program with other software 226 | or hardware) infringes such Recipient's patent(s), then such Recipient's 227 | rights granted under Section 2(b) shall terminate as of the date such 228 | litigation is filed. 229 | 230 | All Recipient's rights under this Agreement shall terminate if it 231 | fails to comply with any of the material terms or conditions of this 232 | Agreement and does not cure such failure in a reasonable period of 233 | time after becoming aware of such noncompliance. If all Recipient's 234 | rights under this Agreement terminate, Recipient agrees to cease use 235 | and distribution of the Program as soon as reasonably practicable. 236 | However, Recipient's obligations under this Agreement and any licenses 237 | granted by Recipient relating to the Program shall continue and survive. 238 | 239 | Everyone is permitted to copy and distribute copies of this Agreement, 240 | but in order to avoid inconsistency the Agreement is copyrighted and 241 | may only be modified in the following manner. The Agreement Steward 242 | reserves the right to publish new versions (including revisions) of 243 | this Agreement from time to time. No one other than the Agreement 244 | Steward has the right to modify this Agreement. The Eclipse Foundation 245 | is the initial Agreement Steward. The Eclipse Foundation may assign the 246 | responsibility to serve as the Agreement Steward to a suitable separate 247 | entity. Each new version of the Agreement will be given a distinguishing 248 | version number. The Program (including Contributions) may always be 249 | Distributed subject to the version of the Agreement under which it was 250 | received. In addition, after a new version of the Agreement is published, 251 | Contributor may elect to Distribute the Program (including its 252 | Contributions) under the new version. 253 | 254 | Except as expressly stated in Sections 2(a) and 2(b) above, Recipient 255 | receives no rights or licenses to the intellectual property of any 256 | Contributor under this Agreement, whether expressly, by implication, 257 | estoppel or otherwise. All rights in the Program not expressly granted 258 | under this Agreement are reserved. Nothing in this Agreement is intended 259 | to be enforceable by any entity that is not a Contributor or Recipient. 260 | No third-party beneficiary rights are created under this Agreement. 261 | 262 | Exhibit A - Form of Secondary Licenses Notice 263 | 264 | "This Source Code may also be made available under the following 265 | Secondary Licenses when the conditions for such availability set forth 266 | in the Eclipse Public License, v. 2.0 are satisfied: GNU General Public 267 | License as published by the Free Software Foundation, either version 2 268 | of the License, or (at your option) any later version, with the GNU 269 | Classpath Exception which is available at 270 | https://www.gnu.org/software/classpath/license.html." 271 | 272 | Simply including a copy of this Agreement, including this Exhibit A 273 | is not sufficient to license the Source Code under Secondary Licenses. 274 | 275 | If it is not possible or desirable to put the notice in a particular 276 | file, then You may include the notice in a location (such as a LICENSE 277 | file in a relevant directory) where a recipient would be likely to 278 | look for such a notice. 279 | 280 | You may add additional accurate notices of copyright ownership. 281 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | # PSB2 - Python Sampling Library 3 | 4 | A Python library for fetching and sampling training and test data for experimenting with the program synthesis dataset PSB2. The library will automatically download datasets to the given location, and will cache them to avoid repeated downloads. 5 | 6 | ## Installation 7 | 8 | Easily installed using `pip`: 9 | 10 | ```text 11 | pip install psb2 12 | ``` 13 | 14 | ## Usage 15 | 16 | There is one constant and one function available in this library. `psb2.PROBLEMS` is the list of all problems in the benchmark suite as strings: 17 | 18 | ```python 19 | >>> import psb2 20 | >>> psb2.PROBLEMS 21 | ['basement', 'bouncing-balls', 'bowling', 'camel-case', 'coin-sums', 'cut-vector', 'dice-game', 'find-pair', 'fizz-buzz', 'fuel-cost', 'gcd', 'indices-of-substring', 'leaders', 'luhn', 'mastermind', 'middle-character', 'paired-digits', 'shopping-list', 'snow-day', 'solve-boolean', 'spin-words', 'square-digits', 'substitution-cipher', 'twitter', 'vector-distance'] 22 | ``` 23 | 24 | The `fetch_examples` function downloads (if necessary) and samples training and test data for a specific problem in PSB2: 25 | 26 | ```python 27 | >>> import psb2 28 | >>> (train_data, test_data) = psb2.fetch_examples("path/to/PSB2/datasets/", "fizz-buzz", 200, 2000) 29 | >>> train_data 30 | [{'input1': 1, 'output1': '1'}, 31 | {'input1': 2, 'output1': '2'}, 32 | {'input1': 3, 'output1': 'Fizz'}, 33 | {'input1': 4, 'output1': '4'}, 34 | ... 35 | {'input1': 405919, 'output1': '405919'}, 36 | {'input1': 405789, 'output1': 'Fizz'}] 37 | ``` 38 | 39 | Or, if you'd like your test cases in a different format, you can supply an optional `format` argument: 40 | 41 | ```python 42 | >>> import psb2 43 | >>> (train_data, test_data) = psb2.fetch_examples("path/to/PSB2/datasets/", "fizz-buzz", 200, 2000, format='lists') 44 | >>> train_data 45 | [([1], ['1']), 46 | ([2], ['2']), 47 | ([3], ['Fizz']), 48 | ([4], ['4']), 49 | ... 50 | ([405919], ['405919']), 51 | ([405789], ['Fizz']) 52 | ``` 53 | 54 | ```python 55 | >>> import psb2 56 | >>> (train_data, test_data) = psb2.fetch_examples("path/to/PSB2/datasets/", "fizz-buzz", 200, 2000, format='competitive') 57 | >>> train_data 58 | [(['1'], ['1']), 59 | (['2'], ['2']), 60 | (['3'], ['Fizz']), 61 | (['4'], ['4']), 62 | ... 63 | (['405919'], ['405919']), 64 | (['405789'], ['Fizz']) 65 | ``` 66 | 67 | Each example in the returned `train_data` and `test_data` lists is a map containing one key for each input and each output. `train_data` includes all defined edge cases for a problem, as well as enough randomly generated cases to fill the training set (200 in the example above). `test_data` will sample `n_test` cases from the randomly generated cases. 68 | 69 | ## Citation 70 | 71 | If you use these datasets in a publication, please cite the paper *PSB2: The Second Program Synthesis Benchmark Suite* and include a link to this repository. 72 | 73 | BibTeX entry for paper: 74 | 75 | ```bibtex 76 | @InProceedings{Helmuth:2021:GECCO, 77 | author = "Thomas Helmuth and Peter Kelly", 78 | title = "{PSB2}: The Second Program Synthesis Benchmark Suite", 79 | booktitle = "2021 Genetic and Evolutionary Computation Conference", 80 | series = {GECCO '21}, 81 | year = "2021", 82 | isbn13 = {978-1-4503-8350-9}, 83 | address = {Lille, France}, 84 | size = {10 pages}, 85 | doi = {10.1145/3449639.3459285}, 86 | publisher = {ACM}, 87 | publisher_address = {New York, NY, USA}, 88 | month = {10-14} # jul, 89 | doi-url = {https://doi.org/10.1145/3449639.3459285}, 90 | URL = {https://dl.acm.org/doi/10.1145/3449639.3459285}, 91 | } 92 | ``` 93 | 94 | ## License 95 | 96 | Copyright © 2021 Thomas Helmuth 97 | 98 | This program and the accompanying materials are made available under the 99 | terms of the Eclipse Public License 2.0 which is available at 100 | http://www.eclipse.org/legal/epl-2.0. 101 | -------------------------------------------------------------------------------- /psb2/__init__.py: -------------------------------------------------------------------------------- 1 | from .format import format_test_case 2 | 3 | from .psb2 import ( 4 | fetch_examples, 5 | get_problem_names, 6 | PROBLEMS 7 | ) 8 | -------------------------------------------------------------------------------- /psb2/format.py: -------------------------------------------------------------------------------- 1 | import itertools 2 | 3 | def objects2lines(objects): 4 | lines = [] 5 | for object in objects: 6 | if isinstance(object, list): 7 | lines.append(str(len(object))) 8 | lines.append(' '.join(map(str, object))) 9 | else: 10 | lines.append(str(object)) 11 | return lines 12 | 13 | def disnumerate_prefix(data, prefix): 14 | lines = [] 15 | for line_number in itertools.count(start=1): 16 | try: 17 | lines.append(data[f'{prefix}{line_number}']) 18 | except KeyError: 19 | break 20 | return lines 21 | 22 | def format_test_case(test_case, format): 23 | """ 24 | Represents a test case in one of 3 formats: 'psb2', 'lists' or 'competitive' 25 | 26 | 'psb2' format is a dictionary with keys 'input{i}' and 'output{i}' 27 | indicating ith input and output, respectively. 28 | Ex: {'input1': 1, 'input2': 2, 'output1': 'output'} 29 | 'lists' format is a list of tuple of the form (input, output) 30 | where input and output are lists of objects. 31 | Ex: [(1, 2), ('output')] 32 | 'competitive' format is the defacto standard format in competitive programming. 33 | It has the same structure as 'lists', but all inputs and outputs are strings 34 | corresponding to input and output lines 35 | Besides, every array/list input/output is represented with 2 lines: 36 | array length and array elements 37 | """ 38 | 39 | if format == 'psb2': 40 | return test_case 41 | else: 42 | input_objs, output_objs = (disnumerate_prefix(test_case, prefix) 43 | for prefix in ('input', 'output')) 44 | 45 | if format == 'lists': 46 | return (input_objs, output_objs) 47 | elif format == 'competitive': 48 | return (objects2lines(input_objs), objects2lines(output_objs)) 49 | 50 | raise ValueError(f'Unknown format: {format}') -------------------------------------------------------------------------------- /psb2/psb2.py: -------------------------------------------------------------------------------- 1 | """ 2 | PSB2 - The Second Program Synthesis Benchmark Suite 3 | """ 4 | 5 | import os, json, random 6 | import requests 7 | 8 | from psb2 import format_test_case 9 | 10 | PROBLEMS = ["basement", 11 | "bouncing-balls", 12 | "bowling", 13 | "camel-case", 14 | "coin-sums", 15 | "cut-vector", 16 | "dice-game", 17 | "find-pair", 18 | "fizz-buzz", 19 | "fuel-cost", 20 | "gcd", 21 | "indices-of-substring", 22 | "leaders", 23 | "luhn", 24 | "mastermind", 25 | "middle-character", 26 | "paired-digits", 27 | "shopping-list", 28 | "snow-day", 29 | "solve-boolean", 30 | "spin-words", 31 | "square-digits", 32 | "substitution-cipher", 33 | "twitter", 34 | "vector-distance"] 35 | 36 | def load_json_lines(filename): 37 | """Load edn from a filename. Expects file to have multiple lines of JSON.""" 38 | 39 | data = [] 40 | with open(filename) as f: 41 | for line in f: 42 | example = json.loads(line) 43 | data.append(example) 44 | 45 | return data 46 | 47 | def fetch_and_possibly_cache_data(datasets_directory, problem_name, edge_or_random): 48 | """Helper function for fetch_examples that does the following for edge or 49 | random dataset: 50 | 1. Checks if JSON file for dataset is already downloaded. 51 | 2. If not, downloads the dataset file to the specified location. 52 | 3. Loads and returns list of the data from the dataset file. 53 | """ 54 | 55 | # Make directory path and file path 56 | directory_path = os.path.join(datasets_directory, "datasets", problem_name) 57 | file_path = os.path.join(directory_path, "{}-{}.json".format(problem_name, edge_or_random)) 58 | 59 | # Make directories if necessary 60 | if not os.path.isdir(directory_path): 61 | os.makedirs(directory_path) 62 | 63 | # 1. Check if JSON file already exists 64 | if not os.path.isfile(file_path): 65 | # Make URL 66 | problem_url = "{}/{}-{}.json".format(problem_name, problem_name, edge_or_random) 67 | s3_url = "https://psb2-datasets.s3.amazonaws.com/PSB2/datasets/{}".format(problem_url) 68 | 69 | # 2. Download dataset file 70 | fetched_data = requests.get(s3_url) 71 | with open(file_path, 'wb') as data_file: 72 | data_file.write(fetched_data.content) 73 | 74 | # 3. Load and return dataset 75 | dataset = load_json_lines(file_path) 76 | return dataset 77 | 78 | 79 | def fetch_examples(datasets_directory, problem_name, n_train, n_test, format='psb2', seed=None): 80 | """Downloads, fetches, and returns training and test data from a PSB2 problem. 81 | Caches downloaded datasets in `datasets_directory` to avoid multiple downloads. 82 | Returns a tuple of the form (training_examples testing_examples) 83 | where training-examples and testing-examples are lists of training and test 84 | data. The elements of these lists are dictionaries of the form: 85 | {'input1': first_input, 'input2': second_input, ..., "output1": first_output, ...} 86 | The training examples will include all hard-coded edge cases included in the suite, 87 | along with enough random cases to include `n-train` cases. The test examples 88 | will including a random sample of the random cases. 89 | Note that this function downloads and loads large datasets and can 90 | be slow, up to 1 minute. 91 | Parameters: 92 | `datasets_directory` - Location to download the PSB2 datasets. 93 | `problem_name` - Name of the PSB2 problem, lowercase and seperated by dashes. 94 | - Ex: indices-of-substring 95 | `n_train` - Number of training cases to return 96 | `n_test` - Number of test cases to return 97 | `format` - 'psb2', 'lists' or 'competitive' 98 | `seed` - Seed for random. Uses default random.seed behavior if no value is provided""" 99 | 100 | # Cannot sample more than 1 million examples for train or test 101 | assert n_train < 1000000, "Cannot sample more than 1 million examples" 102 | assert n_test < 1000000, "Cannot sample more than 1 million examples" 103 | 104 | # Load data 105 | edge_data = fetch_and_possibly_cache_data(datasets_directory, problem_name, "edge") 106 | random_data = fetch_and_possibly_cache_data(datasets_directory, problem_name, "random") 107 | 108 | # Seed RNG source 109 | random.seed(seed) 110 | 111 | # Make training and test sets 112 | if n_train < len(edge_data): 113 | train = random.sample(edge_data, n_train) 114 | else: 115 | train = edge_data 116 | train.extend(random.sample(random_data, n_train - len(edge_data))) 117 | 118 | test = random.sample(random_data, n_test) 119 | 120 | train = [format_test_case(test_case, format) for test_case in train] 121 | test = [format_test_case(test_case, format) for test_case in test] 122 | 123 | return (train, test) 124 | 125 | def get_problem_names(): 126 | """Returns a list of strings of the problem names in PSB2.""" 127 | return PROBLEMS 128 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup 2 | 3 | # read the contents of README file 4 | from os import path 5 | this_directory = path.abspath(path.dirname(__file__)) 6 | with open(path.join(this_directory, 'README.md'), encoding='utf-8') as f: 7 | readme = f.read() 8 | 9 | setup( 10 | name='psb2', 11 | version='1.1.1', 12 | description='Utilities for sampling the datasets of PSB2.', 13 | author='Thomas Helmuth and contributors', 14 | author_email='thelmuth@hamilton.edu', 15 | url='https://github.com/thelmuth/psb2-python', 16 | project_urls={ 17 | "More information": "https://cs.hamilton.edu/~thelmuth/PSB2/PSB2.html", 18 | "Dataset archive": "https://zenodo.org/record/4678739", 19 | }, 20 | long_description = readme, 21 | long_description_content_type = "text/markdown", 22 | license='Eclipse Public License 2.0 (EPL-2.0)', 23 | packages=["psb2"], 24 | install_requires=["requests"], 25 | classifiers=[ 26 | 'Programming Language :: Python :: 3', 27 | 'Intended Audience :: Developers', 28 | 'Intended Audience :: Information Technology', 29 | 'Intended Audience :: Science/Research', 30 | 'License :: OSI Approved :: Eclipse Public License 2.0 (EPL-2.0)' 31 | ], 32 | ) 33 | 34 | --------------------------------------------------------------------------------