├── LICENSE └── README.md /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2020 Tim Yang 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # A Collection of Datasets for Big Code Analysis 2 | 3 | A collection of datasets (and other resources) for [big code analysis](https://ml4code.github.io/papers.html). 4 | 5 | If you want to contribute to this list, please send a pull request. 6 | 7 | ## Datasets 8 | 9 | | Name | Description | Tag | Language | Link | 10 | | ------------- | ------------------------------------------------------------ | ------------------------------------ | ----------------- | ------------------------------------------------------------ | 11 | | CodeSearchNet | Dataset and benchmarks for code retrieval using natural language | Code Retrieval, NLP | Multiple (Python) | [link](https://github.com/github/CodeSearchNet) | 12 | | PY150 | 150k Python programs and corresponding abstract syntax trees, released by OOPSLA'16 _Probabilistic Model for Code with Decision Trees_ | General | Python | [link](https://www.sri.inf.ethz.ch/py150) | 13 | | OJ-104 | Code from a Online Judge System, consisting of 104 classes of C programs, released by AAAI'16 _Convolutional Neural Networks over Tree Structures for Programming Language Processing._ | Code Classification, Clone Dectetion | C | [link](https://sites.google.com/site/treebasedcnn/), also used in [ASTNN](https://github.com/zhangj111/astnn) | 14 | | code2seq | Datset released by the ICLR paper _code2vec_, _code2seq_, etc. | Code Completion | Java, C# | [link](https://github.com/tech-srl/code2seq#datasets) | 15 | | BigCloneBench | BigCloneBench is a clone detection benchmark of known clones in the dataset source repository. | Clone Dectetion | Java | [link](https://github.com/clonebench/BigCloneBench) | 16 | | Google Code Jam | Projects collected from Google Code Jam competition. | Clone Dectetion | Java | [link](https://github.com/parasol-aser/deepsim/tree/master/dataset) | 17 | | CodeChef | Program classification dataset released by kaggle | Code Classification | Java | [link](https://www.kaggle.com/arjoonn/codechef-competitive-programming) | 18 | | OOPSLA19Li | Datset released by the OOPSLA'19 _Improving Bug Detection via Context-based Code Representation Learning and Attention-based Neural Networks_ | Bug Detection | Java | [link](https://github.com/OOPSLA-2019-BugDetection/OOPSLA-2019-BugDetection) | 19 | | Devign | Dataset released by NeurIPS'19 *Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks* | Vulnerability Identification | C | [link](https://sites.google.com/view/devign) | 20 | | Draper | The dataset consists of the source code of 1.27 million functions mined from open source software, labelled by static analysis for potential vulnerabilities. The dataset is released by ICMLA'18 _Automated Vulnerability Detection in Source Code Using Deep Representation Learning_ | Vulnerability Identification | C | [link](https://osf.io/d45bw/) | 21 | | VulDeePecker | Semantics-based Vulnerability Candidate (SeVC) dataset. Dataset released by NDSS'18 _VulDeePecker: A Deep Learning-Based System for Vulnerability Detection_ | Vulnerability Detection | C/C++ | [link](https://github.com/CGCL-codes/VulDeePecker) | 22 | | SySeVR | The Semantics-based Vulnerability Candidate (SeVC) dataset released by arXiv'18 _SySeVR: A Framework for Using Deep Learning to Detect Vulnerabilities_ | Vulnerability Detection | C | [link](https://github.com/SySeVR/SySeVR) | 23 | | Seahymn | Vulnerable functions from 9 open-source software projects | Vulnerability Detection | C | [link](https://github.com/Seahymn2019/Function-level-Vulnerability-Dataset) | 24 | | Big-Vul | A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries | Vulnerability Detection | C/C++ | [link](https://github.com/ZeoVan/MSR_20_Code_vulnerability_CSV_Dataset) | 25 | | RAISE19Ferenc | Dataset released by RAISE'19 *Challenging Machine Learning Algorithms in Predicting Vulnerable JavaScript Functions* | Vulnerability Detection | JavaScript | [link](http://www.inf.u-szeged.hu/~ferenc/papers/JSVulnerabilityDataSet/) | 26 | | D2A | Differential Analysis Dataset released by ICSE-SEIP'21 paper *D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis* | Vulnerability Detection | C/C++ | [link](https://github.com/IBM/D2A) | 27 | | TypeWriter | Dataset released by FSE'20 *TypeWriter: Neural Type Prediction with Search-based Validation* | Type Inference | Python | [link](http://software-lab.org/projects/TypeWriter/data.tar.gz) | 28 | | DeepTyper | Dataset released by FSE'18 *Deep Learning Type Inference* | Type Inference | JavaScript | [link](https://github.com/DeepTyper/DeepTyper/blob/master/data/repo-SHAs.txt) | 29 | | Typlus | Dataset released by PLDI'20 paper *Typilus: Neural Type Hints* | Type Inference | Python | [link](https://github.com/typilus/typilus/blob/master/src/data_preparation/metadata/popularLibs.txt) | 30 | 31 | ## Resources 32 | - [[CSUR'18] A Survey of Machine Learning for Big Code and Naturalness](https://ml4code.github.io/papers.html) 33 | - [[CSUR'20] Deep Learning for Source Code Modeling and Generation: Models, Applications, and Challenges](https://dl.acm.org/doi/10.1145/3383458) 34 | - [Awsome Machine Learning on Source Code](https://github.com/src-d/awesome-machine-learning-on-source-code) 35 | --------------------------------------------------------------------------------