├── LICENSE
└── README.md


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2020 Tim Yang
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # A Collection of Datasets for Big Code Analysis
 2 | 
 3 | A collection of datasets (and other resources) for [big code analysis](https://ml4code.github.io/papers.html).
 4 | 
 5 | If you want to contribute to this list, please send a pull request.
 6 | 
 7 | ## Datasets
 8 | 
 9 | | Name          | Description                                                  | Tag                                  | Language          | Link                                                         |
10 | | ------------- | ------------------------------------------------------------ | ------------------------------------ | ----------------- | ------------------------------------------------------------ |
11 | | CodeSearchNet | Dataset and benchmarks for code retrieval using natural language | Code Retrieval, NLP                  | Multiple (Python) | [link](https://github.com/github/CodeSearchNet)              |
12 | | PY150         | 150k Python programs and corresponding abstract syntax trees, released by OOPSLA'16 _Probabilistic Model for Code with Decision Trees_                                | General                              | Python            | [link](https://www.sri.inf.ethz.ch/py150)                    |
13 | | OJ-104            | Code from a Online Judge System, consisting of 104 classes of C programs, released by AAAI'16 _Convolutional Neural Networks over Tree Structures for Programming Language Processing._ | Code Classification, Clone Dectetion | C                 | [link](https://sites.google.com/site/treebasedcnn/), also used in [ASTNN](https://github.com/zhangj111/astnn)                   |
14 | | code2seq      | Datset released by the ICLR paper _code2vec_, _code2seq_, etc.   | Code Completion                      | Java, C#          | [link](https://github.com/tech-srl/code2seq#datasets)        |
15 | | BigCloneBench | BigCloneBench is a clone detection benchmark of known clones in the dataset source repository. | Clone Dectetion                      | Java              | [link](https://github.com/clonebench/BigCloneBench)          |
16 | | Google Code Jam | Projects collected from Google Code Jam competition. | Clone Dectetion                      | Java              | [link](https://github.com/parasol-aser/deepsim/tree/master/dataset)          |
17 | | CodeChef      | Program classification dataset released by kaggle        | Code Classification                  | Java              | [link](https://www.kaggle.com/arjoonn/codechef-competitive-programming) |
18 | | OOPSLA19Li    | Datset released by the OOPSLA'19 _Improving Bug Detection via Context-based Code Representation Learning and Attention-based Neural Networks_ | Bug Detection                        | Java              | [link](https://github.com/OOPSLA-2019-BugDetection/OOPSLA-2019-BugDetection) |
19 | | Devign        | Dataset released by NeurIPS'19 *Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks* | Vulnerability Identification         | C              | [link](https://sites.google.com/view/devign)                 |
20 | | Draper        | The dataset consists of the source code of 1.27 million functions mined from open source software, labelled by static analysis for potential vulnerabilities. The dataset is released by ICMLA'18 _Automated Vulnerability Detection in Source Code Using Deep Representation Learning_  | Vulnerability Identification         | C               | [link](https://osf.io/d45bw/)                 |
21 | | VulDeePecker | Semantics-based Vulnerability Candidate (SeVC) dataset. Dataset released by NDSS'18 _VulDeePecker: A Deep Learning-Based System for Vulnerability Detection_ | Vulnerability Detection | C/C++ | [link](https://github.com/CGCL-codes/VulDeePecker)  |
22 | | SySeVR |  The Semantics-based Vulnerability Candidate (SeVC) dataset released by arXiv'18 _SySeVR: A Framework for Using Deep Learning to Detect Vulnerabilities_ | Vulnerability Detection | C | [link](https://github.com/SySeVR/SySeVR) |
23 | | Seahymn | Vulnerable functions from 9 open-source software projects | Vulnerability Detection | C | [link](https://github.com/Seahymn2019/Function-level-Vulnerability-Dataset) |
24 | | Big-Vul | A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries | Vulnerability Detection | C/C++ | [link](https://github.com/ZeoVan/MSR_20_Code_vulnerability_CSV_Dataset) |
25 | | RAISE19Ferenc | Dataset released by RAISE'19 *Challenging Machine Learning Algorithms in Predicting Vulnerable JavaScript Functions* | Vulnerability Detection | JavaScript | [link](http://www.inf.u-szeged.hu/~ferenc/papers/JSVulnerabilityDataSet/) |
26 | | D2A | Differential Analysis Dataset released by ICSE-SEIP'21 paper *D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis* | Vulnerability Detection | C/C++ | [link](https://github.com/IBM/D2A) |
27 | | TypeWriter | Dataset released by FSE'20 *TypeWriter: Neural Type Prediction with Search-based Validation* | Type Inference | Python | [link](http://software-lab.org/projects/TypeWriter/data.tar.gz) |
28 | | DeepTyper | Dataset released by FSE'18 *Deep Learning Type Inference* | Type Inference | JavaScript | [link](https://github.com/DeepTyper/DeepTyper/blob/master/data/repo-SHAs.txt) |
29 | | Typlus | Dataset released by PLDI'20 paper *Typilus: Neural Type Hints* | Type Inference | Python | [link](https://github.com/typilus/typilus/blob/master/src/data_preparation/metadata/popularLibs.txt) |
30 | 
31 | ## Resources
32 | - [[CSUR'18] A Survey of Machine Learning for Big Code and Naturalness](https://ml4code.github.io/papers.html)
33 | - [[CSUR'20] Deep Learning for Source Code Modeling and Generation: Models, Applications, and Challenges](https://dl.acm.org/doi/10.1145/3383458)
34 | - [Awsome Machine Learning on Source Code](https://github.com/src-d/awesome-machine-learning-on-source-code)
35 | 


--------------------------------------------------------------------------------