└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # CCSD-benchmark-for-code-summarization 2 | This repo is the benchmark for source code summarization on C language. 3 | 4 | 5 | # HGNN 6 | RETRIEVAL-AUGMENTED GENERATION FOR CODE SUMMARIZATION VIA HYBRID GNN 7 | 8 | A repository with the data for [the paper](https://openreview.net/pdf?id=zv-typ1gPxA). We share the dataset at [goole drive](https://drive.google.com/drive/folders/1NMRfcC1VgxjGGfVPrlRUrNSx2SGdtWeW?usp=sharing). 9 | 10 | 11 | ## Pre-Processing Stage 12 | We crawlled 300+ projects such as Linux, redis, to construct our CCSD dataset. Totally, we collected 500k+ raw pairs, named with "all_functions.csv" at goole drive. The format is function_name (the name of extracted function), start_line (the start line number of this function in *.c file), end_line (the end line number of this function in *.c file), commemt (the comment of the function), start_comment, end_comment, and file_path. 13 | Furthermore, Specifically, for summary, we extract the first sentence of the comment marked by "/**" and "*/". to keep the small functions, we set the threshold for tokenization to 150 and we will keep 130k+ functions. We named with "filter_functions.csv". 14 | We perform a de-duplication process and remove functions with cosine similarity over 80% to keep the dataset diverse as soon as possible. For the acceleration, we encode the raw functions into vectors with sklearn to compute the cosine similarity. After the de-duplication, we kept 95k+ pairs and named with "c_functions_all_data.jsonl.gz". 15 | 16 | ## Citation 17 | If you find this dataset relevant to your work, please cite our paper: 18 | 19 | ``` 20 | @inproceedings{ 21 | liu2021retrievalaugmented, 22 | title={Retrieval-Augmented Generation for Code Summarization via Hybrid {\{}GNN{\}}}, 23 | author={Shangqing Liu and Yu Chen and Xiaofei Xie and Jing Kai Siow and Yang Liu}, 24 | booktitle={International Conference on Learning Representations}, 25 | year={2021}, 26 | url={https://openreview.net/forum?id=zv-typ1gPxA} 27 | } 28 | ``` 29 | --------------------------------------------------------------------------------