├── .gitignore ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── CodeBERT ├── code2nl │ ├── README.md │ ├── bleu.py │ ├── model.py │ └── run.py └── codesearch │ ├── README.md │ ├── mrr.py │ ├── process_data.py │ ├── run_classifier.py │ └── utils.py ├── GraphCodeBERT ├── clonedetection │ ├── README.md │ ├── dataset.zip │ ├── evaluator │ │ ├── answers.txt │ │ ├── evaluator.py │ │ └── predictions.txt │ ├── model.py │ ├── parser │ │ ├── DFG.py │ │ ├── __init__.py │ │ ├── build.py │ │ ├── build.sh │ │ ├── my-languages.so │ │ └── utils.py │ └── run.py ├── codesearch │ ├── README.md │ ├── dataset.zip │ ├── model.py │ ├── parser │ │ ├── DFG.py │ │ ├── __init__.py │ │ ├── build.py │ │ ├── build.sh │ │ ├── my-languages.so │ │ └── utils.py │ └── run.py ├── refinement │ ├── README.md │ ├── bleu.py │ ├── data.zip │ ├── model.py │ ├── parser │ │ ├── DFG.py │ │ ├── __init__.py │ │ ├── build.py │ │ ├── build.sh │ │ ├── my-languages.so │ │ └── utils.py │ └── run.py └── translation │ ├── README.md │ ├── bleu.py │ ├── data.zip │ ├── model.py │ ├── parser │ ├── DFG.py │ ├── __init__.py │ ├── build.py │ ├── build.sh │ ├── my-languages.so │ └── utils.py │ └── run.py ├── LICENSE ├── NOTICE.md ├── README.md └── SECURITY.md /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | pip-wheel-metadata/ 24 | share/python-wheels/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | MANIFEST 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .nox/ 44 | .coverage 45 | .coverage.* 46 | .cache 47 | nosetests.xml 48 | coverage.xml 49 | *.cover 50 | *.py,cover 51 | .hypothesis/ 52 | .pytest_cache/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | target/ 76 | 77 | # Jupyter Notebook 78 | .ipynb_checkpoints 79 | 80 | # IPython 81 | profile_default/ 82 | ipython_config.py 83 | 84 | # pyenv 85 | .python-version 86 | 87 | # pipenv 88 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 89 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 90 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 91 | # install all needed dependencies. 92 | #Pipfile.lock 93 | 94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 95 | __pypackages__/ 96 | 97 | # Celery stuff 98 | celerybeat-schedule 99 | celerybeat.pid 100 | 101 | # SageMath parsed files 102 | *.sage.py 103 | 104 | # Environments 105 | .env 106 | .venv 107 | env/ 108 | venv/ 109 | ENV/ 110 | env.bak/ 111 | venv.bak/ 112 | 113 | # Spyder project settings 114 | .spyderproject 115 | .spyproject 116 | 117 | # Rope project settings 118 | .ropeproject 119 | 120 | # mkdocs documentation 121 | /site 122 | 123 | # mypy 124 | .mypy_cache/ 125 | .dmypy.json 126 | dmypy.json 127 | 128 | # Pyre type checker 129 | .pyre/ 130 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | # Microsoft Open Source Code of Conduct 2 | 3 | This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). 4 | 5 | Resources: 6 | 7 | - [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/) 8 | - [Microsoft Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) 9 | - Contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with questions or concerns -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing 2 | 3 | This project welcomes contributions and suggestions. Most contributions require you to 4 | agree to a Contributor License Agreement (CLA) declaring that you have the right to, 5 | and actually do, grant us the rights to use your contribution. For details, visit 6 | https://cla.microsoft.com. 7 | 8 | When you submit a pull request, a CLA-bot will automatically determine whether you need 9 | to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the 10 | instructions provided by the bot. You will only need to do this once across all repositories using our CLA. 11 | 12 | This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). 13 | For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) 14 | or contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments. -------------------------------------------------------------------------------- /CodeBERT/code2nl/README.md: -------------------------------------------------------------------------------- 1 | # Code Documentation Generation 2 | 3 | This repo provides the code for reproducing the experiments on [CodeSearchNet](https://arxiv.org/abs/1909.09436) dataset for code document generation tasks in six programming languages. 4 | 5 | **!News: We release a new pipeline for this task. The new pipeline only needs 2 p100 GPUs and less training time for Code Documentation Generation. Please refer to the [website](https://github.com/microsoft/CodeXGLUE/tree/main/Code-Text/code-to-text).** 6 | 7 | ## Dependency 8 | 9 | - pip install torch==1.4.0 10 | - pip install transformers==2.5.0 11 | - pip install filelock 12 | 13 | ## Data Preprocess 14 | 15 | We clean CodeSearchNet dataset for this task by following steps: 16 | 17 | - Remove comments in the code 18 | - Remove examples that codes cannot be parsed into an abstract syntax tree. 19 | - Remove examples that #tokens of documents is < 3 or >256 20 | - Remove examples that documents contain special tokens (e.g. or https:...) 21 | - Remove examples that documents are not English. 22 | 23 | Data statistic about the cleaned dataset for code document generation is shown in this Table. We release the cleaned dataset in this [website](https://drive.google.com/open?id=1rd2Tc6oUWBo7JouwexW3ksQ0PaOhUr6h). 24 | 25 | | PL | Training | Dev | Test | 26 | | :--------- | :------: | :----: | :----: | 27 | | Python | 251,820 | 13,914 | 14,918 | 28 | | PHP | 241,241 | 12,982 | 14,014 | 29 | | Go | 167,288 | 7,325 | 8,122 | 30 | | Java | 164,923 | 5,183 | 10,955 | 31 | | JavaScript | 58,025 | 3,885 | 3,291 | 32 | | Ruby | 24,927 | 1,400 | 1,261 | 33 | 34 | 35 | 36 | ## Data Download 37 | 38 | You can download dataset from the [website](https://drive.google.com/open?id=1rd2Tc6oUWBo7JouwexW3ksQ0PaOhUr6h). Or use the following command. 39 | 40 | ```shell 41 | pip install gdown 42 | mkdir data data/code2nl 43 | cd data/code2nl 44 | gdown https://drive.google.com/uc?id=1rd2Tc6oUWBo7JouwexW3ksQ0PaOhUr6h 45 | unzip Cleaned_CodeSearchNet.zip 46 | rm Cleaned_CodeSearchNet.zip 47 | cd ../.. 48 | ``` 49 | 50 | 51 | 52 | ## Fine-Tune 53 | 54 | We fine-tuned the model on 4*P40 GPUs. 55 | 56 | ```shell 57 | cd code2nl 58 | 59 | lang=php #programming language 60 | lr=5e-5 61 | batch_size=64 62 | beam_size=10 63 | source_length=256 64 | target_length=128 65 | data_dir=../data/code2nl/CodeSearchNet 66 | output_dir=model/$lang 67 | train_file=$data_dir/$lang/train.jsonl 68 | dev_file=$data_dir/$lang/valid.jsonl 69 | eval_steps=1000 #400 for ruby, 600 for javascript, 1000 for others 70 | train_steps=50000 #20000 for ruby, 30000 for javascript, 50000 for others 71 | pretrained_model=microsoft/codebert-base #Roberta: roberta-base 72 | 73 | python run.py --do_train --do_eval --model_type roberta --model_name_or_path $pretrained_model --train_filename $train_file --dev_filename $dev_file --output_dir $output_dir --max_source_length $source_length --max_target_length $target_length --beam_size $beam_size --train_batch_size $batch_size --eval_batch_size $batch_size --learning_rate $lr --train_steps $train_steps --eval_steps $eval_steps 74 | ``` 75 | 76 | 77 | 78 | ## Inference and Evaluation 79 | 80 | After fine-tuning, inference and evaluation are as follows: 81 | 82 | ```shell 83 | lang=php #programming language 84 | beam_size=10 85 | batch_size=128 86 | source_length=256 87 | target_length=128 88 | output_dir=model/$lang 89 | data_dir=../data/code2nl/CodeSearchNet 90 | dev_file=$data_dir/$lang/valid.jsonl 91 | test_file=$data_dir/$lang/test.jsonl 92 | test_model=$output_dir/checkpoint-best-bleu/pytorch_model.bin #checkpoint for test 93 | 94 | python run.py --do_test --model_type roberta --model_name_or_path microsoft/codebert-base --load_model_path $test_model --dev_filename $dev_file --test_filename $test_file --output_dir $output_dir --max_source_length $source_length --max_target_length $target_length --beam_size $beam_size --eval_batch_size $batch_size 95 | ``` 96 | 97 | The results on CodeSearchNet are shown in this Table: 98 | 99 | | Model | Ruby | Javascript | Go | Python | Java | PHP | Overall | 100 | | ----------- | :-------: | :--------: | :-------: | :-------: | :-------: | :-------: | :-------: | 101 | | Seq2Seq | 9.64 | 10.21 | 13.98 | 15.93 | 15.09 | 21.08 | 14.32 | 102 | | Transformer | 11.18 | 11.59 | 16.38 | 15.81 | 16.26 | 22.12 | 15.56 | 103 | | RoBERTa | 11.17 | 11.90 | 17.72 | 18.14 | 16.47 | 24.02 | 16.57 | 104 | | CodeBERT | **12.16** | **14.90** | **18.07** | **19.06** | **17.65** | **25.16** | **17.83** | 105 | 106 | 107 | -------------------------------------------------------------------------------- /CodeBERT/code2nl/bleu.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | 3 | ''' 4 | This script was adapted from the original version by hieuhoang1972 which is part of MOSES. 5 | ''' 6 | 7 | # $Id: bleu.py 1307 2007-03-14 22:22:36Z hieuhoang1972 $ 8 | 9 | '''Provides: 10 | 11 | cook_refs(refs, n=4): Transform a list of reference sentences as strings into a form usable by cook_test(). 12 | cook_test(test, refs, n=4): Transform a test sentence as a string (together with the cooked reference sentences) into a form usable by score_cooked(). 13 | score_cooked(alltest, n=4): Score a list of cooked test sentences. 14 | 15 | score_set(s, testid, refids, n=4): Interface with dataset.py; calculate BLEU score of testid against refids. 16 | 17 | The reason for breaking the BLEU computation into three phases cook_refs(), cook_test(), and score_cooked() is to allow the caller to calculate BLEU scores for multiple test sets as efficiently as possible. 18 | ''' 19 | 20 | import sys, math, re, xml.sax.saxutils 21 | import subprocess 22 | import os 23 | 24 | # Added to bypass NIST-style pre-processing of hyp and ref files -- wade 25 | nonorm = 0 26 | 27 | preserve_case = False 28 | eff_ref_len = "shortest" 29 | 30 | normalize1 = [ 31 | ('', ''), # strip "skipped" tags 32 | (r'-\n', ''), # strip end-of-line hyphenation and join lines 33 | (r'\n', ' '), # join lines 34 | # (r'(\d)\s+(?=\d)', r'\1'), # join digits 35 | ] 36 | normalize1 = [(re.compile(pattern), replace) for (pattern, replace) in normalize1] 37 | 38 | normalize2 = [ 39 | (r'([\{-\~\[-\` -\&\(-\+\:-\@\/])',r' \1 '), # tokenize punctuation. apostrophe is missing 40 | (r'([^0-9])([\.,])',r'\1 \2 '), # tokenize period and comma unless preceded by a digit 41 | (r'([\.,])([^0-9])',r' \1 \2'), # tokenize period and comma unless followed by a digit 42 | (r'([0-9])(-)',r'\1 \2 ') # tokenize dash when preceded by a digit 43 | ] 44 | normalize2 = [(re.compile(pattern), replace) for (pattern, replace) in normalize2] 45 | 46 | def normalize(s): 47 | '''Normalize and tokenize text. This is lifted from NIST mteval-v11a.pl.''' 48 | # Added to bypass NIST-style pre-processing of hyp and ref files -- wade 49 | if (nonorm): 50 | return s.split() 51 | if type(s) is not str: 52 | s = " ".join(s) 53 | # language-independent part: 54 | for (pattern, replace) in normalize1: 55 | s = re.sub(pattern, replace, s) 56 | s = xml.sax.saxutils.unescape(s, {'"':'"'}) 57 | # language-dependent part (assuming Western languages): 58 | s = " %s " % s 59 | if not preserve_case: 60 | s = s.lower() # this might not be identical to the original 61 | for (pattern, replace) in normalize2: 62 | s = re.sub(pattern, replace, s) 63 | return s.split() 64 | 65 | def count_ngrams(words, n=4): 66 | counts = {} 67 | for k in range(1,n+1): 68 | for i in range(len(words)-k+1): 69 | ngram = tuple(words[i:i+k]) 70 | counts[ngram] = counts.get(ngram, 0)+1 71 | return counts 72 | 73 | def cook_refs(refs, n=4): 74 | '''Takes a list of reference sentences for a single segment 75 | and returns an object that encapsulates everything that BLEU 76 | needs to know about them.''' 77 | 78 | refs = [normalize(ref) for ref in refs] 79 | maxcounts = {} 80 | for ref in refs: 81 | counts = count_ngrams(ref, n) 82 | for (ngram,count) in counts.items(): 83 | maxcounts[ngram] = max(maxcounts.get(ngram,0), count) 84 | return ([len(ref) for ref in refs], maxcounts) 85 | 86 | def cook_test(test, item, n=4): 87 | '''Takes a test sentence and returns an object that 88 | encapsulates everything that BLEU needs to know about it.''' 89 | (reflens, refmaxcounts)=item 90 | test = normalize(test) 91 | result = {} 92 | result["testlen"] = len(test) 93 | 94 | # Calculate effective reference sentence length. 95 | 96 | if eff_ref_len == "shortest": 97 | result["reflen"] = min(reflens) 98 | elif eff_ref_len == "average": 99 | result["reflen"] = float(sum(reflens))/len(reflens) 100 | elif eff_ref_len == "closest": 101 | min_diff = None 102 | for reflen in reflens: 103 | if min_diff is None or abs(reflen-len(test)) < min_diff: 104 | min_diff = abs(reflen-len(test)) 105 | result['reflen'] = reflen 106 | 107 | result["guess"] = [max(len(test)-k+1,0) for k in range(1,n+1)] 108 | 109 | result['correct'] = [0]*n 110 | counts = count_ngrams(test, n) 111 | for (ngram, count) in counts.items(): 112 | result["correct"][len(ngram)-1] += min(refmaxcounts.get(ngram,0), count) 113 | 114 | return result 115 | 116 | def score_cooked(allcomps, n=4, ground=0, smooth=1): 117 | totalcomps = {'testlen':0, 'reflen':0, 'guess':[0]*n, 'correct':[0]*n} 118 | for comps in allcomps: 119 | for key in ['testlen','reflen']: 120 | totalcomps[key] += comps[key] 121 | for key in ['guess','correct']: 122 | for k in range(n): 123 | totalcomps[key][k] += comps[key][k] 124 | logbleu = 0.0 125 | all_bleus = [] 126 | for k in range(n): 127 | correct = totalcomps['correct'][k] 128 | guess = totalcomps['guess'][k] 129 | addsmooth = 0 130 | if smooth == 1 and k > 0: 131 | addsmooth = 1 132 | logbleu += math.log(correct + addsmooth + sys.float_info.min)-math.log(guess + addsmooth+ sys.float_info.min) 133 | if guess == 0: 134 | all_bleus.append(-10000000) 135 | else: 136 | all_bleus.append(math.log(correct + sys.float_info.min)-math.log( guess )) 137 | 138 | logbleu /= float(n) 139 | all_bleus.insert(0, logbleu) 140 | 141 | brevPenalty = min(0,1-float(totalcomps['reflen'] + 1)/(totalcomps['testlen'] + 1)) 142 | for i in range(len(all_bleus)): 143 | if i ==0: 144 | all_bleus[i] += brevPenalty 145 | all_bleus[i] = math.exp(all_bleus[i]) 146 | return all_bleus 147 | 148 | def bleu(refs, candidate, ground=0, smooth=1): 149 | refs = cook_refs(refs) 150 | test = cook_test(candidate, refs) 151 | return score_cooked([test], ground=ground, smooth=smooth) 152 | 153 | def splitPuncts(line): 154 | return ' '.join(re.findall(r"[\w]+|[^\s\w]", line)) 155 | 156 | def computeMaps(predictions, goldfile): 157 | predictionMap = {} 158 | goldMap = {} 159 | gf = open(goldfile, 'r') 160 | 161 | for row in predictions: 162 | cols = row.strip().split('\t') 163 | if len(cols) == 1: 164 | (rid, pred) = (cols[0], '') 165 | else: 166 | (rid, pred) = (cols[0], cols[1]) 167 | predictionMap[rid] = [splitPuncts(pred.strip().lower())] 168 | 169 | for row in gf: 170 | (rid, pred) = row.split('\t') 171 | if rid in predictionMap: # Only insert if the id exists for the method 172 | if rid not in goldMap: 173 | goldMap[rid] = [] 174 | goldMap[rid].append(splitPuncts(pred.strip().lower())) 175 | 176 | sys.stderr.write('Total: ' + str(len(goldMap)) + '\n') 177 | return (goldMap, predictionMap) 178 | 179 | 180 | #m1 is the reference map 181 | #m2 is the prediction map 182 | def bleuFromMaps(m1, m2): 183 | score = [0] * 5 184 | num = 0.0 185 | 186 | for key in m1: 187 | if key in m2: 188 | bl = bleu(m1[key], m2[key][0]) 189 | score = [ score[i] + bl[i] for i in range(0, len(bl))] 190 | num += 1 191 | return [s * 100.0 / num for s in score] 192 | 193 | if __name__ == '__main__': 194 | reference_file = sys.argv[1] 195 | predictions = [] 196 | for row in sys.stdin: 197 | predictions.append(row) 198 | (goldMap, predictionMap) = computeMaps(predictions, reference_file) 199 | print (bleuFromMaps(goldMap, predictionMap)[0]) 200 | 201 | -------------------------------------------------------------------------------- /CodeBERT/code2nl/model.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Microsoft Corporation. 2 | # Licensed under the MIT license. 3 | 4 | import torch 5 | import torch.nn as nn 6 | import torch 7 | from torch.autograd import Variable 8 | import copy 9 | class Seq2Seq(nn.Module): 10 | """ 11 | Build Seqence-to-Sequence. 12 | 13 | Parameters: 14 | 15 | * `encoder`- encoder of seq2seq model. e.g. roberta 16 | * `decoder`- decoder of seq2seq model. e.g. transformer 17 | * `config`- configuration of encoder model. 18 | * `beam_size`- beam size for beam search. 19 | * `max_length`- max length of target for beam search. 20 | * `sos_id`- start of symbol ids in target for beam search. 21 | * `eos_id`- end of symbol ids in target for beam search. 22 | """ 23 | def __init__(self, encoder,decoder,config,beam_size=None,max_length=None,sos_id=None,eos_id=None): 24 | super(Seq2Seq, self).__init__() 25 | self.encoder = encoder 26 | self.decoder=decoder 27 | self.config=config 28 | self.register_buffer("bias", torch.tril(torch.ones(2048, 2048))) 29 | self.dense = nn.Linear(config.hidden_size, config.hidden_size) 30 | self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False) 31 | self.lsm = nn.LogSoftmax(dim=-1) 32 | self.tie_weights() 33 | 34 | self.beam_size=beam_size 35 | self.max_length=max_length 36 | self.sos_id=sos_id 37 | self.eos_id=eos_id 38 | 39 | def _tie_or_clone_weights(self, first_module, second_module): 40 | """ Tie or clone module weights depending of weither we are using TorchScript or not 41 | """ 42 | if self.config.torchscript: 43 | first_module.weight = nn.Parameter(second_module.weight.clone()) 44 | else: 45 | first_module.weight = second_module.weight 46 | 47 | def tie_weights(self): 48 | """ Make sure we are sharing the input and output embeddings. 49 | Export to TorchScript can't handle parameter sharing so we are cloning them instead. 50 | """ 51 | self._tie_or_clone_weights(self.lm_head, 52 | self.encoder.embeddings.word_embeddings) 53 | 54 | def forward(self, source_ids=None,source_mask=None,target_ids=None,target_mask=None,args=None): 55 | outputs = self.encoder(source_ids, attention_mask=source_mask) 56 | encoder_output = outputs[0].permute([1,0,2]).contiguous() 57 | if target_ids is not None: 58 | attn_mask=-1e4 *(1-self.bias[:target_ids.shape[1],:target_ids.shape[1]]) 59 | tgt_embeddings = self.encoder.embeddings(target_ids).permute([1,0,2]).contiguous() 60 | out = self.decoder(tgt_embeddings,encoder_output,tgt_mask=attn_mask,memory_key_padding_mask=(1-source_mask).bool()) 61 | hidden_states = torch.tanh(self.dense(out)).permute([1,0,2]).contiguous() 62 | lm_logits = self.lm_head(hidden_states) 63 | # Shift so that tokens < n predict n 64 | active_loss = target_mask[..., 1:].ne(0).view(-1) == 1 65 | shift_logits = lm_logits[..., :-1, :].contiguous() 66 | shift_labels = target_ids[..., 1:].contiguous() 67 | # Flatten the tokens 68 | loss_fct = nn.CrossEntropyLoss(ignore_index=-1) 69 | loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1))[active_loss], 70 | shift_labels.view(-1)[active_loss]) 71 | 72 | outputs = loss,loss*active_loss.sum(),active_loss.sum() 73 | return outputs 74 | else: 75 | #Predict 76 | preds=[] 77 | zero=torch.cuda.LongTensor(1).fill_(0) 78 | for i in range(source_ids.shape[0]): 79 | context=encoder_output[:,i:i+1] 80 | context_mask=source_mask[i:i+1,:] 81 | beam = Beam(self.beam_size,self.sos_id,self.eos_id) 82 | input_ids=beam.getCurrentState() 83 | context=context.repeat(1, self.beam_size,1) 84 | context_mask=context_mask.repeat(self.beam_size,1) 85 | for _ in range(self.max_length): 86 | if beam.done(): 87 | break 88 | attn_mask=-1e4 *(1-self.bias[:input_ids.shape[1],:input_ids.shape[1]]) 89 | tgt_embeddings = self.encoder.embeddings(input_ids).permute([1,0,2]).contiguous() 90 | out = self.decoder(tgt_embeddings,context,tgt_mask=attn_mask,memory_key_padding_mask=(1-context_mask).bool()) 91 | out = torch.tanh(self.dense(out)) 92 | hidden_states=out.permute([1,0,2]).contiguous()[:,-1,:] 93 | out = self.lsm(self.lm_head(hidden_states)).data 94 | beam.advance(out) 95 | input_ids.data.copy_(input_ids.data.index_select(0, beam.getCurrentOrigin())) 96 | input_ids=torch.cat((input_ids,beam.getCurrentState()),-1) 97 | hyp= beam.getHyp(beam.getFinal()) 98 | pred=beam.buildTargetTokens(hyp)[:self.beam_size] 99 | pred=[torch.cat([x.view(-1) for x in p]+[zero]*(self.max_length-len(p))).view(1,-1) for p in pred] 100 | preds.append(torch.cat(pred,0).unsqueeze(0)) 101 | 102 | preds=torch.cat(preds,0) 103 | return preds 104 | 105 | 106 | 107 | class Beam(object): 108 | def __init__(self, size,sos,eos): 109 | self.size = size 110 | self.tt = torch.cuda 111 | # The score for each translation on the beam. 112 | self.scores = self.tt.FloatTensor(size).zero_() 113 | # The backpointers at each time-step. 114 | self.prevKs = [] 115 | # The outputs at each time-step. 116 | self.nextYs = [self.tt.LongTensor(size) 117 | .fill_(0)] 118 | self.nextYs[0][0] = sos 119 | # Has EOS topped the beam yet. 120 | self._eos = eos 121 | self.eosTop = False 122 | # Time and k pair for finished. 123 | self.finished = [] 124 | 125 | def getCurrentState(self): 126 | "Get the outputs for the current timestep." 127 | batch = self.tt.LongTensor(self.nextYs[-1]).view(-1, 1) 128 | return batch 129 | 130 | def getCurrentOrigin(self): 131 | "Get the backpointers for the current timestep." 132 | return self.prevKs[-1] 133 | 134 | def advance(self, wordLk): 135 | """ 136 | Given prob over words for every last beam `wordLk` and attention 137 | `attnOut`: Compute and update the beam search. 138 | 139 | Parameters: 140 | 141 | * `wordLk`- probs of advancing from the last step (K x words) 142 | * `attnOut`- attention at the last step 143 | 144 | Returns: True if beam search is complete. 145 | """ 146 | numWords = wordLk.size(1) 147 | 148 | # Sum the previous scores. 149 | if len(self.prevKs) > 0: 150 | beamLk = wordLk + self.scores.unsqueeze(1).expand_as(wordLk) 151 | 152 | # Don't let EOS have children. 153 | for i in range(self.nextYs[-1].size(0)): 154 | if self.nextYs[-1][i] == self._eos: 155 | beamLk[i] = -1e20 156 | else: 157 | beamLk = wordLk[0] 158 | flatBeamLk = beamLk.view(-1) 159 | bestScores, bestScoresId = flatBeamLk.topk(self.size, 0, True, True) 160 | 161 | self.scores = bestScores 162 | 163 | # bestScoresId is flattened beam x word array, so calculate which 164 | # word and beam each score came from 165 | prevK = bestScoresId // numWords 166 | self.prevKs.append(prevK) 167 | self.nextYs.append((bestScoresId - prevK * numWords)) 168 | 169 | 170 | for i in range(self.nextYs[-1].size(0)): 171 | if self.nextYs[-1][i] == self._eos: 172 | s = self.scores[i] 173 | self.finished.append((s, len(self.nextYs) - 1, i)) 174 | 175 | # End condition is when top-of-beam is EOS and no global score. 176 | if self.nextYs[-1][0] == self._eos: 177 | self.eosTop = True 178 | 179 | def done(self): 180 | return self.eosTop and len(self.finished) >=self.size 181 | 182 | def getFinal(self): 183 | if len(self.finished) == 0: 184 | self.finished.append((self.scores[0], len(self.nextYs) - 1, 0)) 185 | self.finished.sort(key=lambda a: -a[0]) 186 | if len(self.finished) != self.size: 187 | unfinished=[] 188 | for i in range(self.nextYs[-1].size(0)): 189 | if self.nextYs[-1][i] != self._eos: 190 | s = self.scores[i] 191 | unfinished.append((s, len(self.nextYs) - 1, i)) 192 | unfinished.sort(key=lambda a: -a[0]) 193 | self.finished+=unfinished[:self.size-len(self.finished)] 194 | return self.finished[:self.size] 195 | 196 | def getHyp(self, beam_res): 197 | """ 198 | Walk back to construct the full hypothesis. 199 | """ 200 | hyps=[] 201 | for _,timestep, k in beam_res: 202 | hyp = [] 203 | for j in range(len(self.prevKs[:timestep]) - 1, -1, -1): 204 | hyp.append(self.nextYs[j+1][k]) 205 | k = self.prevKs[j][k] 206 | hyps.append(hyp[::-1]) 207 | return hyps 208 | 209 | def buildTargetTokens(self, preds): 210 | sentence=[] 211 | for pred in preds: 212 | tokens = [] 213 | for tok in pred: 214 | if tok==self._eos: 215 | break 216 | tokens.append(tok) 217 | sentence.append(tokens) 218 | return sentence 219 | 220 | -------------------------------------------------------------------------------- /CodeBERT/code2nl/run.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | # Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team. 3 | # Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved. 4 | # 5 | # Licensed under the Apache License, Version 2.0 (the "License"); 6 | # you may not use this file except in compliance with the License. 7 | # You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | """ 17 | Fine-tuning the library models for language modeling on a text file (GPT, GPT-2, BERT, RoBERTa). 18 | GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned 19 | using a masked language modeling (MLM) loss. 20 | """ 21 | 22 | from __future__ import absolute_import 23 | import os 24 | import sys 25 | import bleu 26 | import pickle 27 | import torch 28 | import json 29 | import random 30 | import logging 31 | import argparse 32 | import numpy as np 33 | from io import open 34 | from itertools import cycle 35 | import torch.nn as nn 36 | from model import Seq2Seq 37 | from tqdm import tqdm, trange 38 | from torch.utils.data import DataLoader, Dataset, SequentialSampler, RandomSampler,TensorDataset 39 | from torch.utils.data.distributed import DistributedSampler 40 | from transformers import (WEIGHTS_NAME, AdamW, get_linear_schedule_with_warmup, 41 | RobertaConfig, RobertaModel, RobertaTokenizer) 42 | MODEL_CLASSES = {'roberta': (RobertaConfig, RobertaModel, RobertaTokenizer)} 43 | 44 | logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s', 45 | datefmt = '%m/%d/%Y %H:%M:%S', 46 | level = logging.INFO) 47 | logger = logging.getLogger(__name__) 48 | 49 | class Example(object): 50 | """A single training/test example.""" 51 | def __init__(self, 52 | idx, 53 | source, 54 | target, 55 | ): 56 | self.idx = idx 57 | self.source = source 58 | self.target = target 59 | 60 | def read_examples(filename): 61 | """Read examples from filename.""" 62 | examples=[] 63 | with open(filename,encoding="utf-8") as f: 64 | for idx, line in enumerate(f): 65 | line=line.strip() 66 | js=json.loads(line) 67 | if 'idx' not in js: 68 | js['idx']=idx 69 | code=' '.join(js['code_tokens']).replace('\n',' ') 70 | code=' '.join(code.strip().split()) 71 | nl=' '.join(js['docstring_tokens']).replace('\n','') 72 | nl=' '.join(nl.strip().split()) 73 | examples.append( 74 | Example( 75 | idx = idx, 76 | source=code, 77 | target = nl, 78 | ) 79 | ) 80 | return examples 81 | 82 | 83 | class InputFeatures(object): 84 | """A single training/test features for a example.""" 85 | def __init__(self, 86 | example_id, 87 | source_ids, 88 | target_ids, 89 | source_mask, 90 | target_mask, 91 | 92 | ): 93 | self.example_id = example_id 94 | self.source_ids = source_ids 95 | self.target_ids = target_ids 96 | self.source_mask = source_mask 97 | self.target_mask = target_mask 98 | 99 | 100 | 101 | def convert_examples_to_features(examples, tokenizer, args,stage=None): 102 | features = [] 103 | for example_index, example in enumerate(examples): 104 | #source 105 | source_tokens = tokenizer.tokenize(example.source)[:args.max_source_length-2] 106 | source_tokens =[tokenizer.cls_token]+source_tokens+[tokenizer.sep_token] 107 | source_ids = tokenizer.convert_tokens_to_ids(source_tokens) 108 | source_mask = [1] * (len(source_tokens)) 109 | padding_length = args.max_source_length - len(source_ids) 110 | source_ids+=[tokenizer.pad_token_id]*padding_length 111 | source_mask+=[0]*padding_length 112 | 113 | #target 114 | if stage=="test": 115 | target_tokens = tokenizer.tokenize("None") 116 | else: 117 | target_tokens = tokenizer.tokenize(example.target)[:args.max_target_length-2] 118 | target_tokens = [tokenizer.cls_token]+target_tokens+[tokenizer.sep_token] 119 | target_ids = tokenizer.convert_tokens_to_ids(target_tokens) 120 | target_mask = [1] *len(target_ids) 121 | padding_length = args.max_target_length - len(target_ids) 122 | target_ids+=[tokenizer.pad_token_id]*padding_length 123 | target_mask+=[0]*padding_length 124 | 125 | if example_index < 5: 126 | if stage=='train': 127 | logger.info("*** Example ***") 128 | logger.info("idx: {}".format(example.idx)) 129 | 130 | logger.info("source_tokens: {}".format([x.replace('\u0120','_') for x in source_tokens])) 131 | logger.info("source_ids: {}".format(' '.join(map(str, source_ids)))) 132 | logger.info("source_mask: {}".format(' '.join(map(str, source_mask)))) 133 | 134 | logger.info("target_tokens: {}".format([x.replace('\u0120','_') for x in target_tokens])) 135 | logger.info("target_ids: {}".format(' '.join(map(str, target_ids)))) 136 | logger.info("target_mask: {}".format(' '.join(map(str, target_mask)))) 137 | 138 | features.append( 139 | InputFeatures( 140 | example_index, 141 | source_ids, 142 | target_ids, 143 | source_mask, 144 | target_mask, 145 | ) 146 | ) 147 | return features 148 | 149 | 150 | 151 | def set_seed(args): 152 | """set random seed.""" 153 | random.seed(args.seed) 154 | np.random.seed(args.seed) 155 | torch.manual_seed(args.seed) 156 | if args.n_gpu > 0: 157 | torch.cuda.manual_seed_all(args.seed) 158 | 159 | def main(): 160 | parser = argparse.ArgumentParser() 161 | 162 | ## Required parameters 163 | parser.add_argument("--model_type", default=None, type=str, required=True, 164 | help="Model type: e.g. roberta") 165 | parser.add_argument("--model_name_or_path", default=None, type=str, required=True, 166 | help="Path to pre-trained model: e.g. roberta-base" ) 167 | parser.add_argument("--output_dir", default=None, type=str, required=True, 168 | help="The output directory where the model predictions and checkpoints will be written.") 169 | parser.add_argument("--load_model_path", default=None, type=str, 170 | help="Path to trained model: Should contain the .bin files" ) 171 | ## Other parameters 172 | parser.add_argument("--train_filename", default=None, type=str, 173 | help="The train filename. Should contain the .jsonl files for this task.") 174 | parser.add_argument("--dev_filename", default=None, type=str, 175 | help="The dev filename. Should contain the .jsonl files for this task.") 176 | parser.add_argument("--test_filename", default=None, type=str, 177 | help="The test filename. Should contain the .jsonl files for this task.") 178 | 179 | parser.add_argument("--config_name", default="", type=str, 180 | help="Pretrained config name or path if not the same as model_name") 181 | parser.add_argument("--tokenizer_name", default="", type=str, 182 | help="Pretrained tokenizer name or path if not the same as model_name") 183 | parser.add_argument("--max_source_length", default=64, type=int, 184 | help="The maximum total source sequence length after tokenization. Sequences longer " 185 | "than this will be truncated, sequences shorter will be padded.") 186 | parser.add_argument("--max_target_length", default=32, type=int, 187 | help="The maximum total target sequence length after tokenization. Sequences longer " 188 | "than this will be truncated, sequences shorter will be padded.") 189 | 190 | parser.add_argument("--do_train", action='store_true', 191 | help="Whether to run training.") 192 | parser.add_argument("--do_eval", action='store_true', 193 | help="Whether to run eval on the dev set.") 194 | parser.add_argument("--do_test", action='store_true', 195 | help="Whether to run eval on the dev set.") 196 | parser.add_argument("--do_lower_case", action='store_true', 197 | help="Set this flag if you are using an uncased model.") 198 | parser.add_argument("--no_cuda", action='store_true', 199 | help="Avoid using CUDA when available") 200 | 201 | parser.add_argument("--train_batch_size", default=8, type=int, 202 | help="Batch size per GPU/CPU for training.") 203 | parser.add_argument("--eval_batch_size", default=8, type=int, 204 | help="Batch size per GPU/CPU for evaluation.") 205 | parser.add_argument('--gradient_accumulation_steps', type=int, default=1, 206 | help="Number of updates steps to accumulate before performing a backward/update pass.") 207 | parser.add_argument("--learning_rate", default=5e-5, type=float, 208 | help="The initial learning rate for Adam.") 209 | parser.add_argument("--beam_size", default=10, type=int, 210 | help="beam size for beam search") 211 | parser.add_argument("--weight_decay", default=0.0, type=float, 212 | help="Weight deay if we apply some.") 213 | parser.add_argument("--adam_epsilon", default=1e-8, type=float, 214 | help="Epsilon for Adam optimizer.") 215 | parser.add_argument("--max_grad_norm", default=1.0, type=float, 216 | help="Max gradient norm.") 217 | parser.add_argument("--num_train_epochs", default=3.0, type=float, 218 | help="Total number of training epochs to perform.") 219 | parser.add_argument("--max_steps", default=-1, type=int, 220 | help="If > 0: set total number of training steps to perform. Override num_train_epochs.") 221 | parser.add_argument("--eval_steps", default=-1, type=int, 222 | help="") 223 | parser.add_argument("--train_steps", default=-1, type=int, 224 | help="") 225 | parser.add_argument("--warmup_steps", default=0, type=int, 226 | help="Linear warmup over warmup_steps.") 227 | parser.add_argument("--local_rank", type=int, default=-1, 228 | help="For distributed training: local_rank") 229 | parser.add_argument('--seed', type=int, default=42, 230 | help="random seed for initialization") 231 | # print arguments 232 | args = parser.parse_args() 233 | logger.info(args) 234 | 235 | # Setup CUDA, GPU & distributed training 236 | if args.local_rank == -1 or args.no_cuda: 237 | device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu") 238 | args.n_gpu = torch.cuda.device_count() 239 | else: # Initializes the distributed backend which will take care of sychronizing nodes/GPUs 240 | torch.cuda.set_device(args.local_rank) 241 | device = torch.device("cuda", args.local_rank) 242 | torch.distributed.init_process_group(backend='nccl') 243 | args.n_gpu = 1 244 | logger.warning("Process rank: %s, device: %s, n_gpu: %s, distributed training: %s", 245 | args.local_rank, device, args.n_gpu, bool(args.local_rank != -1)) 246 | args.device = device 247 | # Set seed 248 | set_seed(args) 249 | # make dir if output_dir not exist 250 | if os.path.exists(args.output_dir) is False: 251 | os.makedirs(args.output_dir) 252 | 253 | config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type] 254 | config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path) 255 | tokenizer = tokenizer_class.from_pretrained(args.tokenizer_name if args.tokenizer_name else args.model_name_or_path,do_lower_case=args.do_lower_case) 256 | 257 | #budild model 258 | encoder = model_class.from_pretrained(args.model_name_or_path,config=config) 259 | decoder_layer = nn.TransformerDecoderLayer(d_model=config.hidden_size, nhead=config.num_attention_heads) 260 | decoder = nn.TransformerDecoder(decoder_layer, num_layers=6) 261 | model=Seq2Seq(encoder=encoder,decoder=decoder,config=config, 262 | beam_size=args.beam_size,max_length=args.max_target_length, 263 | sos_id=tokenizer.cls_token_id,eos_id=tokenizer.sep_token_id) 264 | if args.load_model_path is not None: 265 | logger.info("reload model from {}".format(args.load_model_path)) 266 | model.load_state_dict(torch.load(args.load_model_path)) 267 | 268 | model.to(device) 269 | if args.local_rank != -1: 270 | # Distributed training 271 | try: 272 | from apex.parallel import DistributedDataParallel as DDP 273 | except ImportError: 274 | raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use distributed and fp16 training.") 275 | 276 | model = DDP(model) 277 | elif args.n_gpu > 1: 278 | # multi-gpu training 279 | model = torch.nn.DataParallel(model) 280 | 281 | 282 | 283 | 284 | if args.do_train: 285 | # Prepare training data loader 286 | train_examples = read_examples(args.train_filename) 287 | train_features = convert_examples_to_features(train_examples, tokenizer,args,stage='train') 288 | all_source_ids = torch.tensor([f.source_ids for f in train_features], dtype=torch.long) 289 | all_source_mask = torch.tensor([f.source_mask for f in train_features], dtype=torch.long) 290 | all_target_ids = torch.tensor([f.target_ids for f in train_features], dtype=torch.long) 291 | all_target_mask = torch.tensor([f.target_mask for f in train_features], dtype=torch.long) 292 | train_data = TensorDataset(all_source_ids,all_source_mask,all_target_ids,all_target_mask) 293 | 294 | if args.local_rank == -1: 295 | train_sampler = RandomSampler(train_data) 296 | else: 297 | train_sampler = DistributedSampler(train_data) 298 | train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=args.train_batch_size//args.gradient_accumulation_steps) 299 | 300 | num_train_optimization_steps = args.train_steps 301 | 302 | # Prepare optimizer and schedule (linear warmup and decay) 303 | no_decay = ['bias', 'LayerNorm.weight'] 304 | optimizer_grouped_parameters = [ 305 | {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 306 | 'weight_decay': args.weight_decay}, 307 | {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0} 308 | ] 309 | optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon) 310 | scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=args.warmup_steps, 311 | num_training_steps=num_train_optimization_steps) 312 | 313 | 314 | #Start training 315 | logger.info("***** Running training *****") 316 | logger.info(" Num examples = %d", len(train_examples)) 317 | logger.info(" Batch size = %d", args.train_batch_size) 318 | logger.info(" Num epoch = %d", num_train_optimization_steps*args.train_batch_size//len(train_examples)) 319 | 320 | 321 | model.train() 322 | dev_dataset={} 323 | nb_tr_examples, nb_tr_steps,tr_loss,global_step,best_bleu,best_loss = 0, 0,0,0,0,1e6 324 | bar = tqdm(range(num_train_optimization_steps),total=num_train_optimization_steps) 325 | train_dataloader=cycle(train_dataloader) 326 | eval_flag = True 327 | for step in bar: 328 | batch = next(train_dataloader) 329 | batch = tuple(t.to(device) for t in batch) 330 | source_ids,source_mask,target_ids,target_mask = batch 331 | loss,_,_ = model(source_ids=source_ids,source_mask=source_mask,target_ids=target_ids,target_mask=target_mask) 332 | 333 | if args.n_gpu > 1: 334 | loss = loss.mean() # mean() to average on multi-gpu. 335 | if args.gradient_accumulation_steps > 1: 336 | loss = loss / args.gradient_accumulation_steps 337 | tr_loss += loss.item() 338 | train_loss=round(tr_loss*args.gradient_accumulation_steps/(nb_tr_steps+1),4) 339 | bar.set_description("loss {}".format(train_loss)) 340 | nb_tr_examples += source_ids.size(0) 341 | nb_tr_steps += 1 342 | loss.backward() 343 | 344 | if (nb_tr_steps + 1) % args.gradient_accumulation_steps == 0: 345 | #Update parameters 346 | optimizer.step() 347 | optimizer.zero_grad() 348 | scheduler.step() 349 | global_step += 1 350 | eval_flag = True 351 | 352 | if args.do_eval and ((global_step + 1) %args.eval_steps == 0) and eval_flag: 353 | #Eval model with dev dataset 354 | tr_loss = 0 355 | nb_tr_examples, nb_tr_steps = 0, 0 356 | eval_flag=False 357 | if 'dev_loss' in dev_dataset: 358 | eval_examples,eval_data=dev_dataset['dev_loss'] 359 | else: 360 | eval_examples = read_examples(args.dev_filename) 361 | eval_features = convert_examples_to_features(eval_examples, tokenizer, args,stage='dev') 362 | all_source_ids = torch.tensor([f.source_ids for f in eval_features], dtype=torch.long) 363 | all_source_mask = torch.tensor([f.source_mask for f in eval_features], dtype=torch.long) 364 | all_target_ids = torch.tensor([f.target_ids for f in eval_features], dtype=torch.long) 365 | all_target_mask = torch.tensor([f.target_mask for f in eval_features], dtype=torch.long) 366 | eval_data = TensorDataset(all_source_ids,all_source_mask,all_target_ids,all_target_mask) 367 | dev_dataset['dev_loss']=eval_examples,eval_data 368 | eval_sampler = SequentialSampler(eval_data) 369 | eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=args.eval_batch_size) 370 | 371 | logger.info("\n***** Running evaluation *****") 372 | logger.info(" Num examples = %d", len(eval_examples)) 373 | logger.info(" Batch size = %d", args.eval_batch_size) 374 | 375 | #Start Evaling model 376 | model.eval() 377 | eval_loss,tokens_num = 0,0 378 | for batch in eval_dataloader: 379 | batch = tuple(t.to(device) for t in batch) 380 | source_ids,source_mask,target_ids,target_mask = batch 381 | 382 | with torch.no_grad(): 383 | _,loss,num = model(source_ids=source_ids,source_mask=source_mask, 384 | target_ids=target_ids,target_mask=target_mask) 385 | eval_loss += loss.sum().item() 386 | tokens_num += num.sum().item() 387 | #Pring loss of dev dataset 388 | model.train() 389 | eval_loss = eval_loss / tokens_num 390 | result = {'eval_ppl': round(np.exp(eval_loss),5), 391 | 'global_step': global_step+1, 392 | 'train_loss': round(train_loss,5)} 393 | for key in sorted(result.keys()): 394 | logger.info(" %s = %s", key, str(result[key])) 395 | logger.info(" "+"*"*20) 396 | 397 | #save last checkpoint 398 | last_output_dir = os.path.join(args.output_dir, 'checkpoint-last') 399 | if not os.path.exists(last_output_dir): 400 | os.makedirs(last_output_dir) 401 | model_to_save = model.module if hasattr(model, 'module') else model # Only save the model it-self 402 | output_model_file = os.path.join(last_output_dir, "pytorch_model.bin") 403 | torch.save(model_to_save.state_dict(), output_model_file) 404 | if eval_lossbest_bleu: 461 | logger.info(" Best bleu:%s",dev_bleu) 462 | logger.info(" "+"*"*20) 463 | best_bleu=dev_bleu 464 | # Save best checkpoint for best bleu 465 | output_dir = os.path.join(args.output_dir, 'checkpoint-best-bleu') 466 | if not os.path.exists(output_dir): 467 | os.makedirs(output_dir) 468 | model_to_save = model.module if hasattr(model, 'module') else model # Only save the model it-self 469 | output_model_file = os.path.join(output_dir, "pytorch_model.bin") 470 | torch.save(model_to_save.state_dict(), output_model_file) 471 | 472 | if args.do_test: 473 | files=[] 474 | if args.dev_filename is not None: 475 | files.append(args.dev_filename) 476 | if args.test_filename is not None: 477 | files.append(args.test_filename) 478 | for idx,file in enumerate(files): 479 | logger.info("Test file: {}".format(file)) 480 | eval_examples = read_examples(file) 481 | eval_features = convert_examples_to_features(eval_examples, tokenizer, args,stage='test') 482 | all_source_ids = torch.tensor([f.source_ids for f in eval_features], dtype=torch.long) 483 | all_source_mask = torch.tensor([f.source_mask for f in eval_features], dtype=torch.long) 484 | eval_data = TensorDataset(all_source_ids,all_source_mask) 485 | 486 | # Calculate bleu 487 | eval_sampler = SequentialSampler(eval_data) 488 | eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=args.eval_batch_size) 489 | 490 | model.eval() 491 | p=[] 492 | for batch in tqdm(eval_dataloader,total=len(eval_dataloader)): 493 | batch = tuple(t.to(device) for t in batch) 494 | source_ids,source_mask= batch 495 | with torch.no_grad(): 496 | preds = model(source_ids=source_ids,source_mask=source_mask) 497 | for pred in preds: 498 | t=pred[0].cpu().numpy() 499 | t=list(t) 500 | if 0 in t: 501 | t=t[:t.index(0)] 502 | text = tokenizer.decode(t,clean_up_tokenization_spaces=False) 503 | p.append(text) 504 | model.train() 505 | predictions=[] 506 | with open(os.path.join(args.output_dir,"test_{}.output".format(str(idx))),'w') as f, open(os.path.join(args.output_dir,"test_{}.gold".format(str(idx))),'w') as f1: 507 | for ref,gold in zip(p,eval_examples): 508 | predictions.append(str(gold.idx)+'\t'+ref) 509 | f.write(str(gold.idx)+'\t'+ref+'\n') 510 | f1.write(str(gold.idx)+'\t'+gold.target+'\n') 511 | 512 | (goldMap, predictionMap) = bleu.computeMaps(predictions, os.path.join(args.output_dir, "test_{}.gold".format(idx))) 513 | dev_bleu=round(bleu.bleuFromMaps(goldMap, predictionMap)[0],2) 514 | logger.info(" %s = %s "%("bleu-4",str(dev_bleu))) 515 | logger.info(" "+"*"*20) 516 | 517 | 518 | 519 | 520 | 521 | 522 | 523 | if __name__ == "__main__": 524 | main() 525 | 526 | 527 | -------------------------------------------------------------------------------- /CodeBERT/codesearch/README.md: -------------------------------------------------------------------------------- 1 | # Code Search 2 | 3 | ## Data Preprocess 4 | 5 | Both training and validation datasets are created in a way that positive and negative samples are balanced. Negative samples consist of balanced number of instances with randomly replaced NL and PL. 6 | 7 | We follow the official evaluation metric to calculate the Mean Reciprocal Rank (MRR) for each pair of test data (c, w) over a fixed set of 999 distractor codes. 8 | 9 | You can use the following command to download the preprocessed training and validation dataset and preprocess the test dataset by yourself. The preprocessed testing dataset is very large, so only the preprocessing script is provided. 10 | 11 | ```shell 12 | mkdir data data/codesearch 13 | cd data/codesearch 14 | gdown https://drive.google.com/uc?id=1xgSR34XO8xXZg4cZScDYj2eGerBE9iGo 15 | unzip codesearch_data.zip 16 | rm codesearch_data.zip 17 | cd ../../codesearch 18 | python process_data.py 19 | cd .. 20 | ``` 21 | 22 | ## Fine-Tune 23 | We fine-tuned the model on 2*P100 GPUs. 24 | ```shell 25 | cd codesearch 26 | 27 | lang=php #fine-tuning a language-specific model for each programming language 28 | pretrained_model=microsoft/codebert-base #Roberta: roberta-base 29 | 30 | python run_classifier.py \ 31 | --model_type roberta \ 32 | --task_name codesearch \ 33 | --do_train \ 34 | --do_eval \ 35 | --eval_all_checkpoints \ 36 | --train_file train.txt \ 37 | --dev_file valid.txt \ 38 | --max_seq_length 200 \ 39 | --per_gpu_train_batch_size 32 \ 40 | --per_gpu_eval_batch_size 32 \ 41 | --learning_rate 1e-5 \ 42 | --num_train_epochs 8 \ 43 | --gradient_accumulation_steps 1 \ 44 | --overwrite_output_dir \ 45 | --data_dir ../data/codesearch/train_valid/$lang \ 46 | --output_dir ./models/$lang \ 47 | --model_name_or_path $pretrained_model 48 | ``` 49 | ## Inference and Evaluation 50 | 51 | Inference 52 | ```shell 53 | lang=php #programming language 54 | idx=0 #test batch idx 55 | 56 | python run_classifier.py \ 57 | --model_type roberta \ 58 | --model_name_or_path microsoft/codebert-base \ 59 | --task_name codesearch \ 60 | --do_predict \ 61 | --output_dir ./models/$lang \ 62 | --data_dir ../data/codesearch/test/$lang \ 63 | --max_seq_length 200 \ 64 | --per_gpu_train_batch_size 32 \ 65 | --per_gpu_eval_batch_size 32 \ 66 | --learning_rate 1e-5 \ 67 | --num_train_epochs 8 \ 68 | --test_file batch_${idx}.txt \ 69 | --pred_model_dir ./models/$lang/checkpoint-best/ \ 70 | --test_result_dir ./results/$lang/${idx}_batch_result.txt 71 | ``` 72 | 73 | Evaluation 74 | ```shell 75 | python mrr.py 76 | ``` 77 | 78 | 79 | -------------------------------------------------------------------------------- /CodeBERT/codesearch/mrr.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # Copyright (c) Microsoft Corporation. 3 | # Licensed under the MIT license. 4 | 5 | import os 6 | import numpy as np 7 | from more_itertools import chunked 8 | import argparse 9 | 10 | 11 | def main(): 12 | parser = argparse.ArgumentParser() 13 | parser.add_argument('--test_batch_size', type=int, default=1000) 14 | args = parser.parse_args() 15 | languages = ['ruby', 'go', 'php', 'python', 'java', 'javascript'] 16 | MRR_dict = {} 17 | for language in languages: 18 | file_dir = './results/{}'.format(language) 19 | ranks = [] 20 | num_batch = 0 21 | for file in sorted(os.listdir(file_dir)): 22 | print(os.path.join(file_dir, file)) 23 | with open(os.path.join(file_dir, file), encoding='utf-8') as f: 24 | batched_data = chunked(f.readlines(), args.test_batch_size) 25 | for batch_idx, batch_data in enumerate(batched_data): 26 | num_batch += 1 27 | correct_score = float(batch_data[batch_idx].strip().split('')[-1]) 28 | scores = np.array([float(data.strip().split('')[-1]) for data in batch_data]) 29 | rank = np.sum(scores >= correct_score) 30 | ranks.append(rank) 31 | 32 | mean_mrr = np.mean(1.0 / np.array(ranks)) 33 | print("{} mrr: {}".format(language, mean_mrr)) 34 | MRR_dict[language] = mean_mrr 35 | for key, val in MRR_dict.items(): 36 | print("{} mrr: {}".format(key, val)) 37 | 38 | 39 | if __name__ == "__main__": 40 | main() 41 | -------------------------------------------------------------------------------- /CodeBERT/codesearch/process_data.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # Copyright (c) Microsoft Corporation. 3 | # Licensed under the MIT license. 4 | 5 | import gzip 6 | import os 7 | import json 8 | import numpy as np 9 | from more_itertools import chunked 10 | 11 | DATA_DIR='../data/codesearch' 12 | 13 | def format_str(string): 14 | for char in ['\r\n', '\r', '\n']: 15 | string = string.replace(char, ' ') 16 | return string 17 | 18 | 19 | def preprocess_test_data(language, test_batch_size=1000): 20 | path = os.path.join(DATA_DIR, '{}_test_0.jsonl.gz'.format(language)) 21 | print(path) 22 | with gzip.open(path, 'r') as pf: 23 | data = pf.readlines() 24 | 25 | idxs = np.arange(len(data)) 26 | data = np.array(data, dtype=np.object) 27 | 28 | np.random.seed(0) # set random seed so that random things are reproducible 29 | np.random.shuffle(idxs) 30 | data = data[idxs] 31 | batched_data = chunked(data, test_batch_size) 32 | 33 | print("start processing") 34 | for batch_idx, batch_data in enumerate(batched_data): 35 | if len(batch_data) < test_batch_size: 36 | break # the last batch is smaller than the others, exclude. 37 | examples = [] 38 | for d_idx, d in enumerate(batch_data): 39 | line_a = json.loads(str(d, encoding='utf-8')) 40 | doc_token = ' '.join(line_a['docstring_tokens']) 41 | for dd in batch_data: 42 | line_b = json.loads(str(dd, encoding='utf-8')) 43 | code_token = ' '.join([format_str(token) for token in line_b['code_tokens']]) 44 | 45 | example = (str(1), line_a['url'], line_b['url'], doc_token, code_token) 46 | example = ''.join(example) 47 | examples.append(example) 48 | 49 | data_path = os.path.join(DATA_DIR, 'test/{}'.format(language)) 50 | if not os.path.exists(data_path): 51 | os.makedirs(data_path) 52 | file_path = os.path.join(data_path, 'batch_{}.txt'.format(batch_idx)) 53 | print(file_path) 54 | with open(file_path, 'w', encoding='utf-8') as f: 55 | f.writelines('\n'.join(examples)) 56 | 57 | if __name__ == '__main__': 58 | languages = ['go', 'php', 'python', 'java', 'javascript', 'ruby'] 59 | for lang in languages: 60 | preprocess_test_data(lang) 61 | -------------------------------------------------------------------------------- /CodeBERT/codesearch/utils.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | # Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team. 3 | # Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved. 4 | # 5 | # Licensed under the Apache License, Version 2.0 (the "License"); 6 | # you may not use this file except in compliance with the License. 7 | # You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | """ BERT classification fine-tuning: utilities to work with GLUE tasks """ 17 | 18 | from __future__ import absolute_import, division, print_function 19 | 20 | import csv 21 | import logging 22 | import os 23 | import sys 24 | from io import open 25 | from sklearn.metrics import f1_score 26 | 27 | csv.field_size_limit(sys.maxsize) 28 | logger = logging.getLogger(__name__) 29 | 30 | 31 | class InputExample(object): 32 | """A single training/test example for simple sequence classification.""" 33 | 34 | def __init__(self, guid, text_a, text_b=None, label=None): 35 | """Constructs a InputExample. 36 | 37 | Args: 38 | guid: Unique id for the example. 39 | text_a: string. The untokenized text of the first sequence. For single 40 | sequence tasks, only this sequence must be specified. 41 | text_b: (Optional) string. The untokenized text of the second sequence. 42 | Only must be specified for sequence pair tasks. 43 | label: (Optional) string. The label of the example. This should be 44 | specified for train and dev examples, but not for test examples. 45 | """ 46 | self.guid = guid 47 | self.text_a = text_a 48 | self.text_b = text_b 49 | self.label = label 50 | 51 | 52 | class InputFeatures(object): 53 | """A single set of features of data.""" 54 | 55 | def __init__(self, input_ids, input_mask, segment_ids, label_id): 56 | self.input_ids = input_ids 57 | self.input_mask = input_mask 58 | self.segment_ids = segment_ids 59 | self.label_id = label_id 60 | 61 | 62 | class DataProcessor(object): 63 | """Base class for data converters for sequence classification data sets.""" 64 | 65 | def get_train_examples(self, data_dir): 66 | """Gets a collection of `InputExample`s for the train set.""" 67 | raise NotImplementedError() 68 | 69 | def get_dev_examples(self, data_dir): 70 | """Gets a collection of `InputExample`s for the dev set.""" 71 | raise NotImplementedError() 72 | 73 | def get_labels(self): 74 | """Gets the list of labels for this data set.""" 75 | raise NotImplementedError() 76 | 77 | @classmethod 78 | def _read_tsv(cls, input_file, quotechar=None): 79 | """Reads a tab separated value file.""" 80 | with open(input_file, "r", encoding='utf-8') as f: 81 | lines = [] 82 | for line in f.readlines(): 83 | line = line.strip().split('') 84 | if len(line) != 5: 85 | continue 86 | lines.append(line) 87 | return lines 88 | 89 | 90 | class CodesearchProcessor(DataProcessor): 91 | """Processor for the MRPC data set (GLUE version).""" 92 | 93 | def get_train_examples(self, data_dir, train_file): 94 | """See base class.""" 95 | logger.info("LOOKING AT {}".format(os.path.join(data_dir, train_file))) 96 | return self._create_examples( 97 | self._read_tsv(os.path.join(data_dir, train_file)), "train") 98 | 99 | def get_dev_examples(self, data_dir, dev_file): 100 | """See base class.""" 101 | logger.info("LOOKING AT {}".format(os.path.join(data_dir, dev_file))) 102 | return self._create_examples( 103 | self._read_tsv(os.path.join(data_dir, dev_file)), "dev") 104 | 105 | def get_test_examples(self, data_dir, test_file): 106 | """See base class.""" 107 | logger.info("LOOKING AT {}".format(os.path.join(data_dir, test_file))) 108 | return self._create_examples( 109 | self._read_tsv(os.path.join(data_dir, test_file)), "test") 110 | 111 | def get_labels(self): 112 | """See base class.""" 113 | return ["0", "1"] 114 | 115 | def _create_examples(self, lines, set_type): 116 | """Creates examples for the training and dev sets.""" 117 | examples = [] 118 | for (i, line) in enumerate(lines): 119 | guid = "%s-%s" % (set_type, i) 120 | text_a = line[3] 121 | text_b = line[4] 122 | if (set_type == 'test'): 123 | label = self.get_labels()[0] 124 | else: 125 | label = line[0] 126 | examples.append( 127 | InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)) 128 | if (set_type == 'test'): 129 | return examples, lines 130 | else: 131 | return examples 132 | 133 | 134 | def convert_examples_to_features(examples, label_list, max_seq_length, 135 | tokenizer, output_mode, 136 | cls_token_at_end=False, pad_on_left=False, 137 | cls_token='[CLS]', sep_token='[SEP]', pad_token=0, 138 | sequence_a_segment_id=0, sequence_b_segment_id=1, 139 | cls_token_segment_id=1, pad_token_segment_id=0, 140 | mask_padding_with_zero=True): 141 | """ Loads a data file into a list of `InputBatch`s 142 | `cls_token_at_end` define the location of the CLS token: 143 | - False (Default, BERT/XLM pattern): [CLS] + A + [SEP] + B + [SEP] 144 | - True (XLNet/GPT pattern): A + [SEP] + B + [SEP] + [CLS] 145 | `cls_token_segment_id` define the segment id associated to the CLS token (0 for BERT, 2 for XLNet) 146 | """ 147 | 148 | label_map = {label: i for i, label in enumerate(label_list)} 149 | 150 | features = [] 151 | for (ex_index, example) in enumerate(examples): 152 | if ex_index % 10000 == 0: 153 | logger.info("Writing example %d of %d" % (ex_index, len(examples))) 154 | 155 | tokens_a = tokenizer.tokenize(example.text_a)[:50] 156 | 157 | tokens_b = None 158 | if example.text_b: 159 | tokens_b = tokenizer.tokenize(example.text_b) 160 | # Modifies `tokens_a` and `tokens_b` in place so that the total 161 | # length is less than the specified length. 162 | # Account for [CLS], [SEP], [SEP] with "- 3" 163 | _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3) 164 | else: 165 | # Account for [CLS] and [SEP] with "- 2" 166 | if len(tokens_a) > max_seq_length - 2: 167 | tokens_a = tokens_a[:(max_seq_length - 2)] 168 | 169 | # The convention in BERT is: 170 | # (a) For sequence pairs: 171 | # tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP] 172 | # type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1 173 | # (b) For single sequences: 174 | # tokens: [CLS] the dog is hairy . [SEP] 175 | # type_ids: 0 0 0 0 0 0 0 176 | # 177 | # Where "type_ids" are used to indicate whether this is the first 178 | # sequence or the second sequence. The embedding vectors for `type=0` and 179 | # `type=1` were learned during pre-training and are added to the wordpiece 180 | # embedding vector (and position vector). This is not *strictly* necessary 181 | # since the [SEP] token unambiguously separates the sequences, but it makes 182 | # it easier for the model to learn the concept of sequences. 183 | # 184 | # For classification tasks, the first vector (corresponding to [CLS]) is 185 | # used as as the "sentence vector". Note that this only makes sense because 186 | # the entire model is fine-tuned. 187 | tokens = tokens_a + [sep_token] 188 | segment_ids = [sequence_a_segment_id] * len(tokens) 189 | 190 | if tokens_b: 191 | tokens += tokens_b + [sep_token] 192 | segment_ids += [sequence_b_segment_id] * (len(tokens_b) + 1) 193 | 194 | if cls_token_at_end: 195 | tokens = tokens + [cls_token] 196 | segment_ids = segment_ids + [cls_token_segment_id] 197 | else: 198 | tokens = [cls_token] + tokens 199 | segment_ids = [cls_token_segment_id] + segment_ids 200 | 201 | input_ids = tokenizer.convert_tokens_to_ids(tokens) 202 | 203 | # The mask has 1 for real tokens and 0 for padding tokens. Only real 204 | # tokens are attended to. 205 | input_mask = [1 if mask_padding_with_zero else 0] * len(input_ids) 206 | 207 | # Zero-pad up to the sequence length. 208 | padding_length = max_seq_length - len(input_ids) 209 | if pad_on_left: 210 | input_ids = ([pad_token] * padding_length) + input_ids 211 | input_mask = ([0 if mask_padding_with_zero else 1] * padding_length) + input_mask 212 | segment_ids = ([pad_token_segment_id] * padding_length) + segment_ids 213 | else: 214 | input_ids = input_ids + ([pad_token] * padding_length) 215 | input_mask = input_mask + ([0 if mask_padding_with_zero else 1] * padding_length) 216 | segment_ids = segment_ids + ([pad_token_segment_id] * padding_length) 217 | 218 | assert len(input_ids) == max_seq_length 219 | assert len(input_mask) == max_seq_length 220 | assert len(segment_ids) == max_seq_length 221 | 222 | if output_mode == "classification": 223 | label_id = label_map[example.label] 224 | elif output_mode == "regression": 225 | label_id = float(example.label) 226 | else: 227 | raise KeyError(output_mode) 228 | 229 | if ex_index < 5: 230 | logger.info("*** Example ***") 231 | logger.info("guid: %s" % (example.guid)) 232 | logger.info("tokens: %s" % " ".join( 233 | [str(x) for x in tokens])) 234 | logger.info("input_ids: %s" % " ".join([str(x) for x in input_ids])) 235 | logger.info("input_mask: %s" % " ".join([str(x) for x in input_mask])) 236 | logger.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids])) 237 | logger.info("label: %s (id = %d)" % (example.label, label_id)) 238 | 239 | features.append( 240 | InputFeatures(input_ids=input_ids, 241 | input_mask=input_mask, 242 | segment_ids=segment_ids, 243 | label_id=label_id)) 244 | return features 245 | 246 | 247 | def _truncate_seq_pair(tokens_a, tokens_b, max_length): 248 | """Truncates a sequence pair in place to the maximum length.""" 249 | 250 | # This is a simple heuristic which will always truncate the longer sequence 251 | # one token at a time. This makes more sense than truncating an equal percent 252 | # of tokens from each, since if one sequence is very short then each token 253 | # that's truncated likely contains more information than a longer sequence. 254 | while True: 255 | total_length = len(tokens_a) + len(tokens_b) 256 | if total_length <= max_length: 257 | break 258 | if len(tokens_a) > len(tokens_b): 259 | tokens_a.pop() 260 | else: 261 | tokens_b.pop() 262 | 263 | 264 | def simple_accuracy(preds, labels): 265 | return (preds == labels).mean() 266 | 267 | 268 | def acc_and_f1(preds, labels): 269 | acc = simple_accuracy(preds, labels) 270 | f1 = f1_score(y_true=labels, y_pred=preds) 271 | return { 272 | "acc": acc, 273 | "f1": f1, 274 | "acc_and_f1": (acc + f1) / 2, 275 | } 276 | 277 | 278 | def compute_metrics(task_name, preds, labels): 279 | assert len(preds) == len(labels) 280 | if task_name == "codesearch": 281 | return acc_and_f1(preds, labels) 282 | else: 283 | raise KeyError(task_name) 284 | 285 | 286 | processors = { 287 | "codesearch": CodesearchProcessor, 288 | } 289 | 290 | output_modes = { 291 | "codesearch": "classification", 292 | } 293 | 294 | GLUE_TASKS_NUM_LABELS = { 295 | "codesearch": 2, 296 | } 297 | -------------------------------------------------------------------------------- /GraphCodeBERT/clonedetection/README.md: -------------------------------------------------------------------------------- 1 | # Clone Detection 2 | 3 | ## Task Definition 4 | 5 | Given two codes as the input, the task is to do binary classification (0/1), where 1 stands for semantic equivalence and 0 for others. Models are evaluated by F1 score. 6 | 7 | ## Dataset 8 | 9 | The dataset we use is [BigCloneBench](https://www.cs.usask.ca/faculty/croy/papers/2014/SvajlenkoICSME2014BigERA.pdf) and filtered following the paper [Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree](https://arxiv.org/pdf/2002.08653.pdf). 10 | 11 | ### Data Format 12 | 13 | 1. dataset/data.jsonl is stored in jsonlines format. Each line in the uncompressed file represents one function. One row is illustrated below. 14 | 15 | - **func:** the function 16 | 17 | - **idx:** index of the example 18 | 19 | 2. train.txt/valid.txt/test.txt provide examples, stored in the following format: idx1 idx2 label 20 | 21 | ### Data Statistics 22 | 23 | Data statistics of the dataset are shown in the below table: 24 | 25 | | | #Examples | 26 | | ----- | :-------: | 27 | | Train | 901,028 | 28 | | Dev | 415,416 | 29 | | Test | 415,416 | 30 | 31 | You can get data using the following command. 32 | 33 | ``` 34 | unzip dataset.zip 35 | ``` 36 | 37 | ## Evaluator 38 | 39 | We provide a script to evaluate predictions for this task, and report F1 score 40 | 41 | ### Example 42 | 43 | ```bash 44 | python evaluator/evaluator.py -a evaluator/answers.txt -p evaluator/predictions.txt 45 | ``` 46 | 47 | {'Recall': 0.25, 'Prediction': 0.5, 'F1': 0.3333333333333333} 48 | 49 | ### Input predictions 50 | 51 | A predications file that has predictions in TXT format, such as evaluator/predictions.txt. For example: 52 | 53 | ```b 54 | 13653451 21955002 0 55 | 1188160 8831513 1 56 | 1141235 14322332 0 57 | 16765164 17526811 1 58 | ``` 59 | 60 | ## Pipeline-GraphCodeBERT 61 | 62 | We also provide a pipeline that fine-tunes GraphCodeBERT on this task. 63 | ### Dependency 64 | 65 | - pip install torch 66 | - pip install transformers 67 | - pip install tree_sitter 68 | - pip sklearn 69 | 70 | ### Tree-sitter (optional) 71 | 72 | If the built file "parser/my-languages.so" doesn't work for you, please rebuild as the following command: 73 | 74 | ```shell 75 | cd parser 76 | bash build.sh 77 | cd .. 78 | ``` 79 | 80 | ### Fine-tune 81 | 82 | We use 4*V100-16G to fine-tune and 10% valid data to evaluate. 83 | 84 | 85 | ```shell 86 | mkdir saved_models 87 | python run.py \ 88 | --output_dir=saved_models \ 89 | --config_name=microsoft/graphcodebert-base \ 90 | --model_name_or_path=microsoft/graphcodebert-base \ 91 | --tokenizer_name=microsoft/graphcodebert-base \ 92 | --do_train \ 93 | --train_data_file=dataset/train.txt \ 94 | --eval_data_file=dataset/valid.txt \ 95 | --test_data_file=dataset/test.txt \ 96 | --epoch 1 \ 97 | --code_length 512 \ 98 | --data_flow_length 128 \ 99 | --train_batch_size 16 \ 100 | --eval_batch_size 32 \ 101 | --learning_rate 2e-5 \ 102 | --max_grad_norm 1.0 \ 103 | --evaluate_during_training \ 104 | --seed 123456 2>&1| tee saved_models/train.log 105 | ``` 106 | 107 | ### Inference 108 | 109 | We use full test data for inference. 110 | 111 | ```shell 112 | python run.py \ 113 | --output_dir=saved_models \ 114 | --config_name=microsoft/graphcodebert-base \ 115 | --model_name_or_path=microsoft/graphcodebert-base \ 116 | --tokenizer_name=microsoft/graphcodebert-base \ 117 | --do_eval \ 118 | --do_test \ 119 | --train_data_file=dataset/train.txt \ 120 | --eval_data_file=dataset/valid.txt \ 121 | --test_data_file=dataset/test.txt \ 122 | --epoch 1 \ 123 | --code_length 512 \ 124 | --data_flow_length 128 \ 125 | --train_batch_size 16 \ 126 | --eval_batch_size 32 \ 127 | --learning_rate 2e-5 \ 128 | --max_grad_norm 1.0 \ 129 | --evaluate_during_training \ 130 | --seed 123456 2>&1| tee saved_models/test.log 131 | ``` 132 | 133 | ### Evaluation 134 | 135 | ```shell 136 | python evaluator/evaluator.py -a dataset/test.txt -p saved_models/predictions.txt 2>&1| tee saved_models/score.log 137 | ``` 138 | 139 | ## Result 140 | 141 | The results on the test set are shown as below: 142 | 143 | | Method | Precision | Recall | F1 | 144 | | ------------- | :-------: | :-------: | :-------: | 145 | | Deckard | 0.93 | 0.02 | 0.03 | 146 | | RtvNN | 0.95 | 0.01 | 0.01 | 147 | | CDLH | 0.92 | 0.74 | 0.82 | 148 | | ASTNN | 0.92 | 0.94 | 0.93 | 149 | | FA-AST-GMN | 0.96 | 0.94 | 0.95 | 150 | | CodeBERT | 0.964 | 0.966 | 0.965 | 151 | | GraphCodeBERT | **0.973** | **0.968** | **0.971** | 152 | 153 | -------------------------------------------------------------------------------- /GraphCodeBERT/clonedetection/dataset.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sxjscience/CodeBERT/e20547d53e4e6b7d97c2394470d2f6ef922e88ad/GraphCodeBERT/clonedetection/dataset.zip -------------------------------------------------------------------------------- /GraphCodeBERT/clonedetection/evaluator/answers.txt: -------------------------------------------------------------------------------- 1 | 13653451 21955002 0 2 | 1188160 8831513 0 3 | 1141235 14322332 0 4 | 16765164 17526811 0 -------------------------------------------------------------------------------- /GraphCodeBERT/clonedetection/evaluator/evaluator.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Microsoft Corporation. 2 | # Licensed under the MIT license. 3 | import logging 4 | import sys 5 | from sklearn.metrics import recall_score,precision_score,f1_score 6 | 7 | def read_answers(filename): 8 | answers={} 9 | with open(filename) as f: 10 | for line in f: 11 | line=line.strip() 12 | idx1,idx2,label=line.split() 13 | answers[(idx1,idx2)]=label 14 | return answers 15 | 16 | def read_predictions(filename): 17 | predictions={} 18 | with open(filename) as f: 19 | for line in f: 20 | line=line.strip() 21 | idx1,idx2,label=line.split() 22 | if 'txt' in line: 23 | idx1=idx1.split('/')[-1][:-4] 24 | idx2=idx2.split('/')[-1][:-4] 25 | predictions[(idx1,idx2)]=label 26 | return predictions 27 | 28 | def calculate_scores(answers,predictions): 29 | y_trues,y_preds=[],[] 30 | for key in answers: 31 | if key not in predictions: 32 | logging.error("Missing prediction for ({},{}) pair.".format(key[0],key[1])) 33 | sys.exit() 34 | y_trues.append(answers[key]) 35 | y_preds.append(predictions[key]) 36 | scores={} 37 | scores['Recall']=recall_score(y_trues, y_preds, average='macro') 38 | scores['Prediction']=precision_score(y_trues, y_preds, average='macro') 39 | scores['F1']=f1_score(y_trues, y_preds, average='macro') 40 | return scores 41 | 42 | def main(): 43 | import argparse 44 | parser = argparse.ArgumentParser(description='Evaluate leaderboard predictions for BigCloneBench dataset.') 45 | parser.add_argument('--answers', '-a',help="filename of the labels, in txt format.") 46 | parser.add_argument('--predictions', '-p',help="filename of the leaderboard predictions, in txt format.") 47 | 48 | 49 | args = parser.parse_args() 50 | answers=read_answers(args.answers) 51 | predictions=read_predictions(args.predictions) 52 | scores=calculate_scores(answers,predictions) 53 | print(scores) 54 | 55 | if __name__ == '__main__': 56 | main() 57 | -------------------------------------------------------------------------------- /GraphCodeBERT/clonedetection/evaluator/predictions.txt: -------------------------------------------------------------------------------- 1 | 13653451 21955002 0 2 | 1188160 8831513 1 3 | 1141235 14322332 0 4 | 16765164 17526811 1 -------------------------------------------------------------------------------- /GraphCodeBERT/clonedetection/model.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch 4 | from torch.autograd import Variable 5 | import copy 6 | import torch.nn.functional as F 7 | from torch.nn import CrossEntropyLoss, MSELoss 8 | 9 | class RobertaClassificationHead(nn.Module): 10 | """Head for sentence-level classification tasks.""" 11 | 12 | def __init__(self, config): 13 | super().__init__() 14 | self.dense = nn.Linear(config.hidden_size*2, config.hidden_size) 15 | self.dropout = nn.Dropout(config.hidden_dropout_prob) 16 | self.out_proj = nn.Linear(config.hidden_size, 2) 17 | 18 | def forward(self, features, **kwargs): 19 | x = features[:, 0, :] # take token (equiv. to [CLS]) 20 | x = x.reshape(-1,x.size(-1)*2) 21 | x = self.dropout(x) 22 | x = self.dense(x) 23 | x = torch.tanh(x) 24 | x = self.dropout(x) 25 | x = self.out_proj(x) 26 | return x 27 | 28 | class Model(nn.Module): 29 | def __init__(self, encoder,config,tokenizer,args): 30 | super(Model, self).__init__() 31 | self.encoder = encoder 32 | self.config=config 33 | self.tokenizer=tokenizer 34 | self.classifier=RobertaClassificationHead(config) 35 | self.args=args 36 | 37 | 38 | def forward(self, inputs_ids_1,position_idx_1,attn_mask_1,inputs_ids_2,position_idx_2,attn_mask_2,labels=None): 39 | bs,l=inputs_ids_1.size() 40 | inputs_ids=torch.cat((inputs_ids_1.unsqueeze(1),inputs_ids_2.unsqueeze(1)),1).view(bs*2,l) 41 | position_idx=torch.cat((position_idx_1.unsqueeze(1),position_idx_2.unsqueeze(1)),1).view(bs*2,l) 42 | attn_mask=torch.cat((attn_mask_1.unsqueeze(1),attn_mask_2.unsqueeze(1)),1).view(bs*2,l,l) 43 | 44 | #embedding 45 | nodes_mask=position_idx.eq(0) 46 | token_mask=position_idx.ge(2) 47 | inputs_embeddings=self.encoder.roberta.embeddings.word_embeddings(inputs_ids) 48 | nodes_to_token_mask=nodes_mask[:,:,None]&token_mask[:,None,:]&attn_mask 49 | nodes_to_token_mask=nodes_to_token_mask/(nodes_to_token_mask.sum(-1)+1e-10)[:,:,None] 50 | avg_embeddings=torch.einsum("abc,acd->abd",nodes_to_token_mask,inputs_embeddings) 51 | inputs_embeddings=inputs_embeddings*(~nodes_mask)[:,:,None]+avg_embeddings*nodes_mask[:,:,None] 52 | 53 | outputs = self.encoder.roberta(inputs_embeds=inputs_embeddings,attention_mask=attn_mask,position_ids=position_idx)[0] 54 | logits=self.classifier(outputs) 55 | prob=F.softmax(logits) 56 | if labels is not None: 57 | loss_fct = CrossEntropyLoss() 58 | loss = loss_fct(logits, labels) 59 | return loss,prob 60 | else: 61 | return prob 62 | 63 | 64 | 65 | 66 | -------------------------------------------------------------------------------- /GraphCodeBERT/clonedetection/parser/__init__.py: -------------------------------------------------------------------------------- 1 | from .utils import (remove_comments_and_docstrings, 2 | tree_to_token_index, 3 | index_to_code_token, 4 | tree_to_variable_index) 5 | from .DFG import DFG_python,DFG_java,DFG_ruby,DFG_go,DFG_php,DFG_javascript,DFG_csharp -------------------------------------------------------------------------------- /GraphCodeBERT/clonedetection/parser/build.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Microsoft Corporation. 2 | # Licensed under the MIT license. 3 | 4 | from tree_sitter import Language, Parser 5 | 6 | Language.build_library( 7 | # Store the library in the `build` directory 8 | 'my-languages.so', 9 | 10 | # Include one or more languages 11 | [ 12 | 'tree-sitter-go', 13 | 'tree-sitter-javascript', 14 | 'tree-sitter-python', 15 | 'tree-sitter-php', 16 | 'tree-sitter-java', 17 | 'tree-sitter-ruby', 18 | 'tree-sitter-c-sharp', 19 | ] 20 | ) 21 | 22 | -------------------------------------------------------------------------------- /GraphCodeBERT/clonedetection/parser/build.sh: -------------------------------------------------------------------------------- 1 | git clone https://github.com/tree-sitter/tree-sitter-go 2 | git clone https://github.com/tree-sitter/tree-sitter-javascript 3 | git clone https://github.com/tree-sitter/tree-sitter-python 4 | git clone https://github.com/tree-sitter/tree-sitter-ruby 5 | git clone https://github.com/tree-sitter/tree-sitter-php 6 | git clone https://github.com/tree-sitter/tree-sitter-java 7 | git clone https://github.com/tree-sitter/tree-sitter-c-sharp 8 | python build.py 9 | -------------------------------------------------------------------------------- /GraphCodeBERT/clonedetection/parser/my-languages.so: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sxjscience/CodeBERT/e20547d53e4e6b7d97c2394470d2f6ef922e88ad/GraphCodeBERT/clonedetection/parser/my-languages.so -------------------------------------------------------------------------------- /GraphCodeBERT/clonedetection/parser/utils.py: -------------------------------------------------------------------------------- 1 | import re 2 | from io import StringIO 3 | import tokenize 4 | def remove_comments_and_docstrings(source,lang): 5 | if lang in ['python']: 6 | """ 7 | Returns 'source' minus comments and docstrings. 8 | """ 9 | io_obj = StringIO(source) 10 | out = "" 11 | prev_toktype = tokenize.INDENT 12 | last_lineno = -1 13 | last_col = 0 14 | for tok in tokenize.generate_tokens(io_obj.readline): 15 | token_type = tok[0] 16 | token_string = tok[1] 17 | start_line, start_col = tok[2] 18 | end_line, end_col = tok[3] 19 | ltext = tok[4] 20 | if start_line > last_lineno: 21 | last_col = 0 22 | if start_col > last_col: 23 | out += (" " * (start_col - last_col)) 24 | # Remove comments: 25 | if token_type == tokenize.COMMENT: 26 | pass 27 | # This series of conditionals removes docstrings: 28 | elif token_type == tokenize.STRING: 29 | if prev_toktype != tokenize.INDENT: 30 | # This is likely a docstring; double-check we're not inside an operator: 31 | if prev_toktype != tokenize.NEWLINE: 32 | if start_col > 0: 33 | out += token_string 34 | else: 35 | out += token_string 36 | prev_toktype = token_type 37 | last_col = end_col 38 | last_lineno = end_line 39 | temp=[] 40 | for x in out.split('\n'): 41 | if x.strip()!="": 42 | temp.append(x) 43 | return '\n'.join(temp) 44 | elif lang in ['ruby']: 45 | return source 46 | else: 47 | def replacer(match): 48 | s = match.group(0) 49 | if s.startswith('/'): 50 | return " " # note: a space and not an empty string 51 | else: 52 | return s 53 | pattern = re.compile( 54 | r'//.*?$|/\*.*?\*/|\'(?:\\.|[^\\\'])*\'|"(?:\\.|[^\\"])*"', 55 | re.DOTALL | re.MULTILINE 56 | ) 57 | temp=[] 58 | for x in re.sub(pattern, replacer, source).split('\n'): 59 | if x.strip()!="": 60 | temp.append(x) 61 | return '\n'.join(temp) 62 | 63 | def tree_to_token_index(root_node): 64 | if (len(root_node.children)==0 or root_node.type=='string') and root_node.type!='comment': 65 | return [(root_node.start_point,root_node.end_point)] 66 | else: 67 | code_tokens=[] 68 | for child in root_node.children: 69 | code_tokens+=tree_to_token_index(child) 70 | return code_tokens 71 | 72 | def tree_to_variable_index(root_node,index_to_code): 73 | if (len(root_node.children)==0 or root_node.type=='string') and root_node.type!='comment': 74 | index=(root_node.start_point,root_node.end_point) 75 | _,code=index_to_code[index] 76 | if root_node.type!=code: 77 | return [(root_node.start_point,root_node.end_point)] 78 | else: 79 | return [] 80 | else: 81 | code_tokens=[] 82 | for child in root_node.children: 83 | code_tokens+=tree_to_variable_index(child,index_to_code) 84 | return code_tokens 85 | 86 | def index_to_code_token(index,code): 87 | start_point=index[0] 88 | end_point=index[1] 89 | if start_point[0]==end_point[0]: 90 | s=code[start_point[0]][start_point[1]:end_point[1]] 91 | else: 92 | s="" 93 | s+=code[start_point[0]][start_point[1]:] 94 | for i in range(start_point[0]+1,end_point[0]): 95 | s+=code[i] 96 | s+=code[end_point[0]][:end_point[1]] 97 | return s 98 | -------------------------------------------------------------------------------- /GraphCodeBERT/clonedetection/run.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | # Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team. 3 | # Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved. 4 | # 5 | # Licensed under the Apache License, Version 2.0 (the "License"); 6 | # you may not use this file except in compliance with the License. 7 | # You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | """ 17 | Fine-tuning the library models for language modeling on a text file (GPT, GPT-2, BERT, RoBERTa). 18 | GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned 19 | using a masked language modeling (MLM) loss. 20 | """ 21 | 22 | from __future__ import absolute_import, division, print_function 23 | 24 | import argparse 25 | import glob 26 | import logging 27 | import os 28 | import pickle 29 | import random 30 | import re 31 | import shutil 32 | import json 33 | import numpy as np 34 | import torch 35 | from torch.utils.data import DataLoader, Dataset, SequentialSampler, RandomSampler,TensorDataset 36 | from torch.utils.data.distributed import DistributedSampler 37 | from transformers import (WEIGHTS_NAME, AdamW, get_linear_schedule_with_warmup, 38 | RobertaConfig, RobertaForSequenceClassification, RobertaTokenizer) 39 | from tqdm import tqdm, trange 40 | import multiprocessing 41 | from model import Model 42 | 43 | cpu_cont = 16 44 | logger = logging.getLogger(__name__) 45 | 46 | from parser import DFG_python,DFG_java,DFG_ruby,DFG_go,DFG_php,DFG_javascript 47 | from parser import (remove_comments_and_docstrings, 48 | tree_to_token_index, 49 | index_to_code_token, 50 | tree_to_variable_index) 51 | from tree_sitter import Language, Parser 52 | dfg_function={ 53 | 'python':DFG_python, 54 | 'java':DFG_java, 55 | 'ruby':DFG_ruby, 56 | 'go':DFG_go, 57 | 'php':DFG_php, 58 | 'javascript':DFG_javascript 59 | } 60 | 61 | #load parsers 62 | parsers={} 63 | for lang in dfg_function: 64 | LANGUAGE = Language('parser/my-languages.so', lang) 65 | parser = Parser() 66 | parser.set_language(LANGUAGE) 67 | parser = [parser,dfg_function[lang]] 68 | parsers[lang]= parser 69 | 70 | 71 | #remove comments, tokenize code and extract dataflow 72 | def extract_dataflow(code, parser,lang): 73 | #remove comments 74 | try: 75 | code=remove_comments_and_docstrings(code,lang) 76 | except: 77 | pass 78 | #obtain dataflow 79 | if lang=="php": 80 | code="" 81 | try: 82 | tree = parser[0].parse(bytes(code,'utf8')) 83 | root_node = tree.root_node 84 | tokens_index=tree_to_token_index(root_node) 85 | code=code.split('\n') 86 | code_tokens=[index_to_code_token(x,code) for x in tokens_index] 87 | index_to_code={} 88 | for idx,(index,code) in enumerate(zip(tokens_index,code_tokens)): 89 | index_to_code[index]=(idx,code) 90 | try: 91 | DFG,_=parser[1](root_node,index_to_code,{}) 92 | except: 93 | DFG=[] 94 | DFG=sorted(DFG,key=lambda x:x[1]) 95 | indexs=set() 96 | for d in DFG: 97 | if len(d[-1])!=0: 98 | indexs.add(d[1]) 99 | for x in d[-1]: 100 | indexs.add(x) 101 | new_DFG=[] 102 | for d in DFG: 103 | if d[1] in indexs: 104 | new_DFG.append(d) 105 | dfg=new_DFG 106 | except: 107 | dfg=[] 108 | return code_tokens,dfg 109 | 110 | class InputFeatures(object): 111 | """A single training/test features for a example.""" 112 | def __init__(self, 113 | input_tokens_1, 114 | input_ids_1, 115 | position_idx_1, 116 | dfg_to_code_1, 117 | dfg_to_dfg_1, 118 | input_tokens_2, 119 | input_ids_2, 120 | position_idx_2, 121 | dfg_to_code_2, 122 | dfg_to_dfg_2, 123 | label, 124 | url1, 125 | url2 126 | 127 | ): 128 | #The first code function 129 | self.input_tokens_1 = input_tokens_1 130 | self.input_ids_1 = input_ids_1 131 | self.position_idx_1=position_idx_1 132 | self.dfg_to_code_1=dfg_to_code_1 133 | self.dfg_to_dfg_1=dfg_to_dfg_1 134 | 135 | #The second code function 136 | self.input_tokens_2 = input_tokens_2 137 | self.input_ids_2 = input_ids_2 138 | self.position_idx_2=position_idx_2 139 | self.dfg_to_code_2=dfg_to_code_2 140 | self.dfg_to_dfg_2=dfg_to_dfg_2 141 | 142 | #label 143 | self.label=label 144 | self.url1=url1 145 | self.url2=url2 146 | 147 | 148 | def convert_examples_to_features(item): 149 | #source 150 | url1,url2,label,tokenizer, args,cache,url_to_code=item 151 | parser=parsers['java'] 152 | 153 | for url in [url1,url2]: 154 | if url not in cache: 155 | func=url_to_code[url] 156 | 157 | #extract data flow 158 | code_tokens,dfg=extract_dataflow(func,parser,'java') 159 | code_tokens=[tokenizer.tokenize('@ '+x)[1:] if idx!=0 else tokenizer.tokenize(x) for idx,x in enumerate(code_tokens)] 160 | ori2cur_pos={} 161 | ori2cur_pos[-1]=(0,0) 162 | for i in range(len(code_tokens)): 163 | ori2cur_pos[i]=(ori2cur_pos[i-1][1],ori2cur_pos[i-1][1]+len(code_tokens[i])) 164 | code_tokens=[y for x in code_tokens for y in x] 165 | 166 | #truncating 167 | code_tokens=code_tokens[:args.code_length+args.data_flow_length-3-min(len(dfg),args.data_flow_length)][:512-3] 168 | source_tokens =[tokenizer.cls_token]+code_tokens+[tokenizer.sep_token] 169 | source_ids = tokenizer.convert_tokens_to_ids(source_tokens) 170 | position_idx = [i+tokenizer.pad_token_id + 1 for i in range(len(source_tokens))] 171 | dfg=dfg[:args.code_length+args.data_flow_length-len(source_tokens)] 172 | source_tokens+=[x[0] for x in dfg] 173 | position_idx+=[0 for x in dfg] 174 | source_ids+=[tokenizer.unk_token_id for x in dfg] 175 | padding_length=args.code_length+args.data_flow_length-len(source_ids) 176 | position_idx+=[tokenizer.pad_token_id]*padding_length 177 | source_ids+=[tokenizer.pad_token_id]*padding_length 178 | 179 | #reindex 180 | reverse_index={} 181 | for idx,x in enumerate(dfg): 182 | reverse_index[x[1]]=idx 183 | for idx,x in enumerate(dfg): 184 | dfg[idx]=x[:-1]+([reverse_index[i] for i in x[-1] if i in reverse_index],) 185 | dfg_to_dfg=[x[-1] for x in dfg] 186 | dfg_to_code=[ori2cur_pos[x[1]] for x in dfg] 187 | length=len([tokenizer.cls_token]) 188 | dfg_to_code=[(x[0]+length,x[1]+length) for x in dfg_to_code] 189 | cache[url]=source_tokens,source_ids,position_idx,dfg_to_code,dfg_to_dfg 190 | 191 | 192 | source_tokens_1,source_ids_1,position_idx_1,dfg_to_code_1,dfg_to_dfg_1=cache[url1] 193 | source_tokens_2,source_ids_2,position_idx_2,dfg_to_code_2,dfg_to_dfg_2=cache[url2] 194 | return InputFeatures(source_tokens_1,source_ids_1,position_idx_1,dfg_to_code_1,dfg_to_dfg_1, 195 | source_tokens_2,source_ids_2,position_idx_2,dfg_to_code_2,dfg_to_dfg_2, 196 | label,url1,url2) 197 | 198 | class TextDataset(Dataset): 199 | def __init__(self, tokenizer, args, file_path='train'): 200 | self.examples = [] 201 | self.args=args 202 | index_filename=file_path 203 | 204 | #load index 205 | logger.info("Creating features from index file at %s ", index_filename) 206 | url_to_code={} 207 | with open('/'.join(index_filename.split('/')[:-1])+'/data.jsonl') as f: 208 | for line in f: 209 | line=line.strip() 210 | js=json.loads(line) 211 | url_to_code[js['idx']]=js['func'] 212 | 213 | #load code function according to index 214 | data=[] 215 | cache={} 216 | f=open(index_filename) 217 | with open(index_filename) as f: 218 | for line in f: 219 | line=line.strip() 220 | url1,url2,label=line.split('\t') 221 | if url1 not in url_to_code or url2 not in url_to_code: 222 | continue 223 | if label=='0': 224 | label=0 225 | else: 226 | label=1 227 | data.append((url1,url2,label,tokenizer, args,cache,url_to_code)) 228 | 229 | #only use 10% valid data to keep best model 230 | if 'valid' in file_path: 231 | data=random.sample(data,int(len(data)*0.1)) 232 | 233 | #convert example to input features 234 | self.examples=[convert_examples_to_features(x) for x in tqdm(data,total=len(data))] 235 | 236 | if 'train' in file_path: 237 | for idx, example in enumerate(self.examples[:3]): 238 | logger.info("*** Example ***") 239 | logger.info("idx: {}".format(idx)) 240 | logger.info("label: {}".format(example.label)) 241 | logger.info("input_tokens_1: {}".format([x.replace('\u0120','_') for x in example.input_tokens_1])) 242 | logger.info("input_ids_1: {}".format(' '.join(map(str, example.input_ids_1)))) 243 | logger.info("position_idx_1: {}".format(example.position_idx_1)) 244 | logger.info("dfg_to_code_1: {}".format(' '.join(map(str, example.dfg_to_code_1)))) 245 | logger.info("dfg_to_dfg_1: {}".format(' '.join(map(str, example.dfg_to_dfg_1)))) 246 | 247 | logger.info("input_tokens_2: {}".format([x.replace('\u0120','_') for x in example.input_tokens_2])) 248 | logger.info("input_ids_2: {}".format(' '.join(map(str, example.input_ids_2)))) 249 | logger.info("position_idx_2: {}".format(example.position_idx_2)) 250 | logger.info("dfg_to_code_2: {}".format(' '.join(map(str, example.dfg_to_code_2)))) 251 | logger.info("dfg_to_dfg_2: {}".format(' '.join(map(str, example.dfg_to_dfg_2)))) 252 | 253 | 254 | def __len__(self): 255 | return len(self.examples) 256 | 257 | def __getitem__(self, item): 258 | #calculate graph-guided masked function 259 | attn_mask_1= np.zeros((self.args.code_length+self.args.data_flow_length, 260 | self.args.code_length+self.args.data_flow_length),dtype=np.bool) 261 | #calculate begin index of node and max length of input 262 | node_index=sum([i>1 for i in self.examples[item].position_idx_1]) 263 | max_length=sum([i!=1 for i in self.examples[item].position_idx_1]) 264 | #sequence can attend to sequence 265 | attn_mask_1[:node_index,:node_index]=True 266 | #special tokens attend to all tokens 267 | for idx,i in enumerate(self.examples[item].input_ids_1): 268 | if i in [0,2]: 269 | attn_mask_1[idx,:max_length]=True 270 | #nodes attend to code tokens that are identified from 271 | for idx,(a,b) in enumerate(self.examples[item].dfg_to_code_1): 272 | if a1 for i in self.examples[item].position_idx_2]) 286 | max_length=sum([i!=1 for i in self.examples[item].position_idx_2]) 287 | #sequence can attend to sequence 288 | attn_mask_2[:node_index,:node_index]=True 289 | #special tokens attend to all tokens 290 | for idx,i in enumerate(self.examples[item].input_ids_2): 291 | if i in [0,2]: 292 | attn_mask_2[idx,:max_length]=True 293 | #nodes attend to code tokens that are identified from 294 | for idx,(a,b) in enumerate(self.examples[item].dfg_to_code_2): 295 | if a 0: 318 | torch.cuda.manual_seed_all(args.seed) 319 | 320 | 321 | def train(args, train_dataset, model, tokenizer): 322 | """ Train the model """ 323 | 324 | #build dataloader 325 | train_sampler = RandomSampler(train_dataset) 326 | train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size,num_workers=4) 327 | 328 | args.max_steps=args.epochs*len( train_dataloader) 329 | args.save_steps=len( train_dataloader)//10 330 | args.warmup_steps=args.max_steps//5 331 | model.to(args.device) 332 | 333 | # Prepare optimizer and schedule (linear warmup and decay) 334 | no_decay = ['bias', 'LayerNorm.weight'] 335 | optimizer_grouped_parameters = [ 336 | {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 337 | 'weight_decay': args.weight_decay}, 338 | {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0} 339 | ] 340 | optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon) 341 | scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=args.warmup_steps, 342 | num_training_steps=args.max_steps) 343 | 344 | # multi-gpu training 345 | if args.n_gpu > 1: 346 | model = torch.nn.DataParallel(model) 347 | 348 | # Train! 349 | logger.info("***** Running training *****") 350 | logger.info(" Num examples = %d", len(train_dataset)) 351 | logger.info(" Num Epochs = %d", args.epochs) 352 | logger.info(" Instantaneous batch size per GPU = %d", args.train_batch_size//args.n_gpu) 353 | logger.info(" Total train batch size = %d",args.train_batch_size*args.gradient_accumulation_steps) 354 | logger.info(" Gradient Accumulation steps = %d", args.gradient_accumulation_steps) 355 | logger.info(" Total optimization steps = %d", args.max_steps) 356 | 357 | global_step=0 358 | tr_loss, logging_loss,avg_loss,tr_nb,tr_num,train_loss = 0.0, 0.0,0.0,0,0,0 359 | best_f1=0 360 | 361 | model.zero_grad() 362 | 363 | for idx in range(args.epochs): 364 | bar = tqdm(train_dataloader,total=len(train_dataloader)) 365 | tr_num=0 366 | train_loss=0 367 | for step, batch in enumerate(bar): 368 | (inputs_ids_1,position_idx_1,attn_mask_1, 369 | inputs_ids_2,position_idx_2,attn_mask_2, 370 | labels)=[x.to(args.device) for x in batch] 371 | model.train() 372 | loss,logits = model(inputs_ids_1,position_idx_1,attn_mask_1,inputs_ids_2,position_idx_2,attn_mask_2,labels) 373 | 374 | if args.n_gpu > 1: 375 | loss = loss.mean() 376 | 377 | if args.gradient_accumulation_steps > 1: 378 | loss = loss / args.gradient_accumulation_steps 379 | 380 | loss.backward() 381 | torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm) 382 | 383 | tr_loss += loss.item() 384 | tr_num+=1 385 | train_loss+=loss.item() 386 | if avg_loss==0: 387 | avg_loss=tr_loss 388 | 389 | avg_loss=round(train_loss/tr_num,5) 390 | bar.set_description("epoch {} loss {}".format(idx,avg_loss)) 391 | 392 | if (step + 1) % args.gradient_accumulation_steps == 0: 393 | optimizer.step() 394 | optimizer.zero_grad() 395 | scheduler.step() 396 | global_step += 1 397 | output_flag=True 398 | avg_loss=round(np.exp((tr_loss - logging_loss) /(global_step- tr_nb)),4) 399 | 400 | if global_step % args.save_steps == 0: 401 | results = evaluate(args, model, tokenizer, eval_when_training=True) 402 | 403 | # Save model checkpoint 404 | if results['eval_f1']>best_f1: 405 | best_f1=results['eval_f1'] 406 | logger.info(" "+"*"*20) 407 | logger.info(" Best f1:%s",round(best_f1,4)) 408 | logger.info(" "+"*"*20) 409 | 410 | checkpoint_prefix = 'checkpoint-best-f1' 411 | output_dir = os.path.join(args.output_dir, '{}'.format(checkpoint_prefix)) 412 | if not os.path.exists(output_dir): 413 | os.makedirs(output_dir) 414 | model_to_save = model.module if hasattr(model,'module') else model 415 | output_dir = os.path.join(output_dir, '{}'.format('model.bin')) 416 | torch.save(model_to_save.state_dict(), output_dir) 417 | logger.info("Saving model checkpoint to %s", output_dir) 418 | 419 | def evaluate(args, model, tokenizer, eval_when_training=False): 420 | #build dataloader 421 | eval_dataset = TextDataset(tokenizer, args, file_path=args.eval_data_file) 422 | eval_sampler = SequentialSampler(eval_dataset) 423 | eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler,batch_size=args.eval_batch_size,num_workers=4) 424 | 425 | # multi-gpu evaluate 426 | if args.n_gpu > 1 and eval_when_training is False: 427 | model = torch.nn.DataParallel(model) 428 | 429 | # Eval! 430 | logger.info("***** Running evaluation *****") 431 | logger.info(" Num examples = %d", len(eval_dataset)) 432 | logger.info(" Batch size = %d", args.eval_batch_size) 433 | 434 | eval_loss = 0.0 435 | nb_eval_steps = 0 436 | model.eval() 437 | logits=[] 438 | y_trues=[] 439 | for batch in eval_dataloader: 440 | (inputs_ids_1,position_idx_1,attn_mask_1, 441 | inputs_ids_2,position_idx_2,attn_mask_2, 442 | labels)=[x.to(args.device) for x in batch] 443 | with torch.no_grad(): 444 | lm_loss,logit = model(inputs_ids_1,position_idx_1,attn_mask_1,inputs_ids_2,position_idx_2,attn_mask_2,labels) 445 | eval_loss += lm_loss.mean().item() 446 | logits.append(logit.cpu().numpy()) 447 | y_trues.append(labels.cpu().numpy()) 448 | nb_eval_steps += 1 449 | 450 | #calculate scores 451 | logits=np.concatenate(logits,0) 452 | y_trues=np.concatenate(y_trues,0) 453 | best_threshold=0.5 454 | best_f1=0 455 | 456 | y_preds=logits[:,1]>best_threshold 457 | from sklearn.metrics import recall_score 458 | recall=recall_score(y_trues, y_preds, average='macro') 459 | from sklearn.metrics import precision_score 460 | precision=precision_score(y_trues, y_preds, average='macro') 461 | from sklearn.metrics import f1_score 462 | f1=f1_score(y_trues, y_preds, average='macro') 463 | result = { 464 | "eval_recall": float(recall), 465 | "eval_precision": float(precision), 466 | "eval_f1": float(f1), 467 | "eval_threshold":best_threshold, 468 | 469 | } 470 | 471 | logger.info("***** Eval results *****") 472 | for key in sorted(result.keys()): 473 | logger.info(" %s = %s", key, str(round(result[key],4))) 474 | 475 | return result 476 | 477 | def test(args, model, tokenizer, best_threshold=0): 478 | #build dataloader 479 | eval_dataset = TextDataset(tokenizer, args, file_path=args.test_data_file) 480 | eval_sampler = SequentialSampler(eval_dataset) 481 | eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size,num_workers=4) 482 | 483 | # multi-gpu evaluate 484 | if args.n_gpu > 1: 485 | model = torch.nn.DataParallel(model) 486 | 487 | # Eval! 488 | logger.info("***** Running Test *****") 489 | logger.info(" Num examples = %d", len(eval_dataset)) 490 | logger.info(" Batch size = %d", args.eval_batch_size) 491 | eval_loss = 0.0 492 | nb_eval_steps = 0 493 | model.eval() 494 | logits=[] 495 | y_trues=[] 496 | for batch in eval_dataloader: 497 | (inputs_ids_1,position_idx_1,attn_mask_1, 498 | inputs_ids_2,position_idx_2,attn_mask_2, 499 | labels)=[x.to(args.device) for x in batch] 500 | with torch.no_grad(): 501 | lm_loss,logit = model(inputs_ids_1,position_idx_1,attn_mask_1,inputs_ids_2,position_idx_2,attn_mask_2,labels) 502 | eval_loss += lm_loss.mean().item() 503 | logits.append(logit.cpu().numpy()) 504 | y_trues.append(labels.cpu().numpy()) 505 | nb_eval_steps += 1 506 | 507 | #output result 508 | logits=np.concatenate(logits,0) 509 | y_preds=logits[:,1]>best_threshold 510 | with open(os.path.join(args.output_dir,"predictions.txt"),'w') as f: 511 | for example,pred in zip(eval_dataset.examples,y_preds): 512 | if pred: 513 | f.write(example.url1+'\t'+example.url2+'\t'+'1'+'\n') 514 | else: 515 | f.write(example.url1+'\t'+example.url2+'\t'+'0'+'\n') 516 | 517 | def main(): 518 | parser = argparse.ArgumentParser() 519 | 520 | ## Required parameters 521 | parser.add_argument("--train_data_file", default=None, type=str, required=True, 522 | help="The input training data file (a text file).") 523 | parser.add_argument("--output_dir", default=None, type=str, required=True, 524 | help="The output directory where the model predictions and checkpoints will be written.") 525 | 526 | ## Other parameters 527 | parser.add_argument("--eval_data_file", default=None, type=str, 528 | help="An optional input evaluation data file to evaluate the perplexity on (a text file).") 529 | parser.add_argument("--test_data_file", default=None, type=str, 530 | help="An optional input evaluation data file to evaluate the perplexity on (a text file).") 531 | 532 | parser.add_argument("--model_name_or_path", default=None, type=str, 533 | help="The model checkpoint for weights initialization.") 534 | 535 | parser.add_argument("--config_name", default="", type=str, 536 | help="Optional pretrained config name or path if not the same as model_name_or_path") 537 | parser.add_argument("--tokenizer_name", default="", type=str, 538 | help="Optional pretrained tokenizer name or path if not the same as model_name_or_path") 539 | 540 | parser.add_argument("--code_length", default=256, type=int, 541 | help="Optional Code input sequence length after tokenization.") 542 | parser.add_argument("--data_flow_length", default=64, type=int, 543 | help="Optional Data Flow input sequence length after tokenization.") 544 | parser.add_argument("--do_train", action='store_true', 545 | help="Whether to run training.") 546 | parser.add_argument("--do_eval", action='store_true', 547 | help="Whether to run eval on the dev set.") 548 | parser.add_argument("--do_test", action='store_true', 549 | help="Whether to run eval on the dev set.") 550 | parser.add_argument("--evaluate_during_training", action='store_true', 551 | help="Run evaluation during training at each logging step.") 552 | 553 | parser.add_argument("--train_batch_size", default=4, type=int, 554 | help="Batch size per GPU/CPU for training.") 555 | parser.add_argument("--eval_batch_size", default=4, type=int, 556 | help="Batch size per GPU/CPU for evaluation.") 557 | parser.add_argument('--gradient_accumulation_steps', type=int, default=1, 558 | help="Number of updates steps to accumulate before performing a backward/update pass.") 559 | parser.add_argument("--learning_rate", default=5e-5, type=float, 560 | help="The initial learning rate for Adam.") 561 | parser.add_argument("--weight_decay", default=0.0, type=float, 562 | help="Weight deay if we apply some.") 563 | parser.add_argument("--adam_epsilon", default=1e-8, type=float, 564 | help="Epsilon for Adam optimizer.") 565 | parser.add_argument("--max_grad_norm", default=1.0, type=float, 566 | help="Max gradient norm.") 567 | parser.add_argument("--max_steps", default=-1, type=int, 568 | help="If > 0: set total number of training steps to perform. Override num_train_epochs.") 569 | parser.add_argument("--warmup_steps", default=0, type=int, 570 | help="Linear warmup over warmup_steps.") 571 | 572 | parser.add_argument('--seed', type=int, default=42, 573 | help="random seed for initialization") 574 | parser.add_argument('--epochs', type=int, default=1, 575 | help="training epochs") 576 | 577 | args = parser.parse_args() 578 | 579 | # Setup CUDA, GPU 580 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 581 | args.n_gpu = torch.cuda.device_count() 582 | 583 | args.device = device 584 | 585 | # Setup logging 586 | logging.basicConfig(format='%(asctime)s - %(levelname)s - %(name)s - %(message)s',datefmt='%m/%d/%Y %H:%M:%S',level=logging.INFO) 587 | logger.warning("device: %s, n_gpu: %s",device, args.n_gpu,) 588 | 589 | 590 | # Set seed 591 | set_seed(args) 592 | config = RobertaConfig.from_pretrained(args.config_name if args.config_name else args.model_name_or_path) 593 | config.num_labels=1 594 | tokenizer = RobertaTokenizer.from_pretrained(args.tokenizer_name) 595 | model = RobertaForSequenceClassification.from_pretrained(args.model_name_or_path,config=config) 596 | 597 | model=Model(model,config,tokenizer,args) 598 | logger.info("Training/evaluation parameters %s", args) 599 | # Training 600 | if args.do_train: 601 | train_dataset = TextDataset(tokenizer, args, file_path=args.train_data_file) 602 | train(args, train_dataset, model, tokenizer) 603 | 604 | # Evaluation 605 | results = {} 606 | if args.do_eval: 607 | checkpoint_prefix = 'checkpoint-best-f1/model.bin' 608 | output_dir = os.path.join(args.output_dir, '{}'.format(checkpoint_prefix)) 609 | model.load_state_dict(torch.load(output_dir)) 610 | model.to(args.device) 611 | result=evaluate(args, model, tokenizer) 612 | 613 | if args.do_test: 614 | checkpoint_prefix = 'checkpoint-best-f1/model.bin' 615 | output_dir = os.path.join(args.output_dir, '{}'.format(checkpoint_prefix)) 616 | model.load_state_dict(torch.load(output_dir)) 617 | model.to(args.device) 618 | test(args, model, tokenizer,best_threshold=0.5) 619 | 620 | return results 621 | 622 | 623 | if __name__ == "__main__": 624 | main() 625 | 626 | -------------------------------------------------------------------------------- /GraphCodeBERT/codesearch/README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | # Code Search 4 | 5 | ## Data Preprocess 6 | 7 | Different from the setting of [CodeSearchNet](husain2019codesearchnet), the answer of each query is retrieved from the whole development and testing code corpus instead of 1,000 candidate codes. Besides, we observe that some queries contain content unrelated to the code, such as a link ``http://..." that refers to external resources. Therefore, we filter following examples to improve the quality of the dataset. 8 | 9 | - Remove comments in the code 10 | 11 | - Remove examples that codes cannot be parsed into an abstract syntax tree. 12 | 13 | - Remove examples that #tokens of documents is < 3 or >256 14 | 15 | - Remove examples that documents contain special tokens (e.g. or https:...) 16 | 17 | - Remove examples that documents are not English. 18 | 19 | Data statistic about the cleaned dataset for code document generation is shown in this Table. 20 | 21 | | PL | Training | Dev | Test | Candidates code | 22 | | :--------- | :------: | :----: | :----: | :-------------: | 23 | | Python | 251,820 | 13,914 | 14,918 | 43,827 | 24 | | PHP | 241,241 | 12,982 | 14,014 | 52,660 | 25 | | Go | 167,288 | 7,325 | 8,122 | 28,120 | 26 | | Java | 164,923 | 5,183 | 10,955 | 40,347 | 27 | | JavaScript | 58,025 | 3,885 | 3,291 | 13,981 | 28 | | Ruby | 24,927 | 1,400 | 1,261 | 4,360 | 29 | 30 | You can download and preprocess data using the following command. 31 | ```shell 32 | unzip dataset.zip 33 | cd dataset 34 | bash run.sh 35 | cd .. 36 | ``` 37 | 38 | ## Dependency 39 | 40 | - pip install torch 41 | - pip install transformers 42 | - pip install tree_sitter 43 | 44 | ### Tree-sitter (optional) 45 | 46 | If the built file "parser/my-languages.so" doesn't work for you, please rebuild as the following command: 47 | 48 | ```shell 49 | cd parser 50 | bash build.sh 51 | cd .. 52 | ``` 53 | 54 | ## Fine-Tune 55 | 56 | We fine-tuned the model on 2*V100-16G GPUs. 57 | ```shell 58 | lang=ruby 59 | mkdir -p ./saved_models/$lang 60 | python run.py \ 61 | --output_dir=./saved_models/$lang \ 62 | --config_name=microsoft/graphcodebert-base \ 63 | --model_name_or_path=microsoft/graphcodebert-base \ 64 | --tokenizer_name=microsoft/graphcodebert-base \ 65 | --lang=$lang \ 66 | --do_train \ 67 | --train_data_file=dataset/$lang/train.jsonl \ 68 | --eval_data_file=dataset/$lang/valid.jsonl \ 69 | --test_data_file=dataset/$lang/test.jsonl \ 70 | --codebase_file=dataset/$lang/codebase.jsonl \ 71 | --num_train_epochs 10 \ 72 | --code_length 256 \ 73 | --data_flow_length 64 \ 74 | --nl_length 128 \ 75 | --train_batch_size 32 \ 76 | --eval_batch_size 64 \ 77 | --learning_rate 2e-5 \ 78 | --seed 123456 2>&1| tee saved_models/$lang/train.log 79 | ``` 80 | ## Inference and Evaluation 81 | 82 | ```shell 83 | lang=ruby 84 | python run.py \ 85 | --output_dir=./saved_models/$lang \ 86 | --config_name=microsoft/graphcodebert-base \ 87 | --model_name_or_path=microsoft/graphcodebert-base \ 88 | --tokenizer_name=microsoft/graphcodebert-base \ 89 | --lang=$lang \ 90 | --do_eval \ 91 | --do_test \ 92 | --train_data_file=dataset/$lang/train.jsonl \ 93 | --eval_data_file=dataset/$lang/valid.jsonl \ 94 | --test_data_file=dataset/$lang/test.jsonl \ 95 | --codebase_file=dataset/$lang/codebase.jsonl \ 96 | --num_train_epochs 10 \ 97 | --code_length 256 \ 98 | --data_flow_length 64 \ 99 | --nl_length 128 \ 100 | --train_batch_size 32 \ 101 | --eval_batch_size 64 \ 102 | --learning_rate 2e-5 \ 103 | --seed 123456 2>&1| tee saved_models/$lang/test.log 104 | ``` 105 | 106 | ## Results 107 | 108 | The results on the filtered dataset are shown in this Table: 109 | 110 | | Model | Ruby | Javascript | Go | Python | Java | PHP | Overall | 111 | | -------------- | :-------: | :--------: | :-------: | :-------: | :-------: | :-------: | :-------: | 112 | | NBow | 0.162 | 0.157 | 0.330 | 0.161 | 0.171 | 0.152 | 0.189 | 113 | | CNN | 0.276 | 0.224 | 0.680 | 0.242 | 0.263 | 0.260 | 0.324 | 114 | | BiRNN | 0.213 | 0.193 | 0.688 | 0.290 | 0.304 | 0.338 | 0.338 | 115 | | SelfAtt | 0.275 | 0.287 | 0.723 | 0.398 | 0.404 | 0.426 | 0.419 | 116 | | RoBERTa | 0.587 | 0.517 | 0.850 | 0.587 | 0.599 | 0.560 | 0.617 | 117 | | RoBERTa (code) | 0.628 | 0.562 | 0.859 | 0.610 | 0.620 | 0.579 | 0.643 | 118 | | CodeBERT | 0.679 | 0.620 | 0.882 | 0.672 | 0.676 | 0.628 | 0.693 | 119 | | GraphCodeBERT | **0.703** | **0.644** | **0.897** | **0.692** | **0.691** | **0.649** | **0.713** | -------------------------------------------------------------------------------- /GraphCodeBERT/codesearch/dataset.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sxjscience/CodeBERT/e20547d53e4e6b7d97c2394470d2f6ef922e88ad/GraphCodeBERT/codesearch/dataset.zip -------------------------------------------------------------------------------- /GraphCodeBERT/codesearch/model.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Microsoft Corporation. 2 | # Licensed under the MIT License. 3 | import torch.nn as nn 4 | import torch 5 | class Model(nn.Module): 6 | def __init__(self, encoder): 7 | super(Model, self).__init__() 8 | self.encoder = encoder 9 | 10 | def forward(self, code_inputs=None, attn_mask=None,position_idx=None, nl_inputs=None): 11 | if code_inputs is not None: 12 | nodes_mask=position_idx.eq(0) 13 | token_mask=position_idx.ge(2) 14 | inputs_embeddings=self.encoder.embeddings.word_embeddings(code_inputs) 15 | nodes_to_token_mask=nodes_mask[:,:,None]&token_mask[:,None,:]&attn_mask 16 | nodes_to_token_mask=nodes_to_token_mask/(nodes_to_token_mask.sum(-1)+1e-10)[:,:,None] 17 | avg_embeddings=torch.einsum("abc,acd->abd",nodes_to_token_mask,inputs_embeddings) 18 | inputs_embeddings=inputs_embeddings*(~nodes_mask)[:,:,None]+avg_embeddings*nodes_mask[:,:,None] 19 | return self.encoder(inputs_embeds=inputs_embeddings,attention_mask=attn_mask,position_ids=position_idx)[1] 20 | else: 21 | return self.encoder(nl_inputs,attention_mask=nl_inputs.ne(1))[1] 22 | 23 | 24 | 25 | 26 | -------------------------------------------------------------------------------- /GraphCodeBERT/codesearch/parser/__init__.py: -------------------------------------------------------------------------------- 1 | from .utils import (remove_comments_and_docstrings, 2 | tree_to_token_index, 3 | index_to_code_token, 4 | tree_to_variable_index) 5 | from .DFG import DFG_python,DFG_java,DFG_ruby,DFG_go,DFG_php,DFG_javascript,DFG_csharp -------------------------------------------------------------------------------- /GraphCodeBERT/codesearch/parser/build.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Microsoft Corporation. 2 | # Licensed under the MIT license. 3 | 4 | from tree_sitter import Language, Parser 5 | 6 | Language.build_library( 7 | # Store the library in the `build` directory 8 | 'my-languages.so', 9 | 10 | # Include one or more languages 11 | [ 12 | 'tree-sitter-go', 13 | 'tree-sitter-javascript', 14 | 'tree-sitter-python', 15 | 'tree-sitter-php', 16 | 'tree-sitter-java', 17 | 'tree-sitter-ruby', 18 | 'tree-sitter-c-sharp', 19 | ] 20 | ) 21 | 22 | -------------------------------------------------------------------------------- /GraphCodeBERT/codesearch/parser/build.sh: -------------------------------------------------------------------------------- 1 | git clone https://github.com/tree-sitter/tree-sitter-go 2 | git clone https://github.com/tree-sitter/tree-sitter-javascript 3 | git clone https://github.com/tree-sitter/tree-sitter-python 4 | git clone https://github.com/tree-sitter/tree-sitter-ruby 5 | git clone https://github.com/tree-sitter/tree-sitter-php 6 | git clone https://github.com/tree-sitter/tree-sitter-java 7 | git clone https://github.com/tree-sitter/tree-sitter-c-sharp 8 | python build.py 9 | -------------------------------------------------------------------------------- /GraphCodeBERT/codesearch/parser/my-languages.so: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sxjscience/CodeBERT/e20547d53e4e6b7d97c2394470d2f6ef922e88ad/GraphCodeBERT/codesearch/parser/my-languages.so -------------------------------------------------------------------------------- /GraphCodeBERT/codesearch/parser/utils.py: -------------------------------------------------------------------------------- 1 | import re 2 | from io import StringIO 3 | import tokenize 4 | def remove_comments_and_docstrings(source,lang): 5 | if lang in ['python']: 6 | """ 7 | Returns 'source' minus comments and docstrings. 8 | """ 9 | io_obj = StringIO(source) 10 | out = "" 11 | prev_toktype = tokenize.INDENT 12 | last_lineno = -1 13 | last_col = 0 14 | for tok in tokenize.generate_tokens(io_obj.readline): 15 | token_type = tok[0] 16 | token_string = tok[1] 17 | start_line, start_col = tok[2] 18 | end_line, end_col = tok[3] 19 | ltext = tok[4] 20 | if start_line > last_lineno: 21 | last_col = 0 22 | if start_col > last_col: 23 | out += (" " * (start_col - last_col)) 24 | # Remove comments: 25 | if token_type == tokenize.COMMENT: 26 | pass 27 | # This series of conditionals removes docstrings: 28 | elif token_type == tokenize.STRING: 29 | if prev_toktype != tokenize.INDENT: 30 | # This is likely a docstring; double-check we're not inside an operator: 31 | if prev_toktype != tokenize.NEWLINE: 32 | if start_col > 0: 33 | out += token_string 34 | else: 35 | out += token_string 36 | prev_toktype = token_type 37 | last_col = end_col 38 | last_lineno = end_line 39 | temp=[] 40 | for x in out.split('\n'): 41 | if x.strip()!="": 42 | temp.append(x) 43 | return '\n'.join(temp) 44 | elif lang in ['ruby']: 45 | return source 46 | else: 47 | def replacer(match): 48 | s = match.group(0) 49 | if s.startswith('/'): 50 | return " " # note: a space and not an empty string 51 | else: 52 | return s 53 | pattern = re.compile( 54 | r'//.*?$|/\*.*?\*/|\'(?:\\.|[^\\\'])*\'|"(?:\\.|[^\\"])*"', 55 | re.DOTALL | re.MULTILINE 56 | ) 57 | temp=[] 58 | for x in re.sub(pattern, replacer, source).split('\n'): 59 | if x.strip()!="": 60 | temp.append(x) 61 | return '\n'.join(temp) 62 | 63 | def tree_to_token_index(root_node): 64 | if (len(root_node.children)==0 or root_node.type=='string') and root_node.type!='comment': 65 | return [(root_node.start_point,root_node.end_point)] 66 | else: 67 | code_tokens=[] 68 | for child in root_node.children: 69 | code_tokens+=tree_to_token_index(child) 70 | return code_tokens 71 | 72 | def tree_to_variable_index(root_node,index_to_code): 73 | if (len(root_node.children)==0 or root_node.type=='string') and root_node.type!='comment': 74 | index=(root_node.start_point,root_node.end_point) 75 | _,code=index_to_code[index] 76 | if root_node.type!=code: 77 | return [(root_node.start_point,root_node.end_point)] 78 | else: 79 | return [] 80 | else: 81 | code_tokens=[] 82 | for child in root_node.children: 83 | code_tokens+=tree_to_variable_index(child,index_to_code) 84 | return code_tokens 85 | 86 | def index_to_code_token(index,code): 87 | start_point=index[0] 88 | end_point=index[1] 89 | if start_point[0]==end_point[0]: 90 | s=code[start_point[0]][start_point[1]:end_point[1]] 91 | else: 92 | s="" 93 | s+=code[start_point[0]][start_point[1]:] 94 | for i in range(start_point[0]+1,end_point[0]): 95 | s+=code[i] 96 | s+=code[end_point[0]][:end_point[1]] 97 | return s 98 | -------------------------------------------------------------------------------- /GraphCodeBERT/codesearch/run.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | # Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team. 3 | # Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved. 4 | # 5 | # Licensed under the Apache License, Version 2.0 (the "License"); 6 | # you may not use this file except in compliance with the License. 7 | # You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | """ 17 | Fine-tuning the library models for language modeling on a text file (GPT, GPT-2, BERT, RoBERTa). 18 | GPT and GPT-2 are fine-tuned using a causal language modeling (CLM) loss while BERT and RoBERTa are fine-tuned 19 | using a masked language modeling (MLM) loss. 20 | """ 21 | 22 | import argparse 23 | import logging 24 | import os 25 | import pickle 26 | import random 27 | import torch 28 | import json 29 | import numpy as np 30 | from model import Model 31 | from torch.nn import CrossEntropyLoss, MSELoss 32 | from torch.utils.data import DataLoader, Dataset, SequentialSampler, RandomSampler,TensorDataset 33 | from transformers import (WEIGHTS_NAME, AdamW, get_linear_schedule_with_warmup, 34 | RobertaConfig, RobertaModel, RobertaTokenizer) 35 | 36 | logger = logging.getLogger(__name__) 37 | 38 | from tqdm import tqdm, trange 39 | import multiprocessing 40 | cpu_cont = 16 41 | 42 | from parser import DFG_python,DFG_java,DFG_ruby,DFG_go,DFG_php,DFG_javascript 43 | from parser import (remove_comments_and_docstrings, 44 | tree_to_token_index, 45 | index_to_code_token, 46 | tree_to_variable_index) 47 | from tree_sitter import Language, Parser 48 | dfg_function={ 49 | 'python':DFG_python, 50 | 'java':DFG_java, 51 | 'ruby':DFG_ruby, 52 | 'go':DFG_go, 53 | 'php':DFG_php, 54 | 'javascript':DFG_javascript 55 | } 56 | 57 | #load parsers 58 | parsers={} 59 | for lang in dfg_function: 60 | LANGUAGE = Language('parser/my-languages.so', lang) 61 | parser = Parser() 62 | parser.set_language(LANGUAGE) 63 | parser = [parser,dfg_function[lang]] 64 | parsers[lang]= parser 65 | 66 | 67 | #remove comments, tokenize code and extract dataflow 68 | def extract_dataflow(code, parser,lang): 69 | #remove comments 70 | try: 71 | code=remove_comments_and_docstrings(code,lang) 72 | except: 73 | pass 74 | #obtain dataflow 75 | if lang=="php": 76 | code="" 77 | try: 78 | tree = parser[0].parse(bytes(code,'utf8')) 79 | root_node = tree.root_node 80 | tokens_index=tree_to_token_index(root_node) 81 | code=code.split('\n') 82 | code_tokens=[index_to_code_token(x,code) for x in tokens_index] 83 | index_to_code={} 84 | for idx,(index,code) in enumerate(zip(tokens_index,code_tokens)): 85 | index_to_code[index]=(idx,code) 86 | try: 87 | DFG,_=parser[1](root_node,index_to_code,{}) 88 | except: 89 | DFG=[] 90 | DFG=sorted(DFG,key=lambda x:x[1]) 91 | indexs=set() 92 | for d in DFG: 93 | if len(d[-1])!=0: 94 | indexs.add(d[1]) 95 | for x in d[-1]: 96 | indexs.add(x) 97 | new_DFG=[] 98 | for d in DFG: 99 | if d[1] in indexs: 100 | new_DFG.append(d) 101 | dfg=new_DFG 102 | except: 103 | dfg=[] 104 | return code_tokens,dfg 105 | 106 | class InputFeatures(object): 107 | """A single training/test features for a example.""" 108 | def __init__(self, 109 | code_tokens, 110 | code_ids, 111 | position_idx, 112 | dfg_to_code, 113 | dfg_to_dfg, 114 | nl_tokens, 115 | nl_ids, 116 | url, 117 | 118 | ): 119 | self.code_tokens = code_tokens 120 | self.code_ids = code_ids 121 | self.position_idx=position_idx 122 | self.dfg_to_code=dfg_to_code 123 | self.dfg_to_dfg=dfg_to_dfg 124 | self.nl_tokens = nl_tokens 125 | self.nl_ids = nl_ids 126 | self.url=url 127 | 128 | 129 | def convert_examples_to_features(item): 130 | js,tokenizer,args=item 131 | #code 132 | parser=parsers[args.lang] 133 | #extract data flow 134 | code_tokens,dfg=extract_dataflow(js['original_string'],parser,args.lang) 135 | code_tokens=[tokenizer.tokenize('@ '+x)[1:] if idx!=0 else tokenizer.tokenize(x) for idx,x in enumerate(code_tokens)] 136 | ori2cur_pos={} 137 | ori2cur_pos[-1]=(0,0) 138 | for i in range(len(code_tokens)): 139 | ori2cur_pos[i]=(ori2cur_pos[i-1][1],ori2cur_pos[i-1][1]+len(code_tokens[i])) 140 | code_tokens=[y for x in code_tokens for y in x] 141 | #truncating 142 | code_tokens=code_tokens[:args.code_length+args.data_flow_length-2-min(len(dfg),args.data_flow_length)] 143 | code_tokens =[tokenizer.cls_token]+code_tokens+[tokenizer.sep_token] 144 | code_ids = tokenizer.convert_tokens_to_ids(code_tokens) 145 | position_idx = [i+tokenizer.pad_token_id + 1 for i in range(len(code_tokens))] 146 | dfg=dfg[:args.code_length+args.data_flow_length-len(code_tokens)] 147 | code_tokens+=[x[0] for x in dfg] 148 | position_idx+=[0 for x in dfg] 149 | code_ids+=[tokenizer.unk_token_id for x in dfg] 150 | padding_length=args.code_length+args.data_flow_length-len(code_ids) 151 | position_idx+=[tokenizer.pad_token_id]*padding_length 152 | code_ids+=[tokenizer.pad_token_id]*padding_length 153 | #reindex 154 | reverse_index={} 155 | for idx,x in enumerate(dfg): 156 | reverse_index[x[1]]=idx 157 | for idx,x in enumerate(dfg): 158 | dfg[idx]=x[:-1]+([reverse_index[i] for i in x[-1] if i in reverse_index],) 159 | dfg_to_dfg=[x[-1] for x in dfg] 160 | dfg_to_code=[ori2cur_pos[x[1]] for x in dfg] 161 | length=len([tokenizer.cls_token]) 162 | dfg_to_code=[(x[0]+length,x[1]+length) for x in dfg_to_code] 163 | #nl 164 | nl=' '.join(js['docstring_tokens']) 165 | nl_tokens=tokenizer.tokenize(nl)[:args.nl_length-2] 166 | nl_tokens =[tokenizer.cls_token]+nl_tokens+[tokenizer.sep_token] 167 | nl_ids = tokenizer.convert_tokens_to_ids(nl_tokens) 168 | padding_length = args.nl_length - len(nl_ids) 169 | nl_ids+=[tokenizer.pad_token_id]*padding_length 170 | 171 | return InputFeatures(code_tokens,code_ids,position_idx,dfg_to_code,dfg_to_dfg,nl_tokens,nl_ids,js['url']) 172 | 173 | class TextDataset(Dataset): 174 | def __init__(self, tokenizer, args, file_path=None,pool=None): 175 | self.args=args 176 | prefix=file_path.split('/')[-1][:-6] 177 | cache_file=args.output_dir+'/'+prefix+'.pkl' 178 | if os.path.exists(cache_file): 179 | self.examples=pickle.load(open(cache_file,'rb')) 180 | else: 181 | self.examples = [] 182 | data=[] 183 | with open(file_path) as f: 184 | for line in f: 185 | line=line.strip() 186 | js=json.loads(line) 187 | data.append((js,tokenizer,args)) 188 | self.examples=pool.map(convert_examples_to_features, tqdm(data,total=len(data))) 189 | pickle.dump(self.examples,open(cache_file,'wb')) 190 | 191 | if 'train' in file_path: 192 | for idx, example in enumerate(self.examples[:3]): 193 | logger.info("*** Example ***") 194 | logger.info("idx: {}".format(idx)) 195 | logger.info("code_tokens: {}".format([x.replace('\u0120','_') for x in example.code_tokens])) 196 | logger.info("code_ids: {}".format(' '.join(map(str, example.code_ids)))) 197 | logger.info("position_idx: {}".format(example.position_idx)) 198 | logger.info("dfg_to_code: {}".format(' '.join(map(str, example.dfg_to_code)))) 199 | logger.info("dfg_to_dfg: {}".format(' '.join(map(str, example.dfg_to_dfg)))) 200 | logger.info("nl_tokens: {}".format([x.replace('\u0120','_') for x in example.nl_tokens])) 201 | logger.info("nl_ids: {}".format(' '.join(map(str, example.nl_ids)))) 202 | 203 | def __len__(self): 204 | return len(self.examples) 205 | 206 | def __getitem__(self, item): 207 | #calculate graph-guided masked function 208 | attn_mask=np.zeros((self.args.code_length+self.args.data_flow_length, 209 | self.args.code_length+self.args.data_flow_length),dtype=np.bool) 210 | #calculate begin index of node and max length of input 211 | node_index=sum([i>1 for i in self.examples[item].position_idx]) 212 | max_length=sum([i!=1 for i in self.examples[item].position_idx]) 213 | #sequence can attend to sequence 214 | attn_mask[:node_index,:node_index]=True 215 | #special tokens attend to all tokens 216 | for idx,i in enumerate(self.examples[item].code_ids): 217 | if i in [0,2]: 218 | attn_mask[idx,:max_length]=True 219 | #nodes attend to code tokens that are identified from 220 | for idx,(a,b) in enumerate(self.examples[item].dfg_to_code): 221 | if a 1: 258 | model = torch.nn.DataParallel(model) 259 | 260 | # Train! 261 | logger.info("***** Running training *****") 262 | logger.info(" Num examples = %d", len(train_dataset)) 263 | logger.info(" Num Epochs = %d", args.num_train_epochs) 264 | logger.info(" Instantaneous batch size per GPU = %d", args.train_batch_size//args.n_gpu) 265 | logger.info(" Total train batch size = %d", args.train_batch_size) 266 | logger.info(" Total optimization steps = %d", len(train_dataloader)*args.num_train_epochs) 267 | 268 | # model.resize_token_embeddings(len(tokenizer)) 269 | model.zero_grad() 270 | 271 | model.train() 272 | tr_num,tr_loss,best_mrr=0,0,0 273 | for idx in range(args.num_train_epochs): 274 | for step,batch in enumerate(train_dataloader): 275 | #get inputs 276 | code_inputs = batch[0].to(args.device) 277 | attn_mask = batch[1].to(args.device) 278 | position_idx = batch[2].to(args.device) 279 | nl_inputs = batch[3].to(args.device) 280 | #get code and nl vectors 281 | code_vec = model(code_inputs=code_inputs,attn_mask=attn_mask,position_idx=position_idx) 282 | nl_vec = model(nl_inputs=nl_inputs) 283 | 284 | #calculate scores and loss 285 | scores=torch.einsum("ab,cb->ac",nl_vec,code_vec) 286 | loss_fct = CrossEntropyLoss() 287 | loss = loss_fct(scores, torch.arange(code_inputs.size(0), device=scores.device)) 288 | 289 | #report loss 290 | tr_loss += loss.item() 291 | tr_num+=1 292 | if (step+1)% 100==0: 293 | logger.info("epoch {} step {} loss {}".format(idx,step+1,round(tr_loss/tr_num,5))) 294 | tr_loss=0 295 | tr_num=0 296 | 297 | #backward 298 | loss.backward() 299 | torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm) 300 | optimizer.step() 301 | optimizer.zero_grad() 302 | scheduler.step() 303 | 304 | #evaluate 305 | results = evaluate(args, model, tokenizer,args.eval_data_file, pool, eval_when_training=True) 306 | for key, value in results.items(): 307 | logger.info(" %s = %s", key, round(value,4)) 308 | 309 | #save best model 310 | if results['eval_mrr']>best_mrr: 311 | best_mrr=results['eval_mrr'] 312 | logger.info(" "+"*"*20) 313 | logger.info(" Best mrr:%s",round(best_mrr,4)) 314 | logger.info(" "+"*"*20) 315 | 316 | checkpoint_prefix = 'checkpoint-best-mrr' 317 | output_dir = os.path.join(args.output_dir, '{}'.format(checkpoint_prefix)) 318 | if not os.path.exists(output_dir): 319 | os.makedirs(output_dir) 320 | model_to_save = model.module if hasattr(model,'module') else model 321 | output_dir = os.path.join(output_dir, '{}'.format('model.bin')) 322 | torch.save(model_to_save.state_dict(), output_dir) 323 | logger.info("Saving model checkpoint to %s", output_dir) 324 | 325 | 326 | def evaluate(args, model, tokenizer,file_name,pool, eval_when_training=False): 327 | query_dataset = TextDataset(tokenizer, args, file_name, pool) 328 | query_sampler = SequentialSampler(query_dataset) 329 | query_dataloader = DataLoader(query_dataset, sampler=query_sampler, batch_size=args.eval_batch_size,num_workers=4) 330 | 331 | code_dataset = TextDataset(tokenizer, args, args.codebase_file, pool) 332 | code_sampler = SequentialSampler(code_dataset) 333 | code_dataloader = DataLoader(code_dataset, sampler=code_sampler, batch_size=args.eval_batch_size,num_workers=4) 334 | 335 | # multi-gpu evaluate 336 | if args.n_gpu > 1 and eval_when_training is False: 337 | model = torch.nn.DataParallel(model) 338 | 339 | # Eval! 340 | logger.info("***** Running evaluation *****") 341 | logger.info(" Num queries = %d", len(query_dataset)) 342 | logger.info(" Num codes = %d", len(code_dataset)) 343 | logger.info(" Batch size = %d", args.eval_batch_size) 344 | 345 | 346 | model.eval() 347 | code_vecs=[] 348 | nl_vecs=[] 349 | for batch in query_dataloader: 350 | nl_inputs = batch[3].to(args.device) 351 | with torch.no_grad(): 352 | nl_vec = model(nl_inputs=nl_inputs) 353 | nl_vecs.append(nl_vec.cpu().numpy()) 354 | 355 | for batch in code_dataloader: 356 | code_inputs = batch[0].to(args.device) 357 | attn_mask = batch[1].to(args.device) 358 | position_idx =batch[2].to(args.device) 359 | with torch.no_grad(): 360 | code_vec= model(code_inputs=code_inputs, attn_mask=attn_mask,position_idx=position_idx) 361 | code_vecs.append(code_vec.cpu().numpy()) 362 | model.train() 363 | code_vecs=np.concatenate(code_vecs,0) 364 | nl_vecs=np.concatenate(nl_vecs,0) 365 | 366 | scores=np.matmul(nl_vecs,code_vecs.T) 367 | 368 | sort_ids=np.argsort(scores, axis=-1, kind='quicksort', order=None)[:,::-1] 369 | 370 | nl_urls=[] 371 | code_urls=[] 372 | for example in query_dataset.examples: 373 | nl_urls.append(example.url) 374 | 375 | for example in code_dataset.examples: 376 | code_urls.append(example.url) 377 | 378 | ranks=[] 379 | for url, sort_id in zip(nl_urls,sort_ids): 380 | rank=0 381 | find=False 382 | for idx in sort_id[:1000]: 383 | if find is False: 384 | rank+=1 385 | if code_urls[idx]==url: 386 | find=True 387 | if find: 388 | ranks.append(1/rank) 389 | else: 390 | ranks.append(0) 391 | 392 | result = { 393 | "eval_mrr":float(np.mean(ranks)) 394 | } 395 | 396 | return result 397 | 398 | 399 | 400 | def main(): 401 | parser = argparse.ArgumentParser() 402 | 403 | ## Required parameters 404 | parser.add_argument("--train_data_file", default=None, type=str, required=True, 405 | help="The input training data file (a json file).") 406 | parser.add_argument("--output_dir", default=None, type=str, required=True, 407 | help="The output directory where the model predictions and checkpoints will be written.") 408 | parser.add_argument("--eval_data_file", default=None, type=str, 409 | help="An optional input evaluation data file to evaluate the MRR(a jsonl file).") 410 | parser.add_argument("--test_data_file", default=None, type=str, 411 | help="An optional input test data file to test the MRR(a josnl file).") 412 | parser.add_argument("--codebase_file", default=None, type=str, 413 | help="An optional input test data file to codebase (a jsonl file).") 414 | 415 | parser.add_argument("--lang", default=None, type=str, 416 | help="language.") 417 | 418 | parser.add_argument("--model_name_or_path", default=None, type=str, 419 | help="The model checkpoint for weights initialization.") 420 | parser.add_argument("--config_name", default="", type=str, 421 | help="Optional pretrained config name or path if not the same as model_name_or_path") 422 | parser.add_argument("--tokenizer_name", default="", type=str, 423 | help="Optional pretrained tokenizer name or path if not the same as model_name_or_path") 424 | 425 | parser.add_argument("--nl_length", default=128, type=int, 426 | help="Optional NL input sequence length after tokenization.") 427 | parser.add_argument("--code_length", default=256, type=int, 428 | help="Optional Code input sequence length after tokenization.") 429 | parser.add_argument("--data_flow_length", default=64, type=int, 430 | help="Optional Data Flow input sequence length after tokenization.") 431 | 432 | parser.add_argument("--do_train", action='store_true', 433 | help="Whether to run training.") 434 | parser.add_argument("--do_eval", action='store_true', 435 | help="Whether to run eval on the dev set.") 436 | parser.add_argument("--do_test", action='store_true', 437 | help="Whether to run eval on the test set.") 438 | 439 | 440 | parser.add_argument("--train_batch_size", default=4, type=int, 441 | help="Batch size for training.") 442 | parser.add_argument("--eval_batch_size", default=4, type=int, 443 | help="Batch size for evaluation.") 444 | parser.add_argument("--learning_rate", default=5e-5, type=float, 445 | help="The initial learning rate for Adam.") 446 | parser.add_argument("--max_grad_norm", default=1.0, type=float, 447 | help="Max gradient norm.") 448 | parser.add_argument("--num_train_epochs", default=1, type=int, 449 | help="Total number of training epochs to perform.") 450 | 451 | parser.add_argument('--seed', type=int, default=42, 452 | help="random seed for initialization") 453 | 454 | pool = multiprocessing.Pool(cpu_cont) 455 | 456 | #print arguments 457 | args = parser.parse_args() 458 | 459 | #set log 460 | logging.basicConfig(format='%(asctime)s - %(levelname)s - %(name)s - %(message)s', 461 | datefmt='%m/%d/%Y %H:%M:%S',level=logging.INFO ) 462 | #set device 463 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 464 | args.n_gpu = torch.cuda.device_count() 465 | args.device = device 466 | logger.info("device: %s, n_gpu: %s",device, args.n_gpu) 467 | 468 | # Set seed 469 | set_seed(args.seed) 470 | 471 | #build model 472 | config = RobertaConfig.from_pretrained(args.config_name if args.config_name else args.model_name_or_path) 473 | tokenizer = RobertaTokenizer.from_pretrained(args.tokenizer_name) 474 | model = RobertaModel.from_pretrained(args.model_name_or_path) 475 | model=Model(model) 476 | logger.info("Training/evaluation parameters %s", args) 477 | model.to(args.device) 478 | 479 | # Training 480 | if args.do_train: 481 | train(args, model, tokenizer, pool) 482 | 483 | # Evaluation 484 | results = {} 485 | if args.do_eval: 486 | checkpoint_prefix = 'checkpoint-best-mrr/model.bin' 487 | output_dir = os.path.join(args.output_dir, '{}'.format(checkpoint_prefix)) 488 | model.load_state_dict(torch.load(output_dir),strict=False) 489 | model.to(args.device) 490 | result=evaluate(args, model, tokenizer,args.eval_data_file, pool) 491 | logger.info("***** Eval results *****") 492 | for key in sorted(result.keys()): 493 | logger.info(" %s = %s", key, str(round(result[key],4))) 494 | 495 | if args.do_test: 496 | checkpoint_prefix = 'checkpoint-best-mrr/model.bin' 497 | output_dir = os.path.join(args.output_dir, '{}'.format(checkpoint_prefix)) 498 | model.load_state_dict(torch.load(output_dir),strict=False) 499 | model.to(args.device) 500 | result=evaluate(args, model, tokenizer,args.test_data_file, pool) 501 | logger.info("***** Eval results *****") 502 | for key in sorted(result.keys()): 503 | logger.info(" %s = %s", key, str(round(result[key],4))) 504 | 505 | return results 506 | 507 | 508 | if __name__ == "__main__": 509 | main() 510 | 511 | 512 | -------------------------------------------------------------------------------- /GraphCodeBERT/refinement/README.md: -------------------------------------------------------------------------------- 1 | # Code Refinement 2 | 3 | ## Task Definition 4 | 5 | Code refinement aims to automatically fix bugs in the code, which can contribute to reducing the cost of bug-fixes for developers. 6 | In CodeXGLUE, given a piece of Java code with bugs, the task is to remove the bugs to output the refined code. 7 | Models are evaluated by BLEU scores and accuracy (exactly match). 8 | 9 | ## Dataset 10 | 11 | We use the dataset released by this paper(https://arxiv.org/pdf/1812.08693.pdf). The source side is a Java function with bugs and the target side is the refined one. 12 | All the function and variable names are normalized. Their dataset contains two subsets ( i.e.small and medium) based on the function length. 13 | 14 | ### Data Format 15 | 16 | The dataset is in the "data" folder. Each line of the files is a function. You can get data using the following command: 17 | 18 | ``` 19 | unzip data.zip 20 | ``` 21 | 22 | ### Data Statistics 23 | 24 | Data statistics of this dataset are shown in the below table: 25 | 26 | | | #Examples | #Examples | 27 | | ------- | :-------: | :-------: | 28 | | | Small | Medium | 29 | | Train | 46,680 | 52,364 | 30 | | Valid | 5,835 | 6,545 | 31 | | Test | 5,835 | 6,545 | 32 | 33 | ## Pipeline-GraphCodeBERT 34 | 35 | ### Dependency 36 | 37 | - pip install torch 38 | - pip install transformers 39 | - pip install tree_sitter 40 | 41 | ### Tree-sitter (optional) 42 | 43 | If the built file "parser/my-languages.so" doesn't work for you, please rebuild as the following command: 44 | 45 | ```shell 46 | cd parser 47 | bash build.sh 48 | cd .. 49 | ``` 50 | 51 | ### Fine-tune 52 | We use 4*V100-16G to fine-tune. Taking the "small" subset as example: 53 | 54 | ```shell 55 | scale=small 56 | lr=1e-4 57 | batch_size=32 58 | beam_size=10 59 | source_length=320 60 | target_length=256 61 | output_dir=saved_models/$scale/ 62 | train_file=data/$scale/train.buggy-fixed.buggy,data/$scale/train.buggy-fixed.fixed 63 | dev_file=data/$scale/valid.buggy-fixed.buggy,data/$scale/valid.buggy-fixed.fixed 64 | epochs=50 65 | pretrained_model=microsoft/graphcodebert-base 66 | 67 | mkdir -p $output_dir 68 | python run.py --do_train --do_eval --model_type roberta --model_name_or_path $pretrained_model --tokenizer_name microsoft/graphcodebert-base --config_name microsoft/graphcodebert-base --train_filename $train_file --dev_filename $dev_file --output_dir $output_dir --max_source_length $source_length --max_target_length $target_length --beam_size $beam_size --train_batch_size $batch_size --eval_batch_size $batch_size --learning_rate $lr --num_train_epochs $epochs 2>&1| tee $output_dir/train.log 69 | ``` 70 | 71 | ### Inference 72 | 73 | We use full test data for inference. 74 | 75 | ```shell 76 | batch_size=64 77 | dev_file=data/$scale/valid.buggy-fixed.buggy,data/$scale/valid.buggy-fixed.fixed 78 | test_file=data/$scale/test.buggy-fixed.buggy,data/$scale/test.buggy-fixed.fixed 79 | load_model_path=$output_dir/checkpoint-best-bleu/pytorch_model.bin #checkpoint for test 80 | 81 | python run.py --do_test --model_type roberta --model_name_or_path $pretrained_model --tokenizer_name microsoft/graphcodebert-base --config_name microsoft/graphcodebert-base --load_model_path $load_model_path --dev_filename $dev_file --test_filename $test_file --output_dir $output_dir --max_source_length $source_length --max_target_length $target_length --beam_size $beam_size --eval_batch_size $batch_size 2>&1| tee $output_dir/test.log 82 | ``` 83 | 84 | 85 | 86 | ## Result 87 | 88 | The results on the test set are shown as below: 89 | 90 | Small: 91 | 92 | | Method | BLEU | Acc (100%) | 93 | | ------------- | :-------: | :--------: | 94 | | Naive copy | 78.06 | 0.0 | 95 | | LSTM | 76.76 | 10.0 | 96 | | Transformer | 77.21 | 14.7 | 97 | | CodeBERT | 77.42 | 16.4 | 98 | | GraphCodeBERT | **80.02** | **17.3** | 99 | 100 | Medium: 101 | 102 | | Method | BLEU | Acc (100%) | 103 | | ------------- | :-------: | :--------: | 104 | | Naive copy | 90.91 | 0.0 | 105 | | LSTM | 72.08 | 2.5 | 106 | | Transformer | 89.25 | 3.7 | 107 | | CodeBERT | 91.07 | 5.16 | 108 | | GraphCodeBERT | **91.31** | **9.1** | 109 | 110 | 111 | -------------------------------------------------------------------------------- /GraphCodeBERT/refinement/bleu.py: -------------------------------------------------------------------------------- 1 | # Copyright 2017 Google Inc. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | # ============================================================================== 15 | 16 | """Python implementation of BLEU and smooth-BLEU. 17 | 18 | This module provides a Python implementation of BLEU and smooth-BLEU. 19 | Smooth BLEU is computed following the method outlined in the paper: 20 | Chin-Yew Lin, Franz Josef Och. ORANGE: a method for evaluating automatic 21 | evaluation metrics for machine translation. COLING 2004. 22 | """ 23 | 24 | import collections 25 | import math 26 | 27 | 28 | def _get_ngrams(segment, max_order): 29 | """Extracts all n-grams upto a given maximum order from an input segment. 30 | 31 | Args: 32 | segment: text segment from which n-grams will be extracted. 33 | max_order: maximum length in tokens of the n-grams returned by this 34 | methods. 35 | 36 | Returns: 37 | The Counter containing all n-grams upto max_order in segment 38 | with a count of how many times each n-gram occurred. 39 | """ 40 | ngram_counts = collections.Counter() 41 | for order in range(1, max_order + 1): 42 | for i in range(0, len(segment) - order + 1): 43 | ngram = tuple(segment[i:i+order]) 44 | ngram_counts[ngram] += 1 45 | return ngram_counts 46 | 47 | 48 | def compute_bleu(reference_corpus, translation_corpus, max_order=4, 49 | smooth=False): 50 | """Computes BLEU score of translated segments against one or more references. 51 | 52 | Args: 53 | reference_corpus: list of lists of references for each translation. Each 54 | reference should be tokenized into a list of tokens. 55 | translation_corpus: list of translations to score. Each translation 56 | should be tokenized into a list of tokens. 57 | max_order: Maximum n-gram order to use when computing BLEU score. 58 | smooth: Whether or not to apply Lin et al. 2004 smoothing. 59 | 60 | Returns: 61 | 3-Tuple with the BLEU score, n-gram precisions, geometric mean of n-gram 62 | precisions and brevity penalty. 63 | """ 64 | matches_by_order = [0] * max_order 65 | possible_matches_by_order = [0] * max_order 66 | reference_length = 0 67 | translation_length = 0 68 | for (references, translation) in zip(reference_corpus, 69 | translation_corpus): 70 | reference_length += min(len(r) for r in references) 71 | translation_length += len(translation) 72 | 73 | merged_ref_ngram_counts = collections.Counter() 74 | for reference in references: 75 | merged_ref_ngram_counts |= _get_ngrams(reference, max_order) 76 | translation_ngram_counts = _get_ngrams(translation, max_order) 77 | overlap = translation_ngram_counts & merged_ref_ngram_counts 78 | for ngram in overlap: 79 | matches_by_order[len(ngram)-1] += overlap[ngram] 80 | for order in range(1, max_order+1): 81 | possible_matches = len(translation) - order + 1 82 | if possible_matches > 0: 83 | possible_matches_by_order[order-1] += possible_matches 84 | 85 | precisions = [0] * max_order 86 | for i in range(0, max_order): 87 | if smooth: 88 | precisions[i] = ((matches_by_order[i] + 1.) / 89 | (possible_matches_by_order[i] + 1.)) 90 | else: 91 | if possible_matches_by_order[i] > 0: 92 | precisions[i] = (float(matches_by_order[i]) / 93 | possible_matches_by_order[i]) 94 | else: 95 | precisions[i] = 0.0 96 | 97 | if min(precisions) > 0: 98 | p_log_sum = sum((1. / max_order) * math.log(p) for p in precisions) 99 | geo_mean = math.exp(p_log_sum) 100 | else: 101 | geo_mean = 0 102 | 103 | ratio = float(translation_length) / reference_length 104 | 105 | if ratio > 1.0: 106 | bp = 1. 107 | else: 108 | bp = math.exp(1 - 1. / ratio) 109 | 110 | bleu = geo_mean * bp 111 | 112 | return (bleu, precisions, bp, ratio, translation_length, reference_length) 113 | 114 | 115 | def _bleu(ref_file, trans_file, subword_option=None): 116 | max_order = 4 117 | smooth = True 118 | ref_files = [ref_file] 119 | reference_text = [] 120 | for reference_filename in ref_files: 121 | with open(reference_filename) as fh: 122 | reference_text.append(fh.readlines()) 123 | per_segment_references = [] 124 | for references in zip(*reference_text): 125 | reference_list = [] 126 | for reference in references: 127 | reference_list.append(reference.strip().split()) 128 | per_segment_references.append(reference_list) 129 | translations = [] 130 | with open(trans_file) as fh: 131 | for line in fh: 132 | translations.append(line.strip().split()) 133 | bleu_score, _, _, _, _, _ = compute_bleu(per_segment_references, translations, max_order, smooth) 134 | return round(100 * bleu_score,2) -------------------------------------------------------------------------------- /GraphCodeBERT/refinement/data.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sxjscience/CodeBERT/e20547d53e4e6b7d97c2394470d2f6ef922e88ad/GraphCodeBERT/refinement/data.zip -------------------------------------------------------------------------------- /GraphCodeBERT/refinement/model.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Microsoft Corporation. 2 | # Licensed under the MIT license. 3 | 4 | import torch 5 | import torch.nn as nn 6 | import torch 7 | from torch.autograd import Variable 8 | import copy 9 | class Seq2Seq(nn.Module): 10 | """ 11 | Build Seqence-to-Sequence. 12 | 13 | Parameters: 14 | 15 | * `encoder`- encoder of seq2seq model. e.g. roberta 16 | * `decoder`- decoder of seq2seq model. e.g. transformer 17 | * `config`- configuration of encoder model. 18 | * `beam_size`- beam size for beam search. 19 | * `max_length`- max length of target for beam search. 20 | * `sos_id`- start of symbol ids in target for beam search. 21 | * `eos_id`- end of symbol ids in target for beam search. 22 | """ 23 | def __init__(self, encoder,decoder,config,beam_size=None,max_length=None,sos_id=None,eos_id=None): 24 | super(Seq2Seq, self).__init__() 25 | self.encoder = encoder 26 | self.decoder=decoder 27 | self.config=config 28 | self.register_buffer("bias", torch.tril(torch.ones(2048, 2048))) 29 | self.dense = nn.Linear(config.hidden_size, config.hidden_size) 30 | self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False) 31 | self.lsm = nn.LogSoftmax(dim=-1) 32 | self.tie_weights() 33 | 34 | self.beam_size=beam_size 35 | self.max_length=max_length 36 | self.sos_id=sos_id 37 | self.eos_id=eos_id 38 | 39 | def _tie_or_clone_weights(self, first_module, second_module): 40 | """ Tie or clone module weights depending of weither we are using TorchScript or not 41 | """ 42 | if self.config.torchscript: 43 | first_module.weight = nn.Parameter(second_module.weight.clone()) 44 | else: 45 | first_module.weight = second_module.weight 46 | 47 | def tie_weights(self): 48 | """ Make sure we are sharing the input and output embeddings. 49 | Export to TorchScript can't handle parameter sharing so we are cloning them instead. 50 | """ 51 | self._tie_or_clone_weights(self.lm_head, 52 | self.encoder.embeddings.word_embeddings) 53 | 54 | def forward(self, source_ids,source_mask,position_idx,attn_mask,target_ids=None,target_mask=None,args=None): 55 | #embedding 56 | nodes_mask=position_idx.eq(0) 57 | token_mask=position_idx.ge(2) 58 | inputs_embeddings=self.encoder.embeddings.word_embeddings(source_ids) 59 | nodes_to_token_mask=nodes_mask[:,:,None]&token_mask[:,None,:]&attn_mask 60 | nodes_to_token_mask=nodes_to_token_mask/(nodes_to_token_mask.sum(-1)+1e-10)[:,:,None] 61 | avg_embeddings=torch.einsum("abc,acd->abd",nodes_to_token_mask,inputs_embeddings) 62 | inputs_embeddings=inputs_embeddings*(~nodes_mask)[:,:,None]+avg_embeddings*nodes_mask[:,:,None] 63 | 64 | outputs = self.encoder(inputs_embeds=inputs_embeddings,attention_mask=attn_mask,position_ids=position_idx) 65 | encoder_output = outputs[0].permute([1,0,2]).contiguous() 66 | #source_mask=token_mask.float() 67 | if target_ids is not None: 68 | attn_mask=-1e4 *(1-self.bias[:target_ids.shape[1],:target_ids.shape[1]]) 69 | tgt_embeddings = self.encoder.embeddings(target_ids).permute([1,0,2]).contiguous() 70 | out = self.decoder(tgt_embeddings,encoder_output,tgt_mask=attn_mask,memory_key_padding_mask=(1-source_mask).bool()) 71 | hidden_states = torch.tanh(self.dense(out)).permute([1,0,2]).contiguous() 72 | lm_logits = self.lm_head(hidden_states) 73 | # Shift so that tokens < n predict n 74 | active_loss = target_mask[..., 1:].ne(0).view(-1) == 1 75 | shift_logits = lm_logits[..., :-1, :].contiguous() 76 | shift_labels = target_ids[..., 1:].contiguous() 77 | # Flatten the tokens 78 | loss_fct = nn.CrossEntropyLoss(ignore_index=-1) 79 | loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1))[active_loss], 80 | shift_labels.view(-1)[active_loss]) 81 | 82 | outputs = loss,loss*active_loss.sum(),active_loss.sum() 83 | return outputs 84 | else: 85 | #Predict 86 | preds=[] 87 | zero=torch.cuda.LongTensor(1).fill_(0) 88 | for i in range(source_ids.shape[0]): 89 | context=encoder_output[:,i:i+1] 90 | context_mask=source_mask[i:i+1,:] 91 | beam = Beam(self.beam_size,self.sos_id,self.eos_id) 92 | input_ids=beam.getCurrentState() 93 | context=context.repeat(1, self.beam_size,1) 94 | context_mask=context_mask.repeat(self.beam_size,1) 95 | for j in range(self.max_length): 96 | if beam.done(): 97 | break 98 | attn_mask=-1e4 *(1-self.bias[:input_ids.shape[1],:input_ids.shape[1]]) 99 | tgt_embeddings = self.encoder.embeddings(input_ids).permute([1,0,2]).contiguous() 100 | out = self.decoder(tgt_embeddings,context,tgt_mask=attn_mask,memory_key_padding_mask=(1-context_mask).bool()) 101 | out = torch.tanh(self.dense(out)) 102 | hidden_states=out.permute([1,0,2]).contiguous()[:,-1,:] 103 | out = self.lsm(self.lm_head(hidden_states)).data 104 | beam.advance(out) 105 | input_ids.data.copy_(input_ids.data.index_select(0, beam.getCurrentOrigin())) 106 | input_ids=torch.cat((input_ids,beam.getCurrentState()),-1) 107 | hyp= beam.getHyp(beam.getFinal()) 108 | pred=beam.buildTargetTokens(hyp)[:self.beam_size] 109 | pred=[torch.cat([x.view(-1) for x in p]+[zero]*(self.max_length-len(p))).view(1,-1) for p in pred] 110 | preds.append(torch.cat(pred,0).unsqueeze(0)) 111 | 112 | preds=torch.cat(preds,0) 113 | return preds 114 | 115 | 116 | 117 | class Beam(object): 118 | def __init__(self, size,sos,eos): 119 | self.size = size 120 | self.tt = torch.cuda 121 | # The score for each translation on the beam. 122 | self.scores = self.tt.FloatTensor(size).zero_() 123 | # The backpointers at each time-step. 124 | self.prevKs = [] 125 | # The outputs at each time-step. 126 | self.nextYs = [self.tt.LongTensor(size) 127 | .fill_(0)] 128 | self.nextYs[0][0] = sos 129 | # Has EOS topped the beam yet. 130 | self._eos = eos 131 | self.eosTop = False 132 | # Time and k pair for finished. 133 | self.finished = [] 134 | 135 | def getCurrentState(self): 136 | "Get the outputs for the current timestep." 137 | batch = self.tt.LongTensor(self.nextYs[-1]).view(-1, 1) 138 | return batch 139 | 140 | def getCurrentOrigin(self): 141 | "Get the backpointers for the current timestep." 142 | return self.prevKs[-1] 143 | 144 | def advance(self, wordLk): 145 | """ 146 | Given prob over words for every last beam `wordLk` and attention 147 | `attnOut`: Compute and update the beam search. 148 | 149 | Parameters: 150 | 151 | * `wordLk`- probs of advancing from the last step (K x words) 152 | * `attnOut`- attention at the last step 153 | 154 | Returns: True if beam search is complete. 155 | """ 156 | numWords = wordLk.size(1) 157 | 158 | # Sum the previous scores. 159 | if len(self.prevKs) > 0: 160 | beamLk = wordLk + self.scores.unsqueeze(1).expand_as(wordLk) 161 | 162 | # Don't let EOS have children. 163 | for i in range(self.nextYs[-1].size(0)): 164 | if self.nextYs[-1][i] == self._eos: 165 | beamLk[i] = -1e20 166 | else: 167 | beamLk = wordLk[0] 168 | flatBeamLk = beamLk.view(-1) 169 | bestScores, bestScoresId = flatBeamLk.topk(self.size, 0, True, True) 170 | 171 | self.scores = bestScores 172 | 173 | # bestScoresId is flattened beam x word array, so calculate which 174 | # word and beam each score came from 175 | prevK = bestScoresId // numWords 176 | self.prevKs.append(prevK) 177 | self.nextYs.append((bestScoresId - prevK * numWords)) 178 | 179 | 180 | for i in range(self.nextYs[-1].size(0)): 181 | if self.nextYs[-1][i] == self._eos: 182 | s = self.scores[i] 183 | self.finished.append((s, len(self.nextYs) - 1, i)) 184 | 185 | # End condition is when top-of-beam is EOS and no global score. 186 | if self.nextYs[-1][0] == self._eos: 187 | self.eosTop = True 188 | 189 | def done(self): 190 | return self.eosTop and len(self.finished) >=self.size 191 | 192 | def getFinal(self): 193 | if len(self.finished) == 0: 194 | self.finished.append((self.scores[0], len(self.nextYs) - 1, 0)) 195 | self.finished.sort(key=lambda a: -a[0]) 196 | if len(self.finished) != self.size: 197 | unfinished=[] 198 | for i in range(self.nextYs[-1].size(0)): 199 | if self.nextYs[-1][i] != self._eos: 200 | s = self.scores[i] 201 | unfinished.append((s, len(self.nextYs) - 1, i)) 202 | unfinished.sort(key=lambda a: -a[0]) 203 | self.finished+=unfinished[:self.size-len(self.finished)] 204 | return self.finished[:self.size] 205 | 206 | def getHyp(self, beam_res): 207 | """ 208 | Walk back to construct the full hypothesis. 209 | """ 210 | hyps=[] 211 | for _,timestep, k in beam_res: 212 | hyp = [] 213 | for j in range(len(self.prevKs[:timestep]) - 1, -1, -1): 214 | hyp.append(self.nextYs[j+1][k]) 215 | k = self.prevKs[j][k] 216 | hyps.append(hyp[::-1]) 217 | return hyps 218 | 219 | def buildTargetTokens(self, preds): 220 | sentence=[] 221 | for pred in preds: 222 | tokens = [] 223 | for tok in pred: 224 | if tok==self._eos: 225 | break 226 | tokens.append(tok) 227 | sentence.append(tokens) 228 | return sentence 229 | 230 | -------------------------------------------------------------------------------- /GraphCodeBERT/refinement/parser/__init__.py: -------------------------------------------------------------------------------- 1 | from .utils import (remove_comments_and_docstrings, 2 | tree_to_token_index, 3 | index_to_code_token, 4 | tree_to_variable_index) 5 | from .DFG import DFG_python,DFG_java,DFG_ruby,DFG_go,DFG_php,DFG_javascript,DFG_csharp -------------------------------------------------------------------------------- /GraphCodeBERT/refinement/parser/build.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Microsoft Corporation. 2 | # Licensed under the MIT license. 3 | 4 | from tree_sitter import Language, Parser 5 | 6 | Language.build_library( 7 | # Store the library in the `build` directory 8 | 'my-languages.so', 9 | 10 | # Include one or more languages 11 | [ 12 | 'tree-sitter-go', 13 | 'tree-sitter-javascript', 14 | 'tree-sitter-python', 15 | 'tree-sitter-php', 16 | 'tree-sitter-java', 17 | 'tree-sitter-ruby', 18 | 'tree-sitter-c-sharp', 19 | ] 20 | ) 21 | 22 | -------------------------------------------------------------------------------- /GraphCodeBERT/refinement/parser/build.sh: -------------------------------------------------------------------------------- 1 | git clone https://github.com/tree-sitter/tree-sitter-go 2 | git clone https://github.com/tree-sitter/tree-sitter-javascript 3 | git clone https://github.com/tree-sitter/tree-sitter-python 4 | git clone https://github.com/tree-sitter/tree-sitter-ruby 5 | git clone https://github.com/tree-sitter/tree-sitter-php 6 | git clone https://github.com/tree-sitter/tree-sitter-java 7 | git clone https://github.com/tree-sitter/tree-sitter-c-sharp 8 | python build.py 9 | -------------------------------------------------------------------------------- /GraphCodeBERT/refinement/parser/my-languages.so: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sxjscience/CodeBERT/e20547d53e4e6b7d97c2394470d2f6ef922e88ad/GraphCodeBERT/refinement/parser/my-languages.so -------------------------------------------------------------------------------- /GraphCodeBERT/refinement/parser/utils.py: -------------------------------------------------------------------------------- 1 | import re 2 | from io import StringIO 3 | import tokenize 4 | def remove_comments_and_docstrings(source,lang): 5 | if lang in ['python']: 6 | """ 7 | Returns 'source' minus comments and docstrings. 8 | """ 9 | io_obj = StringIO(source) 10 | out = "" 11 | prev_toktype = tokenize.INDENT 12 | last_lineno = -1 13 | last_col = 0 14 | for tok in tokenize.generate_tokens(io_obj.readline): 15 | token_type = tok[0] 16 | token_string = tok[1] 17 | start_line, start_col = tok[2] 18 | end_line, end_col = tok[3] 19 | ltext = tok[4] 20 | if start_line > last_lineno: 21 | last_col = 0 22 | if start_col > last_col: 23 | out += (" " * (start_col - last_col)) 24 | # Remove comments: 25 | if token_type == tokenize.COMMENT: 26 | pass 27 | # This series of conditionals removes docstrings: 28 | elif token_type == tokenize.STRING: 29 | if prev_toktype != tokenize.INDENT: 30 | # This is likely a docstring; double-check we're not inside an operator: 31 | if prev_toktype != tokenize.NEWLINE: 32 | if start_col > 0: 33 | out += token_string 34 | else: 35 | out += token_string 36 | prev_toktype = token_type 37 | last_col = end_col 38 | last_lineno = end_line 39 | temp=[] 40 | for x in out.split('\n'): 41 | if x.strip()!="": 42 | temp.append(x) 43 | return '\n'.join(temp) 44 | elif lang in ['ruby']: 45 | return source 46 | else: 47 | def replacer(match): 48 | s = match.group(0) 49 | if s.startswith('/'): 50 | return " " # note: a space and not an empty string 51 | else: 52 | return s 53 | pattern = re.compile( 54 | r'//.*?$|/\*.*?\*/|\'(?:\\.|[^\\\'])*\'|"(?:\\.|[^\\"])*"', 55 | re.DOTALL | re.MULTILINE 56 | ) 57 | temp=[] 58 | for x in re.sub(pattern, replacer, source).split('\n'): 59 | if x.strip()!="": 60 | temp.append(x) 61 | return '\n'.join(temp) 62 | 63 | def tree_to_token_index(root_node): 64 | if (len(root_node.children)==0 or root_node.type=='string') and root_node.type!='comment': 65 | return [(root_node.start_point,root_node.end_point)] 66 | else: 67 | code_tokens=[] 68 | for child in root_node.children: 69 | code_tokens+=tree_to_token_index(child) 70 | return code_tokens 71 | 72 | def tree_to_variable_index(root_node,index_to_code): 73 | if (len(root_node.children)==0 or root_node.type=='string') and root_node.type!='comment': 74 | index=(root_node.start_point,root_node.end_point) 75 | _,code=index_to_code[index] 76 | if root_node.type!=code: 77 | return [(root_node.start_point,root_node.end_point)] 78 | else: 79 | return [] 80 | else: 81 | code_tokens=[] 82 | for child in root_node.children: 83 | code_tokens+=tree_to_variable_index(child,index_to_code) 84 | return code_tokens 85 | 86 | def index_to_code_token(index,code): 87 | start_point=index[0] 88 | end_point=index[1] 89 | if start_point[0]==end_point[0]: 90 | s=code[start_point[0]][start_point[1]:end_point[1]] 91 | else: 92 | s="" 93 | s+=code[start_point[0]][start_point[1]:] 94 | for i in range(start_point[0]+1,end_point[0]): 95 | s+=code[i] 96 | s+=code[end_point[0]][:end_point[1]] 97 | return s 98 | -------------------------------------------------------------------------------- /GraphCodeBERT/translation/README.md: -------------------------------------------------------------------------------- 1 | # Code Translation 2 | 3 | ## Task Definition 4 | 5 | Code translation aims to migrate legacy software from one programming language in a platform toanother. 6 | Given a piece of Java (C#) code, the task is to translate the code into C# (Java) version. 7 | Models are evaluated by BLEU scores and accuracy (exactly match). 8 | 9 | ## Dataset 10 | 11 | The dataset is collected from several public repos, including Lucene(http://lucene.apache.org/), POI(http://poi.apache.org/), JGit(https://github.com/eclipse/jgit/) and Antlr(https://github.com/antlr/). 12 | 13 | We collect both the Java and C# versions of the codes and find the parallel functions. After removing duplicates and functions with the empty body, we split the whole dataset into training, validation and test sets. 14 | 15 | ### Data Format 16 | 17 | The dataset is in the "data" folder. Each line of the files is a function, and the suffix of the file indicates the programming language. You can get data using the following command: 18 | 19 | ``` 20 | unzip data.zip 21 | ``` 22 | 23 | ### Data Statistics 24 | 25 | Data statistics of the dataset are shown in the below table: 26 | 27 | | | #Examples | 28 | | ------- | :-------: | 29 | | Train | 10,300 | 30 | | Valid | 500 | 31 | | Test | 1,000 | 32 | 33 | ## Pipeline-GraphCodeBERT 34 | 35 | ### Dependency 36 | 37 | - pip install torch 38 | - pip install transformers 39 | - pip install tree_sitter 40 | 41 | ### Tree-sitter (optional) 42 | 43 | If the built file "parser/my-languages.so" doesn't work for you, please rebuild as the following command: 44 | 45 | ```shell 46 | cd parser 47 | bash build.sh 48 | cd .. 49 | ``` 50 | 51 | ### Fine-tune 52 | We use 4*V100-16G to fine-tune. Taking Java to C# translation as example: 53 | 54 | ```shell 55 | source=java 56 | target=cs 57 | lr=1e-4 58 | batch_size=32 59 | beam_size=10 60 | source_length=320 61 | target_length=256 62 | output_dir=saved_models/$source-$target/ 63 | train_file=data/train.java-cs.txt.$source,data/train.java-cs.txt.$target 64 | dev_file=data/valid.java-cs.txt.$source,data/valid.java-cs.txt.$target 65 | epochs=100 66 | pretrained_model=microsoft/graphcodebert-base 67 | 68 | mkdir -p $output_dir 69 | python run.py --do_train --do_eval --model_type roberta --model_name_or_path $pretrained_model --tokenizer_name microsoft/graphcodebert-base --config_name microsoft/graphcodebert-base --train_filename $train_file --dev_filename $dev_file --output_dir $output_dir --max_source_length $source_length --max_target_length $target_length --beam_size $beam_size --train_batch_size $batch_size --eval_batch_size $batch_size --learning_rate $lr --num_train_epochs $epochs 2>&1| tee $output_dir/train.log 70 | ``` 71 | 72 | ### Inference 73 | 74 | We use full test data for inference. 75 | 76 | ```shell 77 | batch_size=64 78 | dev_file=data/valid.java-cs.txt.$source,data/valid.java-cs.txt.$target 79 | test_file=data/test.java-cs.txt.$source,data/test.java-cs.txt.$target 80 | load_model_path=$output_dir/checkpoint-best-bleu/pytorch_model.bin #checkpoint for test 81 | 82 | python run.py --do_test --model_type roberta --model_name_or_path $pretrained_model --tokenizer_name microsoft/graphcodebert-base --config_name microsoft/graphcodebert-base --load_model_path $load_model_path --dev_filename $dev_file --test_filename $test_file --output_dir $output_dir --max_source_length $source_length --max_target_length $target_length --beam_size $beam_size --eval_batch_size $batch_size 2>&1| tee $output_dir/test.log 83 | ``` 84 | 85 | 86 | 87 | ## Result 88 | 89 | The results on the test set are shown as below: 90 | 91 | Java to C#: 92 | 93 | | Method | BLEU | Acc (100%) | 94 | | ---------- | :--------: | :-------: | 95 | | Naive copy | 18.54 | 0.0 | 96 | | PBSMT | 43.53 | 12.5 | 97 | | Transformer | 55.84 | 33.0 | 98 | | Roborta (code) | 77.46 | 56.1 | 99 | | CodeBERT | 79.92 | 59.0 | 100 | | GraphCodeBERT | **80.58** | **59.4** | 101 | 102 | C# to Java: 103 | 104 | | Method | BLEU | Acc (100%) | 105 | | -------------- | :-------: | :--------: | 106 | | Naive copy | 18.69 | 0.0 | 107 | | PBSMT | 40.06 | 16.1 | 108 | | Transformer | 50.47 | 37.9 | 109 | | Roborta (code) | 71.99 | 57.9 | 110 | | CodeBERT | 72.14 | 58.0 | 111 | | GraphCodeBERT | **72.64** | **58.8** | 112 | 113 | -------------------------------------------------------------------------------- /GraphCodeBERT/translation/bleu.py: -------------------------------------------------------------------------------- 1 | # Copyright 2017 Google Inc. All Rights Reserved. 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # http://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | # ============================================================================== 15 | 16 | """Python implementation of BLEU and smooth-BLEU. 17 | 18 | This module provides a Python implementation of BLEU and smooth-BLEU. 19 | Smooth BLEU is computed following the method outlined in the paper: 20 | Chin-Yew Lin, Franz Josef Och. ORANGE: a method for evaluating automatic 21 | evaluation metrics for machine translation. COLING 2004. 22 | """ 23 | 24 | import collections 25 | import math 26 | 27 | 28 | def _get_ngrams(segment, max_order): 29 | """Extracts all n-grams upto a given maximum order from an input segment. 30 | 31 | Args: 32 | segment: text segment from which n-grams will be extracted. 33 | max_order: maximum length in tokens of the n-grams returned by this 34 | methods. 35 | 36 | Returns: 37 | The Counter containing all n-grams upto max_order in segment 38 | with a count of how many times each n-gram occurred. 39 | """ 40 | ngram_counts = collections.Counter() 41 | for order in range(1, max_order + 1): 42 | for i in range(0, len(segment) - order + 1): 43 | ngram = tuple(segment[i:i+order]) 44 | ngram_counts[ngram] += 1 45 | return ngram_counts 46 | 47 | 48 | def compute_bleu(reference_corpus, translation_corpus, max_order=4, 49 | smooth=False): 50 | """Computes BLEU score of translated segments against one or more references. 51 | 52 | Args: 53 | reference_corpus: list of lists of references for each translation. Each 54 | reference should be tokenized into a list of tokens. 55 | translation_corpus: list of translations to score. Each translation 56 | should be tokenized into a list of tokens. 57 | max_order: Maximum n-gram order to use when computing BLEU score. 58 | smooth: Whether or not to apply Lin et al. 2004 smoothing. 59 | 60 | Returns: 61 | 3-Tuple with the BLEU score, n-gram precisions, geometric mean of n-gram 62 | precisions and brevity penalty. 63 | """ 64 | matches_by_order = [0] * max_order 65 | possible_matches_by_order = [0] * max_order 66 | reference_length = 0 67 | translation_length = 0 68 | for (references, translation) in zip(reference_corpus, 69 | translation_corpus): 70 | reference_length += min(len(r) for r in references) 71 | translation_length += len(translation) 72 | 73 | merged_ref_ngram_counts = collections.Counter() 74 | for reference in references: 75 | merged_ref_ngram_counts |= _get_ngrams(reference, max_order) 76 | translation_ngram_counts = _get_ngrams(translation, max_order) 77 | overlap = translation_ngram_counts & merged_ref_ngram_counts 78 | for ngram in overlap: 79 | matches_by_order[len(ngram)-1] += overlap[ngram] 80 | for order in range(1, max_order+1): 81 | possible_matches = len(translation) - order + 1 82 | if possible_matches > 0: 83 | possible_matches_by_order[order-1] += possible_matches 84 | 85 | precisions = [0] * max_order 86 | for i in range(0, max_order): 87 | if smooth: 88 | precisions[i] = ((matches_by_order[i] + 1.) / 89 | (possible_matches_by_order[i] + 1.)) 90 | else: 91 | if possible_matches_by_order[i] > 0: 92 | precisions[i] = (float(matches_by_order[i]) / 93 | possible_matches_by_order[i]) 94 | else: 95 | precisions[i] = 0.0 96 | 97 | if min(precisions) > 0: 98 | p_log_sum = sum((1. / max_order) * math.log(p) for p in precisions) 99 | geo_mean = math.exp(p_log_sum) 100 | else: 101 | geo_mean = 0 102 | 103 | ratio = float(translation_length) / reference_length 104 | 105 | if ratio > 1.0: 106 | bp = 1. 107 | else: 108 | bp = math.exp(1 - 1. / ratio) 109 | 110 | bleu = geo_mean * bp 111 | 112 | return (bleu, precisions, bp, ratio, translation_length, reference_length) 113 | 114 | 115 | def _bleu(ref_file, trans_file, subword_option=None): 116 | max_order = 4 117 | smooth = True 118 | ref_files = [ref_file] 119 | reference_text = [] 120 | for reference_filename in ref_files: 121 | with open(reference_filename) as fh: 122 | reference_text.append(fh.readlines()) 123 | per_segment_references = [] 124 | for references in zip(*reference_text): 125 | reference_list = [] 126 | for reference in references: 127 | reference_list.append(reference.strip().split()) 128 | per_segment_references.append(reference_list) 129 | translations = [] 130 | with open(trans_file) as fh: 131 | for line in fh: 132 | translations.append(line.strip().split()) 133 | bleu_score, _, _, _, _, _ = compute_bleu(per_segment_references, translations, max_order, smooth) 134 | return round(100 * bleu_score,2) -------------------------------------------------------------------------------- /GraphCodeBERT/translation/data.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sxjscience/CodeBERT/e20547d53e4e6b7d97c2394470d2f6ef922e88ad/GraphCodeBERT/translation/data.zip -------------------------------------------------------------------------------- /GraphCodeBERT/translation/model.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Microsoft Corporation. 2 | # Licensed under the MIT license. 3 | 4 | import torch 5 | import torch.nn as nn 6 | import torch 7 | from torch.autograd import Variable 8 | import copy 9 | class Seq2Seq(nn.Module): 10 | """ 11 | Build Seqence-to-Sequence. 12 | 13 | Parameters: 14 | 15 | * `encoder`- encoder of seq2seq model. e.g. roberta 16 | * `decoder`- decoder of seq2seq model. e.g. transformer 17 | * `config`- configuration of encoder model. 18 | * `beam_size`- beam size for beam search. 19 | * `max_length`- max length of target for beam search. 20 | * `sos_id`- start of symbol ids in target for beam search. 21 | * `eos_id`- end of symbol ids in target for beam search. 22 | """ 23 | def __init__(self, encoder,decoder,config,beam_size=None,max_length=None,sos_id=None,eos_id=None): 24 | super(Seq2Seq, self).__init__() 25 | self.encoder = encoder 26 | self.decoder=decoder 27 | self.config=config 28 | self.register_buffer("bias", torch.tril(torch.ones(2048, 2048))) 29 | self.dense = nn.Linear(config.hidden_size, config.hidden_size) 30 | self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False) 31 | self.lsm = nn.LogSoftmax(dim=-1) 32 | self.tie_weights() 33 | 34 | self.beam_size=beam_size 35 | self.max_length=max_length 36 | self.sos_id=sos_id 37 | self.eos_id=eos_id 38 | 39 | def _tie_or_clone_weights(self, first_module, second_module): 40 | """ Tie or clone module weights depending of weither we are using TorchScript or not 41 | """ 42 | if self.config.torchscript: 43 | first_module.weight = nn.Parameter(second_module.weight.clone()) 44 | else: 45 | first_module.weight = second_module.weight 46 | 47 | def tie_weights(self): 48 | """ Make sure we are sharing the input and output embeddings. 49 | Export to TorchScript can't handle parameter sharing so we are cloning them instead. 50 | """ 51 | self._tie_or_clone_weights(self.lm_head, 52 | self.encoder.embeddings.word_embeddings) 53 | 54 | def forward(self, source_ids,source_mask,position_idx,attn_mask,target_ids=None,target_mask=None,args=None): 55 | #embedding 56 | nodes_mask=position_idx.eq(0) 57 | token_mask=position_idx.ge(2) 58 | inputs_embeddings=self.encoder.embeddings.word_embeddings(source_ids) 59 | nodes_to_token_mask=nodes_mask[:,:,None]&token_mask[:,None,:]&attn_mask 60 | nodes_to_token_mask=nodes_to_token_mask/(nodes_to_token_mask.sum(-1)+1e-10)[:,:,None] 61 | avg_embeddings=torch.einsum("abc,acd->abd",nodes_to_token_mask,inputs_embeddings) 62 | inputs_embeddings=inputs_embeddings*(~nodes_mask)[:,:,None]+avg_embeddings*nodes_mask[:,:,None] 63 | 64 | outputs = self.encoder(inputs_embeds=inputs_embeddings,attention_mask=attn_mask,position_ids=position_idx) 65 | encoder_output = outputs[0].permute([1,0,2]).contiguous() 66 | #source_mask=token_mask.float() 67 | if target_ids is not None: 68 | attn_mask=-1e4 *(1-self.bias[:target_ids.shape[1],:target_ids.shape[1]]) 69 | tgt_embeddings = self.encoder.embeddings(target_ids).permute([1,0,2]).contiguous() 70 | out = self.decoder(tgt_embeddings,encoder_output,tgt_mask=attn_mask,memory_key_padding_mask=(1-source_mask).bool()) 71 | hidden_states = torch.tanh(self.dense(out)).permute([1,0,2]).contiguous() 72 | lm_logits = self.lm_head(hidden_states) 73 | # Shift so that tokens < n predict n 74 | active_loss = target_mask[..., 1:].ne(0).view(-1) == 1 75 | shift_logits = lm_logits[..., :-1, :].contiguous() 76 | shift_labels = target_ids[..., 1:].contiguous() 77 | # Flatten the tokens 78 | loss_fct = nn.CrossEntropyLoss(ignore_index=-1) 79 | loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1))[active_loss], 80 | shift_labels.view(-1)[active_loss]) 81 | 82 | outputs = loss,loss*active_loss.sum(),active_loss.sum() 83 | return outputs 84 | else: 85 | #Predict 86 | preds=[] 87 | zero=torch.cuda.LongTensor(1).fill_(0) 88 | for i in range(source_ids.shape[0]): 89 | context=encoder_output[:,i:i+1] 90 | context_mask=source_mask[i:i+1,:] 91 | beam = Beam(self.beam_size,self.sos_id,self.eos_id) 92 | input_ids=beam.getCurrentState() 93 | context=context.repeat(1, self.beam_size,1) 94 | context_mask=context_mask.repeat(self.beam_size,1) 95 | for _ in range(self.max_length): 96 | if beam.done(): 97 | break 98 | attn_mask=-1e4 *(1-self.bias[:input_ids.shape[1],:input_ids.shape[1]]) 99 | tgt_embeddings = self.encoder.embeddings(input_ids).permute([1,0,2]).contiguous() 100 | out = self.decoder(tgt_embeddings,context,tgt_mask=attn_mask,memory_key_padding_mask=(1-context_mask).bool()) 101 | out = torch.tanh(self.dense(out)) 102 | hidden_states=out.permute([1,0,2]).contiguous()[:,-1,:] 103 | out = self.lsm(self.lm_head(hidden_states)).data 104 | beam.advance(out) 105 | input_ids.data.copy_(input_ids.data.index_select(0, beam.getCurrentOrigin())) 106 | input_ids=torch.cat((input_ids,beam.getCurrentState()),-1) 107 | hyp= beam.getHyp(beam.getFinal()) 108 | pred=beam.buildTargetTokens(hyp)[:self.beam_size] 109 | pred=[torch.cat([x.view(-1) for x in p]+[zero]*(self.max_length-len(p))).view(1,-1) for p in pred] 110 | preds.append(torch.cat(pred,0).unsqueeze(0)) 111 | 112 | preds=torch.cat(preds,0) 113 | return preds 114 | 115 | 116 | 117 | class Beam(object): 118 | def __init__(self, size,sos,eos): 119 | self.size = size 120 | self.tt = torch.cuda 121 | # The score for each translation on the beam. 122 | self.scores = self.tt.FloatTensor(size).zero_() 123 | # The backpointers at each time-step. 124 | self.prevKs = [] 125 | # The outputs at each time-step. 126 | self.nextYs = [self.tt.LongTensor(size) 127 | .fill_(0)] 128 | self.nextYs[0][0] = sos 129 | # Has EOS topped the beam yet. 130 | self._eos = eos 131 | self.eosTop = False 132 | # Time and k pair for finished. 133 | self.finished = [] 134 | 135 | def getCurrentState(self): 136 | "Get the outputs for the current timestep." 137 | batch = self.tt.LongTensor(self.nextYs[-1]).view(-1, 1) 138 | return batch 139 | 140 | def getCurrentOrigin(self): 141 | "Get the backpointers for the current timestep." 142 | return self.prevKs[-1] 143 | 144 | def advance(self, wordLk): 145 | """ 146 | Given prob over words for every last beam `wordLk` and attention 147 | `attnOut`: Compute and update the beam search. 148 | 149 | Parameters: 150 | 151 | * `wordLk`- probs of advancing from the last step (K x words) 152 | * `attnOut`- attention at the last step 153 | 154 | Returns: True if beam search is complete. 155 | """ 156 | numWords = wordLk.size(1) 157 | 158 | # Sum the previous scores. 159 | if len(self.prevKs) > 0: 160 | beamLk = wordLk + self.scores.unsqueeze(1).expand_as(wordLk) 161 | 162 | # Don't let EOS have children. 163 | for i in range(self.nextYs[-1].size(0)): 164 | if self.nextYs[-1][i] == self._eos: 165 | beamLk[i] = -1e20 166 | else: 167 | beamLk = wordLk[0] 168 | flatBeamLk = beamLk.view(-1) 169 | bestScores, bestScoresId = flatBeamLk.topk(self.size, 0, True, True) 170 | 171 | self.scores = bestScores 172 | 173 | # bestScoresId is flattened beam x word array, so calculate which 174 | # word and beam each score came from 175 | prevK = bestScoresId // numWords 176 | self.prevKs.append(prevK) 177 | self.nextYs.append((bestScoresId - prevK * numWords)) 178 | 179 | 180 | for i in range(self.nextYs[-1].size(0)): 181 | if self.nextYs[-1][i] == self._eos: 182 | s = self.scores[i] 183 | self.finished.append((s, len(self.nextYs) - 1, i)) 184 | 185 | # End condition is when top-of-beam is EOS and no global score. 186 | if self.nextYs[-1][0] == self._eos: 187 | self.eosTop = True 188 | 189 | def done(self): 190 | return self.eosTop and len(self.finished) >=self.size 191 | 192 | def getFinal(self): 193 | if len(self.finished) == 0: 194 | self.finished.append((self.scores[0], len(self.nextYs) - 1, 0)) 195 | self.finished.sort(key=lambda a: -a[0]) 196 | if len(self.finished) != self.size: 197 | unfinished=[] 198 | for i in range(self.nextYs[-1].size(0)): 199 | if self.nextYs[-1][i] != self._eos: 200 | s = self.scores[i] 201 | unfinished.append((s, len(self.nextYs) - 1, i)) 202 | unfinished.sort(key=lambda a: -a[0]) 203 | self.finished+=unfinished[:self.size-len(self.finished)] 204 | return self.finished[:self.size] 205 | 206 | def getHyp(self, beam_res): 207 | """ 208 | Walk back to construct the full hypothesis. 209 | """ 210 | hyps=[] 211 | for _,timestep, k in beam_res: 212 | hyp = [] 213 | for j in range(len(self.prevKs[:timestep]) - 1, -1, -1): 214 | hyp.append(self.nextYs[j+1][k]) 215 | k = self.prevKs[j][k] 216 | hyps.append(hyp[::-1]) 217 | return hyps 218 | 219 | def buildTargetTokens(self, preds): 220 | sentence=[] 221 | for pred in preds: 222 | tokens = [] 223 | for tok in pred: 224 | if tok==self._eos: 225 | break 226 | tokens.append(tok) 227 | sentence.append(tokens) 228 | return sentence 229 | 230 | -------------------------------------------------------------------------------- /GraphCodeBERT/translation/parser/__init__.py: -------------------------------------------------------------------------------- 1 | from .utils import (remove_comments_and_docstrings, 2 | tree_to_token_index, 3 | index_to_code_token, 4 | tree_to_variable_index) 5 | from .DFG import DFG_python,DFG_java,DFG_ruby,DFG_go,DFG_php,DFG_javascript,DFG_csharp -------------------------------------------------------------------------------- /GraphCodeBERT/translation/parser/build.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Microsoft Corporation. 2 | # Licensed under the MIT license. 3 | 4 | from tree_sitter import Language, Parser 5 | 6 | Language.build_library( 7 | # Store the library in the `build` directory 8 | 'my-languages.so', 9 | 10 | # Include one or more languages 11 | [ 12 | 'tree-sitter-go', 13 | 'tree-sitter-javascript', 14 | 'tree-sitter-python', 15 | 'tree-sitter-php', 16 | 'tree-sitter-java', 17 | 'tree-sitter-ruby', 18 | 'tree-sitter-c-sharp', 19 | ] 20 | ) 21 | 22 | -------------------------------------------------------------------------------- /GraphCodeBERT/translation/parser/build.sh: -------------------------------------------------------------------------------- 1 | git clone https://github.com/tree-sitter/tree-sitter-go 2 | git clone https://github.com/tree-sitter/tree-sitter-javascript 3 | git clone https://github.com/tree-sitter/tree-sitter-python 4 | git clone https://github.com/tree-sitter/tree-sitter-ruby 5 | git clone https://github.com/tree-sitter/tree-sitter-php 6 | git clone https://github.com/tree-sitter/tree-sitter-java 7 | git clone https://github.com/tree-sitter/tree-sitter-c-sharp 8 | python build.py 9 | -------------------------------------------------------------------------------- /GraphCodeBERT/translation/parser/my-languages.so: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sxjscience/CodeBERT/e20547d53e4e6b7d97c2394470d2f6ef922e88ad/GraphCodeBERT/translation/parser/my-languages.so -------------------------------------------------------------------------------- /GraphCodeBERT/translation/parser/utils.py: -------------------------------------------------------------------------------- 1 | import re 2 | from io import StringIO 3 | import tokenize 4 | def remove_comments_and_docstrings(source,lang): 5 | if lang in ['python']: 6 | """ 7 | Returns 'source' minus comments and docstrings. 8 | """ 9 | io_obj = StringIO(source) 10 | out = "" 11 | prev_toktype = tokenize.INDENT 12 | last_lineno = -1 13 | last_col = 0 14 | for tok in tokenize.generate_tokens(io_obj.readline): 15 | token_type = tok[0] 16 | token_string = tok[1] 17 | start_line, start_col = tok[2] 18 | end_line, end_col = tok[3] 19 | ltext = tok[4] 20 | if start_line > last_lineno: 21 | last_col = 0 22 | if start_col > last_col: 23 | out += (" " * (start_col - last_col)) 24 | # Remove comments: 25 | if token_type == tokenize.COMMENT: 26 | pass 27 | # This series of conditionals removes docstrings: 28 | elif token_type == tokenize.STRING: 29 | if prev_toktype != tokenize.INDENT: 30 | # This is likely a docstring; double-check we're not inside an operator: 31 | if prev_toktype != tokenize.NEWLINE: 32 | if start_col > 0: 33 | out += token_string 34 | else: 35 | out += token_string 36 | prev_toktype = token_type 37 | last_col = end_col 38 | last_lineno = end_line 39 | temp=[] 40 | for x in out.split('\n'): 41 | if x.strip()!="": 42 | temp.append(x) 43 | return '\n'.join(temp) 44 | elif lang in ['ruby']: 45 | return source 46 | else: 47 | def replacer(match): 48 | s = match.group(0) 49 | if s.startswith('/'): 50 | return " " # note: a space and not an empty string 51 | else: 52 | return s 53 | pattern = re.compile( 54 | r'//.*?$|/\*.*?\*/|\'(?:\\.|[^\\\'])*\'|"(?:\\.|[^\\"])*"', 55 | re.DOTALL | re.MULTILINE 56 | ) 57 | temp=[] 58 | for x in re.sub(pattern, replacer, source).split('\n'): 59 | if x.strip()!="": 60 | temp.append(x) 61 | return '\n'.join(temp) 62 | 63 | def tree_to_token_index(root_node): 64 | if (len(root_node.children)==0 or root_node.type=='string') and root_node.type!='comment': 65 | return [(root_node.start_point,root_node.end_point)] 66 | else: 67 | code_tokens=[] 68 | for child in root_node.children: 69 | code_tokens+=tree_to_token_index(child) 70 | return code_tokens 71 | 72 | def tree_to_variable_index(root_node,index_to_code): 73 | if (len(root_node.children)==0 or root_node.type=='string') and root_node.type!='comment': 74 | index=(root_node.start_point,root_node.end_point) 75 | _,code=index_to_code[index] 76 | if root_node.type!=code: 77 | return [(root_node.start_point,root_node.end_point)] 78 | else: 79 | return [] 80 | else: 81 | code_tokens=[] 82 | for child in root_node.children: 83 | code_tokens+=tree_to_variable_index(child,index_to_code) 84 | return code_tokens 85 | 86 | def index_to_code_token(index,code): 87 | start_point=index[0] 88 | end_point=index[1] 89 | if start_point[0]==end_point[0]: 90 | s=code[start_point[0]][start_point[1]:end_point[1]] 91 | else: 92 | s="" 93 | s+=code[start_point[0]][start_point[1]:] 94 | for i in range(start_point[0]+1,end_point[0]): 95 | s+=code[i] 96 | s+=code[end_point[0]][:end_point[1]] 97 | return s 98 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) Microsoft Corporation. 2 | 3 | MIT License 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED *AS IS*, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /NOTICE.md: -------------------------------------------------------------------------------- 1 | NOTICES AND INFORMATION 2 | 3 | Do Not Translate or Localize 4 | 5 | This software incorporates material from third parties. Microsoft makes certain open source code available at http://3rdpartysource.microsoft.com, or you may send a check or money order for US $5.00, including the product name, the open source component name, and version number, to: 6 | 7 | Source Code Compliance Team Microsoft Corporation One Microsoft Way Redmond, WA 98052 USA 8 | 9 | Notwithstanding any other terms, you may reverse engineer this software to the extent required to debug changes to any libraries licensed under the GNU Lesser General Public License. 10 | 11 | =============================================================================== 12 | 13 | Component. 14 | 15 | huggingface/transformers 16 | 17 | Open Source License/Copyright Notice. 18 | 19 | ``` 20 | Apache License 21 | Version 2.0, January 2004 22 | http://www.apache.org/licenses/ 23 | 24 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 25 | 26 | 1. Definitions. 27 | 28 | "License" shall mean the terms and conditions for use, reproduction, 29 | and distribution as defined by Sections 1 through 9 of this document. 30 | 31 | "Licensor" shall mean the copyright owner or entity authorized by 32 | the copyright owner that is granting the License. 33 | 34 | "Legal Entity" shall mean the union of the acting entity and all 35 | other entities that control, are controlled by, or are under common 36 | control with that entity. For the purposes of this definition, 37 | "control" means (i) the power, direct or indirect, to cause the 38 | direction or management of such entity, whether by contract or 39 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 40 | outstanding shares, or (iii) beneficial ownership of such entity. 41 | 42 | "You" (or "Your") shall mean an individual or Legal Entity 43 | exercising permissions granted by this License. 44 | 45 | "Source" form shall mean the preferred form for making modifications, 46 | including but not limited to software source code, documentation 47 | source, and configuration files. 48 | 49 | "Object" form shall mean any form resulting from mechanical 50 | transformation or translation of a Source form, including but 51 | not limited to compiled object code, generated documentation, 52 | and conversions to other media types. 53 | 54 | "Work" shall mean the work of authorship, whether in Source or 55 | Object form, made available under the License, as indicated by a 56 | copyright notice that is included in or attached to the work 57 | (an example is provided in the Appendix below). 58 | 59 | "Derivative Works" shall mean any work, whether in Source or Object 60 | form, that is based on (or derived from) the Work and for which the 61 | editorial revisions, annotations, elaborations, or other modifications 62 | represent, as a whole, an original work of authorship. For the purposes 63 | of this License, Derivative Works shall not include works that remain 64 | separable from, or merely link (or bind by name) to the interfaces of, 65 | the Work and Derivative Works thereof. 66 | 67 | "Contribution" shall mean any work of authorship, including 68 | the original version of the Work and any modifications or additions 69 | to that Work or Derivative Works thereof, that is intentionally 70 | submitted to Licensor for inclusion in the Work by the copyright owner 71 | or by an individual or Legal Entity authorized to submit on behalf of 72 | the copyright owner. For the purposes of this definition, "submitted" 73 | means any form of electronic, verbal, or written communication sent 74 | to the Licensor or its representatives, including but not limited to 75 | communication on electronic mailing lists, source code control systems, 76 | and issue tracking systems that are managed by, or on behalf of, the 77 | Licensor for the purpose of discussing and improving the Work, but 78 | excluding communication that is conspicuously marked or otherwise 79 | designated in writing by the copyright owner as "Not a Contribution." 80 | 81 | "Contributor" shall mean Licensor and any individual or Legal Entity 82 | on behalf of whom a Contribution has been received by Licensor and 83 | subsequently incorporated within the Work. 84 | 85 | 2. Grant of Copyright License. Subject to the terms and conditions of 86 | this License, each Contributor hereby grants to You a perpetual, 87 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 88 | copyright license to reproduce, prepare Derivative Works of, 89 | publicly display, publicly perform, sublicense, and distribute the 90 | Work and such Derivative Works in Source or Object form. 91 | 92 | 3. Grant of Patent License. Subject to the terms and conditions of 93 | this License, each Contributor hereby grants to You a perpetual, 94 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 95 | (except as stated in this section) patent license to make, have made, 96 | use, offer to sell, sell, import, and otherwise transfer the Work, 97 | where such license applies only to those patent claims licensable 98 | by such Contributor that are necessarily infringed by their 99 | Contribution(s) alone or by combination of their Contribution(s) 100 | with the Work to which such Contribution(s) was submitted. If You 101 | institute patent litigation against any entity (including a 102 | cross-claim or counterclaim in a lawsuit) alleging that the Work 103 | or a Contribution incorporated within the Work constitutes direct 104 | or contributory patent infringement, then any patent licenses 105 | granted to You under this License for that Work shall terminate 106 | as of the date such litigation is filed. 107 | 108 | 4. Redistribution. You may reproduce and distribute copies of the 109 | Work or Derivative Works thereof in any medium, with or without 110 | modifications, and in Source or Object form, provided that You 111 | meet the following conditions: 112 | 113 | (a) You must give any other recipients of the Work or 114 | Derivative Works a copy of this License; and 115 | 116 | (b) You must cause any modified files to carry prominent notices 117 | stating that You changed the files; and 118 | 119 | (c) You must retain, in the Source form of any Derivative Works 120 | that You distribute, all copyright, patent, trademark, and 121 | attribution notices from the Source form of the Work, 122 | excluding those notices that do not pertain to any part of 123 | the Derivative Works; and 124 | 125 | (d) If the Work includes a "NOTICE" text file as part of its 126 | distribution, then any Derivative Works that You distribute must 127 | include a readable copy of the attribution notices contained 128 | within such NOTICE file, excluding those notices that do not 129 | pertain to any part of the Derivative Works, in at least one 130 | of the following places: within a NOTICE text file distributed 131 | as part of the Derivative Works; within the Source form or 132 | documentation, if provided along with the Derivative Works; or, 133 | within a display generated by the Derivative Works, if and 134 | wherever such third-party notices normally appear. The contents 135 | of the NOTICE file are for informational purposes only and 136 | do not modify the License. You may add Your own attribution 137 | notices within Derivative Works that You distribute, alongside 138 | or as an addendum to the NOTICE text from the Work, provided 139 | that such additional attribution notices cannot be construed 140 | as modifying the License. 141 | 142 | You may add Your own copyright statement to Your modifications and 143 | may provide additional or different license terms and conditions 144 | for use, reproduction, or distribution of Your modifications, or 145 | for any such Derivative Works as a whole, provided Your use, 146 | reproduction, and distribution of the Work otherwise complies with 147 | the conditions stated in this License. 148 | 149 | 5. Submission of Contributions. Unless You explicitly state otherwise, 150 | any Contribution intentionally submitted for inclusion in the Work 151 | by You to the Licensor shall be under the terms and conditions of 152 | this License, without any additional terms or conditions. 153 | Notwithstanding the above, nothing herein shall supersede or modify 154 | the terms of any separate license agreement you may have executed 155 | with Licensor regarding such Contributions. 156 | 157 | 6. Trademarks. This License does not grant permission to use the trade 158 | names, trademarks, service marks, or product names of the Licensor, 159 | except as required for reasonable and customary use in describing the 160 | origin of the Work and reproducing the content of the NOTICE file. 161 | 162 | 7. Disclaimer of Warranty. Unless required by applicable law or 163 | agreed to in writing, Licensor provides the Work (and each 164 | Contributor provides its Contributions) on an "AS IS" BASIS, 165 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 166 | implied, including, without limitation, any warranties or conditions 167 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 168 | PARTICULAR PURPOSE. You are solely responsible for determining the 169 | appropriateness of using or redistributing the Work and assume any 170 | risks associated with Your exercise of permissions under this License. 171 | 172 | 8. Limitation of Liability. In no event and under no legal theory, 173 | whether in tort (including negligence), contract, or otherwise, 174 | unless required by applicable law (such as deliberate and grossly 175 | negligent acts) or agreed to in writing, shall any Contributor be 176 | liable to You for damages, including any direct, indirect, special, 177 | incidental, or consequential damages of any character arising as a 178 | result of this License or out of the use or inability to use the 179 | Work (including but not limited to damages for loss of goodwill, 180 | work stoppage, computer failure or malfunction, or any and all 181 | other commercial damages or losses), even if such Contributor 182 | has been advised of the possibility of such damages. 183 | 184 | 9. Accepting Warranty or Additional Liability. While redistributing 185 | the Work or Derivative Works thereof, You may choose to offer, 186 | and charge a fee for, acceptance of support, warranty, indemnity, 187 | or other liability obligations and/or rights consistent with this 188 | License. However, in accepting such obligations, You may act only 189 | on Your own behalf and on Your sole responsibility, not on behalf 190 | of any other Contributor, and only if You agree to indemnify, 191 | defend, and hold each Contributor harmless for any liability 192 | incurred by, or claims asserted against, such Contributor by reason 193 | of your accepting any such warranty or additional liability. 194 | 195 | END OF TERMS AND CONDITIONS 196 | 197 | APPENDIX: How to apply the Apache License to your work. 198 | 199 | To apply the Apache License to your work, attach the following 200 | boilerplate notice, with the fields enclosed by brackets "[]" 201 | replaced with your own identifying information. (Don't include 202 | the brackets!) The text should be enclosed in the appropriate 203 | comment syntax for the file format. We also recommend that a 204 | file or class name and description of purpose be included on the 205 | same "printed page" as the copyright notice for easier 206 | identification within third-party archives. 207 | 208 | Copyright [yyyy] [name of copyright owner] 209 | 210 | Licensed under the Apache License, Version 2.0 (the "License"); 211 | you may not use this file except in compliance with the License. 212 | You may obtain a copy of the License at 213 | 214 | http://www.apache.org/licenses/LICENSE-2.0 215 | 216 | Unless required by applicable law or agreed to in writing, software 217 | distributed under the License is distributed on an "AS IS" BASIS, 218 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 219 | See the License for the specific language governing permissions and 220 | limitations under the License. 221 | ``` 222 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # CodeBERT 2 | This repo provides the code for reproducing the experiments in [CodeBERT: A Pre-Trained Model for Programming and Natural Languages](https://arxiv.org/pdf/2002.08155.pdf). CodeBERT is a pre-trained model for programming language, which is a multi-programming-lingual model pre-trained on NL-PL pairs in 6 programming languages (Python, Java, JavaScript, PHP, Ruby, Go). 3 | 4 | ### Dependency 5 | 6 | - pip install torch 7 | - pip install transformers 8 | 9 | ### Qiuck Tour 10 | We use huggingface/transformers framework to train the model. You can use our model like the pre-trained Roberta base. Now, We give an example on how to load the model. 11 | ```python 12 | import torch 13 | from transformers import RobertaTokenizer, RobertaConfig, RobertaModel 14 | 15 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 16 | tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base") 17 | model = RobertaModel.from_pretrained("microsoft/codebert-base") 18 | model.to(device) 19 | ``` 20 | 21 | ### NL-PL Embeddings 22 | 23 | Here, we give an example to obtain embedding from CodeBERT. 24 | 25 | ```python 26 | >>> from transformers import AutoTokenizer, AutoModel 27 | >>> import torch 28 | >>> tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base") 29 | >>> model = AutoModel.from_pretrained("microsoft/codebert-base") 30 | >>> nl_tokens=tokenizer.tokenize("return maximum value") 31 | ['return', 'Ġmaximum', 'Ġvalue'] 32 | >>> code_tokens=tokenizer.tokenize("def max(a,b): if a>b: return a else return b") 33 | ['def', 'Ġmax', '(', 'a', ',', 'b', '):', 'Ġif', 'Ġa', '>', 'b', ':', 'Ġreturn', 'Ġa', 'Ġelse', 'Ġreturn', 'Ġb'] 34 | >>> tokens=[tokenizer.cls_token]+nl_tokens+[tokenizer.sep_token]+code_tokens+[tokenizer.sep_token] 35 | ['', 'return', 'Ġmaximum', 'Ġvalue', '', 'def', 'Ġmax', '(', 'a', ',', 'b', '):', 'Ġif', 'Ġa', '>', 'b', ':', 'Ġreturn', 'Ġa', 'Ġelse', 'Ġreturn', 'Ġb', ''] 36 | >>> tokens_ids=tokenizer.convert_tokens_to_ids(tokens) 37 | [0, 30921, 4532, 923, 2, 9232, 19220, 1640, 102, 6, 428, 3256, 114, 10, 15698, 428, 35, 671, 10, 1493, 671, 741, 2] 38 | >>> context_embeddings=model(torch.tensor(tokens_ids)[None,:])[0] 39 | torch.Size([1, 23, 768]) 40 | tensor([[-0.1423, 0.3766, 0.0443, ..., -0.2513, -0.3099, 0.3183], 41 | [-0.5739, 0.1333, 0.2314, ..., -0.1240, -0.1219, 0.2033], 42 | [-0.1579, 0.1335, 0.0291, ..., 0.2340, -0.8801, 0.6216], 43 | ..., 44 | [-0.4042, 0.2284, 0.5241, ..., -0.2046, -0.2419, 0.7031], 45 | [-0.3894, 0.4603, 0.4797, ..., -0.3335, -0.6049, 0.4730], 46 | [-0.1433, 0.3785, 0.0450, ..., -0.2527, -0.3121, 0.3207]], 47 | grad_fn=) 48 | ``` 49 | 50 | 51 | ### Probing 52 | 53 | As stated in the paper, CodeBERT is not suitable for mask prediction task, while CodeBERT (MLM) is suitable for mask prediction task. 54 | 55 | 56 | We give an example on how to use CodeBERT(MLM) for mask prediction task. 57 | ```python 58 | from transformers import RobertaConfig, RobertaTokenizer, RobertaForMaskedLM, pipeline 59 | 60 | model = RobertaForMaskedLM.from_pretrained("microsoft/codebert-base-mlm") 61 | tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base-mlm") 62 | 63 | CODE = "if (x is not None) (x>1)" 64 | fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer) 65 | 66 | outputs = fill_mask(CODE) 67 | print(outputs) 68 | 69 | ``` 70 | Results 71 | ```python 72 | 'and', 'or', 'if', 'then', 'AND' 73 | ``` 74 | The detailed outputs are as follows: 75 | ```python 76 | {'sequence': ' if (x is not None) and (x>1)', 'score': 0.6049249172210693, 'token': 8} 77 | {'sequence': ' if (x is not None) or (x>1)', 'score': 0.30680200457572937, 'token': 50} 78 | {'sequence': ' if (x is not None) if (x>1)', 'score': 0.02133703976869583, 'token': 114} 79 | {'sequence': ' if (x is not None) then (x>1)', 'score': 0.018607674166560173, 'token': 172} 80 | {'sequence': ' if (x is not None) AND (x>1)', 'score': 0.007619690150022507, 'token': 4248} 81 | ``` 82 | 83 | ### Downstream Tasks 84 | 85 | For Code Search and Code Docsmentation Generation tasks, please refer to the [CodeBERT](https://github.com/guoday/CodeBERT/tree/master/CodeBERT) folder. 86 | 87 | 88 | 89 | # GraphCodeBERT 90 | 91 | This repo also provides the code for reproducing the experiments in [GraphCodeBERT: Pre-training Code Representations with Data Flow](https://openreview.net/pdf?id=jLoC4ez43PZ). GraphCodeBERT a pre-trained model for programming language that considers the inherent structure of code i.e. data flow, which is a multi-programming-lingual model pre-trained on NL-PL pairs in 6 programming languages (Python, Java, JavaScript, PHP, Ruby, Go). 92 | 93 | For downstream tasks like code search, clone detection, code refinement and code translation, please refer to the [GraphCodeBERT](https://github.com/guoday/CodeBERT/tree/master/GraphCodeBERT) folder. 94 | 95 | ## Contact 96 | 97 | Feel free to contact Daya Guo (guody5@mail2.sysu.edu.cn) and Duyu Tang (dutang@microsoft.com) if you have any further questions. 98 | -------------------------------------------------------------------------------- /SECURITY.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | ## Security 4 | 5 | Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet), [Xamarin](https://github.com/xamarin), and [our GitHub organizations](https://opensource.microsoft.com/). 6 | 7 | If you believe you have found a security vulnerability in any Microsoft-owned repository that meets Microsoft's [Microsoft's definition of a security vulnerability](https://docs.microsoft.com/en-us/previous-versions/tn-archive/cc751383(v=technet.10)), please report it to us as described below. 8 | 9 | ## Reporting Security Issues 10 | 11 | **Please do not report security vulnerabilities through public GitHub issues.** 12 | 13 | Instead, please report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://msrc.microsoft.com/create-report). 14 | 15 | If you prefer to submit without logging in, send email to [secure@microsoft.com](mailto:secure@microsoft.com). If possible, encrypt your message with our PGP key; please download it from the the [Microsoft Security Response Center PGP Key page](https://www.microsoft.com/en-us/msrc/pgp-key-msrc). 16 | 17 | You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://www.microsoft.com/msrc). 18 | 19 | Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue: 20 | 21 | * Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.) 22 | * Full paths of source file(s) related to the manifestation of the issue 23 | * The location of the affected source code (tag/branch/commit or direct URL) 24 | * Any special configuration required to reproduce the issue 25 | * Step-by-step instructions to reproduce the issue 26 | * Proof-of-concept or exploit code (if possible) 27 | * Impact of the issue, including how an attacker might exploit the issue 28 | 29 | This information will help us triage your report more quickly. 30 | 31 | If you are reporting for a bug bounty, more complete reports can contribute to a higher bounty award. Please visit our [Microsoft Bug Bounty Program](https://microsoft.com/msrc/bounty) page for more details about our active programs. 32 | 33 | ## Preferred Languages 34 | 35 | We prefer all communications to be in English. 36 | 37 | ## Policy 38 | 39 | Microsoft follows the principle of [Coordinated Vulnerability Disclosure](https://www.microsoft.com/en-us/msrc/cvd). 40 | 41 | --------------------------------------------------------------------------------