├── hqcm ├── __init__.py ├── gen.py ├── xdata.py └── sft.py ├── assets └── codefuse.jpg ├── requirements.txt ├── LEGAL.md ├── .gitignore ├── README.md ├── LICENSE └── dataset └── README.md /hqcm/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /assets/codefuse.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/codefuse-ai/codefuse-hqcm/master/assets/codefuse.jpg -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | python-dotenv==1.0.1 2 | torch==2.4.1 3 | datasets==3.1.0 4 | transformers==4.46.1 5 | accelerate==1.0.1 6 | peft==0.13.2 7 | bitsandbytes==0.44.1 8 | unidiff==0.7.5 9 | tensorboardX==2.6.2.2 10 | -------------------------------------------------------------------------------- /LEGAL.md: -------------------------------------------------------------------------------- 1 | Legal Disclaimer 2 | 3 | Within this source code, the comments in Chinese shall be the original, governing version. Any comment in other languages are for reference only. In the event of any conflict between the Chinese language version comments and other language version comments, the Chinese language version shall prevail. 4 | 5 | 法律免责声明 6 | 7 | 关于代码注释部分,中文注释为官方版本,其它语言注释仅做参考。中文注释可能与其它语言注释存在不一致,当中文注释与其它语言注释存在不一致时,请以中文注释为准。 -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Gradle files 2 | .gradle/ 3 | build/ 4 | 5 | # Local configuration file (sdk path, etc) 6 | local.properties 7 | 8 | # Log/OS Files 9 | *.log 10 | 11 | # Android Studio generated files and folders 12 | captures/ 13 | .externalNativeBuild/ 14 | .cxx/ 15 | *.apk 16 | output.json 17 | 18 | # IntelliJ 19 | *.iml 20 | .idea/ 21 | 22 | # Keystore files 23 | *.jks 24 | *.keystore 25 | 26 | # Google Services (e.g. APIs or Firebase) 27 | google-services.json 28 | 29 | # Android Profiling 30 | *.hprof 31 | 32 | .DS_Store 33 | 34 | # Byte-compiled / optimized / DLL files 35 | __pycache__/ 36 | *.py[cod] 37 | *$py.class 38 | 39 | # C extensions 40 | *.so 41 | 42 | # Distribution / packaging 43 | .Python 44 | build/ 45 | develop-eggs/ 46 | dist/ 47 | downloads/ 48 | eggs/ 49 | .eggs/ 50 | lib/ 51 | lib64/ 52 | parts/ 53 | sdist/ 54 | var/ 55 | wheels/ 56 | share/python-wheels/ 57 | *.egg-info/ 58 | .installed.cfg 59 | *.egg 60 | MANIFEST 61 | 62 | # PyInstaller 63 | # Usually these files are written by a python script from a template 64 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 65 | *.manifest 66 | *.spec 67 | 68 | # Installer logs 69 | pip-log.txt 70 | pip-delete-this-directory.txt 71 | 72 | # Unit test / coverage reports 73 | htmlcov/ 74 | .tox/ 75 | .nox/ 76 | .coverage 77 | .coverage.* 78 | .cache 79 | nosetests.xml 80 | coverage.xml 81 | *.cover 82 | *.py,cover 83 | .hypothesis/ 84 | .pytest_cache/ 85 | cover/ 86 | 87 | # Translations 88 | *.mo 89 | *.pot 90 | 91 | # Django stuff: 92 | *.log 93 | local_settings.py 94 | db.sqlite3 95 | db.sqlite3-journal 96 | 97 | # Flask stuff: 98 | instance/ 99 | .webassets-cache 100 | 101 | # Scrapy stuff: 102 | .scrapy 103 | 104 | # Sphinx documentation 105 | docs/_build/ 106 | 107 | # PyBuilder 108 | .pybuilder/ 109 | target/ 110 | 111 | # Jupyter Notebook 112 | .ipynb_checkpoints 113 | 114 | # IPython 115 | profile_default/ 116 | ipython_config.py 117 | 118 | # pyenv 119 | # For a library or package, you might want to ignore these files since the code is 120 | # intended to run in multiple environments; otherwise, check them in: 121 | # .python-version 122 | 123 | # pipenv 124 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 125 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 126 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 127 | # install all needed dependencies. 128 | #Pipfile.lock 129 | 130 | # poetry 131 | # Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control. 132 | # This is especially recommended for binary packages to ensure reproducibility, and is more 133 | # commonly ignored for libraries. 134 | # https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control 135 | #poetry.lock 136 | 137 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 138 | __pypackages__/ 139 | 140 | # Celery stuff 141 | celerybeat-schedule 142 | celerybeat.pid 143 | 144 | # SageMath parsed files 145 | *.sage.py 146 | 147 | # Environments 148 | .env 149 | .venv 150 | env/ 151 | venv/ 152 | ENV/ 153 | env.bak/ 154 | venv.bak/ 155 | 156 | # Spyder project settings 157 | .spyderproject 158 | .spyproject 159 | 160 | # Rope project settings 161 | .ropeproject 162 | 163 | # mkdocs documentation 164 | /site 165 | 166 | # mypy 167 | .mypy_cache/ 168 | .dmypy.json 169 | dmypy.json 170 | 171 | # Pyre type checker 172 | .pyre/ 173 | 174 | # pytype static type analyzer 175 | .pytype/ 176 | 177 | # Cython debug symbols 178 | cython_debug/ 179 | 180 | # PyCharm 181 | # JetBrains specific template is maintainted in a separate JetBrains.gitignore that can 182 | # be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore 183 | # and can be added to the global gitignore or merged into this file. For a more nuclear 184 | # option (not recommended) you can uncomment the following to ignore the entire idea folder. 185 | .idea/ 186 | -------------------------------------------------------------------------------- /hqcm/gen.py: -------------------------------------------------------------------------------- 1 | import json 2 | import pprint 3 | import time 4 | 5 | from datasets import load_dataset 6 | from dotenv import load_dotenv 7 | from transformers import AutoTokenizer, AutoModelForCausalLM 8 | 9 | load_dotenv() 10 | 11 | 12 | class SFTModel: 13 | 14 | def __init__( 15 | self, model_id, peft_adapter_id=None, 16 | max_response_length=512, max_prompt_length=4096, 17 | temperature=0.8, top_k=50, top_p=0.95 18 | ): 19 | self.tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) 20 | self.model = AutoModelForCausalLM.from_pretrained( 21 | model_id, device_map="auto", torch_dtype="auto", 22 | trust_remote_code=True 23 | ) 24 | if peft_adapter_id is not None: 25 | self.model.load_adapter(peft_adapter_id) 26 | self.generation_config = { 27 | "max_new_tokens": max_response_length, 28 | "num_return_sequences": 1, 29 | "num_beams": 1 30 | } 31 | if abs(temperature) < 1e-8: # temperature is zero 32 | self.generation_config["do_sample"] = False 33 | else: 34 | self.generation_config["do_sample"] = True 35 | self.generation_config["temperature"] = temperature 36 | self.generation_config["top_k"] = top_k 37 | self.generation_config["top_p"] = top_p 38 | 39 | def query(self, prompt): 40 | inp_tk = self.tokenizer(prompt, return_tensors="pt").to("cuda") 41 | # TODO: What if len(inp_tk) > self.max_prompt_length? 42 | oup_tk = self.model.generate(**inp_tk, **self.generation_config).to("cpu") 43 | answer = self.tokenizer.batch_decode( 44 | oup_tk[:, inp_tk['input_ids'].shape[-1]:], # Don't echo the preceding chat_message in the answer 45 | skip_special_tokens=True, # Skip special tokens like EOS, PAD in answer 46 | clean_up_tokenization_spaces=False)[0] 47 | return answer 48 | 49 | 50 | if __name__ == '__main__': 51 | from argparse import ArgumentParser 52 | from pathlib import Path 53 | 54 | parser = ArgumentParser( 55 | prog="gen", 56 | description="Call a supervised fine-tuned model to generate answers for the transformed HQCM dataset" 57 | ) 58 | parser.add_argument( 59 | 'adapter', 60 | type=Path, 61 | help="Path to the adapter after supervised fine-tuning, or to the plain model if -M is specified." 62 | ) 63 | parser.add_argument( 64 | "-d", "--dataset", 65 | required=True, type=str, 66 | help="Path to the dataset for inference" 67 | ) 68 | parser.add_argument( 69 | "-c", "--config", 70 | default=None, type=str, 71 | help="Config name of the dataset" 72 | ) 73 | parser.add_argument( 74 | "-t", "--split", 75 | default=None, type=str, 76 | help="Split of the dataset; each item in the dataset should have a `prompt` and an `answer` field" 77 | ) 78 | parser.add_argument( 79 | '-o', '--output', 80 | required=True, type=Path, 81 | help="Path to the JSON file saving the inference results" 82 | ) 83 | parser.add_argument( 84 | '-T', '--temperature', 85 | default=0.8, type=float, 86 | help="Temperature set to the model, controlling the randomness of the model" 87 | ) 88 | parser.add_argument( 89 | '-M', '--plain-model', 90 | default=False, action='store_true', 91 | help="Loading the plain model without considering the adapters (i.e., the adapter points to a model, not an adapter)" 92 | ) 93 | 94 | args = parser.parse_args() 95 | pprint.pprint(vars(args)) 96 | 97 | # TODO: Directly load the model with the adapter id 98 | if not args.plain_model: 99 | adapter_config_path = args.adapter / 'adapter_config.json' 100 | assert adapter_config_path.exists(), f'Adapter config does not exists under {adapter_config_path}' 101 | with adapter_config_path.open('r') as fin: 102 | adapter_config = json.load(fin) 103 | model_path = adapter_config['base_model_name_or_path'] 104 | 105 | model = SFTModel( 106 | model_id=model_path, 107 | peft_adapter_id=str(args.adapter.absolute()), 108 | max_prompt_length=512, 109 | temperature=args.temperature 110 | ) 111 | else: 112 | model_path = str(args.adapter) 113 | model = SFTModel( 114 | model_id=model_path, 115 | max_prompt_length=512, 116 | temperature=args.temperature 117 | ) 118 | 119 | results = [] 120 | for index, item in enumerate(load_dataset(args.dataset, args.config, split=args.split)): 121 | prompt = item['prompt'] 122 | answer = item['answer'] 123 | 124 | start_time_ms = time.time() * 1000 125 | try: 126 | model_answer = model.query(prompt) 127 | except Exception as e: 128 | model_answer = f'Generation Failed: {e}' 129 | end_time_ms = time.time() * 1000 130 | 131 | elapsed_ms = (end_time_ms - start_time_ms) 132 | 133 | results.append({ 134 | 'prompt': prompt, 135 | 'expected_answer': answer, 136 | 'actual_answer': model_answer, 137 | 'elapsed_ms': elapsed_ms 138 | }) 139 | 140 | print(f"#{index} (elapsed: {elapsed_ms}ms) {answer} : {model_answer}") 141 | 142 | with args.output.open("w", encoding='utf-8') as fou: 143 | json.dump(results, fou, ensure_ascii=False, indent=2) 144 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |

2 | CodeFuse 3 |

4 | 5 |
6 |

CodeFuse HQCM

7 |

A Small-Scale yet High-Quality Dataset for Code Change Understanding

8 |
9 | 10 | ## Overview 11 | 12 | HQCM is a smaller yet better dataset for code change understanding. 13 | It is a carefully developed subset of the Java portion of the [MCMD](https://doi.org/10.1007/s10664-022-10219-1) dataset. 14 | Each entry in HQCM has been meticulously selected, reviewed, revised, and validated by crowdsource developers. 15 | The creation of HQCM stems from the recognition that large language models (LLMs) are not silver bullets; 16 | there are scenarios where their application may be limited, for example: 17 | 18 | 1. **Security Constraints**: In cases where data security is paramount, commercial LLMs are often prohibited to prevent potential data leaks, especially in industrial settings. 19 | 2. **Compute Constraints**: LLMs are often difficult to deploy in resource-constrained environments, such as laptops and mobile devices at the edge. 20 | 3. **Financial Constraints**: The high cost of premium LLM APIs makes their use infeasible for many applications without enough budgets. 21 | 4. **Customized Tasks**: LLMs' performance, especially those non-premium ones, can vary significantly across specialized or customized tasks. 22 | 23 | In these contexts, HQCM aims to serve as training and testing data for SLMs (small language models), or as few-shot examples for LLMs in tasks involving code-change understanding. 24 | 25 | HQCM comprises approximately 5,000 high-quality pairs of code changes and their corresponding summaries, where each code change is presented in a unified diff format, while the accompanying summary is a concise sentence available in both English and Chinese. 26 | Each entry in HQCM is classified into one of eight popular categories: *feat* (feature), *fix*, *refactor*, *cicd* (CI/CD), *build*, *test*, *docs* (documentation), and *style*. 27 | Additional categories such as *perf* (performance) and *chore* are planned for future inclusion. 28 | The distribution of these categories reflects their natural prevalence in the real world, with refactor being the most common and style and CI/CD being the least prevalent. 29 | 30 | ## Installation 31 | 32 | ```shell 33 | git clone https://github.com/codefuse-ai/codefuse-hqcm hqcm && cd hqcm 34 | python3 -m venv venv && source venv/bin/activate 35 | pip install -r requirements.txt 36 | ``` 37 | 38 | ## Task Adaptation 39 | 40 | HQCM can be adapted for three change-related tasks: 41 | - **Change Summarization** (`chsum`) summarizes a code change (represented by a code diff) into a short sentence in natural language 42 | - **Change Classification** (`chcl`) classifies each pair of code change and summary into one of the categories 43 | - **Code Refinement** (`coderef`) refines a given piece of code based a comment to produce the refined code, commonly used in code review process 44 | 45 | Below transforms HQCM into its `chsum` (change-summarization) variant and saves the variant into `$CHSUM_PATH` for supervised fine-tuning: 46 | 47 | ```shell 48 | export CHSUM_VARIANT_PATH='./dataset/chsum' 49 | python -m hqcm.xdata --task chsum --output $CHSUM_VARIANT_PATH ./dataset/ 50 | ``` 51 | 52 | ## Fine-tuning SLMs 53 | 54 | The adapted dataset can be used for SLMs' supervised fine-tuning or used as few-shot examples for LLMs. 55 | We provided scripts to fine-tune a HuggingFace model with LoRA based on the transformed dataset. 56 | 57 | Below fine-tunes Llama2-7b for change summarization and saves it into `$CHSUM_MODEL_PATH`, using HQCM's chsum variant in `$CHSUM_VARIANT_PATH`: 58 | 59 | ```sh 60 | export CHSUM_MODEL_PATH='/path/to/chsum_model' 61 | python -m hqcm.sft \ 62 | --seed 0 \ 63 | --learning-rate '2e-4' \ 64 | --num-epochs 5 \ 65 | --batch-size 1 \ 66 | --micro-batch-size 1 \ 67 | --lora-rank 64 \ 68 | --lora-alpha 16 \ 69 | --lora-dropout 0.1 \ 70 | --quantization '-1' \ 71 | --dataset $CHSUM_VARIANT_PATH \ 72 | --split 'train' \ 73 | --max-length 512 \ 74 | --output $CHSUM_MODEL_PATH \ 75 | '/path/to/your/llama2-7b' 76 | ``` 77 | 78 | Below leverages the fine-tuned model in `$CHSUM_MODEL_PATH` to generate summaries for changes in the test split of `$CHSUM_VARIANT_PATH`, with results exporting to `$CHSUM_RES_PATH`: 79 | 80 | ```shell 81 | export CHSUM_RES_PATH='/path/to/chsum_results' 82 | python -m hqcm.gen \ 83 | --dataset $CHSUM_VARIANT_PATH \ 84 | --split 'test' \ 85 | --output $CHSUM_RES_PATH \ 86 | --temperature 0 \ 87 | $CHSUM_MODEL_PATH 88 | ``` 89 | 90 | The above scripts also support HQCM's chcl and coderef variants. 91 | 92 | ## FAQs 93 | 94 | Q: The fine-tuning and generation scripts got stuck when connecting to HuggingFace 95 | A: HQCM's scripts assume an offline environment. Perhaps disabling download by: 96 | 97 | ```shell 98 | export HF_DATASETS_OFFLINE=1 # Disable HuggingFace's online accessing to datasets 99 | export TRANSFORMERS_OFFLINE=1 # Disable HuggingFace's online accessing to models 100 | export TOKENIZERS_PARALLELISM=false # Disable tokenizer's parallelism 101 | ``` 102 | 103 | Q: Does HQCM support other code change-related tasks? 104 | A: HQCM is code-change dataset. Users can adapt it to any change-related tasks in theory, but we did not experiment this. This require users to comprehend the task and reformat the dataset according to their usages. We are expecting promising results and we welcome such adaptations. 105 | 106 | ## Citation 107 | 108 | HQCM was published in [ASE '24](https://dl.acm.org/doi/10.1145/3691620.3694999). 109 | If you find it helpful, please consider citing our paper: 110 | 111 | ```txt 112 | @inproceedings{hqcm_ase24, 113 | author = {Li, Cong and Xu, Zhaogui and Di, Peng and Wang, Dongxia and Li, Zheng and Zheng, Qian}, 114 | title = {Understanding Code Changes Practically with Small-Scale Language Models}, 115 | year = {2024}, 116 | isbn = {9798400712487}, 117 | publisher = {Association for Computing Machinery}, 118 | address = {New York, NY, USA}, 119 | url = {https://doi.org/10.1145/3691620.3694999}, 120 | doi = {10.1145/3691620.3694999}, 121 | booktitle = {Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering}, 122 | pages = {216–228}, 123 | numpages = {13}, 124 | keywords = {code change, code review, language model, LLM, SLM}, 125 | location = {Sacramento, CA, USA}, 126 | series = {ASE '24} 127 | } 128 | ``` 129 | -------------------------------------------------------------------------------- /hqcm/xdata.py: -------------------------------------------------------------------------------- 1 | import functools 2 | import json 3 | 4 | from unidiff import PatchSet 5 | 6 | CHSUM_PROMPT_TEMPLATE = """\ 7 | Please generate a commit message for the following diff: 8 | 9 | ```diff 10 | {diff} 11 | ``` 12 | """ 13 | 14 | CHCL_PROMPT_TEMPLATE = """\ 15 | A git commit can typically be classified into specific categories by examining its code changes and commit message. These categories include: 16 | 17 | - "style": Changes that solely improve the code's formatting and appearance without affecting functionality (e.g., adjusting whitespace, fixing indentation, cleaning up code formatting). 18 | - "docs": Updates or improvements to documentation, which may include inline code comments, README files, or any other type of documentation associated with the project. 19 | - "test": Modifications exclusively related to test code, like the addition of new tests or the correction and improvement of existing tests. 20 | - "build": Changes that affect the build system or tools (like Gulp, Broccoli, NPM) or alterations to external dependencies (e.g., library or package updates). 21 | - "cicd": Tweaks to configuration files or scripts used in Continuous Integration/Continuous Deployment (CI/CD) systems, such as Travis CI or CircleCI configurations. 22 | - "fix": Code amendments that focus on rectifying errors, fixing bugs, or patching security vulnerabilities. 23 | - "feat": Commits that introduce new features or capabilities to the project, such as new classes, functions, or methods. 24 | - "refactor": Changes that reorganize and clean up the codebase without modifying its external behavior or outputs, improving readability and maintainability. 25 | 26 | For a given git commit, we can inspect its code difference (diff) and the associated commit message to determine its type. Below is the diff for a specific git commit: 27 | 28 | ``` 29 | {diff} 30 | ``` 31 | 32 | Accompanying this code diff is its commit message: 33 | 34 | ``` 35 | {message} 36 | ``` 37 | 38 | Given this information, the git commit can be categorized as type: """ 39 | 40 | CHCL_FEWSHOT_PROMPT_TEMPLATE = """\ 41 | A git commit can typically be classified into specific categories by examining its code changes and commit message. These categories include: 42 | 43 | - "style": Changes that solely improve the code's formatting and appearance without affecting functionality (e.g., adjusting whitespace, fixing indentation, cleaning up code formatting). 44 | - "docs": Updates or improvements to documentation, which may include inline code comments, README files, or any other type of documentation associated with the project. 45 | - "test": Modifications exclusively related to test code, like the addition of new tests or the correction and improvement of existing tests. 46 | - "build": Changes that affect the build system or tools (like Gulp, Broccoli, NPM) or alterations to external dependencies (e.g., library or package updates). 47 | - "cicd": Tweaks to configuration files or scripts used in Continuous Integration/Continuous Deployment (CI/CD) systems, such as Travis CI or CircleCI configurations. 48 | - "fix": Code amendments that focus on rectifying errors, fixing bugs, or patching security vulnerabilities. 49 | - "feat": Commits that introduce new features or capabilities to the project, such as new classes, functions, or methods. 50 | - "refactor": Changes that reorganize and clean up the codebase without modifying its external behavior or outputs, improving readability and maintainability. 51 | 52 | For a given git commit, we can inspect its code difference (diff) and the associated commit message to determine its type. 53 | 54 | Diff: ```diff 55 | diff --git a/util/src/com/intellij/util/containers/SLRUMap.java b/util/src/com/intellij/util/containers/SLRUMap.java 56 | index 7f3d09c..635dfab 100644 57 | --- a/util/src/com/intellij/util/containers/SLRUMap.java 58 | +++ b/util/src/com/intellij/util/containers/SLRUMap.java 59 | @@ -69,12 +69,12 @@ public class SLRUMap {{ 60 | public void put(K key, V value) {{ 61 | V oldValue = myProtectedQueue.remove(key); 62 | if (oldValue != null) {{ 63 | - onDropFromCache(key, value); 64 | + onDropFromCache(key, oldValue); 65 | }} 66 | 67 | oldValue = myProbationalQueue.put(getStableKey(key), value); 68 | if (oldValue != null) {{ 69 | - onDropFromCache(key, value); 70 | + onDropFromCache(key, oldValue); 71 | }} 72 | }} 73 | ``` 74 | Message: Corrected parameter error in onDropFromCache() function call 75 | Type: fix 76 | Reason: The git commit is a "fix" commit as it rectified a parameter error where `oldValue` should be passed as the argument of `onDropFromCache` rather than `value`. 77 | 78 | Diff: ```diff 79 | {diff} 80 | ``` 81 | Message: {message} 82 | Type: """ 83 | 84 | CODEREF_PROMPT_TEMPLATE = """\ 85 | // Please optimize the given "Code (to optimize)" (a portion of some "File"s) by strictly following the given "Suggestion". 86 | // Your optimization can involve editing or removing existing code, or adding new code. 87 | // You may pay more attention to lines marked by "// !!attention". If no such marked lines, do everything by yourself. 88 | 89 | // Code (to optimize): 90 | 91 | {code_to_opt} 92 | 93 | // Suggestion: {suggestion} 94 | 95 | // Code (after optimization): 96 | 97 | """ 98 | 99 | CODEREF_FEWSHOT_PROMPT_TEMPLATE = """\ 100 | // Please optimize the given "Code (to optimize)" (a portion of some "File"s) by strictly following the given "Suggestion". 101 | // Your optimization can involve editing or removing existing code, or adding new code. 102 | // You may pay more attention to lines marked by "// !!attention". If no such marked lines, do everything by yourself. 103 | 104 | ------- 105 | 106 | // Code (to optimize): 107 | 108 | //// File: util/src/com/intellij/util/containers/SLRUMap.java 109 | 69 | public void put(K key, V value) {{ 110 | 70 | V oldValue = myProtectedQueue.remove(key); 111 | 71 | if (oldValue != null) {{ 112 | 72 | onDropFromCache(key, value); // !!attention 113 | 73 | }} 114 | 74 | 115 | 75 | oldValue = myProbationalQueue.put(getStableKey(key), value); 116 | 76 | if (oldValue != null) {{ 117 | 77 | onDropFromCache(key, value); // !!attention 118 | 78 | }} 119 | 79 | }} 120 | 80 | 121 | 122 | // Suggestion: Corrected parameter error in onDropFromCache() function call 123 | 124 | // Code (after optimization): 125 | 126 | //// File: util/src/com/intellij/util/containers/SLRUMap.java 127 | 69 | public void put(K key, V value) {{ 128 | 70 | V oldValue = myProtectedQueue.remove(key); 129 | 71 | if (oldValue != null) {{ 130 | 72 | onDropFromCache(key, oldValue); 131 | 73 | }} 132 | 74 | 133 | 75 | oldValue = myProbationalQueue.put(getStableKey(key), value); 134 | 76 | if (oldValue != null) {{ 135 | 77 | onDropFromCache(key, oldValue); 136 | 78 | }} 137 | 79 | }} 138 | 80 | 139 | 140 | ------- 141 | 142 | // Code (to optimize): 143 | 144 | {code_to_opt} 145 | 146 | // Suggestion: {suggestion} 147 | 148 | // Code (after optimization): 149 | 150 | """ 151 | 152 | 153 | def xitem_coderef(item, fewshot=False): 154 | opt_suggestion = item['summaries']['en'] 155 | 156 | code_to_opt = [] 157 | code_after_opt = [] 158 | 159 | for patched_file in PatchSet(item['change']): 160 | file_name = patched_file.path 161 | source_text = "\n".join( 162 | # Add an attention to additionally inform the model 163 | f"{line.source_line_no} | {line.value.rstrip()} {'// !!attention' if line.is_removed else ''}" 164 | for hunk in patched_file 165 | for line in hunk.source_lines() 166 | ) 167 | target_text = "\n".join( 168 | f'{line.target_line_no} | {line.value.rstrip()}' 169 | for hunk in patched_file 170 | for line in hunk.target_lines() 171 | ) 172 | code_to_opt.append(f"""//// File: {file_name}\n{source_text}""") 173 | code_after_opt.append(f"""//// File: {file_name}\n{target_text}""") 174 | 175 | if fewshot: 176 | return { 177 | 'prompt': CODEREF_FEWSHOT_PROMPT_TEMPLATE.format( 178 | code_to_opt='\n'.join(code_to_opt), 179 | suggestion=opt_suggestion 180 | ), 181 | 'answer': '\n'.join(code_after_opt) 182 | } 183 | else: 184 | return { 185 | 'prompt': CODEREF_PROMPT_TEMPLATE.format( 186 | code_to_opt='\n'.join(code_to_opt), 187 | suggestion=opt_suggestion 188 | ), 189 | 'answer': '\n'.join(code_after_opt) 190 | } 191 | 192 | 193 | def xitem_chcl(item, fewshot=False): 194 | if fewshot: 195 | return { 196 | 'prompt': CHCL_FEWSHOT_PROMPT_TEMPLATE.format( 197 | diff=item['change'], 198 | message=item['summaries']['en'] 199 | ), 200 | 'answer': item['type'] 201 | } 202 | else: 203 | return { 204 | 'prompt': CHCL_PROMPT_TEMPLATE.format( 205 | diff=item['change'], 206 | message=item['summaries']['en'] 207 | ), 208 | 'answer': item['type'] 209 | } 210 | 211 | 212 | def xitem_chsum(item): 213 | return { 214 | 'prompt': CHSUM_PROMPT_TEMPLATE.format(diff=item['change']), 215 | 'answer': item['summaries']['en'] 216 | } 217 | 218 | 219 | def transform(in_dir, out_dir, xitem_fn): 220 | assert in_dir.is_dir(), f"Not a directory: {in_dir}" 221 | assert (in_dir / 'train.json').exists(), f"File train.json does not exist in: {in_dir}" 222 | assert (in_dir / 'test.json').exists(), f"File test.json does not exist in: {in_dir}" 223 | 224 | out_dir.mkdir(exist_ok=True) 225 | 226 | for fname in ['train.json', 'test.json']: 227 | with (in_dir / fname).open('r') as fin: 228 | tx_data = [xitem_fn(item) for item in json.load(fin)] 229 | with (out_dir / fname).open('w') as fou: 230 | json.dump(tx_data, fou, ensure_ascii=False, indent=2) 231 | 232 | 233 | if __name__ == '__main__': 234 | from argparse import ArgumentParser 235 | from pathlib import Path 236 | 237 | parser = ArgumentParser( 238 | prog="xdata", 239 | description="Tranform the HQCM dataset for fine-tuning of a specific code-change task" 240 | ) 241 | parser.add_argument( 242 | "dataset", type=Path, 243 | help="Path to the directory saving the HQCM dataset before transformation" 244 | ) 245 | parser.add_argument( 246 | "-t", "--task", 247 | required=True, choices=['chsum', 'chcl', 'coderef'], 248 | help="Target tasks: chsum for change summarization, chcl for change classification, and coderef for code refinement" 249 | ) 250 | parser.add_argument( 251 | "-o", "--output", 252 | required=True, type=Path, 253 | help="Path to the directory to save the HQCM dataset after transforming for the task" 254 | ) 255 | parser.add_argument( 256 | "-F", "--few-shot", 257 | default=False, action='store_true', 258 | help="Transform the dataset that can be used for few-shot in-context learning" 259 | ) 260 | args = parser.parse_args() 261 | 262 | if args.task == 'chsum': 263 | xitem_fn = xitem_chsum 264 | elif args.task == 'chcl': 265 | xitem_fn = functools.partial(xitem_chcl, fewshot=args.fewshot) 266 | elif args.task == 'coderef': 267 | xitem_fn = functools.partial(xitem_coderef, fewshot=args.fewshot) 268 | else: 269 | assert False, "Unsupported code change task: " + args.task 270 | 271 | transform(args.dataset, args.output, xitem_fn=xitem_fn) 272 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. -------------------------------------------------------------------------------- /hqcm/sft.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pprint 3 | import time 4 | from argparse import ArgumentParser 5 | 6 | import torch 7 | from datasets import load_dataset 8 | from dotenv import load_dotenv 9 | from peft import LoraConfig, TaskType, get_peft_model 10 | from transformers import AutoTokenizer, DataCollatorForSeq2Seq, set_seed, AutoModelForCausalLM, TrainingArguments, \ 11 | Trainer, BitsAndBytesConfig 12 | 13 | load_dotenv() 14 | 15 | parser = ArgumentParser( 16 | "sft", 17 | description="Supervised fine-tuning (sft) a HuggingFace model with bf16 precision for " 18 | "a specific code-change related task using the transformed HQCM dataset" 19 | ) 20 | parser.add_argument( 21 | "model", 22 | type=str, 23 | help="Path to the base model for sft" 24 | ) 25 | parser.add_argument( 26 | "-d", "--dataset", 27 | required=True, type=str, 28 | help="Path to the dataset for base model's sft" 29 | ) 30 | parser.add_argument( 31 | "-c", "--config", 32 | default=None, type=str, 33 | help="Config name of the dataset" 34 | ) 35 | parser.add_argument( 36 | "-t", "--split", 37 | default=None, type=str, 38 | help="Split of the dataset; each item in the dataset should have a `prompt` and an `answer` field" 39 | ) 40 | parser.add_argument( 41 | "-p", "--percentage", 42 | default=100, type=int, 43 | help="Select only a percentage of the dataset for sft (e.g., 10 for 10%)" 44 | ) 45 | parser.add_argument( 46 | "-M", "--max-length", 47 | default=512, type=int, 48 | help="Max length of the text (prompt + answer) saved in the dataset" 49 | ) 50 | parser.add_argument( 51 | "-N", "--no-lora", 52 | default=False, action="store_true", 53 | help="Fine-tuning without LoRA" 54 | ) 55 | parser.add_argument( 56 | "-R", "--lora-rank", 57 | default=64, type=int, 58 | help="LoRA's rank" 59 | ) 60 | parser.add_argument( 61 | "-A", "--lora-alpha", 62 | default=16, type=int, 63 | help="LoRA's alpha" 64 | ) 65 | parser.add_argument( 66 | "-D", "--lora-dropout", 67 | default=0.1, type=float, 68 | help="LoRA's dropout" 69 | ) 70 | parser.add_argument( 71 | "-Q", "--quantization", 72 | default=-1, type=int, choices=[-1, 4, 8], 73 | help="Enable fine-tuning with k-bit quantization" 74 | ) 75 | parser.add_argument( 76 | "-a", "--learning-rate", 77 | default=2e-4, type=float, 78 | help="Learning rate (the larger, the fast)" 79 | ) 80 | parser.add_argument( 81 | "-e", "--num-epochs", 82 | default=2, type=int, 83 | help="Number of epochs to train" 84 | ) 85 | parser.add_argument( 86 | "-B", "--batch-size", 87 | default=64, type=int, 88 | help="Number of examples to feed for gradient updates (i.e., opt_per_device_batch_size * gradient_accumulation_steps)" 89 | ) 90 | parser.add_argument( 91 | "-b", "--micro-batch-size", 92 | default=8, type=int, 93 | help="Number of examples to feed per step (i.e., per_device_batch_size)" 94 | ) 95 | parser.add_argument( 96 | "-r", "--resume", 97 | default=False, action='store_true', 98 | help="Resume pretraining from the last checkpoint in the output directory" 99 | ) 100 | parser.add_argument( 101 | "-o", "--output", 102 | required=True, type=str, 103 | help="Path to saved the model after sft" 104 | ) 105 | parser.add_argument( 106 | "-s", "--seed", 107 | default=int(time.time()), type=int, 108 | help="Seed for controlling the randomness of the sft process" 109 | ) 110 | 111 | args = parser.parse_args() 112 | pprint.pprint(vars(args)) 113 | 114 | arg_base_model = args.model 115 | 116 | arg_dataset = args.dataset 117 | opt_dataset_config = args.config 118 | opt_dataset_split = args.split 119 | opt_dataset_max_seq_len = args.max_length 120 | opt_dataset_percentage = args.percentage 121 | 122 | opt_lora = not args.no_lora 123 | opt_lora_rank = args.lora_rank 124 | opt_lora_alpha = args.lora_alpha 125 | opt_lora_dropout = args.lora_dropout 126 | 127 | opt_quant_bit = args.quantization 128 | 129 | opt_learning_rate = args.learning_rate 130 | opt_training_epochs = args.num_epochs 131 | opt_per_device_batch_size = args.micro_batch_size 132 | opt_gradient_accumulation_steps = args.batch_size // args.micro_batch_size 133 | 134 | opt_resume_from_checkpoint = args.resume 135 | 136 | arg_output_dir = args.output 137 | opt_output_save_steps = 1000 138 | opt_output_save_limit = 5 139 | opt_output_logging_steps = 10 140 | 141 | opt_seed = args.seed 142 | opt_num_proc = os.cpu_count() 143 | 144 | # Set the seed for reproduction 145 | set_seed(opt_seed) 146 | 147 | # Write out the command for reproduction 148 | os.makedirs(arg_output_dir, exist_ok=True) 149 | with open(os.path.join(arg_output_dir, "command.txt"), 'w') as fp: 150 | fp.write(pprint.pformat(vars(args))) 151 | 152 | # Load the tokenizer and set the pad token 153 | tokenizer = AutoTokenizer.from_pretrained(arg_base_model, trust_remote_code=True) 154 | if tokenizer.pad_token_id is None: 155 | tokenizer.pad_token_id = tokenizer.unk_token_id 156 | tokenizer.padding_side = "left" 157 | 158 | # Create a DataCollator to help us pad each sequence to the maximum length in the batch while training. 159 | # Don't use DataCollatorForLanguageModeling as it doesn't pad the labels. 160 | data_collator = DataCollatorForSeq2Seq( 161 | tokenizer, pad_to_multiple_of=8, return_tensors="pt" 162 | ) 163 | 164 | 165 | # Tokenize the dataset and hide the prompt from the model 166 | def tokenize_and_preprocess(example): 167 | prompt, answer = example['prompt'], example['answer'] 168 | 169 | tk_example = tokenizer( 170 | prompt + answer, 171 | max_length=opt_dataset_max_seq_len, 172 | # Don't pad since we are about to pad each example by the DataCollator 173 | padding=False, truncation=True 174 | ) 175 | 176 | # In case there lacks an EOS token. This is because the tokenizer is 177 | # used for generation tasks, so it automatically adds a token to the 178 | # head but does not add a EOS token to the end for further generation. 179 | # So let's add it as the end of our prompt-answer pair. 180 | if (tk_example['input_ids'][-1] != tokenizer.eos_token_id and 181 | len(tk_example['input_ids']) < opt_dataset_max_seq_len): 182 | tk_example['input_ids'].append(tokenizer.eos_token_id) 183 | tk_example['attention_mask'].append(1) # This EOS token should be attended 184 | 185 | # Prepare our labels, it's exactly the input_ids 186 | tk_example['labels'] = tk_example['input_ids'].copy() 187 | 188 | # Hide the prompt such that our training process does not compute cross-entropy for prompts 189 | # and our model only focuses on learning to generate the answer. 190 | # Since the last token of the prompt might be part of the first token of the answer, let's skip it. 191 | # For example, if prompt="A " and answer="a" then tk_prompt=["_A", "_"], tk_example=["_A", "_a"]. 192 | num_hidden_tokens = len(tokenizer( 193 | prompt, max_length=opt_dataset_max_seq_len, padding=False, truncation=True 194 | )['input_ids']) - 1 195 | # label_pad_token_id (-100) is a magic number used by pytorch to hide tokens for cross-entropy. 196 | tk_example['labels'][:num_hidden_tokens] = [data_collator.label_pad_token_id] * num_hidden_tokens 197 | 198 | return tk_example 199 | 200 | 201 | # Load the dataset for supervised fine-tuning 202 | raw_dataset = load_dataset(arg_dataset, opt_dataset_config, split=opt_dataset_split) 203 | if 0 < opt_dataset_percentage < 100: 204 | num_selected = int(len(raw_dataset) * (opt_dataset_percentage / 100)) 205 | raw_dataset = raw_dataset.shuffle().select(range(num_selected)) 206 | dataset = raw_dataset.map( 207 | tokenize_and_preprocess, 208 | remove_columns=raw_dataset.column_names, 209 | num_proc=opt_num_proc, 210 | load_from_cache_file=True 211 | ) 212 | 213 | # Enable quantization if needed 214 | if opt_quant_bit == -1: 215 | quantization_config = None 216 | elif opt_quant_bit == 4: 217 | quantization_config = BitsAndBytesConfig( 218 | # According to https://huggingface.co/blog/4bit-transformers-bitsandbytes: 219 | # A rule of thumb is: use double quant if you have problems with memory, 220 | # use NF4 for higher precision, and use a 16-bit dtype for faster finetuning. 221 | load_in_4bit=True, 222 | bnb_4bit_quant_type="nf4", 223 | bnb_4bit_use_double_quant=True, 224 | bnb_4bit_compute_dtype=torch.bfloat16 225 | ) 226 | elif opt_quant_bit == 8: 227 | quantization_config = BitsAndBytesConfig( 228 | # TODO: Enable more 8bit options 229 | load_in_8bit=True 230 | ) 231 | else: 232 | assert False, f"Unsupported quantization bits: {opt_quant_bit}" 233 | 234 | # Load the base model for supervised fine-tuning 235 | base_model = AutoModelForCausalLM.from_pretrained( 236 | arg_base_model, 237 | device_map='auto', # Let the accelerator module decides 238 | # Let the base model decides, i.e., uses 'torch.dtype' in config.json. 239 | # If this is not set and using the default value, the model will be loaded by float32. 240 | torch_dtype='auto', 241 | # Let's enable quantization during supervised fine-tuning 242 | quantization_config=quantization_config, 243 | trust_remote_code=True 244 | ) 245 | 246 | # Create a LoRA adapter with the PEFT module 247 | if opt_lora: 248 | lora_adapter = LoraConfig( 249 | r=opt_lora_rank, 250 | lora_alpha=opt_lora_alpha, 251 | lora_dropout=opt_lora_dropout, 252 | # APIs like base_model.modules() or base_model.named_modules() can help. 253 | # Each module can be a full module name, a suffix of the name, or a regex. 254 | # By default, PEFT/LoRA treat each in the following as a suffix of a module 255 | # name and check against all modules in the model by creating a regex r'.*\.{target_module}$'. 256 | # If not specified, some default values are used according to different base models. 257 | # Like for ChatGLM-series models, defaults to ['query_key_value']; 258 | # Like for Llama-series models, defaults to ['q_proj', 'k_proj', 'v_proj']; 259 | # Check: TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING in site-packages/peft/utils/others.py. 260 | # target_modules=['q_proj', 'k_proj', 'v_proj'], # TODO: support customizing LoRA's target modules 261 | inference_mode=False, 262 | bias="none", 263 | task_type=TaskType.CAUSAL_LM 264 | ) 265 | else: 266 | lora_adapter = None 267 | base_model = get_peft_model(base_model, lora_adapter) 268 | 269 | # Prepare the arguments for supervised fine-tuning 270 | training_args = TrainingArguments( 271 | output_dir=arg_output_dir, 272 | overwrite_output_dir=False, 273 | 274 | # logging arguments 275 | report_to="tensorboard", 276 | logging_steps=opt_output_logging_steps, 277 | logging_first_step=True, 278 | logging_dir=arg_output_dir, 279 | 280 | # saving arguments 281 | save_strategy="steps", 282 | save_steps=opt_output_save_steps, 283 | save_total_limit=opt_output_save_limit, 284 | 285 | # learning arguments 286 | learning_rate=opt_learning_rate, 287 | num_train_epochs=opt_training_epochs, 288 | per_device_train_batch_size=opt_per_device_batch_size, 289 | gradient_accumulation_steps=opt_gradient_accumulation_steps, 290 | # optim=None, 291 | weight_decay=0.01, # This adds an additional, regulation item to AdamW to avoid overfitting 292 | warmup_ratio=0.01, # This is for learning rate scheduler 293 | # load_best_model_at_end=True, 294 | 295 | # model arguments 296 | bf16=True, # Enforce bf16 precision 297 | 298 | # data loading arguments 299 | dataloader_drop_last=False, 300 | dataloader_num_workers=opt_num_proc, 301 | ) 302 | 303 | # Define the trainer for supervised fine-tuning 304 | trainer = Trainer( 305 | model=base_model, 306 | args=training_args, 307 | train_dataset=dataset, 308 | # TODO support on-the-fly evaluation 309 | # eval_dataset=eval_dataset, 310 | # compute_metrics=compute_metrics, 311 | data_collator=data_collator 312 | ) 313 | 314 | # TODO: Check out this line 315 | base_model.config.use_cache = False 316 | 317 | # Start supervised fine-tuning 318 | trainer.train(resume_from_checkpoint=True if opt_resume_from_checkpoint else None) 319 | 320 | # Save the model 321 | trainer.save_model(arg_output_dir) 322 | 323 | print(f"----------") 324 | print(f"Supervised fine-tuning finished; the model was saved to {arg_output_dir}") 325 | -------------------------------------------------------------------------------- /dataset/README.md: -------------------------------------------------------------------------------- 1 | # README 2 | 3 | ## 一级分类 (Type) 4 | 5 | > 参考了主流的 Conventional Commits 进行一级分类,主要是参考了 Angular Convention。 6 | 7 | | 分类 | 类型描述 | 8 | | --------- | --------------------------------------------------------------- | 9 | | style | 对代码风格的变更,如增删空白符、修改代码缩进、代码格式化等 | 10 | | docs | 与文档相关的代码变更 | 11 | | test | 添加新测试或删除、更正已有测试等 | 12 | | build | 会影响构建系统(如 Gulp、Broccoli、NPM)或者外部依赖的代码变更 | 13 | | cicd | 对 CICD 系统(如 Travis、Circle)的配置文件或配置脚本的代码变更 | 14 | | fix | 修复或预防代码缺陷、漏洞等 | 15 | | feat | 对代码添加新的功能 | 16 | | refactor | 与代码重构相关的代码变更 | 17 | | ~~perf~~ | ~~会影响代码性能的代码变更~~ | 18 | | ~~chore~~ | ~~不满足上述类型的其他所有代码变更~~ | 19 | 20 | > Note: 按照提交 git commit 的最佳实践,每个 commit 需要遵循最小原则,因此每个 commit 进对应上述一级分类的一个类型。但现实情况中,开发者通常会在同一个 commit 中混合多种类型的代码修改,在这种情况下,我们推荐仅考虑该 commit 中最主要的代码修改,并按照上述一级分类的顺序一一比对。 21 | 22 | ## 二级分类 v0.1.0 (Subtype) 23 | 24 | > 以下分类主要参考了 “EMSE20/An empirical investigation of relevant changes and automation needs in modern code review” 文章中的对 code review activity 的分类。 25 | 26 | ### feat 27 | 28 | | 分类 | 类型描述 | 29 | | ---- | ---------------------------------------------------------------- | 30 | | E1 | 检查函数:例如,在函数调用时需要检查返回值是否有效且没有发生错误 | 31 | | E2 | 检查变量:例如,需要检查变量 | 32 | | E3 | 检查用户输入:例如,需要验证用户输入 | 33 | | Ex | 新功能:其他新功能 | 34 | 35 | ### fix 36 | 37 | | 分类 | 类型描述 | 38 | | ---- | ------------------------------------------------------------------------------------------------- | 39 | | I1 | 函数调用:例如,对系统或库的调用不正确或缺失 | 40 | | I2 | 参数:函数调用或其他交互中使用了不正确或缺失的参数 | 41 | | I3 | 比较:比较语句中的错误 | 42 | | I4 | 计算:计算产生错误的结果 | 43 | | I5 | 错误的位置:正确的操作被执行,但执行时间过早或过晚 | 44 | | I6 | 算法/性能:使用效率低下的算法 | 45 | | I7 | 变量初始化:变量在使用之前未初始化。未初始化的变量可能包含任何值,并使用此类变量会产生错误 | 46 | | I8 | 内存管理:例如,处理系统内存时发生错误、显示调用 GC | 47 | | I9 | 数据与资源操作:与操作或使用数据或其他资源相关的缺陷 | 48 | | I10 | 安全性:与应用程序/软件的安全性相关的问题 | 49 | | I11 | 并发性:与并发性相关的问题 | 50 | | I12 | 完整性:部分实现的功能 | 51 | | I13 | GUI:与用户界面代码相关的缺陷,涉及用户界面的一致性以及在每种情况下向用户提供的选项。 | 52 | | I14 | 检查外部代码/连锁效应:需要检查审查范围之外的应用程序代码,因为它可能包含基于当前审查的不正确代码 | 53 | | I15 | 语法:修复语法错误 | 54 | | I16 | 拼写:修复拼写错误 | 55 | | Ix | 其他:其他与问题修复相关的代码变更 | 56 | 57 | ### refactor 58 | 59 | | 分类 | 类型描述 | 60 | | ---- | ---------------------------------------------------------------------------------------------------------------------------- | 61 | | R1 | 语义重复:具有相似意图但在语法上实现不同的代码结构 | 62 | | R2 | 语义无效代码:执行但没有任何有意义作用或对结果没有影响的代码片段 | 63 | | R3 | 更改功能:例如,因使用旧的或已弃用的函数而将函数调用更改为另一个函数 | 64 | | R4 | 标准编码约定:例如,在错误消息中使用异常而不是返回值,在魔术数字中使用常量,使用内置数据结构而不是自己的实现等 | 65 | | R5 | 字符串(用语):字符串内容、组成不好的问题 | 66 | | R6 | 日志记录:为方法添加、删除记录结果或错误的能力 | 67 | | R7 | 导入:错误、缺失或未使用的导入语句的问题 | 68 | | R8 | 移动功能:例如,将类、函数、函数部分或其他功能元素移动到不同的类、文件、模块或未知 | 69 | | R9 | 长子程序/复杂代码/简化:将长而复杂的函数拆分为多个函数或重组或重写实现以使其更易理解 | 70 | | R10 | 重复/冗余/无效代码:删除重复的代码或无用的代码或永远不会到达和执行的代码 | 71 | | R11 | 一致性:意味着需要保持代码一致性,类似的代码元素以类似的方式操作,更或多或少对称。例如,类似的任务在类似的类中应有类似的实现 | 72 | | R12 | 架构变更:代码审查通常会导致对系统架构的更改,例如将接口划分为两个不同的接口、引入抽象或引入设计模式 | 73 | | R13 | 重命名:重新命名类、方法、字段等等 | 74 | | Rx | 其他:其他与重构相关的代码变更 | 75 | 76 | ### style 77 | 78 | | 分类 | 类型描述 | 79 | | ---- | ---------------------------------------------- | 80 | | S1 | 括号和花括号:例如,在条件分支之后只有一条语句 | 81 | | S2 | 缩进:例如,代码的一致缩进、删除或增加缩进 | 82 | | S3 | 空白行:空白行过多或过少,或者行分割错误 | 83 | | S4 | 行过长:代码语句过长,超过特定字符数 | 84 | | S5 | 空格使用:代码中空格的使用情况 | 85 | | Sx | 其他:其他与代码风格相关的代码变更 | 86 | 87 | ### test 88 | 89 | | 分类 | 类型描述 | 90 | | ---- | -------------------------- | 91 | | T1 | 测试:与测试相关的代码变更 | 92 | 93 | ### docs 94 | 95 | | 分类 | 类型描述 | 96 | | ---- | ----------------------------------------------------------------------------------------------------------------------- | 97 | | D1 | 命名:例如,文档中与项目的命名政策不符的软件元素(如方法、类、变量等)名称的问题 | 98 | | D2 | 注释:与注释相关。例如,注释错误放置、添加新注释、修改已有注释、删除注释、缺少或错误的 Javadoc 等(包括 TODO 和 FIXME) | 99 | | D3 | 许可证头部:源代码文件中缺失或错误的许可证头部的问题 | 100 | | D4 | 拼写错误:文档中的拼写错误 | 101 | | Dx | 其他:其他与文档相关的代码变更 | 102 | 103 | ### build 104 | 105 | | 分类 | 类型描述 | 106 | | ---- | ------------------------------ | 107 | | B1 | 构建:与构建系统相关的代码变更 | 108 | 109 | ### cicd 110 | 111 | | 分类 | 类型描述 | 112 | | ---- | ------------------------------------------------------------------------------------------------------------------ | 113 | | C1 | 持续集成/持续交付配置:关于持续集成或持续交付流程设置的配置文件更改 | 114 | | C2 | 自动静态分析工具配置:项目中用到的 Linters、Checkers、Recommenders 的配置更改(例如 Checkstyle、PMD、FindBugs 等) | 115 | | C3 | 特定语言或框架:对所使用编程语言的特定文件的更改。例如 Java 的 MANIFEST 文件 | 116 | | C4 | 运行时配置:例如,docker 配置、ansible playbook、交付配置等 | 117 | | Cx | 其他:其他与持续集成/部署相关的代码变更 | 118 | 119 | ## 二级分类 v0.2.0 (Subtype) 120 | 121 | 根据 v0.1.0 进行优化,删除不常用和不合理的二级分类,合并一些分类,并添加一些缺失的分类,最后用 GPT-4 对该分类体系进行评价并进行优化。 122 | 123 | > 现在我要为用户使用 git 提交的 commit message 进行分类,我的分类需要满足大多数用户,也就是我需要一些大多数用户都会用到的分类,对于只有少部分用户才会用到的分类,均分类到“其他”里。 124 | > 125 | > 为此,我参考了 Angular 项目的分类方式,将一级分类分为了 feat (Feature), fix, refactor, style, test, build, cicd, docs 八类。 126 | > 127 | > 我还需要为每个一级分类生成一些大多数用户都会用到的二级分类,下面是我的二级分类,请你帮我看一下,这样的分类是否合理,告诉我里面哪些二级分类是用户不太常用的,还有哪些是用户常用但我缺失了的: 128 | 129 | ### feat 130 | 131 | | 分类 | 类型描述 | 132 | | ---- | ---------------------------------------------------------------- | 133 | | E1 | 元素:新增新的软件元素,如模块、类、方法、函数等,以对应新的功能 | 134 | | E2 | 日志:为函数添加记录结果或错误的能力 | 135 | | E3 | 性能:使用性能更好的算法、数据结构等替换旧的算法、数据结构 | 136 | | E4 | 安全:与安全性相关的特性加强,如增强认证、授权、加密等 | 137 | | Ex | 其他:其他与 feat 相关的代码变更 | 138 | 139 | ### fix 140 | 141 | | 分类 | 类型描述 | 142 | | ---- | ----------------------------------------------------------------------------------------------------------------------------------------------------- | 143 | | I1 | 语法:修复代码中的语法错误(如添加缺失的括号、分号等) | 144 | | I2 | 拼写:修复代码中的拼写错误(如 tyep 修复为 type) | 145 | | I3 | 检查:增强代码检查以防止/修复 miss-check 错误,如增强用户输入检查以防止/修复 XSS 等,添加空值检查以防止/修复空指针/空引用错误,添加函数的返回值检查等 | 146 | | I4 | 初始化:修复变量、字段等未初始化的问题 | 147 | | I5 | 并发:修复与并发相关的问题,如数据竞争 | 148 | | I6 | 内存:修复与内存相关的问题,如内存分配、内存泄漏、内存溢出 | 149 | | Ix | 其他:其他与 fix 相关的代码变更 | 150 | 151 | ### refactor 152 | 153 | | 分类 | 类型描述 | 154 | | ---- | ------------------------------------------------------------------------------------------------------------------------------------------ | 155 | | R1 | 代码提取:将相似、语义重复的代码片段提取出来(如提取成公共的类、函数、方法等)以减少重复代码,或使用已存在的公共方法取代现有的相似代码片段 | 156 | | R2 | 死代码删除:删除无效的(没有任何有意义或对结果没有影响的)代码片段 | 157 | | R3 | 代码移动:例如,将整体或部分代码(如类、函数、方法)移动到其他位置(如不同的类、文件、模块) | 158 | | R4 | 代码简化:将长而复杂的类、函数等拆分为多个函数或重组或重写实现以使其更易理解 | 159 | | R5 | 重命名:重新命名类、方法、字段等 | 160 | | R6 | 代码一致:修改代码名称、实现等以与其他类似代码保持一致 | 161 | | R7 | 新旧替换:废弃一些代码(如标记为 @deprecated),转而使用新代码 | 162 | | R8 | 导入修改:删除错误或未使用的导入语句,或添加缺失的导入语句 | 163 | | Rx | 其他:其他与重构相关的代码变更 | 164 | 165 | ### style 166 | 167 | | 分类 | 类型描述 | 168 | | ---- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | 169 | | S1 | 括号:修改括号(圆括号、方括号、花括号等)的对齐方式 | 170 | | S2 | 缩进:增加或删除缩进 | 171 | | S3 | 空白行:增删空白行 | 172 | | S4 | 行分割:分割过长的代码行或修正代码行的分割方式 | 173 | | S5 | 空格:增删代码中的空格 | 174 | | Sx | 其他:其他与 style 相关的代码变更 | 175 | | Sx | 风格:与代码风格相关而不会影响代码实际效果的代码变更,如修改括号(圆括号、方括号、花括号等)的对齐方式,增加或删除缩进,增删空白行,增删代码中 的空格、分割过长的代码行或修正代码行的分割方式 | 176 | 177 | ### test 178 | 179 | | 分类 | 类型描述 | 180 | | ---- | -------------------------------------------------------------------------------------------------------------------------------------------- | 181 | | T1 | 新增测试:新增一个/些测试 | 182 | | T2 | 删除测试:删除已有的测试或标记某个/些测试不再使用 | 183 | | T3 | 修正测试:修正已有的测试 | 184 | | Tx | 其他:其他与 test 相关的代码变更 | 185 | | Tx | 测试:与测试代码相关的代码变更,如新增一些测试,删除已有的测试或标记某些测试不再使用,修正已有测试的测试流程、测试断言,调整代码覆盖率工具等 | 186 | 187 | ### docs 188 | 189 | | 分类 | 类型描述 | 190 | | ---- | ---------------------------------------------------------------------------------------------------------------------------- | 191 | | D1 | 命名:修正文档中与代码中不一致的软件元素(如方法、类、变量)名称 | 192 | | D2 | 注释:与注释相关的代码变更,包括 TODO 和 FIXME(如,注释错误放置、添加新注释、修改已有注释、删除注释、缺少或错误的 Javadoc) | 193 | | D3 | 拼写:修正文档中的拼写错误 | 194 | | Dx | 其他:其他与文档相关的代码变更 | 195 | 196 | ### build 197 | 198 | | 分类 | 类型描述 | 199 | | ---- | ------------------------------------------------------------ | 200 | | B1 | 依赖:与依赖管理相关的变更(如更新、降级、删除、添加依赖等) | 201 | | Bx | 其他:与构建系统相关的代码变更 | 202 | 203 | ### cicd 204 | 205 | | 分类 | 类型描述 | 206 | | ---- | ------------------------------------------------------------------------------------------------------------------- | 207 | | C1 | 自动分析:CICD 流水线中用到的自动分析工具如 linters、checkers、analyzers 的配置更改(如 Checkstyle、PMD、FindBugs) | 208 | | C2 | 镜像:与镜像构建、部署相关的变更(如 docker、k8s) | 209 | | C3 | 部署:与项目部署相关的代码变更,如增删预发、生产部署步骤,修改部署脚本等 | 210 | | Cx | 其他:其他与持续集成、持续部署相关的代码变更 | 211 | --------------------------------------------------------------------------------