├── hqcm
    ├── __init__.py
    ├── gen.py
    ├── xdata.py
    └── sft.py
├── assets
    └── codefuse.jpg
├── requirements.txt
├── LEGAL.md
├── .gitignore
├── README.md
├── LICENSE
└── dataset
    └── README.md


/hqcm/__init__.py:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/assets/codefuse.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/codefuse-ai/codefuse-hqcm/master/assets/codefuse.jpg


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | python-dotenv==1.0.1
 2 | torch==2.4.1
 3 | datasets==3.1.0
 4 | transformers==4.46.1
 5 | accelerate==1.0.1
 6 | peft==0.13.2
 7 | bitsandbytes==0.44.1
 8 | unidiff==0.7.5
 9 | tensorboardX==2.6.2.2
10 | 


--------------------------------------------------------------------------------
/LEGAL.md:
--------------------------------------------------------------------------------
1 | Legal Disclaimer
2 | 
3 | Within this source code, the comments in Chinese shall be the original, governing version. Any comment in other languages are for reference only. In the event of any conflict between the Chinese language version comments and other language version comments, the Chinese language version shall prevail.
4 | 
5 | 法律免责声明
6 | 
7 | 关于代码注释部分，中文注释为官方版本，其它语言注释仅做参考。中文注释可能与其它语言注释存在不一致，当中文注释与其它语言注释存在不一致时，请以中文注释为准。


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
  1 | # Gradle files
  2 | .gradle/
  3 | build/
  4 | 
  5 | # Local configuration file (sdk path, etc)
  6 | local.properties
  7 | 
  8 | # Log/OS Files
  9 | *.log
 10 | 
 11 | # Android Studio generated files and folders
 12 | captures/
 13 | .externalNativeBuild/
 14 | .cxx/
 15 | *.apk
 16 | output.json
 17 | 
 18 | # IntelliJ
 19 | *.iml
 20 | .idea/
 21 | 
 22 | # Keystore files
 23 | *.jks
 24 | *.keystore
 25 | 
 26 | # Google Services (e.g. APIs or Firebase)
 27 | google-services.json
 28 | 
 29 | # Android Profiling
 30 | *.hprof
 31 | 
 32 | .DS_Store
 33 | 
 34 | # Byte-compiled / optimized / DLL files
 35 | __pycache__/
 36 | *.py[cod]
 37 | *$py.class
 38 | 
 39 | # C extensions
 40 | *.so
 41 | 
 42 | # Distribution / packaging
 43 | .Python
 44 | build/
 45 | develop-eggs/
 46 | dist/
 47 | downloads/
 48 | eggs/
 49 | .eggs/
 50 | lib/
 51 | lib64/
 52 | parts/
 53 | sdist/
 54 | var/
 55 | wheels/
 56 | share/python-wheels/
 57 | *.egg-info/
 58 | .installed.cfg
 59 | *.egg
 60 | MANIFEST
 61 | 
 62 | # PyInstaller
 63 | #  Usually these files are written by a python script from a template
 64 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 65 | *.manifest
 66 | *.spec
 67 | 
 68 | # Installer logs
 69 | pip-log.txt
 70 | pip-delete-this-directory.txt
 71 | 
 72 | # Unit test / coverage reports
 73 | htmlcov/
 74 | .tox/
 75 | .nox/
 76 | .coverage
 77 | .coverage.*
 78 | .cache
 79 | nosetests.xml
 80 | coverage.xml
 81 | *.cover
 82 | *.py,cover
 83 | .hypothesis/
 84 | .pytest_cache/
 85 | cover/
 86 | 
 87 | # Translations
 88 | *.mo
 89 | *.pot
 90 | 
 91 | # Django stuff:
 92 | *.log
 93 | local_settings.py
 94 | db.sqlite3
 95 | db.sqlite3-journal
 96 | 
 97 | # Flask stuff:
 98 | instance/
 99 | .webassets-cache
100 | 
101 | # Scrapy stuff:
102 | .scrapy
103 | 
104 | # Sphinx documentation
105 | docs/_build/
106 | 
107 | # PyBuilder
108 | .pybuilder/
109 | target/
110 | 
111 | # Jupyter Notebook
112 | .ipynb_checkpoints
113 | 
114 | # IPython
115 | profile_default/
116 | ipython_config.py
117 | 
118 | # pyenv
119 | #   For a library or package, you might want to ignore these files since the code is
120 | #   intended to run in multiple environments; otherwise, check them in:
121 | # .python-version
122 | 
123 | # pipenv
124 | #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
125 | #   However, in case of collaboration, if having platform-specific dependencies or dependencies
126 | #   having no cross-platform support, pipenv may install dependencies that don't work, or not
127 | #   install all needed dependencies.
128 | #Pipfile.lock
129 | 
130 | # poetry
131 | #   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
132 | #   This is especially recommended for binary packages to ensure reproducibility, and is more
133 | #   commonly ignored for libraries.
134 | #   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
135 | #poetry.lock
136 | 
137 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow
138 | __pypackages__/
139 | 
140 | # Celery stuff
141 | celerybeat-schedule
142 | celerybeat.pid
143 | 
144 | # SageMath parsed files
145 | *.sage.py
146 | 
147 | # Environments
148 | .env
149 | .venv
150 | env/
151 | venv/
152 | ENV/
153 | env.bak/
154 | venv.bak/
155 | 
156 | # Spyder project settings
157 | .spyderproject
158 | .spyproject
159 | 
160 | # Rope project settings
161 | .ropeproject
162 | 
163 | # mkdocs documentation
164 | /site
165 | 
166 | # mypy
167 | .mypy_cache/
168 | .dmypy.json
169 | dmypy.json
170 | 
171 | # Pyre type checker
172 | .pyre/
173 | 
174 | # pytype static type analyzer
175 | .pytype/
176 | 
177 | # Cython debug symbols
178 | cython_debug/
179 | 
180 | # PyCharm
181 | #  JetBrains specific template is maintainted in a separate JetBrains.gitignore that can
182 | #  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
183 | #  and can be added to the global gitignore or merged into this file.  For a more nuclear
184 | #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
185 | .idea/
186 | 


--------------------------------------------------------------------------------
/hqcm/gen.py:
--------------------------------------------------------------------------------
  1 | import json
  2 | import pprint
  3 | import time
  4 | 
  5 | from datasets import load_dataset
  6 | from dotenv import load_dotenv
  7 | from transformers import AutoTokenizer, AutoModelForCausalLM
  8 | 
  9 | load_dotenv()
 10 | 
 11 | 
 12 | class SFTModel:
 13 | 
 14 |     def __init__(
 15 |         self, model_id, peft_adapter_id=None,
 16 |         max_response_length=512, max_prompt_length=4096,
 17 |         temperature=0.8, top_k=50, top_p=0.95
 18 |     ):
 19 |         self.tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
 20 |         self.model = AutoModelForCausalLM.from_pretrained(
 21 |             model_id, device_map="auto", torch_dtype="auto",
 22 |             trust_remote_code=True
 23 |         )
 24 |         if peft_adapter_id is not None:
 25 |             self.model.load_adapter(peft_adapter_id)
 26 |         self.generation_config = {
 27 |             "max_new_tokens": max_response_length,
 28 |             "num_return_sequences": 1,
 29 |             "num_beams": 1
 30 |         }
 31 |         if abs(temperature) < 1e-8:  # temperature is zero
 32 |             self.generation_config["do_sample"] = False
 33 |         else:
 34 |             self.generation_config["do_sample"] = True
 35 |             self.generation_config["temperature"] = temperature
 36 |             self.generation_config["top_k"] = top_k
 37 |             self.generation_config["top_p"] = top_p
 38 | 
 39 |     def query(self, prompt):
 40 |         inp_tk = self.tokenizer(prompt, return_tensors="pt").to("cuda")
 41 |         # TODO: What if len(inp_tk) > self.max_prompt_length?
 42 |         oup_tk = self.model.generate(**inp_tk, **self.generation_config).to("cpu")
 43 |         answer = self.tokenizer.batch_decode(
 44 |             oup_tk[:, inp_tk['input_ids'].shape[-1]:],  # Don't echo the preceding chat_message in the answer
 45 |             skip_special_tokens=True,  # Skip special tokens like EOS, PAD in answer
 46 |             clean_up_tokenization_spaces=False)[0]
 47 |         return answer
 48 | 
 49 | 
 50 | if __name__ == '__main__':
 51 |     from argparse import ArgumentParser
 52 |     from pathlib import Path
 53 | 
 54 |     parser = ArgumentParser(
 55 |         prog="gen",
 56 |         description="Call a supervised fine-tuned model to generate answers for the transformed HQCM dataset"
 57 |     )
 58 |     parser.add_argument(
 59 |         'adapter',
 60 |         type=Path,
 61 |         help="Path to the adapter after supervised fine-tuning, or to the plain model if -M is specified."
 62 |     )
 63 |     parser.add_argument(
 64 |         "-d", "--dataset",
 65 |         required=True, type=str,
 66 |         help="Path to the dataset for inference"
 67 |     )
 68 |     parser.add_argument(
 69 |         "-c", "--config",
 70 |         default=None, type=str,
 71 |         help="Config name of the dataset"
 72 |     )
 73 |     parser.add_argument(
 74 |         "-t", "--split",
 75 |         default=None, type=str,
 76 |         help="Split of the dataset; each item in the dataset should have a `prompt` and an `answer` field"
 77 |     )
 78 |     parser.add_argument(
 79 |         '-o', '--output',
 80 |         required=True, type=Path,
 81 |         help="Path to the JSON file saving the inference results"
 82 |     )
 83 |     parser.add_argument(
 84 |         '-T', '--temperature',
 85 |         default=0.8, type=float,
 86 |         help="Temperature set to the model, controlling the randomness of the model"
 87 |     )
 88 |     parser.add_argument(
 89 |         '-M', '--plain-model',
 90 |         default=False, action='store_true',
 91 |         help="Loading the plain model without considering the adapters (i.e., the adapter points to a model, not an adapter)"
 92 |     )
 93 | 
 94 |     args = parser.parse_args()
 95 |     pprint.pprint(vars(args))
 96 | 
 97 |     # TODO: Directly load the model with the adapter id
 98 |     if not args.plain_model:
 99 |         adapter_config_path = args.adapter / 'adapter_config.json'
100 |         assert adapter_config_path.exists(), f'Adapter config does not exists under {adapter_config_path}'
101 |         with adapter_config_path.open('r') as fin:
102 |             adapter_config = json.load(fin)
103 |         model_path = adapter_config['base_model_name_or_path']
104 | 
105 |         model = SFTModel(
106 |             model_id=model_path,
107 |             peft_adapter_id=str(args.adapter.absolute()),
108 |             max_prompt_length=512,
109 |             temperature=args.temperature
110 |         )
111 |     else:
112 |         model_path = str(args.adapter)
113 |         model = SFTModel(
114 |             model_id=model_path,
115 |             max_prompt_length=512,
116 |             temperature=args.temperature
117 |         )
118 | 
119 |     results = []
120 |     for index, item in enumerate(load_dataset(args.dataset, args.config, split=args.split)):
121 |         prompt = item['prompt']
122 |         answer = item['answer']
123 | 
124 |         start_time_ms = time.time() * 1000
125 |         try:
126 |             model_answer = model.query(prompt)
127 |         except Exception as e:
128 |             model_answer = f'Generation Failed: {e}'
129 |         end_time_ms = time.time() * 1000
130 | 
131 |         elapsed_ms = (end_time_ms - start_time_ms)
132 | 
133 |         results.append({
134 |             'prompt': prompt,
135 |             'expected_answer': answer,
136 |             'actual_answer': model_answer,
137 |             'elapsed_ms': elapsed_ms
138 |         })
139 | 
140 |         print(f"#{index} (elapsed: {elapsed_ms}ms) {answer}   :   {model_answer}")
141 | 
142 |     with args.output.open("w", encoding='utf-8') as fou:
143 |         json.dump(results, fou, ensure_ascii=False, indent=2)
144 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | <p align="center">
  2 |   <img src="assets/codefuse.jpg" alt="CodeFuse" width="100%"/>
  3 | </p>
  4 | 
  5 | <div align="center">
  6 |     <h1>CodeFuse HQCM</h1>
  7 |     <p>A Small-Scale yet High-Quality Dataset for Code Change Understanding</p>
  8 | </div>
  9 | 
 10 | ## Overview
 11 |  
 12 | HQCM is a smaller yet better dataset for code change understanding.
 13 | It is a carefully developed subset of the Java portion of the [MCMD](https://doi.org/10.1007/s10664-022-10219-1) dataset.
 14 | Each entry in HQCM has been meticulously selected, reviewed, revised, and validated by crowdsource developers.
 15 | The creation of HQCM stems from the recognition that large language models (LLMs) are not silver bullets;
 16 | there are scenarios where their application may be limited, for example:
 17 | 
 18 | 1. **Security Constraints**: In cases where data security is paramount, commercial LLMs are often prohibited to prevent potential data leaks, especially in industrial settings.
 19 | 2. **Compute Constraints**: LLMs are often difficult to deploy in resource-constrained environments, such as laptops and mobile devices at the edge.
 20 | 3. **Financial Constraints**: The high cost of premium LLM APIs makes their use infeasible for many applications without enough budgets.
 21 | 4. **Customized Tasks**: LLMs' performance, especially those non-premium ones, can vary significantly across specialized or customized tasks.
 22 | 
 23 | In these contexts, HQCM aims to serve as training and testing data for SLMs (small language models), or as few-shot examples for LLMs in tasks involving code-change understanding.
 24 | 
 25 | HQCM comprises approximately 5,000 high-quality pairs of code changes and their corresponding summaries, where each code change is presented in a unified diff format, while the accompanying summary is a concise sentence available in both English and Chinese.
 26 | Each entry in HQCM is classified into one of eight popular categories: *feat* (feature), *fix*, *refactor*, *cicd* (CI/CD), *build*, *test*, *docs* (documentation), and *style*.
 27 | Additional categories such as *perf* (performance) and *chore* are planned for future inclusion.
 28 | The distribution of these categories reflects their natural prevalence in the real world, with refactor being the most common and style and CI/CD being the least prevalent.
 29 | 
 30 | ## Installation
 31 | 
 32 | ```shell
 33 | git clone https://github.com/codefuse-ai/codefuse-hqcm hqcm && cd hqcm
 34 | python3 -m venv venv && source venv/bin/activate
 35 | pip install -r requirements.txt
 36 | ```
 37 | 
 38 | ## Task Adaptation 
 39 | 
 40 | HQCM can be adapted for three change-related tasks:
 41 | - **Change Summarization** (`chsum`) summarizes a code change (represented by a code diff) into a short sentence in natural language
 42 | - **Change Classification** (`chcl`) classifies each pair of code change and summary into one of the categories
 43 | - **Code Refinement** (`coderef`) refines a given piece of code based a comment to produce the refined code, commonly used in code review process 
 44 | 
 45 | Below transforms HQCM into its `chsum` (change-summarization) variant and saves the variant into `$CHSUM_PATH` for supervised fine-tuning:
 46 | 
 47 | ```shell
 48 | export CHSUM_VARIANT_PATH='./dataset/chsum'
 49 | python -m hqcm.xdata --task chsum --output $CHSUM_VARIANT_PATH ./dataset/
 50 | ```
 51 | 
 52 | ## Fine-tuning SLMs
 53 | 
 54 | The adapted dataset can be used for SLMs' supervised fine-tuning or used as few-shot examples for LLMs.
 55 | We provided scripts to fine-tune a HuggingFace model with LoRA based on the transformed dataset.
 56 | 
 57 | Below fine-tunes Llama2-7b for change summarization and saves it into `$CHSUM_MODEL_PATH`, using HQCM's chsum variant in `$CHSUM_VARIANT_PATH`:
 58 | 
 59 | ```sh
 60 | export CHSUM_MODEL_PATH='/path/to/chsum_model'
 61 | python -m hqcm.sft                \
 62 |     --seed 0                      \
 63 |     --learning-rate '2e-4'        \
 64 |     --num-epochs 5                \
 65 |     --batch-size 1                \
 66 |     --micro-batch-size 1          \
 67 |     --lora-rank 64                \
 68 |     --lora-alpha 16               \
 69 |     --lora-dropout 0.1            \
 70 |     --quantization '-1'           \
 71 |     --dataset $CHSUM_VARIANT_PATH \
 72 |     --split 'train'               \
 73 |     --max-length 512              \
 74 |     --output $CHSUM_MODEL_PATH    \
 75 |     '/path/to/your/llama2-7b'
 76 | ```
 77 | 
 78 | Below leverages the fine-tuned model in `$CHSUM_MODEL_PATH` to generate summaries for changes in the test split of `$CHSUM_VARIANT_PATH`, with results exporting to `$CHSUM_RES_PATH`:
 79 | 
 80 | ```shell
 81 | export CHSUM_RES_PATH='/path/to/chsum_results'
 82 | python -m hqcm.gen                \
 83 |     --dataset $CHSUM_VARIANT_PATH \
 84 |     --split 'test'                \
 85 |     --output $CHSUM_RES_PATH      \
 86 |     --temperature 0               \
 87 |     $CHSUM_MODEL_PATH
 88 | ```
 89 | 
 90 | The above scripts also support HQCM's chcl and coderef variants.
 91 | 
 92 | ## FAQs
 93 | 
 94 | Q: The fine-tuning and generation scripts got stuck when connecting to HuggingFace
 95 | A: HQCM's scripts assume an offline environment. Perhaps disabling download by:
 96 | 
 97 | ```shell
 98 | export HF_DATASETS_OFFLINE=1         # Disable HuggingFace's online accessing to datasets
 99 | export TRANSFORMERS_OFFLINE=1        # Disable HuggingFace's online accessing to models
100 | export TOKENIZERS_PARALLELISM=false  # Disable tokenizer's parallelism
101 | ```
102 | 
103 | Q: Does HQCM support other code change-related tasks?
104 | A: HQCM is code-change dataset. Users can adapt it to any change-related tasks in theory, but we did not experiment this. This require users to comprehend the task and reformat the dataset according to their usages. We are expecting promising results and we welcome such adaptations.
105 | 
106 | ## Citation
107 | 
108 | HQCM was published in [ASE '24](https://dl.acm.org/doi/10.1145/3691620.3694999).
109 | If you find it helpful, please consider citing our paper:
110 | 
111 | ```txt
112 | @inproceedings{hqcm_ase24,
113 |   author = {Li, Cong and Xu, Zhaogui and Di, Peng and Wang, Dongxia and Li, Zheng and Zheng, Qian},
114 |   title = {Understanding Code Changes Practically with Small-Scale Language Models},
115 |   year = {2024},
116 |   isbn = {9798400712487},
117 |   publisher = {Association for Computing Machinery},
118 |   address = {New York, NY, USA},
119 |   url = {https://doi.org/10.1145/3691620.3694999},
120 |   doi = {10.1145/3691620.3694999},
121 |   booktitle = {Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering},
122 |   pages = {216–228},
123 |   numpages = {13},
124 |   keywords = {code change, code review, language model, LLM, SLM},
125 |   location = {Sacramento, CA, USA},
126 |   series = {ASE '24}
127 | }
128 | ```
129 | 


--------------------------------------------------------------------------------
/hqcm/xdata.py:
--------------------------------------------------------------------------------
  1 | import functools
  2 | import json
  3 | 
  4 | from unidiff import PatchSet
  5 | 
  6 | CHSUM_PROMPT_TEMPLATE = """\
  7 | Please generate a commit message for the following diff:
  8 | 
  9 | ```diff
 10 | {diff}
 11 | ```
 12 | """
 13 | 
 14 | CHCL_PROMPT_TEMPLATE = """\
 15 | A git commit can typically be classified into specific categories by examining its code changes and commit message. These categories include:
 16 | 
 17 | - "style": Changes that solely improve the code's formatting and appearance without affecting functionality (e.g., adjusting whitespace, fixing indentation, cleaning up code formatting).
 18 | - "docs": Updates or improvements to documentation, which may include inline code comments, README files, or any other type of documentation associated with the project.
 19 | - "test": Modifications exclusively related to test code, like the addition of new tests or the correction and improvement of existing tests.
 20 | - "build": Changes that affect the build system or tools (like Gulp, Broccoli, NPM) or alterations to external dependencies (e.g., library or package updates).
 21 | - "cicd": Tweaks to configuration files or scripts used in Continuous Integration/Continuous Deployment (CI/CD) systems, such as Travis CI or CircleCI configurations.
 22 | - "fix": Code amendments that focus on rectifying errors, fixing bugs, or patching security vulnerabilities.
 23 | - "feat": Commits that introduce new features or capabilities to the project, such as new classes, functions, or methods.
 24 | - "refactor": Changes that reorganize and clean up the codebase without modifying its external behavior or outputs, improving readability and maintainability.
 25 | 
 26 | For a given git commit, we can inspect its code difference (diff) and the associated commit message to determine its type. Below is the diff for a specific git commit:
 27 | 
 28 | ```
 29 | {diff}
 30 | ```
 31 | 
 32 | Accompanying this code diff is its commit message:
 33 | 
 34 | ```
 35 | {message}
 36 | ```
 37 | 
 38 | Given this information, the git commit can be categorized as type: """
 39 | 
 40 | CHCL_FEWSHOT_PROMPT_TEMPLATE = """\
 41 | A git commit can typically be classified into specific categories by examining its code changes and commit message. These categories include:
 42 | 
 43 | - "style": Changes that solely improve the code's formatting and appearance without affecting functionality (e.g., adjusting whitespace, fixing indentation, cleaning up code formatting).
 44 | - "docs": Updates or improvements to documentation, which may include inline code comments, README files, or any other type of documentation associated with the project.
 45 | - "test": Modifications exclusively related to test code, like the addition of new tests or the correction and improvement of existing tests.
 46 | - "build": Changes that affect the build system or tools (like Gulp, Broccoli, NPM) or alterations to external dependencies (e.g., library or package updates).
 47 | - "cicd": Tweaks to configuration files or scripts used in Continuous Integration/Continuous Deployment (CI/CD) systems, such as Travis CI or CircleCI configurations.
 48 | - "fix": Code amendments that focus on rectifying errors, fixing bugs, or patching security vulnerabilities.
 49 | - "feat": Commits that introduce new features or capabilities to the project, such as new classes, functions, or methods.
 50 | - "refactor": Changes that reorganize and clean up the codebase without modifying its external behavior or outputs, improving readability and maintainability.
 51 | 
 52 | For a given git commit, we can inspect its code difference (diff) and the associated commit message to determine its type.
 53 | 
 54 | Diff: ```diff
 55 | diff --git a/util/src/com/intellij/util/containers/SLRUMap.java b/util/src/com/intellij/util/containers/SLRUMap.java
 56 | index 7f3d09c..635dfab 100644
 57 | --- a/util/src/com/intellij/util/containers/SLRUMap.java
 58 | +++ b/util/src/com/intellij/util/containers/SLRUMap.java
 59 | @@ -69,12 +69,12 @@ public class SLRUMap<K,V> {{
 60 |    public void put(K key, V value) {{
 61 |      V oldValue = myProtectedQueue.remove(key);
 62 |      if (oldValue != null) {{
 63 | -      onDropFromCache(key, value);
 64 | +      onDropFromCache(key, oldValue);
 65 |      }}
 66 |  
 67 |      oldValue = myProbationalQueue.put(getStableKey(key), value);
 68 |      if (oldValue != null) {{
 69 | -      onDropFromCache(key, value);
 70 | +      onDropFromCache(key, oldValue);
 71 |      }}
 72 |    }}
 73 | ```
 74 | Message: Corrected parameter error in onDropFromCache() function call
 75 | Type: fix
 76 | Reason: The git commit is a "fix" commit as it rectified a parameter error where `oldValue` should be passed as the argument of `onDropFromCache` rather than `value`.
 77 | 
 78 | Diff: ```diff
 79 | {diff}
 80 | ```
 81 | Message: {message}
 82 | Type: """
 83 | 
 84 | CODEREF_PROMPT_TEMPLATE = """\
 85 | // Please optimize the given "Code (to optimize)" (a portion of some "File"s) by strictly following the given "Suggestion".
 86 | // Your optimization can involve editing or removing existing code, or adding new code.
 87 | // You may pay more attention to lines marked by "// !!attention". If no such marked lines, do everything by yourself.
 88 | 
 89 | // Code (to optimize):
 90 | 
 91 | {code_to_opt}
 92 | 
 93 | // Suggestion: {suggestion}
 94 | 
 95 | // Code (after optimization):
 96 | 
 97 | """
 98 | 
 99 | CODEREF_FEWSHOT_PROMPT_TEMPLATE = """\
100 | // Please optimize the given "Code (to optimize)" (a portion of some "File"s) by strictly following the given "Suggestion".
101 | // Your optimization can involve editing or removing existing code, or adding new code.
102 | // You may pay more attention to lines marked by "// !!attention". If no such marked lines, do everything by yourself.
103 | 
104 | -------
105 | 
106 | // Code (to optimize):
107 | 
108 | //// File: util/src/com/intellij/util/containers/SLRUMap.java
109 | 69 |   public void put(K key, V value) {{ 
110 | 70 |     V oldValue = myProtectedQueue.remove(key); 
111 | 71 |     if (oldValue != null) {{ 
112 | 72 |       onDropFromCache(key, value); // !!attention
113 | 73 |     }} 
114 | 74 |  
115 | 75 |     oldValue = myProbationalQueue.put(getStableKey(key), value); 
116 | 76 |     if (oldValue != null) {{ 
117 | 77 |       onDropFromCache(key, value); // !!attention
118 | 78 |     }} 
119 | 79 |   }} 
120 | 80 |  
121 | 
122 | // Suggestion: Corrected parameter error in onDropFromCache() function call
123 | 
124 | // Code (after optimization):
125 | 
126 | //// File: util/src/com/intellij/util/containers/SLRUMap.java
127 | 69 |   public void put(K key, V value) {{
128 | 70 |     V oldValue = myProtectedQueue.remove(key);
129 | 71 |     if (oldValue != null) {{
130 | 72 |       onDropFromCache(key, oldValue);
131 | 73 |     }}
132 | 74 | 
133 | 75 |     oldValue = myProbationalQueue.put(getStableKey(key), value);
134 | 76 |     if (oldValue != null) {{
135 | 77 |       onDropFromCache(key, oldValue);
136 | 78 |     }}
137 | 79 |   }}
138 | 80 | 
139 | 
140 | -------
141 | 
142 | // Code (to optimize):
143 | 
144 | {code_to_opt}
145 | 
146 | // Suggestion: {suggestion}
147 | 
148 | // Code (after optimization):
149 | 
150 | """
151 | 
152 | 
153 | def xitem_coderef(item, fewshot=False):
154 |     opt_suggestion = item['summaries']['en']
155 | 
156 |     code_to_opt = []
157 |     code_after_opt = []
158 | 
159 |     for patched_file in PatchSet(item['change']):
160 |         file_name = patched_file.path
161 |         source_text = "\n".join(
162 |             # Add an attention to additionally inform the model
163 |             f"{line.source_line_no} | {line.value.rstrip()} {'// !!attention' if line.is_removed else ''}"
164 |             for hunk in patched_file
165 |             for line in hunk.source_lines()
166 |         )
167 |         target_text = "\n".join(
168 |             f'{line.target_line_no} | {line.value.rstrip()}'
169 |             for hunk in patched_file
170 |             for line in hunk.target_lines()
171 |         )
172 |         code_to_opt.append(f"""//// File: {file_name}\n{source_text}""")
173 |         code_after_opt.append(f"""//// File: {file_name}\n{target_text}""")
174 | 
175 |     if fewshot:
176 |         return {
177 |             'prompt': CODEREF_FEWSHOT_PROMPT_TEMPLATE.format(
178 |                 code_to_opt='\n'.join(code_to_opt),
179 |                 suggestion=opt_suggestion
180 |             ),
181 |             'answer': '\n'.join(code_after_opt)
182 |         }
183 |     else:
184 |         return {
185 |             'prompt': CODEREF_PROMPT_TEMPLATE.format(
186 |                 code_to_opt='\n'.join(code_to_opt),
187 |                 suggestion=opt_suggestion
188 |             ),
189 |             'answer': '\n'.join(code_after_opt)
190 |         }
191 | 
192 | 
193 | def xitem_chcl(item, fewshot=False):
194 |     if fewshot:
195 |         return {
196 |             'prompt': CHCL_FEWSHOT_PROMPT_TEMPLATE.format(
197 |                 diff=item['change'],
198 |                 message=item['summaries']['en']
199 |             ),
200 |             'answer': item['type']
201 |         }
202 |     else:
203 |         return {
204 |             'prompt': CHCL_PROMPT_TEMPLATE.format(
205 |                 diff=item['change'],
206 |                 message=item['summaries']['en']
207 |             ),
208 |             'answer': item['type']
209 |         }
210 | 
211 | 
212 | def xitem_chsum(item):
213 |     return {
214 |         'prompt': CHSUM_PROMPT_TEMPLATE.format(diff=item['change']),
215 |         'answer': item['summaries']['en']
216 |     }
217 | 
218 | 
219 | def transform(in_dir, out_dir, xitem_fn):
220 |     assert in_dir.is_dir(), f"Not a directory: {in_dir}"
221 |     assert (in_dir / 'train.json').exists(), f"File train.json does not exist in: {in_dir}"
222 |     assert (in_dir / 'test.json').exists(), f"File test.json does not exist in: {in_dir}"
223 | 
224 |     out_dir.mkdir(exist_ok=True)
225 | 
226 |     for fname in ['train.json', 'test.json']:
227 |         with (in_dir / fname).open('r') as fin:
228 |             tx_data = [xitem_fn(item) for item in json.load(fin)]
229 |         with (out_dir / fname).open('w') as fou:
230 |             json.dump(tx_data, fou, ensure_ascii=False, indent=2)
231 | 
232 | 
233 | if __name__ == '__main__':
234 |     from argparse import ArgumentParser
235 |     from pathlib import Path
236 | 
237 |     parser = ArgumentParser(
238 |         prog="xdata",
239 |         description="Tranform the HQCM dataset for fine-tuning of a specific code-change task"
240 |     )
241 |     parser.add_argument(
242 |         "dataset", type=Path,
243 |         help="Path to the directory saving the HQCM dataset before transformation"
244 |     )
245 |     parser.add_argument(
246 |         "-t", "--task",
247 |         required=True, choices=['chsum', 'chcl', 'coderef'],
248 |         help="Target tasks: chsum for change summarization, chcl for change classification, and coderef for code refinement"
249 |     )
250 |     parser.add_argument(
251 |         "-o", "--output",
252 |         required=True, type=Path,
253 |         help="Path to the directory to save the HQCM dataset after transforming for the task"
254 |     )
255 |     parser.add_argument(
256 |         "-F", "--few-shot",
257 |         default=False, action='store_true',
258 |         help="Transform the dataset that can be used for few-shot in-context learning"
259 |     )
260 |     args = parser.parse_args()
261 | 
262 |     if args.task == 'chsum':
263 |         xitem_fn = xitem_chsum
264 |     elif args.task == 'chcl':
265 |         xitem_fn = functools.partial(xitem_chcl, fewshot=args.fewshot)
266 |     elif args.task == 'coderef':
267 |         xitem_fn = functools.partial(xitem_coderef, fewshot=args.fewshot)
268 |     else:
269 |         assert False, "Unsupported code change task: " + args.task
270 | 
271 |     transform(args.dataset, args.output, xitem_fn=xitem_fn)
272 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 |                                  Apache License
  2 |                            Version 2.0, January 2004
  3 |                         http://www.apache.org/licenses/
  4 | 
  5 |    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
  6 | 
  7 |    1. Definitions.
  8 | 
  9 |       "License" shall mean the terms and conditions for use, reproduction,
 10 |       and distribution as defined by Sections 1 through 9 of this document.
 11 | 
 12 |       "Licensor" shall mean the copyright owner or entity authorized by
 13 |       the copyright owner that is granting the License.
 14 | 
 15 |       "Legal Entity" shall mean the union of the acting entity and all
 16 |       other entities that control, are controlled by, or are under common
 17 |       control with that entity. For the purposes of this definition,
 18 |       "control" means (i) the power, direct or indirect, to cause the
 19 |       direction or management of such entity, whether by contract or
 20 |       otherwise, or (ii) ownership of fifty percent (50%) or more of the
 21 |       outstanding shares, or (iii) beneficial ownership of such entity.
 22 | 
 23 |       "You" (or "Your") shall mean an individual or Legal Entity
 24 |       exercising permissions granted by this License.
 25 | 
 26 |       "Source" form shall mean the preferred form for making modifications,
 27 |       including but not limited to software source code, documentation
 28 |       source, and configuration files.
 29 | 
 30 |       "Object" form shall mean any form resulting from mechanical
 31 |       transformation or translation of a Source form, including but
 32 |       not limited to compiled object code, generated documentation,
 33 |       and conversions to other media types.
 34 | 
 35 |       "Work" shall mean the work of authorship, whether in Source or
 36 |       Object form, made available under the License, as indicated by a
 37 |       copyright notice that is included in or attached to the work
 38 |       (an example is provided in the Appendix below).
 39 | 
 40 |       "Derivative Works" shall mean any work, whether in Source or Object
 41 |       form, that is based on (or derived from) the Work and for which the
 42 |       editorial revisions, annotations, elaborations, or other modifications
 43 |       represent, as a whole, an original work of authorship. For the purposes
 44 |       of this License, Derivative Works shall not include works that remain
 45 |       separable from, or merely link (or bind by name) to the interfaces of,
 46 |       the Work and Derivative Works thereof.
 47 | 
 48 |       "Contribution" shall mean any work of authorship, including
 49 |       the original version of the Work and any modifications or additions
 50 |       to that Work or Derivative Works thereof, that is intentionally
 51 |       submitted to Licensor for inclusion in the Work by the copyright owner
 52 |       or by an individual or Legal Entity authorized to submit on behalf of
 53 |       the copyright owner. For the purposes of this definition, "submitted"
 54 |       means any form of electronic, verbal, or written communication sent
 55 |       to the Licensor or its representatives, including but not limited to
 56 |       communication on electronic mailing lists, source code control systems,
 57 |       and issue tracking systems that are managed by, or on behalf of, the
 58 |       Licensor for the purpose of discussing and improving the Work, but
 59 |       excluding communication that is conspicuously marked or otherwise
 60 |       designated in writing by the copyright owner as "Not a Contribution."
 61 | 
 62 |       "Contributor" shall mean Licensor and any individual or Legal Entity
 63 |       on behalf of whom a Contribution has been received by Licensor and
 64 |       subsequently incorporated within the Work.
 65 | 
 66 |    2. Grant of Copyright License. Subject to the terms and conditions of
 67 |       this License, each Contributor hereby grants to You a perpetual,
 68 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 69 |       copyright license to reproduce, prepare Derivative Works of,
 70 |       publicly display, publicly perform, sublicense, and distribute the
 71 |       Work and such Derivative Works in Source or Object form.
 72 | 
 73 |    3. Grant of Patent License. Subject to the terms and conditions of
 74 |       this License, each Contributor hereby grants to You a perpetual,
 75 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 76 |       (except as stated in this section) patent license to make, have made,
 77 |       use, offer to sell, sell, import, and otherwise transfer the Work,
 78 |       where such license applies only to those patent claims licensable
 79 |       by such Contributor that are necessarily infringed by their
 80 |       Contribution(s) alone or by combination of their Contribution(s)
 81 |       with the Work to which such Contribution(s) was submitted. If You
 82 |       institute patent litigation against any entity (including a
 83 |       cross-claim or counterclaim in a lawsuit) alleging that the Work
 84 |       or a Contribution incorporated within the Work constitutes direct
 85 |       or contributory patent infringement, then any patent licenses
 86 |       granted to You under this License for that Work shall terminate
 87 |       as of the date such litigation is filed.
 88 | 
 89 |    4. Redistribution. You may reproduce and distribute copies of the
 90 |       Work or Derivative Works thereof in any medium, with or without
 91 |       modifications, and in Source or Object form, provided that You
 92 |       meet the following conditions:
 93 | 
 94 |       (a) You must give any other recipients of the Work or
 95 |           Derivative Works a copy of this License; and
 96 | 
 97 |       (b) You must cause any modified files to carry prominent notices
 98 |           stating that You changed the files; and
 99 | 
100 |       (c) You must retain, in the Source form of any Derivative Works
101 |           that You distribute, all copyright, patent, trademark, and
102 |           attribution notices from the Source form of the Work,
103 |           excluding those notices that do not pertain to any part of
104 |           the Derivative Works; and
105 | 
106 |       (d) If the Work includes a "NOTICE" text file as part of its
107 |           distribution, then any Derivative Works that You distribute must
108 |           include a readable copy of the attribution notices contained
109 |           within such NOTICE file, excluding those notices that do not
110 |           pertain to any part of the Derivative Works, in at least one
111 |           of the following places: within a NOTICE text file distributed
112 |           as part of the Derivative Works; within the Source form or
113 |           documentation, if provided along with the Derivative Works; or,
114 |           within a display generated by the Derivative Works, if and
115 |           wherever such third-party notices normally appear. The contents
116 |           of the NOTICE file are for informational purposes only and
117 |           do not modify the License. You may add Your own attribution
118 |           notices within Derivative Works that You distribute, alongside
119 |           or as an addendum to the NOTICE text from the Work, provided
120 |           that such additional attribution notices cannot be construed
121 |           as modifying the License.
122 | 
123 |       You may add Your own copyright statement to Your modifications and
124 |       may provide additional or different license terms and conditions
125 |       for use, reproduction, or distribution of Your modifications, or
126 |       for any such Derivative Works as a whole, provided Your use,
127 |       reproduction, and distribution of the Work otherwise complies with
128 |       the conditions stated in this License.
129 | 
130 |    5. Submission of Contributions. Unless You explicitly state otherwise,
131 |       any Contribution intentionally submitted for inclusion in the Work
132 |       by You to the Licensor shall be under the terms and conditions of
133 |       this License, without any additional terms or conditions.
134 |       Notwithstanding the above, nothing herein shall supersede or modify
135 |       the terms of any separate license agreement you may have executed
136 |       with Licensor regarding such Contributions.
137 | 
138 |    6. Trademarks. This License does not grant permission to use the trade
139 |       names, trademarks, service marks, or product names of the Licensor,
140 |       except as required for reasonable and customary use in describing the
141 |       origin of the Work and reproducing the content of the NOTICE file.
142 | 
143 |    7. Disclaimer of Warranty. Unless required by applicable law or
144 |       agreed to in writing, Licensor provides the Work (and each
145 |       Contributor provides its Contributions) on an "AS IS" BASIS,
146 |       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 |       implied, including, without limitation, any warranties or conditions
148 |       of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 |       PARTICULAR PURPOSE. You are solely responsible for determining the
150 |       appropriateness of using or redistributing the Work and assume any
151 |       risks associated with Your exercise of permissions under this License.
152 | 
153 |    8. Limitation of Liability. In no event and under no legal theory,
154 |       whether in tort (including negligence), contract, or otherwise,
155 |       unless required by applicable law (such as deliberate and grossly
156 |       negligent acts) or agreed to in writing, shall any Contributor be
157 |       liable to You for damages, including any direct, indirect, special,
158 |       incidental, or consequential damages of any character arising as a
159 |       result of this License or out of the use or inability to use the
160 |       Work (including but not limited to damages for loss of goodwill,
161 |       work stoppage, computer failure or malfunction, or any and all
162 |       other commercial damages or losses), even if such Contributor
163 |       has been advised of the possibility of such damages.
164 | 
165 |    9. Accepting Warranty or Additional Liability. While redistributing
166 |       the Work or Derivative Works thereof, You may choose to offer,
167 |       and charge a fee for, acceptance of support, warranty, indemnity,
168 |       or other liability obligations and/or rights consistent with this
169 |       License. However, in accepting such obligations, You may act only
170 |       on Your own behalf and on Your sole responsibility, not on behalf
171 |       of any other Contributor, and only if You agree to indemnify,
172 |       defend, and hold each Contributor harmless for any liability
173 |       incurred by, or claims asserted against, such Contributor by reason
174 |       of your accepting any such warranty or additional liability.
175 | 
176 |    END OF TERMS AND CONDITIONS
177 | 
178 |    APPENDIX: How to apply the Apache License to your work.
179 | 
180 |       To apply the Apache License to your work, attach the following
181 |       boilerplate notice, with the fields enclosed by brackets "[]"
182 |       replaced with your own identifying information. (Don't include
183 |       the brackets!)  The text should be enclosed in the appropriate
184 |       comment syntax for the file format. We also recommend that a
185 |       file or class name and description of purpose be included on the
186 |       same "printed page" as the copyright notice for easier
187 |       identification within third-party archives.
188 | 
189 |    Copyright [yyyy] [name of copyright owner]
190 | 
191 |    Licensed under the Apache License, Version 2.0 (the "License");
192 |    you may not use this file except in compliance with the License.
193 |    You may obtain a copy of the License at
194 | 
195 |        http://www.apache.org/licenses/LICENSE-2.0
196 | 
197 |    Unless required by applicable law or agreed to in writing, software
198 |    distributed under the License is distributed on an "AS IS" BASIS,
199 |    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 |    See the License for the specific language governing permissions and
201 |    limitations under the License.


--------------------------------------------------------------------------------
/hqcm/sft.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import pprint
  3 | import time
  4 | from argparse import ArgumentParser
  5 | 
  6 | import torch
  7 | from datasets import load_dataset
  8 | from dotenv import load_dotenv
  9 | from peft import LoraConfig, TaskType, get_peft_model
 10 | from transformers import AutoTokenizer, DataCollatorForSeq2Seq, set_seed, AutoModelForCausalLM, TrainingArguments, \
 11 |     Trainer, BitsAndBytesConfig
 12 | 
 13 | load_dotenv()
 14 | 
 15 | parser = ArgumentParser(
 16 |     "sft",
 17 |     description="Supervised fine-tuning (sft) a HuggingFace model with bf16 precision for "
 18 |                 "a specific code-change related task using the transformed HQCM dataset"
 19 | )
 20 | parser.add_argument(
 21 |     "model",
 22 |     type=str,
 23 |     help="Path to the base model for sft"
 24 | )
 25 | parser.add_argument(
 26 |     "-d", "--dataset",
 27 |     required=True, type=str,
 28 |     help="Path to the dataset for base model's sft"
 29 | )
 30 | parser.add_argument(
 31 |     "-c", "--config",
 32 |     default=None, type=str,
 33 |     help="Config name of the dataset"
 34 | )
 35 | parser.add_argument(
 36 |     "-t", "--split",
 37 |     default=None, type=str,
 38 |     help="Split of the dataset; each item in the dataset should have a `prompt` and an `answer` field"
 39 | )
 40 | parser.add_argument(
 41 |     "-p", "--percentage",
 42 |     default=100, type=int,
 43 |     help="Select only a percentage of the dataset for sft (e.g., 10 for 10%)"
 44 | )
 45 | parser.add_argument(
 46 |     "-M", "--max-length",
 47 |     default=512, type=int,
 48 |     help="Max length of the text (prompt + answer) saved in the dataset"
 49 | )
 50 | parser.add_argument(
 51 |     "-N", "--no-lora",
 52 |     default=False, action="store_true",
 53 |     help="Fine-tuning without LoRA"
 54 | )
 55 | parser.add_argument(
 56 |     "-R", "--lora-rank",
 57 |     default=64, type=int,
 58 |     help="LoRA's rank"
 59 | )
 60 | parser.add_argument(
 61 |     "-A", "--lora-alpha",
 62 |     default=16, type=int,
 63 |     help="LoRA's alpha"
 64 | )
 65 | parser.add_argument(
 66 |     "-D", "--lora-dropout",
 67 |     default=0.1, type=float,
 68 |     help="LoRA's dropout"
 69 | )
 70 | parser.add_argument(
 71 |     "-Q", "--quantization",
 72 |     default=-1, type=int, choices=[-1, 4, 8],
 73 |     help="Enable fine-tuning with k-bit quantization"
 74 | )
 75 | parser.add_argument(
 76 |     "-a", "--learning-rate",
 77 |     default=2e-4, type=float,
 78 |     help="Learning rate (the larger, the fast)"
 79 | )
 80 | parser.add_argument(
 81 |     "-e", "--num-epochs",
 82 |     default=2, type=int,
 83 |     help="Number of epochs to train"
 84 | )
 85 | parser.add_argument(
 86 |     "-B", "--batch-size",
 87 |     default=64, type=int,
 88 |     help="Number of examples to feed for gradient updates (i.e., opt_per_device_batch_size * gradient_accumulation_steps)"
 89 | )
 90 | parser.add_argument(
 91 |     "-b", "--micro-batch-size",
 92 |     default=8, type=int,
 93 |     help="Number of examples to feed per step (i.e., per_device_batch_size)"
 94 | )
 95 | parser.add_argument(
 96 |     "-r", "--resume",
 97 |     default=False, action='store_true',
 98 |     help="Resume pretraining from the last checkpoint in the output directory"
 99 | )
100 | parser.add_argument(
101 |     "-o", "--output",
102 |     required=True, type=str,
103 |     help="Path to saved the model after sft"
104 | )
105 | parser.add_argument(
106 |     "-s", "--seed",
107 |     default=int(time.time()), type=int,
108 |     help="Seed for controlling the randomness of the sft process"
109 | )
110 | 
111 | args = parser.parse_args()
112 | pprint.pprint(vars(args))
113 | 
114 | arg_base_model = args.model
115 | 
116 | arg_dataset = args.dataset
117 | opt_dataset_config = args.config
118 | opt_dataset_split = args.split
119 | opt_dataset_max_seq_len = args.max_length
120 | opt_dataset_percentage = args.percentage
121 | 
122 | opt_lora = not args.no_lora
123 | opt_lora_rank = args.lora_rank
124 | opt_lora_alpha = args.lora_alpha
125 | opt_lora_dropout = args.lora_dropout
126 | 
127 | opt_quant_bit = args.quantization
128 | 
129 | opt_learning_rate = args.learning_rate
130 | opt_training_epochs = args.num_epochs
131 | opt_per_device_batch_size = args.micro_batch_size
132 | opt_gradient_accumulation_steps = args.batch_size // args.micro_batch_size
133 | 
134 | opt_resume_from_checkpoint = args.resume
135 | 
136 | arg_output_dir = args.output
137 | opt_output_save_steps = 1000
138 | opt_output_save_limit = 5
139 | opt_output_logging_steps = 10
140 | 
141 | opt_seed = args.seed
142 | opt_num_proc = os.cpu_count()
143 | 
144 | # Set the seed for reproduction
145 | set_seed(opt_seed)
146 | 
147 | # Write out the command for reproduction
148 | os.makedirs(arg_output_dir, exist_ok=True)
149 | with open(os.path.join(arg_output_dir, "command.txt"), 'w') as fp:
150 |     fp.write(pprint.pformat(vars(args)))
151 | 
152 | # Load the tokenizer and set the pad token
153 | tokenizer = AutoTokenizer.from_pretrained(arg_base_model, trust_remote_code=True)
154 | if tokenizer.pad_token_id is None:
155 |     tokenizer.pad_token_id = tokenizer.unk_token_id
156 | tokenizer.padding_side = "left"
157 | 
158 | # Create a DataCollator to help us pad each sequence to the maximum length in the batch while training.
159 | # Don't use DataCollatorForLanguageModeling as it doesn't pad the labels.
160 | data_collator = DataCollatorForSeq2Seq(
161 |     tokenizer, pad_to_multiple_of=8, return_tensors="pt"
162 | )
163 | 
164 | 
165 | # Tokenize the dataset and hide the prompt from the model
166 | def tokenize_and_preprocess(example):
167 |     prompt, answer = example['prompt'], example['answer']
168 | 
169 |     tk_example = tokenizer(
170 |         prompt + answer,
171 |         max_length=opt_dataset_max_seq_len,
172 |         # Don't pad since we are about to pad each example by the DataCollator
173 |         padding=False, truncation=True
174 |     )
175 | 
176 |     # In case there lacks an EOS token. This is because the tokenizer is
177 |     # used for generation tasks, so it automatically adds a <bos> token to the
178 |     # head but does not add a EOS token to the end for further generation.
179 |     # So let's add it as the end of our prompt-answer pair.
180 |     if (tk_example['input_ids'][-1] != tokenizer.eos_token_id and
181 |         len(tk_example['input_ids']) < opt_dataset_max_seq_len):
182 |         tk_example['input_ids'].append(tokenizer.eos_token_id)
183 |         tk_example['attention_mask'].append(1)  # This EOS token should be attended
184 | 
185 |     # Prepare our labels, it's exactly the input_ids
186 |     tk_example['labels'] = tk_example['input_ids'].copy()
187 | 
188 |     # Hide the prompt such that our training process does not compute cross-entropy for prompts
189 |     # and our model only focuses on learning to generate the answer.
190 |     # Since the last token of the prompt might be part of the first token of the answer, let's skip it.
191 |     # For example, if prompt="A " and answer="a" then tk_prompt=["_A", "_"], tk_example=["_A", "_a"].
192 |     num_hidden_tokens = len(tokenizer(
193 |         prompt, max_length=opt_dataset_max_seq_len, padding=False, truncation=True
194 |     )['input_ids']) - 1
195 |     # label_pad_token_id (-100) is a magic number used by pytorch to hide tokens for cross-entropy.
196 |     tk_example['labels'][:num_hidden_tokens] = [data_collator.label_pad_token_id] * num_hidden_tokens
197 | 
198 |     return tk_example
199 | 
200 | 
201 | # Load the dataset for supervised fine-tuning
202 | raw_dataset = load_dataset(arg_dataset, opt_dataset_config, split=opt_dataset_split)
203 | if 0 < opt_dataset_percentage < 100:
204 |     num_selected = int(len(raw_dataset) * (opt_dataset_percentage / 100))
205 |     raw_dataset = raw_dataset.shuffle().select(range(num_selected))
206 | dataset = raw_dataset.map(
207 |     tokenize_and_preprocess,
208 |     remove_columns=raw_dataset.column_names,
209 |     num_proc=opt_num_proc,
210 |     load_from_cache_file=True
211 | )
212 | 
213 | # Enable quantization if needed
214 | if opt_quant_bit == -1:
215 |     quantization_config = None
216 | elif opt_quant_bit == 4:
217 |     quantization_config = BitsAndBytesConfig(
218 |         # According to https://huggingface.co/blog/4bit-transformers-bitsandbytes:
219 |         # A rule of thumb is: use double quant if you have problems with memory,
220 |         # use NF4 for higher precision, and use a 16-bit dtype for faster finetuning.
221 |         load_in_4bit=True,
222 |         bnb_4bit_quant_type="nf4",
223 |         bnb_4bit_use_double_quant=True,
224 |         bnb_4bit_compute_dtype=torch.bfloat16
225 |     )
226 | elif opt_quant_bit == 8:
227 |     quantization_config = BitsAndBytesConfig(
228 |         # TODO: Enable more 8bit options
229 |         load_in_8bit=True
230 |     )
231 | else:
232 |     assert False, f"Unsupported quantization bits: {opt_quant_bit}"
233 | 
234 | # Load the base model for supervised fine-tuning
235 | base_model = AutoModelForCausalLM.from_pretrained(
236 |     arg_base_model,
237 |     device_map='auto',  # Let the accelerator module decides
238 |     # Let the base model decides, i.e., uses 'torch.dtype' in config.json.
239 |     # If this is not set and using the default value, the model will be loaded by float32.
240 |     torch_dtype='auto',
241 |     # Let's enable quantization during supervised fine-tuning
242 |     quantization_config=quantization_config,
243 |     trust_remote_code=True
244 | )
245 | 
246 | # Create a LoRA adapter with the PEFT module
247 | if opt_lora:
248 |     lora_adapter = LoraConfig(
249 |         r=opt_lora_rank,
250 |         lora_alpha=opt_lora_alpha,
251 |         lora_dropout=opt_lora_dropout,
252 |         # APIs like base_model.modules() or base_model.named_modules() can help.
253 |         # Each module can be a full module name, a suffix of the name, or a regex.
254 |         # By default, PEFT/LoRA treat each in the following as a suffix of a module
255 |         # name and check against all modules in the model by creating a regex r'.*\.{target_module}$'.
256 |         # If not specified, some default values are used according to different base models.
257 |         # Like for ChatGLM-series models, defaults to ['query_key_value'];
258 |         # Like for Llama-series models, defaults to ['q_proj', 'k_proj', 'v_proj'];
259 |         # Check: TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING in site-packages/peft/utils/others.py.
260 |         # target_modules=['q_proj', 'k_proj', 'v_proj'],  # TODO: support customizing LoRA's target modules
261 |         inference_mode=False,
262 |         bias="none",
263 |         task_type=TaskType.CAUSAL_LM
264 |     )
265 | else:
266 |     lora_adapter = None
267 | base_model = get_peft_model(base_model, lora_adapter)
268 | 
269 | # Prepare the arguments for supervised fine-tuning
270 | training_args = TrainingArguments(
271 |     output_dir=arg_output_dir,
272 |     overwrite_output_dir=False,
273 | 
274 |     # logging arguments
275 |     report_to="tensorboard",
276 |     logging_steps=opt_output_logging_steps,
277 |     logging_first_step=True,
278 |     logging_dir=arg_output_dir,
279 | 
280 |     # saving arguments
281 |     save_strategy="steps",
282 |     save_steps=opt_output_save_steps,
283 |     save_total_limit=opt_output_save_limit,
284 | 
285 |     # learning arguments
286 |     learning_rate=opt_learning_rate,
287 |     num_train_epochs=opt_training_epochs,
288 |     per_device_train_batch_size=opt_per_device_batch_size,
289 |     gradient_accumulation_steps=opt_gradient_accumulation_steps,
290 |     # optim=None,
291 |     weight_decay=0.01,  # This adds an additional, regulation item to AdamW to avoid overfitting
292 |     warmup_ratio=0.01,  # This is for learning rate scheduler
293 |     # load_best_model_at_end=True,
294 | 
295 |     # model arguments
296 |     bf16=True,  # Enforce bf16 precision
297 | 
298 |     # data loading arguments
299 |     dataloader_drop_last=False,
300 |     dataloader_num_workers=opt_num_proc,
301 | )
302 | 
303 | # Define the trainer for supervised fine-tuning
304 | trainer = Trainer(
305 |     model=base_model,
306 |     args=training_args,
307 |     train_dataset=dataset,
308 |     # TODO support on-the-fly evaluation
309 |     # eval_dataset=eval_dataset,
310 |     # compute_metrics=compute_metrics,
311 |     data_collator=data_collator
312 | )
313 | 
314 | # TODO: Check out this line
315 | base_model.config.use_cache = False
316 | 
317 | # Start supervised fine-tuning
318 | trainer.train(resume_from_checkpoint=True if opt_resume_from_checkpoint else None)
319 | 
320 | # Save the model
321 | trainer.save_model(arg_output_dir)
322 | 
323 | print(f"----------")
324 | print(f"Supervised fine-tuning finished; the model was saved to {arg_output_dir}")
325 | 


--------------------------------------------------------------------------------
/dataset/README.md:
--------------------------------------------------------------------------------
  1 | # README
  2 | 
  3 | ## 一级分类 (Type)
  4 | 
  5 | > 参考了主流的 Conventional Commits 进行一级分类，主要是参考了 Angular Convention。
  6 | 
  7 | | 分类      | 类型描述                                                        |
  8 | | --------- | --------------------------------------------------------------- |
  9 | | style     | 对代码风格的变更，如增删空白符、修改代码缩进、代码格式化等      |
 10 | | docs      | 与文档相关的代码变更                                            |
 11 | | test      | 添加新测试或删除、更正已有测试等                                |
 12 | | build     | 会影响构建系统（如 Gulp、Broccoli、NPM）或者外部依赖的代码变更  |
 13 | | cicd      | 对 CICD 系统（如 Travis、Circle）的配置文件或配置脚本的代码变更 |
 14 | | fix       | 修复或预防代码缺陷、漏洞等                                      |
 15 | | feat      | 对代码添加新的功能                                              |
 16 | | refactor  | 与代码重构相关的代码变更                                        |
 17 | | ~~perf~~  | ~~会影响代码性能的代码变更~~                                    |
 18 | | ~~chore~~ | ~~不满足上述类型的其他所有代码变更~~                            |
 19 | 
 20 | > Note: 按照提交 git commit 的最佳实践，每个 commit 需要遵循最小原则，因此每个 commit 进对应上述一级分类的一个类型。但现实情况中，开发者通常会在同一个 commit 中混合多种类型的代码修改，在这种情况下，我们推荐仅考虑该 commit 中最主要的代码修改，并按照上述一级分类的顺序一一比对。
 21 | 
 22 | ## 二级分类 v0.1.0 (Subtype)
 23 | 
 24 | > 以下分类主要参考了 “EMSE20/An empirical investigation of relevant changes and automation needs in modern code review” 文章中的对 code review activity 的分类。
 25 | 
 26 | ### feat
 27 | 
 28 | | 分类 | 类型描述                                                         |
 29 | | ---- | ---------------------------------------------------------------- |
 30 | | E1   | 检查函数：例如，在函数调用时需要检查返回值是否有效且没有发生错误 |
 31 | | E2   | 检查变量：例如，需要检查变量                                     |
 32 | | E3   | 检查用户输入：例如，需要验证用户输入                             |
 33 | | Ex   | 新功能：其他新功能                                               |
 34 | 
 35 | ### fix
 36 | 
 37 | | 分类 | 类型描述                                                                                          |
 38 | | ---- | ------------------------------------------------------------------------------------------------- |
 39 | | I1   | 函数调用：例如，对系统或库的调用不正确或缺失                                                      |
 40 | | I2   | 参数：函数调用或其他交互中使用了不正确或缺失的参数                                                |
 41 | | I3   | 比较：比较语句中的错误                                                                            |
 42 | | I4   | 计算：计算产生错误的结果                                                                          |
 43 | | I5   | 错误的位置：正确的操作被执行，但执行时间过早或过晚                                                |
 44 | | I6   | 算法/性能：使用效率低下的算法                                                                     |
 45 | | I7   | 变量初始化：变量在使用之前未初始化。未初始化的变量可能包含任何值，并使用此类变量会产生错误        |
 46 | | I8   | 内存管理：例如，处理系统内存时发生错误、显示调用 GC                                               |
 47 | | I9   | 数据与资源操作：与操作或使用数据或其他资源相关的缺陷                                              |
 48 | | I10  | 安全性：与应用程序/软件的安全性相关的问题                                                         |
 49 | | I11  | 并发性：与并发性相关的问题                                                                        |
 50 | | I12  | 完整性：部分实现的功能                                                                            |
 51 | | I13  | GUI：与用户界面代码相关的缺陷，涉及用户界面的一致性以及在每种情况下向用户提供的选项。             |
 52 | | I14  | 检查外部代码/连锁效应：需要检查审查范围之外的应用程序代码，因为它可能包含基于当前审查的不正确代码 |
 53 | | I15  | 语法：修复语法错误                                                                                |
 54 | | I16  | 拼写：修复拼写错误                                                                                |
 55 | | Ix   | 其他：其他与问题修复相关的代码变更                                                                |
 56 | 
 57 | ### refactor
 58 | 
 59 | | 分类 | 类型描述                                                                                                                     |
 60 | | ---- | ---------------------------------------------------------------------------------------------------------------------------- |
 61 | | R1   | 语义重复：具有相似意图但在语法上实现不同的代码结构                                                                           |
 62 | | R2   | 语义无效代码：执行但没有任何有意义作用或对结果没有影响的代码片段                                                             |
 63 | | R3   | 更改功能：例如，因使用旧的或已弃用的函数而将函数调用更改为另一个函数                                                         |
 64 | | R4   | 标准编码约定：例如，在错误消息中使用异常而不是返回值，在魔术数字中使用常量，使用内置数据结构而不是自己的实现等               |
 65 | | R5   | 字符串（用语）：字符串内容、组成不好的问题                                                                                   |
 66 | | R6   | 日志记录：为方法添加、删除记录结果或错误的能力                                                                               |
 67 | | R7   | 导入：错误、缺失或未使用的导入语句的问题                                                                                     |
 68 | | R8   | 移动功能：例如，将类、函数、函数部分或其他功能元素移动到不同的类、文件、模块或未知                                           |
 69 | | R9   | 长子程序/复杂代码/简化：将长而复杂的函数拆分为多个函数或重组或重写实现以使其更易理解                                         |
 70 | | R10  | 重复/冗余/无效代码：删除重复的代码或无用的代码或永远不会到达和执行的代码                                                     |
 71 | | R11  | 一致性：意味着需要保持代码一致性，类似的代码元素以类似的方式操作，更或多或少对称。例如，类似的任务在类似的类中应有类似的实现 |
 72 | | R12  | 架构变更：代码审查通常会导致对系统架构的更改，例如将接口划分为两个不同的接口、引入抽象或引入设计模式                         |
 73 | | R13  | 重命名：重新命名类、方法、字段等等                                                                                           |
 74 | | Rx   | 其他：其他与重构相关的代码变更                                                                                               |
 75 | 
 76 | ### style
 77 | 
 78 | | 分类 | 类型描述                                       |
 79 | | ---- | ---------------------------------------------- |
 80 | | S1   | 括号和花括号：例如，在条件分支之后只有一条语句 |
 81 | | S2   | 缩进：例如，代码的一致缩进、删除或增加缩进     |
 82 | | S3   | 空白行：空白行过多或过少，或者行分割错误       |
 83 | | S4   | 行过长：代码语句过长，超过特定字符数           |
 84 | | S5   | 空格使用：代码中空格的使用情况                 |
 85 | | Sx   | 其他：其他与代码风格相关的代码变更             |
 86 | 
 87 | ### test
 88 | 
 89 | | 分类 | 类型描述                   |
 90 | | ---- | -------------------------- |
 91 | | T1   | 测试：与测试相关的代码变更 |
 92 | 
 93 | ### docs
 94 | 
 95 | | 分类 | 类型描述                                                                                                                |
 96 | | ---- | ----------------------------------------------------------------------------------------------------------------------- |
 97 | | D1   | 命名：例如，文档中与项目的命名政策不符的软件元素（如方法、类、变量等）名称的问题                                        |
 98 | | D2   | 注释：与注释相关。例如，注释错误放置、添加新注释、修改已有注释、删除注释、缺少或错误的 Javadoc 等（包括 TODO 和 FIXME） |
 99 | | D3   | 许可证头部：源代码文件中缺失或错误的许可证头部的问题                                                                    |
100 | | D4   | 拼写错误：文档中的拼写错误                                                                                              |
101 | | Dx   | 其他：其他与文档相关的代码变更                                                                                          |
102 | 
103 | ### build
104 | 
105 | | 分类 | 类型描述                       |
106 | | ---- | ------------------------------ |
107 | | B1   | 构建：与构建系统相关的代码变更 |
108 | 
109 | ### cicd
110 | 
111 | | 分类 | 类型描述                                                                                                           |
112 | | ---- | ------------------------------------------------------------------------------------------------------------------ |
113 | | C1   | 持续集成/持续交付配置：关于持续集成或持续交付流程设置的配置文件更改                                                |
114 | | C2   | 自动静态分析工具配置：项目中用到的 Linters、Checkers、Recommenders 的配置更改（例如 Checkstyle、PMD、FindBugs 等） |
115 | | C3   | 特定语言或框架：对所使用编程语言的特定文件的更改。例如 Java 的 MANIFEST 文件                                       |
116 | | C4   | 运行时配置：例如，docker 配置、ansible playbook、交付配置等                                                        |
117 | | Cx   | 其他：其他与持续集成/部署相关的代码变更                                                                            |
118 | 
119 | ## 二级分类 v0.2.0 (Subtype)
120 | 
121 | 根据 v0.1.0 进行优化，删除不常用和不合理的二级分类，合并一些分类，并添加一些缺失的分类，最后用 GPT-4 对该分类体系进行评价并进行优化。
122 | 
123 | > 现在我要为用户使用 git 提交的 commit message 进行分类，我的分类需要满足大多数用户，也就是我需要一些大多数用户都会用到的分类，对于只有少部分用户才会用到的分类，均分类到“其他”里。
124 | >
125 | > 为此，我参考了 Angular 项目的分类方式，将一级分类分为了 feat (Feature), fix, refactor, style, test, build, cicd, docs 八类。
126 | >
127 | > 我还需要为每个一级分类生成一些大多数用户都会用到的二级分类，下面是我的二级分类，请你帮我看一下，这样的分类是否合理，告诉我里面哪些二级分类是用户不太常用的，还有哪些是用户常用但我缺失了的：
128 | 
129 | ### feat
130 | 
131 | | 分类 | 类型描述                                                         |
132 | | ---- | ---------------------------------------------------------------- |
133 | | E1   | 元素：新增新的软件元素，如模块、类、方法、函数等，以对应新的功能 |
134 | | E2   | 日志：为函数添加记录结果或错误的能力                             |
135 | | E3   | 性能：使用性能更好的算法、数据结构等替换旧的算法、数据结构       |
136 | | E4   | 安全：与安全性相关的特性加强，如增强认证、授权、加密等           |
137 | | Ex   | 其他：其他与 feat 相关的代码变更                                 |
138 | 
139 | ### fix
140 | 
141 | | 分类 | 类型描述                                                                                                                                              |
142 | | ---- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
143 | | I1   | 语法：修复代码中的语法错误（如添加缺失的括号、分号等）                                                                                                |
144 | | I2   | 拼写：修复代码中的拼写错误（如 tyep 修复为 type）                                                                                                     |
145 | | I3   | 检查：增强代码检查以防止/修复 miss-check 错误，如增强用户输入检查以防止/修复 XSS 等，添加空值检查以防止/修复空指针/空引用错误，添加函数的返回值检查等 |
146 | | I4   | 初始化：修复变量、字段等未初始化的问题                                                                                                                |
147 | | I5   | 并发：修复与并发相关的问题，如数据竞争                                                                                                                |
148 | | I6   | 内存：修复与内存相关的问题，如内存分配、内存泄漏、内存溢出                                                                                            |
149 | | Ix   | 其他：其他与 fix 相关的代码变更                                                                                                                       |
150 | 
151 | ### refactor
152 | 
153 | | 分类 | 类型描述                                                                                                                                   |
154 | | ---- | ------------------------------------------------------------------------------------------------------------------------------------------ |
155 | | R1   | 代码提取：将相似、语义重复的代码片段提取出来（如提取成公共的类、函数、方法等）以减少重复代码，或使用已存在的公共方法取代现有的相似代码片段 |
156 | | R2   | 死代码删除：删除无效的（没有任何有意义或对结果没有影响的）代码片段                                                                         |
157 | | R3   | 代码移动：例如，将整体或部分代码（如类、函数、方法）移动到其他位置（如不同的类、文件、模块）                                               |
158 | | R4   | 代码简化：将长而复杂的类、函数等拆分为多个函数或重组或重写实现以使其更易理解                                                               |
159 | | R5   | 重命名：重新命名类、方法、字段等                                                                                                           |
160 | | R6   | 代码一致：修改代码名称、实现等以与其他类似代码保持一致                                                                                     |
161 | | R7   | 新旧替换：废弃一些代码（如标记为 @deprecated），转而使用新代码                                                                             |
162 | | R8   | 导入修改：删除错误或未使用的导入语句，或添加缺失的导入语句                                                                                 |
163 | | Rx   | 其他：其他与重构相关的代码变更                                                                                                             |
164 | 
165 | ### style
166 | 
167 | | 分类 | 类型描述                                                                                                                                                                                      |
168 | | ---- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
169 | | S1   | 括号：修改括号（圆括号、方括号、花括号等）的对齐方式                                                                                                                                          |
170 | | S2   | 缩进：增加或删除缩进                                                                                                                                                                          |
171 | | S3   | 空白行：增删空白行                                                                                                                                                                            |
172 | | S4   | 行分割：分割过长的代码行或修正代码行的分割方式                                                                                                                                                |
173 | | S5   | 空格：增删代码中的空格                                                                                                                                                                        |
174 | | Sx   | 其他：其他与 style 相关的代码变更                                                                                                                                                             |
175 | | Sx   | 风格：与代码风格相关而不会影响代码实际效果的代码变更，如修改括号（圆括号、方括号、花括号等）的对齐方式，增加或删除缩进，增删空白行，增删代码中 的空格、分割过长的代码行或修正代码行的分割方式 |
176 | 
177 | ### test
178 | 
179 | | 分类 | 类型描述                                                                                                                                     |
180 | | ---- | -------------------------------------------------------------------------------------------------------------------------------------------- |
181 | | T1   | 新增测试：新增一个/些测试                                                                                                                    |
182 | | T2   | 删除测试：删除已有的测试或标记某个/些测试不再使用                                                                                            |
183 | | T3   | 修正测试：修正已有的测试                                                                                                                     |
184 | | Tx   | 其他：其他与 test 相关的代码变更                                                                                                             |
185 | | Tx   | 测试：与测试代码相关的代码变更，如新增一些测试，删除已有的测试或标记某些测试不再使用，修正已有测试的测试流程、测试断言，调整代码覆盖率工具等 |
186 | 
187 | ### docs
188 | 
189 | | 分类 | 类型描述                                                                                                                     |
190 | | ---- | ---------------------------------------------------------------------------------------------------------------------------- |
191 | | D1   | 命名：修正文档中与代码中不一致的软件元素（如方法、类、变量）名称                                                             |
192 | | D2   | 注释：与注释相关的代码变更，包括 TODO 和 FIXME（如，注释错误放置、添加新注释、修改已有注释、删除注释、缺少或错误的 Javadoc） |
193 | | D3   | 拼写：修正文档中的拼写错误                                                                                                   |
194 | | Dx   | 其他：其他与文档相关的代码变更                                                                                               |
195 | 
196 | ### build
197 | 
198 | | 分类 | 类型描述                                                     |
199 | | ---- | ------------------------------------------------------------ |
200 | | B1   | 依赖：与依赖管理相关的变更（如更新、降级、删除、添加依赖等） |
201 | | Bx   | 其他：与构建系统相关的代码变更                               |
202 | 
203 | ### cicd
204 | 
205 | | 分类 | 类型描述                                                                                                            |
206 | | ---- | ------------------------------------------------------------------------------------------------------------------- |
207 | | C1   | 自动分析：CICD 流水线中用到的自动分析工具如 linters、checkers、analyzers 的配置更改（如 Checkstyle、PMD、FindBugs） |
208 | | C2   | 镜像：与镜像构建、部署相关的变更（如 docker、k8s）                                                                  |
209 | | C3   | 部署：与项目部署相关的代码变更，如增删预发、生产部署步骤，修改部署脚本等                                            |
210 | | Cx   | 其他：其他与持续集成、持续部署相关的代码变更                                                                        |
211 | 


--------------------------------------------------------------------------------