├── .gitignore
├── README.md
├── environment.yml
├── minhash.py
├── model_eval
    ├── gen.sh
    └── generate.py
└── model_finetune
    ├── MG-Verilog
        └── data_format_conversion.py
    ├── README.md
    └── run_peft.sh


/.gitignore:
--------------------------------------------------------------------------------
1 | **/results/**
2 | **.jsonl**
3 | **/CodeLlama-7b/**
4 | **/CodeLlama-7b.nemo


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # LLM4HWDesign Starting Toolkit
 2 | ## Introduction
 3 | This repository provides a starting toolkit for participants of the [LLM4HWDesign Contest at ICCAD 2024](https://nvlabs.github.io/LLM4HWDesign/). The toolkit includes scripts and utilities for deduplicating training data, fine-tuning models, and evaluating model performance. Participants can use this toolkit to kickstart their work and streamline their development process.
 4 | 
 5 | ## Base Dataset
 6 | The base dataset used in the contest is the [MG-Verilog dataset](https://huggingface.co/datasets/GaTech-EIC/MG-Verilog). For your submitted data, please follow the same format as the [MG-Verilog dataset](https://huggingface.co/datasets/GaTech-EIC/MG-Verilog). Please note that you can either provide multiple levels or a single level of description for each code sequence, but we will **concatenate all descriptions at different levels into one string** for each code sequence following the script below.
 7 | 
 8 | ```python
 9 | instructions_dict = {
10 |     "summary": "xxx",
11 |     "detailed explanation": "yyy"
12 | }
13 | 
14 | result = ";\n".join([f"{key}: {value}" for key, value in instructions_dict.items()]) + "\n"
15 | 
16 | '''
17 | result should be
18 | summary: xxx;
19 | detailed explanation: yyy
20 | '''
21 | ```
22 | ## Toolkit Release Progress
23 | - [x] **Deduplication**: Scripts to identify and remove duplicate samples from the dataset.
24 | - [x] **Fine-tuning**: Scripts to fine-tune a pretrained language model on the MG-Verilog dataset.
25 | - [x] **Evaluation**: Tools to evaluate the performance of the fine-tuned model using standard metrics.
26 | 
27 | 
28 | ## Setup Environment
29 | 
30 | We assume CUDA 12.1. (Only needed if you want to do fine-tuning and evaluation on your own.)
31 | 
32 | `conda env create -f environment.yml`
33 | 
34 | 
35 | ## Deduplication
36 | The toolkit includes a deduplication script, which will be used to deduplicate each participant's data against the base dataset during the evaluation of Phase I.
37 | To run the deduplication script:
38 | ```bash
39 | python minhash.py
40 | ```
41 | 
42 | 
43 | ## Evaluation
44 | 
45 | The following shows an example on how to evaluate your fine-tuned model.
46 | 
47 | **Prerequisites**:
48 | 
49 | `export HF_TOKEN=your_huggingface_token`
50 | 
51 | Prepare your fine-tuned model and tokenizer in HuggingFace format.
52 | 
53 | Install [Iverilog](https://steveicarus.github.io/iverilog/usage/installation.html) 
54 | 
55 | Install VerilogEval as the benchmark:
56 | 
57 | **Please read the WARNINGS in the [VerilogEval](https://github.com/NVlabs/verilog-eval/tree/main?tab=readme-ov-file#usage) before proceeding**
58 | 
59 | **Only support 1.0 Version** https://github.com/NVlabs/verilog-eval/tree/release/1.0.0 
60 | 
61 | Pay attention to the "verilog-eval" or "verilog_eval" which is used in mg-verilog's own midified VerilogEval
62 | 
63 | ```bash
64 | git clone https://github.com/NVlabs/verilog-eval.git
65 | pip install -e verilog-eval
66 | ```
67 | 
68 | **Evaluation Scripts**:
69 | 
70 | ```bash
71 | cd model_eval
72 | ./gen.sh <path_to_folder_with_your_model_and_config> <your_huggingface_token>
73 | #example: ./gen.sh "finetuned_model/" "hf-xxxxxxxxxx"
74 | ```
75 | 
76 | NOTE: The folder with your model and config should include two files (1) the generated pytorch_model.bin and (2) the [model config](https://huggingface.co/meta-llama/CodeLlama-7b-Instruct-hf/blob/main/config.json) of CodeLlama-7B-Instruct from HuggingFace
77 | 
78 | The results will be printed and logged in `./model_eval/data/gen.jsonl`
79 | 


--------------------------------------------------------------------------------
/environment.yml:
--------------------------------------------------------------------------------
 1 | name: iccad_challenge_2024
 2 | channels:
 3 |   - conda-forge
 4 |   - pytorch
 5 |   - nvidia
 6 | dependencies:
 7 |   - python=3.8
 8 |   - pytorch=2.1.1 
 9 |   - torchvision 
10 |   - pytorch-cuda=12.1
11 |   - gperf
12 |   - flex
13 |   - bison
14 |   - pip:
15 |     - jsonlines
16 |     - accelerate==0.24.1
17 |     - pandas 
18 |     - fire 
19 |     - transformers[torch]==4.35.0
20 |     - peft==0.6.1
21 |     - bitsandbytes==0.41.2.post2
22 |     - scipy
23 |     - sentencepiece 
24 |     - transformers[dev]
25 |     - setuptools==69.5.1
26 |     - deepspeed==0.12.2
27 |     - transformers[deepspeed]
28 |     - datasketch==1.6.4
29 | 


--------------------------------------------------------------------------------
/minhash.py:
--------------------------------------------------------------------------------
  1 | import json
  2 | import multiprocessing as mp
  3 | import re
  4 | from collections import defaultdict
  5 | from functools import partial
  6 | from typing import Dict, List, Optional, Set, Tuple, Type
  7 | 
  8 | from datasets import Dataset
  9 | from datasketch import MinHash, MinHashLSH
 10 | from tqdm import tqdm
 11 | 
 12 | 
 13 | 
 14 | from functools import partial
 15 | import sys
 16 | import queue
 17 | import threading
 18 | from typing import TypeVar, Iterator, List, Optional, Tuple
 19 | 
 20 | T = TypeVar('T')
 21 | 
 22 | class ThreadedIterator(Iterator[T]):
 23 |     """An iterator object that computes its elements in a single parallel thread to be ready to be consumed.
 24 |     The iterator should *not* return `None`. Elements of the original iterable will be shuffled arbitrarily."""
 25 |     def __init__(self, original_iterator: Iterator[T], max_queue_size: int = 2, enabled: bool = True):
 26 |         self.__is_enabled = enabled
 27 |         if enabled:
 28 |             self.__queue = queue.Queue(maxsize=max_queue_size)  # type: queue.Queue[Optional[T]]
 29 |             self.__thread = threading.Thread(target=lambda: self.__worker(self.__queue, original_iterator), daemon=True)
 30 |             self.__thread.start()
 31 |         else:
 32 |             self.__original_iterator = original_iterator
 33 | 
 34 |     @staticmethod
 35 |     def __worker(queue: queue.Queue, original_iterator: Iterator[T])-> None:
 36 |         try:
 37 |             for element in original_iterator:
 38 |                 assert element is not None, 'By convention, Iterables wrapped in ThreadedIterator may not contain None.'
 39 |                 queue.put(element, block=True)
 40 |             queue.put(None, block=True)
 41 |         except Exception as e:
 42 |             _, __, tb = sys.exc_info()
 43 |             queue.put((e, tb), block=True)
 44 | 
 45 |     def __next__(self) -> T:
 46 |         next_element = self.__queue.get(block=True)
 47 |         if next_element is None:
 48 |             self.__thread.join()
 49 |             self.__queue.put(None)  # Make sure that we remember that we are done if we are called once more...
 50 |             raise StopIteration
 51 |         if isinstance(next_element, tuple) and isinstance(next_element[0], Exception):
 52 |             raise next_element[0].with_traceback(next_element[1])
 53 |         return next_element
 54 | 
 55 |     def __iter__(self):
 56 |         if self.__is_enabled:
 57 |             return self
 58 |         else:
 59 |             return iter(self.__original_iterator)
 60 | 
 61 | 
 62 | 
 63 | 
 64 | 
 65 | NON_ALPHA = re.compile("[^A-Za-z_0-9]")
 66 | # parameters used in DuplicationIndex
 67 | MIN_NUM_TOKENS = 10
 68 | NUM_PERM = 256
 69 | 
 70 | def get_min_hash(tokens: List[str]) -> Optional[MinHash]:
 71 |     """Compute the MinHash of a code snippet."""
 72 |     if len(tokens) < MIN_NUM_TOKENS:
 73 |         return None
 74 |     min_hash = MinHash(num_perm=NUM_PERM)
 75 |     for token in set(tokens):
 76 |         min_hash.update(token.encode())
 77 |     return min_hash
 78 | 
 79 | 
 80 | def get_tokens(code: str) -> Set[str]:
 81 |     """Tokenize a code snippet."""
 82 |     return {t for t in NON_ALPHA.split(code) if len(t.strip()) > 0}
 83 | 
 84 | 
 85 | class DuplicationIndex:
 86 |     def __init__(
 87 |         self,
 88 |         *,
 89 |         duplication_jaccard_threshold: float = 0.85,
 90 |     ):
 91 |         self._duplication_jaccard_threshold = duplication_jaccard_threshold
 92 |         self._num_perm = NUM_PERM
 93 |         self._index = MinHashLSH(threshold=self._duplication_jaccard_threshold, num_perm=self._num_perm)
 94 | 
 95 |         self._duplicate_clusters = defaultdict(set)
 96 | 
 97 |     def add(self, code_key: Tuple, min_hash: MinHash) -> None:
 98 |         """Add a key to _index (MinHashLSH)
 99 |         the min_hash is used to query closest matches based on the jaccard_threshold.
100 |         The new key is either added to a existing cluster of one close match,
101 |         or a new cluster is created. The clusters created in this way, depend on the order of add.
102 | 
103 |         Args:
104 |             code_key (Tuple of (index, repo_name, path)):
105 |                 Theoritically any hasbale key. Here we use a tuple to retrieve the information later.
106 |             min_hash: MinHash of the code_key.
107 |         """
108 |         close_duplicates = self._index.query(min_hash)
109 |         if code_key in self._index.keys:
110 |             print(f"Duplicate key {code_key}")
111 |             return
112 | 
113 |         self._index.insert(code_key, min_hash)
114 |         if len(close_duplicates) > 0:
115 |             for base_duplicate in close_duplicates:
116 |                 if base_duplicate in self._duplicate_clusters:
117 |                     self._duplicate_clusters[base_duplicate].add(code_key)
118 |                     break
119 |             else:
120 |                 self._duplicate_clusters[close_duplicates[0]].add(code_key)
121 | 
122 |     def get_duplicate_clusters(self) -> List[List[Dict]]:
123 |         """Export the duplicate clusters.
124 |         For each cluster, the first element is the base element of the cluster.
125 |         The base element has an estimation jaccard similarity higher than the threshold with all the other elements.
126 | 
127 |         Returns:
128 |             duplicate_clusters (List[List[Dict]]):
129 |                 List of duplicate clusters.
130 |         """
131 |         duplicate_clusters = []
132 |         for base, duplicates in self._duplicate_clusters.items():
133 |             cluster = [base] + list(duplicates)
134 |             # reformat the cluster to be a list of dict
135 |             cluster = [{"base_index": el[0], "repo_name": el[1], "path": el[2]} for el in cluster]
136 |             duplicate_clusters.append(cluster)
137 |         return duplicate_clusters
138 | 
139 |     def save(self, filepath) -> None:
140 |         duplicate_clusters = self.get_duplicate_clusters()
141 |         with open(filepath, "w") as f:
142 |             json.dump(duplicate_clusters, f)
143 | 
144 | 
145 | def _compute_min_hash(element):
146 |     index, data = element
147 |     min_hash = get_min_hash([t for t in NON_ALPHA.split(data["code"]) if len(t.strip()) > 0])
148 |     if min_hash is not None:
149 |         #can supply later and bookkept by get_duplicate_clusters 
150 |         return (index, "norepo", "nopath"), min_hash
151 | 
152 | 
153 | def minhash_iter(dataset_iterator: Type[Dataset]):
154 |     with mp.Pool() as pool:
155 |         for data in pool.imap_unordered(
156 |             _compute_min_hash,
157 |             ThreadedIterator(dataset_iterator, max_queue_size=10000),
158 |             chunksize=100,
159 |         ):
160 |             if data is not None:
161 |                 yield data
162 | 
163 | 
164 | def make_duplicate_clusters(dataset_iterator: Type[Dataset], jaccard_threshold: float):
165 |     """Find duplicate clusters in the dataset in two steps:
166 |     1. Compute MinHash for each code snippet. MinHash is a tool for fast jaccard similarity estimation.
167 |     This step is computed using an asynchronous multiprocessing pool, minhash_iter
168 |     2. Find duplicate clusters. The computed MinHash is added sequentially to the DuplicationIndex.
169 |     This step cannot be parallelized. So using asynchronous thread in the previous step helps to speed up the process.
170 |     """
171 |     di = DuplicationIndex(duplication_jaccard_threshold=jaccard_threshold)
172 | 
173 |     for filename, min_hash in tqdm(ThreadedIterator(minhash_iter(enumerate(dataset_iterator)), max_queue_size=100)):
174 |         di.add(filename, min_hash)
175 | 
176 |     # Returns a List[Cluster] where Cluster is List[str] with the filenames.
177 |     return di.get_duplicate_clusters()
178 | 
179 | 
180 | def jaccard_similarity(code1: str, code2: str) -> float:
181 |     """Compute the Jaccard similarity of two code snippets."""
182 |     tokens1 = get_tokens(code1)
183 |     tokens2 = get_tokens(code2)
184 |     return len(tokens1 & tokens2) / len(tokens1 | tokens2)
185 | 
186 | 
187 | _shared_dataset = None
188 | 
189 | 
190 | def _find_cluster_extremes_shared(cluster, jaccard_threshold):
191 |     """Find a reduced cluster such that each code in the origin cluster is similar to at least one code in the reduced cluster.
192 |     Two codes are similar if their Jaccard similarity is above the threshold.
193 | 
194 |     Args:
195 |         cluster (List[dict]):
196 |            cluster is a list of dict, each dict contains the following keys:
197 |                 - base_index
198 |                 - repo_name
199 |                 - path
200 |             This is a typical output of DuplicationIndex.get_duplicate_clusters()
201 |         jaccard_threshold (float):
202 |             threshold for Jaccard similarity.
203 |             Two codes are similar if their Jaccard similarity is above the threshold.
204 | 
205 |     Returns:
206 |         extremes (List[dict]):
207 |             A reduced representation of the cluster. The field copies is added to each dict.
208 |             The copies field indicates the number of similar codes in the cluster for a extreme.
209 |     """
210 |     extremes = []
211 |     for element1 in cluster:
212 |         code1 = _shared_dataset[element1["base_index"]]["code"]
213 |         for element2 in extremes:
214 |             code2 = _shared_dataset[element2["base_index"]]["code"]
215 |             if jaccard_similarity(code1, code2) >= jaccard_threshold:
216 |                 element2["copies"] += 1
217 |                 break
218 |         else:
219 |             element1["copies"] = 1
220 |             extremes.append(element1)
221 |     return extremes
222 | 
223 | 
224 | def find_extremes(cluster_list, dataset, jaccard_threshold):
225 |     """Call the _find_cluster_extremes_shared function in a parallel fashion.
226 | 
227 |     Args:
228 |         cluster_list (List[List[Dict]]):
229 |             each cluster is a list of dicts with the key base_index,
230 |             referring to the index of the base code in the dataset.
231 |         dataset (Type[Dataset]):
232 |             dataset is used to access the content of the code snippets,
233 |             using the base_index from the cluster_list.
234 |             dataset is shared between all the processes using a glabal variable (any other way to share the dataset?),
235 |             otherwise the multi processing is not speeded up.
236 |         jaccard_threshold (float):
237 |             the threshold for the jaccard similarity. The default value is 0.85
238 | 
239 |     Returns:
240 |         extremes_list (List[Dict]):
241 |             Each cluster is reduced to extremes.
242 |             See _find_cluster_extremes_shared for the definition of extremes.
243 |     """
244 |     global _shared_dataset
245 |     _shared_dataset = dataset
246 |     extremes_list = []
247 |     f = partial(_find_cluster_extremes_shared, jaccard_threshold=jaccard_threshold)
248 |     with mp.Pool() as pool:
249 |         for extremes in tqdm(
250 |             pool.imap_unordered(
251 |                 f,
252 |                 cluster_list,
253 |             ),
254 |             total=len(cluster_list),
255 |         ):
256 |             extremes_list.append(extremes)
257 |     return extremes_list
258 | 
259 | 
260 | def deduplicate_dataset(
261 |     dataset: Type[Dataset], jaccard_threshold: float = 0.85
262 | ) -> Tuple[Type[Dataset], List[List[Dict]]]:
263 |     """Deduplicate the dataset using minhash and jaccard similarity.
264 |     This function first generate duplicate clusters, then each cluster
265 |     is reduced to the extremes that are similar to the other elements in the cluster.
266 |     Codes are called similar if their Jaccard similarity is greater than jaccard_threshold (0.85 default).
267 | 
268 |     Args:
269 |         dataset (Type[Dataset]):
270 |             The dataset to deduplicate.
271 |         jaccard_threshold (float, default=0.85):
272 |             jaccard threshold to determine if two codes are similar
273 | 
274 |     Returns:
275 |         ds_dedup (Type[Dataset]):
276 |             The deduplicated dataset.
277 |         duplicate_clusters (List[List[Dict]]):
278 |             The list of duplicate clusters.
279 |             Each cluster is a list of dicts with the following keys:
280 |             - base_index : int
281 |                 The index of the code in the original dataset.
282 |             - repo_name : str
283 |             - path : str
284 |             - copies : int
285 |                 The number of copies of the code in the cluster. (find_cluster_extremes)
286 |             - is_extreme : bool
287 |                 Whether the code is an extreme in the cluster.
288 |             All the codes in the cluster are removed from the dataset except the extremes.
289 | 
290 |     Example:
291 |         >>> from datasets import load_dataset
292 |         >>> from minhash_deduplication import deduplicate_dataset
293 |         >>> ds = load_dataset("lvwerra/codeparrot-clean", split="train")
294 |         >>> ds_dedup, duplicate_clusters = deduplicate_dataset(ds, jaccard_threshold=0.85)
295 |     """
296 |     duplicate_clusters = make_duplicate_clusters(dataset, jaccard_threshold)
297 |     duplicate_indices = {x["base_index"] for cluster in duplicate_clusters for x in cluster}
298 |     extreme_dict = {}
299 |     extremes_clusters = find_extremes(duplicate_clusters, dataset, jaccard_threshold)
300 |     for extremes in extremes_clusters:
301 |         for element in extremes:
302 |             extreme_dict[element["base_index"]] = element
303 |     remove_indices = duplicate_indices - set(extreme_dict.keys())
304 |     ds_filter = dataset.filter(lambda x, idx: idx not in remove_indices, with_indices=True)
305 | 
306 |     # update duplicate_clusters
307 |     for cluster in duplicate_clusters:
308 |         for element in cluster:
309 |             element["is_extreme"] = element["base_index"] in extreme_dict
310 |             if element["is_extreme"]:
311 |                 element["copies"] = extreme_dict[element["base_index"]]["copies"]
312 | 
313 |     print(f"Original dataset size: {len(dataset)}")
314 |     print(f"Number of duplicate clusters: {len(duplicate_clusters)}")
315 |     print(f"Files in duplicate cluster: {len(duplicate_indices)}")
316 |     print(f"Unique files in duplicate cluster: {len(extreme_dict)}")
317 |     print(f"Filtered dataset size: {len(ds_filter)}")
318 | 
319 |     return ds_filter, duplicate_clusters
320 | 
321 | 
322 | 
323 | if __name__ == "__main__":
324 |     import argparse
325 |     from datasets import load_dataset
326 |     parser = argparse.ArgumentParser()
327 |     parser.add_argument("--dataset", type=str, default="GaTech-EIC/MG-Verilog")
328 |     args = parser.parse_args()
329 | 
330 |     ds = load_dataset(args.dataset)
331 |     #deduplicate based on the "code" key
332 |     ds_test = ds["train"]
333 |     ds_dedup, duplicate_clusters = deduplicate_dataset(ds_test, jaccard_threshold=0.85)
334 |     # print(duplicate_clusters)
335 |     print("Duplicate clusters: ", len(duplicate_clusters))
336 |     print("Original dataset size: ", len(ds_test))
337 |     print("Deduplicated dataset size: ", len(ds_dedup))


--------------------------------------------------------------------------------
/model_eval/gen.sh:
--------------------------------------------------------------------------------
 1 | export CUDA_VISIBLE_DEVICES=0,1,2,3
 2 | 
 3 | ckpt_dir=$1
 4 | token=$2
 5 | 
 6 | export 'PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:64'
 7 | 
 8 | accelerate launch --multi_gpu generate.py \
 9 |     --checkpoint_dir $ckpt_dir \
10 |     --model_type "qlora" \
11 |     --base_model "codellama/CodeLlama-7b-Instruct-hf" \
12 |     --tokenizer_type "code_llama" \
13 |     --cache_dir "./HF_cache/" \
14 |     --hf_token $token \
15 |     --max_new_tokens 1024 \
16 |     --temperature 0.7 \
17 |     --desc_file ../verilog_eval/descriptions/VerilogDescription_Machine.jsonl \
18 |     --desc_key "detail_description" \
19 |     --prompt_type "baseline" \
20 |     --eval_file ../verilog_eval/data/VerilogEval_Machine.jsonl \
21 |     --output_file ./data/gen.jsonl \
22 |     --fp16 \
23 |     --sample_k 20 \
24 |     --result_name Test \
25 |     --batch_size 1
26 | 


--------------------------------------------------------------------------------
/model_eval/generate.py:
--------------------------------------------------------------------------------
  1 | import json
  2 | import os
  3 | import sys
  4 | import time
  5 | from pathlib import Path
  6 | from typing import Tuple
  7 | 
  8 | import fire
  9 | import torch
 10 | import numpy as np
 11 | import math
 12 | from datetime import timedelta
 13 | 
 14 | 
 15 | from transformers import AutoTokenizer, CodeGenForCausalLM, GenerationConfig, AutoModelForCausalLM, DataCollatorForLanguageModeling
 16 | import pandas as pd
 17 | 
 18 | import jsonlines
 19 | 
 20 | #only support https://github.com/NVlabs/verilog-eval/tree/release/1.0.0 
 21 | #pay attention to the "verilog-eval" or "verilog_eval" which is used in mg-verilog's own midified VerilogEval
 22 | sys.path.append("../verilog-eval/verilog_eval")
 23 | from evaluation import evaluate_functional_correctness
 24 | from datetime import datetime
 25 | from tqdm import tqdm
 26 | 
 27 | from accelerate import PartialState, Accelerator, DeepSpeedPlugin
 28 | from accelerate.utils import InitProcessGroupKwargs
 29 | import torch.distributed as dist
 30 | import torch.multiprocessing as mp
 31 | 
 32 | from transformers import (
 33 |     AutoTokenizer,
 34 |     AutoModelForCausalLM,
 35 |     set_seed,
 36 |     Seq2SeqTrainer,
 37 |     BitsAndBytesConfig,
 38 |     LlamaTokenizer
 39 | 
 40 | )
 41 | from tokenizers import AddedToken
 42 | 
 43 | import warnings
 44 | from packaging import version
 45 | from packaging.version import parse
 46 | 
 47 | from transformers import LlamaForCausalLM, LlamaConfig
 48 | from accelerate import init_empty_weights, load_checkpoint_and_dispatch
 49 | 
 50 | 
 51 | # PROMPT = (
 52 | #     "You only complete chats with syntax correct Verilog code. End the Verilog module code completion with 'endmodule'. Do not include module, input and output definitions."
 53 | #     "\n\nImplement the Verilog module based on the following description. Assume that signals are positive clock/clk edge triggered unless otherwise stated.\n\n"
 54 | #     "\n\n{description}\n\nModule header:\n{module_header}\n"
 55 | # )
 56 | 
 57 | 
 58 | llama2_prompt_with_memory ="""
 59 |     <s>[INST] <<SYS>>
 60 |     {system_message}
 61 |     <</SYS>>
 62 | 
 63 |     {chat_history} {human_input} [/INST]
 64 | """
 65 | 
 66 | llama2_prompt_without_memory ="""
 67 |     <s>[INST] <<SYS>>
 68 |     {system_message}
 69 |     <</SYS>>
 70 | 
 71 |     {human_input} [/INST]
 72 | """
 73 | 
 74 | llama2_prompt_without_memory_without_sys ="""
 75 | <s>[INST] {human_input} [/INST]
 76 | """
 77 | 
 78 | llama2_pompt_with_memory_without_sys ="""
 79 | <s>[INST] {chat_history} {human_input} [/INST]
 80 | """
 81 | 
 82 | llama2_memory_prompt ="""{human_input} [/INST] {model_reply}</s><s>[INST]"""
 83 | 
 84 | system_prompt = "You only complete chats with syntax correct Verilog code. End the Verilog module code completion with 'endmodule'. Do not include module, input and output definitions."
 85 | question_prompt = "Implement the Verilog module based on the following description. Assume that signals are positive clock/clk edge triggered unless otherwise stated."
 86 | problem_description = "\n\n {description} \n\n Module header:\n\n {module_header}\n"
 87 | PROMPT_BASELINE = llama2_prompt_without_memory.format(system_message = system_prompt, human_input = question_prompt + problem_description)
 88 | 
 89 | system_prompt = "You only expand the high level summary of a Verilog code module design to block level summaries."
 90 | question_prompt = "Come up with correct functional blocks for to implement the Verilog module in the high level summary, and expand the high level summary to block level summaries."
 91 | problem_description = "\n Here is the high level summary:\n\n {description} \n\n"
 92 | problem_description += "\n Here is the Verilog module header:\n\n {module_header} \n\n" 
 93 | LLM1_PROMPT = llama2_prompt_without_memory.format(system_message = system_prompt, human_input = question_prompt + problem_description)
 94 | 
 95 | 
 96 | system_prompt = "You only complete chats with syntax correct Verilog code. End the Verilog module code completion with 'endmodule'. Do not include module, input and output definitions."
 97 | question_prompt = "Implement the Verilog module based on the following block level summaries. Assume that signals are positive clock/clk edge triggered unless otherwise stated."
 98 | problem_description = "\nHere are block level summaries:\n\n {description} \n\n Module header:\n\n {module_header} \n"
 99 | LLM2_BLOCK_TO_CODE_PROMPT = llama2_prompt_without_memory.format(system_message = system_prompt, human_input = question_prompt + problem_description)
100 | 
101 | system_prompt = "You only complete chats with syntax correct Verilog code. End the Verilog module code completion with 'endmodule'. Do not include module, input and output definitions."
102 | question_prompt = "Implement the Verilog module based on the following block level summaries. Assume that signals are positive clock/clk edge triggered unless otherwise stated."
103 | problem_description = "{description}  \n\n Module header:\n\n {module_header} \n"
104 | LLM2_BLOCK_GLOBAL_TO_CODE_PROMPT = llama2_prompt_without_memory.format(system_message = system_prompt, human_input = question_prompt + problem_description)
105 | 
106 | def mem():
107 |     # print("torch.cuda.memory_allocated: %fGB"%(torch.cuda.memory_allocated(0)/1024/1024/1024))
108 |     # print("torch.cuda.memory_reserved: %fGB"%(torch.cuda.memory_reserved(0)/1024/1024/1024))
109 |     # print("torch.cuda.max_memory_reserved: %fGB"%(torch.cuda.max_memory_reserved(0)/1024/1024/1024))
110 |     return torch.cuda.memory_allocated(0)/1024/1024/1024
111 | 
112 | def timestamp():
113 |     return f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}]"
114 | 
115 | import importlib
116 | from peft import (
117 |     prepare_model_for_kbit_training,
118 |     LoraConfig,
119 |     get_peft_model,
120 |     PeftModel
121 | )
122 | from peft.tuners.lora import LoraLayer
123 | from os.path import exists, join, isdir
124 | 
125 | 
126 | def is_ipex_available():
127 |     def get_major_and_minor_from_version(full_version):
128 |         return str(version.parse(full_version).major) + "." + str(version.parse(full_version).minor)
129 | 
130 |     _torch_version = importlib.metadata.version("torch")
131 |     if importlib.util.find_spec("intel_extension_for_pytorch") is None:
132 |         return False
133 |     _ipex_version = "N/A"
134 |     try:
135 |         _ipex_version = importlib.metadata.version("intel_extension_for_pytorch")
136 |     except importlib.metadata.PackageNotFoundError:
137 |         return False
138 |     torch_major_and_minor = get_major_and_minor_from_version(_torch_version)
139 |     ipex_major_and_minor = get_major_and_minor_from_version(_ipex_version)
140 |     if torch_major_and_minor != ipex_major_and_minor:
141 |         warnings.warn(
142 |             f"Intel Extension for PyTorch {ipex_major_and_minor} needs to work with PyTorch {ipex_major_and_minor}.*,"
143 |             f" but PyTorch {_torch_version} is found. Please switch to the matching version and run again."
144 |         )
145 |         return False
146 |     return True
147 | 
148 | def smart_tokenizer_and_embedding_resize(
149 |     special_tokens_dict,
150 |     tokenizer,
151 |     model,
152 | ):
153 |     """Resize tokenizer and embedding.
154 | 
155 |     Note: This is the unoptimized version that may make your embedding size not be divisible by 64.
156 |     """
157 |     if "codellama" in model.name_or_path:
158 |         for key, value in special_tokens_dict.items():
159 |             special_tokens_dict[key] = AddedToken(value, lstrip=True, rstrip=True, normalized=False)
160 | 
161 |     num_new_tokens = tokenizer.add_special_tokens(special_tokens_dict)
162 |     model.resize_token_embeddings(len(tokenizer))
163 |     
164 |     if num_new_tokens > 0:
165 |         input_embeddings_data = model.get_input_embeddings().weight.data
166 |         output_embeddings_data = model.get_output_embeddings().weight.data
167 | 
168 |         input_embeddings_avg = input_embeddings_data[:-num_new_tokens].mean(dim=0, keepdim=True)
169 |         output_embeddings_avg = output_embeddings_data[:-num_new_tokens].mean(dim=0, keepdim=True)
170 | 
171 |         input_embeddings_data[-num_new_tokens:] = input_embeddings_avg
172 |         output_embeddings_data[-num_new_tokens:] = output_embeddings_avg
173 | 
174 | def get_accelerate_model(model_name, tokenizer_type, checkpoint, bf16, fp16, cache_dir, hf_token=None):
175 | 
176 |     if torch.cuda.is_available():
177 |         n_gpus = torch.cuda.device_count()
178 |     if is_ipex_available() and torch.xpu.is_available():
179 |         n_gpus = torch.xpu.device_count()
180 |         
181 |     max_memory = f'{24000}MB'
182 |     max_memory = {i: max_memory for i in range(n_gpus)}
183 |     device_map = "auto"
184 | 
185 |     # if we are in a distributed setting, we need to set the device map and max memory per device
186 |     if os.environ.get('LOCAL_RANK') is not None:
187 |         local_rank = int(os.environ.get('LOCAL_RANK', '0'))
188 |         print("distributed", local_rank)
189 |         device_map = {'': local_rank}
190 |         max_memory = {'': max_memory[local_rank]}
191 | 
192 |     compute_dtype = (torch.float16 if fp16 else (torch.bfloat16 if bf16 else torch.float32))
193 | 
194 |     ckpt_path = checkpoint
195 | 
196 |     config = LlamaConfig.from_pretrained(os.path.join(ckpt_path, 'config.json'))
197 |     state_dict = torch.load(os.path.join(ckpt_path, 'pytorch_model.bin'))
198 |     with init_empty_weights():
199 |         model = LlamaForCausalLM(config)
200 |     model = load_checkpoint_and_dispatch(
201 |         model,
202 |         checkpoint=os.path.join(ckpt_path, 'pytorch_model.bin'),
203 |         device_map=device_map,
204 |         max_memory=max_memory,
205 |         dtype=torch.float16  # Change dtype if needed
206 |     )
207 |     model.eval()
208 |         
209 |     # model = AutoModelForCausalLM.from_pretrained(
210 |     #     model_name, 
211 |     #     cache_dir=checkpoint,
212 |     #     load_in_4bit=False,
213 |     #     load_in_8bit=True,
214 |     #     device_map=device_map,
215 |     #     max_memory= max_memory,
216 |     #     quantization_config=BitsAndBytesConfig(
217 |     #         load_in_4bit=False,
218 |     #         load_in_8bit=True,
219 |     #         llm_int8_threshold=6.0,
220 |     #         llm_int8_has_fp16_weight=False,
221 |     #         bnb_4bit_compute_dtype=compute_dtype,
222 |     #         bnb_4bit_use_double_quant=True,
223 |     #         bnb_4bit_quant_type="nf4",
224 |     #     ),
225 |     #     torch_dtype=(torch.float32 if fp16 else (torch.bfloat16 if bf16 else torch.float32)),
226 |     #     trust_remote_code=False,
227 |     #     token=hf_token
228 |     # )
229 | 
230 |     setattr(model, 'model_parallel', True)
231 |     setattr(model, 'is_parallelizable', True)
232 | 
233 |     model.config.torch_dtype=(torch.float32 if fp16 else (torch.bfloat16 if bf16 else torch.float32))
234 | 
235 |     # Tokenizer
236 |     print(model_name)
237 |     tokenizer = AutoTokenizer.from_pretrained(
238 |         model_name,
239 |         cache_dir=cache_dir,
240 |         padding_side="left",
241 |         use_fast=False, # Fast tokenizer giving issues.
242 |         tokenizer_type=tokenizer_type,
243 |         trust_remote_code=False,
244 |         token=hf_token
245 |     )
246 |     if tokenizer._pad_token is None:
247 |         smart_tokenizer_and_embedding_resize(
248 |             special_tokens_dict=dict(pad_token="[PAD]"),
249 |             tokenizer=tokenizer,
250 |             model=model,
251 |         )
252 |     if "codellama" in model_name:
253 |         tokenizer.add_special_tokens({
254 |                 "eos_token": AddedToken(tokenizer.convert_ids_to_tokens(model.config.eos_token_id), lstrip=True, rstrip=True, normalized=False),
255 |                 "bos_token": AddedToken(tokenizer.convert_ids_to_tokens(model.config.bos_token_id), lstrip=True, rstrip=True, normalized=False),
256 |                 "unk_token": AddedToken(tokenizer.convert_ids_to_tokens(
257 |                     tokenizer.pad_token_id
258 |                 ), lstrip=False, rstrip=False, normalized=False),
259 | 
260 |         })
261 |         tokenizer.add_tokens(
262 |                     [AddedToken(tokenizer.eot_token, lstrip=True, rstrip=True, normalized=False),
263 |                     AddedToken(tokenizer.prefix_token, lstrip=True, rstrip=True, normalized=False),
264 |                     AddedToken(tokenizer.suffix_token, lstrip=True, rstrip=True, normalized=False),
265 |                     AddedToken(tokenizer.middle_token, lstrip=True, rstrip=True, normalized=False),
266 |                     AddedToken(tokenizer.pad_token, lstrip=True, rstrip=True, normalized=False),
267 |                     ],
268 |                     special_tokens=True
269 |         ) 
270 |     elif 'llama' in model_name or isinstance(tokenizer, LlamaTokenizer):
271 |         # LLaMA tokenizer may not have correct special tokens set.
272 |         # Check and add them if missing to prevent them from being parsed into different tokens.
273 |         # Note that these are present in the vocabulary.
274 |         # Note also that `model.config.pad_token_id` is 0 which corresponds to `<unk>` token.
275 |         print('Adding special tokens.')
276 |         tokenizer.add_special_tokens({
277 |                 "eos_token": tokenizer.convert_ids_to_tokens(model.config.eos_token_id),
278 |                 "bos_token": tokenizer.convert_ids_to_tokens(model.config.bos_token_id),
279 |                 "unk_token": tokenizer.convert_ids_to_tokens(
280 |                     tokenizer.pad_token_id
281 |                 ),
282 |         })  
283 |     
284 |     # if not args.full_finetune:
285 |     if True:
286 |         # model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=args.gradient_checkpointing)
287 |         model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)
288 | 
289 |     # if not args.full_finetune:
290 |     if True:
291 |         if checkpoint is not None:
292 |             pass
293 |             # print("Loading adapters from checkpoint.")
294 |             # model = PeftModel.from_pretrained(model, join(checkpoint, 'adapter_model'), is_trainable=True)
295 |         else:
296 |             raise Exception("Only loading checkpoint is allowed.")
297 |             # print(f'adding LoRA modules...')
298 |             # modules = find_all_linear_names(args, model)
299 |             # config = LoraConfig(
300 |             #     r=args.lora_r,
301 |             #     lora_alpha=args.lora_alpha,
302 |             #     target_modules=modules,
303 |             #     lora_dropout=args.lora_dropout,
304 |             #     bias="none",
305 |             #     task_type="CAUSAL_LM",
306 |             # )
307 |             # model = get_peft_model(model, config)
308 | 
309 |     for name, module in model.named_modules():
310 |         if isinstance(module, LoraLayer):
311 |             if bf16:
312 |                 module = module.to(torch.bfloat16)
313 |         if 'norm' in name:
314 |             module = module.to(torch.float32)
315 |         if 'lm_head' in name or 'embed_tokens' in name:
316 |             if hasattr(module, 'weight'):
317 |                 if bf16 and module.weight.dtype == torch.float32:
318 |                     module = module.to(torch.bfloat16)
319 |     return model, tokenizer
320 | 
321 | def load(
322 |     checkpoint_dir: str,
323 |     tokenizer_type: str = "code_llama",
324 |     model_type: str = "hf",
325 |     base_model: str = None,
326 |     bf16: bool = False,
327 |     fp16: bool = False,
328 |     cache_dir: str = "/home/user_name/HF_cache/",
329 |     hf_token: str = "your_hf_token_if_you_want_to_use_it",
330 |     dev: str=None
331 | ):
332 |     start_time = time.time()
333 | 
334 |     if model_type == "hf":
335 |         model = CodeGenForCausalLM.from_pretrained(checkpoint_dir, cache_dir="/home/user_name/HF_cache")
336 | 
337 |         if dev is not None:
338 |             model = model.to(dev)
339 | 
340 |         tokenizer = AutoTokenizer.from_pretrained(checkpoint_dir, use_fast=False, padding_side="left")
341 |         tokenizer.pad_token_id = tokenizer.eos_token_id
342 |     
343 |     else:
344 |         print("Loading base model", base_model, "Checkpoint", checkpoint_dir)
345 |         model, tokenizer = get_accelerate_model(base_model, tokenizer_type, checkpoint_dir, bf16, fp16, cache_dir, hf_token=hf_token)
346 | 
347 |     
348 |     print(f"Loaded in {time.time() - start_time:.2f} seconds")
349 |     return model, tokenizer
350 | 
351 | class VerilogDataset(torch.utils.data.Dataset):
352 |   def __init__(self, descriptions, headers, tokenizer, device, prompt_type= "baseline", desc_key = "detail_description", k=20):
353 |     self.device = device
354 |     self.tokenizer = tokenizer
355 |     self.prompt_type = prompt_type
356 | 
357 |     self.prompts = []
358 |     for desc in descriptions:
359 |       for i in range(k):
360 |         self.prompts.append({
361 |           "task_id": desc["task_id"],
362 |           "description": desc[desc_key],
363 |           "module_header": headers[desc["task_id"]]
364 |         })
365 | 
366 |   def __len__(self):
367 |     return len(self.prompts)
368 | 
369 |   def __getitem__(self, index):
370 |     if self.prompt_type == "baseline":
371 |         PROMPT = PROMPT_BASELINE
372 |     elif self.prompt_type == "llm1":
373 |         PROMPT = LLM1_PROMPT
374 |     elif self.prompt_type == "llm2_block_to_code":
375 |         PROMPT = LLM2_BLOCK_TO_CODE_PROMPT
376 |     elif self.prompt_type == "llm2_block_global_to_code":
377 |         PROMPT = LLM2_BLOCK_GLOBAL_TO_CODE_PROMPT
378 |     else:
379 |         raise Exception("Invalid prompt type.")
380 |     prompt = PROMPT.format_map(self.prompts[index])
381 |     return torch.tensor(self.tokenizer(prompt).input_ids).to(self.device)
382 | 
383 | def main(
384 |     checkpoint_dir: str,
385 |     model_type: str = "hf",
386 |     tokenizer_type="code_llama",
387 |     base_model: str = None,
388 |     bf16: bool = False,
389 |     fp16: bool = False,
390 |     cache_dir: str = "/home/user_name/HF_cache/",
391 |     hf_token: str = "your_hf_token_if_you_want_to_use_it",
392 |     max_new_tokens: int = 1024,
393 |     temperature = None,
394 |     top_p = None,
395 |     top_k = None,
396 |     repetition_penalty = None,
397 |     sample_k: int = 1,
398 |     desc_file: str = "../verilog_eval/descriptions/VerilogDescription_Machine.jsonl",
399 |     # desc_file: str = "/home/user_name/DAC_2024/chatgpt4_auto_accel/fine_tune_dataset/auto_doc_part_dataset/hdlbits_for_llm2_eval.jsonl",
400 |     desc_key: str = "detail_description",
401 |     prompt_type: str = "baseline",
402 |     eval_file: str = "../verilog_eval/data/VerilogEval_Machine.jsonl",
403 |     output_file: str = "./data/gen.jsonl",
404 |     result_name: str = None,
405 |     batch_size: int = 1,
406 |     skip_iverilog: bool = False
407 | ):
408 | 
409 |     with torch.no_grad():
410 |         # deepspeed_plugin = DeepSpeedPlugin(zero_stage=3, gradient_accumulation_steps=2)
411 |         # accel = Accelerator(deepspeed_plugin=deepspeed_plugin)
412 |         kwargs = InitProcessGroupKwargs(timeout=timedelta(seconds=3600))
413 |         accel = Accelerator(kwargs_handlers=[kwargs])
414 |         accel.print("Starting at", timestamp())
415 | 
416 |         accel.print("Loading model...")
417 |         device = accel.device
418 |         model, tokenizer = load(checkpoint_dir, tokenizer_type, model_type, base_model=base_model, bf16=bf16, fp16=fp16, cache_dir=cache_dir, hf_token=hf_token)
419 |         model.eval()
420 | 
421 |         accel.print("Loading descriptions...")
422 |         tasks = pd.read_json(path_or_buf=desc_file, lines=True)
423 | 
424 |         headers = {}
425 |         with jsonlines.open(eval_file) as file:
426 |             for obj in file:
427 |                 headers[obj["task_id"]] = obj["prompt"]
428 | 
429 | 
430 |         # clear temp file
431 |         with open(output_file, "w") as file:
432 |             pass
433 | 
434 |         outputs = []
435 |         accel.print("Starting...")
436 | 
437 |         with accel.split_between_processes(tasks.to_dict(orient="records")) as prompts:
438 |             # print("Distr", len(prompt), [r["task_id"] for r in prompt])
439 |             tasks = prompts
440 | 
441 |         # create new dataset for each process to help with prompt batch balancing
442 |         task_ids = [t["task_id"] for t in tasks]
443 |         dataset = VerilogDataset(tasks, headers, tokenizer, device, desc_key=desc_key, prompt_type=prompt_type, k=sample_k)
444 |         collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)
445 |         dataloader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, collate_fn=collator)
446 | 
447 |         accel.print("Starting generation...")
448 | 
449 |         for i, prompts in enumerate(tqdm(dataloader, total=len(dataloader), desc="Process " + str(accel.process_index))):
450 |             prompts = prompts["input_ids"]
451 |             output = model.generate(
452 |                 inputs=prompts,
453 |                 max_new_tokens=max_new_tokens,
454 |                 do_sample=True,
455 |                 temperature=temperature,
456 |                 top_p=top_p,
457 |                 top_k=top_k,
458 |                 repetition_penalty=repetition_penalty,
459 |                 pad_token_id=tokenizer.pad_token_id
460 |             )
461 | 
462 |             strings = tokenizer.batch_decode(output[:, prompts.shape[1]:], skip_special_tokens=True)
463 |             full_responses = tokenizer.batch_decode(output, skip_special_tokens=True)
464 |             prompt_strings = tokenizer.batch_decode(prompts[:, 30:], skip_special_tokens=True)
465 | 
466 |             for j, comp in enumerate(strings):
467 |                 tid = (i * batch_size + j) // sample_k
468 |                 obj = {
469 |                     "process": accel.process_index,
470 |                     "task_id": task_ids[tid],
471 |                     "completion": comp,
472 |                     "prompt": prompt_strings[j],
473 |                     "full_response": full_responses[j],
474 |                 }
475 |                 outputs.append(obj)
476 | 
477 |         with jsonlines.open(output_file, mode="a") as writer:
478 |             print(timestamp(), "starting write", accel.process_index, "length", len(outputs))
479 |             writer.write_all(outputs)
480 |             print(timestamp(), "done writing", accel.process_index)
481 | 
482 |         # evaluate
483 |         accel.wait_for_everyone()
484 | 
485 |         if accel.is_main_process and not skip_iverilog:
486 |             accel.print("running eval")
487 |             res = evaluate_functional_correctness(output_file, problem_file=eval_file, k=[1,5,10])
488 |             accel.print("Eval Results:", res)
489 |             with open("./data/results.txt", mode="a") as f:
490 |                 ts = f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}{' | ' + result_name if result_name is not None else ''}] "
491 |                 f.write(ts + str(res) + "temperature:{}".format(temperature) + "top_p:{}".format(top_p) + "top_k:{}".format(top_k) + "repetition_penalty:{}".format(repetition_penalty) + "\n")
492 | 
493 | if __name__ == "__main__":
494 |     fire.Fire(main)
495 | 


--------------------------------------------------------------------------------
/model_finetune/MG-Verilog/data_format_conversion.py:
--------------------------------------------------------------------------------
 1 | from datasets import load_dataset
 2 | import json
 3 | 
 4 | # Load the dataset
 5 | dataset = load_dataset('GaTech-EIC/MG-Verilog')
 6 | 
 7 | # Function to rename keys and save to JSONL
 8 | def save_to_jsonl(split, split_name):
 9 |     jsonl_file_path = 'dataset.jsonl'
10 |     with open(jsonl_file_path, 'a') as jsonl_file:
11 |         for example in split:
12 |             # Rename keys
13 |             current_sample_dict = example['description']
14 |             current_sample = ', '.join(f'{k}: {v}' for k, v in current_sample_dict.items())
15 |             new_example = {
16 |                 'input': current_sample,
17 |                 'output': example['code']
18 |             }
19 |             jsonl_file.write(json.dumps(new_example) + '\n')
20 |     print(f"Dataset split '{split_name}' saved to {jsonl_file_path}")
21 | 
22 | # Save each split
23 | for split_name in dataset:
24 |     save_to_jsonl(dataset[split_name], split_name)
25 | 


--------------------------------------------------------------------------------
/model_finetune/README.md:
--------------------------------------------------------------------------------
 1 | # Fine-Tunning in NeMo
 2 | 
 3 | In our evaluation process for each submission, we will conduct fine-tuning with the following steps. To reproduce this process, we suggest using a machine with 4x A100 GPUs (or any NVIDIA GPU card with more than 24GB VRAM). If you plan to use a different number of GPUs other than 4, please adjust the `TP_SIZE` and `GPU_COUNT` parameters in the `run_peft.sh` script accordingly.
 4 | 
 5 | 
 6 | ## NeMO docker setup
 7 | 
 8 | ```bash
 9 | # In Host Machine
10 | docker pull nvcr.io/nvidia/nemo:24.05
11 | ```
12 | If you encounter a permission error, please use sudo to run the above command. 
13 | 
14 | ## Run NeMO docker
15 | ```bash
16 | # In Host Machine
17 | docker run --gpus all --shm-size=2g --net=host --ulimit memlock=-1 --rm -it -v ${PWD}:/workspace -w /workspace -v ${PWD}/results:/results nvcr.io/nvidia/nemo:24.05 bash
18 | ```
19 | As with the previous step, if you encounter a permission error, please use sudo to run the above command. 
20 | 
21 | ## Prepare dataset
22 | 
23 | Please prepare the dataset to be summited in .jsonl format. Here we provided an example of converting [MG-Verilog dataset](https://huggingface.co/datasets/GaTech-EIC/MG-Verilog) in huggingface into .jsonl format and used it as the fine-tuning dataset.
24 | 
25 | ```bash
26 | # In Host Machine
27 | cd MG-Verilog
28 | python data_format_conversion.py
29 | ```
30 | 
31 | ## Download Codellama-7b weight
32 | ```bash
33 | # In Docker Container
34 | huggingface-cli login
35 | huggingface-cli download codellama/CodeLlama-7b-Instruct-hf --local-dir CodeLlama-7b
36 | ```
37 | 
38 | ## Convert huggingface model to NeMO format
39 | ```bash
40 | # In Docker Container
41 | python /opt/NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py --input_name_or_path=./CodeLlama-7b/ --output_path=CodeLlama-7b.nemo
42 | ```
43 | 
44 | ## Run LoRA fine-tunning
45 | ```bash
46 | # In Docker Container
47 | mkdir /workspace/results/MG-Verilog
48 | bash run_peft.sh
49 | ```
50 | 
51 | ## Convert NeMO model to huggingface format
52 | ```bash
53 | # In Docker Container
54 | cd /opt/NeMo/
55 | python scripts/nlp_language_modeling/merge_lora_weights/merge.py \
56 |     trainer.accelerator=cpu \
57 |     gpt_model_file=/workspace/CodeLlama-7b.nemo \
58 |     lora_model_path=/workspace/results/MG-Verilog/checkpoints/megatron_gpt_peft_lora_tuning.nemo \
59 |     merged_model_path=/workspace/results/MG-Verilog/checkpoints/CodeLlama-7b-lora-merged.nemo
60 | 
61 | python /opt/NeMo/scripts/checkpoint_converters/convert_llama_nemo_to_hf.py \
62 |     --input_name_or_path /workspace/results/MG-Verilog/checkpoints/CodeLlama-7b-lora-merged.nemo \
63 |     --output_path /workspace/results/MG-Verilog/checkpoints/CodeLlama-7b-lora-merged/pytorch_model.bin \
64 |     --cpu-only
65 | ```
66 | 


--------------------------------------------------------------------------------
/model_finetune/run_peft.sh:
--------------------------------------------------------------------------------
 1 | if [ -z "$1" ]
 2 |   then
 3 |     echo "No argument supplied for max_steps. Please provide an integer value."
 4 |     exit 1
 5 | fi
 6 | MAX_STEPS=$1
 7 | 
 8 | # This is the nemo model we are finetuning
 9 | MODEL="./CodeLlama-7b.nemo"
10 | 
11 | # These are the training datasets (in our case we only have one)
12 | TRAIN_DS="[MG-Verilog/dataset.jsonl]"
13 | 
14 | # These are the validation datasets (in our case we only have one)
15 | VALID_DS="[MG-Verilog/dataset.jsonl]"
16 | 
17 | # These are the test datasets (in our case we only have one)
18 | TEST_DS="[MG-Verilog/dataset.jsonl]"
19 | 
20 | # These are the names of the test datasets
21 | TEST_NAMES="[MG-Verilog]"
22 | 
23 | # This is the PEFT scheme that we will be using. Set to "ptuning" for P-Tuning instead of LoRA
24 | PEFT_SCHEME="lora"
25 | 
26 | # This is the concat sampling probability. This depends on the number of files being passed in the train set
27 | # and the sampling probability for each file. In our case, we have one training file.
28 | CONCAT_SAMPLING_PROBS="[1.0]"
29 | 
30 | 
31 | # This is the tensor parallel size (splitting tensors among GPUs horizontally)
32 | # See above matrix for proper value for the given model size
33 | TP_SIZE=4
34 | 
35 | # This is the pipeline parallel size (splitting layers among GPUs vertically)
36 | # See above matrix for proper value for the given model size
37 | PP_SIZE=1
38 | 
39 | # The number of nodes to run this on
40 | # See above matrix for proper value for the given model size
41 | NODE_COUNT=1
42 | 
43 | # The number of total GPUs used
44 | GPU_COUNT=4
45 | 
46 | # Where to store the finetuned model and training artifacts
47 | OUTPUT_DIR="./results/MG-Verilog"
48 | 
49 | 
50 | # Run the PEFT command by appropriately setting the values for the parameters such as the number of steps,
51 | # model checkpoint path, batch sizes etc. For a full reference of parameter
52 | # settings refer to the config at https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/tuning/conf/megatron_gpt_finetuning_config.yaml
53 | # the value of trainer.max_steps should be scaled according to the size of the training data to ensure that the total number of epochs equals 3
54 | python /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py \
55 |     trainer.log_every_n_steps=1 \
56 |     trainer.precision=bf16 \
57 |     trainer.devices=${GPU_COUNT} \
58 |     trainer.num_nodes=1 \
59 |     trainer.val_check_interval=20 \
60 |     trainer.max_steps=${MAX_STEPS} \
61 |     model.restore_from_path=${MODEL} \
62 |     model.peft.peft_scheme=${PEFT_SCHEME} \
63 |     model.micro_batch_size=1 \
64 |     model.global_batch_size=128 \
65 |     model.tensor_model_parallel_size=${TP_SIZE} \
66 |     model.pipeline_model_parallel_size=${PP_SIZE} \
67 |     model.megatron_amp_O2=True \
68 |     model.activations_checkpoint_granularity=selective \
69 |     model.activations_checkpoint_num_layers=null \
70 |     model.activations_checkpoint_method=uniform \
71 |     model.optim.name=fused_adam \
72 |     model.optim.lr=1e-4 \
73 |     model.answer_only_loss=True \
74 |     model.data.train_ds.file_names=${TRAIN_DS} \
75 |     model.data.validation_ds.file_names=${VALID_DS} \
76 |     model.data.test_ds.file_names=${TEST_DS} \
77 |     model.data.train_ds.concat_sampling_probabilities=${CONCAT_SAMPLING_PROBS} \
78 |     model.data.train_ds.max_seq_length=2048 \
79 |     model.data.validation_ds.max_seq_length=2048 \
80 |     model.data.train_ds.micro_batch_size=1 \
81 |     model.data.train_ds.global_batch_size=128 \
82 |     model.data.validation_ds.micro_batch_size=1 \
83 |     model.data.validation_ds.global_batch_size=128 \
84 |     model.data.train_ds.num_workers=0 \
85 |     model.data.validation_ds.num_workers=0 \
86 |     model.data.test_ds.num_workers=0 \
87 |     model.data.validation_ds.metric.name=loss \
88 |     model.data.test_ds.metric.name=loss \
89 |     exp_manager.create_wandb_logger=False \
90 |     exp_manager.checkpoint_callback_params.mode=min \
91 |     exp_manager.explicit_log_dir=${OUTPUT_DIR} \
92 |     exp_manager.resume_if_exists=True \
93 |     exp_manager.resume_ignore_no_checkpoint=True \
94 |     exp_manager.create_checkpoint_callback=True \
95 |     exp_manager.checkpoint_callback_params.monitor=validation_loss \
96 |     ++exp_manager.checkpoint_callback_params.save_best_model=False \
97 |     exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True \
98 |     model.save_nemo_on_validation_end=False
99 | 


--------------------------------------------------------------------------------