├── .gitignore ├── LICENSE ├── README.md ├── configs ├── dataset_gcc-7.3.0_arm_32_O1_strip │ ├── O1_test10_cszx.yaml │ ├── O1_test3_cszx.yaml │ ├── O1_test4_cszx.yaml │ └── O1_test8_cszx.yaml ├── dataset_gcc-7.3.0_arm_32_O2_strip │ ├── O1_test10_cszx.yaml │ ├── O1_test3_cszx.yaml │ ├── O1_test4_cszx.yaml │ └── O1_test8_cszx.yaml ├── dataset_gcc-7.3.0_arm_32_O3_strip │ ├── O1_test10_cszx.yaml │ ├── O1_test3_cszx.yaml │ ├── O1_test4_cszx.yaml │ └── O1_test8_cszx.yaml ├── dataset_gcc-7.3.0_x86_32_O1_strip │ ├── O1_test10_cszx.yaml │ ├── O1_test3_cszx.yaml │ ├── O1_test4_cszx.yaml │ └── O1_test8_cszx.yaml ├── dataset_gcc-7.3.0_x86_32_O2_strip │ ├── O1_test10_cszx.yaml │ ├── O1_test3_cszx.yaml │ ├── O1_test4_cszx.yaml │ └── O1_test8_cszx.yaml ├── dataset_gcc-7.3.0_x86_32_O3_strip │ ├── O1_test10_cszx.yaml │ ├── O1_test3_cszx.yaml │ ├── O1_test4_cszx.yaml │ └── O1_test8_cszx.yaml ├── dataset_gcc-7.3.0_x86_64_O1_strip │ ├── O1_test10_cszx.yaml │ ├── O1_test3_cszx.yaml │ ├── O1_test4_cszx.yaml │ └── O1_test8_cszx.yaml ├── dataset_gcc-7.3.0_x86_64_O2_strip │ ├── O1_test10_cszx.yaml │ ├── O1_test3_cszx.yaml │ ├── O1_test4_cszx.yaml │ └── O1_test8_cszx.yaml └── dataset_gcc-7.3.0_x86_64_O3_strip │ ├── O1_test10_cszx.yaml │ ├── O1_test3_cszx.yaml │ ├── O1_test4_cszx.yaml │ └── O1_test8_cszx.yaml ├── datas └── README.md ├── figs ├── architecture.png └── datas_google_drive.png ├── models └── README.md └── src ├── __init__.py ├── __main__.py ├── __pycache__ ├── __init__.cpython-39.pyc ├── __main__.cpython-39.pyc ├── build_database.cpython-39.pyc ├── data.cpython-39.pyc ├── model.cpython-39.pyc ├── retrieval_validate.cpython-39.pyc ├── test.cpython-39.pyc ├── train.cpython-39.pyc ├── unittest.cpython-39.pyc └── validate.cpython-39.pyc ├── data.py ├── data └── paraphrase-en.gz ├── meteor-1.5.jar ├── model.py ├── test.py ├── train.py ├── unittest.py └── validate.py /.gitignore: -------------------------------------------------------------------------------- 1 | *.DS_Store -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2023 tongye98 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Binary Code Summarization 2 | Official implementation of EMNLP 2023 main conference paper: [CP-BCS: Binary Code Summarization Guided by Control Flow Graph and Pseudo Code](https://aclanthology.org/2023.emnlp-main.911.pdf). 3 | 4 | 5 | ## Abstract 6 | Automatically generating function summaries for binaries is an extremely valuable but challenging task, since it involves translating the execution behavior and semantics of the low-level language (assembly code) into human-readable natural language. However, most current works on understanding assembly code are oriented towards generating function names, which involve numerous abbreviations that make them still confusing. To bridge this gap, we focus on generating complete summaries for binary functions, especially for stripped binary (no symbol table and debug information in reality). To fully exploit the semantics of assembly code, we present a control flow graph and pseudo code guided binary code summarization framework called CP-BCS. CP-BCS utilizes a bidirectional instruction-level control flow graph and pseudo code that incorporates expert knowledge to learn the comprehensive binary function execution behavior and logic semantics. We evaluate CP-BCS on 3 different binary optimization levels (O1, O2, and O3) for 3 different computer architectures (X86, X64, and ARM). The evaluation results demonstrate CP-BCS is superior and significantly improves the efficiency of reverse engineering. 7 | 8 | 9 | 10 | ## Binary Projects 11 | 12 | The list of 51 binary projects and their corresponding versions utilized for constructing the dataset. 13 | 14 | | Binary Projects | Version | Binary Projects | Version | 15 | | --- | --- | --- | --- | 16 | | a2ps | 4.14 | binutils | 2.30 | 17 | | bool | 0.2.2 | ccd2cue | 0.5 | 18 | | cflow | 1.5 | coreutils | 8.29 | 19 | | cpio | 2.12 | cppi | 1.18 | 20 | | dap | 3.10 | datamash | 1.3 | 21 | | direvent | 5.1 | enscript | 1.6.6 | 22 | | findutils | 4.6.0 | gawk | 4.2.1 | 23 | | gcal | 4.1 | gdbm | 1.15 | 24 | | glpk | 4.65 | gmp | 6.1.2 | 25 | | gnudos | 1.11.4 | grep | 3.1 | 26 | | gsasl | 1.8.0 | gsl | 2.5 | 27 | | gss | 1.0.3 | gzip | 1.9 | 28 | | hello | 2.10 | inetutils | 1.9.4 | 29 | | libiconv | 1.15 | libidn2 | 2.0.5 | 30 | | libmicrohttpd | 0.9.59 | libosip2 | 5.0.0 | 31 | | libtasn1 | 4.13 | llibtool | 2.4.6 | 32 | | libunistring | 0.9.10 | lightning | 2.1.2 | 33 | | macchanger | 1.6.0 | nettle | 3.4 | 34 | | patch | 2.7.6 | plotutils | 2.6 | 35 | | readline | 7.0 | recutils | 1.7 | 36 | | sed | 4.5 | sharutils | 4.15.2 | 37 | | spell | 1.1 | tar | 1.30 | 38 | | texinof | 6.5 | time | 1.9 | 39 | | units | 2.16 | vmlinux | 4.1.52 | 40 | | wdiff | 1.2.2 | which | 2.21 | 41 | | xorriso | 1.4.8 | - | - | 42 | 43 | ## Dataset 44 | The whole dataset encompasses three different computer architectures (X86, X64, and ARM) and three different optimization levels (O1, O2, and O3), culminating in a total of nine unique sub-datasets. 45 | 46 | For the dataset, refer to [datas](datas/README.md). 47 | 48 | Each item (function) has the following attributes: 49 | ``` 50 | function_name: The name of the function in the source code or the non-stripped binary. 51 | function_name_in_strip: The name of the function in stripped binary. 52 | comment: A natural language summary of the function, collected from the corresponding source code. 53 | function_body: The entire function body, presented in the form of assembly code. 54 | pseudo_code: Pseudo code for the entire function in the stripped binary. 55 | cfg: The control flow graph of the function (BI-CFG). 56 | node: Each assembly instrunction is a node. 57 | edge: The pair formed between adjacent nodes. 58 | edge_index: The index of edge. 59 | pseudo_code_non_strip: Pseudo code for the entire function in the corresponding non-stripped binary. 60 | pseudo_code_refined: The refined pseudo code using CodeT5. 61 | ``` 62 | 63 | ### Process Script 64 | 65 | We have put the relevant preprocessing scripts for construct the dataset in the [Link](https://drive.google.com/file/d/1H85d_72MjmAsyfxcki8aAImNJ1OtsP7M/view?usp=share_link). 66 | 67 | 68 | ## Environment 69 | The code is written and tested with the following packages: 70 | 71 | - transformers 72 | - torch 73 | - torch-geometric 74 | 75 | ## Instructions 76 | All training and model parameters are in the [config](configs/): yaml file, and then execute the training or testing instructions. 77 | 78 | 1. train 79 | ``` 80 | python -m src train 81 | ``` 82 | 2. test 83 | ``` 84 | python -m src test 85 | ``` 86 | 87 | ## Citation 88 | ``` 89 | @inproceedings{ye-etal-2023-cp, 90 | title = "{CP}-{BCS}: Binary Code Summarization Guided by Control Flow Graph and Pseudo Code", 91 | author = "Ye, Tong and 92 | Wu, Lingfei and 93 | Ma, Tengfei and 94 | Zhang, Xuhong and 95 | Du, Yangkai and 96 | Liu, Peiyu and 97 | Ji, Shouling and 98 | Wang, Wenhai", 99 | editor = "Bouamor, Houda and 100 | Pino, Juan and 101 | Bali, Kalika", 102 | booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing", 103 | month = dec, 104 | year = "2023", 105 | address = "Singapore", 106 | publisher = "Association for Computational Linguistics", 107 | url = "https://aclanthology.org/2023.emnlp-main.911", 108 | doi = "10.18653/v1/2023.emnlp-main.911", 109 | pages = "14740--14752", 110 | abstract = "Automatically generating function summaries for binaries is an extremely valuable but challenging task, since it involves translating the execution behavior and semantics of the low-level language (assembly code) into human-readable natural language. However, most current works on understanding assembly code are oriented towards generating function names, which involve numerous abbreviations that make them still confusing. To bridge this gap, we focus on generating complete summaries for binary functions, especially for stripped binary (no symbol table and debug information in reality). To fully exploit the semantics of assembly code, we present a control flow graph and pseudo code guided binary code summarization framework called CP-BCS. CP-BCS utilizes a bidirectional instruction-level control flow graph and pseudo code that incorporates expert knowledge to learn the comprehensive binary function execution behavior and logic semantics. We evaluate CP-BCS on 3 different binary optimization levels (O1, O2, and O3) for 3 different computer architectures (X86, X64, and ARM). The evaluation results demonstrate CP-BCS is superior and significantly improves the efficiency of reverse engineering.", 111 | } 112 | ``` 113 | -------------------------------------------------------------------------------- /configs/dataset_gcc-7.3.0_arm_32_O1_strip/O1_test10_cszx.yaml: -------------------------------------------------------------------------------- 1 | # assembly_cfg_comment 2 | # robertatokenizer and special 3 | # tokenizer_trg is None 4 | # assembly token = 400 5 | data: 6 | data_path: "datas/dataset_gcc-7.3.0_arm_32_O1_strip" 7 | cached_dataset_path: "cached_dataset" 8 | train_data_path: "datas/dataset_gcc-7.3.0_arm_32_O1_strip/train_refined.json" 9 | valid_data_path: "datas/dataset_gcc-7.3.0_arm_32_O1_strip/valid_refined.json" 10 | test_data_path: "datas/dataset_gcc-7.3.0_arm_32_O1_strip/test_refined.json" 11 | use_tokenizer: "robertatokenizer" # sentencepiece or robertatokenizer 12 | robertatokenizer: "Salesforce/codet5-base" 13 | architecture: "arm_32" # 'x86_64', 'x86_32', 'arm_32' 14 | use_refined_pseudo_code: True 15 | 16 | assembly_token: 17 | vocab_min_freq: 1 18 | vocab_max_size: 50000 19 | token_max_len: 400 20 | 21 | comment: 22 | vocab_min_freq: 1 23 | vocab_max_size: 50000 24 | token_max_len: 40 25 | 26 | cfg_node: 27 | vocab_min_freq: 1 28 | vocab_max_size: 50000 29 | 30 | pseudo_token: 31 | vocab_min_freq: 1 32 | vocab_max_size: 50000 33 | token_max_len: 400 34 | 35 | training: 36 | model_dir: "models/dataset_gcc-7.3.0_arm_32_O1_strip/test10_cszx" 37 | overwrite: False 38 | load_model: False 39 | random_seed: 980820 40 | 41 | logging_frequence: 100 42 | validation_frequence: 1 # after how many epochs 43 | store_valid_output: False 44 | log_valid_samples: [0,1,2,3,4] 45 | 46 | use_cuda: True 47 | num_workers: 4 48 | 49 | epochs: 100 50 | shuffle: True 51 | max_updates: 1000000000 52 | batch_size: 32 53 | 54 | learning_rate: 0.0001 55 | learning_rate_min: 1.0e-18 56 | # clip_grad_val: 1 57 | clip_grad_norm: 5.0 58 | optimizer: "adam" 59 | weight_decay: 0 60 | adam_betas: [0.9, 0.999] 61 | eps: 1.e-8 62 | early_stop_metric: "bleu" 63 | scheduling: "ReduceLROnPlateau" # "ReduceLROnPlateau", "StepLR", "ExponentialLR", "warmup" 64 | mode: "max" 65 | factor: 0.8 66 | patience: 2 67 | step_size: 1 68 | gamma: 0.1 69 | num_ckpts_keep: 3 70 | 71 | # load_model: "models/best.ckpt" 72 | reset_best_ckpt: False 73 | reset_scheduler: False 74 | reset_optimzer: False 75 | reset_iteration_state: False 76 | 77 | testing: 78 | batch_size: 64 79 | batch_type: "sentence" 80 | max_output_length: 40 81 | min_outptu_length: 1 82 | eval_metrics: ["bleu", "rouge-l"] 83 | n_best: 1 84 | beam_size: 4 85 | beam_alpha: -1 86 | return_attention: False 87 | return_probability: False 88 | generate_unk: False 89 | repetition_penalty: -1 90 | 91 | model: 92 | mode: "assembly_cfg_pseudo_comment" # "assembly_comment", "assembly_cfg_comment", "assembly_cfg_pseudo_comment" 93 | initializer: "xavier_uniform" 94 | embed_initializer: "xavier_uniform" 95 | tied_softmax: False 96 | tied_embeddings: False 97 | 98 | embeddings: 99 | embedding_dim: 512 100 | scale: False 101 | freeze: False 102 | 103 | transformer_encoder: 104 | model_dim: 512 105 | ff_dim: 2048 106 | num_layers: 6 107 | head_count: 8 108 | dropout: 0.2 109 | emb_dropout: 0.2 110 | layer_norm_position: "pre" 111 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 112 | max_src_len: 0 # for learnable, keep same with data segment 113 | freeze: False 114 | max_relative_position: 32 # only for relative position, else must be set to 0 115 | use_negative_distance: True # for relative position 116 | 117 | gnn_encoder: 118 | gnn_type: "GATConv" # ["SAGEConv", "GCNConv", "GATConv"] 119 | aggr: "mean" # ["mean", "max", "lstm"] 120 | model_dim: 512 121 | num_layers: 2 122 | emb_dropout: 0.2 123 | residual: True 124 | 125 | pseudo_encoder: 126 | model_dim: 512 127 | ff_dim: 2048 128 | num_layers: 6 129 | head_count: 8 130 | dropout: 0.2 131 | emb_dropout: 0.2 132 | layer_norm_position: "pre" 133 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 134 | max_src_len: 0 # for learnable, keep same with data segment 135 | freeze: False 136 | max_relative_position: 32 # only for relative position, else must be set to 0 137 | use_negative_distance: True # for relative position 138 | 139 | 140 | transformer_decoder: 141 | model_dim: 512 142 | ff_dim: 2048 143 | num_layers: 6 144 | head_count: 8 145 | dropout: 0.2 146 | emb_dropout: 0.2 147 | layer_norm_position: "pre" 148 | trg_pos_emb: "learnable" # ["absolute", "learnable", "relative"] 149 | max_trg_len: 40 # for learnable, keep same with data segment. 150 | freeze: False 151 | max_relative_position: 0 # only for relative position, else must be set to 0. 152 | use_negative_distance: False # for relative position 153 | -------------------------------------------------------------------------------- /configs/dataset_gcc-7.3.0_arm_32_O1_strip/O1_test3_cszx.yaml: -------------------------------------------------------------------------------- 1 | # assembly_comment 2 | # robertatokenizer and special 3 | # tokenizer_trg is None 4 | # assembly token = 400 5 | data: 6 | data_path: "datas/dataset_gcc-7.3.0_arm_32_O1_strip" 7 | cached_dataset_path: "cached_dataset" 8 | train_data_path: "datas/dataset_gcc-7.3.0_arm_32_O1_strip/train_refined.json" 9 | valid_data_path: "datas/dataset_gcc-7.3.0_arm_32_O1_strip/valid_refined.json" 10 | test_data_path: "datas/dataset_gcc-7.3.0_arm_32_O1_strip/test_refined.json" 11 | use_tokenizer: "robertatokenizer" # sentencepiece or robertatokenizer 12 | robertatokenizer: "Salesforce/codet5-base" 13 | architecture: "arm_32" # 'x86_64', 'x86_32', 'arm_32' 14 | use_refined_pseudo_code: False 15 | 16 | assembly_token: 17 | vocab_min_freq: 1 18 | vocab_max_size: 50000 19 | token_max_len: 400 20 | 21 | comment: 22 | vocab_min_freq: 1 23 | vocab_max_size: 50000 24 | token_max_len: 40 25 | 26 | cfg_node: 27 | vocab_min_freq: 1 28 | vocab_max_size: 50000 29 | 30 | pseudo_token: 31 | vocab_min_freq: 1 32 | vocab_max_size: 50000 33 | token_max_len: 400 34 | 35 | training: 36 | model_dir: "models/dataset_gcc-7.3.0_arm_32_O1_strip/test3_cszx" 37 | overwrite: False 38 | load_model: False 39 | random_seed: 980820 40 | 41 | logging_frequence: 100 42 | validation_frequence: 1 # after how many epochs 43 | store_valid_output: False 44 | log_valid_samples: [0,1,2,3,4] 45 | 46 | use_cuda: True 47 | num_workers: 4 48 | 49 | epochs: 100 50 | shuffle: True 51 | max_updates: 1000000000 52 | batch_size: 32 53 | 54 | learning_rate: 0.0001 55 | learning_rate_min: 1.0e-18 56 | # clip_grad_val: 1 57 | clip_grad_norm: 5.0 58 | optimizer: "adam" 59 | weight_decay: 0 60 | adam_betas: [0.9, 0.999] 61 | eps: 1.e-8 62 | early_stop_metric: "bleu" 63 | scheduling: "ReduceLROnPlateau" # "ReduceLROnPlateau", "StepLR", "ExponentialLR", "warmup" 64 | mode: "max" 65 | factor: 0.8 66 | patience: 2 67 | step_size: 1 68 | gamma: 0.1 69 | num_ckpts_keep: 3 70 | 71 | # load_model: "models/best.ckpt" 72 | reset_best_ckpt: False 73 | reset_scheduler: False 74 | reset_optimzer: False 75 | reset_iteration_state: False 76 | 77 | testing: 78 | batch_size: 64 79 | batch_type: "sentence" 80 | max_output_length: 40 81 | min_outptu_length: 1 82 | eval_metrics: ["bleu", "rouge-l"] 83 | n_best: 1 84 | beam_size: 4 85 | beam_alpha: -1 86 | return_attention: False 87 | return_probability: False 88 | generate_unk: False 89 | repetition_penalty: -1 90 | 91 | model: 92 | mode: "assembly_comment" # "assembly_comment", "assembly_cfg_comment" 93 | initializer: "xavier_uniform" 94 | embed_initializer: "xavier_uniform" 95 | tied_softmax: False 96 | tied_embeddings: False 97 | 98 | embeddings: 99 | embedding_dim: 512 100 | scale: False 101 | freeze: False 102 | 103 | transformer_encoder: 104 | model_dim: 512 105 | ff_dim: 2048 106 | num_layers: 6 107 | head_count: 8 108 | dropout: 0.2 109 | emb_dropout: 0.2 110 | layer_norm_position: "pre" 111 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 112 | max_src_len: 0 # for learnable, keep same with data segment 113 | freeze: False 114 | max_relative_position: 32 # only for relative position, else must be set to 0 115 | use_negative_distance: True # for relative position 116 | 117 | gnn_encoder: 118 | gnn_type: "GATConv" # ["SAGEConv", "GCNConv", "GATConv"] 119 | aggr: "mean" # ["mean", "max", "lstm"] 120 | model_dim: 512 121 | num_layers: 2 122 | emb_dropout: 0.2 123 | residual: True 124 | 125 | pseudo_encoder: 126 | model_dim: 512 127 | ff_dim: 2048 128 | num_layers: 6 129 | head_count: 8 130 | dropout: 0.2 131 | emb_dropout: 0.2 132 | layer_norm_position: "pre" 133 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 134 | max_src_len: 0 # for learnable, keep same with data segment 135 | freeze: False 136 | max_relative_position: 32 # only for relative position, else must be set to 0 137 | use_negative_distance: True # for relative position 138 | 139 | transformer_decoder: 140 | model_dim: 512 141 | ff_dim: 2048 142 | num_layers: 6 143 | head_count: 8 144 | dropout: 0.2 145 | emb_dropout: 0.2 146 | layer_norm_position: "pre" 147 | trg_pos_emb: "learnable" # ["absolute", "learnable", "relative"] 148 | max_trg_len: 40 # for learnable, keep same with data segment. 149 | freeze: False 150 | max_relative_position: 0 # only for relative position, else must be set to 0. 151 | use_negative_distance: False # for relative position 152 | -------------------------------------------------------------------------------- /configs/dataset_gcc-7.3.0_arm_32_O1_strip/O1_test4_cszx.yaml: -------------------------------------------------------------------------------- 1 | # assembly_cfg_comment 2 | # robertatokenizer and special 3 | # tokenizer_trg is None 4 | # assembly token = 400 5 | data: 6 | data_path: "datas/dataset_gcc-7.3.0_arm_32_O1_strip" 7 | cached_dataset_path: "cached_dataset" 8 | train_data_path: "datas/dataset_gcc-7.3.0_arm_32_O1_strip/train_refined.json" 9 | valid_data_path: "datas/dataset_gcc-7.3.0_arm_32_O1_strip/valid_refined.json" 10 | test_data_path: "datas/dataset_gcc-7.3.0_arm_32_O1_strip/test_refined.json" 11 | use_tokenizer: "robertatokenizer" # sentencepiece or robertatokenizer 12 | robertatokenizer: "Salesforce/codet5-base" 13 | architecture: "arm_32" # 'x86_64', 'x86_32', 'arm_32' 14 | use_refined_pseudo_code: False 15 | 16 | assembly_token: 17 | vocab_min_freq: 1 18 | vocab_max_size: 50000 19 | token_max_len: 400 20 | 21 | comment: 22 | vocab_min_freq: 1 23 | vocab_max_size: 50000 24 | token_max_len: 40 25 | 26 | cfg_node: 27 | vocab_min_freq: 1 28 | vocab_max_size: 50000 29 | 30 | pseudo_token: 31 | vocab_min_freq: 1 32 | vocab_max_size: 50000 33 | token_max_len: 400 34 | 35 | training: 36 | model_dir: "models/dataset_gcc-7.3.0_arm_32_O1_strip/test4_cszx" 37 | overwrite: False 38 | load_model: False 39 | random_seed: 980820 40 | 41 | logging_frequence: 100 42 | validation_frequence: 1 # after how many epochs 43 | store_valid_output: False 44 | log_valid_samples: [0,1,2,3,4] 45 | 46 | use_cuda: True 47 | num_workers: 4 48 | 49 | epochs: 100 50 | shuffle: True 51 | max_updates: 1000000000 52 | batch_size: 32 53 | 54 | learning_rate: 0.0001 55 | learning_rate_min: 1.0e-18 56 | # clip_grad_val: 1 57 | clip_grad_norm: 5.0 58 | optimizer: "adam" 59 | weight_decay: 0 60 | adam_betas: [0.9, 0.999] 61 | eps: 1.e-8 62 | early_stop_metric: "bleu" 63 | scheduling: "ReduceLROnPlateau" # "ReduceLROnPlateau", "StepLR", "ExponentialLR", "warmup" 64 | mode: "max" 65 | factor: 0.8 66 | patience: 2 67 | step_size: 1 68 | gamma: 0.1 69 | num_ckpts_keep: 3 70 | 71 | # load_model: "models/best.ckpt" 72 | reset_best_ckpt: False 73 | reset_scheduler: False 74 | reset_optimzer: False 75 | reset_iteration_state: False 76 | 77 | testing: 78 | batch_size: 64 79 | batch_type: "sentence" 80 | max_output_length: 40 81 | min_outptu_length: 1 82 | eval_metrics: ["bleu", "rouge-l"] 83 | n_best: 1 84 | beam_size: 4 85 | beam_alpha: -1 86 | return_attention: False 87 | return_probability: False 88 | generate_unk: False 89 | repetition_penalty: -1 90 | 91 | model: 92 | mode: "assembly_cfg_comment" # "assembly_comment", "assembly_cfg_comment" 93 | initializer: "xavier_uniform" 94 | embed_initializer: "xavier_uniform" 95 | tied_softmax: False 96 | tied_embeddings: False 97 | 98 | embeddings: 99 | embedding_dim: 512 100 | scale: False 101 | freeze: False 102 | 103 | transformer_encoder: 104 | model_dim: 512 105 | ff_dim: 2048 106 | num_layers: 6 107 | head_count: 8 108 | dropout: 0.2 109 | emb_dropout: 0.2 110 | layer_norm_position: "pre" 111 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 112 | max_src_len: 0 # for learnable, keep same with data segment 113 | freeze: False 114 | max_relative_position: 32 # only for relative position, else must be set to 0 115 | use_negative_distance: True # for relative position 116 | 117 | gnn_encoder: 118 | gnn_type: "GATConv" # ["SAGEConv", "GCNConv", "GATConv"] 119 | aggr: "mean" # ["mean", "max", "lstm"] 120 | model_dim: 512 121 | num_layers: 2 122 | emb_dropout: 0.2 123 | residual: True 124 | 125 | pseudo_encoder: 126 | model_dim: 512 127 | ff_dim: 2048 128 | num_layers: 6 129 | head_count: 8 130 | dropout: 0.2 131 | emb_dropout: 0.2 132 | layer_norm_position: "pre" 133 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 134 | max_src_len: 0 # for learnable, keep same with data segment 135 | freeze: False 136 | max_relative_position: 32 # only for relative position, else must be set to 0 137 | use_negative_distance: True # for relative position 138 | 139 | transformer_decoder: 140 | model_dim: 512 141 | ff_dim: 2048 142 | num_layers: 6 143 | head_count: 8 144 | dropout: 0.2 145 | emb_dropout: 0.2 146 | layer_norm_position: "pre" 147 | trg_pos_emb: "learnable" # ["absolute", "learnable", "relative"] 148 | max_trg_len: 40 # for learnable, keep same with data segment. 149 | freeze: False 150 | max_relative_position: 0 # only for relative position, else must be set to 0. 151 | use_negative_distance: False # for relative position 152 | -------------------------------------------------------------------------------- /configs/dataset_gcc-7.3.0_arm_32_O1_strip/O1_test8_cszx.yaml: -------------------------------------------------------------------------------- 1 | # assembly_cfg_comment 2 | # robertatokenizer and special 3 | # tokenizer_trg is None 4 | # assembly token = 400 5 | data: 6 | data_path: "datas/dataset_gcc-7.3.0_arm_32_O1_strip" 7 | cached_dataset_path: "cached_dataset" 8 | train_data_path: "datas/dataset_gcc-7.3.0_arm_32_O1_strip/train_refined.json" 9 | valid_data_path: "datas/dataset_gcc-7.3.0_arm_32_O1_strip/valid_refined.json" 10 | test_data_path: "datas/dataset_gcc-7.3.0_arm_32_O1_strip/test_refined.json" 11 | use_tokenizer: "robertatokenizer" # sentencepiece or robertatokenizer 12 | robertatokenizer: "Salesforce/codet5-base" 13 | architecture: "arm_32" # 'x86_64', 'x86_32', 'arm_32' 14 | use_refined_pseudo_code: False 15 | 16 | assembly_token: 17 | vocab_min_freq: 1 18 | vocab_max_size: 50000 19 | token_max_len: 400 20 | 21 | comment: 22 | vocab_min_freq: 1 23 | vocab_max_size: 50000 24 | token_max_len: 40 25 | 26 | cfg_node: 27 | vocab_min_freq: 1 28 | vocab_max_size: 50000 29 | 30 | pseudo_token: 31 | vocab_min_freq: 1 32 | vocab_max_size: 50000 33 | token_max_len: 400 34 | 35 | training: 36 | model_dir: "models/dataset_gcc-7.3.0_arm_32_O1_strip/test8_cszx" 37 | overwrite: False 38 | load_model: False 39 | random_seed: 980820 40 | 41 | logging_frequence: 100 42 | validation_frequence: 1 # after how many epochs 43 | store_valid_output: False 44 | log_valid_samples: [0,1,2,3,4] 45 | 46 | use_cuda: True 47 | num_workers: 4 48 | 49 | epochs: 100 50 | shuffle: True 51 | max_updates: 1000000000 52 | batch_size: 32 53 | 54 | learning_rate: 0.0001 55 | learning_rate_min: 1.0e-18 56 | # clip_grad_val: 1 57 | clip_grad_norm: 5.0 58 | optimizer: "adam" 59 | weight_decay: 0 60 | adam_betas: [0.9, 0.999] 61 | eps: 1.e-8 62 | early_stop_metric: "bleu" 63 | scheduling: "ReduceLROnPlateau" # "ReduceLROnPlateau", "StepLR", "ExponentialLR", "warmup" 64 | mode: "max" 65 | factor: 0.8 66 | patience: 2 67 | step_size: 1 68 | gamma: 0.1 69 | num_ckpts_keep: 3 70 | 71 | # load_model: "models/best.ckpt" 72 | reset_best_ckpt: False 73 | reset_scheduler: False 74 | reset_optimzer: False 75 | reset_iteration_state: False 76 | 77 | testing: 78 | batch_size: 64 79 | batch_type: "sentence" 80 | max_output_length: 40 81 | min_outptu_length: 1 82 | eval_metrics: ["bleu", "rouge-l"] 83 | n_best: 1 84 | beam_size: 4 85 | beam_alpha: -1 86 | return_attention: False 87 | return_probability: False 88 | generate_unk: False 89 | repetition_penalty: -1 90 | 91 | model: 92 | mode: "assembly_cfg_pseudo_comment" # "assembly_comment", "assembly_cfg_comment", "assembly_cfg_pseudo_comment" 93 | initializer: "xavier_uniform" 94 | embed_initializer: "xavier_uniform" 95 | tied_softmax: False 96 | tied_embeddings: False 97 | 98 | embeddings: 99 | embedding_dim: 512 100 | scale: False 101 | freeze: False 102 | 103 | transformer_encoder: 104 | model_dim: 512 105 | ff_dim: 2048 106 | num_layers: 6 107 | head_count: 8 108 | dropout: 0.2 109 | emb_dropout: 0.2 110 | layer_norm_position: "pre" 111 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 112 | max_src_len: 0 # for learnable, keep same with data segment 113 | freeze: False 114 | max_relative_position: 32 # only for relative position, else must be set to 0 115 | use_negative_distance: True # for relative position 116 | 117 | gnn_encoder: 118 | gnn_type: "GATConv" # ["SAGEConv", "GCNConv", "GATConv"] 119 | aggr: "mean" # ["mean", "max", "lstm"] 120 | model_dim: 512 121 | num_layers: 2 122 | emb_dropout: 0.2 123 | residual: True 124 | 125 | pseudo_encoder: 126 | model_dim: 512 127 | ff_dim: 2048 128 | num_layers: 6 129 | head_count: 8 130 | dropout: 0.2 131 | emb_dropout: 0.2 132 | layer_norm_position: "pre" 133 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 134 | max_src_len: 0 # for learnable, keep same with data segment 135 | freeze: False 136 | max_relative_position: 32 # only for relative position, else must be set to 0 137 | use_negative_distance: True # for relative position 138 | 139 | transformer_decoder: 140 | model_dim: 512 141 | ff_dim: 2048 142 | num_layers: 6 143 | head_count: 8 144 | dropout: 0.2 145 | emb_dropout: 0.2 146 | layer_norm_position: "pre" 147 | trg_pos_emb: "learnable" # ["absolute", "learnable", "relative"] 148 | max_trg_len: 40 # for learnable, keep same with data segment. 149 | freeze: False 150 | max_relative_position: 0 # only for relative position, else must be set to 0. 151 | use_negative_distance: False # for relative position 152 | -------------------------------------------------------------------------------- /configs/dataset_gcc-7.3.0_arm_32_O2_strip/O1_test10_cszx.yaml: -------------------------------------------------------------------------------- 1 | # assembly_cfg_comment 2 | # robertatokenizer and special 3 | # tokenizer_trg is None 4 | # assembly token = 400 5 | data: 6 | data_path: "datas/dataset_gcc-7.3.0_arm_32_O2_strip" 7 | cached_dataset_path: "cached_dataset" 8 | train_data_path: "datas/dataset_gcc-7.3.0_arm_32_O2_strip/train_refined.json" 9 | valid_data_path: "datas/dataset_gcc-7.3.0_arm_32_O2_strip/valid_refined.json" 10 | test_data_path: "datas/dataset_gcc-7.3.0_arm_32_O2_strip/test_refined.json" 11 | use_tokenizer: "robertatokenizer" # sentencepiece or robertatokenizer 12 | robertatokenizer: "Salesforce/codet5-base" 13 | architecture: "arm_32" # 'x86_64', 'x86_32', 'arm_32' 14 | use_refined_pseudo_code: True 15 | 16 | assembly_token: 17 | vocab_min_freq: 1 18 | vocab_max_size: 50000 19 | token_max_len: 400 20 | 21 | comment: 22 | vocab_min_freq: 1 23 | vocab_max_size: 50000 24 | token_max_len: 40 25 | 26 | cfg_node: 27 | vocab_min_freq: 1 28 | vocab_max_size: 50000 29 | 30 | pseudo_token: 31 | vocab_min_freq: 1 32 | vocab_max_size: 50000 33 | token_max_len: 400 34 | 35 | training: 36 | model_dir: "models/dataset_gcc-7.3.0_arm_32_O2_strip/test10_cszx" 37 | overwrite: False 38 | load_model: False 39 | random_seed: 980820 40 | 41 | logging_frequence: 100 42 | validation_frequence: 1 # after how many epochs 43 | store_valid_output: False 44 | log_valid_samples: [0,1,2,3,4] 45 | 46 | use_cuda: True 47 | num_workers: 4 48 | 49 | epochs: 100 50 | shuffle: True 51 | max_updates: 1000000000 52 | batch_size: 32 53 | 54 | learning_rate: 0.0001 55 | learning_rate_min: 1.0e-18 56 | # clip_grad_val: 1 57 | clip_grad_norm: 5.0 58 | optimizer: "adam" 59 | weight_decay: 0 60 | adam_betas: [0.9, 0.999] 61 | eps: 1.e-8 62 | early_stop_metric: "bleu" 63 | scheduling: "ReduceLROnPlateau" # "ReduceLROnPlateau", "StepLR", "ExponentialLR", "warmup" 64 | mode: "max" 65 | factor: 0.8 66 | patience: 2 67 | step_size: 1 68 | gamma: 0.1 69 | num_ckpts_keep: 3 70 | 71 | # load_model: "models/dataset_gcc-7.3.0_x86_64_O1/test4_cszx/best.ckpt" 72 | reset_best_ckpt: False 73 | reset_scheduler: False 74 | reset_optimzer: False 75 | reset_iteration_state: False 76 | 77 | testing: 78 | batch_size: 64 79 | batch_type: "sentence" 80 | max_output_length: 40 81 | min_outptu_length: 1 82 | eval_metrics: ["bleu", "rouge-l"] 83 | n_best: 1 84 | beam_size: 4 85 | beam_alpha: -1 86 | return_attention: False 87 | return_probability: False 88 | generate_unk: False 89 | repetition_penalty: -1 90 | 91 | model: 92 | mode: "assembly_cfg_pseudo_comment" # "assembly_comment", "assembly_cfg_comment", "assembly_cfg_pseudo_comment" 93 | initializer: "xavier_uniform" 94 | embed_initializer: "xavier_uniform" 95 | tied_softmax: False 96 | tied_embeddings: False 97 | 98 | embeddings: 99 | embedding_dim: 512 100 | scale: False 101 | freeze: False 102 | 103 | transformer_encoder: 104 | model_dim: 512 105 | ff_dim: 2048 106 | num_layers: 6 107 | head_count: 8 108 | dropout: 0.2 109 | emb_dropout: 0.2 110 | layer_norm_position: "pre" 111 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 112 | max_src_len: 0 # for learnable, keep same with data segment 113 | freeze: False 114 | max_relative_position: 32 # only for relative position, else must be set to 0 115 | use_negative_distance: True # for relative position 116 | 117 | gnn_encoder: 118 | gnn_type: "GATConv" # ["SAGEConv", "GCNConv", "GATConv"] 119 | aggr: "mean" # ["mean", "max", "lstm"] 120 | model_dim: 512 121 | num_layers: 2 122 | emb_dropout: 0.2 123 | residual: True 124 | 125 | pseudo_encoder: 126 | model_dim: 512 127 | ff_dim: 2048 128 | num_layers: 6 129 | head_count: 8 130 | dropout: 0.2 131 | emb_dropout: 0.2 132 | layer_norm_position: "pre" 133 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 134 | max_src_len: 0 # for learnable, keep same with data segment 135 | freeze: False 136 | max_relative_position: 32 # only for relative position, else must be set to 0 137 | use_negative_distance: True # for relative position 138 | 139 | 140 | transformer_decoder: 141 | model_dim: 512 142 | ff_dim: 2048 143 | num_layers: 6 144 | head_count: 8 145 | dropout: 0.2 146 | emb_dropout: 0.2 147 | layer_norm_position: "pre" 148 | trg_pos_emb: "learnable" # ["absolute", "learnable", "relative"] 149 | max_trg_len: 40 # for learnable, keep same with data segment. 150 | freeze: False 151 | max_relative_position: 0 # only for relative position, else must be set to 0. 152 | use_negative_distance: False # for relative position 153 | -------------------------------------------------------------------------------- /configs/dataset_gcc-7.3.0_arm_32_O2_strip/O1_test3_cszx.yaml: -------------------------------------------------------------------------------- 1 | # assembly_comment 2 | # robertatokenizer and special 3 | # tokenizer_trg is None 4 | # assembly token = 400 5 | data: 6 | data_path: "datas/dataset_gcc-7.3.0_arm_32_O2_strip" 7 | cached_dataset_path: "cached_dataset" 8 | train_data_path: "datas/dataset_gcc-7.3.0_arm_32_O2_strip/train_refined.json" 9 | valid_data_path: "datas/dataset_gcc-7.3.0_arm_32_O2_strip/valid_refined.json" 10 | test_data_path: "datas/dataset_gcc-7.3.0_arm_32_O2_strip/test_refined.json" 11 | use_tokenizer: "robertatokenizer" # sentencepiece or robertatokenizer 12 | robertatokenizer: "Salesforce/codet5-base" 13 | architecture: "arm_32" # 'x86_64', 'x86_32', 'arm_32' 14 | use_refined_pseudo_code: False 15 | 16 | assembly_token: 17 | vocab_min_freq: 1 18 | vocab_max_size: 50000 19 | token_max_len: 400 20 | 21 | comment: 22 | vocab_min_freq: 1 23 | vocab_max_size: 50000 24 | token_max_len: 40 25 | 26 | cfg_node: 27 | vocab_min_freq: 1 28 | vocab_max_size: 50000 29 | 30 | pseudo_token: 31 | vocab_min_freq: 1 32 | vocab_max_size: 50000 33 | token_max_len: 400 34 | 35 | training: 36 | model_dir: "models/dataset_gcc-7.3.0_arm_32_O2_strip/test3_cszx" 37 | overwrite: False 38 | load_model: False 39 | random_seed: 980820 40 | 41 | logging_frequence: 100 42 | validation_frequence: 1 # after how many epochs 43 | store_valid_output: False 44 | log_valid_samples: [0,1,2,3,4] 45 | 46 | use_cuda: True 47 | num_workers: 4 48 | 49 | epochs: 100 50 | shuffle: True 51 | max_updates: 1000000000 52 | batch_size: 32 53 | 54 | learning_rate: 0.0001 55 | learning_rate_min: 1.0e-18 56 | # clip_grad_val: 1 57 | clip_grad_norm: 5.0 58 | optimizer: "adam" 59 | weight_decay: 0 60 | adam_betas: [0.9, 0.999] 61 | eps: 1.e-8 62 | early_stop_metric: "bleu" 63 | scheduling: "ReduceLROnPlateau" # "ReduceLROnPlateau", "StepLR", "ExponentialLR", "warmup" 64 | mode: "max" 65 | factor: 0.8 66 | patience: 2 67 | step_size: 1 68 | gamma: 0.1 69 | num_ckpts_keep: 3 70 | 71 | # load_model: "models/best.ckpt" 72 | reset_best_ckpt: False 73 | reset_scheduler: False 74 | reset_optimzer: False 75 | reset_iteration_state: False 76 | 77 | testing: 78 | batch_size: 64 79 | batch_type: "sentence" 80 | max_output_length: 40 81 | min_outptu_length: 1 82 | eval_metrics: ["bleu", "rouge-l"] 83 | n_best: 1 84 | beam_size: 4 85 | beam_alpha: -1 86 | return_attention: False 87 | return_probability: False 88 | generate_unk: False 89 | repetition_penalty: -1 90 | 91 | model: 92 | mode: "assembly_comment" # "assembly_comment", "assembly_cfg_comment" 93 | initializer: "xavier_uniform" 94 | embed_initializer: "xavier_uniform" 95 | tied_softmax: False 96 | tied_embeddings: False 97 | 98 | embeddings: 99 | embedding_dim: 512 100 | scale: False 101 | freeze: False 102 | 103 | transformer_encoder: 104 | model_dim: 512 105 | ff_dim: 2048 106 | num_layers: 6 107 | head_count: 8 108 | dropout: 0.2 109 | emb_dropout: 0.2 110 | layer_norm_position: "pre" 111 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 112 | max_src_len: 0 # for learnable, keep same with data segment 113 | freeze: False 114 | max_relative_position: 32 # only for relative position, else must be set to 0 115 | use_negative_distance: True # for relative position 116 | 117 | gnn_encoder: 118 | gnn_type: "GATConv" # ["SAGEConv", "GCNConv", "GATConv"] 119 | aggr: "mean" # ["mean", "max", "lstm"] 120 | model_dim: 512 121 | num_layers: 2 122 | emb_dropout: 0.2 123 | residual: True 124 | 125 | pseudo_encoder: 126 | model_dim: 512 127 | ff_dim: 2048 128 | num_layers: 6 129 | head_count: 8 130 | dropout: 0.2 131 | emb_dropout: 0.2 132 | layer_norm_position: "pre" 133 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 134 | max_src_len: 0 # for learnable, keep same with data segment 135 | freeze: False 136 | max_relative_position: 32 # only for relative position, else must be set to 0 137 | use_negative_distance: True # for relative position 138 | 139 | transformer_decoder: 140 | model_dim: 512 141 | ff_dim: 2048 142 | num_layers: 6 143 | head_count: 8 144 | dropout: 0.2 145 | emb_dropout: 0.2 146 | layer_norm_position: "pre" 147 | trg_pos_emb: "learnable" # ["absolute", "learnable", "relative"] 148 | max_trg_len: 40 # for learnable, keep same with data segment. 149 | freeze: False 150 | max_relative_position: 0 # only for relative position, else must be set to 0. 151 | use_negative_distance: False # for relative position 152 | -------------------------------------------------------------------------------- /configs/dataset_gcc-7.3.0_arm_32_O2_strip/O1_test4_cszx.yaml: -------------------------------------------------------------------------------- 1 | # assembly_cfg_comment 2 | # robertatokenizer and special 3 | # tokenizer_trg is None 4 | # assembly token = 400 5 | data: 6 | data_path: "datas/dataset_gcc-7.3.0_arm_32_O2_strip" 7 | cached_dataset_path: "cached_dataset" 8 | train_data_path: "datas/dataset_gcc-7.3.0_arm_32_O2_strip/train_refined.json" 9 | valid_data_path: "datas/dataset_gcc-7.3.0_arm_32_O2_strip/valid_refined.json" 10 | test_data_path: "datas/dataset_gcc-7.3.0_arm_32_O2_strip/test_refined.json" 11 | use_tokenizer: "robertatokenizer" # sentencepiece or robertatokenizer 12 | robertatokenizer: "Salesforce/codet5-base" 13 | architecture: "arm_32" # 'x86_64', 'x86_32', 'arm_32' 14 | use_refined_pseudo_code: False 15 | 16 | assembly_token: 17 | vocab_min_freq: 1 18 | vocab_max_size: 50000 19 | token_max_len: 400 20 | 21 | comment: 22 | vocab_min_freq: 1 23 | vocab_max_size: 50000 24 | token_max_len: 40 25 | 26 | cfg_node: 27 | vocab_min_freq: 1 28 | vocab_max_size: 50000 29 | 30 | pseudo_token: 31 | vocab_min_freq: 1 32 | vocab_max_size: 50000 33 | token_max_len: 400 34 | 35 | training: 36 | model_dir: "models/dataset_gcc-7.3.0_arm_32_O2_strip/test4_cszx" 37 | overwrite: False 38 | load_model: False 39 | random_seed: 980820 40 | 41 | logging_frequence: 100 42 | validation_frequence: 1 # after how many epochs 43 | store_valid_output: False 44 | log_valid_samples: [0,1,2,3,4] 45 | 46 | use_cuda: True 47 | num_workers: 4 48 | 49 | epochs: 100 50 | shuffle: True 51 | max_updates: 1000000000 52 | batch_size: 32 53 | 54 | learning_rate: 0.0001 55 | learning_rate_min: 1.0e-18 56 | # clip_grad_val: 1 57 | clip_grad_norm: 5.0 58 | optimizer: "adam" 59 | weight_decay: 0 60 | adam_betas: [0.9, 0.999] 61 | eps: 1.e-8 62 | early_stop_metric: "bleu" 63 | scheduling: "ReduceLROnPlateau" # "ReduceLROnPlateau", "StepLR", "ExponentialLR", "warmup" 64 | mode: "max" 65 | factor: 0.8 66 | patience: 2 67 | step_size: 1 68 | gamma: 0.1 69 | num_ckpts_keep: 3 70 | 71 | # load_model: "models/best.ckpt" 72 | reset_best_ckpt: False 73 | reset_scheduler: False 74 | reset_optimzer: False 75 | reset_iteration_state: False 76 | 77 | testing: 78 | batch_size: 64 79 | batch_type: "sentence" 80 | max_output_length: 40 81 | min_outptu_length: 1 82 | eval_metrics: ["bleu", "rouge-l"] 83 | n_best: 1 84 | beam_size: 4 85 | beam_alpha: -1 86 | return_attention: False 87 | return_probability: False 88 | generate_unk: False 89 | repetition_penalty: -1 90 | 91 | model: 92 | mode: "assembly_cfg_comment" # "assembly_comment", "assembly_cfg_comment" 93 | initializer: "xavier_uniform" 94 | embed_initializer: "xavier_uniform" 95 | tied_softmax: False 96 | tied_embeddings: False 97 | 98 | embeddings: 99 | embedding_dim: 512 100 | scale: False 101 | freeze: False 102 | 103 | transformer_encoder: 104 | model_dim: 512 105 | ff_dim: 2048 106 | num_layers: 6 107 | head_count: 8 108 | dropout: 0.2 109 | emb_dropout: 0.2 110 | layer_norm_position: "pre" 111 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 112 | max_src_len: 0 # for learnable, keep same with data segment 113 | freeze: False 114 | max_relative_position: 32 # only for relative position, else must be set to 0 115 | use_negative_distance: True # for relative position 116 | 117 | gnn_encoder: 118 | gnn_type: "GATConv" # ["SAGEConv", "GCNConv", "GATConv"] 119 | aggr: "mean" # ["mean", "max", "lstm"] 120 | model_dim: 512 121 | num_layers: 2 122 | emb_dropout: 0.2 123 | residual: True 124 | 125 | pseudo_encoder: 126 | model_dim: 512 127 | ff_dim: 2048 128 | num_layers: 6 129 | head_count: 8 130 | dropout: 0.2 131 | emb_dropout: 0.2 132 | layer_norm_position: "pre" 133 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 134 | max_src_len: 0 # for learnable, keep same with data segment 135 | freeze: False 136 | max_relative_position: 32 # only for relative position, else must be set to 0 137 | use_negative_distance: True # for relative position 138 | 139 | transformer_decoder: 140 | model_dim: 512 141 | ff_dim: 2048 142 | num_layers: 6 143 | head_count: 8 144 | dropout: 0.2 145 | emb_dropout: 0.2 146 | layer_norm_position: "pre" 147 | trg_pos_emb: "learnable" # ["absolute", "learnable", "relative"] 148 | max_trg_len: 40 # for learnable, keep same with data segment. 149 | freeze: False 150 | max_relative_position: 0 # only for relative position, else must be set to 0. 151 | use_negative_distance: False # for relative position 152 | -------------------------------------------------------------------------------- /configs/dataset_gcc-7.3.0_arm_32_O2_strip/O1_test8_cszx.yaml: -------------------------------------------------------------------------------- 1 | # assembly_cfg_comment 2 | # robertatokenizer and special 3 | # tokenizer_trg is None 4 | # assembly token = 400 5 | data: 6 | data_path: "datas/dataset_gcc-7.3.0_arm_32_O2_strip" 7 | cached_dataset_path: "cached_dataset" 8 | train_data_path: "datas/dataset_gcc-7.3.0_arm_32_O2_strip/train_refined.json" 9 | valid_data_path: "datas/dataset_gcc-7.3.0_arm_32_O2_strip/valid_refined.json" 10 | test_data_path: "datas/dataset_gcc-7.3.0_arm_32_O2_strip/test_refined.json" 11 | use_tokenizer: "robertatokenizer" # sentencepiece or robertatokenizer 12 | robertatokenizer: "Salesforce/codet5-base" 13 | architecture: "arm_32" # 'x86_64', 'x86_32', 'arm_32' 14 | use_refined_pseudo_code: False 15 | 16 | assembly_token: 17 | vocab_min_freq: 1 18 | vocab_max_size: 50000 19 | token_max_len: 400 20 | 21 | comment: 22 | vocab_min_freq: 1 23 | vocab_max_size: 50000 24 | token_max_len: 40 25 | 26 | cfg_node: 27 | vocab_min_freq: 1 28 | vocab_max_size: 50000 29 | 30 | pseudo_token: 31 | vocab_min_freq: 1 32 | vocab_max_size: 50000 33 | token_max_len: 400 34 | 35 | training: 36 | model_dir: "models/dataset_gcc-7.3.0_arm_32_O2_strip/test8_cszx" 37 | overwrite: False 38 | load_model: False 39 | random_seed: 980820 40 | 41 | logging_frequence: 100 42 | validation_frequence: 1 # after how many epochs 43 | store_valid_output: False 44 | log_valid_samples: [0,1,2,3,4] 45 | 46 | use_cuda: True 47 | num_workers: 4 48 | 49 | epochs: 100 50 | shuffle: True 51 | max_updates: 1000000000 52 | batch_size: 32 53 | 54 | learning_rate: 0.0001 55 | learning_rate_min: 1.0e-18 56 | # clip_grad_val: 1 57 | clip_grad_norm: 5.0 58 | optimizer: "adam" 59 | weight_decay: 0 60 | adam_betas: [0.9, 0.999] 61 | eps: 1.e-8 62 | early_stop_metric: "bleu" 63 | scheduling: "ReduceLROnPlateau" # "ReduceLROnPlateau", "StepLR", "ExponentialLR", "warmup" 64 | mode: "max" 65 | factor: 0.8 66 | patience: 2 67 | step_size: 1 68 | gamma: 0.1 69 | num_ckpts_keep: 3 70 | 71 | # load_model: "models/dataset_gcc-7.3.0_x86_64_O1/test4_cszx/best.ckpt" 72 | reset_best_ckpt: False 73 | reset_scheduler: False 74 | reset_optimzer: False 75 | reset_iteration_state: False 76 | 77 | testing: 78 | batch_size: 64 79 | batch_type: "sentence" 80 | max_output_length: 40 81 | min_outptu_length: 1 82 | eval_metrics: ["bleu", "rouge-l"] 83 | n_best: 1 84 | beam_size: 4 85 | beam_alpha: -1 86 | return_attention: False 87 | return_probability: False 88 | generate_unk: False 89 | repetition_penalty: -1 90 | 91 | model: 92 | mode: "assembly_cfg_pseudo_comment" # "assembly_comment", "assembly_cfg_comment", "assembly_cfg_pseudo_comment" 93 | initializer: "xavier_uniform" 94 | embed_initializer: "xavier_uniform" 95 | tied_softmax: False 96 | tied_embeddings: False 97 | 98 | embeddings: 99 | embedding_dim: 512 100 | scale: False 101 | freeze: False 102 | 103 | transformer_encoder: 104 | model_dim: 512 105 | ff_dim: 2048 106 | num_layers: 6 107 | head_count: 8 108 | dropout: 0.2 109 | emb_dropout: 0.2 110 | layer_norm_position: "pre" 111 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 112 | max_src_len: 0 # for learnable, keep same with data segment 113 | freeze: False 114 | max_relative_position: 32 # only for relative position, else must be set to 0 115 | use_negative_distance: True # for relative position 116 | 117 | gnn_encoder: 118 | gnn_type: "GATConv" # ["SAGEConv", "GCNConv", "GATConv"] 119 | aggr: "mean" # ["mean", "max", "lstm"] 120 | model_dim: 512 121 | num_layers: 2 122 | emb_dropout: 0.2 123 | residual: True 124 | 125 | pseudo_encoder: 126 | model_dim: 512 127 | ff_dim: 2048 128 | num_layers: 6 129 | head_count: 8 130 | dropout: 0.2 131 | emb_dropout: 0.2 132 | layer_norm_position: "pre" 133 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 134 | max_src_len: 0 # for learnable, keep same with data segment 135 | freeze: False 136 | max_relative_position: 32 # only for relative position, else must be set to 0 137 | use_negative_distance: True # for relative position 138 | 139 | 140 | transformer_decoder: 141 | model_dim: 512 142 | ff_dim: 2048 143 | num_layers: 6 144 | head_count: 8 145 | dropout: 0.2 146 | emb_dropout: 0.2 147 | layer_norm_position: "pre" 148 | trg_pos_emb: "learnable" # ["absolute", "learnable", "relative"] 149 | max_trg_len: 40 # for learnable, keep same with data segment. 150 | freeze: False 151 | max_relative_position: 0 # only for relative position, else must be set to 0. 152 | use_negative_distance: False # for relative position 153 | -------------------------------------------------------------------------------- /configs/dataset_gcc-7.3.0_arm_32_O3_strip/O1_test10_cszx.yaml: -------------------------------------------------------------------------------- 1 | # assembly_cfg_comment 2 | # robertatokenizer and special 3 | # tokenizer_trg is None 4 | # assembly token = 400 5 | data: 6 | data_path: "datas/dataset_gcc-7.3.0_arm_32_O3_strip" 7 | cached_dataset_path: "cached_dataset" 8 | train_data_path: "datas/dataset_gcc-7.3.0_arm_32_O3_strip/train_refined.json" 9 | valid_data_path: "datas/dataset_gcc-7.3.0_arm_32_O3_strip/valid_refined.json" 10 | test_data_path: "datas/dataset_gcc-7.3.0_arm_32_O3_strip/test_refined.json" 11 | use_tokenizer: "robertatokenizer" # sentencepiece or robertatokenizer 12 | robertatokenizer: "Salesforce/codet5-base" 13 | architecture: "arm_32" # 'x86_64', 'x86_32', 'arm_32' 14 | use_refined_pseudo_code: True 15 | 16 | assembly_token: 17 | vocab_min_freq: 1 18 | vocab_max_size: 50000 19 | token_max_len: 400 20 | 21 | comment: 22 | vocab_min_freq: 1 23 | vocab_max_size: 50000 24 | token_max_len: 40 25 | 26 | cfg_node: 27 | vocab_min_freq: 1 28 | vocab_max_size: 50000 29 | 30 | pseudo_token: 31 | vocab_min_freq: 1 32 | vocab_max_size: 50000 33 | token_max_len: 400 34 | 35 | training: 36 | model_dir: "models/dataset_gcc-7.3.0_arm_32_O3_strip/test10_cszx" 37 | overwrite: False 38 | load_model: False 39 | random_seed: 980820 40 | 41 | logging_frequence: 100 42 | validation_frequence: 1 # after how many epochs 43 | store_valid_output: False 44 | log_valid_samples: [0,1,2,3,4] 45 | 46 | use_cuda: True 47 | num_workers: 4 48 | 49 | epochs: 100 50 | shuffle: True 51 | max_updates: 1000000000 52 | batch_size: 32 53 | 54 | learning_rate: 0.0001 55 | learning_rate_min: 1.0e-18 56 | # clip_grad_val: 1 57 | clip_grad_norm: 5.0 58 | optimizer: "adam" 59 | weight_decay: 0 60 | adam_betas: [0.9, 0.999] 61 | eps: 1.e-8 62 | early_stop_metric: "bleu" 63 | scheduling: "ReduceLROnPlateau" # "ReduceLROnPlateau", "StepLR", "ExponentialLR", "warmup" 64 | mode: "max" 65 | factor: 0.8 66 | patience: 2 67 | step_size: 1 68 | gamma: 0.1 69 | num_ckpts_keep: 3 70 | 71 | # load_model: "models/best.ckpt" 72 | reset_best_ckpt: False 73 | reset_scheduler: False 74 | reset_optimzer: False 75 | reset_iteration_state: False 76 | 77 | testing: 78 | batch_size: 64 79 | batch_type: "sentence" 80 | max_output_length: 40 81 | min_outptu_length: 1 82 | eval_metrics: ["bleu", "rouge-l"] 83 | n_best: 1 84 | beam_size: 4 85 | beam_alpha: -1 86 | return_attention: False 87 | return_probability: False 88 | generate_unk: False 89 | repetition_penalty: -1 90 | 91 | model: 92 | mode: "assembly_cfg_pseudo_comment" # "assembly_comment", "assembly_cfg_comment", "assembly_cfg_pseudo_comment" 93 | initializer: "xavier_uniform" 94 | embed_initializer: "xavier_uniform" 95 | tied_softmax: False 96 | tied_embeddings: False 97 | 98 | embeddings: 99 | embedding_dim: 512 100 | scale: False 101 | freeze: False 102 | 103 | transformer_encoder: 104 | model_dim: 512 105 | ff_dim: 2048 106 | num_layers: 6 107 | head_count: 8 108 | dropout: 0.2 109 | emb_dropout: 0.2 110 | layer_norm_position: "pre" 111 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 112 | max_src_len: 0 # for learnable, keep same with data segment 113 | freeze: False 114 | max_relative_position: 32 # only for relative position, else must be set to 0 115 | use_negative_distance: True # for relative position 116 | 117 | gnn_encoder: 118 | gnn_type: "GATConv" # ["SAGEConv", "GCNConv", "GATConv"] 119 | aggr: "mean" # ["mean", "max", "lstm"] 120 | model_dim: 512 121 | num_layers: 2 122 | emb_dropout: 0.2 123 | residual: True 124 | 125 | pseudo_encoder: 126 | model_dim: 512 127 | ff_dim: 2048 128 | num_layers: 6 129 | head_count: 8 130 | dropout: 0.2 131 | emb_dropout: 0.2 132 | layer_norm_position: "pre" 133 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 134 | max_src_len: 0 # for learnable, keep same with data segment 135 | freeze: False 136 | max_relative_position: 32 # only for relative position, else must be set to 0 137 | use_negative_distance: True # for relative position 138 | 139 | transformer_decoder: 140 | model_dim: 512 141 | ff_dim: 2048 142 | num_layers: 6 143 | head_count: 8 144 | dropout: 0.2 145 | emb_dropout: 0.2 146 | layer_norm_position: "pre" 147 | trg_pos_emb: "learnable" # ["absolute", "learnable", "relative"] 148 | max_trg_len: 40 # for learnable, keep same with data segment. 149 | freeze: False 150 | max_relative_position: 0 # only for relative position, else must be set to 0. 151 | use_negative_distance: False # for relative position 152 | -------------------------------------------------------------------------------- /configs/dataset_gcc-7.3.0_arm_32_O3_strip/O1_test3_cszx.yaml: -------------------------------------------------------------------------------- 1 | # assembly_comment 2 | # robertatokenizer and special 3 | # tokenizer_trg is None 4 | # assembly token = 400 5 | data: 6 | data_path: "datas/dataset_gcc-7.3.0_arm_32_O3_strip" 7 | cached_dataset_path: "cached_dataset" 8 | train_data_path: "datas/dataset_gcc-7.3.0_arm_32_O3_strip/train_refined.json" 9 | valid_data_path: "datas/dataset_gcc-7.3.0_arm_32_O3_strip/valid_refined.json" 10 | test_data_path: "datas/dataset_gcc-7.3.0_arm_32_O3_strip/test_refined.json" 11 | use_tokenizer: "robertatokenizer" # sentencepiece or robertatokenizer 12 | robertatokenizer: "Salesforce/codet5-base" 13 | architecture: "arm_32" # 'x86_64', 'x86_32', 'arm_32' 14 | use_refined_pseudo_code: False 15 | 16 | assembly_token: 17 | vocab_min_freq: 1 18 | vocab_max_size: 50000 19 | token_max_len: 400 20 | 21 | comment: 22 | vocab_min_freq: 1 23 | vocab_max_size: 50000 24 | token_max_len: 40 25 | 26 | cfg_node: 27 | vocab_min_freq: 1 28 | vocab_max_size: 50000 29 | 30 | pseudo_token: 31 | vocab_min_freq: 1 32 | vocab_max_size: 50000 33 | token_max_len: 400 34 | 35 | training: 36 | model_dir: "models/dataset_gcc-7.3.0_arm_32_O3_strip/test3_cszx" 37 | overwrite: False 38 | load_model: False 39 | random_seed: 980820 40 | 41 | logging_frequence: 100 42 | validation_frequence: 1 # after how many epochs 43 | store_valid_output: False 44 | log_valid_samples: [0,1,2,3,4] 45 | 46 | use_cuda: True 47 | num_workers: 4 48 | 49 | epochs: 100 50 | shuffle: True 51 | max_updates: 1000000000 52 | batch_size: 32 53 | 54 | learning_rate: 0.0001 55 | learning_rate_min: 1.0e-18 56 | # clip_grad_val: 1 57 | clip_grad_norm: 5.0 58 | optimizer: "adam" 59 | weight_decay: 0 60 | adam_betas: [0.9, 0.999] 61 | eps: 1.e-8 62 | early_stop_metric: "bleu" 63 | scheduling: "ReduceLROnPlateau" # "ReduceLROnPlateau", "StepLR", "ExponentialLR", "warmup" 64 | mode: "max" 65 | factor: 0.8 66 | patience: 2 67 | step_size: 1 68 | gamma: 0.1 69 | num_ckpts_keep: 3 70 | 71 | # load_model: "models/best.ckpt" 72 | reset_best_ckpt: False 73 | reset_scheduler: False 74 | reset_optimzer: False 75 | reset_iteration_state: False 76 | 77 | testing: 78 | batch_size: 64 79 | batch_type: "sentence" 80 | max_output_length: 40 81 | min_outptu_length: 1 82 | eval_metrics: ["bleu", "rouge-l"] 83 | n_best: 1 84 | beam_size: 4 85 | beam_alpha: -1 86 | return_attention: False 87 | return_probability: False 88 | generate_unk: False 89 | repetition_penalty: -1 90 | 91 | model: 92 | mode: "assembly_comment" # "assembly_comment", "assembly_cfg_comment" 93 | initializer: "xavier_uniform" 94 | embed_initializer: "xavier_uniform" 95 | tied_softmax: False 96 | tied_embeddings: False 97 | 98 | embeddings: 99 | embedding_dim: 512 100 | scale: False 101 | freeze: False 102 | 103 | transformer_encoder: 104 | model_dim: 512 105 | ff_dim: 2048 106 | num_layers: 6 107 | head_count: 8 108 | dropout: 0.2 109 | emb_dropout: 0.2 110 | layer_norm_position: "pre" 111 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 112 | max_src_len: 0 # for learnable, keep same with data segment 113 | freeze: False 114 | max_relative_position: 32 # only for relative position, else must be set to 0 115 | use_negative_distance: True # for relative position 116 | 117 | gnn_encoder: 118 | gnn_type: "GATConv" # ["SAGEConv", "GCNConv", "GATConv"] 119 | aggr: "mean" # ["mean", "max", "lstm"] 120 | model_dim: 512 121 | num_layers: 2 122 | emb_dropout: 0.2 123 | residual: True 124 | 125 | pseudo_encoder: 126 | model_dim: 512 127 | ff_dim: 2048 128 | num_layers: 6 129 | head_count: 8 130 | dropout: 0.2 131 | emb_dropout: 0.2 132 | layer_norm_position: "pre" 133 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 134 | max_src_len: 0 # for learnable, keep same with data segment 135 | freeze: False 136 | max_relative_position: 32 # only for relative position, else must be set to 0 137 | use_negative_distance: True # for relative position 138 | 139 | transformer_decoder: 140 | model_dim: 512 141 | ff_dim: 2048 142 | num_layers: 6 143 | head_count: 8 144 | dropout: 0.2 145 | emb_dropout: 0.2 146 | layer_norm_position: "pre" 147 | trg_pos_emb: "learnable" # ["absolute", "learnable", "relative"] 148 | max_trg_len: 40 # for learnable, keep same with data segment. 149 | freeze: False 150 | max_relative_position: 0 # only for relative position, else must be set to 0. 151 | use_negative_distance: False # for relative position 152 | -------------------------------------------------------------------------------- /configs/dataset_gcc-7.3.0_arm_32_O3_strip/O1_test4_cszx.yaml: -------------------------------------------------------------------------------- 1 | # assembly_cfg_comment 2 | # robertatokenizer and special 3 | # tokenizer_trg is None 4 | # assembly token = 400 5 | data: 6 | data_path: "datas/dataset_gcc-7.3.0_arm_32_O3_strip" 7 | cached_dataset_path: "cached_dataset" 8 | train_data_path: "datas/dataset_gcc-7.3.0_arm_32_O3_strip/train_refined.json" 9 | valid_data_path: "datas/dataset_gcc-7.3.0_arm_32_O3_strip/valid_refined.json" 10 | test_data_path: "datas/dataset_gcc-7.3.0_arm_32_O3_strip/test_refined.json" 11 | use_tokenizer: "robertatokenizer" # sentencepiece or robertatokenizer 12 | robertatokenizer: "Salesforce/codet5-base" 13 | architecture: "arm_32" # 'x86_64', 'x86_32', 'arm_32' 14 | use_refined_pseudo_code: False 15 | 16 | assembly_token: 17 | vocab_min_freq: 1 18 | vocab_max_size: 50000 19 | token_max_len: 400 20 | 21 | comment: 22 | vocab_min_freq: 1 23 | vocab_max_size: 50000 24 | token_max_len: 40 25 | 26 | cfg_node: 27 | vocab_min_freq: 1 28 | vocab_max_size: 50000 29 | 30 | pseudo_token: 31 | vocab_min_freq: 1 32 | vocab_max_size: 50000 33 | token_max_len: 400 34 | 35 | training: 36 | model_dir: "models/dataset_gcc-7.3.0_arm_32_O3_strip/test4_cszx" 37 | overwrite: False 38 | load_model: False 39 | random_seed: 980820 40 | 41 | logging_frequence: 100 42 | validation_frequence: 1 # after how many epochs 43 | store_valid_output: False 44 | log_valid_samples: [0,1,2,3,4] 45 | 46 | use_cuda: True 47 | num_workers: 4 48 | 49 | epochs: 100 50 | shuffle: True 51 | max_updates: 1000000000 52 | batch_size: 32 53 | 54 | learning_rate: 0.0001 55 | learning_rate_min: 1.0e-18 56 | # clip_grad_val: 1 57 | clip_grad_norm: 5.0 58 | optimizer: "adam" 59 | weight_decay: 0 60 | adam_betas: [0.9, 0.999] 61 | eps: 1.e-8 62 | early_stop_metric: "bleu" 63 | scheduling: "ReduceLROnPlateau" # "ReduceLROnPlateau", "StepLR", "ExponentialLR", "warmup" 64 | mode: "max" 65 | factor: 0.8 66 | patience: 2 67 | step_size: 1 68 | gamma: 0.1 69 | num_ckpts_keep: 3 70 | 71 | # load_model: "models/dataset_gcc-7.3.0_x86_64_O1/test4_cszx/best.ckpt" 72 | reset_best_ckpt: False 73 | reset_scheduler: False 74 | reset_optimzer: False 75 | reset_iteration_state: False 76 | 77 | testing: 78 | batch_size: 64 79 | batch_type: "sentence" 80 | max_output_length: 40 81 | min_outptu_length: 1 82 | eval_metrics: ["bleu", "rouge-l"] 83 | n_best: 1 84 | beam_size: 4 85 | beam_alpha: -1 86 | return_attention: False 87 | return_probability: False 88 | generate_unk: False 89 | repetition_penalty: -1 90 | 91 | model: 92 | mode: "assembly_cfg_comment" # "assembly_comment", "assembly_cfg_comment" 93 | initializer: "xavier_uniform" 94 | embed_initializer: "xavier_uniform" 95 | tied_softmax: False 96 | tied_embeddings: False 97 | 98 | embeddings: 99 | embedding_dim: 512 100 | scale: False 101 | freeze: False 102 | 103 | transformer_encoder: 104 | model_dim: 512 105 | ff_dim: 2048 106 | num_layers: 6 107 | head_count: 8 108 | dropout: 0.2 109 | emb_dropout: 0.2 110 | layer_norm_position: "pre" 111 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 112 | max_src_len: 0 # for learnable, keep same with data segment 113 | freeze: False 114 | max_relative_position: 32 # only for relative position, else must be set to 0 115 | use_negative_distance: True # for relative position 116 | 117 | gnn_encoder: 118 | gnn_type: "GATConv" # ["SAGEConv", "GCNConv", "GATConv"] 119 | aggr: "mean" # ["mean", "max", "lstm"] 120 | model_dim: 512 121 | num_layers: 2 122 | emb_dropout: 0.2 123 | residual: True 124 | 125 | pseudo_encoder: 126 | model_dim: 512 127 | ff_dim: 2048 128 | num_layers: 6 129 | head_count: 8 130 | dropout: 0.2 131 | emb_dropout: 0.2 132 | layer_norm_position: "pre" 133 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 134 | max_src_len: 0 # for learnable, keep same with data segment 135 | freeze: False 136 | max_relative_position: 32 # only for relative position, else must be set to 0 137 | use_negative_distance: True # for relative position 138 | 139 | transformer_decoder: 140 | model_dim: 512 141 | ff_dim: 2048 142 | num_layers: 6 143 | head_count: 8 144 | dropout: 0.2 145 | emb_dropout: 0.2 146 | layer_norm_position: "pre" 147 | trg_pos_emb: "learnable" # ["absolute", "learnable", "relative"] 148 | max_trg_len: 40 # for learnable, keep same with data segment. 149 | freeze: False 150 | max_relative_position: 0 # only for relative position, else must be set to 0. 151 | use_negative_distance: False # for relative position 152 | -------------------------------------------------------------------------------- /configs/dataset_gcc-7.3.0_arm_32_O3_strip/O1_test8_cszx.yaml: -------------------------------------------------------------------------------- 1 | # assembly_cfg_comment 2 | # robertatokenizer and special 3 | # tokenizer_trg is None 4 | # assembly token = 400 5 | data: 6 | data_path: "datas/dataset_gcc-7.3.0_arm_32_O3_strip" 7 | cached_dataset_path: "cached_dataset" 8 | train_data_path: "datas/dataset_gcc-7.3.0_arm_32_O3_strip/train_refined.json" 9 | valid_data_path: "datas/dataset_gcc-7.3.0_arm_32_O3_strip/valid_refined.json" 10 | test_data_path: "datas/dataset_gcc-7.3.0_arm_32_O3_strip/test_refined.json" 11 | use_tokenizer: "robertatokenizer" # sentencepiece or robertatokenizer 12 | robertatokenizer: "Salesforce/codet5-base" 13 | architecture: "arm_32" # 'x86_64', 'x86_32', 'arm_32' 14 | use_refined_pseudo_code: False 15 | 16 | assembly_token: 17 | vocab_min_freq: 1 18 | vocab_max_size: 50000 19 | token_max_len: 400 20 | 21 | comment: 22 | vocab_min_freq: 1 23 | vocab_max_size: 50000 24 | token_max_len: 40 25 | 26 | cfg_node: 27 | vocab_min_freq: 1 28 | vocab_max_size: 50000 29 | 30 | pseudo_token: 31 | vocab_min_freq: 1 32 | vocab_max_size: 50000 33 | token_max_len: 400 34 | 35 | training: 36 | model_dir: "models/dataset_gcc-7.3.0_arm_32_O3_strip/test8_cszx" 37 | overwrite: False 38 | load_model: False 39 | random_seed: 980820 40 | 41 | logging_frequence: 100 42 | validation_frequence: 1 # after how many epochs 43 | store_valid_output: False 44 | log_valid_samples: [0,1,2,3,4] 45 | 46 | use_cuda: True 47 | num_workers: 4 48 | 49 | epochs: 100 50 | shuffle: True 51 | max_updates: 1000000000 52 | batch_size: 32 53 | 54 | learning_rate: 0.0001 55 | learning_rate_min: 1.0e-18 56 | # clip_grad_val: 1 57 | clip_grad_norm: 5.0 58 | optimizer: "adam" 59 | weight_decay: 0 60 | adam_betas: [0.9, 0.999] 61 | eps: 1.e-8 62 | early_stop_metric: "bleu" 63 | scheduling: "ReduceLROnPlateau" # "ReduceLROnPlateau", "StepLR", "ExponentialLR", "warmup" 64 | mode: "max" 65 | factor: 0.8 66 | patience: 2 67 | step_size: 1 68 | gamma: 0.1 69 | num_ckpts_keep: 3 70 | 71 | # load_model: "models/best.ckpt" 72 | reset_best_ckpt: False 73 | reset_scheduler: False 74 | reset_optimzer: False 75 | reset_iteration_state: False 76 | 77 | testing: 78 | batch_size: 64 79 | batch_type: "sentence" 80 | max_output_length: 40 81 | min_outptu_length: 1 82 | eval_metrics: ["bleu", "rouge-l"] 83 | n_best: 1 84 | beam_size: 4 85 | beam_alpha: -1 86 | return_attention: False 87 | return_probability: False 88 | generate_unk: False 89 | repetition_penalty: -1 90 | 91 | model: 92 | mode: "assembly_cfg_pseudo_comment" # "assembly_comment", "assembly_cfg_comment", "assembly_cfg_pseudo_comment" 93 | initializer: "xavier_uniform" 94 | embed_initializer: "xavier_uniform" 95 | tied_softmax: False 96 | tied_embeddings: False 97 | 98 | embeddings: 99 | embedding_dim: 512 100 | scale: False 101 | freeze: False 102 | 103 | transformer_encoder: 104 | model_dim: 512 105 | ff_dim: 2048 106 | num_layers: 6 107 | head_count: 8 108 | dropout: 0.2 109 | emb_dropout: 0.2 110 | layer_norm_position: "pre" 111 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 112 | max_src_len: 0 # for learnable, keep same with data segment 113 | freeze: False 114 | max_relative_position: 32 # only for relative position, else must be set to 0 115 | use_negative_distance: True # for relative position 116 | 117 | gnn_encoder: 118 | gnn_type: "GATConv" # ["SAGEConv", "GCNConv", "GATConv"] 119 | aggr: "mean" # ["mean", "max", "lstm"] 120 | model_dim: 512 121 | num_layers: 2 122 | emb_dropout: 0.2 123 | residual: True 124 | 125 | pseudo_encoder: 126 | model_dim: 512 127 | ff_dim: 2048 128 | num_layers: 6 129 | head_count: 8 130 | dropout: 0.2 131 | emb_dropout: 0.2 132 | layer_norm_position: "pre" 133 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 134 | max_src_len: 0 # for learnable, keep same with data segment 135 | freeze: False 136 | max_relative_position: 32 # only for relative position, else must be set to 0 137 | use_negative_distance: True # for relative position 138 | 139 | 140 | transformer_decoder: 141 | model_dim: 512 142 | ff_dim: 2048 143 | num_layers: 6 144 | head_count: 8 145 | dropout: 0.2 146 | emb_dropout: 0.2 147 | layer_norm_position: "pre" 148 | trg_pos_emb: "learnable" # ["absolute", "learnable", "relative"] 149 | max_trg_len: 40 # for learnable, keep same with data segment. 150 | freeze: False 151 | max_relative_position: 0 # only for relative position, else must be set to 0. 152 | use_negative_distance: False # for relative position 153 | -------------------------------------------------------------------------------- /configs/dataset_gcc-7.3.0_x86_32_O1_strip/O1_test10_cszx.yaml: -------------------------------------------------------------------------------- 1 | # assembly_cfg_comment 2 | # robertatokenizer and special 3 | # tokenizer_trg is None 4 | # assembly token = 400 5 | data: 6 | data_path: "datas/dataset_gcc-7.3.0_x86_32_O1_strip" 7 | cached_dataset_path: "cached_dataset" 8 | train_data_path: "datas/dataset_gcc-7.3.0_x86_32_O1_strip/train_refined.json" 9 | valid_data_path: "datas/dataset_gcc-7.3.0_x86_32_O1_strip/valid_refined.json" 10 | test_data_path: "datas/dataset_gcc-7.3.0_x86_32_O1_strip/test_refined.json" 11 | use_tokenizer: "robertatokenizer" # sentencepiece or robertatokenizer 12 | robertatokenizer: "Salesforce/codet5-base" 13 | architecture: "x86_32" # 'x86_64', 'x86_32', 'arm_32' 14 | use_refined_pseudo_code: True 15 | 16 | assembly_token: 17 | vocab_min_freq: 1 18 | vocab_max_size: 50000 19 | token_max_len: 400 20 | 21 | comment: 22 | vocab_min_freq: 1 23 | vocab_max_size: 50000 24 | token_max_len: 40 25 | 26 | cfg_node: 27 | vocab_min_freq: 1 28 | vocab_max_size: 50000 29 | 30 | pseudo_token: 31 | vocab_min_freq: 1 32 | vocab_max_size: 50000 33 | token_max_len: 400 34 | 35 | training: 36 | model_dir: "models/dataset_gcc-7.3.0_x86_32_O1_strip/test10_cszx" 37 | overwrite: False 38 | load_model: False 39 | random_seed: 980820 40 | 41 | logging_frequence: 100 42 | validation_frequence: 1 # after how many epochs 43 | store_valid_output: False 44 | log_valid_samples: [0,1,2,3,4] 45 | 46 | use_cuda: True 47 | num_workers: 4 48 | 49 | epochs: 100 50 | shuffle: True 51 | max_updates: 1000000000 52 | batch_size: 32 53 | 54 | learning_rate: 0.0001 55 | learning_rate_min: 1.0e-18 56 | # clip_grad_val: 1 57 | clip_grad_norm: 5.0 58 | optimizer: "adam" 59 | weight_decay: 0 60 | adam_betas: [0.9, 0.999] 61 | eps: 1.e-8 62 | early_stop_metric: "bleu" 63 | scheduling: "ReduceLROnPlateau" # "ReduceLROnPlateau", "StepLR", "ExponentialLR", "warmup" 64 | mode: "max" 65 | factor: 0.8 66 | patience: 2 67 | step_size: 1 68 | gamma: 0.1 69 | num_ckpts_keep: 3 70 | 71 | # load_model: "models/best.ckpt" 72 | reset_best_ckpt: False 73 | reset_scheduler: False 74 | reset_optimzer: False 75 | reset_iteration_state: False 76 | 77 | testing: 78 | batch_size: 64 79 | batch_type: "sentence" 80 | max_output_length: 40 81 | min_outptu_length: 1 82 | eval_metrics: ["bleu", "rouge-l"] 83 | n_best: 1 84 | beam_size: 4 85 | beam_alpha: -1 86 | return_attention: False 87 | return_probability: False 88 | generate_unk: False 89 | repetition_penalty: -1 90 | 91 | model: 92 | mode: "assembly_cfg_pseudo_comment" # "assembly_comment", "assembly_cfg_comment", "assembly_cfg_pseudo_comment" 93 | initializer: "xavier_uniform" 94 | embed_initializer: "xavier_uniform" 95 | tied_softmax: False 96 | tied_embeddings: False 97 | 98 | embeddings: 99 | embedding_dim: 512 100 | scale: False 101 | freeze: False 102 | 103 | transformer_encoder: 104 | model_dim: 512 105 | ff_dim: 2048 106 | num_layers: 6 107 | head_count: 8 108 | dropout: 0.2 109 | emb_dropout: 0.2 110 | layer_norm_position: "pre" 111 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 112 | max_src_len: 0 # for learnable, keep same with data segment 113 | freeze: False 114 | max_relative_position: 32 # only for relative position, else must be set to 0 115 | use_negative_distance: True # for relative position 116 | 117 | gnn_encoder: 118 | gnn_type: "GATConv" # ["SAGEConv", "GCNConv", "GATConv"] 119 | aggr: "mean" # ["mean", "max", "lstm"] 120 | model_dim: 512 121 | num_layers: 2 122 | emb_dropout: 0.2 123 | residual: True 124 | 125 | pseudo_encoder: 126 | model_dim: 512 127 | ff_dim: 2048 128 | num_layers: 6 129 | head_count: 8 130 | dropout: 0.2 131 | emb_dropout: 0.2 132 | layer_norm_position: "pre" 133 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 134 | max_src_len: 0 # for learnable, keep same with data segment 135 | freeze: False 136 | max_relative_position: 32 # only for relative position, else must be set to 0 137 | use_negative_distance: True # for relative position 138 | 139 | transformer_decoder: 140 | model_dim: 512 141 | ff_dim: 2048 142 | num_layers: 6 143 | head_count: 8 144 | dropout: 0.2 145 | emb_dropout: 0.2 146 | layer_norm_position: "pre" 147 | trg_pos_emb: "learnable" # ["absolute", "learnable", "relative"] 148 | max_trg_len: 40 # for learnable, keep same with data segment. 149 | freeze: False 150 | max_relative_position: 0 # only for relative position, else must be set to 0. 151 | use_negative_distance: False # for relative position 152 | -------------------------------------------------------------------------------- /configs/dataset_gcc-7.3.0_x86_32_O1_strip/O1_test3_cszx.yaml: -------------------------------------------------------------------------------- 1 | # assembly_comment 2 | # robertatokenizer and special 3 | # tokenizer_trg is None 4 | # assembly token = 400 5 | data: 6 | data_path: "datas/dataset_gcc-7.3.0_x86_32_O1_strip" 7 | cached_dataset_path: "cached_dataset" 8 | train_data_path: "datas/dataset_gcc-7.3.0_x86_32_O1_strip/train_refined.json" 9 | valid_data_path: "datas/dataset_gcc-7.3.0_x86_32_O1_strip/valid_refined.json" 10 | test_data_path: "datas/dataset_gcc-7.3.0_x86_32_O1_strip/test_refined.json" 11 | use_tokenizer: "robertatokenizer" # sentencepiece or robertatokenizer 12 | robertatokenizer: "Salesforce/codet5-base" 13 | architecture: "x86_32" # 'x86_64', 'x86_32', 'arm_32' 14 | use_refined_pseudo_code: False 15 | 16 | assembly_token: 17 | vocab_min_freq: 1 18 | vocab_max_size: 50000 19 | token_max_len: 400 20 | 21 | comment: 22 | vocab_min_freq: 1 23 | vocab_max_size: 50000 24 | token_max_len: 40 25 | 26 | cfg_node: 27 | vocab_min_freq: 1 28 | vocab_max_size: 50000 29 | 30 | pseudo_token: 31 | vocab_min_freq: 1 32 | vocab_max_size: 50000 33 | token_max_len: 400 34 | 35 | training: 36 | model_dir: "models/dataset_gcc-7.3.0_x86_32_O1_strip/test3_cszx" 37 | overwrite: False 38 | load_model: False 39 | random_seed: 980820 40 | 41 | logging_frequence: 100 42 | validation_frequence: 1 # after how many epochs 43 | store_valid_output: False 44 | log_valid_samples: [0,1,2,3,4] 45 | 46 | use_cuda: True 47 | num_workers: 4 48 | 49 | epochs: 100 50 | shuffle: True 51 | max_updates: 1000000000 52 | batch_size: 32 53 | 54 | learning_rate: 0.0001 55 | learning_rate_min: 1.0e-18 56 | # clip_grad_val: 1 57 | clip_grad_norm: 5.0 58 | optimizer: "adam" 59 | weight_decay: 0 60 | adam_betas: [0.9, 0.999] 61 | eps: 1.e-8 62 | early_stop_metric: "bleu" 63 | scheduling: "ReduceLROnPlateau" # "ReduceLROnPlateau", "StepLR", "ExponentialLR", "warmup" 64 | mode: "max" 65 | factor: 0.8 66 | patience: 2 67 | step_size: 1 68 | gamma: 0.1 69 | num_ckpts_keep: 3 70 | 71 | # load_model: "models/best.ckpt" 72 | reset_best_ckpt: False 73 | reset_scheduler: False 74 | reset_optimzer: False 75 | reset_iteration_state: False 76 | 77 | testing: 78 | batch_size: 64 79 | batch_type: "sentence" 80 | max_output_length: 40 81 | min_outptu_length: 1 82 | eval_metrics: ["bleu", "rouge-l"] 83 | n_best: 1 84 | beam_size: 4 85 | beam_alpha: -1 86 | return_attention: False 87 | return_probability: False 88 | generate_unk: False 89 | repetition_penalty: -1 90 | 91 | model: 92 | mode: "assembly_comment" # "assembly_comment", "assembly_cfg_comment" 93 | initializer: "xavier_uniform" 94 | embed_initializer: "xavier_uniform" 95 | tied_softmax: False 96 | tied_embeddings: False 97 | 98 | embeddings: 99 | embedding_dim: 512 100 | scale: False 101 | freeze: False 102 | 103 | transformer_encoder: 104 | model_dim: 512 105 | ff_dim: 2048 106 | num_layers: 6 107 | head_count: 8 108 | dropout: 0.2 109 | emb_dropout: 0.2 110 | layer_norm_position: "pre" 111 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 112 | max_src_len: 0 # for learnable, keep same with data segment 113 | freeze: False 114 | max_relative_position: 32 # only for relative position, else must be set to 0 115 | use_negative_distance: True # for relative position 116 | 117 | gnn_encoder: 118 | gnn_type: "GATConv" # ["SAGEConv", "GCNConv", "GATConv"] 119 | aggr: "mean" # ["mean", "max", "lstm"] 120 | model_dim: 512 121 | num_layers: 2 122 | emb_dropout: 0.2 123 | residual: True 124 | 125 | pseudo_encoder: 126 | model_dim: 512 127 | ff_dim: 2048 128 | num_layers: 6 129 | head_count: 8 130 | dropout: 0.2 131 | emb_dropout: 0.2 132 | layer_norm_position: "pre" 133 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 134 | max_src_len: 0 # for learnable, keep same with data segment 135 | freeze: False 136 | max_relative_position: 32 # only for relative position, else must be set to 0 137 | use_negative_distance: True # for relative position 138 | 139 | transformer_decoder: 140 | model_dim: 512 141 | ff_dim: 2048 142 | num_layers: 6 143 | head_count: 8 144 | dropout: 0.2 145 | emb_dropout: 0.2 146 | layer_norm_position: "pre" 147 | trg_pos_emb: "learnable" # ["absolute", "learnable", "relative"] 148 | max_trg_len: 40 # for learnable, keep same with data segment. 149 | freeze: False 150 | max_relative_position: 0 # only for relative position, else must be set to 0. 151 | use_negative_distance: False # for relative position 152 | -------------------------------------------------------------------------------- /configs/dataset_gcc-7.3.0_x86_32_O1_strip/O1_test4_cszx.yaml: -------------------------------------------------------------------------------- 1 | # assembly_cfg_comment 2 | # robertatokenizer and special 3 | # tokenizer_trg is None 4 | # assembly token = 400 5 | data: 6 | data_path: "datas/dataset_gcc-7.3.0_x86_32_O1_strip" 7 | cached_dataset_path: "cached_dataset" 8 | train_data_path: "datas/dataset_gcc-7.3.0_x86_32_O1_strip/train_refined.json" 9 | valid_data_path: "datas/dataset_gcc-7.3.0_x86_32_O1_strip/valid_refined.json" 10 | test_data_path: "datas/dataset_gcc-7.3.0_x86_32_O1_strip/test_refined.json" 11 | use_tokenizer: "robertatokenizer" # sentencepiece or robertatokenizer 12 | robertatokenizer: "Salesforce/codet5-base" 13 | architecture: "x86_32" # 'x86_64', 'x86_32', 'arm_32' 14 | use_refined_pseudo_code: False 15 | 16 | 17 | assembly_token: 18 | vocab_min_freq: 1 19 | vocab_max_size: 50000 20 | token_max_len: 400 21 | 22 | comment: 23 | vocab_min_freq: 1 24 | vocab_max_size: 50000 25 | token_max_len: 40 26 | 27 | cfg_node: 28 | vocab_min_freq: 1 29 | vocab_max_size: 50000 30 | 31 | pseudo_token: 32 | vocab_min_freq: 1 33 | vocab_max_size: 50000 34 | token_max_len: 400 35 | 36 | training: 37 | model_dir: "models/dataset_gcc-7.3.0_x86_32_O1_strip/test4_cszx" 38 | overwrite: False 39 | load_model: False 40 | random_seed: 980820 41 | 42 | logging_frequence: 100 43 | validation_frequence: 1 # after how many epochs 44 | store_valid_output: False 45 | log_valid_samples: [0,1,2,3,4] 46 | 47 | use_cuda: True 48 | num_workers: 4 49 | 50 | epochs: 100 51 | shuffle: True 52 | max_updates: 1000000000 53 | batch_size: 32 54 | 55 | learning_rate: 0.0001 56 | learning_rate_min: 1.0e-18 57 | # clip_grad_val: 1 58 | clip_grad_norm: 5.0 59 | optimizer: "adam" 60 | weight_decay: 0 61 | adam_betas: [0.9, 0.999] 62 | eps: 1.e-8 63 | early_stop_metric: "bleu" 64 | scheduling: "ReduceLROnPlateau" # "ReduceLROnPlateau", "StepLR", "ExponentialLR", "warmup" 65 | mode: "max" 66 | factor: 0.8 67 | patience: 2 68 | step_size: 1 69 | gamma: 0.1 70 | num_ckpts_keep: 3 71 | 72 | # load_model: "models/best.ckpt" 73 | reset_best_ckpt: False 74 | reset_scheduler: False 75 | reset_optimzer: False 76 | reset_iteration_state: False 77 | 78 | testing: 79 | batch_size: 64 80 | batch_type: "sentence" 81 | max_output_length: 40 82 | min_outptu_length: 1 83 | eval_metrics: ["bleu", "rouge-l"] 84 | n_best: 1 85 | beam_size: 4 86 | beam_alpha: -1 87 | return_attention: False 88 | return_probability: False 89 | generate_unk: False 90 | repetition_penalty: -1 91 | 92 | model: 93 | mode: "assembly_cfg_comment" # "assembly_comment", "assembly_cfg_comment" 94 | initializer: "xavier_uniform" 95 | embed_initializer: "xavier_uniform" 96 | tied_softmax: False 97 | tied_embeddings: False 98 | 99 | embeddings: 100 | embedding_dim: 512 101 | scale: False 102 | freeze: False 103 | 104 | transformer_encoder: 105 | model_dim: 512 106 | ff_dim: 2048 107 | num_layers: 6 108 | head_count: 8 109 | dropout: 0.2 110 | emb_dropout: 0.2 111 | layer_norm_position: "pre" 112 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 113 | max_src_len: 0 # for learnable, keep same with data segment 114 | freeze: False 115 | max_relative_position: 32 # only for relative position, else must be set to 0 116 | use_negative_distance: True # for relative position 117 | 118 | gnn_encoder: 119 | gnn_type: "GATConv" # ["SAGEConv", "GCNConv", "GATConv"] 120 | aggr: "mean" # ["mean", "max", "lstm"] 121 | model_dim: 512 122 | num_layers: 2 123 | emb_dropout: 0.2 124 | residual: True 125 | 126 | pseudo_encoder: 127 | model_dim: 512 128 | ff_dim: 2048 129 | num_layers: 6 130 | head_count: 8 131 | dropout: 0.2 132 | emb_dropout: 0.2 133 | layer_norm_position: "pre" 134 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 135 | max_src_len: 0 # for learnable, keep same with data segment 136 | freeze: False 137 | max_relative_position: 32 # only for relative position, else must be set to 0 138 | use_negative_distance: True # for relative position 139 | 140 | transformer_decoder: 141 | model_dim: 512 142 | ff_dim: 2048 143 | num_layers: 6 144 | head_count: 8 145 | dropout: 0.2 146 | emb_dropout: 0.2 147 | layer_norm_position: "pre" 148 | trg_pos_emb: "learnable" # ["absolute", "learnable", "relative"] 149 | max_trg_len: 40 # for learnable, keep same with data segment. 150 | freeze: False 151 | max_relative_position: 0 # only for relative position, else must be set to 0. 152 | use_negative_distance: False # for relative position 153 | -------------------------------------------------------------------------------- /configs/dataset_gcc-7.3.0_x86_32_O1_strip/O1_test8_cszx.yaml: -------------------------------------------------------------------------------- 1 | # assembly_cfg_comment 2 | # robertatokenizer and special 3 | # tokenizer_trg is None 4 | # assembly token = 400 5 | data: 6 | data_path: "datas/dataset_gcc-7.3.0_x86_32_O1_strip" 7 | cached_dataset_path: "cached_dataset" 8 | train_data_path: "datas/dataset_gcc-7.3.0_x86_32_O1_strip/train_refined.json" 9 | valid_data_path: "datas/dataset_gcc-7.3.0_x86_32_O1_strip/valid_refined.json" 10 | test_data_path: "datas/dataset_gcc-7.3.0_x86_32_O1_strip/test_refined.json" 11 | use_tokenizer: "robertatokenizer" # sentencepiece or robertatokenizer 12 | robertatokenizer: "Salesforce/codet5-base" 13 | architecture: "x86_32" # 'x86_64', 'x86_32', 'arm_32' 14 | use_refined_pseudo_code: False 15 | 16 | assembly_token: 17 | vocab_min_freq: 1 18 | vocab_max_size: 50000 19 | token_max_len: 400 20 | 21 | comment: 22 | vocab_min_freq: 1 23 | vocab_max_size: 50000 24 | token_max_len: 40 25 | 26 | cfg_node: 27 | vocab_min_freq: 1 28 | vocab_max_size: 50000 29 | 30 | pseudo_token: 31 | vocab_min_freq: 1 32 | vocab_max_size: 50000 33 | token_max_len: 400 34 | 35 | training: 36 | model_dir: "models/dataset_gcc-7.3.0_x86_32_O1_strip/test8_cszx" 37 | overwrite: False 38 | load_model: False 39 | random_seed: 980820 40 | 41 | logging_frequence: 100 42 | validation_frequence: 1 # after how many epochs 43 | store_valid_output: False 44 | log_valid_samples: [0,1,2,3,4] 45 | 46 | use_cuda: True 47 | num_workers: 4 48 | 49 | epochs: 100 50 | shuffle: True 51 | max_updates: 1000000000 52 | batch_size: 32 53 | 54 | learning_rate: 0.0001 55 | learning_rate_min: 1.0e-18 56 | # clip_grad_val: 1 57 | clip_grad_norm: 5.0 58 | optimizer: "adam" 59 | weight_decay: 0 60 | adam_betas: [0.9, 0.999] 61 | eps: 1.e-8 62 | early_stop_metric: "bleu" 63 | scheduling: "ReduceLROnPlateau" # "ReduceLROnPlateau", "StepLR", "ExponentialLR", "warmup" 64 | mode: "max" 65 | factor: 0.8 66 | patience: 2 67 | step_size: 1 68 | gamma: 0.1 69 | num_ckpts_keep: 3 70 | 71 | # load_model: "models/best.ckpt" 72 | reset_best_ckpt: False 73 | reset_scheduler: False 74 | reset_optimzer: False 75 | reset_iteration_state: False 76 | 77 | testing: 78 | batch_size: 64 79 | batch_type: "sentence" 80 | max_output_length: 40 81 | min_outptu_length: 1 82 | eval_metrics: ["bleu", "rouge-l"] 83 | n_best: 1 84 | beam_size: 4 85 | beam_alpha: -1 86 | return_attention: False 87 | return_probability: False 88 | generate_unk: False 89 | repetition_penalty: -1 90 | 91 | model: 92 | mode: "assembly_cfg_pseudo_comment" # "assembly_comment", "assembly_cfg_comment", "assembly_cfg_pseudo_comment" 93 | initializer: "xavier_uniform" 94 | embed_initializer: "xavier_uniform" 95 | tied_softmax: False 96 | tied_embeddings: False 97 | 98 | embeddings: 99 | embedding_dim: 512 100 | scale: False 101 | freeze: False 102 | 103 | transformer_encoder: 104 | model_dim: 512 105 | ff_dim: 2048 106 | num_layers: 6 107 | head_count: 8 108 | dropout: 0.2 109 | emb_dropout: 0.2 110 | layer_norm_position: "pre" 111 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 112 | max_src_len: 0 # for learnable, keep same with data segment 113 | freeze: False 114 | max_relative_position: 32 # only for relative position, else must be set to 0 115 | use_negative_distance: True # for relative position 116 | 117 | gnn_encoder: 118 | gnn_type: "GATConv" # ["SAGEConv", "GCNConv", "GATConv"] 119 | aggr: "mean" # ["mean", "max", "lstm"] 120 | model_dim: 512 121 | num_layers: 2 122 | emb_dropout: 0.2 123 | residual: True 124 | 125 | pseudo_encoder: 126 | model_dim: 512 127 | ff_dim: 2048 128 | num_layers: 6 129 | head_count: 8 130 | dropout: 0.2 131 | emb_dropout: 0.2 132 | layer_norm_position: "pre" 133 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 134 | max_src_len: 0 # for learnable, keep same with data segment 135 | freeze: False 136 | max_relative_position: 32 # only for relative position, else must be set to 0 137 | use_negative_distance: True # for relative position 138 | 139 | 140 | transformer_decoder: 141 | model_dim: 512 142 | ff_dim: 2048 143 | num_layers: 6 144 | head_count: 8 145 | dropout: 0.2 146 | emb_dropout: 0.2 147 | layer_norm_position: "pre" 148 | trg_pos_emb: "learnable" # ["absolute", "learnable", "relative"] 149 | max_trg_len: 40 # for learnable, keep same with data segment. 150 | freeze: False 151 | max_relative_position: 0 # only for relative position, else must be set to 0. 152 | use_negative_distance: False # for relative position 153 | -------------------------------------------------------------------------------- /configs/dataset_gcc-7.3.0_x86_32_O2_strip/O1_test10_cszx.yaml: -------------------------------------------------------------------------------- 1 | # assembly_cfg_comment 2 | # robertatokenizer and special 3 | # tokenizer_trg is None 4 | # assembly token = 400 5 | data: 6 | data_path: "datas/dataset_gcc-7.3.0_x86_32_O2_strip" 7 | cached_dataset_path: "cached_dataset" 8 | train_data_path: "datas/dataset_gcc-7.3.0_x86_32_O2_strip/train_refined.json" 9 | valid_data_path: "datas/dataset_gcc-7.3.0_x86_32_O2_strip/valid_refined.json" 10 | test_data_path: "datas/dataset_gcc-7.3.0_x86_32_O2_strip/test_refined.json" 11 | use_tokenizer: "robertatokenizer" # sentencepiece or robertatokenizer 12 | robertatokenizer: "Salesforce/codet5-base" 13 | architecture: "x86_32" # 'x86_64', 'x86_32', 'arm_32' 14 | use_refined_pseudo_code: True 15 | 16 | assembly_token: 17 | vocab_min_freq: 1 18 | vocab_max_size: 50000 19 | token_max_len: 400 20 | 21 | comment: 22 | vocab_min_freq: 1 23 | vocab_max_size: 50000 24 | token_max_len: 40 25 | 26 | cfg_node: 27 | vocab_min_freq: 1 28 | vocab_max_size: 50000 29 | 30 | pseudo_token: 31 | vocab_min_freq: 1 32 | vocab_max_size: 50000 33 | token_max_len: 400 34 | 35 | training: 36 | model_dir: "models/dataset_gcc-7.3.0_x86_32_O2_strip/test10_cszx" 37 | overwrite: False 38 | load_model: False 39 | random_seed: 980820 40 | 41 | logging_frequence: 100 42 | validation_frequence: 1 # after how many epochs 43 | store_valid_output: False 44 | log_valid_samples: [0,1,2,3,4] 45 | 46 | use_cuda: True 47 | num_workers: 4 48 | 49 | epochs: 100 50 | shuffle: True 51 | max_updates: 1000000000 52 | batch_size: 32 53 | 54 | learning_rate: 0.0001 55 | learning_rate_min: 1.0e-18 56 | # clip_grad_val: 1 57 | clip_grad_norm: 5.0 58 | optimizer: "adam" 59 | weight_decay: 0 60 | adam_betas: [0.9, 0.999] 61 | eps: 1.e-8 62 | early_stop_metric: "bleu" 63 | scheduling: "ReduceLROnPlateau" # "ReduceLROnPlateau", "StepLR", "ExponentialLR", "warmup" 64 | mode: "max" 65 | factor: 0.8 66 | patience: 2 67 | step_size: 1 68 | gamma: 0.1 69 | num_ckpts_keep: 3 70 | 71 | # load_model: "models/best.ckpt" 72 | reset_best_ckpt: False 73 | reset_scheduler: False 74 | reset_optimzer: False 75 | reset_iteration_state: False 76 | 77 | testing: 78 | batch_size: 64 79 | batch_type: "sentence" 80 | max_output_length: 40 81 | min_outptu_length: 1 82 | eval_metrics: ["bleu", "rouge-l"] 83 | n_best: 1 84 | beam_size: 4 85 | beam_alpha: -1 86 | return_attention: False 87 | return_probability: False 88 | generate_unk: False 89 | repetition_penalty: -1 90 | 91 | model: 92 | mode: "assembly_cfg_pseudo_comment" # "assembly_comment", "assembly_cfg_comment", "assembly_cfg_pseudo_comment" 93 | initializer: "xavier_uniform" 94 | embed_initializer: "xavier_uniform" 95 | tied_softmax: False 96 | tied_embeddings: False 97 | 98 | embeddings: 99 | embedding_dim: 512 100 | scale: False 101 | freeze: False 102 | 103 | transformer_encoder: 104 | model_dim: 512 105 | ff_dim: 2048 106 | num_layers: 6 107 | head_count: 8 108 | dropout: 0.2 109 | emb_dropout: 0.2 110 | layer_norm_position: "pre" 111 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 112 | max_src_len: 0 # for learnable, keep same with data segment 113 | freeze: False 114 | max_relative_position: 32 # only for relative position, else must be set to 0 115 | use_negative_distance: True # for relative position 116 | 117 | gnn_encoder: 118 | gnn_type: "GATConv" # ["SAGEConv", "GCNConv", "GATConv"] 119 | aggr: "mean" # ["mean", "max", "lstm"] 120 | model_dim: 512 121 | num_layers: 2 122 | emb_dropout: 0.2 123 | residual: True 124 | 125 | pseudo_encoder: 126 | model_dim: 512 127 | ff_dim: 2048 128 | num_layers: 6 129 | head_count: 8 130 | dropout: 0.2 131 | emb_dropout: 0.2 132 | layer_norm_position: "pre" 133 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 134 | max_src_len: 0 # for learnable, keep same with data segment 135 | freeze: False 136 | max_relative_position: 32 # only for relative position, else must be set to 0 137 | use_negative_distance: True # for relative position 138 | 139 | 140 | transformer_decoder: 141 | model_dim: 512 142 | ff_dim: 2048 143 | num_layers: 6 144 | head_count: 8 145 | dropout: 0.2 146 | emb_dropout: 0.2 147 | layer_norm_position: "pre" 148 | trg_pos_emb: "learnable" # ["absolute", "learnable", "relative"] 149 | max_trg_len: 40 # for learnable, keep same with data segment. 150 | freeze: False 151 | max_relative_position: 0 # only for relative position, else must be set to 0. 152 | use_negative_distance: False # for relative position 153 | -------------------------------------------------------------------------------- /configs/dataset_gcc-7.3.0_x86_32_O2_strip/O1_test3_cszx.yaml: -------------------------------------------------------------------------------- 1 | # assembly_comment 2 | # robertatokenizer and special 3 | # tokenizer_trg is None 4 | # assembly token = 400 5 | data: 6 | data_path: "datas/dataset_gcc-7.3.0_x86_32_O2_strip" 7 | cached_dataset_path: "cached_dataset" 8 | train_data_path: "datas/dataset_gcc-7.3.0_x86_32_O2_strip/train_refined.json" 9 | valid_data_path: "datas/dataset_gcc-7.3.0_x86_32_O2_strip/valid_refined.json" 10 | test_data_path: "datas/dataset_gcc-7.3.0_x86_32_O2_strip/test_refined.json" 11 | use_tokenizer: "robertatokenizer" # sentencepiece or robertatokenizer 12 | robertatokenizer: "Salesforce/codet5-base" 13 | architecture: "x86_32" # 'x86_64', 'x86_32', 'arm_32' 14 | use_refined_pseudo_code: False 15 | 16 | assembly_token: 17 | vocab_min_freq: 1 18 | vocab_max_size: 50000 19 | token_max_len: 400 20 | 21 | comment: 22 | vocab_min_freq: 1 23 | vocab_max_size: 50000 24 | token_max_len: 40 25 | 26 | cfg_node: 27 | vocab_min_freq: 1 28 | vocab_max_size: 50000 29 | 30 | pseudo_token: 31 | vocab_min_freq: 1 32 | vocab_max_size: 50000 33 | token_max_len: 400 34 | 35 | training: 36 | model_dir: "models/dataset_gcc-7.3.0_x86_32_O2_strip/test3_cszx" 37 | overwrite: False 38 | load_model: False 39 | random_seed: 980820 40 | 41 | logging_frequence: 100 42 | validation_frequence: 1 # after how many epochs 43 | store_valid_output: False 44 | log_valid_samples: [0,1,2,3,4] 45 | 46 | use_cuda: True 47 | num_workers: 4 48 | 49 | epochs: 100 50 | shuffle: True 51 | max_updates: 1000000000 52 | batch_size: 32 53 | 54 | learning_rate: 0.0001 55 | learning_rate_min: 1.0e-18 56 | # clip_grad_val: 1 57 | clip_grad_norm: 5.0 58 | optimizer: "adam" 59 | weight_decay: 0 60 | adam_betas: [0.9, 0.999] 61 | eps: 1.e-8 62 | early_stop_metric: "bleu" 63 | scheduling: "ReduceLROnPlateau" # "ReduceLROnPlateau", "StepLR", "ExponentialLR", "warmup" 64 | mode: "max" 65 | factor: 0.8 66 | patience: 2 67 | step_size: 1 68 | gamma: 0.1 69 | num_ckpts_keep: 3 70 | 71 | # load_model: "models/best.ckpt" 72 | reset_best_ckpt: False 73 | reset_scheduler: False 74 | reset_optimzer: False 75 | reset_iteration_state: False 76 | 77 | testing: 78 | batch_size: 64 79 | batch_type: "sentence" 80 | max_output_length: 40 81 | min_outptu_length: 1 82 | eval_metrics: ["bleu", "rouge-l"] 83 | n_best: 1 84 | beam_size: 4 85 | beam_alpha: -1 86 | return_attention: False 87 | return_probability: False 88 | generate_unk: False 89 | repetition_penalty: -1 90 | 91 | model: 92 | mode: "assembly_comment" # "assembly_comment", "assembly_cfg_comment" 93 | initializer: "xavier_uniform" 94 | embed_initializer: "xavier_uniform" 95 | tied_softmax: False 96 | tied_embeddings: False 97 | 98 | embeddings: 99 | embedding_dim: 512 100 | scale: False 101 | freeze: False 102 | 103 | transformer_encoder: 104 | model_dim: 512 105 | ff_dim: 2048 106 | num_layers: 6 107 | head_count: 8 108 | dropout: 0.2 109 | emb_dropout: 0.2 110 | layer_norm_position: "pre" 111 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 112 | max_src_len: 0 # for learnable, keep same with data segment 113 | freeze: False 114 | max_relative_position: 32 # only for relative position, else must be set to 0 115 | use_negative_distance: True # for relative position 116 | 117 | gnn_encoder: 118 | gnn_type: "GATConv" # ["SAGEConv", "GCNConv", "GATConv"] 119 | aggr: "mean" # ["mean", "max", "lstm"] 120 | model_dim: 512 121 | num_layers: 2 122 | emb_dropout: 0.2 123 | residual: True 124 | 125 | pseudo_encoder: 126 | model_dim: 512 127 | ff_dim: 2048 128 | num_layers: 6 129 | head_count: 8 130 | dropout: 0.2 131 | emb_dropout: 0.2 132 | layer_norm_position: "pre" 133 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 134 | max_src_len: 0 # for learnable, keep same with data segment 135 | freeze: False 136 | max_relative_position: 32 # only for relative position, else must be set to 0 137 | use_negative_distance: True # for relative position 138 | 139 | transformer_decoder: 140 | model_dim: 512 141 | ff_dim: 2048 142 | num_layers: 6 143 | head_count: 8 144 | dropout: 0.2 145 | emb_dropout: 0.2 146 | layer_norm_position: "pre" 147 | trg_pos_emb: "learnable" # ["absolute", "learnable", "relative"] 148 | max_trg_len: 40 # for learnable, keep same with data segment. 149 | freeze: False 150 | max_relative_position: 0 # only for relative position, else must be set to 0. 151 | use_negative_distance: False # for relative position 152 | -------------------------------------------------------------------------------- /configs/dataset_gcc-7.3.0_x86_32_O2_strip/O1_test4_cszx.yaml: -------------------------------------------------------------------------------- 1 | # assembly_cfg_comment 2 | # robertatokenizer and special 3 | # tokenizer_trg is None 4 | # assembly token = 400 5 | data: 6 | data_path: "datas/dataset_gcc-7.3.0_x86_32_O2_strip" 7 | cached_dataset_path: "cached_dataset" 8 | train_data_path: "datas/dataset_gcc-7.3.0_x86_32_O2_strip/train_refined.json" 9 | valid_data_path: "datas/dataset_gcc-7.3.0_x86_32_O2_strip/valid_refined.json" 10 | test_data_path: "datas/dataset_gcc-7.3.0_x86_32_O2_strip/test_refined.json" 11 | use_tokenizer: "robertatokenizer" # sentencepiece or robertatokenizer 12 | robertatokenizer: "Salesforce/codet5-base" 13 | architecture: "x86_32" # 'x86_64', 'x86_32', 'arm_32' 14 | use_refined_pseudo_code: False 15 | 16 | assembly_token: 17 | vocab_min_freq: 1 18 | vocab_max_size: 50000 19 | token_max_len: 400 20 | 21 | comment: 22 | vocab_min_freq: 1 23 | vocab_max_size: 50000 24 | token_max_len: 40 25 | 26 | cfg_node: 27 | vocab_min_freq: 1 28 | vocab_max_size: 50000 29 | 30 | pseudo_token: 31 | vocab_min_freq: 1 32 | vocab_max_size: 50000 33 | token_max_len: 400 34 | 35 | training: 36 | model_dir: "models/dataset_gcc-7.3.0_x86_32_O2_strip/test4_cszx" 37 | overwrite: False 38 | load_model: False 39 | random_seed: 980820 40 | 41 | logging_frequence: 100 42 | validation_frequence: 1 # after how many epochs 43 | store_valid_output: False 44 | log_valid_samples: [0,1,2,3,4] 45 | 46 | use_cuda: True 47 | num_workers: 4 48 | 49 | epochs: 100 50 | shuffle: True 51 | max_updates: 1000000000 52 | batch_size: 32 53 | 54 | learning_rate: 0.0001 55 | learning_rate_min: 1.0e-18 56 | # clip_grad_val: 1 57 | clip_grad_norm: 5.0 58 | optimizer: "adam" 59 | weight_decay: 0 60 | adam_betas: [0.9, 0.999] 61 | eps: 1.e-8 62 | early_stop_metric: "bleu" 63 | scheduling: "ReduceLROnPlateau" # "ReduceLROnPlateau", "StepLR", "ExponentialLR", "warmup" 64 | mode: "max" 65 | factor: 0.8 66 | patience: 2 67 | step_size: 1 68 | gamma: 0.1 69 | num_ckpts_keep: 3 70 | 71 | # load_model: "models/best.ckpt" 72 | reset_best_ckpt: False 73 | reset_scheduler: False 74 | reset_optimzer: False 75 | reset_iteration_state: False 76 | 77 | testing: 78 | batch_size: 64 79 | batch_type: "sentence" 80 | max_output_length: 40 81 | min_outptu_length: 1 82 | eval_metrics: ["bleu", "rouge-l"] 83 | n_best: 1 84 | beam_size: 4 85 | beam_alpha: -1 86 | return_attention: False 87 | return_probability: False 88 | generate_unk: False 89 | repetition_penalty: -1 90 | 91 | model: 92 | mode: "assembly_cfg_comment" # "assembly_comment", "assembly_cfg_comment" 93 | initializer: "xavier_uniform" 94 | embed_initializer: "xavier_uniform" 95 | tied_softmax: False 96 | tied_embeddings: False 97 | 98 | embeddings: 99 | embedding_dim: 512 100 | scale: False 101 | freeze: False 102 | 103 | transformer_encoder: 104 | model_dim: 512 105 | ff_dim: 2048 106 | num_layers: 6 107 | head_count: 8 108 | dropout: 0.2 109 | emb_dropout: 0.2 110 | layer_norm_position: "pre" 111 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 112 | max_src_len: 0 # for learnable, keep same with data segment 113 | freeze: False 114 | max_relative_position: 32 # only for relative position, else must be set to 0 115 | use_negative_distance: True # for relative position 116 | 117 | gnn_encoder: 118 | gnn_type: "GATConv" # ["SAGEConv", "GCNConv", "GATConv"] 119 | aggr: "mean" # ["mean", "max", "lstm"] 120 | model_dim: 512 121 | num_layers: 2 122 | emb_dropout: 0.2 123 | residual: True 124 | 125 | pseudo_encoder: 126 | model_dim: 512 127 | ff_dim: 2048 128 | num_layers: 6 129 | head_count: 8 130 | dropout: 0.2 131 | emb_dropout: 0.2 132 | layer_norm_position: "pre" 133 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 134 | max_src_len: 0 # for learnable, keep same with data segment 135 | freeze: False 136 | max_relative_position: 32 # only for relative position, else must be set to 0 137 | use_negative_distance: True # for relative position 138 | 139 | transformer_decoder: 140 | model_dim: 512 141 | ff_dim: 2048 142 | num_layers: 6 143 | head_count: 8 144 | dropout: 0.2 145 | emb_dropout: 0.2 146 | layer_norm_position: "pre" 147 | trg_pos_emb: "learnable" # ["absolute", "learnable", "relative"] 148 | max_trg_len: 40 # for learnable, keep same with data segment. 149 | freeze: False 150 | max_relative_position: 0 # only for relative position, else must be set to 0. 151 | use_negative_distance: False # for relative position 152 | -------------------------------------------------------------------------------- /configs/dataset_gcc-7.3.0_x86_32_O2_strip/O1_test8_cszx.yaml: -------------------------------------------------------------------------------- 1 | # assembly_cfg_comment 2 | # robertatokenizer and special 3 | # tokenizer_trg is None 4 | # assembly token = 400 5 | data: 6 | data_path: "datas/dataset_gcc-7.3.0_x86_32_O2_strip" 7 | cached_dataset_path: "cached_dataset" 8 | train_data_path: "datas/dataset_gcc-7.3.0_x86_32_O2_strip/train_refined.json" 9 | valid_data_path: "datas/dataset_gcc-7.3.0_x86_32_O2_strip/valid_refined.json" 10 | test_data_path: "datas/dataset_gcc-7.3.0_x86_32_O2_strip/test_refined.json" 11 | use_tokenizer: "robertatokenizer" # sentencepiece or robertatokenizer 12 | robertatokenizer: "Salesforce/codet5-bas" 13 | architecture: "x86_32" # 'x86_64', 'x86_32', 'arm_32' 14 | use_refined_pseudo_code: False 15 | 16 | assembly_token: 17 | vocab_min_freq: 1 18 | vocab_max_size: 50000 19 | token_max_len: 400 20 | 21 | comment: 22 | vocab_min_freq: 1 23 | vocab_max_size: 50000 24 | token_max_len: 40 25 | 26 | cfg_node: 27 | vocab_min_freq: 1 28 | vocab_max_size: 50000 29 | 30 | pseudo_token: 31 | vocab_min_freq: 1 32 | vocab_max_size: 50000 33 | token_max_len: 400 34 | 35 | training: 36 | model_dir: "models/dataset_gcc-7.3.0_x86_32_O2_strip/test8_cszx" 37 | overwrite: False 38 | load_model: False 39 | random_seed: 980820 40 | 41 | logging_frequence: 100 42 | validation_frequence: 1 # after how many epochs 43 | store_valid_output: False 44 | log_valid_samples: [0,1,2,3,4] 45 | 46 | use_cuda: True 47 | num_workers: 4 48 | 49 | epochs: 100 50 | shuffle: True 51 | max_updates: 1000000000 52 | batch_size: 32 53 | 54 | learning_rate: 0.0001 55 | learning_rate_min: 1.0e-18 56 | # clip_grad_val: 1 57 | clip_grad_norm: 5.0 58 | optimizer: "adam" 59 | weight_decay: 0 60 | adam_betas: [0.9, 0.999] 61 | eps: 1.e-8 62 | early_stop_metric: "bleu" 63 | scheduling: "ReduceLROnPlateau" # "ReduceLROnPlateau", "StepLR", "ExponentialLR", "warmup" 64 | mode: "max" 65 | factor: 0.8 66 | patience: 2 67 | step_size: 1 68 | gamma: 0.1 69 | num_ckpts_keep: 3 70 | 71 | # load_model: "models/best.ckpt" 72 | reset_best_ckpt: False 73 | reset_scheduler: False 74 | reset_optimzer: False 75 | reset_iteration_state: False 76 | 77 | testing: 78 | batch_size: 64 79 | batch_type: "sentence" 80 | max_output_length: 40 81 | min_outptu_length: 1 82 | eval_metrics: ["bleu", "rouge-l"] 83 | n_best: 1 84 | beam_size: 4 85 | beam_alpha: -1 86 | return_attention: False 87 | return_probability: False 88 | generate_unk: False 89 | repetition_penalty: -1 90 | 91 | model: 92 | mode: "assembly_cfg_pseudo_comment" # "assembly_comment", "assembly_cfg_comment", "assembly_cfg_pseudo_comment" 93 | initializer: "xavier_uniform" 94 | embed_initializer: "xavier_uniform" 95 | tied_softmax: False 96 | tied_embeddings: False 97 | 98 | embeddings: 99 | embedding_dim: 512 100 | scale: False 101 | freeze: False 102 | 103 | transformer_encoder: 104 | model_dim: 512 105 | ff_dim: 2048 106 | num_layers: 6 107 | head_count: 8 108 | dropout: 0.2 109 | emb_dropout: 0.2 110 | layer_norm_position: "pre" 111 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 112 | max_src_len: 0 # for learnable, keep same with data segment 113 | freeze: False 114 | max_relative_position: 32 # only for relative position, else must be set to 0 115 | use_negative_distance: True # for relative position 116 | 117 | gnn_encoder: 118 | gnn_type: "GATConv" # ["SAGEConv", "GCNConv", "GATConv"] 119 | aggr: "mean" # ["mean", "max", "lstm"] 120 | model_dim: 512 121 | num_layers: 2 122 | emb_dropout: 0.2 123 | residual: True 124 | 125 | pseudo_encoder: 126 | model_dim: 512 127 | ff_dim: 2048 128 | num_layers: 6 129 | head_count: 8 130 | dropout: 0.2 131 | emb_dropout: 0.2 132 | layer_norm_position: "pre" 133 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 134 | max_src_len: 0 # for learnable, keep same with data segment 135 | freeze: False 136 | max_relative_position: 32 # only for relative position, else must be set to 0 137 | use_negative_distance: True # for relative position 138 | 139 | 140 | transformer_decoder: 141 | model_dim: 512 142 | ff_dim: 2048 143 | num_layers: 6 144 | head_count: 8 145 | dropout: 0.2 146 | emb_dropout: 0.2 147 | layer_norm_position: "pre" 148 | trg_pos_emb: "learnable" # ["absolute", "learnable", "relative"] 149 | max_trg_len: 40 # for learnable, keep same with data segment. 150 | freeze: False 151 | max_relative_position: 0 # only for relative position, else must be set to 0. 152 | use_negative_distance: False # for relative position 153 | -------------------------------------------------------------------------------- /configs/dataset_gcc-7.3.0_x86_32_O3_strip/O1_test10_cszx.yaml: -------------------------------------------------------------------------------- 1 | # assembly_cfg_comment 2 | # robertatokenizer and special 3 | # tokenizer_trg is None 4 | # assembly token = 400 5 | data: 6 | data_path: "datas/dataset_gcc-7.3.0_x86_32_O3_strip" 7 | cached_dataset_path: "cached_dataset" 8 | train_data_path: "datas/dataset_gcc-7.3.0_x86_32_O3_strip/train_refined.json" 9 | valid_data_path: "datas/dataset_gcc-7.3.0_x86_32_O3_strip/valid_refined.json" 10 | test_data_path: "datas/dataset_gcc-7.3.0_x86_32_O3_strip/test_refined.json" 11 | use_tokenizer: "robertatokenizer" # sentencepiece or robertatokenizer 12 | robertatokenizer: "Salesforce/codet5-base" 13 | architecture: "x86_32" # 'x86_64', 'x86_32', 'arm_32' 14 | use_refined_pseudo_code: True 15 | 16 | 17 | assembly_token: 18 | vocab_min_freq: 1 19 | vocab_max_size: 50000 20 | token_max_len: 400 21 | 22 | comment: 23 | vocab_min_freq: 1 24 | vocab_max_size: 50000 25 | token_max_len: 40 26 | 27 | cfg_node: 28 | vocab_min_freq: 1 29 | vocab_max_size: 50000 30 | 31 | pseudo_token: 32 | vocab_min_freq: 1 33 | vocab_max_size: 50000 34 | token_max_len: 400 35 | 36 | training: 37 | model_dir: "models/dataset_gcc-7.3.0_x86_32_O3_strip/test10_cszx" 38 | overwrite: False 39 | load_model: False 40 | random_seed: 980820 41 | 42 | logging_frequence: 100 43 | validation_frequence: 1 # after how many epochs 44 | store_valid_output: False 45 | log_valid_samples: [0,1,2,3,4] 46 | 47 | use_cuda: True 48 | num_workers: 4 49 | 50 | epochs: 100 51 | shuffle: True 52 | max_updates: 1000000000 53 | batch_size: 32 54 | 55 | learning_rate: 0.0001 56 | learning_rate_min: 1.0e-18 57 | # clip_grad_val: 1 58 | clip_grad_norm: 5.0 59 | optimizer: "adam" 60 | weight_decay: 0 61 | adam_betas: [0.9, 0.999] 62 | eps: 1.e-8 63 | early_stop_metric: "bleu" 64 | scheduling: "ReduceLROnPlateau" # "ReduceLROnPlateau", "StepLR", "ExponentialLR", "warmup" 65 | mode: "max" 66 | factor: 0.8 67 | patience: 2 68 | step_size: 1 69 | gamma: 0.1 70 | num_ckpts_keep: 3 71 | 72 | # load_model: "models/best.ckpt" 73 | reset_best_ckpt: False 74 | reset_scheduler: False 75 | reset_optimzer: False 76 | reset_iteration_state: False 77 | 78 | testing: 79 | batch_size: 64 80 | batch_type: "sentence" 81 | max_output_length: 40 82 | min_outptu_length: 1 83 | eval_metrics: ["bleu", "rouge-l"] 84 | n_best: 1 85 | beam_size: 4 86 | beam_alpha: -1 87 | return_attention: False 88 | return_probability: False 89 | generate_unk: False 90 | repetition_penalty: -1 91 | 92 | model: 93 | mode: "assembly_cfg_pseudo_comment" # "assembly_comment", "assembly_cfg_comment", "assembly_cfg_pseudo_comment" 94 | initializer: "xavier_uniform" 95 | embed_initializer: "xavier_uniform" 96 | tied_softmax: False 97 | tied_embeddings: False 98 | 99 | embeddings: 100 | embedding_dim: 512 101 | scale: False 102 | freeze: False 103 | 104 | transformer_encoder: 105 | model_dim: 512 106 | ff_dim: 2048 107 | num_layers: 6 108 | head_count: 8 109 | dropout: 0.2 110 | emb_dropout: 0.2 111 | layer_norm_position: "pre" 112 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 113 | max_src_len: 0 # for learnable, keep same with data segment 114 | freeze: False 115 | max_relative_position: 32 # only for relative position, else must be set to 0 116 | use_negative_distance: True # for relative position 117 | 118 | gnn_encoder: 119 | gnn_type: "GATConv" # ["SAGEConv", "GCNConv", "GATConv"] 120 | aggr: "mean" # ["mean", "max", "lstm"] 121 | model_dim: 512 122 | num_layers: 2 123 | emb_dropout: 0.2 124 | residual: True 125 | 126 | pseudo_encoder: 127 | model_dim: 512 128 | ff_dim: 2048 129 | num_layers: 6 130 | head_count: 8 131 | dropout: 0.2 132 | emb_dropout: 0.2 133 | layer_norm_position: "pre" 134 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 135 | max_src_len: 0 # for learnable, keep same with data segment 136 | freeze: False 137 | max_relative_position: 32 # only for relative position, else must be set to 0 138 | use_negative_distance: True # for relative position 139 | 140 | transformer_decoder: 141 | model_dim: 512 142 | ff_dim: 2048 143 | num_layers: 6 144 | head_count: 8 145 | dropout: 0.2 146 | emb_dropout: 0.2 147 | layer_norm_position: "pre" 148 | trg_pos_emb: "learnable" # ["absolute", "learnable", "relative"] 149 | max_trg_len: 40 # for learnable, keep same with data segment. 150 | freeze: False 151 | max_relative_position: 0 # only for relative position, else must be set to 0. 152 | use_negative_distance: False # for relative position 153 | -------------------------------------------------------------------------------- /configs/dataset_gcc-7.3.0_x86_32_O3_strip/O1_test3_cszx.yaml: -------------------------------------------------------------------------------- 1 | # assembly_comment 2 | # robertatokenizer and special 3 | # tokenizer_trg is None 4 | # assembly token = 400 5 | data: 6 | data_path: "datas/dataset_gcc-7.3.0_x86_32_O3_strip" 7 | cached_dataset_path: "cached_dataset" 8 | train_data_path: "datas/dataset_gcc-7.3.0_x86_32_O3_strip/train_refined.json" 9 | valid_data_path: "datas/dataset_gcc-7.3.0_x86_32_O3_strip/valid_refined.json" 10 | test_data_path: "datas/dataset_gcc-7.3.0_x86_32_O3_strip/test_refined.json" 11 | use_tokenizer: "robertatokenizer" # sentencepiece or robertatokenizer 12 | robertatokenizer: "Salesforce/codet5-base" 13 | architecture: "x86_32" # 'x86_64', 'x86_32', 'arm_32' 14 | use_refined_pseudo_code: False 15 | 16 | assembly_token: 17 | vocab_min_freq: 1 18 | vocab_max_size: 50000 19 | token_max_len: 400 20 | 21 | comment: 22 | vocab_min_freq: 1 23 | vocab_max_size: 50000 24 | token_max_len: 40 25 | 26 | cfg_node: 27 | vocab_min_freq: 1 28 | vocab_max_size: 50000 29 | 30 | pseudo_token: 31 | vocab_min_freq: 1 32 | vocab_max_size: 50000 33 | token_max_len: 400 34 | 35 | training: 36 | model_dir: "models/dataset_gcc-7.3.0_x86_32_O3_strip/test3_cszx" 37 | overwrite: False 38 | load_model: False 39 | random_seed: 980820 40 | 41 | logging_frequence: 100 42 | validation_frequence: 1 # after how many epochs 43 | store_valid_output: False 44 | log_valid_samples: [0,1,2,3,4] 45 | 46 | use_cuda: True 47 | num_workers: 4 48 | 49 | epochs: 100 50 | shuffle: True 51 | max_updates: 1000000000 52 | batch_size: 32 53 | 54 | learning_rate: 0.0001 55 | learning_rate_min: 1.0e-18 56 | # clip_grad_val: 1 57 | clip_grad_norm: 5.0 58 | optimizer: "adam" 59 | weight_decay: 0 60 | adam_betas: [0.9, 0.999] 61 | eps: 1.e-8 62 | early_stop_metric: "bleu" 63 | scheduling: "ReduceLROnPlateau" # "ReduceLROnPlateau", "StepLR", "ExponentialLR", "warmup" 64 | mode: "max" 65 | factor: 0.8 66 | patience: 2 67 | step_size: 1 68 | gamma: 0.1 69 | num_ckpts_keep: 3 70 | 71 | # load_model: "models/best.ckpt" 72 | reset_best_ckpt: False 73 | reset_scheduler: False 74 | reset_optimzer: False 75 | reset_iteration_state: False 76 | 77 | testing: 78 | batch_size: 64 79 | batch_type: "sentence" 80 | max_output_length: 40 81 | min_outptu_length: 1 82 | eval_metrics: ["bleu", "rouge-l"] 83 | n_best: 1 84 | beam_size: 4 85 | beam_alpha: -1 86 | return_attention: False 87 | return_probability: False 88 | generate_unk: False 89 | repetition_penalty: -1 90 | 91 | model: 92 | mode: "assembly_comment" # "assembly_comment", "assembly_cfg_comment" 93 | initializer: "xavier_uniform" 94 | embed_initializer: "xavier_uniform" 95 | tied_softmax: False 96 | tied_embeddings: False 97 | 98 | embeddings: 99 | embedding_dim: 512 100 | scale: False 101 | freeze: False 102 | 103 | transformer_encoder: 104 | model_dim: 512 105 | ff_dim: 2048 106 | num_layers: 6 107 | head_count: 8 108 | dropout: 0.2 109 | emb_dropout: 0.2 110 | layer_norm_position: "pre" 111 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 112 | max_src_len: 0 # for learnable, keep same with data segment 113 | freeze: False 114 | max_relative_position: 32 # only for relative position, else must be set to 0 115 | use_negative_distance: True # for relative position 116 | 117 | gnn_encoder: 118 | gnn_type: "GATConv" # ["SAGEConv", "GCNConv", "GATConv"] 119 | aggr: "mean" # ["mean", "max", "lstm"] 120 | model_dim: 512 121 | num_layers: 2 122 | emb_dropout: 0.2 123 | residual: True 124 | 125 | pseudo_encoder: 126 | model_dim: 512 127 | ff_dim: 2048 128 | num_layers: 6 129 | head_count: 8 130 | dropout: 0.2 131 | emb_dropout: 0.2 132 | layer_norm_position: "pre" 133 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 134 | max_src_len: 0 # for learnable, keep same with data segment 135 | freeze: False 136 | max_relative_position: 32 # only for relative position, else must be set to 0 137 | use_negative_distance: True # for relative position 138 | 139 | transformer_decoder: 140 | model_dim: 512 141 | ff_dim: 2048 142 | num_layers: 6 143 | head_count: 8 144 | dropout: 0.2 145 | emb_dropout: 0.2 146 | layer_norm_position: "pre" 147 | trg_pos_emb: "learnable" # ["absolute", "learnable", "relative"] 148 | max_trg_len: 40 # for learnable, keep same with data segment. 149 | freeze: False 150 | max_relative_position: 0 # only for relative position, else must be set to 0. 151 | use_negative_distance: False # for relative position 152 | -------------------------------------------------------------------------------- /configs/dataset_gcc-7.3.0_x86_32_O3_strip/O1_test4_cszx.yaml: -------------------------------------------------------------------------------- 1 | # assembly_cfg_comment 2 | # robertatokenizer and special 3 | # tokenizer_trg is None 4 | # assembly token = 400 5 | data: 6 | data_path: "datas/dataset_gcc-7.3.0_x86_32_O3_strip" 7 | cached_dataset_path: "cached_dataset" 8 | train_data_path: "datas/dataset_gcc-7.3.0_x86_32_O3_strip/train_refined.json" 9 | valid_data_path: "datas/dataset_gcc-7.3.0_x86_32_O3_strip/valid_refined.json" 10 | test_data_path: "datas/dataset_gcc-7.3.0_x86_32_O3_strip/test_refined.json" 11 | use_tokenizer: "robertatokenizer" # sentencepiece or robertatokenizer 12 | robertatokenizer: "Salesforce/codet5-base" 13 | architecture: "x86_32" # 'x86_64', 'x86_32', 'arm_32' 14 | use_refined_pseudo_code: False 15 | 16 | assembly_token: 17 | vocab_min_freq: 1 18 | vocab_max_size: 50000 19 | token_max_len: 400 20 | 21 | comment: 22 | vocab_min_freq: 1 23 | vocab_max_size: 50000 24 | token_max_len: 40 25 | 26 | cfg_node: 27 | vocab_min_freq: 1 28 | vocab_max_size: 50000 29 | 30 | pseudo_token: 31 | vocab_min_freq: 1 32 | vocab_max_size: 50000 33 | token_max_len: 400 34 | 35 | training: 36 | model_dir: "models/dataset_gcc-7.3.0_x86_32_O3_strip/test4_cszx" 37 | overwrite: False 38 | load_model: False 39 | random_seed: 980820 40 | 41 | logging_frequence: 100 42 | validation_frequence: 1 # after how many epochs 43 | store_valid_output: False 44 | log_valid_samples: [0,1,2,3,4] 45 | 46 | use_cuda: True 47 | num_workers: 4 48 | 49 | epochs: 100 50 | shuffle: True 51 | max_updates: 1000000000 52 | batch_size: 32 53 | 54 | learning_rate: 0.0001 55 | learning_rate_min: 1.0e-18 56 | # clip_grad_val: 1 57 | clip_grad_norm: 5.0 58 | optimizer: "adam" 59 | weight_decay: 0 60 | adam_betas: [0.9, 0.999] 61 | eps: 1.e-8 62 | early_stop_metric: "bleu" 63 | scheduling: "ReduceLROnPlateau" # "ReduceLROnPlateau", "StepLR", "ExponentialLR", "warmup" 64 | mode: "max" 65 | factor: 0.8 66 | patience: 2 67 | step_size: 1 68 | gamma: 0.1 69 | num_ckpts_keep: 3 70 | 71 | # load_model: "models/best.ckpt" 72 | reset_best_ckpt: False 73 | reset_scheduler: False 74 | reset_optimzer: False 75 | reset_iteration_state: False 76 | 77 | testing: 78 | batch_size: 64 79 | batch_type: "sentence" 80 | max_output_length: 40 81 | min_outptu_length: 1 82 | eval_metrics: ["bleu", "rouge-l"] 83 | n_best: 1 84 | beam_size: 4 85 | beam_alpha: -1 86 | return_attention: False 87 | return_probability: False 88 | generate_unk: False 89 | repetition_penalty: -1 90 | 91 | model: 92 | mode: "assembly_cfg_comment" # "assembly_comment", "assembly_cfg_comment" 93 | initializer: "xavier_uniform" 94 | embed_initializer: "xavier_uniform" 95 | tied_softmax: False 96 | tied_embeddings: False 97 | 98 | embeddings: 99 | embedding_dim: 512 100 | scale: False 101 | freeze: False 102 | 103 | transformer_encoder: 104 | model_dim: 512 105 | ff_dim: 2048 106 | num_layers: 6 107 | head_count: 8 108 | dropout: 0.2 109 | emb_dropout: 0.2 110 | layer_norm_position: "pre" 111 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 112 | max_src_len: 0 # for learnable, keep same with data segment 113 | freeze: False 114 | max_relative_position: 32 # only for relative position, else must be set to 0 115 | use_negative_distance: True # for relative position 116 | 117 | gnn_encoder: 118 | gnn_type: "GATConv" # ["SAGEConv", "GCNConv", "GATConv"] 119 | aggr: "mean" # ["mean", "max", "lstm"] 120 | model_dim: 512 121 | num_layers: 2 122 | emb_dropout: 0.2 123 | residual: True 124 | 125 | pseudo_encoder: 126 | model_dim: 512 127 | ff_dim: 2048 128 | num_layers: 6 129 | head_count: 8 130 | dropout: 0.2 131 | emb_dropout: 0.2 132 | layer_norm_position: "pre" 133 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 134 | max_src_len: 0 # for learnable, keep same with data segment 135 | freeze: False 136 | max_relative_position: 32 # only for relative position, else must be set to 0 137 | use_negative_distance: True # for relative position 138 | 139 | transformer_decoder: 140 | model_dim: 512 141 | ff_dim: 2048 142 | num_layers: 6 143 | head_count: 8 144 | dropout: 0.2 145 | emb_dropout: 0.2 146 | layer_norm_position: "pre" 147 | trg_pos_emb: "learnable" # ["absolute", "learnable", "relative"] 148 | max_trg_len: 40 # for learnable, keep same with data segment. 149 | freeze: False 150 | max_relative_position: 0 # only for relative position, else must be set to 0. 151 | use_negative_distance: False # for relative position 152 | -------------------------------------------------------------------------------- /configs/dataset_gcc-7.3.0_x86_32_O3_strip/O1_test8_cszx.yaml: -------------------------------------------------------------------------------- 1 | # assembly_cfg_comment 2 | # robertatokenizer and special 3 | # tokenizer_trg is None 4 | # assembly token = 400 5 | data: 6 | data_path: "datas/dataset_gcc-7.3.0_x86_32_O3_strip" 7 | cached_dataset_path: "cached_dataset" 8 | train_data_path: "datas/dataset_gcc-7.3.0_x86_32_O3_strip/train_refined.json" 9 | valid_data_path: "datas/dataset_gcc-7.3.0_x86_32_O3_strip/valid_refined.json" 10 | test_data_path: "datas/dataset_gcc-7.3.0_x86_32_O3_strip/test_refined.json" 11 | use_tokenizer: "robertatokenizer" # sentencepiece or robertatokenizer 12 | robertatokenizer: "Salesforce/codet5-base" 13 | architecture: "x86_32" # 'x86_64', 'x86_32', 'arm_32' 14 | use_refined_pseudo_code: False 15 | 16 | 17 | assembly_token: 18 | vocab_min_freq: 1 19 | vocab_max_size: 50000 20 | token_max_len: 400 21 | 22 | comment: 23 | vocab_min_freq: 1 24 | vocab_max_size: 50000 25 | token_max_len: 40 26 | 27 | cfg_node: 28 | vocab_min_freq: 1 29 | vocab_max_size: 50000 30 | 31 | pseudo_token: 32 | vocab_min_freq: 1 33 | vocab_max_size: 50000 34 | token_max_len: 400 35 | 36 | training: 37 | model_dir: "models/dataset_gcc-7.3.0_x86_32_O3_strip/test8_cszx" 38 | overwrite: False 39 | load_model: False 40 | random_seed: 980820 41 | 42 | logging_frequence: 100 43 | validation_frequence: 1 # after how many epochs 44 | store_valid_output: False 45 | log_valid_samples: [0,1,2,3,4] 46 | 47 | use_cuda: True 48 | num_workers: 4 49 | 50 | epochs: 100 51 | shuffle: True 52 | max_updates: 1000000000 53 | batch_size: 32 54 | 55 | learning_rate: 0.0001 56 | learning_rate_min: 1.0e-18 57 | # clip_grad_val: 1 58 | clip_grad_norm: 5.0 59 | optimizer: "adam" 60 | weight_decay: 0 61 | adam_betas: [0.9, 0.999] 62 | eps: 1.e-8 63 | early_stop_metric: "bleu" 64 | scheduling: "ReduceLROnPlateau" # "ReduceLROnPlateau", "StepLR", "ExponentialLR", "warmup" 65 | mode: "max" 66 | factor: 0.8 67 | patience: 2 68 | step_size: 1 69 | gamma: 0.1 70 | num_ckpts_keep: 3 71 | 72 | # load_model: "models/best.ckpt" 73 | reset_best_ckpt: False 74 | reset_scheduler: False 75 | reset_optimzer: False 76 | reset_iteration_state: False 77 | 78 | testing: 79 | batch_size: 64 80 | batch_type: "sentence" 81 | max_output_length: 40 82 | min_outptu_length: 1 83 | eval_metrics: ["bleu", "rouge-l"] 84 | n_best: 1 85 | beam_size: 4 86 | beam_alpha: -1 87 | return_attention: False 88 | return_probability: False 89 | generate_unk: False 90 | repetition_penalty: -1 91 | 92 | model: 93 | mode: "assembly_cfg_pseudo_comment" # "assembly_comment", "assembly_cfg_comment", "assembly_cfg_pseudo_comment" 94 | initializer: "xavier_uniform" 95 | embed_initializer: "xavier_uniform" 96 | tied_softmax: False 97 | tied_embeddings: False 98 | 99 | embeddings: 100 | embedding_dim: 512 101 | scale: False 102 | freeze: False 103 | 104 | transformer_encoder: 105 | model_dim: 512 106 | ff_dim: 2048 107 | num_layers: 6 108 | head_count: 8 109 | dropout: 0.2 110 | emb_dropout: 0.2 111 | layer_norm_position: "pre" 112 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 113 | max_src_len: 0 # for learnable, keep same with data segment 114 | freeze: False 115 | max_relative_position: 32 # only for relative position, else must be set to 0 116 | use_negative_distance: True # for relative position 117 | 118 | gnn_encoder: 119 | gnn_type: "GATConv" # ["SAGEConv", "GCNConv", "GATConv"] 120 | aggr: "mean" # ["mean", "max", "lstm"] 121 | model_dim: 512 122 | num_layers: 2 123 | emb_dropout: 0.2 124 | residual: True 125 | 126 | pseudo_encoder: 127 | model_dim: 512 128 | ff_dim: 2048 129 | num_layers: 6 130 | head_count: 8 131 | dropout: 0.2 132 | emb_dropout: 0.2 133 | layer_norm_position: "pre" 134 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 135 | max_src_len: 0 # for learnable, keep same with data segment 136 | freeze: False 137 | max_relative_position: 32 # only for relative position, else must be set to 0 138 | use_negative_distance: True # for relative position 139 | 140 | 141 | transformer_decoder: 142 | model_dim: 512 143 | ff_dim: 2048 144 | num_layers: 6 145 | head_count: 8 146 | dropout: 0.2 147 | emb_dropout: 0.2 148 | layer_norm_position: "pre" 149 | trg_pos_emb: "learnable" # ["absolute", "learnable", "relative"] 150 | max_trg_len: 40 # for learnable, keep same with data segment. 151 | freeze: False 152 | max_relative_position: 0 # only for relative position, else must be set to 0. 153 | use_negative_distance: False # for relative position 154 | -------------------------------------------------------------------------------- /configs/dataset_gcc-7.3.0_x86_64_O1_strip/O1_test10_cszx.yaml: -------------------------------------------------------------------------------- 1 | # assembly_cfg_comment 2 | # robertatokenizer and special 3 | # tokenizer_trg is None 4 | # assembly token = 400 5 | data: 6 | data_path: "datas/dataset_gcc-7.3.0_x86_64_O1_strip" 7 | cached_dataset_path: "cached_dataset" 8 | train_data_path: "datas/dataset_gcc-7.3.0_x86_64_O1_strip/train_refined.json" 9 | valid_data_path: "datas/dataset_gcc-7.3.0_x86_64_O1_strip/valid_refined.json" 10 | test_data_path: "datas/dataset_gcc-7.3.0_x86_64_O1_strip/test_refined.json" 11 | use_tokenizer: "robertatokenizer" # sentencepiece or robertatokenizer 12 | robertatokenizer: "Salesforce/codet5-base" 13 | architecture: "x86_64" # 'x86_64', 'x86_32', 'arm_32' 14 | use_refined_pseudo_code: True 15 | 16 | assembly_token: 17 | vocab_min_freq: 1 18 | vocab_max_size: 50000 19 | token_max_len: 400 20 | 21 | comment: 22 | vocab_min_freq: 1 23 | vocab_max_size: 50000 24 | token_max_len: 40 25 | 26 | cfg_node: 27 | vocab_min_freq: 1 28 | vocab_max_size: 50000 29 | 30 | pseudo_token: 31 | vocab_min_freq: 1 32 | vocab_max_size: 50000 33 | token_max_len: 400 34 | 35 | training: 36 | model_dir: "models/dataset_gcc-7.3.0_x86_64_O1_strip/test10_cszx" 37 | overwrite: False 38 | load_model: False 39 | random_seed: 980820 40 | 41 | logging_frequence: 100 42 | validation_frequence: 1 # after how many epochs 43 | store_valid_output: False 44 | log_valid_samples: [0,1,2,3,4] 45 | 46 | use_cuda: True 47 | num_workers: 4 48 | 49 | epochs: 100 50 | shuffle: True 51 | max_updates: 1000000000 52 | batch_size: 32 53 | 54 | learning_rate: 0.0001 55 | learning_rate_min: 1.0e-18 56 | # clip_grad_val: 1 57 | clip_grad_norm: 5.0 58 | optimizer: "adam" 59 | weight_decay: 0 60 | adam_betas: [0.9, 0.999] 61 | eps: 1.e-8 62 | early_stop_metric: "bleu" 63 | scheduling: "ReduceLROnPlateau" # "ReduceLROnPlateau", "StepLR", "ExponentialLR", "warmup" 64 | mode: "max" 65 | factor: 0.8 66 | patience: 2 67 | step_size: 1 68 | gamma: 0.1 69 | num_ckpts_keep: 3 70 | 71 | # load_model: "models/best.ckpt" 72 | reset_best_ckpt: False 73 | reset_scheduler: False 74 | reset_optimzer: False 75 | reset_iteration_state: False 76 | 77 | testing: 78 | batch_size: 64 79 | batch_type: "sentence" 80 | max_output_length: 40 81 | min_outptu_length: 1 82 | eval_metrics: ["bleu", "rouge-l"] 83 | n_best: 1 84 | beam_size: 1 85 | beam_alpha: -1 86 | return_attention: False 87 | return_probability: False 88 | generate_unk: False 89 | repetition_penalty: -1 90 | 91 | model: 92 | mode: "assembly_cfg_pseudo_comment" # "assembly_comment", "assembly_cfg_comment", "assembly_cfg_pseudo_comment" 93 | initializer: "xavier_uniform" 94 | embed_initializer: "xavier_uniform" 95 | tied_softmax: False 96 | tied_embeddings: False 97 | 98 | embeddings: 99 | embedding_dim: 512 100 | scale: False 101 | freeze: False 102 | 103 | transformer_encoder: 104 | model_dim: 512 105 | ff_dim: 2048 106 | num_layers: 6 107 | head_count: 8 108 | dropout: 0.2 109 | emb_dropout: 0.2 110 | layer_norm_position: "pre" 111 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 112 | max_src_len: 0 # for learnable, keep same with data segment 113 | freeze: False 114 | max_relative_position: 32 # only for relative position, else must be set to 0 115 | use_negative_distance: True # for relative position 116 | 117 | gnn_encoder: 118 | gnn_type: "GATConv" # ["SAGEConv", "GCNConv", "GATConv"] 119 | aggr: "mean" # ["mean", "max", "lstm"] 120 | model_dim: 512 121 | num_layers: 2 122 | emb_dropout: 0.2 123 | residual: True 124 | 125 | pseudo_encoder: 126 | model_dim: 512 127 | ff_dim: 2048 128 | num_layers: 6 129 | head_count: 8 130 | dropout: 0.2 131 | emb_dropout: 0.2 132 | layer_norm_position: "pre" 133 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 134 | max_src_len: 0 # for learnable, keep same with data segment 135 | freeze: False 136 | max_relative_position: 32 # only for relative position, else must be set to 0 137 | use_negative_distance: True # for relative position 138 | 139 | transformer_decoder: 140 | model_dim: 512 141 | ff_dim: 2048 142 | num_layers: 6 143 | head_count: 8 144 | dropout: 0.2 145 | emb_dropout: 0.2 146 | layer_norm_position: "pre" 147 | trg_pos_emb: "learnable" # ["absolute", "learnable", "relative"] 148 | max_trg_len: 40 # for learnable, keep same with data segment. 149 | freeze: False 150 | max_relative_position: 0 # only for relative position, else must be set to 0. 151 | use_negative_distance: False # for relative position 152 | -------------------------------------------------------------------------------- /configs/dataset_gcc-7.3.0_x86_64_O1_strip/O1_test3_cszx.yaml: -------------------------------------------------------------------------------- 1 | # assembly_comment 2 | # robertatokenizer and special 3 | # tokenizer_trg is None 4 | # assembly token = 400 5 | data: 6 | data_path: "datas/dataset_gcc-7.3.0_x86_64_O1_strip" 7 | cached_dataset_path: "cached_dataset" 8 | train_data_path: "datas/dataset_gcc-7.3.0_x86_64_O1_strip/train_refined.json" 9 | valid_data_path: "datas/dataset_gcc-7.3.0_x86_64_O1_strip/valid_refined.json" 10 | test_data_path: "datas/dataset_gcc-7.3.0_x86_64_O1_strip/test_refined.json" 11 | use_tokenizer: "robertatokenizer" # sentencepiece or robertatokenizer 12 | robertatokenizer: "Salesforce/codet5-base" 13 | architecture: "x86_64" # 'x86_64', 'x86_32', 'arm_32' 14 | use_refined_pseudo_code: False 15 | 16 | assembly_token: 17 | vocab_min_freq: 1 18 | vocab_max_size: 50000 19 | token_max_len: 400 20 | 21 | comment: 22 | vocab_min_freq: 1 23 | vocab_max_size: 50000 24 | token_max_len: 40 25 | 26 | cfg_node: 27 | vocab_min_freq: 1 28 | vocab_max_size: 50000 29 | 30 | pseudo_token: 31 | vocab_min_freq: 1 32 | vocab_max_size: 50000 33 | token_max_len: 400 34 | 35 | training: 36 | model_dir: "models/dataset_gcc-7.3.0_x86_64_O1_strip/test3_cszx" 37 | overwrite: False 38 | load_model: False 39 | random_seed: 980820 40 | 41 | logging_frequence: 100 42 | validation_frequence: 1 # after how many epochs 43 | store_valid_output: False 44 | log_valid_samples: [0,1,2,3,4] 45 | 46 | use_cuda: True 47 | num_workers: 4 48 | 49 | epochs: 100 50 | shuffle: True 51 | max_updates: 1000000000 52 | batch_size: 32 53 | 54 | learning_rate: 0.0001 55 | learning_rate_min: 1.0e-18 56 | # clip_grad_val: 1 57 | clip_grad_norm: 5.0 58 | optimizer: "adam" 59 | weight_decay: 0 60 | adam_betas: [0.9, 0.999] 61 | eps: 1.e-8 62 | early_stop_metric: "bleu" 63 | scheduling: "ReduceLROnPlateau" # "ReduceLROnPlateau", "StepLR", "ExponentialLR", "warmup" 64 | mode: "max" 65 | factor: 0.8 66 | patience: 2 67 | step_size: 1 68 | gamma: 0.1 69 | num_ckpts_keep: 3 70 | 71 | # load_model: "models/best.ckpt" 72 | reset_best_ckpt: False 73 | reset_scheduler: False 74 | reset_optimzer: False 75 | reset_iteration_state: False 76 | 77 | testing: 78 | batch_size: 64 79 | batch_type: "sentence" 80 | max_output_length: 40 81 | min_outptu_length: 1 82 | eval_metrics: ["bleu", "rouge-l"] 83 | n_best: 1 84 | beam_size: 4 85 | beam_alpha: -1 86 | return_attention: False 87 | return_probability: False 88 | generate_unk: False 89 | repetition_penalty: -1 90 | 91 | model: 92 | mode: "assembly_comment" # "assembly_comment", "assembly_cfg_comment" 93 | initializer: "xavier_uniform" 94 | embed_initializer: "xavier_uniform" 95 | tied_softmax: False 96 | tied_embeddings: False 97 | 98 | embeddings: 99 | embedding_dim: 512 100 | scale: False 101 | freeze: False 102 | 103 | transformer_encoder: 104 | model_dim: 512 105 | ff_dim: 2048 106 | num_layers: 6 107 | head_count: 8 108 | dropout: 0.2 109 | emb_dropout: 0.2 110 | layer_norm_position: "pre" 111 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 112 | max_src_len: 0 # for learnable, keep same with data segment 113 | freeze: False 114 | max_relative_position: 32 # only for relative position, else must be set to 0 115 | use_negative_distance: True # for relative position 116 | 117 | gnn_encoder: 118 | gnn_type: "GATConv" # ["SAGEConv", "GCNConv", "GATConv"] 119 | aggr: "mean" # ["mean", "max", "lstm"] 120 | model_dim: 512 121 | num_layers: 2 122 | emb_dropout: 0.2 123 | residual: True 124 | 125 | pseudo_encoder: 126 | model_dim: 512 127 | ff_dim: 2048 128 | num_layers: 6 129 | head_count: 8 130 | dropout: 0.2 131 | emb_dropout: 0.2 132 | layer_norm_position: "pre" 133 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 134 | max_src_len: 0 # for learnable, keep same with data segment 135 | freeze: False 136 | max_relative_position: 32 # only for relative position, else must be set to 0 137 | use_negative_distance: True # for relative position 138 | 139 | transformer_decoder: 140 | model_dim: 512 141 | ff_dim: 2048 142 | num_layers: 6 143 | head_count: 8 144 | dropout: 0.2 145 | emb_dropout: 0.2 146 | layer_norm_position: "pre" 147 | trg_pos_emb: "learnable" # ["absolute", "learnable", "relative"] 148 | max_trg_len: 40 # for learnable, keep same with data segment. 149 | freeze: False 150 | max_relative_position: 0 # only for relative position, else must be set to 0. 151 | use_negative_distance: False # for relative position 152 | -------------------------------------------------------------------------------- /configs/dataset_gcc-7.3.0_x86_64_O1_strip/O1_test4_cszx.yaml: -------------------------------------------------------------------------------- 1 | # assembly_cfg_comment 2 | # robertatokenizer and special 3 | # tokenizer_trg is None 4 | # assembly token = 400 5 | data: 6 | data_path: "datas/dataset_gcc-7.3.0_x86_64_O1_strip" 7 | cached_dataset_path: "cached_dataset" 8 | train_data_path: "datas/dataset_gcc-7.3.0_x86_64_O1_strip/train_refined.json" 9 | valid_data_path: "datas/dataset_gcc-7.3.0_x86_64_O1_strip/valid_refined.json" 10 | test_data_path: "datas/dataset_gcc-7.3.0_x86_64_O1_strip/test_refined.json" 11 | use_tokenizer: "robertatokenizer" # sentencepiece or robertatokenizer 12 | robertatokenizer: "Salesforce/codet5-base/" 13 | architecture: "x86_64" # 'x86_64', 'x86_32', 'arm_32' 14 | use_refined_pseudo_code: False 15 | 16 | assembly_token: 17 | vocab_min_freq: 1 18 | vocab_max_size: 50000 19 | token_max_len: 400 20 | 21 | comment: 22 | vocab_min_freq: 1 23 | vocab_max_size: 50000 24 | token_max_len: 40 25 | 26 | cfg_node: 27 | vocab_min_freq: 1 28 | vocab_max_size: 50000 29 | 30 | pseudo_token: 31 | vocab_min_freq: 1 32 | vocab_max_size: 50000 33 | token_max_len: 400 34 | 35 | training: 36 | model_dir: "models/dataset_gcc-7.3.0_x86_64_O1_strip/test4_cszx" 37 | overwrite: False 38 | load_model: False 39 | random_seed: 980820 40 | 41 | logging_frequence: 100 42 | validation_frequence: 1 # after how many epochs 43 | store_valid_output: False 44 | log_valid_samples: [0,1,2,3,4] 45 | 46 | use_cuda: True 47 | num_workers: 4 48 | 49 | epochs: 100 50 | shuffle: True 51 | max_updates: 1000000000 52 | batch_size: 32 53 | 54 | learning_rate: 0.0001 55 | learning_rate_min: 1.0e-18 56 | # clip_grad_val: 1 57 | clip_grad_norm: 5.0 58 | optimizer: "adam" 59 | weight_decay: 0 60 | adam_betas: [0.9, 0.999] 61 | eps: 1.e-8 62 | early_stop_metric: "bleu" 63 | scheduling: "ReduceLROnPlateau" # "ReduceLROnPlateau", "StepLR", "ExponentialLR", "warmup" 64 | mode: "max" 65 | factor: 0.8 66 | patience: 2 67 | step_size: 1 68 | gamma: 0.1 69 | num_ckpts_keep: 3 70 | 71 | # load_model: "models/best.ckpt" 72 | reset_best_ckpt: False 73 | reset_scheduler: False 74 | reset_optimzer: False 75 | reset_iteration_state: False 76 | 77 | testing: 78 | batch_size: 64 79 | batch_type: "sentence" 80 | max_output_length: 40 81 | min_outptu_length: 1 82 | eval_metrics: ["bleu", "rouge-l"] 83 | n_best: 1 84 | beam_size: 4 85 | beam_alpha: -1 86 | return_attention: False 87 | return_probability: False 88 | generate_unk: False 89 | repetition_penalty: -1 90 | 91 | model: 92 | mode: "assembly_cfg_comment" # "assembly_comment", "assembly_cfg_comment" 93 | initializer: "xavier_uniform" 94 | embed_initializer: "xavier_uniform" 95 | tied_softmax: False 96 | tied_embeddings: False 97 | 98 | embeddings: 99 | embedding_dim: 512 100 | scale: False 101 | freeze: False 102 | 103 | transformer_encoder: 104 | model_dim: 512 105 | ff_dim: 2048 106 | num_layers: 6 107 | head_count: 8 108 | dropout: 0.2 109 | emb_dropout: 0.2 110 | layer_norm_position: "pre" 111 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 112 | max_src_len: 0 # for learnable, keep same with data segment 113 | freeze: False 114 | max_relative_position: 32 # only for relative position, else must be set to 0 115 | use_negative_distance: True # for relative position 116 | 117 | gnn_encoder: 118 | gnn_type: "GATConv" # ["SAGEConv", "GCNConv", "GATConv"] 119 | aggr: "mean" # ["mean", "max", "lstm"] 120 | model_dim: 512 121 | num_layers: 2 122 | emb_dropout: 0.2 123 | residual: True 124 | 125 | pseudo_encoder: 126 | model_dim: 512 127 | ff_dim: 2048 128 | num_layers: 6 129 | head_count: 8 130 | dropout: 0.2 131 | emb_dropout: 0.2 132 | layer_norm_position: "pre" 133 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 134 | max_src_len: 0 # for learnable, keep same with data segment 135 | freeze: False 136 | max_relative_position: 32 # only for relative position, else must be set to 0 137 | use_negative_distance: True # for relative position 138 | 139 | transformer_decoder: 140 | model_dim: 512 141 | ff_dim: 2048 142 | num_layers: 6 143 | head_count: 8 144 | dropout: 0.2 145 | emb_dropout: 0.2 146 | layer_norm_position: "pre" 147 | trg_pos_emb: "learnable" # ["absolute", "learnable", "relative"] 148 | max_trg_len: 40 # for learnable, keep same with data segment. 149 | freeze: False 150 | max_relative_position: 0 # only for relative position, else must be set to 0. 151 | use_negative_distance: False # for relative position 152 | -------------------------------------------------------------------------------- /configs/dataset_gcc-7.3.0_x86_64_O1_strip/O1_test8_cszx.yaml: -------------------------------------------------------------------------------- 1 | # assembly_cfg_comment 2 | # robertatokenizer and special 3 | # tokenizer_trg is None 4 | # assembly token = 400 5 | data: 6 | data_path: "datas/dataset_gcc-7.3.0_x86_64_O1_strip" 7 | cached_dataset_path: "cached_dataset" 8 | train_data_path: "datas/dataset_gcc-7.3.0_x86_64_O1_strip/train_refined.json" 9 | valid_data_path: "datas/dataset_gcc-7.3.0_x86_64_O1_strip/valid_refined.json" 10 | test_data_path: "datas/dataset_gcc-7.3.0_x86_64_O1_strip/test_refined.json" 11 | use_tokenizer: "robertatokenizer" # sentencepiece or robertatokenizer 12 | robertatokenizer: "Salesforce/codet5-base" 13 | architecture: "x86_64" # 'x86_64', 'x86_32', 'arm_32' 14 | use_refined_pseudo_code: False 15 | 16 | 17 | assembly_token: 18 | vocab_min_freq: 1 19 | vocab_max_size: 50000 20 | token_max_len: 400 21 | 22 | comment: 23 | vocab_min_freq: 1 24 | vocab_max_size: 50000 25 | token_max_len: 40 26 | 27 | cfg_node: 28 | vocab_min_freq: 1 29 | vocab_max_size: 50000 30 | 31 | pseudo_token: 32 | vocab_min_freq: 1 33 | vocab_max_size: 50000 34 | token_max_len: 400 35 | 36 | training: 37 | model_dir: "models/dataset_gcc-7.3.0_x86_64_O1_strip/test8_cszx" 38 | overwrite: False 39 | load_model: False 40 | random_seed: 980820 41 | 42 | logging_frequence: 100 43 | validation_frequence: 1 # after how many epochs 44 | store_valid_output: False 45 | log_valid_samples: [0,1,2,3,4] 46 | 47 | use_cuda: True 48 | num_workers: 4 49 | 50 | epochs: 100 51 | shuffle: True 52 | max_updates: 1000000000 53 | batch_size: 32 54 | 55 | learning_rate: 0.0001 56 | learning_rate_min: 1.0e-18 57 | # clip_grad_val: 1 58 | clip_grad_norm: 5.0 59 | optimizer: "adam" 60 | weight_decay: 0 61 | adam_betas: [0.9, 0.999] 62 | eps: 1.e-8 63 | early_stop_metric: "bleu" 64 | scheduling: "ReduceLROnPlateau" # "ReduceLROnPlateau", "StepLR", "ExponentialLR", "warmup" 65 | mode: "max" 66 | factor: 0.8 67 | patience: 2 68 | step_size: 1 69 | gamma: 0.1 70 | num_ckpts_keep: 3 71 | 72 | # load_model: "models/best.ckpt" 73 | reset_best_ckpt: False 74 | reset_scheduler: False 75 | reset_optimzer: False 76 | reset_iteration_state: False 77 | 78 | testing: 79 | batch_size: 64 80 | batch_type: "sentence" 81 | max_output_length: 40 82 | min_outptu_length: 1 83 | eval_metrics: ["bleu", "rouge-l"] 84 | n_best: 1 85 | beam_size: 4 86 | beam_alpha: -1 87 | return_attention: False 88 | return_probability: False 89 | generate_unk: False 90 | repetition_penalty: -1 91 | 92 | model: 93 | mode: "assembly_cfg_pseudo_comment" # "assembly_comment", "assembly_cfg_comment", "assembly_cfg_pseudo_comment" 94 | initializer: "xavier_uniform" 95 | embed_initializer: "xavier_uniform" 96 | tied_softmax: False 97 | tied_embeddings: False 98 | 99 | embeddings: 100 | embedding_dim: 512 101 | scale: False 102 | freeze: False 103 | 104 | transformer_encoder: 105 | model_dim: 512 106 | ff_dim: 2048 107 | num_layers: 6 108 | head_count: 8 109 | dropout: 0.2 110 | emb_dropout: 0.2 111 | layer_norm_position: "pre" 112 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 113 | max_src_len: 0 # for learnable, keep same with data segment 114 | freeze: False 115 | max_relative_position: 32 # only for relative position, else must be set to 0 116 | use_negative_distance: True # for relative position 117 | 118 | gnn_encoder: 119 | gnn_type: "GATConv" # ["SAGEConv", "GCNConv", "GATConv"] 120 | aggr: "mean" # ["mean", "max", "lstm"] 121 | model_dim: 512 122 | num_layers: 2 123 | emb_dropout: 0.2 124 | residual: True 125 | 126 | pseudo_encoder: 127 | model_dim: 512 128 | ff_dim: 2048 129 | num_layers: 6 130 | head_count: 8 131 | dropout: 0.2 132 | emb_dropout: 0.2 133 | layer_norm_position: "pre" 134 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 135 | max_src_len: 0 # for learnable, keep same with data segment 136 | freeze: False 137 | max_relative_position: 32 # only for relative position, else must be set to 0 138 | use_negative_distance: True # for relative position 139 | 140 | 141 | transformer_decoder: 142 | model_dim: 512 143 | ff_dim: 2048 144 | num_layers: 6 145 | head_count: 8 146 | dropout: 0.2 147 | emb_dropout: 0.2 148 | layer_norm_position: "pre" 149 | trg_pos_emb: "learnable" # ["absolute", "learnable", "relative"] 150 | max_trg_len: 40 # for learnable, keep same with data segment. 151 | freeze: False 152 | max_relative_position: 0 # only for relative position, else must be set to 0. 153 | use_negative_distance: False # for relative position 154 | -------------------------------------------------------------------------------- /configs/dataset_gcc-7.3.0_x86_64_O2_strip/O1_test10_cszx.yaml: -------------------------------------------------------------------------------- 1 | # assembly_cfg_comment 2 | # robertatokenizer and special 3 | # tokenizer_trg is None 4 | # assembly token = 400 5 | data: 6 | data_path: "datas/dataset_gcc-7.3.0_x86_64_O2_strip" 7 | cached_dataset_path: "cached_dataset" 8 | train_data_path: "datas/dataset_gcc-7.3.0_x86_64_O2_strip/train_refined.json" 9 | valid_data_path: "datas/dataset_gcc-7.3.0_x86_64_O2_strip/valid_refined.json" 10 | test_data_path: "datas/dataset_gcc-7.3.0_x86_64_O2_strip/test_refined.json" 11 | use_tokenizer: "robertatokenizer" # sentencepiece or robertatokenizer 12 | robertatokenizer: "Salesforce/codet5-base" 13 | architecture: "x86_64" # 'x86_64', 'x86_32', 'arm_32' 14 | use_refined_pseudo_code: True 15 | 16 | assembly_token: 17 | vocab_min_freq: 1 18 | vocab_max_size: 50000 19 | token_max_len: 400 20 | 21 | comment: 22 | vocab_min_freq: 1 23 | vocab_max_size: 50000 24 | token_max_len: 40 25 | 26 | cfg_node: 27 | vocab_min_freq: 1 28 | vocab_max_size: 50000 29 | 30 | pseudo_token: 31 | vocab_min_freq: 1 32 | vocab_max_size: 50000 33 | token_max_len: 400 34 | 35 | training: 36 | model_dir: "models/dataset_gcc-7.3.0_x86_64_O2_strip/test10_cszx" 37 | overwrite: False 38 | load_model: False 39 | random_seed: 980820 40 | 41 | logging_frequence: 100 42 | validation_frequence: 1 # after how many epochs 43 | store_valid_output: False 44 | log_valid_samples: [0,1,2,3,4] 45 | 46 | use_cuda: True 47 | num_workers: 4 48 | 49 | epochs: 100 50 | shuffle: True 51 | max_updates: 1000000000 52 | batch_size: 32 53 | 54 | learning_rate: 0.0001 55 | learning_rate_min: 1.0e-18 56 | # clip_grad_val: 1 57 | clip_grad_norm: 5.0 58 | optimizer: "adam" 59 | weight_decay: 0 60 | adam_betas: [0.9, 0.999] 61 | eps: 1.e-8 62 | early_stop_metric: "bleu" 63 | scheduling: "ReduceLROnPlateau" # "ReduceLROnPlateau", "StepLR", "ExponentialLR", "warmup" 64 | mode: "max" 65 | factor: 0.8 66 | patience: 2 67 | step_size: 1 68 | gamma: 0.1 69 | num_ckpts_keep: 3 70 | 71 | # load_model: "models/best.ckpt" 72 | reset_best_ckpt: False 73 | reset_scheduler: False 74 | reset_optimzer: False 75 | reset_iteration_state: False 76 | 77 | testing: 78 | batch_size: 64 79 | batch_type: "sentence" 80 | max_output_length: 40 81 | min_outptu_length: 1 82 | eval_metrics: ["bleu", "rouge-l"] 83 | n_best: 1 84 | beam_size: 1 85 | beam_alpha: -1 86 | return_attention: False 87 | return_probability: False 88 | generate_unk: False 89 | repetition_penalty: -1 90 | 91 | model: 92 | mode: "assembly_cfg_pseudo_comment" # "assembly_comment", "assembly_cfg_comment", "assembly_cfg_pseudo_comment" 93 | initializer: "xavier_uniform" 94 | embed_initializer: "xavier_uniform" 95 | tied_softmax: False 96 | tied_embeddings: False 97 | 98 | embeddings: 99 | embedding_dim: 512 100 | scale: False 101 | freeze: False 102 | 103 | transformer_encoder: 104 | model_dim: 512 105 | ff_dim: 2048 106 | num_layers: 6 107 | head_count: 8 108 | dropout: 0.2 109 | emb_dropout: 0.2 110 | layer_norm_position: "pre" 111 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 112 | max_src_len: 0 # for learnable, keep same with data segment 113 | freeze: False 114 | max_relative_position: 32 # only for relative position, else must be set to 0 115 | use_negative_distance: True # for relative position 116 | 117 | gnn_encoder: 118 | gnn_type: "GATConv" # ["SAGEConv", "GCNConv", "GATConv"] 119 | aggr: "mean" # ["mean", "max", "lstm"] 120 | model_dim: 512 121 | num_layers: 2 122 | emb_dropout: 0.2 123 | residual: True 124 | 125 | pseudo_encoder: 126 | model_dim: 512 127 | ff_dim: 2048 128 | num_layers: 6 129 | head_count: 8 130 | dropout: 0.2 131 | emb_dropout: 0.2 132 | layer_norm_position: "pre" 133 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 134 | max_src_len: 0 # for learnable, keep same with data segment 135 | freeze: False 136 | max_relative_position: 32 # only for relative position, else must be set to 0 137 | use_negative_distance: True # for relative position 138 | 139 | 140 | transformer_decoder: 141 | model_dim: 512 142 | ff_dim: 2048 143 | num_layers: 6 144 | head_count: 8 145 | dropout: 0.2 146 | emb_dropout: 0.2 147 | layer_norm_position: "pre" 148 | trg_pos_emb: "learnable" # ["absolute", "learnable", "relative"] 149 | max_trg_len: 40 # for learnable, keep same with data segment. 150 | freeze: False 151 | max_relative_position: 0 # only for relative position, else must be set to 0. 152 | use_negative_distance: False # for relative position 153 | -------------------------------------------------------------------------------- /configs/dataset_gcc-7.3.0_x86_64_O2_strip/O1_test3_cszx.yaml: -------------------------------------------------------------------------------- 1 | # assembly_comment 2 | # robertatokenizer and special 3 | # tokenizer_trg is None 4 | # assembly token = 400 5 | data: 6 | data_path: "datas/dataset_gcc-7.3.0_x86_64_O2_strip" 7 | cached_dataset_path: "cached_dataset" 8 | train_data_path: "datas/dataset_gcc-7.3.0_x86_64_O2_strip/train_refined.json" 9 | valid_data_path: "datas/dataset_gcc-7.3.0_x86_64_O2_strip/valid_refined.json" 10 | test_data_path: "datas/dataset_gcc-7.3.0_x86_64_O2_strip/test_refined.json" 11 | use_tokenizer: "robertatokenizer" # sentencepiece or robertatokenizer 12 | robertatokenizer: "Salesforce/codet5-base" 13 | architecture: "x86_64" # 'x86_64', 'x86_32', 'arm_32' 14 | use_refined_pseudo_code: False 15 | 16 | assembly_token: 17 | vocab_min_freq: 1 18 | vocab_max_size: 50000 19 | token_max_len: 400 20 | 21 | comment: 22 | vocab_min_freq: 1 23 | vocab_max_size: 50000 24 | token_max_len: 40 25 | 26 | cfg_node: 27 | vocab_min_freq: 1 28 | vocab_max_size: 50000 29 | 30 | pseudo_token: 31 | vocab_min_freq: 1 32 | vocab_max_size: 50000 33 | token_max_len: 400 34 | 35 | training: 36 | model_dir: "models/dataset_gcc-7.3.0_x86_64_O2_strip/test3_cszx" 37 | overwrite: False 38 | load_model: False 39 | random_seed: 980820 40 | 41 | logging_frequence: 100 42 | validation_frequence: 1 # after how many epochs 43 | store_valid_output: False 44 | log_valid_samples: [0,1,2,3,4] 45 | 46 | use_cuda: True 47 | num_workers: 4 48 | 49 | epochs: 100 50 | shuffle: True 51 | max_updates: 1000000000 52 | batch_size: 32 53 | 54 | learning_rate: 0.0001 55 | learning_rate_min: 1.0e-18 56 | # clip_grad_val: 1 57 | clip_grad_norm: 5.0 58 | optimizer: "adam" 59 | weight_decay: 0 60 | adam_betas: [0.9, 0.999] 61 | eps: 1.e-8 62 | early_stop_metric: "bleu" 63 | scheduling: "ReduceLROnPlateau" # "ReduceLROnPlateau", "StepLR", "ExponentialLR", "warmup" 64 | mode: "max" 65 | factor: 0.8 66 | patience: 2 67 | step_size: 1 68 | gamma: 0.1 69 | num_ckpts_keep: 3 70 | 71 | # load_model: "models/best.ckpt" 72 | reset_best_ckpt: False 73 | reset_scheduler: False 74 | reset_optimzer: False 75 | reset_iteration_state: False 76 | 77 | testing: 78 | batch_size: 64 79 | batch_type: "sentence" 80 | max_output_length: 40 81 | min_outptu_length: 1 82 | eval_metrics: ["bleu", "rouge-l"] 83 | n_best: 1 84 | beam_size: 4 85 | beam_alpha: -1 86 | return_attention: False 87 | return_probability: False 88 | generate_unk: False 89 | repetition_penalty: -1 90 | 91 | model: 92 | mode: "assembly_comment" # "assembly_comment", "assembly_cfg_comment" 93 | initializer: "xavier_uniform" 94 | embed_initializer: "xavier_uniform" 95 | tied_softmax: False 96 | tied_embeddings: False 97 | 98 | embeddings: 99 | embedding_dim: 512 100 | scale: False 101 | freeze: False 102 | 103 | transformer_encoder: 104 | model_dim: 512 105 | ff_dim: 2048 106 | num_layers: 6 107 | head_count: 8 108 | dropout: 0.2 109 | emb_dropout: 0.2 110 | layer_norm_position: "pre" 111 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 112 | max_src_len: 0 # for learnable, keep same with data segment 113 | freeze: False 114 | max_relative_position: 32 # only for relative position, else must be set to 0 115 | use_negative_distance: True # for relative position 116 | 117 | gnn_encoder: 118 | gnn_type: "GATConv" # ["SAGEConv", "GCNConv", "GATConv"] 119 | aggr: "mean" # ["mean", "max", "lstm"] 120 | model_dim: 512 121 | num_layers: 2 122 | emb_dropout: 0.2 123 | residual: True 124 | 125 | pseudo_encoder: 126 | model_dim: 512 127 | ff_dim: 2048 128 | num_layers: 6 129 | head_count: 8 130 | dropout: 0.2 131 | emb_dropout: 0.2 132 | layer_norm_position: "pre" 133 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 134 | max_src_len: 0 # for learnable, keep same with data segment 135 | freeze: False 136 | max_relative_position: 32 # only for relative position, else must be set to 0 137 | use_negative_distance: True # for relative position 138 | 139 | transformer_decoder: 140 | model_dim: 512 141 | ff_dim: 2048 142 | num_layers: 6 143 | head_count: 8 144 | dropout: 0.2 145 | emb_dropout: 0.2 146 | layer_norm_position: "pre" 147 | trg_pos_emb: "learnable" # ["absolute", "learnable", "relative"] 148 | max_trg_len: 40 # for learnable, keep same with data segment. 149 | freeze: False 150 | max_relative_position: 0 # only for relative position, else must be set to 0. 151 | use_negative_distance: False # for relative position 152 | -------------------------------------------------------------------------------- /configs/dataset_gcc-7.3.0_x86_64_O2_strip/O1_test4_cszx.yaml: -------------------------------------------------------------------------------- 1 | # assembly_cfg_comment 2 | # robertatokenizer and special 3 | # tokenizer_trg is None 4 | # assembly token = 400 5 | data: 6 | data_path: "datas/dataset_gcc-7.3.0_x86_64_O2_strip" 7 | cached_dataset_path: "cached_dataset" 8 | train_data_path: "datas/dataset_gcc-7.3.0_x86_64_O2_strip/train_refined.json" 9 | valid_data_path: "datas/dataset_gcc-7.3.0_x86_64_O2_strip/valid_refined.json" 10 | test_data_path: "datas/dataset_gcc-7.3.0_x86_64_O2_strip/test_refined.json" 11 | use_tokenizer: "robertatokenizer" # sentencepiece or robertatokenizer 12 | robertatokenizer: "Salesforce/codet5-base" 13 | architecture: "x86_64" # 'x86_64', 'x86_32', 'arm_32' 14 | use_refined_pseudo_code: False 15 | 16 | assembly_token: 17 | vocab_min_freq: 1 18 | vocab_max_size: 50000 19 | token_max_len: 400 20 | 21 | comment: 22 | vocab_min_freq: 1 23 | vocab_max_size: 50000 24 | token_max_len: 40 25 | 26 | cfg_node: 27 | vocab_min_freq: 1 28 | vocab_max_size: 50000 29 | 30 | pseudo_token: 31 | vocab_min_freq: 1 32 | vocab_max_size: 50000 33 | token_max_len: 400 34 | 35 | training: 36 | model_dir: "models/dataset_gcc-7.3.0_x86_64_O2_strip/test4_cszx" 37 | overwrite: False 38 | load_model: False 39 | random_seed: 980820 40 | 41 | logging_frequence: 100 42 | validation_frequence: 1 # after how many epochs 43 | store_valid_output: False 44 | log_valid_samples: [0,1,2,3,4] 45 | 46 | use_cuda: True 47 | num_workers: 4 48 | 49 | epochs: 100 50 | shuffle: True 51 | max_updates: 1000000000 52 | batch_size: 32 53 | 54 | learning_rate: 0.0001 55 | learning_rate_min: 1.0e-18 56 | # clip_grad_val: 1 57 | clip_grad_norm: 5.0 58 | optimizer: "adam" 59 | weight_decay: 0 60 | adam_betas: [0.9, 0.999] 61 | eps: 1.e-8 62 | early_stop_metric: "bleu" 63 | scheduling: "ReduceLROnPlateau" # "ReduceLROnPlateau", "StepLR", "ExponentialLR", "warmup" 64 | mode: "max" 65 | factor: 0.8 66 | patience: 2 67 | step_size: 1 68 | gamma: 0.1 69 | num_ckpts_keep: 3 70 | 71 | # load_model: "models/best.ckpt" 72 | reset_best_ckpt: False 73 | reset_scheduler: False 74 | reset_optimzer: False 75 | reset_iteration_state: False 76 | 77 | testing: 78 | batch_size: 64 79 | batch_type: "sentence" 80 | max_output_length: 40 81 | min_outptu_length: 1 82 | eval_metrics: ["bleu", "rouge-l"] 83 | n_best: 1 84 | beam_size: 4 85 | beam_alpha: -1 86 | return_attention: False 87 | return_probability: False 88 | generate_unk: False 89 | repetition_penalty: -1 90 | 91 | model: 92 | mode: "assembly_cfg_comment" # "assembly_comment", "assembly_cfg_comment" 93 | initializer: "xavier_uniform" 94 | embed_initializer: "xavier_uniform" 95 | tied_softmax: False 96 | tied_embeddings: False 97 | 98 | embeddings: 99 | embedding_dim: 512 100 | scale: False 101 | freeze: False 102 | 103 | transformer_encoder: 104 | model_dim: 512 105 | ff_dim: 2048 106 | num_layers: 6 107 | head_count: 8 108 | dropout: 0.2 109 | emb_dropout: 0.2 110 | layer_norm_position: "pre" 111 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 112 | max_src_len: 0 # for learnable, keep same with data segment 113 | freeze: False 114 | max_relative_position: 32 # only for relative position, else must be set to 0 115 | use_negative_distance: True # for relative position 116 | 117 | gnn_encoder: 118 | gnn_type: "GATConv" # ["SAGEConv", "GCNConv", "GATConv"] 119 | aggr: "mean" # ["mean", "max", "lstm"] 120 | model_dim: 512 121 | num_layers: 2 122 | emb_dropout: 0.2 123 | residual: True 124 | 125 | pseudo_encoder: 126 | model_dim: 512 127 | ff_dim: 2048 128 | num_layers: 6 129 | head_count: 8 130 | dropout: 0.2 131 | emb_dropout: 0.2 132 | layer_norm_position: "pre" 133 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 134 | max_src_len: 0 # for learnable, keep same with data segment 135 | freeze: False 136 | max_relative_position: 32 # only for relative position, else must be set to 0 137 | use_negative_distance: True # for relative position 138 | 139 | transformer_decoder: 140 | model_dim: 512 141 | ff_dim: 2048 142 | num_layers: 6 143 | head_count: 8 144 | dropout: 0.2 145 | emb_dropout: 0.2 146 | layer_norm_position: "pre" 147 | trg_pos_emb: "learnable" # ["absolute", "learnable", "relative"] 148 | max_trg_len: 40 # for learnable, keep same with data segment. 149 | freeze: False 150 | max_relative_position: 0 # only for relative position, else must be set to 0. 151 | use_negative_distance: False # for relative position 152 | -------------------------------------------------------------------------------- /configs/dataset_gcc-7.3.0_x86_64_O2_strip/O1_test8_cszx.yaml: -------------------------------------------------------------------------------- 1 | # assembly_cfg_comment 2 | # robertatokenizer and special 3 | # tokenizer_trg is None 4 | # assembly token = 400 5 | data: 6 | data_path: "datas/dataset_gcc-7.3.0_x86_64_O2_strip" 7 | cached_dataset_path: "cached_dataset" 8 | train_data_path: "datas/dataset_gcc-7.3.0_x86_64_O2_strip/train_refined.json" 9 | valid_data_path: "datas/dataset_gcc-7.3.0_x86_64_O2_strip/valid_refined.json" 10 | test_data_path: "datas/dataset_gcc-7.3.0_x86_64_O2_strip/test_refined.json" 11 | use_tokenizer: "robertatokenizer" # sentencepiece or robertatokenizer 12 | robertatokenizer: "Salesforce/codet5-base" 13 | architecture: "x86_64" # 'x86_64', 'x86_32', 'arm_32' 14 | use_refined_pseudo_code: False 15 | 16 | assembly_token: 17 | vocab_min_freq: 1 18 | vocab_max_size: 50000 19 | token_max_len: 400 20 | 21 | comment: 22 | vocab_min_freq: 1 23 | vocab_max_size: 50000 24 | token_max_len: 40 25 | 26 | cfg_node: 27 | vocab_min_freq: 1 28 | vocab_max_size: 50000 29 | 30 | pseudo_token: 31 | vocab_min_freq: 1 32 | vocab_max_size: 50000 33 | token_max_len: 400 34 | 35 | training: 36 | model_dir: "models/dataset_gcc-7.3.0_x86_64_O2_strip/test8_cszx" 37 | overwrite: False 38 | load_model: False 39 | random_seed: 980820 40 | 41 | logging_frequence: 100 42 | validation_frequence: 1 # after how many epochs 43 | store_valid_output: False 44 | log_valid_samples: [0,1,2,3,4] 45 | 46 | use_cuda: True 47 | num_workers: 4 48 | 49 | epochs: 100 50 | shuffle: True 51 | max_updates: 1000000000 52 | batch_size: 32 53 | 54 | learning_rate: 0.0001 55 | learning_rate_min: 1.0e-18 56 | # clip_grad_val: 1 57 | clip_grad_norm: 5.0 58 | optimizer: "adam" 59 | weight_decay: 0 60 | adam_betas: [0.9, 0.999] 61 | eps: 1.e-8 62 | early_stop_metric: "bleu" 63 | scheduling: "ReduceLROnPlateau" # "ReduceLROnPlateau", "StepLR", "ExponentialLR", "warmup" 64 | mode: "max" 65 | factor: 0.8 66 | patience: 2 67 | step_size: 1 68 | gamma: 0.1 69 | num_ckpts_keep: 3 70 | 71 | # load_model: "models/best.ckpt" 72 | reset_best_ckpt: False 73 | reset_scheduler: False 74 | reset_optimzer: False 75 | reset_iteration_state: False 76 | 77 | testing: 78 | batch_size: 64 79 | batch_type: "sentence" 80 | max_output_length: 40 81 | min_outptu_length: 1 82 | eval_metrics: ["bleu", "rouge-l"] 83 | n_best: 1 84 | beam_size: 4 85 | beam_alpha: -1 86 | return_attention: False 87 | return_probability: False 88 | generate_unk: False 89 | repetition_penalty: -1 90 | 91 | model: 92 | mode: "assembly_cfg_pseudo_comment" # "assembly_comment", "assembly_cfg_comment", "assembly_cfg_pseudo_comment" 93 | initializer: "xavier_uniform" 94 | embed_initializer: "xavier_uniform" 95 | tied_softmax: False 96 | tied_embeddings: False 97 | 98 | embeddings: 99 | embedding_dim: 512 100 | scale: False 101 | freeze: False 102 | 103 | transformer_encoder: 104 | model_dim: 512 105 | ff_dim: 2048 106 | num_layers: 6 107 | head_count: 8 108 | dropout: 0.2 109 | emb_dropout: 0.2 110 | layer_norm_position: "pre" 111 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 112 | max_src_len: 0 # for learnable, keep same with data segment 113 | freeze: False 114 | max_relative_position: 32 # only for relative position, else must be set to 0 115 | use_negative_distance: True # for relative position 116 | 117 | gnn_encoder: 118 | gnn_type: "GATConv" # ["SAGEConv", "GCNConv", "GATConv"] 119 | aggr: "mean" # ["mean", "max", "lstm"] 120 | model_dim: 512 121 | num_layers: 2 122 | emb_dropout: 0.2 123 | residual: True 124 | 125 | pseudo_encoder: 126 | model_dim: 512 127 | ff_dim: 2048 128 | num_layers: 6 129 | head_count: 8 130 | dropout: 0.2 131 | emb_dropout: 0.2 132 | layer_norm_position: "pre" 133 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 134 | max_src_len: 0 # for learnable, keep same with data segment 135 | freeze: False 136 | max_relative_position: 32 # only for relative position, else must be set to 0 137 | use_negative_distance: True # for relative position 138 | 139 | transformer_decoder: 140 | model_dim: 512 141 | ff_dim: 2048 142 | num_layers: 6 143 | head_count: 8 144 | dropout: 0.2 145 | emb_dropout: 0.2 146 | layer_norm_position: "pre" 147 | trg_pos_emb: "learnable" # ["absolute", "learnable", "relative"] 148 | max_trg_len: 40 # for learnable, keep same with data segment. 149 | freeze: False 150 | max_relative_position: 0 # only for relative position, else must be set to 0. 151 | use_negative_distance: False # for relative position 152 | -------------------------------------------------------------------------------- /configs/dataset_gcc-7.3.0_x86_64_O3_strip/O1_test10_cszx.yaml: -------------------------------------------------------------------------------- 1 | # assembly_cfg_comment 2 | # robertatokenizer and special 3 | # tokenizer_trg is None 4 | # assembly token = 400 5 | data: 6 | data_path: "datas/dataset_gcc-7.3.0_x86_64_O3_strip" 7 | cached_dataset_path: "cached_dataset" 8 | train_data_path: "datas/dataset_gcc-7.3.0_x86_64_O3_strip/train_refined.json" 9 | valid_data_path: "datas/dataset_gcc-7.3.0_x86_64_O3_strip/valid_refined.json" 10 | test_data_path: "datas/dataset_gcc-7.3.0_x86_64_O3_strip/test_refined.json" 11 | use_tokenizer: "robertatokenizer" # sentencepiece or robertatokenizer 12 | robertatokenizer: "Salesforce/codet5-base" 13 | architecture: "x86_64" # 'x86_64', 'x86_32', 'arm_32' 14 | use_refined_pseudo_code: True 15 | 16 | assembly_token: 17 | vocab_min_freq: 1 18 | vocab_max_size: 50000 19 | token_max_len: 400 20 | 21 | comment: 22 | vocab_min_freq: 1 23 | vocab_max_size: 50000 24 | token_max_len: 40 25 | 26 | cfg_node: 27 | vocab_min_freq: 1 28 | vocab_max_size: 50000 29 | 30 | pseudo_token: 31 | vocab_min_freq: 1 32 | vocab_max_size: 50000 33 | token_max_len: 400 34 | 35 | training: 36 | model_dir: "models/dataset_gcc-7.3.0_x86_64_O3_strip/test10_cszx" 37 | overwrite: False 38 | load_model: False 39 | random_seed: 980820 40 | 41 | logging_frequence: 100 42 | validation_frequence: 1 # after how many epochs 43 | store_valid_output: False 44 | log_valid_samples: [0,1,2,3,4] 45 | 46 | use_cuda: True 47 | num_workers: 4 48 | 49 | epochs: 100 50 | shuffle: True 51 | max_updates: 1000000000 52 | batch_size: 32 53 | 54 | learning_rate: 0.0001 55 | learning_rate_min: 1.0e-18 56 | # clip_grad_val: 1 57 | clip_grad_norm: 5.0 58 | optimizer: "adam" 59 | weight_decay: 0 60 | adam_betas: [0.9, 0.999] 61 | eps: 1.e-8 62 | early_stop_metric: "bleu" 63 | scheduling: "ReduceLROnPlateau" # "ReduceLROnPlateau", "StepLR", "ExponentialLR", "warmup" 64 | mode: "max" 65 | factor: 0.8 66 | patience: 2 67 | step_size: 1 68 | gamma: 0.1 69 | num_ckpts_keep: 3 70 | 71 | # load_model: "models/best.ckpt" 72 | reset_best_ckpt: False 73 | reset_scheduler: False 74 | reset_optimzer: False 75 | reset_iteration_state: False 76 | 77 | testing: 78 | batch_size: 64 79 | batch_type: "sentence" 80 | max_output_length: 40 81 | min_outptu_length: 1 82 | eval_metrics: ["bleu", "rouge-l"] 83 | n_best: 1 84 | beam_size: 1 85 | beam_alpha: -1 86 | return_attention: False 87 | return_probability: False 88 | generate_unk: False 89 | repetition_penalty: -1 90 | 91 | model: 92 | mode: "assembly_cfg_pseudo_comment" # "assembly_comment", "assembly_cfg_comment", "assembly_cfg_pseudo_comment" 93 | initializer: "xavier_uniform" 94 | embed_initializer: "xavier_uniform" 95 | tied_softmax: False 96 | tied_embeddings: False 97 | 98 | embeddings: 99 | embedding_dim: 512 100 | scale: False 101 | freeze: False 102 | 103 | transformer_encoder: 104 | model_dim: 512 105 | ff_dim: 2048 106 | num_layers: 6 107 | head_count: 8 108 | dropout: 0.2 109 | emb_dropout: 0.2 110 | layer_norm_position: "pre" 111 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 112 | max_src_len: 0 # for learnable, keep same with data segment 113 | freeze: False 114 | max_relative_position: 32 # only for relative position, else must be set to 0 115 | use_negative_distance: True # for relative position 116 | 117 | gnn_encoder: 118 | gnn_type: "GATConv" # ["SAGEConv", "GCNConv", "GATConv"] 119 | aggr: "mean" # ["mean", "max", "lstm"] 120 | model_dim: 512 121 | num_layers: 2 122 | emb_dropout: 0.2 123 | residual: True 124 | 125 | pseudo_encoder: 126 | model_dim: 512 127 | ff_dim: 2048 128 | num_layers: 6 129 | head_count: 8 130 | dropout: 0.2 131 | emb_dropout: 0.2 132 | layer_norm_position: "pre" 133 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 134 | max_src_len: 0 # for learnable, keep same with data segment 135 | freeze: False 136 | max_relative_position: 32 # only for relative position, else must be set to 0 137 | use_negative_distance: True # for relative position 138 | 139 | transformer_decoder: 140 | model_dim: 512 141 | ff_dim: 2048 142 | num_layers: 6 143 | head_count: 8 144 | dropout: 0.2 145 | emb_dropout: 0.2 146 | layer_norm_position: "pre" 147 | trg_pos_emb: "learnable" # ["absolute", "learnable", "relative"] 148 | max_trg_len: 40 # for learnable, keep same with data segment. 149 | freeze: False 150 | max_relative_position: 0 # only for relative position, else must be set to 0. 151 | use_negative_distance: False # for relative position 152 | -------------------------------------------------------------------------------- /configs/dataset_gcc-7.3.0_x86_64_O3_strip/O1_test3_cszx.yaml: -------------------------------------------------------------------------------- 1 | # assembly_comment 2 | # robertatokenizer and special 3 | # tokenizer_trg is None 4 | # assembly token = 400 5 | data: 6 | data_path: "datas/dataset_gcc-7.3.0_x86_64_O3_strip" 7 | cached_dataset_path: "cached_dataset" 8 | train_data_path: "datas/dataset_gcc-7.3.0_x86_64_O3_strip/train_refined.json" 9 | valid_data_path: "datas/dataset_gcc-7.3.0_x86_64_O3_strip/valid_refined.json" 10 | test_data_path: "datas/dataset_gcc-7.3.0_x86_64_O3_strip/test_refined.json" 11 | use_tokenizer: "robertatokenizer" # sentencepiece or robertatokenizer 12 | robertatokenizer: "Salesforce/codet5-base" 13 | architecture: "x86_64" # 'x86_64', 'x86_32', 'arm_32' 14 | use_refined_pseudo_code: False 15 | 16 | assembly_token: 17 | vocab_min_freq: 1 18 | vocab_max_size: 50000 19 | token_max_len: 400 20 | 21 | comment: 22 | vocab_min_freq: 1 23 | vocab_max_size: 50000 24 | token_max_len: 40 25 | 26 | cfg_node: 27 | vocab_min_freq: 1 28 | vocab_max_size: 50000 29 | 30 | pseudo_token: 31 | vocab_min_freq: 1 32 | vocab_max_size: 50000 33 | token_max_len: 400 34 | 35 | training: 36 | model_dir: "models/dataset_gcc-7.3.0_x86_64_O3_strip/test3_cszx" 37 | overwrite: False 38 | load_model: False 39 | random_seed: 980820 40 | 41 | logging_frequence: 100 42 | validation_frequence: 1 # after how many epochs 43 | store_valid_output: False 44 | log_valid_samples: [0,1,2,3,4] 45 | 46 | use_cuda: True 47 | num_workers: 4 48 | 49 | epochs: 100 50 | shuffle: True 51 | max_updates: 1000000000 52 | batch_size: 32 53 | 54 | learning_rate: 0.0001 55 | learning_rate_min: 1.0e-18 56 | # clip_grad_val: 1 57 | clip_grad_norm: 5.0 58 | optimizer: "adam" 59 | weight_decay: 0 60 | adam_betas: [0.9, 0.999] 61 | eps: 1.e-8 62 | early_stop_metric: "bleu" 63 | scheduling: "ReduceLROnPlateau" # "ReduceLROnPlateau", "StepLR", "ExponentialLR", "warmup" 64 | mode: "max" 65 | factor: 0.8 66 | patience: 2 67 | step_size: 1 68 | gamma: 0.1 69 | num_ckpts_keep: 3 70 | 71 | # load_model: "models/best.ckpt" 72 | reset_best_ckpt: False 73 | reset_scheduler: False 74 | reset_optimzer: False 75 | reset_iteration_state: False 76 | 77 | testing: 78 | batch_size: 64 79 | batch_type: "sentence" 80 | max_output_length: 40 81 | min_outptu_length: 1 82 | eval_metrics: ["bleu", "rouge-l"] 83 | n_best: 1 84 | beam_size: 4 85 | beam_alpha: -1 86 | return_attention: False 87 | return_probability: False 88 | generate_unk: False 89 | repetition_penalty: -1 90 | 91 | model: 92 | mode: "assembly_comment" # "assembly_comment", "assembly_cfg_comment" 93 | initializer: "xavier_uniform" 94 | embed_initializer: "xavier_uniform" 95 | tied_softmax: False 96 | tied_embeddings: False 97 | 98 | embeddings: 99 | embedding_dim: 512 100 | scale: False 101 | freeze: False 102 | 103 | transformer_encoder: 104 | model_dim: 512 105 | ff_dim: 2048 106 | num_layers: 6 107 | head_count: 8 108 | dropout: 0.2 109 | emb_dropout: 0.2 110 | layer_norm_position: "pre" 111 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 112 | max_src_len: 0 # for learnable, keep same with data segment 113 | freeze: False 114 | max_relative_position: 32 # only for relative position, else must be set to 0 115 | use_negative_distance: True # for relative position 116 | 117 | gnn_encoder: 118 | gnn_type: "GATConv" # ["SAGEConv", "GCNConv", "GATConv"] 119 | aggr: "mean" # ["mean", "max", "lstm"] 120 | model_dim: 512 121 | num_layers: 2 122 | emb_dropout: 0.2 123 | residual: True 124 | 125 | pseudo_encoder: 126 | model_dim: 512 127 | ff_dim: 2048 128 | num_layers: 6 129 | head_count: 8 130 | dropout: 0.2 131 | emb_dropout: 0.2 132 | layer_norm_position: "pre" 133 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 134 | max_src_len: 0 # for learnable, keep same with data segment 135 | freeze: False 136 | max_relative_position: 32 # only for relative position, else must be set to 0 137 | use_negative_distance: True # for relative position 138 | 139 | transformer_decoder: 140 | model_dim: 512 141 | ff_dim: 2048 142 | num_layers: 6 143 | head_count: 8 144 | dropout: 0.2 145 | emb_dropout: 0.2 146 | layer_norm_position: "pre" 147 | trg_pos_emb: "learnable" # ["absolute", "learnable", "relative"] 148 | max_trg_len: 40 # for learnable, keep same with data segment. 149 | freeze: False 150 | max_relative_position: 0 # only for relative position, else must be set to 0. 151 | use_negative_distance: False # for relative position 152 | -------------------------------------------------------------------------------- /configs/dataset_gcc-7.3.0_x86_64_O3_strip/O1_test4_cszx.yaml: -------------------------------------------------------------------------------- 1 | # assembly_cfg_comment 2 | # robertatokenizer and special 3 | # tokenizer_trg is None 4 | # assembly token = 400 5 | data: 6 | data_path: "datas/dataset_gcc-7.3.0_x86_64_O3_strip" 7 | cached_dataset_path: "cached_dataset" 8 | train_data_path: "datas/dataset_gcc-7.3.0_x86_64_O3_strip/train_refined.json" 9 | valid_data_path: "datas/dataset_gcc-7.3.0_x86_64_O3_strip/valid_refined.json" 10 | test_data_path: "datas/dataset_gcc-7.3.0_x86_64_O3_strip/test_refined.json" 11 | use_tokenizer: "robertatokenizer" # sentencepiece or robertatokenizer 12 | robertatokenizer: "Salesforce/codet5-base" 13 | architecture: "x86_64" # 'x86_64', 'x86_32', 'arm_32' 14 | use_refined_pseudo_code: False 15 | 16 | assembly_token: 17 | vocab_min_freq: 1 18 | vocab_max_size: 50000 19 | token_max_len: 400 20 | 21 | comment: 22 | vocab_min_freq: 1 23 | vocab_max_size: 50000 24 | token_max_len: 40 25 | 26 | cfg_node: 27 | vocab_min_freq: 1 28 | vocab_max_size: 50000 29 | 30 | pseudo_token: 31 | vocab_min_freq: 1 32 | vocab_max_size: 50000 33 | token_max_len: 400 34 | 35 | training: 36 | model_dir: "models/dataset_gcc-7.3.0_x86_64_O3_strip/test4_cszx" 37 | overwrite: False 38 | load_model: False 39 | random_seed: 980820 40 | 41 | logging_frequence: 100 42 | validation_frequence: 1 # after how many epochs 43 | store_valid_output: False 44 | log_valid_samples: [0,1,2,3,4] 45 | 46 | use_cuda: True 47 | num_workers: 4 48 | 49 | epochs: 100 50 | shuffle: True 51 | max_updates: 1000000000 52 | batch_size: 32 53 | 54 | learning_rate: 0.0001 55 | learning_rate_min: 1.0e-18 56 | # clip_grad_val: 1 57 | clip_grad_norm: 5.0 58 | optimizer: "adam" 59 | weight_decay: 0 60 | adam_betas: [0.9, 0.999] 61 | eps: 1.e-8 62 | early_stop_metric: "bleu" 63 | scheduling: "ReduceLROnPlateau" # "ReduceLROnPlateau", "StepLR", "ExponentialLR", "warmup" 64 | mode: "max" 65 | factor: 0.8 66 | patience: 2 67 | step_size: 1 68 | gamma: 0.1 69 | num_ckpts_keep: 3 70 | 71 | # load_model: "models/best.ckpt" 72 | reset_best_ckpt: False 73 | reset_scheduler: False 74 | reset_optimzer: False 75 | reset_iteration_state: False 76 | 77 | testing: 78 | batch_size: 64 79 | batch_type: "sentence" 80 | max_output_length: 40 81 | min_outptu_length: 1 82 | eval_metrics: ["bleu", "rouge-l"] 83 | n_best: 1 84 | beam_size: 4 85 | beam_alpha: -1 86 | return_attention: False 87 | return_probability: False 88 | generate_unk: False 89 | repetition_penalty: -1 90 | 91 | model: 92 | mode: "assembly_cfg_comment" # "assembly_comment", "assembly_cfg_comment" 93 | initializer: "xavier_uniform" 94 | embed_initializer: "xavier_uniform" 95 | tied_softmax: False 96 | tied_embeddings: False 97 | 98 | embeddings: 99 | embedding_dim: 512 100 | scale: False 101 | freeze: False 102 | 103 | transformer_encoder: 104 | model_dim: 512 105 | ff_dim: 2048 106 | num_layers: 6 107 | head_count: 8 108 | dropout: 0.2 109 | emb_dropout: 0.2 110 | layer_norm_position: "pre" 111 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 112 | max_src_len: 0 # for learnable, keep same with data segment 113 | freeze: False 114 | max_relative_position: 32 # only for relative position, else must be set to 0 115 | use_negative_distance: True # for relative position 116 | 117 | gnn_encoder: 118 | gnn_type: "GATConv" # ["SAGEConv", "GCNConv", "GATConv"] 119 | aggr: "mean" # ["mean", "max", "lstm"] 120 | model_dim: 512 121 | num_layers: 2 122 | emb_dropout: 0.2 123 | residual: True 124 | 125 | pseudo_encoder: 126 | model_dim: 512 127 | ff_dim: 2048 128 | num_layers: 6 129 | head_count: 8 130 | dropout: 0.2 131 | emb_dropout: 0.2 132 | layer_norm_position: "pre" 133 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 134 | max_src_len: 0 # for learnable, keep same with data segment 135 | freeze: False 136 | max_relative_position: 32 # only for relative position, else must be set to 0 137 | use_negative_distance: True # for relative position 138 | 139 | transformer_decoder: 140 | model_dim: 512 141 | ff_dim: 2048 142 | num_layers: 6 143 | head_count: 8 144 | dropout: 0.2 145 | emb_dropout: 0.2 146 | layer_norm_position: "pre" 147 | trg_pos_emb: "learnable" # ["absolute", "learnable", "relative"] 148 | max_trg_len: 40 # for learnable, keep same with data segment. 149 | freeze: False 150 | max_relative_position: 0 # only for relative position, else must be set to 0. 151 | use_negative_distance: False # for relative position 152 | -------------------------------------------------------------------------------- /configs/dataset_gcc-7.3.0_x86_64_O3_strip/O1_test8_cszx.yaml: -------------------------------------------------------------------------------- 1 | # assembly_cfg_comment 2 | # robertatokenizer and special 3 | # tokenizer_trg is None 4 | # assembly token = 400 5 | data: 6 | data_path: "datas/dataset_gcc-7.3.0_x86_64_O3_strip" 7 | cached_dataset_path: "cached_dataset" 8 | train_data_path: "datas/dataset_gcc-7.3.0_x86_64_O3_strip/train_refined.json" 9 | valid_data_path: "datas/dataset_gcc-7.3.0_x86_64_O3_strip/valid_refined.json" 10 | test_data_path: "datas/dataset_gcc-7.3.0_x86_64_O3_strip/test_refined.json" 11 | use_tokenizer: "robertatokenizer" # sentencepiece or robertatokenizer 12 | robertatokenizer: "Salesforce/codet5-base" 13 | architecture: "x86_64" # 'x86_64', 'x86_32', 'arm_32' 14 | use_refined_pseudo_code: False 15 | 16 | assembly_token: 17 | vocab_min_freq: 1 18 | vocab_max_size: 50000 19 | token_max_len: 400 20 | 21 | comment: 22 | vocab_min_freq: 1 23 | vocab_max_size: 50000 24 | token_max_len: 40 25 | 26 | cfg_node: 27 | vocab_min_freq: 1 28 | vocab_max_size: 50000 29 | 30 | pseudo_token: 31 | vocab_min_freq: 1 32 | vocab_max_size: 50000 33 | token_max_len: 400 34 | 35 | training: 36 | model_dir: "models/dataset_gcc-7.3.0_x86_64_O3_strip/test8_cszx" 37 | overwrite: False 38 | load_model: False 39 | random_seed: 980820 40 | 41 | logging_frequence: 100 42 | validation_frequence: 1 # after how many epochs 43 | store_valid_output: False 44 | log_valid_samples: [0,1,2,3,4] 45 | 46 | use_cuda: True 47 | num_workers: 4 48 | 49 | epochs: 100 50 | shuffle: True 51 | max_updates: 1000000000 52 | batch_size: 32 53 | 54 | learning_rate: 0.0001 55 | learning_rate_min: 1.0e-18 56 | # clip_grad_val: 1 57 | clip_grad_norm: 5.0 58 | optimizer: "adam" 59 | weight_decay: 0 60 | adam_betas: [0.9, 0.999] 61 | eps: 1.e-8 62 | early_stop_metric: "bleu" 63 | scheduling: "ReduceLROnPlateau" # "ReduceLROnPlateau", "StepLR", "ExponentialLR", "warmup" 64 | mode: "max" 65 | factor: 0.8 66 | patience: 2 67 | step_size: 1 68 | gamma: 0.1 69 | num_ckpts_keep: 3 70 | 71 | # load_model: "models/best.ckpt" 72 | reset_best_ckpt: False 73 | reset_scheduler: False 74 | reset_optimzer: False 75 | reset_iteration_state: False 76 | 77 | testing: 78 | batch_size: 64 79 | batch_type: "sentence" 80 | max_output_length: 40 81 | min_outptu_length: 1 82 | eval_metrics: ["bleu", "rouge-l"] 83 | n_best: 1 84 | beam_size: 4 85 | beam_alpha: -1 86 | return_attention: False 87 | return_probability: False 88 | generate_unk: False 89 | repetition_penalty: -1 90 | 91 | model: 92 | mode: "assembly_cfg_pseudo_comment" # "assembly_comment", "assembly_cfg_comment", "assembly_cfg_pseudo_comment" 93 | initializer: "xavier_uniform" 94 | embed_initializer: "xavier_uniform" 95 | tied_softmax: False 96 | tied_embeddings: False 97 | 98 | embeddings: 99 | embedding_dim: 512 100 | scale: False 101 | freeze: False 102 | 103 | transformer_encoder: 104 | model_dim: 512 105 | ff_dim: 2048 106 | num_layers: 6 107 | head_count: 8 108 | dropout: 0.2 109 | emb_dropout: 0.2 110 | layer_norm_position: "pre" 111 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 112 | max_src_len: 0 # for learnable, keep same with data segment 113 | freeze: False 114 | max_relative_position: 32 # only for relative position, else must be set to 0 115 | use_negative_distance: True # for relative position 116 | 117 | gnn_encoder: 118 | gnn_type: "GATConv" # ["SAGEConv", "GCNConv", "GATConv"] 119 | aggr: "mean" # ["mean", "max", "lstm"] 120 | model_dim: 512 121 | num_layers: 2 122 | emb_dropout: 0.2 123 | residual: True 124 | 125 | pseudo_encoder: 126 | model_dim: 512 127 | ff_dim: 2048 128 | num_layers: 6 129 | head_count: 8 130 | dropout: 0.2 131 | emb_dropout: 0.2 132 | layer_norm_position: "pre" 133 | src_pos_emb: "relative" # ["absolute", "learnable", "relative"] 134 | max_src_len: 0 # for learnable, keep same with data segment 135 | freeze: False 136 | max_relative_position: 32 # only for relative position, else must be set to 0 137 | use_negative_distance: True # for relative position 138 | 139 | transformer_decoder: 140 | model_dim: 512 141 | ff_dim: 2048 142 | num_layers: 6 143 | head_count: 8 144 | dropout: 0.2 145 | emb_dropout: 0.2 146 | layer_norm_position: "pre" 147 | trg_pos_emb: "learnable" # ["absolute", "learnable", "relative"] 148 | max_trg_len: 40 # for learnable, keep same with data segment. 149 | freeze: False 150 | max_relative_position: 0 # only for relative position, else must be set to 0. 151 | use_negative_distance: False # for relative position 152 | -------------------------------------------------------------------------------- /datas/README.md: -------------------------------------------------------------------------------- 1 | # BinaryCodeSummary/datas 2 | 3 | The dataset is available at [Google Drive](https://drive.google.com/drive/folders/1HsP_QqrMeEhlHcVPqdP_zyrLd6KzuVln?usp=share_link) 4 | 5 | 6 | ![datas](../figs/datas_google_drive.png) -------------------------------------------------------------------------------- /figs/architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tongye98/BinaryCodeSummary/669c66e8e23875759d50327236e8923f0fa638df/figs/architecture.png -------------------------------------------------------------------------------- /figs/datas_google_drive.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tongye98/BinaryCodeSummary/669c66e8e23875759d50327236e8923f0fa638df/figs/datas_google_drive.png -------------------------------------------------------------------------------- /models/README.md: -------------------------------------------------------------------------------- 1 | # BinaryCodeSummary/models 2 | Placeholder to uploading model information. 3 | -------------------------------------------------------------------------------- /src/__init__.py: -------------------------------------------------------------------------------- 1 | __author__ = "TongYe" -------------------------------------------------------------------------------- /src/__main__.py: -------------------------------------------------------------------------------- 1 | import os 2 | import argparse 3 | from src.train import train 4 | from src.test import test 5 | 6 | def main(): 7 | parser = argparse.ArgumentParser("BinaryCodeSummary") 8 | parser.add_argument("mode", type=str, default="train", choices=["train", "test", "unittest"]) 9 | parser.add_argument("--config", type=str, default="configs/dataset_gcc-7.3.0_arm_32_O1_strip/O1_test3_cszx.yaml") 10 | parser.add_argument("--gpu", type=str, default='1') 11 | parser.add_argument("--ckpt", type=str, default=None, help="Model Checkpoint") 12 | args = parser.parse_args() 13 | 14 | os.environ['CUDA_VISIBLE_DEVICES'] = args.gpu 15 | 16 | if args.mode == "train": 17 | train(cfg_file=args.config) 18 | 19 | elif args.mode == "test": 20 | test(cfg_file=args.config, ckpt_path=args.ckpt) 21 | 22 | elif args.mode == "unittest": 23 | assert False, "To Do." 24 | 25 | else: 26 | raise ValueError("Unknown mode!") 27 | 28 | if __name__ == "__main__": 29 | main() -------------------------------------------------------------------------------- /src/__pycache__/__init__.cpython-39.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tongye98/BinaryCodeSummary/669c66e8e23875759d50327236e8923f0fa638df/src/__pycache__/__init__.cpython-39.pyc -------------------------------------------------------------------------------- /src/__pycache__/__main__.cpython-39.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tongye98/BinaryCodeSummary/669c66e8e23875759d50327236e8923f0fa638df/src/__pycache__/__main__.cpython-39.pyc -------------------------------------------------------------------------------- /src/__pycache__/build_database.cpython-39.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tongye98/BinaryCodeSummary/669c66e8e23875759d50327236e8923f0fa638df/src/__pycache__/build_database.cpython-39.pyc -------------------------------------------------------------------------------- /src/__pycache__/data.cpython-39.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tongye98/BinaryCodeSummary/669c66e8e23875759d50327236e8923f0fa638df/src/__pycache__/data.cpython-39.pyc -------------------------------------------------------------------------------- /src/__pycache__/model.cpython-39.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tongye98/BinaryCodeSummary/669c66e8e23875759d50327236e8923f0fa638df/src/__pycache__/model.cpython-39.pyc -------------------------------------------------------------------------------- /src/__pycache__/retrieval_validate.cpython-39.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tongye98/BinaryCodeSummary/669c66e8e23875759d50327236e8923f0fa638df/src/__pycache__/retrieval_validate.cpython-39.pyc -------------------------------------------------------------------------------- /src/__pycache__/test.cpython-39.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tongye98/BinaryCodeSummary/669c66e8e23875759d50327236e8923f0fa638df/src/__pycache__/test.cpython-39.pyc -------------------------------------------------------------------------------- /src/__pycache__/train.cpython-39.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tongye98/BinaryCodeSummary/669c66e8e23875759d50327236e8923f0fa638df/src/__pycache__/train.cpython-39.pyc -------------------------------------------------------------------------------- /src/__pycache__/unittest.cpython-39.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tongye98/BinaryCodeSummary/669c66e8e23875759d50327236e8923f0fa638df/src/__pycache__/unittest.cpython-39.pyc -------------------------------------------------------------------------------- /src/__pycache__/validate.cpython-39.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tongye98/BinaryCodeSummary/669c66e8e23875759d50327236e8923f0fa638df/src/__pycache__/validate.cpython-39.pyc -------------------------------------------------------------------------------- /src/data/paraphrase-en.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tongye98/BinaryCodeSummary/669c66e8e23875759d50327236e8923f0fa638df/src/data/paraphrase-en.gz -------------------------------------------------------------------------------- /src/meteor-1.5.jar: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tongye98/BinaryCodeSummary/669c66e8e23875759d50327236e8923f0fa638df/src/meteor-1.5.jar -------------------------------------------------------------------------------- /src/test.py: -------------------------------------------------------------------------------- 1 | import logging 2 | import torch 3 | from typing import Dict, List 4 | import time 5 | import pickle 6 | import codecs 7 | from pathlib import Path 8 | from src.train import load_config, make_logger 9 | from src.data import load_data, make_data_loader 10 | from src.model import build_model 11 | from src.validate import search, eval_accuracies 12 | from tqdm import tqdm 13 | from tqdm.contrib import tzip 14 | import sentencepiece as spm 15 | 16 | logger = logging.getLogger(__name__) 17 | 18 | def resolve_ckpt_path(ckpt_path:str, load_model:str, model_dir:Path) -> Path: 19 | """ 20 | Resolve checkpoint path 21 | First choose ckpt_path, then choos load_model, 22 | then choose model_dir/best.ckpt, final choose model_dir/latest.ckpt 23 | """ 24 | if ckpt_path is None: 25 | logger.warning("ckpt_path is not specified.") 26 | if load_model is False: 27 | if (model_dir / "best.ckpt").is_file(): 28 | ckpt_path = model_dir / "best.ckpt" 29 | logger.warning("use best ckpt in model dir!") 30 | else: 31 | logger.warning("No ckpt_path, no load_model, no best_model, Please Check!") 32 | ckpt_path = model_dir / "latest.ckpt" 33 | logger.warning("use latest ckpt in model dir!") 34 | else: 35 | logger.warning("use load_model item in config yaml.") 36 | ckpt_path = Path(load_model) 37 | return Path(ckpt_path) 38 | 39 | def load_model_checkpoint(path:Path, device:torch.device) -> Dict: 40 | """ 41 | Load model from saved model checkpoint 42 | """ 43 | assert path.is_file(), f"model checkpoint {path} not found!" 44 | model_checkpoint = torch.load(path.as_posix(), map_location=device) 45 | logger.info("Load model from %s.", path.resolve()) 46 | return model_checkpoint 47 | 48 | def write_model_generated_to_file(path:Path, array: List[str]) -> None: 49 | """ 50 | Write list of strings to file. 51 | array: list of strings. 52 | """ 53 | with path.open("w", encoding="utf-8") as fg: 54 | for entry in array: 55 | fg.write(f"{entry}\n") 56 | 57 | 58 | def test(cfg_file: str, ckpt_path): 59 | """ 60 | Main test function. Handles loading a model from checkpoint, generating translations. 61 | """ 62 | cfg = load_config(Path(cfg_file)) 63 | 64 | model_dir = Path(cfg["training"].get("model_dir", None)) 65 | assert model_dir is not None 66 | 67 | load_model = cfg["training"].get("load_model", False) 68 | use_cuda = cfg["training"].get("use_cuda", False) and torch.cuda.is_available() 69 | device = torch.device("cuda" if use_cuda else "cpu") 70 | seed = cfg["training"].get("random_seed", 980820) 71 | batch_size = cfg["testing"].get("batch_size", 64) 72 | num_workers = cfg["training"].get("num_workers", 4) 73 | beam_size = cfg["testing"].get("beam_size", 4) 74 | 75 | if beam_size > 1: 76 | search_name = "beam_search" 77 | else: 78 | search_name = "greedy_search" 79 | make_logger(model_dir, mode="test_{}".format(search_name)) 80 | 81 | # load data 82 | train_dataset, valid_dataset, test_dataset, vocab_info = load_data(data_cfg=cfg["data"]) 83 | 84 | # build model 85 | model = build_model(model_cfg=cfg["model"], vocab_info=vocab_info) 86 | 87 | # when checkpoint is not specified, take latest(best) from model_dir 88 | ckpt_path = resolve_ckpt_path(ckpt_path, load_model, Path(model_dir)) 89 | logger.info("ckpt_path = {}".format(ckpt_path)) 90 | 91 | # load model checkpoint 92 | model_checkpoint = load_model_checkpoint(path=ckpt_path, device=device) 93 | 94 | # restore model and optimizer parameters 95 | model.load_state_dict(model_checkpoint["model_state"]) 96 | 97 | # model to GPU 98 | if device.type == "cuda": 99 | model.to(device) 100 | 101 | # Test 102 | dataset_to_test = {"valid": valid_dataset, "test":test_dataset} 103 | for dataset_name, dataset in dataset_to_test.items(): 104 | # if dataset_name == "valid": 105 | # continue 106 | if dataset is not None: 107 | logger.info("Starting testing on %s dataset...", dataset_name) 108 | test_start_time = time.time() 109 | test_loader = make_data_loader(dataset=dataset, sampler_seed=seed, shuffle=False, 110 | batch_size=batch_size, num_workers=num_workers, mode="test") 111 | 112 | model.eval() 113 | 114 | all_test_outputs = [] 115 | all_test_probability = [] 116 | all_test_attention = [] 117 | eval_scores = {} 118 | 119 | for batch_data in tqdm(test_loader, desc="Testing"): 120 | batch_data.to(device) 121 | stacked_output, stacked_probability, stacked_attention = search(batch_data, model, cfg) 122 | 123 | all_test_outputs.extend(stacked_output) 124 | all_test_probability.extend(stacked_probability if stacked_probability is not None else []) 125 | all_test_attention.extend(stacked_attention if stacked_attention is not None else []) 126 | 127 | text_vocab = vocab_info["comment_token_vocab"]["self"] 128 | tokenizer_trg = vocab_info["tokenizer_trg"] 129 | model_generated = text_vocab.arrays_to_sentences(arrays=all_test_outputs, cut_at_eos=True, skip_pad=True) 130 | if cfg["data"].get("use_tokenizer") == "robertatokenizer": 131 | logger.info("use robertatokenizer to decode...") 132 | if tokenizer_trg is not None: 133 | model_generated = [tokenizer_trg.convert_tokens_to_string(output) for output in model_generated] 134 | else: 135 | logger.info("uset robertatokenizer but tokenizer_trg is None") 136 | model_generated = [" ".join(output) for output in model_generated] 137 | elif cfg["data"].get("use_tokenizer") == "sentencepiece_binary_model": 138 | logger.info("use sentencepice to decode...") 139 | sp = spm.SentencePieceProcessor(model_file=cfg["data"].get("sentencepiece_binary_model")) 140 | model_generated = [sp.DecodePieces(output) for output in model_generated] 141 | else: 142 | logger.info("not use tokenizer to decode...") 143 | model_generated = [" ".join(output) for output in model_generated] 144 | logger.warning("model generated length = {}".format(len(model_generated))) 145 | 146 | target_truth = dataset.target_truth 147 | 148 | test_duration_time = time.time() - test_start_time 149 | 150 | bleu, rouge_l, meteor = eval_accuracies(model_generated, target_truth) 151 | eval_scores["bleu"] = bleu 152 | eval_scores["rouge_l"] = rouge_l 153 | eval_scores["meteor"] = meteor 154 | 155 | metrics_string = "Bleu={}, Rouge_L={}, Meteor={}".format(bleu, rouge_l, meteor) 156 | logger.info("Evaluation result({}) {}, Test cost time = {:.2f}[sec]".format( 157 | "Beam Search" if beam_size > 1 else "Greedy Search", metrics_string, test_duration_time)) 158 | 159 | output_file_path = Path(model_dir) / "{}.test_{}".format(dataset_name, search_name) 160 | write_model_generated_to_file(output_file_path, model_generated) 161 | 162 | 163 | if __name__ == "__main__": 164 | cfg_file = "configs/binary_summary/O1_test4_integration_cszx.yaml" 165 | test(cfg_file) --------------------------------------------------------------------------------