├── .gitattributes ├── LICENSE ├── LightGBM_Benchmark ├── lightgbm_benchmark.py └── lightgbm_config.json ├── MalConv2_Benchmark ├── LowMemConv.py ├── MalConv.py ├── binaryLoader.py └── malconv_benchmark.py ├── MalDICT_Tags ├── maldict_category_test.jsonl ├── maldict_category_train.jsonl ├── maldict_packer_test.jsonl ├── maldict_packer_train.jsonl ├── maldict_platform_test.jsonl ├── maldict_platform_train.jsonl ├── maldict_vulnerability_test.jsonl └── maldict_vulnerability_train.jsonl └── README.md /.gitattributes: -------------------------------------------------------------------------------- 1 | MalDICT_Tags/maldict_packer_train.jsonl filter=lfs diff=lfs merge=lfs -text 2 | MalDICT_Tags/maldict_platform_test.jsonl filter=lfs diff=lfs merge=lfs -text 3 | MalDICT_Tags/maldict_platform_train.jsonl filter=lfs diff=lfs merge=lfs -text 4 | MalDICT_Tags/maldict_vulnerability_test.jsonl filter=lfs diff=lfs merge=lfs -text 5 | MalDICT_Tags/maldict_vulnerability_train.jsonl filter=lfs diff=lfs merge=lfs -text 6 | MalDICT_Tags/maldict_category_test.jsonl filter=lfs diff=lfs merge=lfs -text 7 | MalDICT_Tags/maldict_category_train.jsonl filter=lfs diff=lfs merge=lfs -text 8 | MalDICT_Tags/maldict_packer_test.jsonl filter=lfs diff=lfs merge=lfs -text 9 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /LightGBM_Benchmark/lightgbm_benchmark.py: -------------------------------------------------------------------------------- 1 | import os 2 | import json 3 | import ember 4 | import argparse 5 | import multiprocessing 6 | import numpy as np 7 | import lightgbm as lgb 8 | from sklearn.metrics import roc_auc_score 9 | from sklearn.metrics import precision_recall_fscore_support as prfs 10 | 11 | 12 | def get_vec(jsonl): 13 | """Vectorize a JSON line with EMBER raw features. 14 | 15 | Returns: 16 | (md5, vector) 17 | """ 18 | ember_meta = json.loads(jsonl) 19 | md5 = ember_meta["md5"] 20 | extractor = ember.PEFeatureExtractor() 21 | vec = extractor.process_raw_features(ember_meta) 22 | return md5, vec 23 | 24 | 25 | def get_tags(tag_path): 26 | """Make a dict mapping MD5s to ClarAVy tags.""" 27 | md5_tags = {} 28 | with open(tag_path, "r") as f: 29 | for jsonl in f: 30 | entry = json.loads(jsonl.strip()) 31 | md5 = entry["md5"] 32 | tags = [rank[0] for rank in entry["ranking"]] 33 | md5_tags[md5] = tags 34 | return md5_tags 35 | 36 | 37 | if __name__ == "__main__": 38 | 39 | parser = argparse.ArgumentParser() 40 | parser.add_argument("ember_meta_dir", help="path to directory with raw " + 41 | "EMBER metadata .jsonl files (train and test)") 42 | parser.add_argument("maldict_train_file", help="path to MalDICT .jsonl " + 43 | "file with train hashes and tags") 44 | parser.add_argument("maldict_test_file", help="path to MalDICT .jsonl " + 45 | "file with test hashes and tags") 46 | parser.add_argument("--num-processes", type=int, default=1) 47 | args = parser.parse_args() 48 | 49 | # Validate ember_meta_path 50 | train_path = os.path.join(args.ember_meta_dir, "train_features.jsonl") 51 | test_path = os.path.join(args.ember_meta_dir, "test_features.jsonl") 52 | 53 | # Read metadata 54 | with open(train_path, "r") as f: 55 | train_meta = [line.strip() for line in f] 56 | with open(test_path, "r") as f: 57 | test_meta = [line.strip() for line in f] 58 | 59 | # Read train/test tags 60 | train_md5_tags = get_tags(args.maldict_train_file) 61 | test_md5_tags = get_tags(args.maldict_test_file) 62 | 63 | # Convert tags to numeric labels 64 | all_tags = set() 65 | for tags in train_md5_tags.values(): 66 | all_tags.update(tags) 67 | for tags in test_md5_tags.values(): 68 | all_tags.update(tags) 69 | sorted_tags = sorted(all_tags) 70 | tag_labels = {tag: i for i, tag in enumerate(sorted_tags)} 71 | 72 | # Vectorize EMBER metadata 73 | pool = multiprocessing.Pool(args.num_processes) 74 | train_md5_vecs = list(pool.imap(get_vec, train_meta)) 75 | test_md5_vecs = list(pool.imap(get_vec, test_meta)) 76 | 77 | # Get sizes of train and test set 78 | num_train = len(train_md5_vecs) 79 | num_test = len(test_md5_vecs) 80 | vec_dim = len(train_md5_vecs[0][1]) 81 | num_labels = len(sorted_tags) 82 | 83 | # Get X and y for train set 84 | X_train = np.zeros((num_train, vec_dim), dtype=np.float) 85 | y_train = np.zeros((num_train, num_labels)) 86 | for i, (md5, vec) in enumerate(train_md5_vecs): 87 | labels = [tag_labels[tag] for tag in train_md5_tags[md5]] 88 | X_train[i,] = vec 89 | for j in labels: 90 | y_train[i,j] = 1 91 | 92 | # Get X and y for test set 93 | X_test = np.zeros((num_test, vec_dim), dtype=np.float) 94 | y_test = np.zeros((num_test, num_labels), dtype=np.float) 95 | for i, (md5, vec) in enumerate(test_md5_vecs): 96 | labels = [tag_labels[tag] for tag in test_md5_tags[md5]] 97 | X_test[i,] = vec 98 | for j in labels: 99 | y_test[i,j] = 1 100 | 101 | # Load LightGBM config 102 | with open("lightgbm_config.json", "rb") as f: 103 | params = json.load(f) 104 | params.update({"verbose": -1}) 105 | params.update({"num_iterations": 100}) 106 | 107 | # Train OvR classifier on each tag 108 | y_pred = np.zeros((num_test, num_labels)) 109 | for j, tag in enumerate(sorted_tags): 110 | 111 | print("Training classifiers on tag: {}".format(tag)) 112 | 113 | # Get train and test sets for fold 114 | y_train_tag = y_train[:, j] 115 | y_test_tag = y_test[:, j] 116 | 117 | # Train LightGBM classifier 118 | train_dataset = lgb.Dataset(X_train, y_train_tag) 119 | test_dataset = lgb.Dataset(X_test, y_test_tag) 120 | clf = lgb.train(params, train_dataset) 121 | 122 | # Get predictions and compute accuracy 123 | predictions = clf.predict(X_test) 124 | y_pred[:,j] = predictions 125 | 126 | # Get AUC score over all tags 127 | micro_auc = roc_auc_score(y_test, y_pred, average="micro", 128 | multi_class="ovr") 129 | weighted_auc = roc_auc_score(y_test, y_pred, average="weighted", 130 | multi_class="ovr") 131 | 132 | # Get Precision, Recall, and F1 (assume > 0.5) 133 | y_pred = (y_pred > 0.5) 134 | p_micro, r_micro, f1_micro, _ = prfs(y_test, y_pred, average="micro") 135 | p_avg, r_avg, f1_avg, _ = prfs(y_test, y_pred, average="weighted") 136 | 137 | # Print results 138 | print("Precision\t{} (micro)\t{} (weighted)".format(p_micro, p_avg)) 139 | print("Recall\t{} (micro)\t{} (weighted)".format(r_micro, r_avg)) 140 | print("F1-Score\t{} (micro)\t{} (weighted)".format(f1_micro, f1_avg)) 141 | print("ROC AUC\t{} (micro)\t{} (weighted)".format(micro_auc, weighted_auc)) 142 | -------------------------------------------------------------------------------- /LightGBM_Benchmark/lightgbm_config.json: -------------------------------------------------------------------------------- 1 | {"objective": "binary", "task": "train", "boosting": "gbdt", "num_iterations": 500, "learning_rate": 0.1, "max_depth": -1, "num_leaves": 64, "tree_learner": "serial", "num_threads": 0, "device_type": "cpu", "seed": 0, "min_data_in_leaf": 100, "min_sum_hessian_in_leaf": 0.001, "bagging_fraction": 0.9, "bagging_freq": 1, "bagging_seed": 0, "feature_fraction": 0.9, "feature_fraction_bynode": 0.9, "feature_fraction_seed": 0, "first_metric_only": true, "max_delta_step": 0, "lambda_l1": 0, "lambda_l2": 1.0, "verbosity": 2, "is_unbalance": true, "sigmoid": 1.0, "boost_from_average": true, "metric": ["auc", "binary_logloss", "binary_error"]} 2 | -------------------------------------------------------------------------------- /MalConv2_Benchmark/LowMemConv.py: -------------------------------------------------------------------------------- 1 | # Adapted from https://github.com/NeuromorphicComputationResearchProgram/MalConv2/blob/main/LowMemConv.py 2 | import numpy as np 3 | import torch 4 | import torch.nn as nn 5 | import torch.nn.functional as F 6 | 7 | 8 | def drop_zeros_hook(module, grad_input, grad_out): 9 | """ 10 | This function is used to replace gradients that are all zeros with None in 11 | pytorch None will not get back-propogated. So we use this as an 12 | approximation to sparse BP to avoid redundant and useless work. 13 | """ 14 | grads = [] 15 | with torch.no_grad(): 16 | for g in grad_input: 17 | if torch.nonzero(g).shape[0] == 0: 18 | grads.append(g.to_sparse()) 19 | else: 20 | grads.append(g) 21 | 22 | return tuple(grads) 23 | 24 | 25 | class CatMod(torch.nn.Module): 26 | def __init__(self): 27 | super(CatMod, self).__init__() 28 | 29 | def forward(self, x): 30 | return torch.cat(x, dim=2) 31 | 32 | 33 | class LowMemConvBase(nn.Module): 34 | 35 | def __init__(self, chunk_size=65536, overlap=512, min_chunk_size=1024): 36 | """ 37 | chunk_size - How many bytes at a time to process. Increasing may 38 | improve compute efficent, but use more memory. Total 39 | memory use will be a function of chunk_size, and not of 40 | the length of the input sequence L 41 | overlap - How many bytes of overlap to use between chunks 42 | """ 43 | 44 | super(LowMemConvBase, self).__init__() 45 | self.chunk_size = chunk_size 46 | self.overlap = overlap 47 | self.min_chunk_size = min_chunk_size 48 | self.pooling = nn.AdaptiveMaxPool1d(1) 49 | self.cat = CatMod() 50 | self.cat.register_backward_hook(drop_zeros_hook) 51 | self.receptive_field = None 52 | self.dummy_tensor = torch.ones(1, dtype=torch.float32, 53 | requires_grad=True) 54 | 55 | 56 | def determinRF(self): 57 | """ 58 | Let's determine the receptive field & stride of our sub-network. 59 | """ 60 | 61 | if self.receptive_field is not None: 62 | return self.receptive_field, self.stride, self.out_channels 63 | cur_device = next(self.embd.parameters()).device 64 | min_rf = 1 65 | max_rf = self.chunk_size 66 | with torch.no_grad(): 67 | tmp = torch.zeros((1,max_rf)).long().to(cur_device) 68 | while True: 69 | test_size = (min_rf+max_rf)//2 70 | is_valid = True 71 | try: 72 | self.processRange(tmp[:,0:test_size]) 73 | except: 74 | is_valid = False 75 | 76 | if is_valid: 77 | max_rf = test_size 78 | else: 79 | min_rf = test_size+1 80 | if max_rf == min_rf: 81 | self.receptive_field = min_rf 82 | out_shape = self.processRange(tmp).shape 83 | self.stride = self.chunk_size//out_shape[2] 84 | self.out_channels = out_shape[1] 85 | break 86 | 87 | return self.receptive_field, self.stride, self.out_channels 88 | 89 | 90 | def pool_group(self, *args): 91 | x = self.cat(args) 92 | x = self.pooling(x) 93 | return x 94 | 95 | 96 | def seq2fix(self, x, pr_args={}): 97 | """ 98 | Takes in an input LongTensor of (B, L) that will be converted to a 99 | fixed length representation (B, C), where C is the number of channels 100 | provided by the base_network given at construction. 101 | """ 102 | 103 | receptive_window, stride, out_channels = self.determinRF() 104 | if x.shape[1] < receptive_window: 105 | x = F.pad(x, (0, receptive_window-x.shape[1]), value=0) 106 | batch_size = x.shape[0] 107 | length = x.shape[1] 108 | winner_values = np.zeros((batch_size, out_channels))-1.0 109 | winner_indices = np.zeros((batch_size, out_channels), dtype=np.int64) 110 | cur_device = next(self.embd.parameters()).device 111 | step = self.chunk_size 112 | start = 0 113 | end = start+step 114 | with torch.no_grad(): 115 | while (start < end and 116 | (end-start) >= max(self.min_chunk_size, receptive_window)): 117 | x_sub = x[:,start:end] 118 | x_sub = x_sub.to(cur_device) 119 | activs = self.processRange(x_sub.long(), **pr_args) 120 | k_size = activs.shape[2] 121 | activ_win, activ_indx = F.max_pool1d(activs, 122 | kernel_size=k_size, 123 | return_indices=True) 124 | activ_win = activ_win.cpu().numpy()[:,:,0] 125 | activ_indx = activ_indx.cpu().numpy()[:,:,0] 126 | selected = winner_values < activ_win 127 | winner_indices[selected] = activ_indx[selected]*stride + start 128 | winner_values[selected] = activ_win[selected] 129 | start = end 130 | end = min(start+step, length) 131 | 132 | final_indices = [np.unique(winner_indices[b,:]) 133 | for b in range(batch_size)] 134 | chunk_list = [[x[b:b+1,max(i-receptive_window,0): 135 | min(i+receptive_window,length)] 136 | for i in final_indices[b]] for b in range(batch_size)] 137 | chunk_list = [torch.cat(c, dim=1)[0,:] for c in chunk_list] 138 | x_selected = torch.nn.utils.rnn.pad_sequence(chunk_list, 139 | batch_first=True) 140 | x_selected = x_selected.to(cur_device) 141 | x_selected = self.processRange(x_selected.long(), **pr_args) 142 | x_selected = self.pooling(x_selected) 143 | x_selected = x_selected.view(x_selected.size(0), -1) 144 | 145 | return x_selected 146 | -------------------------------------------------------------------------------- /MalConv2_Benchmark/MalConv.py: -------------------------------------------------------------------------------- 1 | # Adapted from https://github.com/NeuromorphicComputationResearchProgram/MalConv2/blob/main/MalConv.py 2 | import numpy as np 3 | import torch 4 | import torch.nn as nn 5 | import torch.nn.functional as F 6 | from collections import OrderedDict 7 | from LowMemConv import LowMemConvBase 8 | 9 | 10 | def getParams(): 11 | params = { 12 | 'channels': ("suggest_int", {'name':'channels', 'low':32, 13 | 'high':1024}), 14 | 'log_stride': ("suggest_int", {'name':'log2_stride', 'low':2, 15 | 'high':9}), 16 | 'window_size': ("suggest_int", {'name':'window_size', 'low':32, 17 | 'high':512}), 18 | 'embd_size': ("suggest_int", {'name':'embd_size', 'low':4, 'high':64}), 19 | } 20 | return OrderedDict(sorted(params.items(), key=lambda t: t[0])) 21 | 22 | 23 | def initModel(**kwargs): 24 | new_args = {} 25 | for x in getParams(): 26 | if x in kwargs: 27 | new_args[x] = kwargs[x] 28 | return MalConv(**new_args) 29 | 30 | 31 | class MalConv(LowMemConvBase): 32 | 33 | def __init__(self, out_size=2, channels=128, window_size=512, stride=512, 34 | embd_size=8, log_stride=None): 35 | super(MalConv, self).__init__() 36 | self.embd = nn.Embedding(257, embd_size, padding_idx=0) 37 | if not log_stride is None: 38 | stride = 2**log_stride 39 | 40 | self.conv_1 = nn.Conv1d(embd_size, channels, window_size, 41 | stride=stride, bias=True) 42 | self.conv_2 = nn.Conv1d(embd_size, channels, window_size, 43 | stride=stride, bias=True) 44 | 45 | self.fc_1 = nn.Linear(channels, channels) 46 | self.fc_2 = nn.Linear(channels, out_size) 47 | 48 | 49 | def processRange(self, x): 50 | x = self.embd(x) 51 | x = torch.transpose(x,-1,-2) 52 | cnn_value = self.conv_1(x) 53 | gating_weight = torch.sigmoid(self.conv_2(x)) 54 | x = cnn_value * gating_weight 55 | return x 56 | 57 | def forward(self, x): 58 | post_conv = x = self.seq2fix(x) 59 | penult = x = F.relu(self.fc_1(x)) 60 | x = self.fc_2(x) 61 | return x, penult, post_conv 62 | -------------------------------------------------------------------------------- /MalConv2_Benchmark/binaryLoader.py: -------------------------------------------------------------------------------- 1 | # File adapted from https://github.com/NeuromorphicComputationResearchProgram/MalConv2/blob/main/binaryLoader.py 2 | import os 3 | import random 4 | import numpy as np 5 | import torch 6 | import torch.nn as nn 7 | from torch.utils import data 8 | 9 | 10 | 11 | class BinaryDataset(data.Dataset): 12 | def __init__(self, mal_dir, md5_labels, num_labels, tag_labels, max_len=4000000): 13 | """Class implementing a dataset for directories of malicious files. 14 | 15 | Arguments: 16 | mal_dir -- Directory containing malicious files, or subdirectories of 17 | malicious files (to traverse recursively) 18 | md5_labels -- Dict mapping md5 hashes to labels 19 | num_labels -- Total number of labels 20 | max_len -- The maximum number of bytes to read from a file 21 | """ 22 | 23 | self.all_files = [] 24 | self.max_len = max_len 25 | self.num_labels = num_labels 26 | self.tag_labels = tag_labels 27 | for root, dirs, files in os.walk(mal_dir): 28 | for file_name in files: 29 | md5 = file_name[-32:] 30 | if md5_labels.get(md5) is None: 31 | continue 32 | file_path = os.path.join(root, file_name) 33 | labels = md5_labels[md5] 34 | self.all_files.append((file_path, labels, None)) 35 | 36 | def __len__(self): 37 | return len(self.all_files) 38 | 39 | def __getitem__(self, index): 40 | to_load, labels, _ = self.all_files[index] 41 | with open(to_load, 'rb') as f: 42 | x = f.read(self.max_len) 43 | x = np.frombuffer(x, dtype=np.uint8).astype(np.int16)+1 44 | x = torch.tensor(x) 45 | y = torch.zeros(self.num_labels) 46 | for label in labels: 47 | y[self.tag_labels[label]] = 1 48 | return x, y 49 | 50 | 51 | class RandomChunkSampler(torch.utils.data.sampler.Sampler): 52 | def __init__(self, data_source, batch_size): 53 | """ 54 | Samples random "chunks" of a dataset, so that items within a chunk 55 | are always loaded together. Useful to keep chunks in similar size 56 | groups to reduce runtime. 57 | 58 | data_source - The souce pytorch dataset object 59 | batch_size - The size of the chunks to keep together. Should 60 | generally be set to the desired batch size during 61 | training to minimize runtime. 62 | """ 63 | 64 | self.data_source = data_source 65 | self.batch_size = batch_size 66 | 67 | def __iter__(self): 68 | n = len(self.data_source) 69 | data = [x for x in range(n)] 70 | blocks = [data[i:i+self.batch_size] for i in range(0,len(data), 71 | self.batch_size)] 72 | random.shuffle(blocks) 73 | data[:] = [b for bs in blocks for b in bs] 74 | return iter(data) 75 | 76 | def __len__(self): 77 | return len(self.data_source) 78 | 79 | 80 | def pad_collate_func(batch): 81 | """ 82 | This should be used as the collate_fn=pad_collate_func for a pytorch 83 | DataLoader object in order to pad out files in a batch to the length of 84 | the longest item in the batch. 85 | """ 86 | 87 | vecs = [x[0] for x in batch] 88 | y = torch.stack([x[1] for x in batch]) 89 | x = torch.nn.utils.rnn.pad_sequence(vecs, batch_first=True) 90 | return x, y 91 | -------------------------------------------------------------------------------- /MalConv2_Benchmark/malconv_benchmark.py: -------------------------------------------------------------------------------- 1 | import os 2 | import json 3 | import argparse 4 | import multiprocessing 5 | import numpy as np 6 | import torch 7 | import torch.optim as optim 8 | from torch.utils.data import Dataset, DataLoader 9 | from sklearn.metrics import roc_auc_score 10 | from sklearn.metrics import precision_recall_fscore_support as prfs 11 | from binaryLoader import BinaryDataset, RandomChunkSampler, pad_collate_func 12 | from MalConv import MalConv 13 | import torch.nn.functional as F 14 | 15 | 16 | def get_tags(tag_path): 17 | """Make a dict mapping MD5s to ClarAVy tags.""" 18 | md5_tags = {} 19 | with open(tag_path, "r") as f: 20 | for jsonl in f: 21 | entry = json.loads(jsonl.strip()) 22 | md5 = entry["md5"] 23 | tags = [rank[0] for rank in entry["ranking"]] 24 | md5_tags[md5] = tags 25 | return md5_tags 26 | 27 | 28 | if __name__ == "__main__": 29 | 30 | parser = argparse.ArgumentParser() 31 | parser.add_argument("train_dir", help="Path to directory with files " + 32 | "to train on. Directory is traversed recursively.") 33 | parser.add_argument("test_dir", help="Path to directory with files " + 34 | "to test on. Directory is traversed recursively.") 35 | parser.add_argument("maldict_train_file", help="Path to MalDICT .jsonl " + 36 | "file with train hashes and tags") 37 | parser.add_argument("maldict_test_file", help="Path to MalDICT .jsonl " + 38 | "file with test hashes and tags") 39 | parser.add_argument("--num-processes", type=int, default=1) 40 | args = parser.parse_args() 41 | 42 | # Default hyperparameters for MalConv2 43 | NON_NEG = False 44 | EMBD_SIZE = 8 45 | FILTER_SIZE = 512 46 | FILTER_STRIDE = 512 47 | NUM_CHANNELS = 128 48 | EPOCHS = 1 49 | MAX_FILE_LEN = 16000000 50 | BATCH_SIZE = 128 51 | RANDOM_STATE = 42 52 | 53 | # Read train/test tags 54 | train_md5_tags = get_tags(args.maldict_train_file) 55 | test_md5_tags = get_tags(args.maldict_test_file) 56 | 57 | # Convert tags to numeric labels 58 | all_tags = set() 59 | for tags in train_md5_tags.values(): 60 | all_tags.update(tags) 61 | for tags in test_md5_tags.values(): 62 | all_tags.update(tags) 63 | sorted_tags = sorted(all_tags) 64 | tag_labels = {tag: i for i, tag in enumerate(sorted_tags)} 65 | num_labels = len(sorted_tags) 66 | 67 | # Get train and test datasets of malicious files 68 | train_dataset = BinaryDataset(args.train_dir, train_md5_tags, num_labels, tag_labels, 69 | max_len=MAX_FILE_LEN) 70 | test_dataset = BinaryDataset(args.test_dir, test_md5_tags, num_labels, tag_labels, 71 | max_len=MAX_FILE_LEN) 72 | 73 | # Get train and test loaders 74 | train_sampler = RandomChunkSampler(train_dataset, BATCH_SIZE) 75 | test_sampler = RandomChunkSampler(test_dataset, BATCH_SIZE) 76 | loader_threads = max(multiprocessing.cpu_count()-4, 77 | multiprocessing.cpu_count()//2+1) 78 | train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, 79 | num_workers=loader_threads, 80 | collate_fn=pad_collate_func, 81 | sampler=train_sampler) 82 | test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, 83 | num_workers=loader_threads, 84 | collate_fn=pad_collate_func, sampler=test_sampler) 85 | 86 | # Initialize MalConv2 classifier 87 | device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") 88 | model = MalConv(out_size=num_labels, channels=NUM_CHANNELS, 89 | window_size=FILTER_SIZE, stride=FILTER_STRIDE, 90 | embd_size=EMBD_SIZE).to(device) 91 | criterion = torch.nn.BCEWithLogitsLoss() 92 | optimizer = optim.Adam(model.parameters()) 93 | model.train() 94 | 95 | # Train classifier 96 | for epoch in range(EPOCHS): 97 | for inputs, labels in train_loader: 98 | inputs = inputs.to(device) 99 | labels = labels.to(device) 100 | optimizer.zero_grad() 101 | outputs, _, _ = model(inputs) 102 | loss = criterion(outputs, labels) 103 | loss.backward() 104 | optimizer.step() 105 | 106 | # Test classifier 107 | y_test = [] 108 | y_pred = [] 109 | total = 0 110 | model = model.eval() 111 | with torch.no_grad(): 112 | for inputs, labels in test_loader: 113 | inputs = inputs.to(device) 114 | labels = labels.to(device) 115 | outputs, _, _ = model(inputs) 116 | outputs = F.sigmoid(outputs) 117 | B = inputs.shape[0] 118 | total += B 119 | y_test += labels.detach().cpu().numpy().tolist() 120 | y_pred += outputs.detach().cpu().numpy().tolist() 121 | y_test = np.array(y_test) 122 | y_pred = np.array(y_pred) 123 | 124 | # Get AUC score over all tags 125 | micro_auc = roc_auc_score(y_test, y_pred, average="micro", 126 | multi_class="ovr") 127 | weighted_auc = roc_auc_score(y_test, y_pred, average="weighted", 128 | multi_class="ovr") 129 | 130 | # Get Precision, Recall, and F1 (assume > 0.5) 131 | y_pred = (y_pred > 0.5) 132 | p_micro, r_micro, f1_micro, _ = prfs(y_test, y_pred, average="micro") 133 | p_avg, r_avg, f1_avg, _ = prfs(y_test, y_pred, average="weighted") 134 | 135 | # Print results 136 | print("Precision\t{} (micro)\t{} (weighted)".format(p_micro, p_avg)) 137 | print("Recall\t{} (micro)\t{} (weighted)".format(r_micro, r_avg)) 138 | print("F1-Score\t{} (micro)\t{} (weighted)".format(f1_micro, f1_avg)) 139 | print("ROC AUC\t{} (micro)\t{} (weighted)".format(micro_auc, weighted_auc)) 140 | -------------------------------------------------------------------------------- /MalDICT_Tags/maldict_category_test.jsonl: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:c5501b1d74cc00a46b058a10038cbe2323195c61747dd4e39e3ff20de910949b 3 | size 46269303 4 | -------------------------------------------------------------------------------- /MalDICT_Tags/maldict_category_train.jsonl: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:c055d26d649a5d0f18090fa75c6e69b6c15a0bc0c1b3db143cbe66853814bc23 3 | size 313116717 4 | -------------------------------------------------------------------------------- /MalDICT_Tags/maldict_packer_test.jsonl: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:cc7dd8cc1988afa3fbd2ea8b7f1e80747b12e0fca61bad4869dfc0343bcfe9ee 3 | size 3837469 4 | -------------------------------------------------------------------------------- /MalDICT_Tags/maldict_packer_train.jsonl: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:4dac50e6c3d398fd0531ad216fe8620a0f5eb6157d7161dd8dd9aa93951eceaf 3 | size 15196726 4 | -------------------------------------------------------------------------------- /MalDICT_Tags/maldict_platform_test.jsonl: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:c2a89594549ade5bc72378793bf9d1d0994fcae2434fbf11d11163700f34329f 3 | size 17092039 4 | -------------------------------------------------------------------------------- /MalDICT_Tags/maldict_platform_train.jsonl: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:670ef5a4be12a0d3a3d492856454ac356c59bead471b7c4a7ad0dd589dda4a76 3 | size 56214317 4 | -------------------------------------------------------------------------------- /MalDICT_Tags/maldict_vulnerability_test.jsonl: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:ef0c88b54199458524e303923086848690cf546cdcd454dbb8575a92be166cf2 3 | size 3193170 4 | -------------------------------------------------------------------------------- /MalDICT_Tags/maldict_vulnerability_train.jsonl: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:c75a260fec57ee3083d0e176f0af1af4e5b9ca362f3afbff2ecca61fb7cecbc4 3 | size 11273451 4 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # MalDICT 2 | 3 | MalDICT is a collection of four datasets, each supporting different malware classification tasks. These datasets can be used to train a machine learning classifier on malware behaviors, file properties, vulnerability exploitation, and packers, and then evaluate the classifier's performance. More information is provided in our paper: https://arxiv.org/abs/2310.11706 4 | 5 | If you use MalDICT for your research, please cite us with: 6 | 7 | ``` 8 | @misc{joyce2023maldict, 9 | title={MalDICT: Benchmark Datasets on Malware Behaviors, Platforms, Exploitation, and Packers}, 10 | author={Robert J. Joyce and Edward Raff and Charles Nicholas and James Holt}, 11 | year={2023}, 12 | eprint={2310.11706}, 13 | archivePrefix={arXiv}, 14 | primaryClass={cs.CR} 15 | } 16 | ``` 17 | 18 | ## Dataset Contents 19 | 20 | #### How Did We Build These Datasets? 21 | 22 | We collected nearly 40 million VirusTotal reports for the malware in chunks 0 - 465 of the VirusShare corpus. Then, we ran [ClarAVy](https://github.com/NeuromorphicComputationResearchProgram/ClarAVy/tree/master), a tool we developed for tagging malware, on all of these VirusTotal reports. The tags output by ClarAVy can indicate a malicious file's behaviors, file properties, vulnerability exploitation, and packer. Some tags were very rare and others were applied to millions of files, resulting in a large class imbalance. We discarded any tags that were too rare and randomly down-sampled tags that were too common. The discard and down-sampling thresholds were different for each of the four datasets in MalDICT. 23 | 24 | #### MalDICT-Behavior 25 | 26 | MalDICT-Behavior is a dataset of malware tagged according to its category or behavior (e.g. ransomware, downloader, autorun). It includes 4,317,241 malicious files tagged according to 75 different malware categories or malicious behaviors. A file may have multiple tags if it belongs to multiple malware categories and/or exhibits more than one type of malicious behavior. 27 | 28 | A default train/test split for MalDICT-Behavior is provided in ```MalDICT_Tags/maldict_category_train.jsonl``` and ```MalDICT_Tags/maldict_category_test.jsonl```. The training set uses malware from VirusShare chunks 0 - 315 and the test set uses malware from chunks 316 - 465. Chunks in the test set include newer malware than the training set, effectively creating a temporal train/test split. In order for a machine learning classifier to perform well, it must learn to generalize to the "future" malware in the test set. 29 | 30 | 31 | #### MalDICT-Platform 32 | 33 | MalDICT-Platform includes 963,492 malicious files and has 43 tags for different target operating systems, file formats, and programming langauges (e.g. win64, pdf, java). It uses the same temporal train/test split method as MalDICT-Behavior. Hashes and tags for the training set are in ```MalDICT_Tags/maldict_platform_train.jsonl``` and the test set is in ```MalDICT_Tags/maldict_platform_test.jsonl```. 34 | 35 | 36 | #### MalDICT-Vulnerability 37 | 38 | The MalDICT-Vulnerability dataset has 173,886 files which are tagged according to the vulnerability that they exploit. The dataset includes tags for 128 different vulnerabilities (e.g. cve_2017_0144, ms08_067). 39 | 40 | Hashes and tags for MalDICT-Vulnerability are in ```MalDICT_Tags/maldict_vulnerability_train.jsonl``` and ```MalDICT_Tags/maldict_vulnerability_test.jsonl```. Unlike MalDICT-Behavior and MalDICT-Plaform, this dataset uses a stratified split so that each tag is split proportionally between the training and test set. 41 | 42 | 43 | #### MalDICT-Packer 44 | 45 | MalDICT-Packer contains 252,148 malicious files, tagged according to the packer used to pack the file. It includes 79 different malware packers. Train and test split files are located in ```MalDICT_Tags/maldict_packer_train.jsonl``` and ```MalDICT_Tags/maldict_packer_test.jsonl```. MalDICT-Packer also uses a stratified train-test split. 46 | 47 | 48 | ## Downloading the Datasets 49 | 50 | #### Downloading File Hashes and Tags 51 | 52 | File hashes and tags for all of the malware in MalDICT are provided in .jsonl files within the ```MalDICT_Tags/``` directory of this repo. GIT-LFS is required for downloading these files due to their size. On Debian-based systems, GIT-LFS can be installed using: 53 | 54 | ``` 55 | sudo apt-get install git-lfs 56 | ``` 57 | 58 | After installing GIT-LFS, you can download the hashes and tags by cloning this repository: 59 | 60 | ``` 61 | git lfs clone https://github.com/joyce8/MalDICT/ 62 | ``` 63 | 64 | #### Downloading Malicious Files 65 | 66 | We are releasing all of the Windows Portable Executable (PE) files in MalDICT-Behavior, MalDICT-Platform, and MalDICT-Packer. These files have been disarmed so that they cannot be executed. We did this using the same method as the [SOREL](https://github.com/sophos/SOREL-20M) and [MOTIF](https://github.com/boozallen/MOTIF) datasets (by zeroing out two fields in each file's PE headers). 7zip files containing the disarmed PE files can be downloaded [here](https://drive.google.com/drive/folders/18X0GgEIEczvLEFir2GMNGPdiKAhHXxfJ?usp=sharing). The password to each 7zip file is ```infected```. The total size of the extracted files is approximately 2.1TB. 67 | 68 | Unfortunately, we cannot publish the non-PE malware in MalDICT at this time. However, all of the malware in MalDICT is a subset of the VirusShare corpus (chunks 0 - 465). The full VirusShare corpus is distributed by [VirusShare](https://virusshare.com/login) and by [vx-underground](https://www.vx-underground.org/#E:/root/Samples/Virusshare%20Collection/Downloadable%20Releases). The file hashes in ```MalDICT_Tags/``` can be used to select the appropriate files from VirusShare and assemble the complete MalDICT datasets. 69 | 70 | 71 | #### Downloading EMBER Raw Metadata 72 | 73 | We extracted EMBER (v2) raw features from all of the PE files in MalDICT-Behavior, MalDICT-Platform, and MalDICT-Packer. MalDICT-Vulnerability is excluded because most files in it are not in the PE file format. The EMBER metadata files can be downloaded [here](https://drive.google.com/drive/folders/1QhQBoZ-7RPkad3059VFjLZqLDdb2HW0H?usp=share_link). Each line in one of the metadata files is a JSON object with the following fields: 74 | 75 | | Name | Description | 76 | |---|---------| 77 | | md5 | MD5 hash of file | 78 | | histogram | EMBER byte histogram | 79 | | byteentropy | EMBER byte entropy statistics | 80 | | strings | EMBER strings metadata | 81 | | general | EMBER general file metadata | 82 | | header | EMBER PE header metadata | 83 | | section | EMBER PE section metadata | 84 | | imports | EMBER imports metadata | 85 | | exports | EMBER exports metadata | 86 | | datadirectories | EMBER data directories metadata | 87 | 88 | 89 | ## Training a LightGBM Baseline Classifier 90 | 91 | LightGBM uses an ensemble of gradient-boosted trees for classification. It is trained on Windows PE malware using the EMBER feature vector format. Code for training and evaluating a LightGBM classifier is in ```LightGBM_Benchmark/```. You will need the following Python packages: 92 | 93 | ``` 94 | pip install scikit-learn 95 | pip install lightgbm 96 | pip install git+https://github.com/elastic/ember.git 97 | ``` 98 | 99 | You will also need the MalDICT tag files in the ```MalDict_Tags/``` folder and the [EMBER raw metadata files](https://drive.google.com/drive/folders/1QhQBoZ-7RPkad3059VFjLZqLDdb2HW0H?usp=share_link) for training the model. Usage for the LightGBM benchmark script is shown below: 100 | 101 | ``` 102 | usage: lightgbm_benchmark.py [-h] [--num-processes NUM_PROCESSES] ember_meta_dir/ maldict_train_file maldict_test_file 103 | 104 | positional arguments: 105 | ember_meta_dir path to directory with raw EMBER metadata .jsonl files (train and test) 106 | maldict_train_file path to MalDICT .jsonl file with train hashes and tags 107 | maldict_test_file path to MalDICT .jsonl file with test hashes and tags 108 | 109 | optional arguments: 110 | -h, --help show this help message and exit 111 | --num-processes NUM_PROCESSES 112 | ``` 113 | 114 | The following example shows how to train a LightGBM classifier on the MalDICT-Packer dataset: 115 | 116 | ``` 117 | python lightgbm_benchmark.py /path/to/EMBER_meta/EMBER_packer/ /path/to/MalDICT_Tags/claravy_packer_train.jsonl /path/to/MalDICT_Tags/claravy_packer_test.jsonl 118 | ``` 119 | 120 | 121 | ## Training a MalConv2 Baseline Classifier 122 | 123 | MalConv2 is a deep neural network which learns from the raw bytes within files. Code for training and evaluating a MalConv2 classifier is in ```MalConv2_Benchmark/```. You will need the following Python packages: 124 | 125 | ``` 126 | pip install numpy 127 | pip install scikit-learn 128 | pip install torch 129 | ``` 130 | 131 | You will also need the MalDICT tag files in the ```MalDict_Tags/``` folder as well as the [malicious files](https://drive.google.com/drive/folders/18X0GgEIEczvLEFir2GMNGPdiKAhHXxfJ?usp=sharing), separated into training and testing directories. Usage for the MalConv2 benchmark script is shown below: 132 | 133 | ``` 134 | usage: malconv_benchmark.py [-h] [--num-processes NUM_PROCESSES] train_dir/ test_dir/ maldict_train_file maldict_test_file 135 | 136 | positional arguments: 137 | train_dir Path to directory with files to train on. Directory is traversed recursively. 138 | test_dir Path to directory with files to test on. Directory is traversed recursively. 139 | maldict_train_file Path to MalDICT .jsonl file with train hashes and tags 140 | maldict_test_file Path to MalDICT .jsonl file with test hashes and tags 141 | 142 | 143 | optional arguments: 144 | -h, --help show this help message and exit 145 | --num-processes NUM_PROCESSES 146 | ``` 147 | 148 | The following example shows how to train a MalConv2 classifier on the MalDICT-Packer dataset: 149 | 150 | ``` 151 | python malconv_benchmark.py /path/to/maldict_disarmed_packer_train/ /path/to/maldict_disarmed_packer_test/ /path/to/MalDICT_Tags/maldict_packer_train.jsonl /path/to/MalDICT_Tags/maldict_packer_test.jsonl 152 | ``` 153 | --------------------------------------------------------------------------------