├── .gitattributes
├── LICENSE
├── LightGBM_Benchmark
    ├── lightgbm_benchmark.py
    └── lightgbm_config.json
├── MalConv2_Benchmark
    ├── LowMemConv.py
    ├── MalConv.py
    ├── binaryLoader.py
    └── malconv_benchmark.py
├── MalDICT_Tags
    ├── maldict_category_test.jsonl
    ├── maldict_category_train.jsonl
    ├── maldict_packer_test.jsonl
    ├── maldict_packer_train.jsonl
    ├── maldict_platform_test.jsonl
    ├── maldict_platform_train.jsonl
    ├── maldict_vulnerability_test.jsonl
    └── maldict_vulnerability_train.jsonl
└── README.md


/.gitattributes:
--------------------------------------------------------------------------------
1 | MalDICT_Tags/maldict_packer_train.jsonl filter=lfs diff=lfs merge=lfs -text
2 | MalDICT_Tags/maldict_platform_test.jsonl filter=lfs diff=lfs merge=lfs -text
3 | MalDICT_Tags/maldict_platform_train.jsonl filter=lfs diff=lfs merge=lfs -text
4 | MalDICT_Tags/maldict_vulnerability_test.jsonl filter=lfs diff=lfs merge=lfs -text
5 | MalDICT_Tags/maldict_vulnerability_train.jsonl filter=lfs diff=lfs merge=lfs -text
6 | MalDICT_Tags/maldict_category_test.jsonl filter=lfs diff=lfs merge=lfs -text
7 | MalDICT_Tags/maldict_category_train.jsonl filter=lfs diff=lfs merge=lfs -text
8 | MalDICT_Tags/maldict_packer_test.jsonl filter=lfs diff=lfs merge=lfs -text
9 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 |                                  Apache License
  2 |                            Version 2.0, January 2004
  3 |                         http://www.apache.org/licenses/
  4 | 
  5 |    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
  6 | 
  7 |    1. Definitions.
  8 | 
  9 |       "License" shall mean the terms and conditions for use, reproduction,
 10 |       and distribution as defined by Sections 1 through 9 of this document.
 11 | 
 12 |       "Licensor" shall mean the copyright owner or entity authorized by
 13 |       the copyright owner that is granting the License.
 14 | 
 15 |       "Legal Entity" shall mean the union of the acting entity and all
 16 |       other entities that control, are controlled by, or are under common
 17 |       control with that entity. For the purposes of this definition,
 18 |       "control" means (i) the power, direct or indirect, to cause the
 19 |       direction or management of such entity, whether by contract or
 20 |       otherwise, or (ii) ownership of fifty percent (50%) or more of the
 21 |       outstanding shares, or (iii) beneficial ownership of such entity.
 22 | 
 23 |       "You" (or "Your") shall mean an individual or Legal Entity
 24 |       exercising permissions granted by this License.
 25 | 
 26 |       "Source" form shall mean the preferred form for making modifications,
 27 |       including but not limited to software source code, documentation
 28 |       source, and configuration files.
 29 | 
 30 |       "Object" form shall mean any form resulting from mechanical
 31 |       transformation or translation of a Source form, including but
 32 |       not limited to compiled object code, generated documentation,
 33 |       and conversions to other media types.
 34 | 
 35 |       "Work" shall mean the work of authorship, whether in Source or
 36 |       Object form, made available under the License, as indicated by a
 37 |       copyright notice that is included in or attached to the work
 38 |       (an example is provided in the Appendix below).
 39 | 
 40 |       "Derivative Works" shall mean any work, whether in Source or Object
 41 |       form, that is based on (or derived from) the Work and for which the
 42 |       editorial revisions, annotations, elaborations, or other modifications
 43 |       represent, as a whole, an original work of authorship. For the purposes
 44 |       of this License, Derivative Works shall not include works that remain
 45 |       separable from, or merely link (or bind by name) to the interfaces of,
 46 |       the Work and Derivative Works thereof.
 47 | 
 48 |       "Contribution" shall mean any work of authorship, including
 49 |       the original version of the Work and any modifications or additions
 50 |       to that Work or Derivative Works thereof, that is intentionally
 51 |       submitted to Licensor for inclusion in the Work by the copyright owner
 52 |       or by an individual or Legal Entity authorized to submit on behalf of
 53 |       the copyright owner. For the purposes of this definition, "submitted"
 54 |       means any form of electronic, verbal, or written communication sent
 55 |       to the Licensor or its representatives, including but not limited to
 56 |       communication on electronic mailing lists, source code control systems,
 57 |       and issue tracking systems that are managed by, or on behalf of, the
 58 |       Licensor for the purpose of discussing and improving the Work, but
 59 |       excluding communication that is conspicuously marked or otherwise
 60 |       designated in writing by the copyright owner as "Not a Contribution."
 61 | 
 62 |       "Contributor" shall mean Licensor and any individual or Legal Entity
 63 |       on behalf of whom a Contribution has been received by Licensor and
 64 |       subsequently incorporated within the Work.
 65 | 
 66 |    2. Grant of Copyright License. Subject to the terms and conditions of
 67 |       this License, each Contributor hereby grants to You a perpetual,
 68 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 69 |       copyright license to reproduce, prepare Derivative Works of,
 70 |       publicly display, publicly perform, sublicense, and distribute the
 71 |       Work and such Derivative Works in Source or Object form.
 72 | 
 73 |    3. Grant of Patent License. Subject to the terms and conditions of
 74 |       this License, each Contributor hereby grants to You a perpetual,
 75 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 76 |       (except as stated in this section) patent license to make, have made,
 77 |       use, offer to sell, sell, import, and otherwise transfer the Work,
 78 |       where such license applies only to those patent claims licensable
 79 |       by such Contributor that are necessarily infringed by their
 80 |       Contribution(s) alone or by combination of their Contribution(s)
 81 |       with the Work to which such Contribution(s) was submitted. If You
 82 |       institute patent litigation against any entity (including a
 83 |       cross-claim or counterclaim in a lawsuit) alleging that the Work
 84 |       or a Contribution incorporated within the Work constitutes direct
 85 |       or contributory patent infringement, then any patent licenses
 86 |       granted to You under this License for that Work shall terminate
 87 |       as of the date such litigation is filed.
 88 | 
 89 |    4. Redistribution. You may reproduce and distribute copies of the
 90 |       Work or Derivative Works thereof in any medium, with or without
 91 |       modifications, and in Source or Object form, provided that You
 92 |       meet the following conditions:
 93 | 
 94 |       (a) You must give any other recipients of the Work or
 95 |           Derivative Works a copy of this License; and
 96 | 
 97 |       (b) You must cause any modified files to carry prominent notices
 98 |           stating that You changed the files; and
 99 | 
100 |       (c) You must retain, in the Source form of any Derivative Works
101 |           that You distribute, all copyright, patent, trademark, and
102 |           attribution notices from the Source form of the Work,
103 |           excluding those notices that do not pertain to any part of
104 |           the Derivative Works; and
105 | 
106 |       (d) If the Work includes a "NOTICE" text file as part of its
107 |           distribution, then any Derivative Works that You distribute must
108 |           include a readable copy of the attribution notices contained
109 |           within such NOTICE file, excluding those notices that do not
110 |           pertain to any part of the Derivative Works, in at least one
111 |           of the following places: within a NOTICE text file distributed
112 |           as part of the Derivative Works; within the Source form or
113 |           documentation, if provided along with the Derivative Works; or,
114 |           within a display generated by the Derivative Works, if and
115 |           wherever such third-party notices normally appear. The contents
116 |           of the NOTICE file are for informational purposes only and
117 |           do not modify the License. You may add Your own attribution
118 |           notices within Derivative Works that You distribute, alongside
119 |           or as an addendum to the NOTICE text from the Work, provided
120 |           that such additional attribution notices cannot be construed
121 |           as modifying the License.
122 | 
123 |       You may add Your own copyright statement to Your modifications and
124 |       may provide additional or different license terms and conditions
125 |       for use, reproduction, or distribution of Your modifications, or
126 |       for any such Derivative Works as a whole, provided Your use,
127 |       reproduction, and distribution of the Work otherwise complies with
128 |       the conditions stated in this License.
129 | 
130 |    5. Submission of Contributions. Unless You explicitly state otherwise,
131 |       any Contribution intentionally submitted for inclusion in the Work
132 |       by You to the Licensor shall be under the terms and conditions of
133 |       this License, without any additional terms or conditions.
134 |       Notwithstanding the above, nothing herein shall supersede or modify
135 |       the terms of any separate license agreement you may have executed
136 |       with Licensor regarding such Contributions.
137 | 
138 |    6. Trademarks. This License does not grant permission to use the trade
139 |       names, trademarks, service marks, or product names of the Licensor,
140 |       except as required for reasonable and customary use in describing the
141 |       origin of the Work and reproducing the content of the NOTICE file.
142 | 
143 |    7. Disclaimer of Warranty. Unless required by applicable law or
144 |       agreed to in writing, Licensor provides the Work (and each
145 |       Contributor provides its Contributions) on an "AS IS" BASIS,
146 |       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 |       implied, including, without limitation, any warranties or conditions
148 |       of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 |       PARTICULAR PURPOSE. You are solely responsible for determining the
150 |       appropriateness of using or redistributing the Work and assume any
151 |       risks associated with Your exercise of permissions under this License.
152 | 
153 |    8. Limitation of Liability. In no event and under no legal theory,
154 |       whether in tort (including negligence), contract, or otherwise,
155 |       unless required by applicable law (such as deliberate and grossly
156 |       negligent acts) or agreed to in writing, shall any Contributor be
157 |       liable to You for damages, including any direct, indirect, special,
158 |       incidental, or consequential damages of any character arising as a
159 |       result of this License or out of the use or inability to use the
160 |       Work (including but not limited to damages for loss of goodwill,
161 |       work stoppage, computer failure or malfunction, or any and all
162 |       other commercial damages or losses), even if such Contributor
163 |       has been advised of the possibility of such damages.
164 | 
165 |    9. Accepting Warranty or Additional Liability. While redistributing
166 |       the Work or Derivative Works thereof, You may choose to offer,
167 |       and charge a fee for, acceptance of support, warranty, indemnity,
168 |       or other liability obligations and/or rights consistent with this
169 |       License. However, in accepting such obligations, You may act only
170 |       on Your own behalf and on Your sole responsibility, not on behalf
171 |       of any other Contributor, and only if You agree to indemnify,
172 |       defend, and hold each Contributor harmless for any liability
173 |       incurred by, or claims asserted against, such Contributor by reason
174 |       of your accepting any such warranty or additional liability.
175 | 
176 |    END OF TERMS AND CONDITIONS
177 | 
178 |    APPENDIX: How to apply the Apache License to your work.
179 | 
180 |       To apply the Apache License to your work, attach the following
181 |       boilerplate notice, with the fields enclosed by brackets "[]"
182 |       replaced with your own identifying information. (Don't include
183 |       the brackets!)  The text should be enclosed in the appropriate
184 |       comment syntax for the file format. We also recommend that a
185 |       file or class name and description of purpose be included on the
186 |       same "printed page" as the copyright notice for easier
187 |       identification within third-party archives.
188 | 
189 |    Copyright [yyyy] [name of copyright owner]
190 | 
191 |    Licensed under the Apache License, Version 2.0 (the "License");
192 |    you may not use this file except in compliance with the License.
193 |    You may obtain a copy of the License at
194 | 
195 |        http://www.apache.org/licenses/LICENSE-2.0
196 | 
197 |    Unless required by applicable law or agreed to in writing, software
198 |    distributed under the License is distributed on an "AS IS" BASIS,
199 |    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 |    See the License for the specific language governing permissions and
201 |    limitations under the License.
202 | 


--------------------------------------------------------------------------------
/LightGBM_Benchmark/lightgbm_benchmark.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import json
  3 | import ember
  4 | import argparse
  5 | import multiprocessing
  6 | import numpy as np
  7 | import lightgbm as lgb
  8 | from sklearn.metrics import roc_auc_score
  9 | from sklearn.metrics import precision_recall_fscore_support as prfs
 10 | 
 11 | 
 12 | def get_vec(jsonl):
 13 |     """Vectorize a JSON line with EMBER raw features.
 14 | 
 15 |     Returns:
 16 |     (md5, vector)
 17 |     """
 18 |     ember_meta = json.loads(jsonl)
 19 |     md5 = ember_meta["md5"]
 20 |     extractor = ember.PEFeatureExtractor()
 21 |     vec = extractor.process_raw_features(ember_meta)
 22 |     return md5, vec
 23 | 
 24 | 
 25 | def get_tags(tag_path):
 26 |     """Make a dict mapping MD5s to ClarAVy tags."""
 27 |     md5_tags = {}
 28 |     with open(tag_path, "r") as f:
 29 |         for jsonl in f:
 30 |             entry = json.loads(jsonl.strip())
 31 |             md5 = entry["md5"]
 32 |             tags = [rank[0] for rank in entry["ranking"]]
 33 |             md5_tags[md5] = tags
 34 |     return md5_tags
 35 | 
 36 | 
 37 | if __name__ == "__main__":
 38 | 
 39 |     parser = argparse.ArgumentParser()
 40 |     parser.add_argument("ember_meta_dir", help="path to directory with raw " +
 41 |                         "EMBER metadata .jsonl files (train and test)")
 42 |     parser.add_argument("maldict_train_file", help="path to MalDICT .jsonl " +
 43 |                         "file with train hashes and tags")
 44 |     parser.add_argument("maldict_test_file", help="path to MalDICT .jsonl " +
 45 |                         "file with test hashes and tags")
 46 |     parser.add_argument("--num-processes", type=int, default=1)
 47 |     args = parser.parse_args()
 48 | 
 49 |     # Validate ember_meta_path
 50 |     train_path = os.path.join(args.ember_meta_dir, "train_features.jsonl")
 51 |     test_path = os.path.join(args.ember_meta_dir, "test_features.jsonl")
 52 | 
 53 |     # Read metadata
 54 |     with open(train_path, "r") as f:
 55 |         train_meta = [line.strip() for line in f]
 56 |     with open(test_path, "r") as f:
 57 |         test_meta = [line.strip() for line in f]
 58 | 
 59 |     # Read train/test tags
 60 |     train_md5_tags = get_tags(args.maldict_train_file)
 61 |     test_md5_tags = get_tags(args.maldict_test_file)
 62 | 
 63 |     # Convert tags to numeric labels
 64 |     all_tags = set()
 65 |     for tags in train_md5_tags.values():
 66 |         all_tags.update(tags)
 67 |     for tags in test_md5_tags.values():
 68 |         all_tags.update(tags)
 69 |     sorted_tags = sorted(all_tags)
 70 |     tag_labels = {tag: i for i, tag in enumerate(sorted_tags)}
 71 | 
 72 |     # Vectorize EMBER metadata
 73 |     pool = multiprocessing.Pool(args.num_processes)
 74 |     train_md5_vecs = list(pool.imap(get_vec, train_meta))
 75 |     test_md5_vecs = list(pool.imap(get_vec, test_meta))
 76 | 
 77 |     # Get sizes of train and test set
 78 |     num_train = len(train_md5_vecs)
 79 |     num_test = len(test_md5_vecs)
 80 |     vec_dim = len(train_md5_vecs[0][1])
 81 |     num_labels = len(sorted_tags)
 82 | 
 83 |     # Get X and y for train set
 84 |     X_train = np.zeros((num_train, vec_dim), dtype=np.float)
 85 |     y_train = np.zeros((num_train, num_labels))
 86 |     for i, (md5, vec) in enumerate(train_md5_vecs):
 87 |         labels = [tag_labels[tag] for tag in train_md5_tags[md5]]
 88 |         X_train[i,] = vec
 89 |         for j in labels:
 90 |             y_train[i,j] = 1
 91 | 
 92 |     # Get X and y for test set
 93 |     X_test = np.zeros((num_test, vec_dim), dtype=np.float)
 94 |     y_test = np.zeros((num_test, num_labels), dtype=np.float)
 95 |     for i, (md5, vec) in enumerate(test_md5_vecs):
 96 |         labels = [tag_labels[tag] for tag in test_md5_tags[md5]]
 97 |         X_test[i,] = vec
 98 |         for j in labels:
 99 |             y_test[i,j] = 1
100 | 
101 |     # Load LightGBM config
102 |     with open("lightgbm_config.json", "rb") as f:
103 |         params = json.load(f)
104 |     params.update({"verbose": -1})
105 |     params.update({"num_iterations": 100})
106 | 
107 |     # Train OvR classifier on each tag
108 |     y_pred = np.zeros((num_test, num_labels))
109 |     for j, tag in enumerate(sorted_tags):
110 | 
111 |         print("Training classifiers on tag: {}".format(tag))
112 | 
113 |         # Get train and test sets for fold
114 |         y_train_tag = y_train[:, j]
115 |         y_test_tag = y_test[:, j]
116 | 
117 |         # Train LightGBM classifier
118 |         train_dataset = lgb.Dataset(X_train, y_train_tag)
119 |         test_dataset = lgb.Dataset(X_test, y_test_tag)
120 |         clf = lgb.train(params, train_dataset)
121 | 
122 |         # Get predictions and compute accuracy
123 |         predictions = clf.predict(X_test)
124 |         y_pred[:,j] = predictions
125 | 
126 |     # Get AUC score over all tags
127 |     micro_auc = roc_auc_score(y_test, y_pred, average="micro",
128 |                               multi_class="ovr")
129 |     weighted_auc = roc_auc_score(y_test, y_pred, average="weighted",
130 |                                  multi_class="ovr")
131 | 
132 |     # Get Precision, Recall, and F1 (assume > 0.5)
133 |     y_pred = (y_pred > 0.5)
134 |     p_micro, r_micro, f1_micro, _ = prfs(y_test, y_pred, average="micro")
135 |     p_avg, r_avg, f1_avg, _ = prfs(y_test, y_pred, average="weighted")
136 | 
137 |     # Print results
138 |     print("Precision\t{} (micro)\t{} (weighted)".format(p_micro, p_avg))
139 |     print("Recall\t{} (micro)\t{} (weighted)".format(r_micro, r_avg))
140 |     print("F1-Score\t{} (micro)\t{} (weighted)".format(f1_micro, f1_avg))
141 |     print("ROC AUC\t{} (micro)\t{} (weighted)".format(micro_auc, weighted_auc))
142 | 


--------------------------------------------------------------------------------
/LightGBM_Benchmark/lightgbm_config.json:
--------------------------------------------------------------------------------
1 | {"objective": "binary", "task": "train", "boosting": "gbdt", "num_iterations": 500, "learning_rate": 0.1, "max_depth": -1, "num_leaves": 64, "tree_learner": "serial", "num_threads": 0, "device_type": "cpu", "seed": 0, "min_data_in_leaf": 100, "min_sum_hessian_in_leaf": 0.001, "bagging_fraction": 0.9, "bagging_freq": 1, "bagging_seed": 0, "feature_fraction": 0.9, "feature_fraction_bynode": 0.9, "feature_fraction_seed": 0, "first_metric_only": true, "max_delta_step": 0, "lambda_l1": 0, "lambda_l2": 1.0, "verbosity": 2, "is_unbalance": true, "sigmoid": 1.0, "boost_from_average": true, "metric": ["auc", "binary_logloss", "binary_error"]}
2 | 


--------------------------------------------------------------------------------
/MalConv2_Benchmark/LowMemConv.py:
--------------------------------------------------------------------------------
  1 | # Adapted from https://github.com/NeuromorphicComputationResearchProgram/MalConv2/blob/main/LowMemConv.py
  2 | import numpy as np
  3 | import torch
  4 | import torch.nn as nn
  5 | import torch.nn.functional as F
  6 | 
  7 | 
  8 | def drop_zeros_hook(module, grad_input, grad_out):
  9 |     """
 10 |     This function is used to replace gradients that are all zeros with None in
 11 |     pytorch None will not get back-propogated. So we use this as an
 12 |     approximation to sparse BP to avoid redundant and useless work.
 13 |     """
 14 |     grads = []
 15 |     with torch.no_grad():
 16 |         for g in grad_input:
 17 |             if torch.nonzero(g).shape[0] == 0:
 18 |                 grads.append(g.to_sparse())
 19 |             else:
 20 |                 grads.append(g)
 21 |                 
 22 |     return tuple(grads)
 23 | 
 24 | 
 25 | class CatMod(torch.nn.Module):
 26 |     def __init__(self):
 27 |         super(CatMod, self).__init__()
 28 | 
 29 |     def forward(self, x):
 30 |         return torch.cat(x, dim=2)
 31 |     
 32 | 
 33 | class LowMemConvBase(nn.Module):
 34 |     
 35 |     def __init__(self, chunk_size=65536, overlap=512, min_chunk_size=1024):
 36 |         """
 37 |         chunk_size - How many bytes at a time to process. Increasing may
 38 |                      improve compute efficent, but use more memory. Total
 39 |                      memory use will be a function of chunk_size, and not of
 40 |                      the length of the input sequence L
 41 |         overlap - How many bytes of overlap to use between chunks
 42 |         """
 43 | 
 44 |         super(LowMemConvBase, self).__init__()
 45 |         self.chunk_size = chunk_size
 46 |         self.overlap = overlap
 47 |         self.min_chunk_size = min_chunk_size
 48 |         self.pooling = nn.AdaptiveMaxPool1d(1)
 49 |         self.cat = CatMod()
 50 |         self.cat.register_backward_hook(drop_zeros_hook)
 51 |         self.receptive_field = None
 52 |         self.dummy_tensor = torch.ones(1, dtype=torch.float32,
 53 |                                        requires_grad=True)
 54 | 
 55 | 
 56 |     def determinRF(self):
 57 |         """
 58 |         Let's determine the receptive field & stride of our sub-network.
 59 |         """
 60 |         
 61 |         if self.receptive_field is not None:
 62 |             return self.receptive_field, self.stride, self.out_channels
 63 |         cur_device = next(self.embd.parameters()).device
 64 |         min_rf = 1
 65 |         max_rf = self.chunk_size
 66 |         with torch.no_grad():
 67 |             tmp = torch.zeros((1,max_rf)).long().to(cur_device)
 68 |             while True:
 69 |                 test_size = (min_rf+max_rf)//2
 70 |                 is_valid = True
 71 |                 try:
 72 |                     self.processRange(tmp[:,0:test_size])
 73 |                 except:
 74 |                     is_valid = False
 75 |                 
 76 |                 if is_valid:
 77 |                     max_rf = test_size
 78 |                 else:
 79 |                     min_rf = test_size+1
 80 |                 if max_rf == min_rf:
 81 |                     self.receptive_field = min_rf
 82 |                     out_shape = self.processRange(tmp).shape
 83 |                     self.stride = self.chunk_size//out_shape[2]
 84 |                     self.out_channels = out_shape[1]
 85 |                     break
 86 | 
 87 |         return self.receptive_field, self.stride, self.out_channels
 88 | 
 89 | 
 90 |     def pool_group(self, *args):
 91 |         x = self.cat(args)
 92 |         x = self.pooling(x)
 93 |         return x
 94 | 
 95 | 
 96 |     def seq2fix(self, x, pr_args={}):
 97 |         """
 98 |         Takes in an input LongTensor of (B, L) that will be converted to a
 99 |         fixed length representation (B, C), where C is the number of channels
100 |         provided by the base_network  given at construction. 
101 |         """
102 | 
103 |         receptive_window, stride, out_channels = self.determinRF()
104 |         if x.shape[1] < receptive_window:
105 |             x = F.pad(x, (0, receptive_window-x.shape[1]), value=0)
106 |         batch_size = x.shape[0]
107 |         length = x.shape[1]
108 |         winner_values = np.zeros((batch_size, out_channels))-1.0
109 |         winner_indices = np.zeros((batch_size, out_channels), dtype=np.int64)
110 |         cur_device = next(self.embd.parameters()).device
111 |         step = self.chunk_size
112 |         start = 0
113 |         end = start+step
114 |         with torch.no_grad():
115 |             while (start < end and
116 |                    (end-start) >= max(self.min_chunk_size, receptive_window)):
117 |                 x_sub = x[:,start:end]
118 |                 x_sub = x_sub.to(cur_device)
119 |                 activs = self.processRange(x_sub.long(), **pr_args)
120 |                 k_size = activs.shape[2]
121 |                 activ_win, activ_indx = F.max_pool1d(activs,
122 |                                                      kernel_size=k_size,
123 |                                                      return_indices=True)
124 |                 activ_win = activ_win.cpu().numpy()[:,:,0]
125 |                 activ_indx = activ_indx.cpu().numpy()[:,:,0]
126 |                 selected = winner_values < activ_win
127 |                 winner_indices[selected] = activ_indx[selected]*stride + start 
128 |                 winner_values[selected]  = activ_win[selected]
129 |                 start = end
130 |                 end = min(start+step, length)
131 | 
132 |         final_indices = [np.unique(winner_indices[b,:])
133 |                          for b in range(batch_size)]
134 |         chunk_list = [[x[b:b+1,max(i-receptive_window,0):
135 |                          min(i+receptive_window,length)]
136 |                        for i in final_indices[b]] for b in range(batch_size)]
137 |         chunk_list = [torch.cat(c, dim=1)[0,:] for c in chunk_list]
138 |         x_selected = torch.nn.utils.rnn.pad_sequence(chunk_list,
139 |                                                      batch_first=True)
140 |         x_selected = x_selected.to(cur_device)
141 |         x_selected = self.processRange(x_selected.long(), **pr_args)
142 |         x_selected = self.pooling(x_selected)
143 |         x_selected = x_selected.view(x_selected.size(0), -1)
144 | 
145 |         return x_selected
146 | 


--------------------------------------------------------------------------------
/MalConv2_Benchmark/MalConv.py:
--------------------------------------------------------------------------------
 1 | # Adapted from https://github.com/NeuromorphicComputationResearchProgram/MalConv2/blob/main/MalConv.py
 2 | import numpy as np
 3 | import torch
 4 | import torch.nn as nn
 5 | import torch.nn.functional as F
 6 | from collections import OrderedDict
 7 | from LowMemConv import LowMemConvBase
 8 | 
 9 | 
10 | def getParams():
11 |     params = {
12 |         'channels': ("suggest_int", {'name':'channels', 'low':32,
13 |                                      'high':1024}),
14 |         'log_stride': ("suggest_int", {'name':'log2_stride', 'low':2,
15 |                                        'high':9}),
16 |         'window_size': ("suggest_int", {'name':'window_size', 'low':32,
17 |                                         'high':512}),
18 |         'embd_size': ("suggest_int", {'name':'embd_size', 'low':4, 'high':64}),
19 |     }
20 |     return OrderedDict(sorted(params.items(), key=lambda t: t[0]))
21 | 
22 | 
23 | def initModel(**kwargs):
24 |     new_args = {}
25 |     for x in getParams():
26 |         if x in kwargs:
27 |             new_args[x] = kwargs[x]            
28 |     return MalConv(**new_args)
29 | 
30 | 
31 | class MalConv(LowMemConvBase):
32 |     
33 |     def __init__(self, out_size=2, channels=128, window_size=512, stride=512,
34 |                  embd_size=8, log_stride=None):
35 |         super(MalConv, self).__init__()
36 |         self.embd = nn.Embedding(257, embd_size, padding_idx=0)
37 |         if not log_stride is None:
38 |             stride = 2**log_stride
39 | 
40 |         self.conv_1 = nn.Conv1d(embd_size, channels, window_size,
41 |                                 stride=stride, bias=True)
42 |         self.conv_2 = nn.Conv1d(embd_size, channels, window_size,
43 |                                 stride=stride, bias=True)
44 | 
45 |         self.fc_1 = nn.Linear(channels, channels)
46 |         self.fc_2 = nn.Linear(channels, out_size)
47 |         
48 |     
49 |     def processRange(self, x):
50 |         x = self.embd(x)
51 |         x = torch.transpose(x,-1,-2)
52 |         cnn_value = self.conv_1(x)
53 |         gating_weight = torch.sigmoid(self.conv_2(x))        
54 |         x = cnn_value * gating_weight
55 |         return x
56 |     
57 |     def forward(self, x):
58 |         post_conv = x = self.seq2fix(x)
59 |         penult = x = F.relu(self.fc_1(x))
60 |         x = self.fc_2(x)
61 |         return x, penult, post_conv
62 | 


--------------------------------------------------------------------------------
/MalConv2_Benchmark/binaryLoader.py:
--------------------------------------------------------------------------------
 1 | # File adapted from https://github.com/NeuromorphicComputationResearchProgram/MalConv2/blob/main/binaryLoader.py
 2 | import os
 3 | import random
 4 | import numpy as np
 5 | import torch
 6 | import torch.nn as nn
 7 | from torch.utils import data
 8 | 
 9 | 
10 | 
11 | class BinaryDataset(data.Dataset):
12 |     def __init__(self, mal_dir, md5_labels, num_labels, tag_labels, max_len=4000000):
13 |         """Class implementing a dataset for directories of malicious files.
14 |         
15 |         Arguments:
16 |         mal_dir -- Directory containing malicious files, or subdirectories of
17 |                    malicious files (to traverse recursively)
18 |         md5_labels -- Dict mapping md5 hashes to labels
19 |         num_labels -- Total number of labels
20 |         max_len -- The maximum number of bytes to read from a file
21 |         """
22 | 
23 |         self.all_files = []
24 |         self.max_len = max_len
25 |         self.num_labels = num_labels
26 |         self.tag_labels = tag_labels
27 |         for root, dirs, files in os.walk(mal_dir):
28 |             for file_name in files:
29 |                 md5 = file_name[-32:]
30 |                 if md5_labels.get(md5) is None:
31 |                     continue
32 |                 file_path = os.path.join(root, file_name)
33 |                 labels = md5_labels[md5]
34 |                 self.all_files.append((file_path, labels, None))
35 | 
36 |     def __len__(self):
37 |         return len(self.all_files)
38 | 
39 |     def __getitem__(self, index):
40 |         to_load, labels, _ = self.all_files[index]
41 |         with open(to_load, 'rb') as f:
42 |             x = f.read(self.max_len)
43 |             x = np.frombuffer(x, dtype=np.uint8).astype(np.int16)+1
44 |         x = torch.tensor(x)
45 |         y = torch.zeros(self.num_labels)
46 |         for label in labels:
47 |             y[self.tag_labels[label]] = 1
48 |         return x, y
49 |     
50 | 
51 | class RandomChunkSampler(torch.utils.data.sampler.Sampler):
52 |     def __init__(self, data_source, batch_size):
53 |         """
54 |         Samples random "chunks" of a dataset, so that items within a chunk
55 |         are always loaded together. Useful to keep chunks in similar size
56 |         groups to reduce runtime. 
57 | 
58 |         data_source - The souce pytorch dataset object
59 |         batch_size - The size of the chunks to keep together. Should 
60 |                      generally be set to the desired batch size during
61 |                      training to minimize runtime. 
62 |         """
63 | 
64 |         self.data_source = data_source
65 |         self.batch_size = batch_size
66 |         
67 |     def __iter__(self):
68 |         n = len(self.data_source)
69 |         data = [x for x in range(n)]
70 |         blocks = [data[i:i+self.batch_size] for i in range(0,len(data),
71 |                                                            self.batch_size)]
72 |         random.shuffle(blocks)
73 |         data[:] = [b for bs in blocks for b in bs]
74 |         return iter(data)
75 |         
76 |     def __len__(self):
77 |         return len(self.data_source)
78 | 
79 | 
80 | def pad_collate_func(batch):
81 |     """
82 |     This should be used as the collate_fn=pad_collate_func for a pytorch
83 |     DataLoader object in order to pad out files in a batch to the length of
84 |     the longest item in the batch. 
85 |     """
86 | 
87 |     vecs = [x[0] for x in batch]
88 |     y = torch.stack([x[1] for x in batch])
89 |     x = torch.nn.utils.rnn.pad_sequence(vecs, batch_first=True)
90 |     return x, y
91 | 


--------------------------------------------------------------------------------
/MalConv2_Benchmark/malconv_benchmark.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import json
  3 | import argparse
  4 | import multiprocessing
  5 | import numpy as np
  6 | import torch
  7 | import torch.optim as optim
  8 | from torch.utils.data import Dataset, DataLoader
  9 | from sklearn.metrics import roc_auc_score
 10 | from sklearn.metrics import precision_recall_fscore_support as prfs
 11 | from binaryLoader import BinaryDataset, RandomChunkSampler, pad_collate_func
 12 | from MalConv import MalConv
 13 | import torch.nn.functional as F
 14 | 
 15 | 
 16 | def get_tags(tag_path):
 17 |     """Make a dict mapping MD5s to ClarAVy tags."""
 18 |     md5_tags = {}
 19 |     with open(tag_path, "r") as f:
 20 |         for jsonl in f:
 21 |             entry = json.loads(jsonl.strip())
 22 |             md5 = entry["md5"]
 23 |             tags = [rank[0] for rank in entry["ranking"]]
 24 |             md5_tags[md5] = tags
 25 |     return md5_tags
 26 | 
 27 | 
 28 | if __name__ == "__main__":
 29 | 
 30 |     parser = argparse.ArgumentParser()
 31 |     parser.add_argument("train_dir", help="Path to directory with files " +
 32 |                         "to train on. Directory is traversed recursively.")
 33 |     parser.add_argument("test_dir", help="Path to directory with files " +
 34 |                         "to test on. Directory is traversed recursively.")
 35 |     parser.add_argument("maldict_train_file", help="Path to MalDICT .jsonl " +
 36 |                         "file with train hashes and tags")
 37 |     parser.add_argument("maldict_test_file", help="Path to MalDICT .jsonl " +
 38 |                         "file with test hashes and tags")
 39 |     parser.add_argument("--num-processes", type=int, default=1)
 40 |     args = parser.parse_args()
 41 | 
 42 |     # Default hyperparameters for MalConv2
 43 |     NON_NEG = False
 44 |     EMBD_SIZE = 8
 45 |     FILTER_SIZE = 512
 46 |     FILTER_STRIDE = 512
 47 |     NUM_CHANNELS = 128
 48 |     EPOCHS = 1
 49 |     MAX_FILE_LEN = 16000000
 50 |     BATCH_SIZE = 128
 51 |     RANDOM_STATE = 42
 52 | 
 53 |     # Read train/test tags
 54 |     train_md5_tags = get_tags(args.maldict_train_file)
 55 |     test_md5_tags = get_tags(args.maldict_test_file)
 56 | 
 57 |     # Convert tags to numeric labels
 58 |     all_tags = set()
 59 |     for tags in train_md5_tags.values():
 60 |         all_tags.update(tags)
 61 |     for tags in test_md5_tags.values():
 62 |         all_tags.update(tags)
 63 |     sorted_tags = sorted(all_tags)
 64 |     tag_labels = {tag: i for i, tag in enumerate(sorted_tags)}
 65 |     num_labels = len(sorted_tags)
 66 | 
 67 |     # Get train and test datasets of malicious files
 68 |     train_dataset = BinaryDataset(args.train_dir, train_md5_tags, num_labels, tag_labels,
 69 |                                   max_len=MAX_FILE_LEN)
 70 |     test_dataset = BinaryDataset(args.test_dir, test_md5_tags, num_labels, tag_labels,
 71 |                                  max_len=MAX_FILE_LEN)
 72 | 
 73 |     # Get train and test loaders
 74 |     train_sampler = RandomChunkSampler(train_dataset, BATCH_SIZE)
 75 |     test_sampler = RandomChunkSampler(test_dataset, BATCH_SIZE)
 76 |     loader_threads = max(multiprocessing.cpu_count()-4,
 77 |                          multiprocessing.cpu_count()//2+1)
 78 |     train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE,
 79 |                               num_workers=loader_threads,
 80 |                               collate_fn=pad_collate_func,
 81 |                               sampler=train_sampler)
 82 |     test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE,
 83 |                              num_workers=loader_threads,
 84 |                              collate_fn=pad_collate_func, sampler=test_sampler)
 85 | 
 86 |     # Initialize MalConv2 classifier
 87 |     device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
 88 |     model = MalConv(out_size=num_labels, channels=NUM_CHANNELS,
 89 |                     window_size=FILTER_SIZE, stride=FILTER_STRIDE,
 90 |                     embd_size=EMBD_SIZE).to(device)
 91 |     criterion = torch.nn.BCEWithLogitsLoss()
 92 |     optimizer = optim.Adam(model.parameters())
 93 |     model.train()
 94 | 
 95 |     # Train classifier
 96 |     for epoch in range(EPOCHS):
 97 |         for inputs, labels in train_loader:
 98 |             inputs = inputs.to(device)
 99 |             labels = labels.to(device)
100 |             optimizer.zero_grad()
101 |             outputs, _, _ = model(inputs)
102 |             loss = criterion(outputs, labels)
103 |             loss.backward()
104 |             optimizer.step()
105 | 
106 |     # Test classifier
107 |     y_test = []
108 |     y_pred = []
109 |     total = 0
110 |     model = model.eval()
111 |     with torch.no_grad():
112 |         for inputs, labels in test_loader:
113 |             inputs = inputs.to(device)
114 |             labels = labels.to(device)
115 |             outputs, _, _ = model(inputs)
116 |             outputs = F.sigmoid(outputs)
117 |             B = inputs.shape[0]
118 |             total += B
119 |             y_test += labels.detach().cpu().numpy().tolist()
120 |             y_pred += outputs.detach().cpu().numpy().tolist()
121 |     y_test = np.array(y_test)
122 |     y_pred = np.array(y_pred)
123 | 
124 |     # Get AUC score over all tags
125 |     micro_auc = roc_auc_score(y_test, y_pred, average="micro",
126 |                               multi_class="ovr")
127 |     weighted_auc = roc_auc_score(y_test, y_pred, average="weighted",
128 |                                  multi_class="ovr")
129 | 
130 |     # Get Precision, Recall, and F1 (assume > 0.5)
131 |     y_pred = (y_pred > 0.5)
132 |     p_micro, r_micro, f1_micro, _ = prfs(y_test, y_pred, average="micro")
133 |     p_avg, r_avg, f1_avg, _ = prfs(y_test, y_pred, average="weighted")
134 | 
135 |     # Print results
136 |     print("Precision\t{} (micro)\t{} (weighted)".format(p_micro, p_avg))
137 |     print("Recall\t{} (micro)\t{} (weighted)".format(r_micro, r_avg))
138 |     print("F1-Score\t{} (micro)\t{} (weighted)".format(f1_micro, f1_avg))
139 |     print("ROC AUC\t{} (micro)\t{} (weighted)".format(micro_auc, weighted_auc))
140 | 


--------------------------------------------------------------------------------
/MalDICT_Tags/maldict_category_test.jsonl:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:c5501b1d74cc00a46b058a10038cbe2323195c61747dd4e39e3ff20de910949b
3 | size 46269303
4 | 


--------------------------------------------------------------------------------
/MalDICT_Tags/maldict_category_train.jsonl:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:c055d26d649a5d0f18090fa75c6e69b6c15a0bc0c1b3db143cbe66853814bc23
3 | size 313116717
4 | 


--------------------------------------------------------------------------------
/MalDICT_Tags/maldict_packer_test.jsonl:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:cc7dd8cc1988afa3fbd2ea8b7f1e80747b12e0fca61bad4869dfc0343bcfe9ee
3 | size 3837469
4 | 


--------------------------------------------------------------------------------
/MalDICT_Tags/maldict_packer_train.jsonl:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:4dac50e6c3d398fd0531ad216fe8620a0f5eb6157d7161dd8dd9aa93951eceaf
3 | size 15196726
4 | 


--------------------------------------------------------------------------------
/MalDICT_Tags/maldict_platform_test.jsonl:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:c2a89594549ade5bc72378793bf9d1d0994fcae2434fbf11d11163700f34329f
3 | size 17092039
4 | 


--------------------------------------------------------------------------------
/MalDICT_Tags/maldict_platform_train.jsonl:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:670ef5a4be12a0d3a3d492856454ac356c59bead471b7c4a7ad0dd589dda4a76
3 | size 56214317
4 | 


--------------------------------------------------------------------------------
/MalDICT_Tags/maldict_vulnerability_test.jsonl:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:ef0c88b54199458524e303923086848690cf546cdcd454dbb8575a92be166cf2
3 | size 3193170
4 | 


--------------------------------------------------------------------------------
/MalDICT_Tags/maldict_vulnerability_train.jsonl:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:c75a260fec57ee3083d0e176f0af1af4e5b9ca362f3afbff2ecca61fb7cecbc4
3 | size 11273451
4 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # MalDICT
  2 | 
  3 | MalDICT is a collection of four datasets, each supporting different malware classification tasks. These datasets can be used to train a machine learning classifier on malware behaviors, file properties, vulnerability exploitation, and packers, and then evaluate the classifier's performance. More information is provided in our paper: https://arxiv.org/abs/2310.11706
  4 | 
  5 | If you use MalDICT for your research, please cite us with:
  6 | 
  7 | ```
  8 | @misc{joyce2023maldict,
  9 |       title={MalDICT: Benchmark Datasets on Malware Behaviors, Platforms, Exploitation, and Packers},
 10 |       author={Robert J. Joyce and Edward Raff and Charles Nicholas and James Holt},
 11 |       year={2023},
 12 |       eprint={2310.11706},
 13 |       archivePrefix={arXiv},
 14 |       primaryClass={cs.CR}
 15 | }
 16 | ```
 17 | 
 18 | ## Dataset Contents
 19 | 
 20 | #### How Did We Build These Datasets?
 21 | 
 22 | We collected nearly 40 million VirusTotal reports for the malware in chunks 0 - 465 of the VirusShare corpus. Then, we ran [ClarAVy](https://github.com/NeuromorphicComputationResearchProgram/ClarAVy/tree/master), a tool we developed for tagging malware, on all of these VirusTotal reports. The tags output by ClarAVy can indicate a malicious file's behaviors, file properties, vulnerability exploitation, and packer. Some tags were very rare and others were applied to millions of files, resulting in a large class imbalance. We discarded any tags that were too rare and randomly down-sampled tags that were too common. The discard and down-sampling thresholds were different for each of the four datasets in MalDICT.
 23 | 
 24 | #### MalDICT-Behavior
 25 | 
 26 | MalDICT-Behavior is a dataset of malware tagged according to its category or behavior (e.g. ransomware, downloader, autorun). It includes 4,317,241 malicious files tagged according to 75 different malware categories or malicious behaviors. A file may have multiple tags if it belongs to multiple malware categories and/or exhibits more than one type of malicious behavior.
 27 | 
 28 | A default train/test split for MalDICT-Behavior is provided in ```MalDICT_Tags/maldict_category_train.jsonl``` and ```MalDICT_Tags/maldict_category_test.jsonl```. The training set uses malware from VirusShare chunks 0 - 315 and the test set uses malware from chunks 316 - 465. Chunks in the test set include newer malware than the training set, effectively creating a temporal train/test split. In order for a machine learning classifier to perform well, it must learn to generalize to the "future" malware in the test set.
 29 | 
 30 | 
 31 | #### MalDICT-Platform
 32 | 
 33 | MalDICT-Platform includes 963,492 malicious files and has 43 tags for different target operating systems, file formats, and programming langauges (e.g. win64, pdf, java). It uses the same temporal train/test split method as MalDICT-Behavior. Hashes and tags for the training set are in ```MalDICT_Tags/maldict_platform_train.jsonl``` and the test set is in ```MalDICT_Tags/maldict_platform_test.jsonl```.
 34 | 
 35 | 
 36 | #### MalDICT-Vulnerability
 37 | 
 38 | The MalDICT-Vulnerability dataset has 173,886 files which are tagged according to the vulnerability that they exploit. The dataset includes tags for 128 different vulnerabilities (e.g. cve_2017_0144, ms08_067).
 39 | 
 40 | Hashes and tags for MalDICT-Vulnerability are in ```MalDICT_Tags/maldict_vulnerability_train.jsonl``` and ```MalDICT_Tags/maldict_vulnerability_test.jsonl```. Unlike MalDICT-Behavior and MalDICT-Plaform, this dataset uses a stratified split so that each tag is split proportionally between the training and test set.
 41 | 
 42 | 
 43 | #### MalDICT-Packer
 44 | 
 45 | MalDICT-Packer contains 252,148 malicious files, tagged according to the packer used to pack the file. It includes 79 different malware packers. Train and test split files are located in ```MalDICT_Tags/maldict_packer_train.jsonl``` and ```MalDICT_Tags/maldict_packer_test.jsonl```. MalDICT-Packer also uses a stratified train-test split.
 46 | 
 47 | 
 48 | ## Downloading the Datasets
 49 | 
 50 | #### Downloading File Hashes and Tags
 51 | 
 52 | File hashes and tags for all of the malware in MalDICT are provided in .jsonl files within the ```MalDICT_Tags/``` directory of this repo. GIT-LFS is required for downloading these files due to their size. On Debian-based systems, GIT-LFS can be installed using:
 53 | 
 54 | ```
 55 | sudo apt-get install git-lfs
 56 | ```
 57 | 
 58 | After installing GIT-LFS, you can download the hashes and tags by cloning this repository:
 59 | 
 60 | ```
 61 | git lfs clone https://github.com/joyce8/MalDICT/
 62 | ```
 63 | 
 64 | #### Downloading Malicious Files
 65 | 
 66 | We are releasing all of the Windows Portable Executable (PE) files in MalDICT-Behavior, MalDICT-Platform, and MalDICT-Packer. These files have been disarmed so that they cannot be executed. We did this using the same method as the [SOREL](https://github.com/sophos/SOREL-20M) and [MOTIF](https://github.com/boozallen/MOTIF) datasets (by zeroing out two fields in each file's PE headers). 7zip files containing the disarmed PE files can be downloaded [here](https://drive.google.com/drive/folders/18X0GgEIEczvLEFir2GMNGPdiKAhHXxfJ?usp=sharing). The password to each 7zip file is ```infected```. The total size of the extracted files is approximately 2.1TB.
 67 | 
 68 | Unfortunately, we cannot publish the non-PE malware in MalDICT at this time. However, all of the malware in MalDICT is a subset of the VirusShare corpus (chunks 0 - 465). The full VirusShare corpus is distributed by [VirusShare](https://virusshare.com/login) and by [vx-underground](https://www.vx-underground.org/#E:/root/Samples/Virusshare%20Collection/Downloadable%20Releases). The file hashes in ```MalDICT_Tags/``` can be used to select the appropriate files from VirusShare and assemble the complete MalDICT datasets.
 69 | 
 70 | 
 71 | #### Downloading EMBER Raw Metadata
 72 | 
 73 | We extracted EMBER (v2) raw features from all of the PE files in MalDICT-Behavior, MalDICT-Platform, and MalDICT-Packer. MalDICT-Vulnerability is excluded because most files in it are not in the PE file format. The EMBER metadata files can be downloaded [here](https://drive.google.com/drive/folders/1QhQBoZ-7RPkad3059VFjLZqLDdb2HW0H?usp=share_link). Each line in one of the metadata files is a JSON object with the following fields:
 74 | 
 75 | | Name | Description |
 76 | |---|---------|
 77 | | md5 | MD5 hash of file |
 78 | | histogram | EMBER byte histogram |
 79 | | byteentropy | EMBER byte entropy statistics |
 80 | | strings | EMBER strings metadata |
 81 | | general | EMBER general file metadata |
 82 | | header | EMBER PE header metadata |
 83 | | section | EMBER PE section metadata |
 84 | | imports | EMBER imports metadata |
 85 | | exports | EMBER exports metadata |
 86 | | datadirectories | EMBER data directories metadata |
 87 | 
 88 | 
 89 | ## Training a LightGBM Baseline Classifier
 90 | 
 91 | LightGBM uses an ensemble of gradient-boosted trees for classification. It is trained on Windows PE malware using the EMBER feature vector format. Code for training and evaluating a LightGBM classifier is in ```LightGBM_Benchmark/```. You will need the following Python packages:
 92 | 
 93 | ```
 94 | pip install scikit-learn
 95 | pip install lightgbm
 96 | pip install git+https://github.com/elastic/ember.git
 97 | ```
 98 | 
 99 | You will also need the MalDICT tag files in the ```MalDict_Tags/``` folder and the [EMBER raw metadata files](https://drive.google.com/drive/folders/1QhQBoZ-7RPkad3059VFjLZqLDdb2HW0H?usp=share_link) for training the model. Usage for the LightGBM benchmark script is shown below:
100 | 
101 | ```
102 | usage: lightgbm_benchmark.py [-h] [--num-processes NUM_PROCESSES] ember_meta_dir/ maldict_train_file maldict_test_file
103 | 
104 | positional arguments:
105 |   ember_meta_dir        path to directory with raw EMBER metadata .jsonl files (train and test)
106 |   maldict_train_file    path to MalDICT .jsonl file with train hashes and tags
107 |   maldict_test_file     path to MalDICT .jsonl file with test hashes and tags
108 | 
109 | optional arguments:
110 |   -h, --help            show this help message and exit
111 |   --num-processes NUM_PROCESSES
112 | ```
113 | 
114 | The following example shows how to train a LightGBM classifier on the MalDICT-Packer dataset:
115 | 
116 | ```
117 | python lightgbm_benchmark.py /path/to/EMBER_meta/EMBER_packer/ /path/to/MalDICT_Tags/claravy_packer_train.jsonl /path/to/MalDICT_Tags/claravy_packer_test.jsonl
118 | ```
119 | 
120 | 
121 | ## Training a MalConv2 Baseline Classifier
122 | 
123 | MalConv2 is a deep neural network which learns from the raw bytes within files. Code for training and evaluating a MalConv2 classifier is in ```MalConv2_Benchmark/```. You will need the following Python packages:
124 | 
125 | ```
126 | pip install numpy
127 | pip install scikit-learn
128 | pip install torch
129 | ```
130 | 
131 | You will also need the MalDICT tag files in the ```MalDict_Tags/``` folder as well as the [malicious files](https://drive.google.com/drive/folders/18X0GgEIEczvLEFir2GMNGPdiKAhHXxfJ?usp=sharing), separated into training and testing directories. Usage for the MalConv2 benchmark script is shown below:
132 | 
133 | ```
134 | usage: malconv_benchmark.py [-h] [--num-processes NUM_PROCESSES] train_dir/ test_dir/ maldict_train_file maldict_test_file
135 | 
136 | positional arguments:
137 |   train_dir             Path to directory with files to train on. Directory is traversed recursively.
138 |   test_dir              Path to directory with files to test on. Directory is traversed recursively.
139 |   maldict_train_file    Path to MalDICT .jsonl file with train hashes and tags
140 |   maldict_test_file     Path to MalDICT .jsonl file with test hashes and tags
141 | 
142 | 
143 | optional arguments:
144 |   -h, --help            show this help message and exit
145 |   --num-processes NUM_PROCESSES
146 | ```
147 | 
148 | The following example shows how to train a MalConv2 classifier on the MalDICT-Packer dataset:
149 | 
150 | ```
151 | python malconv_benchmark.py /path/to/maldict_disarmed_packer_train/ /path/to/maldict_disarmed_packer_test/ /path/to/MalDICT_Tags/maldict_packer_train.jsonl /path/to/MalDICT_Tags/maldict_packer_test.jsonl
152 | ```
153 | 


--------------------------------------------------------------------------------