├── .gitignore
├── MANIFEST.in
├── README.md
├── annotatation_examples
├── blastKoala_annotations.tsv.gz
├── dram_annotations.tsv.gz
├── ghostKoala_annotations.tsv.gz
└── kofamscan_annotations.tsv.gz
└── package
├── build
└── lib
│ └── metapathpredict
│ └── cmdline_models.py
├── setup.py
└── src
├── .DS_Store
├── metapathpredict.egg-info
├── PKG-INFO
├── SOURCES.txt
├── dependency_links.txt
├── entry_points.txt
├── requires.txt
└── top_level.txt
└── metapathpredict
├── .DS_Store
├── MetaPathPredict.py
├── __init__.py
├── data
├── __init__.py
├── labels.pkl
├── metapathmodules.pkl
└── requiredCols.pkl
├── download_models.py
└── utils.py
/.gitignore:
--------------------------------------------------------------------------------
1 | package/build
2 | metapathpredict.log
3 |
--------------------------------------------------------------------------------
/MANIFEST.in:
--------------------------------------------------------------------------------
1 | recursive-include src/metapathpredict/ *.py
2 | recursive-include src/metapathpredict/data *.pkl
3 | recursive-include src/metapathpredict/data *.keras
4 | recursive-include src/metapathpredict/data *.py
5 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # MetaPathPredict
2 |
3 | The MetaPathPredict Python module utilizes deep learning models to predict the presence or absence of KEGG metabolic modules in bacterial genomes recovered from environmental sequencing efforts.
4 |
5 | ## Installation
6 |
7 | To run MetaPathPredict, download this repository and install it as a Python module (see download and installation instructions below):
8 |
9 |
10 | ### GitHub install:
11 |
12 | NOTE: [Conda](https://docs.conda.io/en/latest/) is required for this installation.
13 |
14 | 1. Open a Terminal/Command Prompt window and run the following command to download the
15 | GitHub repository to the desired location (note: change your current working directory first
16 | to the desired download location, e.g., `~/Downloads` on MacOS):
17 | `git clone https://github.com/d-mcgrath/MetaPathPredict.git`
18 |
19 | 1. NOTE: You can also download the repository zip file from GitHub
20 |
21 | 2. In a Terminal/Command Prompt window, run the following commands from the parent directory the MetaPathPredict repository was cloned to:
22 | ```bash
23 | conda create -n MetaPathPredict python=3.10.6 scikit-learn=1.3.0 tensorflow=2.10.0 numpy=1.23.4 pandas=1.5.2 keras=2.10.0 git=2.40.1
24 | ```
25 | NOTE: You will be prompted (y/n) to confirm creating this conda environment. Now activate it:
26 |
27 | ```bash
28 | conda activate MetaPathPredict
29 | ```
30 |
31 | 3. Install the `huggingface_hub` library:
32 | ```bash
33 | pip install --upgrade huggingface_hub
34 | ```
35 |
36 | 4. Once complete, pip install MetaPathPredict:
37 | ```bash
38 | pip install MetaPathPredict/package
39 | ```
40 |
41 | 5. Download MetaPathPredict's models by running the following command:
42 | ```bash
43 | DownloadModels
44 | ```
45 |
46 | Note: MetaPathPredict is now installed in the `MetaPathPredict` conda environment. Activate the conda environment prior to any use of MetaPathPredict.
47 |
48 | ### pip install:
49 | [not available yet]
50 |
51 |
52 |
53 | ## Functions
54 |
55 | The following functions can be implemented to run MetaPathPredict on the command line:
56 |
57 | - `MetaPathPredict` parses one or more input KEGG Ortholog gene annotation datasets (currently only bacterial genome data is supported) and predicts the presence or absence of [KEGG Modules](https://www.genome.jp/kegg/module.html). This function takes as input the .tsv output files from the [KofamScan](https://github.com/takaram/kofam_scan) and [DRAM](https://github.com/WrightonLabCSU/DRAM) gene annotation tools as well as the KEGG KOALA online annotation platforms [blastKOALA](https://www.kegg.jp/blastkoala/), [ghostKOALA](https://www.kegg.jp/ghostkoala/), and [kofamKOALA](https://www.genome.jp/tools/kofamkoala/). Run any of these tools first and then use one or more of their output .tsv files as input to MetaPathPredict.
58 | - A single file or multiple space-separated files can be specified to the `--input` parameter, or use a wildcard (e.g., /results/*.tsv). Include full or relative paths to the input file(s). A sample of each annotation file format that MetaPathPredict can process is included in this repository in the [annotatation_examples](annotatation_examples) folder. The sample annotation files in [annotatation_examples](annotatation_examples) can optionally be used as input to test the installation.
59 | - The format of the gene annotation files (kofamscan, kofamkoala, dram, or koala) that is used as input must be specified with the `--annotation-format` parameter. Currently, only one input type can be specified at a time.
60 | - The full or relative path to the desired destination for MetaPathPredict's output .tsv file must be specified, as well as a name for the file. The output file path and name can be specified using the `--output` parameter. By default, MetaPathPredict does not create any default output directory nor does the output have a default file name.
61 | - To specify a specific KEGG module or modules to reconstruct and predict, include the module identifier (e.g., M00001) or identifiers as a space-separated list to the argument `--kegg-modules`.
62 |
63 | - To view which KEGG modules MetaPathPredict can reconstruct and make predictions for, run the following on the command line: `MetaPathModules`.
64 |
65 |
66 |
67 | ## Basic usage
68 |
69 | ```
70 | # predict method for making KEGG module presence/absence predictions on input gene annotations
71 |
72 | usage: MetaPathPredict [-h] --input INPUT [INPUT ...] --annotation-format ANNOTATION_FORMAT
73 | [--kegg-modules KEGG_MODULES [KEGG_MODULES ...]] --output OUTPUT
74 |
75 | options:
76 | -h, --help show this help message and exit
77 | --input INPUT [INPUT ...], -i INPUT [INPUT ...]
78 | input file path(s) and name(s) [required]
79 | --annotation-format ANNOTATION_FORMAT, -a ANNOTATION_FORMAT
80 | annotation format (kofamscan, kofamkoala, dram, or koala) [default:
81 | kofamscan]
82 | --kegg-modules KEGG_MODULES [KEGG_MODULES ...], -k KEGG_MODULES [KEGG_MODULES ...]
83 | KEGG modules to predict [default: MetaPathPredict KEGG modules]
84 | --output OUTPUT, -o OUTPUT
85 | output file path and name [required]
86 | ```
87 |
88 |
89 |
90 | ## Examples with sample datasets
91 |
92 | ```
93 | # One KofamScan gene annotation dataset
94 | MetaPathPredict -i /path/to/kofamscan_annotations_1.tsv -a kofamscan -o /results/predictions.tsv
95 |
96 | # Three KofamScan gene annotation datasets, with predictions for modules M00001 and M00003
97 | MetaPathPredict \
98 | -i kofamscan_annotations_1.tsv kofamscan_annotations_2.tsv kofamscan_annotations_3.tsv \
99 | -a kofamscan \
100 | -k M00001 M00003 \
101 | -o /results/predictions.tsv
102 |
103 | # Multiple KofamScan datasets in a directory
104 | MetaPathPredict -i annotations/*.tsv -a kofamscan -o /results/predictions.tsv
105 |
106 | # One DRAM gene annotation dataset
107 | MetaPathPredict -i dram_annotation.tsv -a dram -o /results/predictions.tsv
108 |
109 | # Multiple DRAM datasets in a directory
110 | MetaPathPredict -i annotations/*.tsv -a dram -o /results/predictions.tsv
111 | ```
112 |
113 |
114 |
115 | ## Understanding the output
116 |
117 | The output of running `MetaPathPredict` is a table. The first column, `file`, displays the full file name of each input gene annotation file. The remaining columns give the class predictions (module present = 1; module absent = 0) of KEGG modules. Each KEGG module occupies a single column in the table and is labelled by its module identifier. See a sample output below of four KEGG module predictions for three input annotation files:
118 |
119 | | file | M00001 | M00002 | M00003 | M00004 |
120 | |--------------------------------------|--------|--------|--------|--------|
121 | | /path/to/kofamscan_annotations_1.tsv | 1 | 1 | 0 | 1 |
122 | | /path/to/kofamscan_annotations_2.tsv | 0 | 1 | 0 | 0 |
123 | | /path/to/kofamscan_annotations_3.tsv | 1 | 0 | 0 | 0 |
124 |
125 |
126 |
127 | ## Developer usage
128 |
129 | ```
130 | # training method for MetaPathPredict's internal models
131 |
132 | usage: MetaPathTrain [-h] --train-targets TRAIN_TARGETS --train-features TRAIN_FEATURES
133 | [--num-epochs NUM_EPOCHS] --model-out MODEL_OUT [--use-gpu]
134 | [--num-cores NUM_CORES] [--num-hidden-layers NUM_HIDDEN_LAYERS]
135 | [--hidden-nodes-per-layer HIDDEN_NODES_PER_LAYER]
136 | [--num-features NUM_FEATURES] [--threshold THRESHOLD]
137 |
138 | options:
139 | -h, --help show this help message and exit
140 | --train-targets TRAIN_TARGETS
141 | training targets file
142 | --train-features TRAIN_FEATURES
143 | training features
144 | --num-epochs NUM_EPOCHS
145 | number of epochs
146 | --model-out MODEL_OUT, -m MODEL_OUT
147 | model file name output
148 | --use-gpu use GPU if available
149 | --num-cores NUM_CORES
150 | Number of cores for parallel processing
151 |
152 | Neural Net parameters:
153 | --num-hidden-layers NUM_HIDDEN_LAYERS
154 | number of hidden layers
155 | --hidden-nodes-per-layer HIDDEN_NODES_PER_LAYER
156 | number of nodes in each hidden layer
157 | --num-features NUM_FEATURES
158 | number of features to retain from training data
159 | --threshold THRESHOLD
160 | threshold for SelectKBest feature selection
161 | ```
162 |
--------------------------------------------------------------------------------
/annotatation_examples/blastKoala_annotations.tsv.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/d-mcgrath/MetaPathPredict/0f276eaa693be2bc4c735d04b50548949ce2079c/annotatation_examples/blastKoala_annotations.tsv.gz
--------------------------------------------------------------------------------
/annotatation_examples/dram_annotations.tsv.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/d-mcgrath/MetaPathPredict/0f276eaa693be2bc4c735d04b50548949ce2079c/annotatation_examples/dram_annotations.tsv.gz
--------------------------------------------------------------------------------
/annotatation_examples/ghostKoala_annotations.tsv.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/d-mcgrath/MetaPathPredict/0f276eaa693be2bc4c735d04b50548949ce2079c/annotatation_examples/ghostKoala_annotations.tsv.gz
--------------------------------------------------------------------------------
/annotatation_examples/kofamscan_annotations.tsv.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/d-mcgrath/MetaPathPredict/0f276eaa693be2bc4c735d04b50548949ce2079c/annotatation_examples/kofamscan_annotations.tsv.gz
--------------------------------------------------------------------------------
/package/build/lib/metapathpredict/cmdline_models.py:
--------------------------------------------------------------------------------
1 | """
2 | Command Line Interface for MetaPathPredict Tools:
3 | ====================================
4 |
5 | .. currentmodule:: metapathpredict
6 |
7 | class methods:
8 | MetaPathPredict methods
9 | """
10 |
11 | import logging
12 | import argparse
13 | import datetime
14 | import pickle
15 | import os
16 | import sys
17 | import re
18 | import math
19 | import importlib
20 | from typing import Iterable, List, Dict, Set, Optional, Sequence
21 | from itertools import chain
22 |
23 | # disable tensorflow info messages
24 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1'
25 |
26 | import sklearn
27 | import numpy as np
28 | import pandas as pd
29 | import keras
30 | from torchvision import transforms
31 | import torch.optim as optim
32 | from torch.utils.data import Dataset, DataLoader, TensorDataset
33 | from sklearn.model_selection import train_test_split
34 | from sklearn.preprocessing import StandardScaler
35 | from sklearn.feature_selection import SelectKBest, f_classif
36 | from sklearn.metrics import classification_report
37 | import torch
38 | import torch.nn as nn
39 |
40 | import warnings
41 | from sklearn.exceptions import InconsistentVersionWarning
42 | warnings.filterwarnings(action='ignore', category=InconsistentVersionWarning)
43 |
44 | from metapathpredict.utils import InputData
45 | from metapathpredict.utils import AnnotationList
46 |
47 |
48 | # CUDA for PyTorch
49 | use_cuda = torch.cuda.is_available()
50 | device = torch.device("cuda:0" if use_cuda else "cpu")
51 | # device = "cpu"
52 |
53 | torch.backends.cudnn.benchmark = True
54 |
55 | # Parameters
56 | params = {"batch_size": 64, "shuffle": True, "num_workers": 6}
57 |
58 | #Configure the logging system
59 | logging.basicConfig(
60 | filename='HISTORYlistener.log',
61 | level=logging.DEBUG,
62 | format='%(asctime)s %(levelname)s %(module)s - %(message)s',
63 | datefmt='%Y-%m-%d %H:%M:%S')
64 |
65 | root = logging.getLogger()
66 | root.setLevel(logging.DEBUG)
67 |
68 | handler = logging.StreamHandler(sys.stdout)
69 | handler.setLevel(logging.DEBUG)
70 | formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
71 | handler.setFormatter(formatter)
72 | root.addHandler(handler)
73 |
74 |
75 |
76 | class CustomDataset(Dataset):
77 | def __init__(self, data, targets, transform=None):
78 | print("type", type(data), data.shape)
79 | self.data = torch.tensor(data, dtype=torch.float32)
80 | self.targets = torch.tensor(targets, dtype=torch.float32)
81 | self.transform = transform
82 |
83 | def __len__(self):
84 | return len(self.data)
85 |
86 | def __getitem__(self, idx):
87 | features, target = self.data[idx], self.targets[idx]
88 |
89 | if self.transform:
90 | sample = self.transform(sample)
91 |
92 | return features, target
93 |
94 |
95 |
96 | class CustomModel(nn.Module):
97 | def __init__(self, num_hidden_nodes_per_layer=1024, num_hidden_layers=5):
98 | super(CustomModel, self).__init__()
99 | NUM_HIDDEN_NODES = num_hidden_nodes_per_layer
100 | self.NUM_HIDDEN_LAYERS = num_hidden_layers
101 |
102 | self.fc1 = nn.Linear(2000, NUM_HIDDEN_NODES)
103 | self.relu = nn.ReLU()
104 | self.dropout = nn.Dropout(0.1)
105 |
106 | # array of hidden layers
107 | self.fcs = [
108 | nn.Linear(NUM_HIDDEN_NODES, NUM_HIDDEN_NODES)
109 | for i in range(num_hidden_layers)
110 | ]
111 |
112 | self.output_layer = nn.Linear(NUM_HIDDEN_NODES, 94)
113 | self.sigmoid = nn.Sigmoid()
114 |
115 | def forward(self, x):
116 | x = self.fc1(x)
117 | x = self.relu(x)
118 | x = self.dropout(x)
119 |
120 | for i in range(self.NUM_HIDDEN_LAYERS - 1):
121 | x = self.fcs[i](x)
122 | x = self.relu(x)
123 | x = self.dropout(x)
124 |
125 | x = self.fcs[self.NUM_HIDDEN_LAYERS - 1](x)
126 | x = self.relu(x)
127 |
128 | x = self.output_layer(x)
129 | x = self.sigmoid(x)
130 | return x
131 |
132 |
133 |
134 | class Models:
135 |
136 | """Platform-agnostic command line functions available in MetaPathPredict tools."""
137 |
138 | @classmethod
139 | def train(cls, args: Iterable[str] = None) -> int:
140 | """Train a model from the input data .
141 |
142 | Writes out a DNN model in the keras forma
143 |
144 | Parameters
145 | ----------
146 | args : Iterable[str], optional
147 | value of None, when passed to `parser.parse_args` causes the parser to
148 | read `sys.argv`
149 |
150 | Returns
151 | -------
152 | return_call : 0
153 | return call if the program completes successfully
154 |
155 | """
156 | parser = argparse.ArgumentParser()
157 |
158 | parser.add_argument(
159 | "--train-targets",
160 | dest="train_targets",
161 | required=True,
162 | help="training targets file",
163 | )
164 | parser.add_argument(
165 | "--train-features",
166 | dest="train_features",
167 | required=True,
168 | help="training features",
169 | )
170 | parser.add_argument(
171 | "--num-epochs",
172 | dest="num_epochs",
173 | required=False,
174 | default=100,
175 | type=int,
176 | help="number of epochs",
177 | )
178 | parser.add_argument(
179 | "--model-out",
180 | "-m",
181 | dest="model_out",
182 | required=True,
183 | help="model file name output",
184 | )
185 | parser.add_argument(
186 | "--use-gpu",
187 | dest="use_gpu",
188 | required=False,
189 | action="store_true",
190 | help="use GPU if available",
191 | )
192 | parser.add_argument(
193 | "--num-cores",
194 | dest="num_cores",
195 | required=False,
196 | default=10,
197 | type=int,
198 | help="Number of cores for parallel processing",
199 | )
200 | neural_net_params = parser.add_argument_group("Neural Net parameters")
201 | neural_net_params.add_argument(
202 | "--num-hidden-layers",
203 | default=5,
204 | required=False,
205 | type=int,
206 | help="number of hidden layers",
207 | )
208 | neural_net_params.add_argument(
209 | "--hidden-nodes-per-layer",
210 | type=int,
211 | required=False,
212 | default=1024,
213 | help="number of nodes in each hidden layer",
214 | )
215 | neural_net_params.add_argument(
216 | "--num-features",
217 | dest="num_features",
218 | default=2000,
219 | required=False,
220 | type=int,
221 | help="number of features to retain from training data",
222 | )
223 | neural_net_params.add_argument(
224 | "--threshold",
225 | dest="threshold",
226 | default=6432,
227 | required=False,
228 | type=float,
229 | help="threshold for SelectKBest feature selection",
230 | )
231 |
232 |
233 | args = parser.parse_args()
234 |
235 | # CUDA for PyTorch
236 | device = "cpu"
237 | if args.use_gpu:
238 | use_cuda = torch.cuda.is_available()
239 | device = torch.device("cuda:0" if use_cuda else "cpu")
240 |
241 | logging.info(f"Using device: {device}")
242 |
243 | # read in features
244 | features = pd.read_table(args.train_features, compression="gzip")
245 | logging.info(f"reading input features of shape: {features.shape[0]} x {features.shape[1]}")
246 |
247 | # read in labels
248 | targets = pd.read_table(args.train_targets, compression="gzip")
249 | logging.info(f"reading input labels of shape: {targets.shape[0]} x {targets.shape[1]}")
250 |
251 | # split the data into training and test sets
252 | test_size = 0.25
253 | x, x_test, y, y_test = train_test_split(
254 | features,
255 | targets,
256 | stratify=targets,
257 | shuffle=True,
258 | test_size= test_size,
259 | random_state=111,
260 | )
261 | logging.info(f"creating test size of: {test_size}%")
262 |
263 | # Split the remaining data to train and validation
264 | x_train, x_val, y_train, y_val = train_test_split(
265 | x, y, stratify=y, test_size=0.2, shuffle=True, random_state=111
266 | )
267 |
268 | print("features size", features.shape)
269 | print("targets size", targets.shape)
270 |
271 | print("x_test", x_test.shape, " y_test ", y_test.shape)
272 | print("x", x.shape, " y ", y.shape)
273 |
274 | print("x_train", x_train.shape, " y_train ", y_train.shape)
275 | print("x_val", x_val.shape, " y_val ", y_val.shape)
276 | print("x_test", x_test.shape, " y_test ", y_test.shape)
277 |
278 |
279 |
280 | # Initialize the StandardScaler
281 | scaler = StandardScaler()
282 |
283 | # Fit the scaler to training data and transform it
284 | # and then transform val and test data w/ the fitted scaler object
285 | # (std. dev., variance, etc. are based on training data columns)
286 | scaled_features = scaler.fit_transform(x_train)
287 | x_train = pd.DataFrame(scaled_features, index = x_train.index, columns = x_train.columns)
288 | x_val = pd.DataFrame(scaler.transform(x_val), index = x_val.index, columns = x_val.columns)
289 | x_test = pd.DataFrame(scaler.transform(x_test), index = x_test.index, columns = x_test.columns)
290 | logging.info(f"normalizing the training input features")
291 |
292 |
293 |
294 | # feature selection based only on the training data
295 | # Select features according to the k highest F-values
296 | # from running ANOVA on y_train and x_train
297 | selected_features = []
298 | for label in y_train:
299 | selector = SelectKBest(f_classif, k = 'all')
300 | selector.fit(x_train, y_train[label])
301 | selected_features.append(list(selector.scores_))
302 |
303 | # select threshold that retains 2000 features
304 | threshold = args.threshold
305 |
306 | # # MeanCS
307 | logging.info(f"total number of features in input: {x_train.shape[1]}")
308 | selected_features2 = np.mean(selected_features, axis = 0) > threshold
309 | logging.info(f"number of features selected for training: {sum(selected_features2)}")
310 |
311 | # create new training, validation, and test datasets retaining only the 2000 top features
312 | # determined from the training data
313 | x_train2 = x_train.loc[:, selected_features2]
314 | x_val2 = x_val.loc[:, selected_features2]
315 | x_test2 = x_test.loc[:, selected_features2]
316 | features_used = x_train2.columns.values
317 | labels_used = y_val.columns.values
318 |
319 | logging.info(f"Using features : {str(features_used)}")
320 | logging.info(f"Using labels : {str(labels_used)}")
321 |
322 | # Initialize the StandardScaler
323 | #scaler = StandardScaler()
324 |
325 | # Fit the scaler to your data and transform it
326 | #x_train2 = scaler.fit_transform(x_train2)
327 | #x_val2 = scaler.fit_transform(x_val2)
328 | #logging.info(f"normalizing the training input features")
329 |
330 | y_train = np.asarray(y_train.values)
331 | y_val = np.asarray(y_val.values)
332 |
333 | print()
334 | print("x_train2", x_train2.shape)
335 | print("x_val2", x_val2.shape)
336 | print("x_test2", x_test2.shape)
337 |
338 | # outline the neural network architecture - multilable classifier
339 | # 1 input layer, 5 hidden layers, 1 output layer
340 | # inclue dropout for all hidden layers
341 | model = CustomModel(
342 | num_hidden_nodes_per_layer=args.hidden_nodes_per_layer,
343 | num_hidden_layers=args.num_hidden_layers,
344 | ).to(device)
345 |
346 | # Define loss function and optimizer
347 | criterion = nn.BCELoss()
348 | optimizer = optim.Adam(model.parameters(), lr=0.001)
349 | logging.info(f"optimizer Adam with learning rate: 0.001")
350 |
351 | # Define early stopping
352 | early_stopping = torch.optim.lr_scheduler.ReduceLROnPlateau(
353 | optimizer, "min", patience=10
354 | )
355 |
356 | # Create an empty transform
357 | no_transform = transforms.Compose([])
358 |
359 | # dataset DataLoader
360 | x_train2 = np.asarray(x_train2)
361 | x_val2 = np.asarray(x_val2)
362 | print("xtrain2", x_train2.shape, y_train.shape)
363 |
364 | logging.info(f"loading training dataset into dataloader")
365 | dataset = CustomDataset(data=x_train2, targets=y_train, transform=None)
366 |
367 | batch_size = 10000
368 | train_data_loader = DataLoader(
369 | dataset, batch_size=batch_size, num_workers=args.num_cores, shuffle=True
370 | )
371 |
372 | logging.info(f"loading testing dataset into dataloader")
373 | val_dataset = CustomDataset(data=x_val2, targets=y_val, transform=None)
374 | val_data_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=True)
375 |
376 | # Train the model
377 | num_epochs = args.num_epochs
378 | logging.info(f"number of epochs for training: {num_epochs}")
379 | for epoch in range(num_epochs):
380 | model.train()
381 | train_loss = 0.0
382 |
383 | for inputs, targets in train_data_loader:
384 | inputs, targets = inputs.to(device), targets.to(device)
385 | optimizer.zero_grad()
386 | outputs = model(inputs)
387 | loss = criterion(outputs, targets)
388 |
389 | loss.backward()
390 | optimizer.step()
391 | train_loss += loss.item()
392 |
393 | model.eval()
394 | val_loss = 0.0
395 | with torch.no_grad():
396 | for inputs, targets in val_data_loader:
397 | inputs, targets = inputs.to(device), targets.to(device)
398 | outputs = model(inputs)
399 | loss = criterion(outputs, targets)
400 | val_loss += loss.item()
401 |
402 | # Update learning rate using early stopping
403 | early_stopping.step(val_loss)
404 |
405 | logging.info(
406 | f"Epoch [{epoch+1}/{num_epochs}], Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}"
407 | )
408 |
409 | print(
410 | f"Epoch [{epoch+1}/{num_epochs}], Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}"
411 | )
412 |
413 | # assess the model on test data
414 | x_test2 = np.asarray(x_test2)
415 | x_test2 = torch.tensor(x_test2, dtype=torch.float32)
416 | logging.info(f"converting test inputs to torch.tensor")
417 |
418 | predictions_test = model(x_test2)
419 |
420 | # round predictions
421 | roundedTestPreds = np.round(predictions_test.detach().numpy())
422 |
423 | # print out performance metrics
424 | print(classification_report(y_test.values, roundedTestPreds))
425 |
426 | logging.info(f"Training finished successfully!")
427 |
428 | model_file = {}
429 | model_file["description"] = "neural net trained for predicting multilabels"
430 | model_file["features"] = features_used
431 | model_file["labels"] = labels_used
432 | model_file["model"] = model
433 | torch.save(model_file, args.model_out)
434 | logging.info(f"writing model file: {args.model_out}")
435 |
436 |
437 |
438 | @classmethod
439 | def predict(cls, args: Iterable[str] = None) -> int:
440 | """Predict the presence or absence of select KEGG modules on bacterial
441 | annotation data.
442 |
443 | Parameters
444 | ----------
445 | args : Iterable[str], optional
446 | value of None, when passed to `parser.parse_args` causes the parser to
447 | read `sys.argv`
448 |
449 | Returns
450 | -------
451 | return_call : 0
452 | return call if the program completes successfully
453 |
454 | """
455 |
456 | # disable tensorflow info messages
457 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1'
458 |
459 | parser = argparse.ArgumentParser()
460 |
461 | parser.add_argument(
462 | "--input",
463 | "-i",
464 | action = "extend",
465 | nargs = "+",
466 | dest="input",
467 | required=True,
468 | help="input file path(s) and name(s) [required]",
469 | )
470 | parser.add_argument(
471 | "--annotation-format",
472 | "-a",
473 | dest="annotation_format",
474 | required=True,
475 | help="annotation format (kofamscan, kofamscan-web, dram, or koala) [default: kofamscan]",
476 | )
477 | parser.add_argument(
478 | "--kegg-modules",
479 | "-k",
480 | dest="kegg_modules",
481 | required=False,
482 | default=None,
483 | action="extend",
484 | nargs="+",
485 | help="KEGG modules to predict [default: MetaPathPredict KEGG modules]",
486 | )
487 | parser.add_argument(
488 | "--output",
489 | "-o",
490 | dest="output",
491 | required=True,
492 | help="output file path and name [required]",
493 | )
494 |
495 | args = parser.parse_args()
496 |
497 | module_dir = importlib.resources.files('metapathpredict')
498 | data_dir = module_dir.joinpath("data/")
499 |
500 | scaler_0_path = module_dir.joinpath("data/model_0_scaler.pkl")
501 | scaler_1_path = module_dir.joinpath("data/model_1_scaler.pkl")
502 |
503 | model_0_path = module_dir.joinpath("data/model_0.keras")
504 | model_1_path = module_dir.joinpath("data/model_1.keras")
505 |
506 | labels_path = module_dir.joinpath("data/labels.pkl")
507 |
508 | with open(scaler_0_path, "rb") as f:
509 | model_0_scaler = pickle.load(f)
510 |
511 | with open(scaler_1_path, "rb") as f:
512 | model_1_scaler = pickle.load(f)
513 |
514 | with open(labels_path, "rb") as f:
515 | labels = pickle.load(f)
516 |
517 | #models = [torch.load(model_0_path), torch.load(model_1_path)]
518 | models = [keras.models.load_model(model_0_path), keras.models.load_model(model_1_path)]
519 |
520 | # logging.info(f"reading model files: {args.model_in[0]}, {args.model_in[1]}")
521 | # logging.info(f"reading scaler files: {args.scaler_in[0]}, {args.scaler_in[1]}")
522 |
523 | # logging.info(f"Reading model files from directory: {data_dir}")
524 | # logging.info(f"Reading scaler files from directory: {data_dir}")
525 |
526 |
527 | # load the input features
528 | files_list = InputData(files = args.input)
529 |
530 | if args.annotation_format == "kofamscan":
531 | files_list.read_kofamscan_detailed_tsv()
532 |
533 | elif args.annotation_format == "kofamkoala":
534 | files_list.read_kofamkoala()
535 |
536 | elif args.annotation_format == "dram":
537 | files_list.read_dram_annotation_tsv()
538 |
539 | elif args.annotation_format == "koala":
540 | files_list.read_koala_tsv()
541 |
542 | else:
543 | logging.error("""Did not recognize annotation format; use "kofamscan", "kofamkoala", "dram", or "koala""""")
544 | sys.exit(0)
545 |
546 | logging.info(f"Reading input files with format: {args.annotation_format}")
547 |
548 | model_0_cols = np.ndarray.tolist(model_0_scaler.feature_names_in_)
549 | model_1_cols = np.ndarray.tolist(model_1_scaler.feature_names_in_)
550 | reqColsAll = list(set(model_0_cols).union(set(model_1_cols)))
551 |
552 | input_features = AnnotationList(
553 | requiredColumnsAll = reqColsAll, # add list of all required columns for model #1 and model #2
554 | requiredColumnsModel0 = model_0_scaler.feature_names_in_, # add list of all required columns for model #1
555 | requiredColumnsModel1 = model_1_scaler.feature_names_in_, # add list of all required columns for model #2
556 | annotations = files_list.annotations)
557 |
558 | input_features.create_feature_df()
559 | input_features.check_feature_columns()
560 | input_features.select_model_features()
561 | input_features.transform_model_features(model_0_scaler, model_1_scaler)
562 |
563 | logging.info("Making KEGG module presence/absence predictions")
564 |
565 | predictions_list = []
566 | for x in range(2):
567 |
568 | #features = torch.tensor(np.asarray(input_features.feature_df[x]), dtype=torch.float32)
569 |
570 | # predict
571 | #predictions = models[x]['model'](features)
572 | logging.info(f"Model {x} is making predictions")
573 | predictions = models[x].predict(input_features.feature_df[x])
574 |
575 | # round predictions
576 | #roundedPreds = np.round(predictions.detach().numpy())
577 | roundedPreds = np.round(predictions)
578 |
579 | #predsDf = pd.DataFrame(data = roundedPreds, columns = models[x]['labels']).astype(int)
580 | predsDf = pd.DataFrame(data = roundedPreds, columns = labels[x]).astype(int)
581 |
582 | predictions_list.append(predsDf)
583 |
584 | logging.info(f"Model {x} completed predictions")
585 |
586 | logging.info("All done.")
587 |
588 | out_df = pd.concat(predictions_list, axis = 1)
589 |
590 | if args.kegg_modules is not None:
591 | if all(modules in out_df.columns for modules in args.kegg_modules):
592 | out_df = out_df[args.kegg_modules]
593 | else:
594 | logging.error("""Did not recognize one or more KEGG modules specified with --kegg-modules; keeping all prediction columns""")
595 |
596 | out_df.insert(loc = 0, column = 'file', value = args.input)
597 |
598 | logging.info(f"Writing output to file: {args.output}")
599 | out_df.to_csv(args.output, sep='\t', index=None)
600 |
601 | #logging.info(f"Output matrix size: {out_df.shape[0]} x {out_df.shape[1]}")
602 |
--------------------------------------------------------------------------------
/package/setup.py:
--------------------------------------------------------------------------------
1 | from setuptools import Extension, setup, find_packages
2 | import os
3 |
4 | CLASSIFIERS = [
5 | "Development Status :: 4 - Beta",
6 | "Natural Language :: English",
7 | "License :: OSI Approved :: BSD License",
8 | "Operating System :: Linux, MacOS, Windows",
9 | "Programming Language :: Python :: 3.10.6+"
10 | ]
11 |
12 | setup(
13 | name="metapathpredict",
14 | description="Tool for predicting the presence or absence of KEGG modules in bacterial genomes",
15 | author="D. Geller-McGrath, K.M. Konwar, V.P. Edgcomb, M. Pachiadaki, J.W. Roddy, T.J. Wheeler, J.E. McDermott",
16 | author_email="dgellermcgrath@gmail.com, kishori82@gmail.com",
17 | package_dir={"": "src"},
18 | packages=["metapathpredict"],
19 | package_data={"metapathpredict": ["data/*.*"]},
20 | install_requires=[
21 | "scikit-learn>=1.1.3",
22 | "tensorflow>=2.10.0",
23 | "numpy>=1.23.4",
24 | "pandas>=1.5.2",
25 | "keras>=2.10.0",
26 | "torchvision>=0.15.2",
27 | "torch>=2.0.1",
28 | ],
29 | entry_points={
30 | "console_scripts": [
31 | "MetaPathTrain = metapathpredict.MetaPathPredict:Models.train",
32 | "MetaPathPredict = metapathpredict.MetaPathPredict:Models.predict",
33 | "MetaPathModules = metapathpredict.MetaPathPredict:Models.show_available_modules",
34 | "DownloadModels = metapathpredict.download_models:Download.download_models",
35 | "PredictFromTable = metapathpredict.MetaPathPredict:Models.predict_from_feature_table",
36 | "PredictFromTableFs = metapathpredict.MetaPathPredict:Models.predict_from_feature_table_fs_models"
37 | ]
38 | },
39 | classifiers=CLASSIFIERS,
40 | include_package_data=True,
41 | #ext_modules=cythonize("src/metapathpredict/cpp_mods.pyx")
42 | )
43 |
--------------------------------------------------------------------------------
/package/src/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/d-mcgrath/MetaPathPredict/0f276eaa693be2bc4c735d04b50548949ce2079c/package/src/.DS_Store
--------------------------------------------------------------------------------
/package/src/metapathpredict.egg-info/PKG-INFO:
--------------------------------------------------------------------------------
1 | Metadata-Version: 2.1
2 | Name: metapathpredict
3 | Version: 0.0.0
4 | Summary: Tool for predicting the presence or absence of KEGG modules in bacterial genomes
5 | Author: D. Geller-McGrath, K.M. Konwar, V.P. Edgcomb, M. Pachiadaki, J.W. Roddy, T.J. Wheeler, J.E. McDermott
6 | Author-email: dgellermcgrath@gmail.com, kishori82@gmail.com
7 | Classifier: Development Status :: 4 - Beta
8 | Classifier: Natural Language :: English
9 | Classifier: License :: OSI Approved :: BSD License
10 | Classifier: Operating System :: Linux, MacOS, Windows
11 | Classifier: Programming Language :: Python :: 3.10.6+
12 |
--------------------------------------------------------------------------------
/package/src/metapathpredict.egg-info/SOURCES.txt:
--------------------------------------------------------------------------------
1 | MANIFEST.in
2 | setup.py
3 | src/metapathpredict/MetaPathPredict.py
4 | src/metapathpredict/__init__.py
5 | src/metapathpredict/download_models.py
6 | src/metapathpredict/utils.py
7 | src/metapathpredict.egg-info/PKG-INFO
8 | src/metapathpredict.egg-info/SOURCES.txt
9 | src/metapathpredict.egg-info/dependency_links.txt
10 | src/metapathpredict.egg-info/entry_points.txt
11 | src/metapathpredict.egg-info/requires.txt
12 | src/metapathpredict.egg-info/top_level.txt
13 | src/metapathpredict/data/__init__.py
14 | src/metapathpredict/data/labels.pkl
15 | src/metapathpredict/data/metapathmodules.pkl
16 | src/metapathpredict/data/model_0.keras
17 | src/metapathpredict/data/model_1.keras
18 | src/metapathpredict/data/requiredCols.pkl
--------------------------------------------------------------------------------
/package/src/metapathpredict.egg-info/dependency_links.txt:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/package/src/metapathpredict.egg-info/entry_points.txt:
--------------------------------------------------------------------------------
1 | [console_scripts]
2 | DownloadModels = metapathpredict.download_models:Download.download_models
3 | MetaPathModules = metapathpredict.MetaPathPredict:Models.show_available_modules
4 | MetaPathPredict = metapathpredict.MetaPathPredict:Models.predict
5 | MetaPathTrain = metapathpredict.MetaPathPredict:Models.train
6 | PredictFromTable = metapathpredict.MetaPathPredict:Models.predict_from_feature_table
7 | PredictFromTableFs = metapathpredict.MetaPathPredict:Models.predict_from_feature_table_fs_models
8 |
--------------------------------------------------------------------------------
/package/src/metapathpredict.egg-info/requires.txt:
--------------------------------------------------------------------------------
1 | scikit-learn>=1.1.3
2 | tensorflow>=2.10.0
3 | numpy>=1.23.4
4 | pandas>=1.5.2
5 | keras>=2.10.0
6 | torchvision>=0.15.2
7 | torch>=2.0.1
8 |
--------------------------------------------------------------------------------
/package/src/metapathpredict.egg-info/top_level.txt:
--------------------------------------------------------------------------------
1 | metapathpredict
2 |
--------------------------------------------------------------------------------
/package/src/metapathpredict/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/d-mcgrath/MetaPathPredict/0f276eaa693be2bc4c735d04b50548949ce2079c/package/src/metapathpredict/.DS_Store
--------------------------------------------------------------------------------
/package/src/metapathpredict/MetaPathPredict.py:
--------------------------------------------------------------------------------
1 | """
2 | Command Line Interface for MetaPathPredict Tools:
3 | ====================================
4 |
5 | .. currentmodule:: metapathpredict
6 |
7 | class methods:
8 | MetaPathPredict methods
9 | """
10 |
11 | import logging
12 | import argparse
13 | import datetime
14 | import pickle
15 | import os
16 | import sys
17 | import re
18 | import math
19 | import importlib
20 | from typing import Iterable, List, Dict, Set, Optional, Sequence
21 | from itertools import chain
22 |
23 | # disable tensorflow info messages
24 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1'
25 |
26 | import sklearn
27 | import numpy as np
28 | import pandas as pd
29 | import keras
30 | from torchvision import transforms
31 | import torch.optim as optim
32 | from torch.utils.data import Dataset, DataLoader, TensorDataset
33 | from sklearn.model_selection import train_test_split
34 | from sklearn.preprocessing import StandardScaler
35 | from sklearn.feature_selection import SelectKBest, f_classif
36 | from sklearn.metrics import classification_report
37 | import torch
38 | import torch.nn as nn
39 |
40 | import warnings
41 | from sklearn.exceptions import InconsistentVersionWarning
42 | warnings.filterwarnings(action='ignore', category=InconsistentVersionWarning)
43 |
44 | from metapathpredict.utils import InputData
45 | from metapathpredict.utils import AnnotationList
46 |
47 |
48 | # CUDA for PyTorch
49 | use_cuda = torch.cuda.is_available()
50 | device = torch.device("cuda:0" if use_cuda else "cpu")
51 | # device = "cpu"
52 |
53 | torch.backends.cudnn.benchmark = True
54 |
55 | # Parameters
56 | params = {"batch_size": 64, "shuffle": True, "num_workers": 6}
57 |
58 | #Configure the logging system
59 | logging.basicConfig(
60 | filename='metapathpredict.log',
61 | level=logging.INFO,
62 | format="%(asctime)s %(levelname)s %(module)s - %(message)s",
63 | datefmt="%Y-%m-%d %H:%M:%S")
64 |
65 | root = logging.getLogger()
66 | root.setLevel(logging.INFO)
67 |
68 | handler = logging.StreamHandler(sys.stdout)
69 | handler.setLevel(logging.INFO)
70 | formatter = logging.Formatter("%(asctime)s %(levelname)s %(module)s - %(message)s",
71 | "%Y-%m-%d %H:%M:%S")
72 | handler.setFormatter(formatter)
73 | root.addHandler(handler)
74 |
75 |
76 |
77 | class CustomDataset(Dataset):
78 | def __init__(self, data, targets, transform=None):
79 | print("type", type(data), data.shape)
80 | self.data = torch.tensor(data, dtype=torch.float32)
81 | self.targets = torch.tensor(targets, dtype=torch.float32)
82 | self.transform = transform
83 |
84 | def __len__(self):
85 | return len(self.data)
86 |
87 | def __getitem__(self, idx):
88 | features, target = self.data[idx], self.targets[idx]
89 |
90 | if self.transform:
91 | sample = self.transform(sample)
92 |
93 | return features, target
94 |
95 |
96 |
97 | class CustomModel(nn.Module):
98 | def __init__(self, num_hidden_nodes_per_layer=1024, num_hidden_layers=5):
99 | super(CustomModel, self).__init__()
100 | NUM_HIDDEN_NODES = num_hidden_nodes_per_layer
101 | self.NUM_HIDDEN_LAYERS = num_hidden_layers
102 |
103 | self.fc1 = nn.Linear(2000, NUM_HIDDEN_NODES)
104 | self.relu = nn.ReLU()
105 | self.dropout = nn.Dropout(0.1)
106 |
107 | # array of hidden layers
108 | self.fcs = [
109 | nn.Linear(NUM_HIDDEN_NODES, NUM_HIDDEN_NODES)
110 | for i in range(num_hidden_layers)
111 | ]
112 |
113 | self.output_layer = nn.Linear(NUM_HIDDEN_NODES, 94)
114 | self.sigmoid = nn.Sigmoid()
115 |
116 | def forward(self, x):
117 | x = self.fc1(x)
118 | x = self.relu(x)
119 | x = self.dropout(x)
120 |
121 | for i in range(self.NUM_HIDDEN_LAYERS - 1):
122 | x = self.fcs[i](x)
123 | x = self.relu(x)
124 | x = self.dropout(x)
125 |
126 | x = self.fcs[self.NUM_HIDDEN_LAYERS - 1](x)
127 | x = self.relu(x)
128 |
129 | x = self.output_layer(x)
130 | x = self.sigmoid(x)
131 | return x
132 |
133 |
134 |
135 | class Models:
136 |
137 | """Platform-agnostic command line functions available in MetaPathPredict tools."""
138 |
139 | @classmethod
140 | def train(cls, args: Iterable[str] = None) -> int:
141 | """Train a model from the input data .
142 |
143 | Writes out a DNN model in the keras forma
144 |
145 | Parameters
146 | ----------
147 | args : Iterable[str], optional
148 | value of None, when passed to `parser.parse_args` causes the parser to
149 | read `sys.argv`
150 |
151 | Returns
152 | -------
153 | return_call : 0
154 | return call if the program completes successfully
155 |
156 | """
157 | parser = argparse.ArgumentParser()
158 |
159 | parser.add_argument(
160 | "--train-targets",
161 | dest="train_targets",
162 | required=True,
163 | help="training targets file",
164 | )
165 | parser.add_argument(
166 | "--train-features",
167 | dest="train_features",
168 | required=True,
169 | help="training features",
170 | )
171 | parser.add_argument(
172 | "--num-epochs",
173 | dest="num_epochs",
174 | required=False,
175 | default=100,
176 | type=int,
177 | help="number of epochs",
178 | )
179 | parser.add_argument(
180 | "--model-out",
181 | "-m",
182 | dest="model_out",
183 | required=True,
184 | help="model file name output",
185 | )
186 | parser.add_argument(
187 | "--use-gpu",
188 | dest="use_gpu",
189 | required=False,
190 | action="store_true",
191 | help="use GPU if available",
192 | )
193 | parser.add_argument(
194 | "--num-cores",
195 | dest="num_cores",
196 | required=False,
197 | default=10,
198 | type=int,
199 | help="Number of cores for parallel processing",
200 | )
201 | neural_net_params = parser.add_argument_group("Neural Net parameters")
202 | neural_net_params.add_argument(
203 | "--num-hidden-layers",
204 | default=5,
205 | required=False,
206 | type=int,
207 | help="number of hidden layers",
208 | )
209 | neural_net_params.add_argument(
210 | "--hidden-nodes-per-layer",
211 | type=int,
212 | required=False,
213 | default=1024,
214 | help="number of nodes in each hidden layer",
215 | )
216 | neural_net_params.add_argument(
217 | "--num-features",
218 | dest="num_features",
219 | default=2000,
220 | required=False,
221 | type=int,
222 | help="number of features to retain from training data",
223 | )
224 | neural_net_params.add_argument(
225 | "--threshold",
226 | dest="threshold",
227 | default=6432,
228 | required=False,
229 | type=float,
230 | help="threshold for SelectKBest feature selection",
231 | )
232 |
233 |
234 | args = parser.parse_args()
235 |
236 | # CUDA for PyTorch
237 | device = "cpu"
238 | if args.use_gpu:
239 | use_cuda = torch.cuda.is_available()
240 | device = torch.device("cuda:0" if use_cuda else "cpu")
241 |
242 | logging.info(f"Using device: {device}")
243 |
244 | # read in features
245 | features = pd.read_table(args.train_features, compression="gzip")
246 | logging.info(f"reading input features of shape: {features.shape[0]} x {features.shape[1]}")
247 |
248 | # read in labels
249 | targets = pd.read_table(args.train_targets, compression="gzip")
250 | logging.info(f"reading input labels of shape: {targets.shape[0]} x {targets.shape[1]}")
251 |
252 | # split the data into training and test sets
253 | test_size = 0.25
254 | x, x_test, y, y_test = train_test_split(
255 | features,
256 | targets,
257 | stratify=targets,
258 | shuffle=True,
259 | test_size= test_size,
260 | random_state=111,
261 | )
262 | logging.info(f"creating test size of: {test_size}%")
263 |
264 | # Split the remaining data to train and validation
265 | x_train, x_val, y_train, y_val = train_test_split(
266 | x, y, stratify=y, test_size=0.2, shuffle=True, random_state=111
267 | )
268 |
269 | print("features size", features.shape)
270 | print("targets size", targets.shape)
271 |
272 | print("x_test", x_test.shape, " y_test ", y_test.shape)
273 | print("x", x.shape, " y ", y.shape)
274 |
275 | print("x_train", x_train.shape, " y_train ", y_train.shape)
276 | print("x_val", x_val.shape, " y_val ", y_val.shape)
277 | print("x_test", x_test.shape, " y_test ", y_test.shape)
278 |
279 |
280 |
281 | # Initialize the StandardScaler
282 | scaler = StandardScaler()
283 |
284 | # Fit the scaler to training data and transform it
285 | # and then transform val and test data w/ the fitted scaler object
286 | # (std. dev., variance, etc. are based on training data columns)
287 | scaled_features = scaler.fit_transform(x_train)
288 | x_train = pd.DataFrame(scaled_features, index = x_train.index, columns = x_train.columns)
289 | x_val = pd.DataFrame(scaler.transform(x_val), index = x_val.index, columns = x_val.columns)
290 | x_test = pd.DataFrame(scaler.transform(x_test), index = x_test.index, columns = x_test.columns)
291 | logging.info(f"normalizing the training input features")
292 |
293 |
294 |
295 | # feature selection based only on the training data
296 | # Select features according to the k highest F-values
297 | # from running ANOVA on y_train and x_train
298 | selected_features = []
299 | for label in y_train:
300 | selector = SelectKBest(f_classif, k = 'all')
301 | selector.fit(x_train, y_train[label])
302 | selected_features.append(list(selector.scores_))
303 |
304 | # select threshold that retains 2000 features
305 | threshold = args.threshold
306 |
307 | # # MeanCS
308 | logging.info(f"total number of features in input: {x_train.shape[1]}")
309 | selected_features2 = np.mean(selected_features, axis = 0) > threshold
310 | logging.info(f"number of features selected for training: {sum(selected_features2)}")
311 |
312 | # create new training, validation, and test datasets retaining only the 2000 top features
313 | # determined from the training data
314 | x_train2 = x_train.loc[:, selected_features2]
315 | x_val2 = x_val.loc[:, selected_features2]
316 | x_test2 = x_test.loc[:, selected_features2]
317 | features_used = x_train2.columns.values
318 | labels_used = y_val.columns.values
319 |
320 | logging.info(f"Using features : {str(features_used)}")
321 | logging.info(f"Using labels : {str(labels_used)}")
322 |
323 | # Initialize the StandardScaler
324 | #scaler = StandardScaler()
325 |
326 | # Fit the scaler to your data and transform it
327 | #x_train2 = scaler.fit_transform(x_train2)
328 | #x_val2 = scaler.fit_transform(x_val2)
329 | #logging.info(f"normalizing the training input features")
330 |
331 | y_train = np.asarray(y_train.values)
332 | y_val = np.asarray(y_val.values)
333 |
334 | print()
335 | print("x_train2", x_train2.shape)
336 | print("x_val2", x_val2.shape)
337 | print("x_test2", x_test2.shape)
338 |
339 | # outline the neural network architecture - multilable classifier
340 | # 1 input layer, 5 hidden layers, 1 output layer
341 | # inclue dropout for all hidden layers
342 | model = CustomModel(
343 | num_hidden_nodes_per_layer=args.hidden_nodes_per_layer,
344 | num_hidden_layers=args.num_hidden_layers,
345 | ).to(device)
346 |
347 | # Define loss function and optimizer
348 | criterion = nn.BCELoss()
349 | optimizer = optim.Adam(model.parameters(), lr=0.001)
350 | logging.info(f"optimizer Adam with learning rate: 0.001")
351 |
352 | # Define early stopping
353 | early_stopping = torch.optim.lr_scheduler.ReduceLROnPlateau(
354 | optimizer, "min", patience=10
355 | )
356 |
357 | # Create an empty transform
358 | no_transform = transforms.Compose([])
359 |
360 | # dataset DataLoader
361 | x_train2 = np.asarray(x_train2)
362 | x_val2 = np.asarray(x_val2)
363 | print("xtrain2", x_train2.shape, y_train.shape)
364 |
365 | logging.info(f"loading training dataset into dataloader")
366 | dataset = CustomDataset(data=x_train2, targets=y_train, transform=None)
367 |
368 | batch_size = 10000
369 | train_data_loader = DataLoader(
370 | dataset, batch_size=batch_size, num_workers=args.num_cores, shuffle=True
371 | )
372 |
373 | logging.info(f"loading testing dataset into dataloader")
374 | val_dataset = CustomDataset(data=x_val2, targets=y_val, transform=None)
375 | val_data_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=True)
376 |
377 | # Train the model
378 | num_epochs = args.num_epochs
379 | logging.info(f"number of epochs for training: {num_epochs}")
380 | for epoch in range(num_epochs):
381 | model.train()
382 | train_loss = 0.0
383 |
384 | for inputs, targets in train_data_loader:
385 | inputs, targets = inputs.to(device), targets.to(device)
386 | optimizer.zero_grad()
387 | outputs = model(inputs)
388 | loss = criterion(outputs, targets)
389 |
390 | loss.backward()
391 | optimizer.step()
392 | train_loss += loss.item()
393 |
394 | model.eval()
395 | val_loss = 0.0
396 | with torch.no_grad():
397 | for inputs, targets in val_data_loader:
398 | inputs, targets = inputs.to(device), targets.to(device)
399 | outputs = model(inputs)
400 | loss = criterion(outputs, targets)
401 | val_loss += loss.item()
402 |
403 | # Update learning rate using early stopping
404 | early_stopping.step(val_loss)
405 |
406 | logging.info(
407 | f"Epoch [{epoch+1}/{num_epochs}], Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}"
408 | )
409 |
410 | print(
411 | f"Epoch [{epoch+1}/{num_epochs}], Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}"
412 | )
413 |
414 | # assess the model on test data
415 | x_test2 = np.asarray(x_test2)
416 | x_test2 = torch.tensor(x_test2, dtype=torch.float32)
417 | logging.info(f"converting test inputs to torch.tensor")
418 |
419 | predictions_test = model(x_test2)
420 |
421 | # round predictions
422 | roundedTestPreds = np.round(predictions_test.detach().numpy())
423 |
424 | # print out performance metrics
425 | print(classification_report(y_test.values, roundedTestPreds))
426 |
427 | logging.info(f"Training finished successfully!")
428 |
429 | model_file = {}
430 | model_file["description"] = "neural net trained for predicting multilabels"
431 | model_file["features"] = features_used
432 | model_file["labels"] = labels_used
433 | model_file["model"] = model
434 | torch.save(model_file, args.model_out)
435 | logging.info(f"writing model file: {args.model_out}")
436 |
437 |
438 |
439 | @classmethod
440 | def predict(cls, args: Iterable[str] = None) -> int:
441 | """Predict the presence or absence of select KEGG modules on bacterial
442 | annotation data.
443 |
444 | Parameters
445 | ----------
446 | args : Iterable[str], optional
447 | value of None, when passed to `parser.parse_args` causes the parser to
448 | read `sys.argv`
449 |
450 | Returns
451 | -------
452 | return_call : 0
453 | return call if the program completes successfully
454 |
455 | """
456 |
457 | # disable tensorflow info messages
458 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1'
459 |
460 | parser = argparse.ArgumentParser()
461 |
462 | parser.add_argument(
463 | "--input",
464 | "-i",
465 | action = "extend",
466 | nargs = "+",
467 | dest="input",
468 | required=True,
469 | help="input file path(s) and name(s) [required]",
470 | )
471 | parser.add_argument(
472 | "--annotation-format",
473 | "-a",
474 | dest="annotation_format",
475 | required=True,
476 | help="annotation format (kofamscan, kofamscan-web, dram, or koala) [default: kofamscan]",
477 | )
478 | parser.add_argument(
479 | "--kegg-modules",
480 | "-k",
481 | dest="kegg_modules",
482 | required=False,
483 | default=None,
484 | action="extend",
485 | nargs="+",
486 | help="KEGG modules to predict [default: MetaPathPredict KEGG modules]",
487 | )
488 | parser.add_argument(
489 | "--output",
490 | "-o",
491 | dest="output",
492 | required=True,
493 | help="output file path and name [required]",
494 | )
495 |
496 | args = parser.parse_args()
497 |
498 | module_dir = importlib.resources.files('metapathpredict')
499 | data_dir = module_dir.joinpath("data/")
500 |
501 | # scaler_0_path = module_dir.joinpath("data/model_0_scaler.pkl")
502 | # scaler_1_path = module_dir.joinpath("data/model_1_scaler.pkl")
503 |
504 | model_0_path = module_dir.joinpath("data/model_0.keras")
505 | model_1_path = module_dir.joinpath("data/model_1.keras")
506 |
507 | labels_path = module_dir.joinpath("data/labels.pkl")
508 | requiredCols_path = module_dir.joinpath("data/requiredCols.pkl")
509 |
510 | # with open(scaler_0_path, "rb") as f:
511 | # model_0_scaler = pickle.load(f)
512 | #
513 | # with open(scaler_1_path, "rb") as f:
514 | # model_1_scaler = pickle.load(f)
515 |
516 | with open(labels_path, "rb") as f:
517 | labels = pickle.load(f)
518 |
519 | with open(requiredCols_path, "rb") as f:
520 | requiredCols = pickle.load(f)
521 |
522 | #models = [torch.load(model_0_path), torch.load(model_1_path)]
523 | models = [keras.models.load_model(model_0_path), keras.models.load_model(model_1_path)]
524 |
525 | # logging.info(f"reading model files: {args.model_in[0]}, {args.model_in[1]}")
526 | # logging.info(f"reading scaler files: {args.scaler_in[0]}, {args.scaler_in[1]}")
527 |
528 | # logging.info(f"Reading model files from directory: {data_dir}")
529 | # logging.info(f"Reading scaler files from directory: {data_dir}")
530 |
531 |
532 | # load the input features
533 | files_list = InputData(files = args.input)
534 |
535 | if args.annotation_format == "kofamscan":
536 | files_list.read_kofamscan_detailed_tsv()
537 |
538 | elif args.annotation_format == "kofamkoala":
539 | files_list.read_kofamkoala()
540 |
541 | elif args.annotation_format == "dram":
542 | files_list.read_dram_annotation_tsv()
543 |
544 | elif args.annotation_format == "koala":
545 | files_list.read_koala_tsv()
546 |
547 | else:
548 | logging.error("""Did not recognize annotation format; use "kofamscan", "kofamkoala", "dram", or "koala""""")
549 | sys.exit(0)
550 |
551 | logging.info(f"Reading input files with format: {args.annotation_format}")
552 |
553 | # model_0_cols = np.ndarray.tolist(model_0_scaler.feature_names_in_)
554 | # model_1_cols = np.ndarray.tolist(model_1_scaler.feature_names_in_)
555 | # reqColsAll = list(set(model_0_cols).union(set(model_1_cols)))
556 |
557 | reqColsAll = requiredCols
558 |
559 | input_features = AnnotationList(
560 | requiredColumnsAll = reqColsAll, # add list of all required columns for model #1 and model #2
561 | requiredColumnsModel0 = "blank", #model_0_scaler.feature_names_in_, # add list of all required columns for model #1
562 | requiredColumnsModel1 = "blank", #model_1_scaler.feature_names_in_, # add list of all required columns for model #2
563 | annotations = files_list.annotations)
564 |
565 | input_features.create_feature_df()
566 | input_features.check_feature_columns()
567 | # input_features.select_model_features()
568 | # input_features.transform_model_features(model_0_scaler, model_1_scaler)
569 |
570 | logging.info("Making KEGG module presence/absence predictions")
571 |
572 | predictions_list = []
573 | for prediction_iteration in range(2):
574 |
575 | #features = torch.tensor(np.asarray(input_features.feature_df[x]), dtype=torch.float32)
576 |
577 | # predict
578 | #predictions = models[x]['model'](features)
579 | logging.info(f"Model {prediction_iteration} is making predictions")
580 | predictions = models[prediction_iteration].predict(input_features.feature_df[prediction_iteration])
581 |
582 | # round predictions
583 | #roundedPreds = np.round(predictions.detach().numpy())
584 | roundedPreds = np.round(predictions)
585 |
586 | #predsDf = pd.DataFrame(data = roundedPreds, columns = models[x]['labels']).astype(int)
587 | predsDf = pd.DataFrame(data = roundedPreds, columns = labels[prediction_iteration]).astype(int)
588 |
589 | predictions_list.append(predsDf)
590 |
591 | logging.info(f"Model {prediction_iteration} completed making predictions")
592 |
593 | logging.info("All done.")
594 |
595 | out_df = pd.concat(predictions_list, axis = 1)
596 |
597 | if args.kegg_modules is not None:
598 | if all(modules in out_df.columns for modules in args.kegg_modules):
599 | out_df = out_df[args.kegg_modules]
600 | else:
601 | logging.error("""Did not recognize one or more KEGG modules specified with --kegg-modules; keeping all prediction columns""")
602 |
603 | out_df.insert(loc = 0, column = 'file', value = args.input)
604 |
605 | logging.info(f"Writing output to file: {args.output}")
606 | out_df.to_csv(args.output, sep='\t', index=None)
607 |
608 | #logging.info(f"Output matrix size: {out_df.shape[0]} x {out_df.shape[1]}")
609 |
610 |
611 |
612 | @classmethod
613 | def show_available_modules(cls, args: Iterable[str] = None) -> int:
614 |
615 | """List available KEGG modules for presence/absence prediction.
616 |
617 | Parameters
618 | ----------
619 | args : Iterable[str], optional
620 | value of None, when passed to `parser.parse_args` causes the parser to
621 | read `sys.argv`
622 |
623 | Returns
624 | -------
625 | return_call : 0
626 | return call if the program completes successfully
627 |
628 | """
629 |
630 | module_dir = importlib.resources.files('metapathpredict')
631 |
632 | metapathmodules_path = module_dir.joinpath("data/metapathmodules.pkl")
633 |
634 | with open(metapathmodules_path, "rb") as f:
635 | metapathmodules = pickle.load(f)
636 |
637 | pd.set_option('display.max_rows', None)
638 | pd.set_option('max_colwidth', None)
639 |
640 | print(metapathmodules)
641 |
642 |
643 |
644 | @classmethod
645 | def predict_from_feature_table(cls, args: Iterable[str] = None) -> int:
646 | """Predict the presence or absence of select KEGG modules on bacterial
647 | annotation data -- from an input feature table of KEGG K numbers
648 |
649 | Parameters
650 | ----------
651 | args : Iterable[str], optional
652 | value of None, when passed to `parser.parse_args` causes the parser to
653 | read `sys.argv`
654 |
655 | Returns
656 | -------
657 | return_call : 0
658 | return call if the program completes successfully
659 |
660 | """
661 |
662 | # disable tensorflow info messages
663 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1'
664 |
665 | parser = argparse.ArgumentParser()
666 |
667 | parser.add_argument(
668 | "--input",
669 | "-i",
670 | dest="input",
671 | required=True,
672 | help="input file path(s) and name(s) [required]",
673 | )
674 | parser.add_argument(
675 | "--kegg-modules",
676 | "-k",
677 | dest="kegg_modules",
678 | required=False,
679 | default=None,
680 | action="extend",
681 | nargs="+",
682 | help="KEGG modules to predict [default: MetaPathPredict KEGG modules]",
683 | )
684 | parser.add_argument(
685 | "--output",
686 | "-o",
687 | dest="output",
688 | required=True,
689 | help="output file path and name [required]",
690 | )
691 |
692 | args = parser.parse_args()
693 |
694 | module_dir = importlib.resources.files('metapathpredict')
695 | data_dir = module_dir.joinpath("data/")
696 |
697 | # scaler_0_path = module_dir.joinpath("data/model_0_scaler.pkl")
698 | # scaler_1_path = module_dir.joinpath("data/model_1_scaler.pkl")
699 |
700 | model_0_path = module_dir.joinpath("data/model_0.keras")
701 | model_1_path = module_dir.joinpath("data/model_1.keras")
702 |
703 | labels_path = module_dir.joinpath("data/labels.pkl")
704 | requiredCols_path = module_dir.joinpath("data/requiredCols.pkl")
705 |
706 | # with open(scaler_0_path, "rb") as f:
707 | # model_0_scaler = pickle.load(f)
708 | #
709 | # with open(scaler_1_path, "rb") as f:
710 | # model_1_scaler = pickle.load(f)
711 |
712 | with open(labels_path, "rb") as f:
713 | labels = pickle.load(f)
714 |
715 | with open(requiredCols_path, "rb") as f:
716 | requiredCols = pickle.load(f)
717 |
718 | #models = [torch.load(model_0_path), torch.load(model_1_path)]
719 | models = [keras.models.load_model(model_0_path), keras.models.load_model(model_1_path)]
720 |
721 | # logging.info(f"reading model files: {args.model_in[0]}, {args.model_in[1]}")
722 | # logging.info(f"reading scaler files: {args.scaler_in[0]}, {args.scaler_in[1]}")
723 |
724 | # logging.info(f"Reading model files from directory: {data_dir}")
725 | # logging.info(f"Reading scaler files from directory: {data_dir}")
726 |
727 |
728 | # load the input features
729 | features = pd.read_csv(args.input, sep = "\t")
730 | # files_list = InputData(files = args.input)
731 | #
732 | # if args.annotation_format == "kofamscan":
733 | # files_list.read_kofamscan_detailed_tsv()
734 | #
735 | # elif args.annotation_format == "kofamkoala":
736 | # files_list.read_kofamkoala()
737 | #
738 | # elif args.annotation_format == "dram":
739 | # files_list.read_dram_annotation_tsv()
740 | #
741 | # elif args.annotation_format == "koala":
742 | # files_list.read_koala_tsv()
743 | #
744 | # else:
745 | # logging.error("""Did not recognize annotation format; use "kofamscan", "kofamkoala", "dram", or "koala""""")
746 | # sys.exit(0)
747 | #
748 | # logging.info(f"Reading input files with format: {args.annotation_format}")
749 |
750 | # model_0_cols = np.ndarray.tolist(model_0_scaler.feature_names_in_)
751 | # model_1_cols = np.ndarray.tolist(model_1_scaler.feature_names_in_)
752 | # reqColsAll = list(set(model_0_cols).union(set(model_1_cols)))
753 |
754 | #reqColsAll = np.ndarray.tolist(model_0_scaler.feature_names_in_)
755 | reqColsAll = requiredCols
756 |
757 | input_features = AnnotationList(
758 | requiredColumnsAll = reqColsAll, # add list of all required columns for model #1 and model #2
759 | requiredColumnsModel0 = "blank", # add list of all required columns for model #1
760 | requiredColumnsModel1 = "blank", # add list of all required columns for model #2
761 | annotations = "blank")
762 |
763 | #input_features.create_feature_df()
764 | input_features.feature_df = features
765 | input_features.check_feature_columns()
766 | # input_features.select_model_features()
767 | # input_features.transform_model_features(model_0_scaler, model_1_scaler)
768 |
769 | logging.info("Making KEGG module presence/absence predictions")
770 |
771 | predictions_list = []
772 | for x in range(2):
773 |
774 | #features = torch.tensor(np.asarray(input_features.feature_df[x]), dtype=torch.float32)
775 |
776 | # predict
777 | #predictions = models[x]['model'](features)
778 | logging.info(f"Model {x} is making predictions")
779 | predictions = models[x].predict(input_features.feature_df[x])
780 |
781 | # round predictions
782 | #roundedPreds = np.round(predictions.detach().numpy())
783 | roundedPreds = np.round(predictions)
784 |
785 | #predsDf = pd.DataFrame(data = roundedPreds, columns = models[x]['labels']).astype(int)
786 | predsDf = pd.DataFrame(data = roundedPreds, columns = labels[x]).astype(int)
787 |
788 | predictions_list.append(predsDf)
789 |
790 | logging.info(f"Model {x} completed making predictions")
791 |
792 | logging.info("All done.")
793 |
794 | out_df = pd.concat(predictions_list, axis = 1)
795 |
796 | if args.kegg_modules is not None:
797 | if all(modules in out_df.columns for modules in args.kegg_modules):
798 | out_df = out_df[args.kegg_modules]
799 | else:
800 | logging.error("""Did not recognize one or more KEGG modules specified with --kegg-modules; keeping all prediction columns""")
801 |
802 | out_df.insert(loc = 0, column = 'file', value = args.input)
803 |
804 | logging.info(f"Writing output to file: {args.output}")
805 | out_df.to_csv(args.output, sep='\t', index=None)
806 |
807 | #logging.info(f"Output matrix size: {out_df.shape[0]} x {out_df.shape[1]}")
808 |
809 |
810 |
811 | @classmethod
812 | def predict_from_feature_table_fs_models(cls, args: Iterable[str] = None) -> int:
813 | """Predict the presence or absence of select KEGG modules on bacterial
814 | annotation data -- from an input feature table of KEGG K numbers
815 |
816 | Parameters
817 | ----------
818 | args : Iterable[str], optional
819 | value of None, when passed to `parser.parse_args` causes the parser to
820 | read `sys.argv`
821 |
822 | Returns
823 | -------
824 | return_call : 0
825 | return call if the program completes successfully
826 |
827 | """
828 |
829 | # disable tensorflow info messages
830 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1'
831 |
832 | parser = argparse.ArgumentParser()
833 |
834 | parser.add_argument(
835 | "--input",
836 | "-i",
837 | dest="input",
838 | required=True,
839 | help="input file path(s) and name(s) [required]",
840 | )
841 | parser.add_argument(
842 | "--kegg-modules",
843 | "-k",
844 | dest="kegg_modules",
845 | required=False,
846 | default=None,
847 | action="extend",
848 | nargs="+",
849 | help="KEGG modules to predict [default: MetaPathPredict KEGG modules]",
850 | )
851 | parser.add_argument(
852 | "--output",
853 | "-o",
854 | dest="output",
855 | required=True,
856 | help="output file path and name [required]",
857 | )
858 |
859 | args = parser.parse_args()
860 |
861 | module_dir = importlib.resources.files('metapathpredict')
862 | data_dir = module_dir.joinpath("data/")
863 |
864 | # scaler_0_path = module_dir.joinpath("data/model_0_scaler.pkl")
865 | # scaler_1_path = module_dir.joinpath("data/model_1_scaler.pkl")
866 |
867 | model_0_path = module_dir.joinpath("data/model_0.keras")
868 | model_1_path = module_dir.joinpath("data/model_1.keras")
869 |
870 | labels_path = module_dir.joinpath("data/labels.pkl")
871 | requiredCols_path = module_dir.joinpath("data/requiredCols.pkl")
872 |
873 | requiredColumnsModel0_path = module_dir.joinpath("data/requiredColumnsModel0.pkl")
874 | requiredColumnsModel1_path = module_dir.joinpath("data/requiredColumnsModel1.pkl")
875 |
876 | # with open(scaler_0_path, "rb") as f:
877 | # model_0_scaler = pickle.load(f)
878 | #
879 | # with open(scaler_1_path, "rb") as f:
880 | # model_1_scaler = pickle.load(f)
881 |
882 | with open(labels_path, "rb") as f:
883 | labels = pickle.load(f)
884 |
885 | with open(requiredCols_path, "rb") as f:
886 | requiredCols = pickle.load(f)
887 |
888 | with open(requiredColumnsModel0_path, "rb") as f:
889 | model_0_features = pickle.load(f)
890 |
891 | with open(requiredColumnsModel1_path, "rb") as f:
892 | model_1_features = pickle.load(f)
893 |
894 |
895 | #models = [torch.load(model_0_path), torch.load(model_1_path)]
896 | models = [keras.models.load_model(model_0_path), keras.models.load_model(model_1_path)]
897 |
898 | # logging.info(f"reading model files: {args.model_in[0]}, {args.model_in[1]}")
899 | # logging.info(f"reading scaler files: {args.scaler_in[0]}, {args.scaler_in[1]}")
900 |
901 | # logging.info(f"Reading model files from directory: {data_dir}")
902 | # logging.info(f"Reading scaler files from directory: {data_dir}")
903 |
904 |
905 | # load the input features
906 | features = pd.read_csv(args.input, sep = "\t")
907 | # files_list = InputData(files = args.input)
908 | #
909 | # if args.annotation_format == "kofamscan":
910 | # files_list.read_kofamscan_detailed_tsv()
911 | #
912 | # elif args.annotation_format == "kofamkoala":
913 | # files_list.read_kofamkoala()
914 | #
915 | # elif args.annotation_format == "dram":
916 | # files_list.read_dram_annotation_tsv()
917 | #
918 | # elif args.annotation_format == "koala":
919 | # files_list.read_koala_tsv()
920 | #
921 | # else:
922 | # logging.error("""Did not recognize annotation format; use "kofamscan", "kofamkoala", "dram", or "koala""""")
923 | # sys.exit(0)
924 | #
925 | # logging.info(f"Reading input files with format: {args.annotation_format}")
926 |
927 | # model_0_cols = np.ndarray.tolist(model_0_scaler.feature_names_in_)
928 | # model_1_cols = np.ndarray.tolist(model_1_scaler.feature_names_in_)
929 | # reqColsAll = list(set(model_0_cols).union(set(model_1_cols)))
930 |
931 | #reqColsAll = np.ndarray.tolist(model_0_scaler.feature_names_in_)
932 | reqColsAll = requiredCols
933 |
934 | input_features = AnnotationList(
935 | requiredColumnsAll = reqColsAll, # add list of all required columns for model #1 and model #2
936 | requiredColumnsModel0 = model_0_features, # add list of all required columns for model #1
937 | requiredColumnsModel1 = model_1_features, # add list of all required columns for model #2
938 | annotations = "blank")
939 |
940 | #input_features.create_feature_df()
941 | input_features.feature_df = features
942 | input_features.check_feature_columns()
943 | input_features.select_model_features()
944 | # input_features.transform_model_features(model_0_scaler, model_1_scaler)
945 |
946 | logging.info("Making KEGG module presence/absence predictions")
947 |
948 | predictions_list = []
949 | for x in range(2):
950 |
951 | #features = torch.tensor(np.asarray(input_features.feature_df[x]), dtype=torch.float32)
952 |
953 | # predict
954 | #predictions = models[x]['model'](features)
955 | logging.info(f"Model {x} is making predictions")
956 | predictions = models[x].predict(input_features.feature_df[x])
957 |
958 | # round predictions
959 | #roundedPreds = np.round(predictions.detach().numpy())
960 | roundedPreds = np.round(predictions)
961 |
962 | #predsDf = pd.DataFrame(data = roundedPreds, columns = models[x]['labels']).astype(int)
963 | predsDf = pd.DataFrame(data = roundedPreds, columns = labels[x]).astype(int)
964 |
965 | predictions_list.append(predsDf)
966 |
967 | logging.info(f"Model {x} completed making predictions")
968 |
969 | logging.info("All done.")
970 |
971 | out_df = pd.concat(predictions_list, axis = 1)
972 |
973 | if args.kegg_modules is not None:
974 | if all(modules in out_df.columns for modules in args.kegg_modules):
975 | out_df = out_df[args.kegg_modules]
976 | else:
977 | logging.error("""Did not recognize one or more KEGG modules specified with --kegg-modules; keeping all prediction columns""")
978 |
979 | out_df.insert(loc = 0, column = 'file', value = args.input)
980 |
981 | logging.info(f"Writing output to file: {args.output}")
982 | out_df.to_csv(args.output, sep='\t', index=None)
983 |
984 | #logging.info(f"Output matrix size: {out_df.shape[0]} x {out_df.shape[1]}")
985 |
--------------------------------------------------------------------------------
/package/src/metapathpredict/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/d-mcgrath/MetaPathPredict/0f276eaa693be2bc4c735d04b50548949ce2079c/package/src/metapathpredict/__init__.py
--------------------------------------------------------------------------------
/package/src/metapathpredict/data/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/d-mcgrath/MetaPathPredict/0f276eaa693be2bc4c735d04b50548949ce2079c/package/src/metapathpredict/data/__init__.py
--------------------------------------------------------------------------------
/package/src/metapathpredict/data/labels.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/d-mcgrath/MetaPathPredict/0f276eaa693be2bc4c735d04b50548949ce2079c/package/src/metapathpredict/data/labels.pkl
--------------------------------------------------------------------------------
/package/src/metapathpredict/data/metapathmodules.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/d-mcgrath/MetaPathPredict/0f276eaa693be2bc4c735d04b50548949ce2079c/package/src/metapathpredict/data/metapathmodules.pkl
--------------------------------------------------------------------------------
/package/src/metapathpredict/data/requiredCols.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/d-mcgrath/MetaPathPredict/0f276eaa693be2bc4c735d04b50548949ce2079c/package/src/metapathpredict/data/requiredCols.pkl
--------------------------------------------------------------------------------
/package/src/metapathpredict/download_models.py:
--------------------------------------------------------------------------------
1 | #import pyxet
2 | import importlib
3 | import shutil
4 | from importlib import resources
5 | from huggingface_hub import hf_hub_download
6 |
7 |
8 | class Download:
9 | """Functions to download MetaPathPredict's machine learning models"""
10 |
11 | @classmethod
12 | def download_models(cls):
13 | """Downloads MetaPathPredict's models.
14 |
15 | Returns:
16 | None
17 |
18 | """
19 | print("Downloading MetaPathPredict models...")
20 | module_dir = resources.files('metapathpredict')
21 | data_dir = module_dir.joinpath("data/")
22 | # model_0_dl_path = "xet://dgellermcgrath/MetaPathPredict/main/package/src/metapathpredict/data/model_0.keras"
23 | # model_1_dl_path = "xet://dgellermcgrath/MetaPathPredict/main/package/src/metapathpredict/data/model_1.keras"
24 | model_0_install_path = module_dir.joinpath("data/MetaPathPredict_model_0.keras")
25 | model_1_install_path = module_dir.joinpath("data/MetaPathPredict_model_1.keras")
26 |
27 | model_0_renamed_dir_path = module_dir.joinpath("data/model_0.keras_directory")
28 | model_1_renamed_dir_path = module_dir.joinpath("data/model_1.keras_directory")
29 |
30 | model_0_initial_path = module_dir.joinpath("data/model_0.keras_directory/MetaPathPredict_model_0.keras")
31 | model_1_initial_path = module_dir.joinpath("data/model_1.keras_directory/MetaPathPredict_model_1.keras")
32 |
33 | model_0_final_path = module_dir.joinpath("data/model_0.keras")
34 | model_1_final_path = module_dir.joinpath("data/model_1.keras")
35 |
36 | download_destination = module_dir.joinpath("data/")
37 |
38 | hf_hub_download(repo_id="dgellermcgrath/MetaPathPredict", filename="MetaPathPredict_model_0.keras", local_dir=model_0_install_path, force_download=True)
39 | hf_hub_download(repo_id="dgellermcgrath/MetaPathPredict", filename="MetaPathPredict_model_1.keras", local_dir=model_1_install_path, force_download=True)
40 |
41 | # rename the model directories downloaded from HuggingFace
42 | shutil.move(model_0_install_path, model_0_renamed_dir_path)
43 | shutil.move(model_1_install_path, model_1_renamed_dir_path)
44 |
45 | # move the models out of their directories and rename them
46 | shutil.move(model_0_initial_path, model_0_final_path)
47 | shutil.move(model_1_initial_path, model_1_final_path)
48 |
49 | # remove the directories downloaded from HuggingFace
50 | shutil.rmtree(model_0_renamed_dir_path)
51 | shutil.rmtree(model_1_renamed_dir_path)
52 |
53 | # fs = pyxet.XetFS() # fsspec filesystem
54 | # fs.get(model_0_dl_path, str(model_0_install_path))
55 | # fs.get(model_1_dl_path, str(model_1_install_path))
56 | print("Models were downloaded to: " + str(download_destination))
57 | print("All done. Use MetaPathPredict -h to see how to make predictions.")
58 |
--------------------------------------------------------------------------------
/package/src/metapathpredict/utils.py:
--------------------------------------------------------------------------------
1 | import csv
2 | import re
3 | import gzip
4 | import numpy as np
5 | import pandas as pd
6 |
7 |
8 | class InputData:
9 |
10 | """Data parsing functions of input data"""
11 |
12 |
13 | def __init__(self, files, annotations = []):
14 | self.files = files
15 | self.annotations = annotations
16 |
17 | def read_kofamscan_detailed_tsv(self):
18 | """Reads in multiple .tsv files, each with columns: 0: "surpassed_threshold",
19 | 1: 'gene_name', 2: "k_number", 3: "adaptive_threshold", 4: "score",
20 | 5: "evalue", 6: "definition". Keeps only rows where "surpassed_threshold" is
21 | equal to "*". When there are duplicate values in "gene name", keeps the
22 | row containing the highest value in the "score" column. If column "gene name"
23 | contains multiple rows with the same maximum value, calculates the
24 | score-to-adaptive-threshold ratio, and picks the annotation with the highest
25 | ratio.
26 |
27 | Returns:
28 | A list of lists, where each inner list is the annotation data from one file.
29 | """
30 |
31 | if type(self.files) is str:
32 | self.files = [self.files]
33 |
34 | for file in self.files:
35 | lines = []
36 |
37 | if file.endswith(".gz"):
38 | with gzip.open(file, "rb") as f:
39 | for row in f:
40 | if row.decode().split("\t")[0] == "*":
41 | lines.append(row.decode().split("\t"))
42 | else:
43 | with open(file, "rb") as f:
44 | for row in f:
45 | if row.decode().split("\t")[0] == "*":
46 | lines.append(row.decode().split("\t"))
47 |
48 | data = pd.DataFrame(lines)
49 | data.rename(columns={0: "surpassed_threshold", 1: 'gene_identifier',
50 | 2: "k_number", 3: "adaptive_threshold", 4: "score",
51 | 5: "evalue", 6: "definition"}, inplace=True)
52 |
53 | data[["adaptive_threshold", "score", "evalue"]] = data[["adaptive_threshold", "score", "evalue"]].apply(pd.to_numeric, axis = 1)
54 | data = data.groupby("gene_identifier").apply(lambda group: group.loc[group["score"] == group["score"].max()]).reset_index(level = 0, drop = True)
55 |
56 | data["group_size"] = data.groupby(["gene_identifier"]).transform("size")
57 |
58 | if data["group_size"].max() > 1:
59 | n_genes = (data[['gene_identifier', 'group_size']].drop_duplicates()['group_size'] > 1).sum()
60 | print(f"""{n_genes} gene(s) contained multiple annotations that surpassed the adaptive threshold.
61 | Picking the annotation with the highest score-to-adaptive_threshold ratio for these genes.""")
62 |
63 | data["ratio"] = data["score"] / data["adaptive_threshold"]
64 | data = data.groupby("gene_identifier").apply(lambda group: group.loc[group["ratio"] == group["ratio"].max()]).reset_index(level = 0, drop = True)
65 |
66 | data = data.drop(["ratio"], axis = 1)
67 |
68 | data["file_name"] = file
69 | data = data[["file_name", "gene_identifier", "k_number", "definition"]]
70 |
71 | self.annotations.append(data)
72 |
73 |
74 |
75 |
76 | def read_kofamkoala(self):
77 | """Reads in multiple .tsv files, each with columns: 0: "gene_identifier",
78 | 1: 'k_number', 2: "adaptive_threshold", 3: "score", 4: "evalue",
79 | 5: "definition", 6: "definition_2". Keeps only rows where
80 | "surpassed_threshold" is equal to "*". When there are duplicate values in
81 | "gene name", keeps the row containing the highest value in the "score"
82 | column. If column "gene name" contains multiple rows with the same maximum
83 | value, calculates the score-to-adaptive-threshold ratio, and picks the
84 | annotation with the highest ratio.
85 |
86 | Returns:
87 | A list of lists, where each inner list is the annotation data from one file.
88 | """
89 |
90 | if type(self.files) is str:
91 | self.files = [self.files]
92 |
93 | for file in self.files:
94 | lines = []
95 |
96 | if file.endswith(".gz"):
97 | with gzip.open(file, "rb") as f:
98 | for row in f:
99 | if row.decode().split("\t")[0] == "gene":
100 | continue
101 | elif row.decode().split("\t")[3] == "-":
102 | continue
103 | elif row.decode().split("\t")[2] == "-":
104 | if float(row.decode().split("\t")[4]) <= 1e-50:
105 | lines.append(row.decode().split("\t"))
106 | else:
107 | continue
108 | else:
109 | if float(row.decode().split("\t")[3]) > float(row.decode().split("\t")[2]):
110 | lines.append(row.decode().split("\t"))
111 | else:
112 | with open(file, "rb") as f:
113 | for row in f:
114 | if row.decode().split("\t")[0] == "gene":
115 | continue
116 | elif row.decode().split("\t")[3] == "-":
117 | continue
118 | elif row.decode().split("\t")[2] == "-":
119 | if float(row.decode().split("\t")[4]) <= 1e-50:
120 | lines.append(row.decode().split("\t"))
121 | else:
122 | continue
123 | else:
124 | if float(row.decode().split("\t")[3]) > float(row.decode().split("\t")[2]):
125 | lines.append(row.decode().split("\t"))
126 |
127 | data = pd.DataFrame(lines)
128 | data.rename(columns={0: "gene_identifier", 1: 'k_number',
129 | 2: "adaptive_threshold", 3: "score", 4: "evalue",
130 | 5: "definition", 6: "definition_2"}, inplace=True)
131 |
132 | data.loc[data["adaptive_threshold"] == "-", "adaptive_threshold"] = 1
133 |
134 | data[["adaptive_threshold", "score", "evalue"]] = data[["adaptive_threshold", "score", "evalue"]].apply(pd.to_numeric, axis = 1)
135 | data = data.groupby("gene_identifier").apply(lambda group: group.loc[group["score"] == group["score"].max()]).reset_index(level = 0, drop = True)
136 |
137 | data["group_size"] = data.groupby(["gene_identifier"]).transform("size")
138 |
139 | if data["group_size"].max() > 1:
140 | n_genes = (data[['gene_identifier', 'group_size']].drop_duplicates()['group_size'] > 1).sum()
141 | print(f"""{n_genes} gene(s) contained multiple annotations that surpassed the adaptive threshold.
142 | Picking the annotation with the highest score-to-adaptive_threshold ratio for these genes.""")
143 |
144 | data["ratio"] = data["score"] / data["adaptive_threshold"]
145 | data = data.groupby("gene_identifier", group_keys = False).apply(lambda group: group.loc[group["ratio"] == group["ratio"].max()]).reset_index(level = 0, drop = True)
146 |
147 | data = data.drop(["ratio"], axis = 1)
148 |
149 | data["file_name"] = file
150 | data = data[["file_name", "gene_identifier", "k_number", "definition"]]
151 |
152 | self.annotations.append(data)
153 |
154 |
155 |
156 | def read_dram_annotation_tsv(self):
157 | """Reads in multiple DRAM annotation.tsv files, keeping the "gene_identifier"
158 | as column 0, "k_number"" as column 1, and "definition" as column 2. Keeps
159 | only rows where a gene had a KEGG Ortholog annotation.
160 |
161 | Returns:
162 | A list of lists, where each inner list is the annotation data from one file.
163 | """
164 |
165 | pattern = "K[0-9]{5}"
166 |
167 | if type(self.files) is str:
168 | self.files = [self.files]
169 |
170 | for file in self.files:
171 | lines = []
172 | if file.endswith(".gz"):
173 | with gzip.open(file, "rb") as f:
174 | for row in f:
175 | if re.match(pattern, row.decode().split("\t")[8]):
176 | lines.append(row.decode().split("\t"))
177 | else:
178 | with open(file, "rb") as f:
179 | for row in f:
180 | if re.match(pattern, row.decode().split("\t")[8]):
181 | lines.append(row.decode().split("\t"))
182 |
183 | data = pd.DataFrame(lines)[[0,8,9]]
184 | data.rename(columns={0: "gene_identifier", 8: 'k_number',
185 | 9: "definition"}, inplace=True)
186 | data["file_name"] = file
187 | data = data[["file_name", "gene_identifier", "k_number", "definition"]]
188 |
189 |
190 | self.annotations.append(data)
191 |
192 |
193 |
194 | def read_koala_tsv(self):
195 | """Reads in multiple blastKoala or ghostKoala .tsv files, keeping the
196 | "gene_identifier" as column 0, "k_number"" as column 1, and "definition" as
197 | column 2. Keeps only rows where a gene had a KEGG Ortholog annotation.
198 |
199 | Returns:
200 | A list of lists, where each inner list is the annotation data from one file.
201 | """
202 |
203 | pattern = "K[0-9]{5}"
204 |
205 | if type(self.files) is str:
206 | self.files = [self.files]
207 |
208 | for file in self.files:
209 | lines = []
210 | if file.endswith(".gz"):
211 | with gzip.open(file, "rb") as f:
212 | for row in f:
213 | if re.match(pattern, row.decode().split("\t")[1]):
214 | lines.append(row.decode().split("\t"))
215 | else:
216 | with open(file, "rb") as f:
217 | for row in f:
218 | if re.match(pattern, row.decode().split("\t")[1]):
219 | lines.append(row.decode().split("\t"))
220 |
221 | data = pd.DataFrame(lines)[[0,1,2]]
222 | data.rename(columns={0: "gene_identifier", 1: 'k_number',
223 | 2: "definition"}, inplace=True)
224 | data["file_name"] = file
225 | data = data[["file_name", "gene_identifier", "k_number", "definition"]]
226 |
227 | self.annotations.append(data)
228 |
229 |
230 |
231 | class AnnotationList:
232 |
233 | """Data formatting functions to feed formatted data to the MetaPathPredict function"""
234 |
235 |
236 | def __init__(self, requiredColumnsAll, requiredColumnsModel0, requiredColumnsModel1, annotations, feature_df = pd.DataFrame()):
237 | self.requiredColumnsAll = requiredColumnsAll # all required columns for model #1 and model #2
238 | self.requiredColumnsModel0 = requiredColumnsModel0 # list of all required columns for model #1
239 | self.requiredColumnsModel1 = requiredColumnsModel1 # list of all required columns for model #2
240 | self.annotations = annotations
241 | self.feature_df = feature_df
242 |
243 |
244 |
245 | def create_feature_df(self):
246 | """Converts as list of annotations into a Pandas feature DataFrame.
247 |
248 | Returns:
249 | A Pandas DataFrame.
250 | """
251 |
252 | for df in self.annotations:
253 | df["count"] = 1
254 | self.feature_df = pd.concat([self.feature_df, df], axis = 0)
255 |
256 | self.feature_df = self.feature_df.groupby(["file_name", "k_number"]).agg(count=("count", "sum")).reset_index().pivot_table(
257 | index = "file_name",
258 | columns = "k_number",
259 | values = "count",
260 | aggfunc = "first")
261 |
262 | self.feature_df = self.feature_df.replace(np.NaN, 0)
263 | self.feature_df = self.feature_df.where(self.feature_df <= 1, 1)
264 |
265 |
266 |
267 | def check_feature_columns(self):
268 | """Checks that all required columns are present for both of MetaPathPredict's models.
269 |
270 | Returns:
271 | A Pandas DataFrame.
272 | """
273 |
274 | cols_to_add = [col for col in self.requiredColumnsAll if col not in self.feature_df.columns]
275 | #self.feature_df.loc[:, cols_to_add] = 0
276 | col_dict = dict.fromkeys(cols_to_add, 0)
277 | temp_df = pd.DataFrame(col_dict, index = self.feature_df.index)
278 | self.feature_df = pd.concat([self.feature_df, temp_df], axis = 1)
279 |
280 | cols_to_drop = [col for col in self.feature_df.columns if col not in self.requiredColumnsAll]
281 | self.feature_df.drop(cols_to_drop, axis = 1, inplace = True)
282 |
283 | self.feature_df = self.feature_df.reindex(self.requiredColumnsAll, axis = 1)
284 |
285 | self.feature_df = [self.feature_df, self.feature_df]
286 |
287 |
288 |
289 | # def select_model_features(self):
290 | # """Selects all required columns for the specified MetaPathPredict model (both model #1 and model #2).
291 | #
292 | # Returns:
293 | # A Pandas DataFrame.
294 | # """
295 | #
296 | # self.feature_df[0] = self.feature_df[0][self.requiredColumnsModel0]
297 | # self.feature_df[0] = self.feature_df[0].reindex(self.requiredColumnsModel0, axis = 1)
298 | #
299 | # self.feature_df[1] = self.feature_df[1][self.requiredColumnsModel1]
300 | # self.feature_df[1] = self.feature_df[1].reindex(self.requiredColumnsModel1, axis = 1)
301 |
302 |
303 |
304 | # def transform_model_features(self, scaler_0, scaler_1):
305 | # """Transforms all required columns for the specified MetaPathPredict model (both model #1 and model #2).
306 | #
307 | # Returns:
308 | # A Pandas DataFrame.
309 | # """
310 | #
311 | # scaled_features_0 = scaler_0.transform(self.feature_df[0])
312 | # self.feature_df[0] = pd.DataFrame(scaled_features_0, index = self.feature_df[0].index, columns = self.feature_df[0].columns)
313 | #
314 | # scaled_features_1 = scaler_1.transform(self.feature_df[1])
315 | # self.feature_df[1] = pd.DataFrame(scaled_features_1, index = self.feature_df[1].index, columns = self.feature_df[1].columns)
316 |
--------------------------------------------------------------------------------