├── .gitignore ├── MANIFEST.in ├── README.md ├── annotatation_examples ├── blastKoala_annotations.tsv.gz ├── dram_annotations.tsv.gz ├── ghostKoala_annotations.tsv.gz └── kofamscan_annotations.tsv.gz └── package ├── build └── lib │ └── metapathpredict │ └── cmdline_models.py ├── setup.py └── src ├── .DS_Store ├── metapathpredict.egg-info ├── PKG-INFO ├── SOURCES.txt ├── dependency_links.txt ├── entry_points.txt ├── requires.txt └── top_level.txt └── metapathpredict ├── .DS_Store ├── MetaPathPredict.py ├── __init__.py ├── data ├── __init__.py ├── labels.pkl ├── metapathmodules.pkl └── requiredCols.pkl ├── download_models.py └── utils.py /.gitignore: -------------------------------------------------------------------------------- 1 | package/build 2 | metapathpredict.log 3 | -------------------------------------------------------------------------------- /MANIFEST.in: -------------------------------------------------------------------------------- 1 | recursive-include src/metapathpredict/ *.py 2 | recursive-include src/metapathpredict/data *.pkl 3 | recursive-include src/metapathpredict/data *.keras 4 | recursive-include src/metapathpredict/data *.py 5 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # MetaPathPredict 2 | 3 | The MetaPathPredict Python module utilizes deep learning models to predict the presence or absence of KEGG metabolic modules in bacterial genomes recovered from environmental sequencing efforts. 4 | 5 | ## Installation 6 | 7 | To run MetaPathPredict, download this repository and install it as a Python module (see download and installation instructions below): 8 | 9 | 10 | ### GitHub install: 11 | 12 | NOTE: [Conda](https://docs.conda.io/en/latest/) is required for this installation. 13 | 14 | 1. Open a Terminal/Command Prompt window and run the following command to download the 15 | GitHub repository to the desired location (note: change your current working directory first 16 | to the desired download location, e.g., `~/Downloads` on MacOS): 17 | `git clone https://github.com/d-mcgrath/MetaPathPredict.git` 18 | 19 | 1. NOTE: You can also download the repository zip file from GitHub 20 | 21 | 2. In a Terminal/Command Prompt window, run the following commands from the parent directory the MetaPathPredict repository was cloned to: 22 | ```bash 23 | conda create -n MetaPathPredict python=3.10.6 scikit-learn=1.3.0 tensorflow=2.10.0 numpy=1.23.4 pandas=1.5.2 keras=2.10.0 git=2.40.1 24 | ``` 25 | NOTE: You will be prompted (y/n) to confirm creating this conda environment. Now activate it: 26 | 27 | ```bash 28 | conda activate MetaPathPredict 29 | ``` 30 | 31 | 3. Install the `huggingface_hub` library: 32 | ```bash 33 | pip install --upgrade huggingface_hub 34 | ``` 35 | 36 | 4. Once complete, pip install MetaPathPredict: 37 | ```bash 38 | pip install MetaPathPredict/package 39 | ``` 40 | 41 | 5. Download MetaPathPredict's models by running the following command: 42 | ```bash 43 | DownloadModels 44 | ``` 45 | 46 | Note: MetaPathPredict is now installed in the `MetaPathPredict` conda environment. Activate the conda environment prior to any use of MetaPathPredict. 47 | 48 | ### pip install: 49 | [not available yet] 50 | 51 |
52 | 53 | ## Functions 54 | 55 | The following functions can be implemented to run MetaPathPredict on the command line: 56 | 57 | - `MetaPathPredict` parses one or more input KEGG Ortholog gene annotation datasets (currently only bacterial genome data is supported) and predicts the presence or absence of [KEGG Modules](https://www.genome.jp/kegg/module.html). This function takes as input the .tsv output files from the [KofamScan](https://github.com/takaram/kofam_scan) and [DRAM](https://github.com/WrightonLabCSU/DRAM) gene annotation tools as well as the KEGG KOALA online annotation platforms [blastKOALA](https://www.kegg.jp/blastkoala/), [ghostKOALA](https://www.kegg.jp/ghostkoala/), and [kofamKOALA](https://www.genome.jp/tools/kofamkoala/). Run any of these tools first and then use one or more of their output .tsv files as input to MetaPathPredict. 58 | - A single file or multiple space-separated files can be specified to the `--input` parameter, or use a wildcard (e.g., /results/*.tsv). Include full or relative paths to the input file(s). A sample of each annotation file format that MetaPathPredict can process is included in this repository in the [annotatation_examples](annotatation_examples) folder. The sample annotation files in [annotatation_examples](annotatation_examples) can optionally be used as input to test the installation. 59 | - The format of the gene annotation files (kofamscan, kofamkoala, dram, or koala) that is used as input must be specified with the `--annotation-format` parameter. Currently, only one input type can be specified at a time. 60 | - The full or relative path to the desired destination for MetaPathPredict's output .tsv file must be specified, as well as a name for the file. The output file path and name can be specified using the `--output` parameter. By default, MetaPathPredict does not create any default output directory nor does the output have a default file name. 61 | - To specify a specific KEGG module or modules to reconstruct and predict, include the module identifier (e.g., M00001) or identifiers as a space-separated list to the argument `--kegg-modules`. 62 | 63 | - To view which KEGG modules MetaPathPredict can reconstruct and make predictions for, run the following on the command line: `MetaPathModules`. 64 | 65 |
66 | 67 | ## Basic usage 68 | 69 | ``` 70 | # predict method for making KEGG module presence/absence predictions on input gene annotations 71 | 72 | usage: MetaPathPredict [-h] --input INPUT [INPUT ...] --annotation-format ANNOTATION_FORMAT 73 | [--kegg-modules KEGG_MODULES [KEGG_MODULES ...]] --output OUTPUT 74 | 75 | options: 76 | -h, --help show this help message and exit 77 | --input INPUT [INPUT ...], -i INPUT [INPUT ...] 78 | input file path(s) and name(s) [required] 79 | --annotation-format ANNOTATION_FORMAT, -a ANNOTATION_FORMAT 80 | annotation format (kofamscan, kofamkoala, dram, or koala) [default: 81 | kofamscan] 82 | --kegg-modules KEGG_MODULES [KEGG_MODULES ...], -k KEGG_MODULES [KEGG_MODULES ...] 83 | KEGG modules to predict [default: MetaPathPredict KEGG modules] 84 | --output OUTPUT, -o OUTPUT 85 | output file path and name [required] 86 | ``` 87 | 88 |
89 | 90 | ## Examples with sample datasets 91 | 92 | ``` 93 | # One KofamScan gene annotation dataset 94 | MetaPathPredict -i /path/to/kofamscan_annotations_1.tsv -a kofamscan -o /results/predictions.tsv 95 | 96 | # Three KofamScan gene annotation datasets, with predictions for modules M00001 and M00003 97 | MetaPathPredict \ 98 | -i kofamscan_annotations_1.tsv kofamscan_annotations_2.tsv kofamscan_annotations_3.tsv \ 99 | -a kofamscan \ 100 | -k M00001 M00003 \ 101 | -o /results/predictions.tsv 102 | 103 | # Multiple KofamScan datasets in a directory 104 | MetaPathPredict -i annotations/*.tsv -a kofamscan -o /results/predictions.tsv 105 | 106 | # One DRAM gene annotation dataset 107 | MetaPathPredict -i dram_annotation.tsv -a dram -o /results/predictions.tsv 108 | 109 | # Multiple DRAM datasets in a directory 110 | MetaPathPredict -i annotations/*.tsv -a dram -o /results/predictions.tsv 111 | ``` 112 | 113 |
114 | 115 | ## Understanding the output 116 | 117 | The output of running `MetaPathPredict` is a table. The first column, `file`, displays the full file name of each input gene annotation file. The remaining columns give the class predictions (module present = 1; module absent = 0) of KEGG modules. Each KEGG module occupies a single column in the table and is labelled by its module identifier. See a sample output below of four KEGG module predictions for three input annotation files: 118 | 119 | | file | M00001 | M00002 | M00003 | M00004 | 120 | |--------------------------------------|--------|--------|--------|--------| 121 | | /path/to/kofamscan_annotations_1.tsv | 1 | 1 | 0 | 1 | 122 | | /path/to/kofamscan_annotations_2.tsv | 0 | 1 | 0 | 0 | 123 | | /path/to/kofamscan_annotations_3.tsv | 1 | 0 | 0 | 0 | 124 | 125 |
126 | 127 | ## Developer usage 128 | 129 | ``` 130 | # training method for MetaPathPredict's internal models 131 | 132 | usage: MetaPathTrain [-h] --train-targets TRAIN_TARGETS --train-features TRAIN_FEATURES 133 | [--num-epochs NUM_EPOCHS] --model-out MODEL_OUT [--use-gpu] 134 | [--num-cores NUM_CORES] [--num-hidden-layers NUM_HIDDEN_LAYERS] 135 | [--hidden-nodes-per-layer HIDDEN_NODES_PER_LAYER] 136 | [--num-features NUM_FEATURES] [--threshold THRESHOLD] 137 | 138 | options: 139 | -h, --help show this help message and exit 140 | --train-targets TRAIN_TARGETS 141 | training targets file 142 | --train-features TRAIN_FEATURES 143 | training features 144 | --num-epochs NUM_EPOCHS 145 | number of epochs 146 | --model-out MODEL_OUT, -m MODEL_OUT 147 | model file name output 148 | --use-gpu use GPU if available 149 | --num-cores NUM_CORES 150 | Number of cores for parallel processing 151 | 152 | Neural Net parameters: 153 | --num-hidden-layers NUM_HIDDEN_LAYERS 154 | number of hidden layers 155 | --hidden-nodes-per-layer HIDDEN_NODES_PER_LAYER 156 | number of nodes in each hidden layer 157 | --num-features NUM_FEATURES 158 | number of features to retain from training data 159 | --threshold THRESHOLD 160 | threshold for SelectKBest feature selection 161 | ``` 162 | -------------------------------------------------------------------------------- /annotatation_examples/blastKoala_annotations.tsv.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/d-mcgrath/MetaPathPredict/0f276eaa693be2bc4c735d04b50548949ce2079c/annotatation_examples/blastKoala_annotations.tsv.gz -------------------------------------------------------------------------------- /annotatation_examples/dram_annotations.tsv.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/d-mcgrath/MetaPathPredict/0f276eaa693be2bc4c735d04b50548949ce2079c/annotatation_examples/dram_annotations.tsv.gz -------------------------------------------------------------------------------- /annotatation_examples/ghostKoala_annotations.tsv.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/d-mcgrath/MetaPathPredict/0f276eaa693be2bc4c735d04b50548949ce2079c/annotatation_examples/ghostKoala_annotations.tsv.gz -------------------------------------------------------------------------------- /annotatation_examples/kofamscan_annotations.tsv.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/d-mcgrath/MetaPathPredict/0f276eaa693be2bc4c735d04b50548949ce2079c/annotatation_examples/kofamscan_annotations.tsv.gz -------------------------------------------------------------------------------- /package/build/lib/metapathpredict/cmdline_models.py: -------------------------------------------------------------------------------- 1 | """ 2 | Command Line Interface for MetaPathPredict Tools: 3 | ==================================== 4 | 5 | .. currentmodule:: metapathpredict 6 | 7 | class methods: 8 | MetaPathPredict methods 9 | """ 10 | 11 | import logging 12 | import argparse 13 | import datetime 14 | import pickle 15 | import os 16 | import sys 17 | import re 18 | import math 19 | import importlib 20 | from typing import Iterable, List, Dict, Set, Optional, Sequence 21 | from itertools import chain 22 | 23 | # disable tensorflow info messages 24 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1' 25 | 26 | import sklearn 27 | import numpy as np 28 | import pandas as pd 29 | import keras 30 | from torchvision import transforms 31 | import torch.optim as optim 32 | from torch.utils.data import Dataset, DataLoader, TensorDataset 33 | from sklearn.model_selection import train_test_split 34 | from sklearn.preprocessing import StandardScaler 35 | from sklearn.feature_selection import SelectKBest, f_classif 36 | from sklearn.metrics import classification_report 37 | import torch 38 | import torch.nn as nn 39 | 40 | import warnings 41 | from sklearn.exceptions import InconsistentVersionWarning 42 | warnings.filterwarnings(action='ignore', category=InconsistentVersionWarning) 43 | 44 | from metapathpredict.utils import InputData 45 | from metapathpredict.utils import AnnotationList 46 | 47 | 48 | # CUDA for PyTorch 49 | use_cuda = torch.cuda.is_available() 50 | device = torch.device("cuda:0" if use_cuda else "cpu") 51 | # device = "cpu" 52 | 53 | torch.backends.cudnn.benchmark = True 54 | 55 | # Parameters 56 | params = {"batch_size": 64, "shuffle": True, "num_workers": 6} 57 | 58 | #Configure the logging system 59 | logging.basicConfig( 60 | filename='HISTORYlistener.log', 61 | level=logging.DEBUG, 62 | format='%(asctime)s %(levelname)s %(module)s - %(message)s', 63 | datefmt='%Y-%m-%d %H:%M:%S') 64 | 65 | root = logging.getLogger() 66 | root.setLevel(logging.DEBUG) 67 | 68 | handler = logging.StreamHandler(sys.stdout) 69 | handler.setLevel(logging.DEBUG) 70 | formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s') 71 | handler.setFormatter(formatter) 72 | root.addHandler(handler) 73 | 74 | 75 | 76 | class CustomDataset(Dataset): 77 | def __init__(self, data, targets, transform=None): 78 | print("type", type(data), data.shape) 79 | self.data = torch.tensor(data, dtype=torch.float32) 80 | self.targets = torch.tensor(targets, dtype=torch.float32) 81 | self.transform = transform 82 | 83 | def __len__(self): 84 | return len(self.data) 85 | 86 | def __getitem__(self, idx): 87 | features, target = self.data[idx], self.targets[idx] 88 | 89 | if self.transform: 90 | sample = self.transform(sample) 91 | 92 | return features, target 93 | 94 | 95 | 96 | class CustomModel(nn.Module): 97 | def __init__(self, num_hidden_nodes_per_layer=1024, num_hidden_layers=5): 98 | super(CustomModel, self).__init__() 99 | NUM_HIDDEN_NODES = num_hidden_nodes_per_layer 100 | self.NUM_HIDDEN_LAYERS = num_hidden_layers 101 | 102 | self.fc1 = nn.Linear(2000, NUM_HIDDEN_NODES) 103 | self.relu = nn.ReLU() 104 | self.dropout = nn.Dropout(0.1) 105 | 106 | # array of hidden layers 107 | self.fcs = [ 108 | nn.Linear(NUM_HIDDEN_NODES, NUM_HIDDEN_NODES) 109 | for i in range(num_hidden_layers) 110 | ] 111 | 112 | self.output_layer = nn.Linear(NUM_HIDDEN_NODES, 94) 113 | self.sigmoid = nn.Sigmoid() 114 | 115 | def forward(self, x): 116 | x = self.fc1(x) 117 | x = self.relu(x) 118 | x = self.dropout(x) 119 | 120 | for i in range(self.NUM_HIDDEN_LAYERS - 1): 121 | x = self.fcs[i](x) 122 | x = self.relu(x) 123 | x = self.dropout(x) 124 | 125 | x = self.fcs[self.NUM_HIDDEN_LAYERS - 1](x) 126 | x = self.relu(x) 127 | 128 | x = self.output_layer(x) 129 | x = self.sigmoid(x) 130 | return x 131 | 132 | 133 | 134 | class Models: 135 | 136 | """Platform-agnostic command line functions available in MetaPathPredict tools.""" 137 | 138 | @classmethod 139 | def train(cls, args: Iterable[str] = None) -> int: 140 | """Train a model from the input data . 141 | 142 | Writes out a DNN model in the keras forma 143 | 144 | Parameters 145 | ---------- 146 | args : Iterable[str], optional 147 | value of None, when passed to `parser.parse_args` causes the parser to 148 | read `sys.argv` 149 | 150 | Returns 151 | ------- 152 | return_call : 0 153 | return call if the program completes successfully 154 | 155 | """ 156 | parser = argparse.ArgumentParser() 157 | 158 | parser.add_argument( 159 | "--train-targets", 160 | dest="train_targets", 161 | required=True, 162 | help="training targets file", 163 | ) 164 | parser.add_argument( 165 | "--train-features", 166 | dest="train_features", 167 | required=True, 168 | help="training features", 169 | ) 170 | parser.add_argument( 171 | "--num-epochs", 172 | dest="num_epochs", 173 | required=False, 174 | default=100, 175 | type=int, 176 | help="number of epochs", 177 | ) 178 | parser.add_argument( 179 | "--model-out", 180 | "-m", 181 | dest="model_out", 182 | required=True, 183 | help="model file name output", 184 | ) 185 | parser.add_argument( 186 | "--use-gpu", 187 | dest="use_gpu", 188 | required=False, 189 | action="store_true", 190 | help="use GPU if available", 191 | ) 192 | parser.add_argument( 193 | "--num-cores", 194 | dest="num_cores", 195 | required=False, 196 | default=10, 197 | type=int, 198 | help="Number of cores for parallel processing", 199 | ) 200 | neural_net_params = parser.add_argument_group("Neural Net parameters") 201 | neural_net_params.add_argument( 202 | "--num-hidden-layers", 203 | default=5, 204 | required=False, 205 | type=int, 206 | help="number of hidden layers", 207 | ) 208 | neural_net_params.add_argument( 209 | "--hidden-nodes-per-layer", 210 | type=int, 211 | required=False, 212 | default=1024, 213 | help="number of nodes in each hidden layer", 214 | ) 215 | neural_net_params.add_argument( 216 | "--num-features", 217 | dest="num_features", 218 | default=2000, 219 | required=False, 220 | type=int, 221 | help="number of features to retain from training data", 222 | ) 223 | neural_net_params.add_argument( 224 | "--threshold", 225 | dest="threshold", 226 | default=6432, 227 | required=False, 228 | type=float, 229 | help="threshold for SelectKBest feature selection", 230 | ) 231 | 232 | 233 | args = parser.parse_args() 234 | 235 | # CUDA for PyTorch 236 | device = "cpu" 237 | if args.use_gpu: 238 | use_cuda = torch.cuda.is_available() 239 | device = torch.device("cuda:0" if use_cuda else "cpu") 240 | 241 | logging.info(f"Using device: {device}") 242 | 243 | # read in features 244 | features = pd.read_table(args.train_features, compression="gzip") 245 | logging.info(f"reading input features of shape: {features.shape[0]} x {features.shape[1]}") 246 | 247 | # read in labels 248 | targets = pd.read_table(args.train_targets, compression="gzip") 249 | logging.info(f"reading input labels of shape: {targets.shape[0]} x {targets.shape[1]}") 250 | 251 | # split the data into training and test sets 252 | test_size = 0.25 253 | x, x_test, y, y_test = train_test_split( 254 | features, 255 | targets, 256 | stratify=targets, 257 | shuffle=True, 258 | test_size= test_size, 259 | random_state=111, 260 | ) 261 | logging.info(f"creating test size of: {test_size}%") 262 | 263 | # Split the remaining data to train and validation 264 | x_train, x_val, y_train, y_val = train_test_split( 265 | x, y, stratify=y, test_size=0.2, shuffle=True, random_state=111 266 | ) 267 | 268 | print("features size", features.shape) 269 | print("targets size", targets.shape) 270 | 271 | print("x_test", x_test.shape, " y_test ", y_test.shape) 272 | print("x", x.shape, " y ", y.shape) 273 | 274 | print("x_train", x_train.shape, " y_train ", y_train.shape) 275 | print("x_val", x_val.shape, " y_val ", y_val.shape) 276 | print("x_test", x_test.shape, " y_test ", y_test.shape) 277 | 278 | 279 | 280 | # Initialize the StandardScaler 281 | scaler = StandardScaler() 282 | 283 | # Fit the scaler to training data and transform it 284 | # and then transform val and test data w/ the fitted scaler object 285 | # (std. dev., variance, etc. are based on training data columns) 286 | scaled_features = scaler.fit_transform(x_train) 287 | x_train = pd.DataFrame(scaled_features, index = x_train.index, columns = x_train.columns) 288 | x_val = pd.DataFrame(scaler.transform(x_val), index = x_val.index, columns = x_val.columns) 289 | x_test = pd.DataFrame(scaler.transform(x_test), index = x_test.index, columns = x_test.columns) 290 | logging.info(f"normalizing the training input features") 291 | 292 | 293 | 294 | # feature selection based only on the training data 295 | # Select features according to the k highest F-values 296 | # from running ANOVA on y_train and x_train 297 | selected_features = [] 298 | for label in y_train: 299 | selector = SelectKBest(f_classif, k = 'all') 300 | selector.fit(x_train, y_train[label]) 301 | selected_features.append(list(selector.scores_)) 302 | 303 | # select threshold that retains 2000 features 304 | threshold = args.threshold 305 | 306 | # # MeanCS 307 | logging.info(f"total number of features in input: {x_train.shape[1]}") 308 | selected_features2 = np.mean(selected_features, axis = 0) > threshold 309 | logging.info(f"number of features selected for training: {sum(selected_features2)}") 310 | 311 | # create new training, validation, and test datasets retaining only the 2000 top features 312 | # determined from the training data 313 | x_train2 = x_train.loc[:, selected_features2] 314 | x_val2 = x_val.loc[:, selected_features2] 315 | x_test2 = x_test.loc[:, selected_features2] 316 | features_used = x_train2.columns.values 317 | labels_used = y_val.columns.values 318 | 319 | logging.info(f"Using features : {str(features_used)}") 320 | logging.info(f"Using labels : {str(labels_used)}") 321 | 322 | # Initialize the StandardScaler 323 | #scaler = StandardScaler() 324 | 325 | # Fit the scaler to your data and transform it 326 | #x_train2 = scaler.fit_transform(x_train2) 327 | #x_val2 = scaler.fit_transform(x_val2) 328 | #logging.info(f"normalizing the training input features") 329 | 330 | y_train = np.asarray(y_train.values) 331 | y_val = np.asarray(y_val.values) 332 | 333 | print() 334 | print("x_train2", x_train2.shape) 335 | print("x_val2", x_val2.shape) 336 | print("x_test2", x_test2.shape) 337 | 338 | # outline the neural network architecture - multilable classifier 339 | # 1 input layer, 5 hidden layers, 1 output layer 340 | # inclue dropout for all hidden layers 341 | model = CustomModel( 342 | num_hidden_nodes_per_layer=args.hidden_nodes_per_layer, 343 | num_hidden_layers=args.num_hidden_layers, 344 | ).to(device) 345 | 346 | # Define loss function and optimizer 347 | criterion = nn.BCELoss() 348 | optimizer = optim.Adam(model.parameters(), lr=0.001) 349 | logging.info(f"optimizer Adam with learning rate: 0.001") 350 | 351 | # Define early stopping 352 | early_stopping = torch.optim.lr_scheduler.ReduceLROnPlateau( 353 | optimizer, "min", patience=10 354 | ) 355 | 356 | # Create an empty transform 357 | no_transform = transforms.Compose([]) 358 | 359 | # dataset DataLoader 360 | x_train2 = np.asarray(x_train2) 361 | x_val2 = np.asarray(x_val2) 362 | print("xtrain2", x_train2.shape, y_train.shape) 363 | 364 | logging.info(f"loading training dataset into dataloader") 365 | dataset = CustomDataset(data=x_train2, targets=y_train, transform=None) 366 | 367 | batch_size = 10000 368 | train_data_loader = DataLoader( 369 | dataset, batch_size=batch_size, num_workers=args.num_cores, shuffle=True 370 | ) 371 | 372 | logging.info(f"loading testing dataset into dataloader") 373 | val_dataset = CustomDataset(data=x_val2, targets=y_val, transform=None) 374 | val_data_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=True) 375 | 376 | # Train the model 377 | num_epochs = args.num_epochs 378 | logging.info(f"number of epochs for training: {num_epochs}") 379 | for epoch in range(num_epochs): 380 | model.train() 381 | train_loss = 0.0 382 | 383 | for inputs, targets in train_data_loader: 384 | inputs, targets = inputs.to(device), targets.to(device) 385 | optimizer.zero_grad() 386 | outputs = model(inputs) 387 | loss = criterion(outputs, targets) 388 | 389 | loss.backward() 390 | optimizer.step() 391 | train_loss += loss.item() 392 | 393 | model.eval() 394 | val_loss = 0.0 395 | with torch.no_grad(): 396 | for inputs, targets in val_data_loader: 397 | inputs, targets = inputs.to(device), targets.to(device) 398 | outputs = model(inputs) 399 | loss = criterion(outputs, targets) 400 | val_loss += loss.item() 401 | 402 | # Update learning rate using early stopping 403 | early_stopping.step(val_loss) 404 | 405 | logging.info( 406 | f"Epoch [{epoch+1}/{num_epochs}], Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}" 407 | ) 408 | 409 | print( 410 | f"Epoch [{epoch+1}/{num_epochs}], Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}" 411 | ) 412 | 413 | # assess the model on test data 414 | x_test2 = np.asarray(x_test2) 415 | x_test2 = torch.tensor(x_test2, dtype=torch.float32) 416 | logging.info(f"converting test inputs to torch.tensor") 417 | 418 | predictions_test = model(x_test2) 419 | 420 | # round predictions 421 | roundedTestPreds = np.round(predictions_test.detach().numpy()) 422 | 423 | # print out performance metrics 424 | print(classification_report(y_test.values, roundedTestPreds)) 425 | 426 | logging.info(f"Training finished successfully!") 427 | 428 | model_file = {} 429 | model_file["description"] = "neural net trained for predicting multilabels" 430 | model_file["features"] = features_used 431 | model_file["labels"] = labels_used 432 | model_file["model"] = model 433 | torch.save(model_file, args.model_out) 434 | logging.info(f"writing model file: {args.model_out}") 435 | 436 | 437 | 438 | @classmethod 439 | def predict(cls, args: Iterable[str] = None) -> int: 440 | """Predict the presence or absence of select KEGG modules on bacterial 441 | annotation data. 442 | 443 | Parameters 444 | ---------- 445 | args : Iterable[str], optional 446 | value of None, when passed to `parser.parse_args` causes the parser to 447 | read `sys.argv` 448 | 449 | Returns 450 | ------- 451 | return_call : 0 452 | return call if the program completes successfully 453 | 454 | """ 455 | 456 | # disable tensorflow info messages 457 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1' 458 | 459 | parser = argparse.ArgumentParser() 460 | 461 | parser.add_argument( 462 | "--input", 463 | "-i", 464 | action = "extend", 465 | nargs = "+", 466 | dest="input", 467 | required=True, 468 | help="input file path(s) and name(s) [required]", 469 | ) 470 | parser.add_argument( 471 | "--annotation-format", 472 | "-a", 473 | dest="annotation_format", 474 | required=True, 475 | help="annotation format (kofamscan, kofamscan-web, dram, or koala) [default: kofamscan]", 476 | ) 477 | parser.add_argument( 478 | "--kegg-modules", 479 | "-k", 480 | dest="kegg_modules", 481 | required=False, 482 | default=None, 483 | action="extend", 484 | nargs="+", 485 | help="KEGG modules to predict [default: MetaPathPredict KEGG modules]", 486 | ) 487 | parser.add_argument( 488 | "--output", 489 | "-o", 490 | dest="output", 491 | required=True, 492 | help="output file path and name [required]", 493 | ) 494 | 495 | args = parser.parse_args() 496 | 497 | module_dir = importlib.resources.files('metapathpredict') 498 | data_dir = module_dir.joinpath("data/") 499 | 500 | scaler_0_path = module_dir.joinpath("data/model_0_scaler.pkl") 501 | scaler_1_path = module_dir.joinpath("data/model_1_scaler.pkl") 502 | 503 | model_0_path = module_dir.joinpath("data/model_0.keras") 504 | model_1_path = module_dir.joinpath("data/model_1.keras") 505 | 506 | labels_path = module_dir.joinpath("data/labels.pkl") 507 | 508 | with open(scaler_0_path, "rb") as f: 509 | model_0_scaler = pickle.load(f) 510 | 511 | with open(scaler_1_path, "rb") as f: 512 | model_1_scaler = pickle.load(f) 513 | 514 | with open(labels_path, "rb") as f: 515 | labels = pickle.load(f) 516 | 517 | #models = [torch.load(model_0_path), torch.load(model_1_path)] 518 | models = [keras.models.load_model(model_0_path), keras.models.load_model(model_1_path)] 519 | 520 | # logging.info(f"reading model files: {args.model_in[0]}, {args.model_in[1]}") 521 | # logging.info(f"reading scaler files: {args.scaler_in[0]}, {args.scaler_in[1]}") 522 | 523 | # logging.info(f"Reading model files from directory: {data_dir}") 524 | # logging.info(f"Reading scaler files from directory: {data_dir}") 525 | 526 | 527 | # load the input features 528 | files_list = InputData(files = args.input) 529 | 530 | if args.annotation_format == "kofamscan": 531 | files_list.read_kofamscan_detailed_tsv() 532 | 533 | elif args.annotation_format == "kofamkoala": 534 | files_list.read_kofamkoala() 535 | 536 | elif args.annotation_format == "dram": 537 | files_list.read_dram_annotation_tsv() 538 | 539 | elif args.annotation_format == "koala": 540 | files_list.read_koala_tsv() 541 | 542 | else: 543 | logging.error("""Did not recognize annotation format; use "kofamscan", "kofamkoala", "dram", or "koala""""") 544 | sys.exit(0) 545 | 546 | logging.info(f"Reading input files with format: {args.annotation_format}") 547 | 548 | model_0_cols = np.ndarray.tolist(model_0_scaler.feature_names_in_) 549 | model_1_cols = np.ndarray.tolist(model_1_scaler.feature_names_in_) 550 | reqColsAll = list(set(model_0_cols).union(set(model_1_cols))) 551 | 552 | input_features = AnnotationList( 553 | requiredColumnsAll = reqColsAll, # add list of all required columns for model #1 and model #2 554 | requiredColumnsModel0 = model_0_scaler.feature_names_in_, # add list of all required columns for model #1 555 | requiredColumnsModel1 = model_1_scaler.feature_names_in_, # add list of all required columns for model #2 556 | annotations = files_list.annotations) 557 | 558 | input_features.create_feature_df() 559 | input_features.check_feature_columns() 560 | input_features.select_model_features() 561 | input_features.transform_model_features(model_0_scaler, model_1_scaler) 562 | 563 | logging.info("Making KEGG module presence/absence predictions") 564 | 565 | predictions_list = [] 566 | for x in range(2): 567 | 568 | #features = torch.tensor(np.asarray(input_features.feature_df[x]), dtype=torch.float32) 569 | 570 | # predict 571 | #predictions = models[x]['model'](features) 572 | logging.info(f"Model {x} is making predictions") 573 | predictions = models[x].predict(input_features.feature_df[x]) 574 | 575 | # round predictions 576 | #roundedPreds = np.round(predictions.detach().numpy()) 577 | roundedPreds = np.round(predictions) 578 | 579 | #predsDf = pd.DataFrame(data = roundedPreds, columns = models[x]['labels']).astype(int) 580 | predsDf = pd.DataFrame(data = roundedPreds, columns = labels[x]).astype(int) 581 | 582 | predictions_list.append(predsDf) 583 | 584 | logging.info(f"Model {x} completed predictions") 585 | 586 | logging.info("All done.") 587 | 588 | out_df = pd.concat(predictions_list, axis = 1) 589 | 590 | if args.kegg_modules is not None: 591 | if all(modules in out_df.columns for modules in args.kegg_modules): 592 | out_df = out_df[args.kegg_modules] 593 | else: 594 | logging.error("""Did not recognize one or more KEGG modules specified with --kegg-modules; keeping all prediction columns""") 595 | 596 | out_df.insert(loc = 0, column = 'file', value = args.input) 597 | 598 | logging.info(f"Writing output to file: {args.output}") 599 | out_df.to_csv(args.output, sep='\t', index=None) 600 | 601 | #logging.info(f"Output matrix size: {out_df.shape[0]} x {out_df.shape[1]}") 602 | -------------------------------------------------------------------------------- /package/setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import Extension, setup, find_packages 2 | import os 3 | 4 | CLASSIFIERS = [ 5 | "Development Status :: 4 - Beta", 6 | "Natural Language :: English", 7 | "License :: OSI Approved :: BSD License", 8 | "Operating System :: Linux, MacOS, Windows", 9 | "Programming Language :: Python :: 3.10.6+" 10 | ] 11 | 12 | setup( 13 | name="metapathpredict", 14 | description="Tool for predicting the presence or absence of KEGG modules in bacterial genomes", 15 | author="D. Geller-McGrath, K.M. Konwar, V.P. Edgcomb, M. Pachiadaki, J.W. Roddy, T.J. Wheeler, J.E. McDermott", 16 | author_email="dgellermcgrath@gmail.com, kishori82@gmail.com", 17 | package_dir={"": "src"}, 18 | packages=["metapathpredict"], 19 | package_data={"metapathpredict": ["data/*.*"]}, 20 | install_requires=[ 21 | "scikit-learn>=1.1.3", 22 | "tensorflow>=2.10.0", 23 | "numpy>=1.23.4", 24 | "pandas>=1.5.2", 25 | "keras>=2.10.0", 26 | "torchvision>=0.15.2", 27 | "torch>=2.0.1", 28 | ], 29 | entry_points={ 30 | "console_scripts": [ 31 | "MetaPathTrain = metapathpredict.MetaPathPredict:Models.train", 32 | "MetaPathPredict = metapathpredict.MetaPathPredict:Models.predict", 33 | "MetaPathModules = metapathpredict.MetaPathPredict:Models.show_available_modules", 34 | "DownloadModels = metapathpredict.download_models:Download.download_models", 35 | "PredictFromTable = metapathpredict.MetaPathPredict:Models.predict_from_feature_table", 36 | "PredictFromTableFs = metapathpredict.MetaPathPredict:Models.predict_from_feature_table_fs_models" 37 | ] 38 | }, 39 | classifiers=CLASSIFIERS, 40 | include_package_data=True, 41 | #ext_modules=cythonize("src/metapathpredict/cpp_mods.pyx") 42 | ) 43 | -------------------------------------------------------------------------------- /package/src/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/d-mcgrath/MetaPathPredict/0f276eaa693be2bc4c735d04b50548949ce2079c/package/src/.DS_Store -------------------------------------------------------------------------------- /package/src/metapathpredict.egg-info/PKG-INFO: -------------------------------------------------------------------------------- 1 | Metadata-Version: 2.1 2 | Name: metapathpredict 3 | Version: 0.0.0 4 | Summary: Tool for predicting the presence or absence of KEGG modules in bacterial genomes 5 | Author: D. Geller-McGrath, K.M. Konwar, V.P. Edgcomb, M. Pachiadaki, J.W. Roddy, T.J. Wheeler, J.E. McDermott 6 | Author-email: dgellermcgrath@gmail.com, kishori82@gmail.com 7 | Classifier: Development Status :: 4 - Beta 8 | Classifier: Natural Language :: English 9 | Classifier: License :: OSI Approved :: BSD License 10 | Classifier: Operating System :: Linux, MacOS, Windows 11 | Classifier: Programming Language :: Python :: 3.10.6+ 12 | -------------------------------------------------------------------------------- /package/src/metapathpredict.egg-info/SOURCES.txt: -------------------------------------------------------------------------------- 1 | MANIFEST.in 2 | setup.py 3 | src/metapathpredict/MetaPathPredict.py 4 | src/metapathpredict/__init__.py 5 | src/metapathpredict/download_models.py 6 | src/metapathpredict/utils.py 7 | src/metapathpredict.egg-info/PKG-INFO 8 | src/metapathpredict.egg-info/SOURCES.txt 9 | src/metapathpredict.egg-info/dependency_links.txt 10 | src/metapathpredict.egg-info/entry_points.txt 11 | src/metapathpredict.egg-info/requires.txt 12 | src/metapathpredict.egg-info/top_level.txt 13 | src/metapathpredict/data/__init__.py 14 | src/metapathpredict/data/labels.pkl 15 | src/metapathpredict/data/metapathmodules.pkl 16 | src/metapathpredict/data/model_0.keras 17 | src/metapathpredict/data/model_1.keras 18 | src/metapathpredict/data/requiredCols.pkl -------------------------------------------------------------------------------- /package/src/metapathpredict.egg-info/dependency_links.txt: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /package/src/metapathpredict.egg-info/entry_points.txt: -------------------------------------------------------------------------------- 1 | [console_scripts] 2 | DownloadModels = metapathpredict.download_models:Download.download_models 3 | MetaPathModules = metapathpredict.MetaPathPredict:Models.show_available_modules 4 | MetaPathPredict = metapathpredict.MetaPathPredict:Models.predict 5 | MetaPathTrain = metapathpredict.MetaPathPredict:Models.train 6 | PredictFromTable = metapathpredict.MetaPathPredict:Models.predict_from_feature_table 7 | PredictFromTableFs = metapathpredict.MetaPathPredict:Models.predict_from_feature_table_fs_models 8 | -------------------------------------------------------------------------------- /package/src/metapathpredict.egg-info/requires.txt: -------------------------------------------------------------------------------- 1 | scikit-learn>=1.1.3 2 | tensorflow>=2.10.0 3 | numpy>=1.23.4 4 | pandas>=1.5.2 5 | keras>=2.10.0 6 | torchvision>=0.15.2 7 | torch>=2.0.1 8 | -------------------------------------------------------------------------------- /package/src/metapathpredict.egg-info/top_level.txt: -------------------------------------------------------------------------------- 1 | metapathpredict 2 | -------------------------------------------------------------------------------- /package/src/metapathpredict/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/d-mcgrath/MetaPathPredict/0f276eaa693be2bc4c735d04b50548949ce2079c/package/src/metapathpredict/.DS_Store -------------------------------------------------------------------------------- /package/src/metapathpredict/MetaPathPredict.py: -------------------------------------------------------------------------------- 1 | """ 2 | Command Line Interface for MetaPathPredict Tools: 3 | ==================================== 4 | 5 | .. currentmodule:: metapathpredict 6 | 7 | class methods: 8 | MetaPathPredict methods 9 | """ 10 | 11 | import logging 12 | import argparse 13 | import datetime 14 | import pickle 15 | import os 16 | import sys 17 | import re 18 | import math 19 | import importlib 20 | from typing import Iterable, List, Dict, Set, Optional, Sequence 21 | from itertools import chain 22 | 23 | # disable tensorflow info messages 24 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1' 25 | 26 | import sklearn 27 | import numpy as np 28 | import pandas as pd 29 | import keras 30 | from torchvision import transforms 31 | import torch.optim as optim 32 | from torch.utils.data import Dataset, DataLoader, TensorDataset 33 | from sklearn.model_selection import train_test_split 34 | from sklearn.preprocessing import StandardScaler 35 | from sklearn.feature_selection import SelectKBest, f_classif 36 | from sklearn.metrics import classification_report 37 | import torch 38 | import torch.nn as nn 39 | 40 | import warnings 41 | from sklearn.exceptions import InconsistentVersionWarning 42 | warnings.filterwarnings(action='ignore', category=InconsistentVersionWarning) 43 | 44 | from metapathpredict.utils import InputData 45 | from metapathpredict.utils import AnnotationList 46 | 47 | 48 | # CUDA for PyTorch 49 | use_cuda = torch.cuda.is_available() 50 | device = torch.device("cuda:0" if use_cuda else "cpu") 51 | # device = "cpu" 52 | 53 | torch.backends.cudnn.benchmark = True 54 | 55 | # Parameters 56 | params = {"batch_size": 64, "shuffle": True, "num_workers": 6} 57 | 58 | #Configure the logging system 59 | logging.basicConfig( 60 | filename='metapathpredict.log', 61 | level=logging.INFO, 62 | format="%(asctime)s %(levelname)s %(module)s - %(message)s", 63 | datefmt="%Y-%m-%d %H:%M:%S") 64 | 65 | root = logging.getLogger() 66 | root.setLevel(logging.INFO) 67 | 68 | handler = logging.StreamHandler(sys.stdout) 69 | handler.setLevel(logging.INFO) 70 | formatter = logging.Formatter("%(asctime)s %(levelname)s %(module)s - %(message)s", 71 | "%Y-%m-%d %H:%M:%S") 72 | handler.setFormatter(formatter) 73 | root.addHandler(handler) 74 | 75 | 76 | 77 | class CustomDataset(Dataset): 78 | def __init__(self, data, targets, transform=None): 79 | print("type", type(data), data.shape) 80 | self.data = torch.tensor(data, dtype=torch.float32) 81 | self.targets = torch.tensor(targets, dtype=torch.float32) 82 | self.transform = transform 83 | 84 | def __len__(self): 85 | return len(self.data) 86 | 87 | def __getitem__(self, idx): 88 | features, target = self.data[idx], self.targets[idx] 89 | 90 | if self.transform: 91 | sample = self.transform(sample) 92 | 93 | return features, target 94 | 95 | 96 | 97 | class CustomModel(nn.Module): 98 | def __init__(self, num_hidden_nodes_per_layer=1024, num_hidden_layers=5): 99 | super(CustomModel, self).__init__() 100 | NUM_HIDDEN_NODES = num_hidden_nodes_per_layer 101 | self.NUM_HIDDEN_LAYERS = num_hidden_layers 102 | 103 | self.fc1 = nn.Linear(2000, NUM_HIDDEN_NODES) 104 | self.relu = nn.ReLU() 105 | self.dropout = nn.Dropout(0.1) 106 | 107 | # array of hidden layers 108 | self.fcs = [ 109 | nn.Linear(NUM_HIDDEN_NODES, NUM_HIDDEN_NODES) 110 | for i in range(num_hidden_layers) 111 | ] 112 | 113 | self.output_layer = nn.Linear(NUM_HIDDEN_NODES, 94) 114 | self.sigmoid = nn.Sigmoid() 115 | 116 | def forward(self, x): 117 | x = self.fc1(x) 118 | x = self.relu(x) 119 | x = self.dropout(x) 120 | 121 | for i in range(self.NUM_HIDDEN_LAYERS - 1): 122 | x = self.fcs[i](x) 123 | x = self.relu(x) 124 | x = self.dropout(x) 125 | 126 | x = self.fcs[self.NUM_HIDDEN_LAYERS - 1](x) 127 | x = self.relu(x) 128 | 129 | x = self.output_layer(x) 130 | x = self.sigmoid(x) 131 | return x 132 | 133 | 134 | 135 | class Models: 136 | 137 | """Platform-agnostic command line functions available in MetaPathPredict tools.""" 138 | 139 | @classmethod 140 | def train(cls, args: Iterable[str] = None) -> int: 141 | """Train a model from the input data . 142 | 143 | Writes out a DNN model in the keras forma 144 | 145 | Parameters 146 | ---------- 147 | args : Iterable[str], optional 148 | value of None, when passed to `parser.parse_args` causes the parser to 149 | read `sys.argv` 150 | 151 | Returns 152 | ------- 153 | return_call : 0 154 | return call if the program completes successfully 155 | 156 | """ 157 | parser = argparse.ArgumentParser() 158 | 159 | parser.add_argument( 160 | "--train-targets", 161 | dest="train_targets", 162 | required=True, 163 | help="training targets file", 164 | ) 165 | parser.add_argument( 166 | "--train-features", 167 | dest="train_features", 168 | required=True, 169 | help="training features", 170 | ) 171 | parser.add_argument( 172 | "--num-epochs", 173 | dest="num_epochs", 174 | required=False, 175 | default=100, 176 | type=int, 177 | help="number of epochs", 178 | ) 179 | parser.add_argument( 180 | "--model-out", 181 | "-m", 182 | dest="model_out", 183 | required=True, 184 | help="model file name output", 185 | ) 186 | parser.add_argument( 187 | "--use-gpu", 188 | dest="use_gpu", 189 | required=False, 190 | action="store_true", 191 | help="use GPU if available", 192 | ) 193 | parser.add_argument( 194 | "--num-cores", 195 | dest="num_cores", 196 | required=False, 197 | default=10, 198 | type=int, 199 | help="Number of cores for parallel processing", 200 | ) 201 | neural_net_params = parser.add_argument_group("Neural Net parameters") 202 | neural_net_params.add_argument( 203 | "--num-hidden-layers", 204 | default=5, 205 | required=False, 206 | type=int, 207 | help="number of hidden layers", 208 | ) 209 | neural_net_params.add_argument( 210 | "--hidden-nodes-per-layer", 211 | type=int, 212 | required=False, 213 | default=1024, 214 | help="number of nodes in each hidden layer", 215 | ) 216 | neural_net_params.add_argument( 217 | "--num-features", 218 | dest="num_features", 219 | default=2000, 220 | required=False, 221 | type=int, 222 | help="number of features to retain from training data", 223 | ) 224 | neural_net_params.add_argument( 225 | "--threshold", 226 | dest="threshold", 227 | default=6432, 228 | required=False, 229 | type=float, 230 | help="threshold for SelectKBest feature selection", 231 | ) 232 | 233 | 234 | args = parser.parse_args() 235 | 236 | # CUDA for PyTorch 237 | device = "cpu" 238 | if args.use_gpu: 239 | use_cuda = torch.cuda.is_available() 240 | device = torch.device("cuda:0" if use_cuda else "cpu") 241 | 242 | logging.info(f"Using device: {device}") 243 | 244 | # read in features 245 | features = pd.read_table(args.train_features, compression="gzip") 246 | logging.info(f"reading input features of shape: {features.shape[0]} x {features.shape[1]}") 247 | 248 | # read in labels 249 | targets = pd.read_table(args.train_targets, compression="gzip") 250 | logging.info(f"reading input labels of shape: {targets.shape[0]} x {targets.shape[1]}") 251 | 252 | # split the data into training and test sets 253 | test_size = 0.25 254 | x, x_test, y, y_test = train_test_split( 255 | features, 256 | targets, 257 | stratify=targets, 258 | shuffle=True, 259 | test_size= test_size, 260 | random_state=111, 261 | ) 262 | logging.info(f"creating test size of: {test_size}%") 263 | 264 | # Split the remaining data to train and validation 265 | x_train, x_val, y_train, y_val = train_test_split( 266 | x, y, stratify=y, test_size=0.2, shuffle=True, random_state=111 267 | ) 268 | 269 | print("features size", features.shape) 270 | print("targets size", targets.shape) 271 | 272 | print("x_test", x_test.shape, " y_test ", y_test.shape) 273 | print("x", x.shape, " y ", y.shape) 274 | 275 | print("x_train", x_train.shape, " y_train ", y_train.shape) 276 | print("x_val", x_val.shape, " y_val ", y_val.shape) 277 | print("x_test", x_test.shape, " y_test ", y_test.shape) 278 | 279 | 280 | 281 | # Initialize the StandardScaler 282 | scaler = StandardScaler() 283 | 284 | # Fit the scaler to training data and transform it 285 | # and then transform val and test data w/ the fitted scaler object 286 | # (std. dev., variance, etc. are based on training data columns) 287 | scaled_features = scaler.fit_transform(x_train) 288 | x_train = pd.DataFrame(scaled_features, index = x_train.index, columns = x_train.columns) 289 | x_val = pd.DataFrame(scaler.transform(x_val), index = x_val.index, columns = x_val.columns) 290 | x_test = pd.DataFrame(scaler.transform(x_test), index = x_test.index, columns = x_test.columns) 291 | logging.info(f"normalizing the training input features") 292 | 293 | 294 | 295 | # feature selection based only on the training data 296 | # Select features according to the k highest F-values 297 | # from running ANOVA on y_train and x_train 298 | selected_features = [] 299 | for label in y_train: 300 | selector = SelectKBest(f_classif, k = 'all') 301 | selector.fit(x_train, y_train[label]) 302 | selected_features.append(list(selector.scores_)) 303 | 304 | # select threshold that retains 2000 features 305 | threshold = args.threshold 306 | 307 | # # MeanCS 308 | logging.info(f"total number of features in input: {x_train.shape[1]}") 309 | selected_features2 = np.mean(selected_features, axis = 0) > threshold 310 | logging.info(f"number of features selected for training: {sum(selected_features2)}") 311 | 312 | # create new training, validation, and test datasets retaining only the 2000 top features 313 | # determined from the training data 314 | x_train2 = x_train.loc[:, selected_features2] 315 | x_val2 = x_val.loc[:, selected_features2] 316 | x_test2 = x_test.loc[:, selected_features2] 317 | features_used = x_train2.columns.values 318 | labels_used = y_val.columns.values 319 | 320 | logging.info(f"Using features : {str(features_used)}") 321 | logging.info(f"Using labels : {str(labels_used)}") 322 | 323 | # Initialize the StandardScaler 324 | #scaler = StandardScaler() 325 | 326 | # Fit the scaler to your data and transform it 327 | #x_train2 = scaler.fit_transform(x_train2) 328 | #x_val2 = scaler.fit_transform(x_val2) 329 | #logging.info(f"normalizing the training input features") 330 | 331 | y_train = np.asarray(y_train.values) 332 | y_val = np.asarray(y_val.values) 333 | 334 | print() 335 | print("x_train2", x_train2.shape) 336 | print("x_val2", x_val2.shape) 337 | print("x_test2", x_test2.shape) 338 | 339 | # outline the neural network architecture - multilable classifier 340 | # 1 input layer, 5 hidden layers, 1 output layer 341 | # inclue dropout for all hidden layers 342 | model = CustomModel( 343 | num_hidden_nodes_per_layer=args.hidden_nodes_per_layer, 344 | num_hidden_layers=args.num_hidden_layers, 345 | ).to(device) 346 | 347 | # Define loss function and optimizer 348 | criterion = nn.BCELoss() 349 | optimizer = optim.Adam(model.parameters(), lr=0.001) 350 | logging.info(f"optimizer Adam with learning rate: 0.001") 351 | 352 | # Define early stopping 353 | early_stopping = torch.optim.lr_scheduler.ReduceLROnPlateau( 354 | optimizer, "min", patience=10 355 | ) 356 | 357 | # Create an empty transform 358 | no_transform = transforms.Compose([]) 359 | 360 | # dataset DataLoader 361 | x_train2 = np.asarray(x_train2) 362 | x_val2 = np.asarray(x_val2) 363 | print("xtrain2", x_train2.shape, y_train.shape) 364 | 365 | logging.info(f"loading training dataset into dataloader") 366 | dataset = CustomDataset(data=x_train2, targets=y_train, transform=None) 367 | 368 | batch_size = 10000 369 | train_data_loader = DataLoader( 370 | dataset, batch_size=batch_size, num_workers=args.num_cores, shuffle=True 371 | ) 372 | 373 | logging.info(f"loading testing dataset into dataloader") 374 | val_dataset = CustomDataset(data=x_val2, targets=y_val, transform=None) 375 | val_data_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=True) 376 | 377 | # Train the model 378 | num_epochs = args.num_epochs 379 | logging.info(f"number of epochs for training: {num_epochs}") 380 | for epoch in range(num_epochs): 381 | model.train() 382 | train_loss = 0.0 383 | 384 | for inputs, targets in train_data_loader: 385 | inputs, targets = inputs.to(device), targets.to(device) 386 | optimizer.zero_grad() 387 | outputs = model(inputs) 388 | loss = criterion(outputs, targets) 389 | 390 | loss.backward() 391 | optimizer.step() 392 | train_loss += loss.item() 393 | 394 | model.eval() 395 | val_loss = 0.0 396 | with torch.no_grad(): 397 | for inputs, targets in val_data_loader: 398 | inputs, targets = inputs.to(device), targets.to(device) 399 | outputs = model(inputs) 400 | loss = criterion(outputs, targets) 401 | val_loss += loss.item() 402 | 403 | # Update learning rate using early stopping 404 | early_stopping.step(val_loss) 405 | 406 | logging.info( 407 | f"Epoch [{epoch+1}/{num_epochs}], Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}" 408 | ) 409 | 410 | print( 411 | f"Epoch [{epoch+1}/{num_epochs}], Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}" 412 | ) 413 | 414 | # assess the model on test data 415 | x_test2 = np.asarray(x_test2) 416 | x_test2 = torch.tensor(x_test2, dtype=torch.float32) 417 | logging.info(f"converting test inputs to torch.tensor") 418 | 419 | predictions_test = model(x_test2) 420 | 421 | # round predictions 422 | roundedTestPreds = np.round(predictions_test.detach().numpy()) 423 | 424 | # print out performance metrics 425 | print(classification_report(y_test.values, roundedTestPreds)) 426 | 427 | logging.info(f"Training finished successfully!") 428 | 429 | model_file = {} 430 | model_file["description"] = "neural net trained for predicting multilabels" 431 | model_file["features"] = features_used 432 | model_file["labels"] = labels_used 433 | model_file["model"] = model 434 | torch.save(model_file, args.model_out) 435 | logging.info(f"writing model file: {args.model_out}") 436 | 437 | 438 | 439 | @classmethod 440 | def predict(cls, args: Iterable[str] = None) -> int: 441 | """Predict the presence or absence of select KEGG modules on bacterial 442 | annotation data. 443 | 444 | Parameters 445 | ---------- 446 | args : Iterable[str], optional 447 | value of None, when passed to `parser.parse_args` causes the parser to 448 | read `sys.argv` 449 | 450 | Returns 451 | ------- 452 | return_call : 0 453 | return call if the program completes successfully 454 | 455 | """ 456 | 457 | # disable tensorflow info messages 458 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1' 459 | 460 | parser = argparse.ArgumentParser() 461 | 462 | parser.add_argument( 463 | "--input", 464 | "-i", 465 | action = "extend", 466 | nargs = "+", 467 | dest="input", 468 | required=True, 469 | help="input file path(s) and name(s) [required]", 470 | ) 471 | parser.add_argument( 472 | "--annotation-format", 473 | "-a", 474 | dest="annotation_format", 475 | required=True, 476 | help="annotation format (kofamscan, kofamscan-web, dram, or koala) [default: kofamscan]", 477 | ) 478 | parser.add_argument( 479 | "--kegg-modules", 480 | "-k", 481 | dest="kegg_modules", 482 | required=False, 483 | default=None, 484 | action="extend", 485 | nargs="+", 486 | help="KEGG modules to predict [default: MetaPathPredict KEGG modules]", 487 | ) 488 | parser.add_argument( 489 | "--output", 490 | "-o", 491 | dest="output", 492 | required=True, 493 | help="output file path and name [required]", 494 | ) 495 | 496 | args = parser.parse_args() 497 | 498 | module_dir = importlib.resources.files('metapathpredict') 499 | data_dir = module_dir.joinpath("data/") 500 | 501 | # scaler_0_path = module_dir.joinpath("data/model_0_scaler.pkl") 502 | # scaler_1_path = module_dir.joinpath("data/model_1_scaler.pkl") 503 | 504 | model_0_path = module_dir.joinpath("data/model_0.keras") 505 | model_1_path = module_dir.joinpath("data/model_1.keras") 506 | 507 | labels_path = module_dir.joinpath("data/labels.pkl") 508 | requiredCols_path = module_dir.joinpath("data/requiredCols.pkl") 509 | 510 | # with open(scaler_0_path, "rb") as f: 511 | # model_0_scaler = pickle.load(f) 512 | # 513 | # with open(scaler_1_path, "rb") as f: 514 | # model_1_scaler = pickle.load(f) 515 | 516 | with open(labels_path, "rb") as f: 517 | labels = pickle.load(f) 518 | 519 | with open(requiredCols_path, "rb") as f: 520 | requiredCols = pickle.load(f) 521 | 522 | #models = [torch.load(model_0_path), torch.load(model_1_path)] 523 | models = [keras.models.load_model(model_0_path), keras.models.load_model(model_1_path)] 524 | 525 | # logging.info(f"reading model files: {args.model_in[0]}, {args.model_in[1]}") 526 | # logging.info(f"reading scaler files: {args.scaler_in[0]}, {args.scaler_in[1]}") 527 | 528 | # logging.info(f"Reading model files from directory: {data_dir}") 529 | # logging.info(f"Reading scaler files from directory: {data_dir}") 530 | 531 | 532 | # load the input features 533 | files_list = InputData(files = args.input) 534 | 535 | if args.annotation_format == "kofamscan": 536 | files_list.read_kofamscan_detailed_tsv() 537 | 538 | elif args.annotation_format == "kofamkoala": 539 | files_list.read_kofamkoala() 540 | 541 | elif args.annotation_format == "dram": 542 | files_list.read_dram_annotation_tsv() 543 | 544 | elif args.annotation_format == "koala": 545 | files_list.read_koala_tsv() 546 | 547 | else: 548 | logging.error("""Did not recognize annotation format; use "kofamscan", "kofamkoala", "dram", or "koala""""") 549 | sys.exit(0) 550 | 551 | logging.info(f"Reading input files with format: {args.annotation_format}") 552 | 553 | # model_0_cols = np.ndarray.tolist(model_0_scaler.feature_names_in_) 554 | # model_1_cols = np.ndarray.tolist(model_1_scaler.feature_names_in_) 555 | # reqColsAll = list(set(model_0_cols).union(set(model_1_cols))) 556 | 557 | reqColsAll = requiredCols 558 | 559 | input_features = AnnotationList( 560 | requiredColumnsAll = reqColsAll, # add list of all required columns for model #1 and model #2 561 | requiredColumnsModel0 = "blank", #model_0_scaler.feature_names_in_, # add list of all required columns for model #1 562 | requiredColumnsModel1 = "blank", #model_1_scaler.feature_names_in_, # add list of all required columns for model #2 563 | annotations = files_list.annotations) 564 | 565 | input_features.create_feature_df() 566 | input_features.check_feature_columns() 567 | # input_features.select_model_features() 568 | # input_features.transform_model_features(model_0_scaler, model_1_scaler) 569 | 570 | logging.info("Making KEGG module presence/absence predictions") 571 | 572 | predictions_list = [] 573 | for prediction_iteration in range(2): 574 | 575 | #features = torch.tensor(np.asarray(input_features.feature_df[x]), dtype=torch.float32) 576 | 577 | # predict 578 | #predictions = models[x]['model'](features) 579 | logging.info(f"Model {prediction_iteration} is making predictions") 580 | predictions = models[prediction_iteration].predict(input_features.feature_df[prediction_iteration]) 581 | 582 | # round predictions 583 | #roundedPreds = np.round(predictions.detach().numpy()) 584 | roundedPreds = np.round(predictions) 585 | 586 | #predsDf = pd.DataFrame(data = roundedPreds, columns = models[x]['labels']).astype(int) 587 | predsDf = pd.DataFrame(data = roundedPreds, columns = labels[prediction_iteration]).astype(int) 588 | 589 | predictions_list.append(predsDf) 590 | 591 | logging.info(f"Model {prediction_iteration} completed making predictions") 592 | 593 | logging.info("All done.") 594 | 595 | out_df = pd.concat(predictions_list, axis = 1) 596 | 597 | if args.kegg_modules is not None: 598 | if all(modules in out_df.columns for modules in args.kegg_modules): 599 | out_df = out_df[args.kegg_modules] 600 | else: 601 | logging.error("""Did not recognize one or more KEGG modules specified with --kegg-modules; keeping all prediction columns""") 602 | 603 | out_df.insert(loc = 0, column = 'file', value = args.input) 604 | 605 | logging.info(f"Writing output to file: {args.output}") 606 | out_df.to_csv(args.output, sep='\t', index=None) 607 | 608 | #logging.info(f"Output matrix size: {out_df.shape[0]} x {out_df.shape[1]}") 609 | 610 | 611 | 612 | @classmethod 613 | def show_available_modules(cls, args: Iterable[str] = None) -> int: 614 | 615 | """List available KEGG modules for presence/absence prediction. 616 | 617 | Parameters 618 | ---------- 619 | args : Iterable[str], optional 620 | value of None, when passed to `parser.parse_args` causes the parser to 621 | read `sys.argv` 622 | 623 | Returns 624 | ------- 625 | return_call : 0 626 | return call if the program completes successfully 627 | 628 | """ 629 | 630 | module_dir = importlib.resources.files('metapathpredict') 631 | 632 | metapathmodules_path = module_dir.joinpath("data/metapathmodules.pkl") 633 | 634 | with open(metapathmodules_path, "rb") as f: 635 | metapathmodules = pickle.load(f) 636 | 637 | pd.set_option('display.max_rows', None) 638 | pd.set_option('max_colwidth', None) 639 | 640 | print(metapathmodules) 641 | 642 | 643 | 644 | @classmethod 645 | def predict_from_feature_table(cls, args: Iterable[str] = None) -> int: 646 | """Predict the presence or absence of select KEGG modules on bacterial 647 | annotation data -- from an input feature table of KEGG K numbers 648 | 649 | Parameters 650 | ---------- 651 | args : Iterable[str], optional 652 | value of None, when passed to `parser.parse_args` causes the parser to 653 | read `sys.argv` 654 | 655 | Returns 656 | ------- 657 | return_call : 0 658 | return call if the program completes successfully 659 | 660 | """ 661 | 662 | # disable tensorflow info messages 663 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1' 664 | 665 | parser = argparse.ArgumentParser() 666 | 667 | parser.add_argument( 668 | "--input", 669 | "-i", 670 | dest="input", 671 | required=True, 672 | help="input file path(s) and name(s) [required]", 673 | ) 674 | parser.add_argument( 675 | "--kegg-modules", 676 | "-k", 677 | dest="kegg_modules", 678 | required=False, 679 | default=None, 680 | action="extend", 681 | nargs="+", 682 | help="KEGG modules to predict [default: MetaPathPredict KEGG modules]", 683 | ) 684 | parser.add_argument( 685 | "--output", 686 | "-o", 687 | dest="output", 688 | required=True, 689 | help="output file path and name [required]", 690 | ) 691 | 692 | args = parser.parse_args() 693 | 694 | module_dir = importlib.resources.files('metapathpredict') 695 | data_dir = module_dir.joinpath("data/") 696 | 697 | # scaler_0_path = module_dir.joinpath("data/model_0_scaler.pkl") 698 | # scaler_1_path = module_dir.joinpath("data/model_1_scaler.pkl") 699 | 700 | model_0_path = module_dir.joinpath("data/model_0.keras") 701 | model_1_path = module_dir.joinpath("data/model_1.keras") 702 | 703 | labels_path = module_dir.joinpath("data/labels.pkl") 704 | requiredCols_path = module_dir.joinpath("data/requiredCols.pkl") 705 | 706 | # with open(scaler_0_path, "rb") as f: 707 | # model_0_scaler = pickle.load(f) 708 | # 709 | # with open(scaler_1_path, "rb") as f: 710 | # model_1_scaler = pickle.load(f) 711 | 712 | with open(labels_path, "rb") as f: 713 | labels = pickle.load(f) 714 | 715 | with open(requiredCols_path, "rb") as f: 716 | requiredCols = pickle.load(f) 717 | 718 | #models = [torch.load(model_0_path), torch.load(model_1_path)] 719 | models = [keras.models.load_model(model_0_path), keras.models.load_model(model_1_path)] 720 | 721 | # logging.info(f"reading model files: {args.model_in[0]}, {args.model_in[1]}") 722 | # logging.info(f"reading scaler files: {args.scaler_in[0]}, {args.scaler_in[1]}") 723 | 724 | # logging.info(f"Reading model files from directory: {data_dir}") 725 | # logging.info(f"Reading scaler files from directory: {data_dir}") 726 | 727 | 728 | # load the input features 729 | features = pd.read_csv(args.input, sep = "\t") 730 | # files_list = InputData(files = args.input) 731 | # 732 | # if args.annotation_format == "kofamscan": 733 | # files_list.read_kofamscan_detailed_tsv() 734 | # 735 | # elif args.annotation_format == "kofamkoala": 736 | # files_list.read_kofamkoala() 737 | # 738 | # elif args.annotation_format == "dram": 739 | # files_list.read_dram_annotation_tsv() 740 | # 741 | # elif args.annotation_format == "koala": 742 | # files_list.read_koala_tsv() 743 | # 744 | # else: 745 | # logging.error("""Did not recognize annotation format; use "kofamscan", "kofamkoala", "dram", or "koala""""") 746 | # sys.exit(0) 747 | # 748 | # logging.info(f"Reading input files with format: {args.annotation_format}") 749 | 750 | # model_0_cols = np.ndarray.tolist(model_0_scaler.feature_names_in_) 751 | # model_1_cols = np.ndarray.tolist(model_1_scaler.feature_names_in_) 752 | # reqColsAll = list(set(model_0_cols).union(set(model_1_cols))) 753 | 754 | #reqColsAll = np.ndarray.tolist(model_0_scaler.feature_names_in_) 755 | reqColsAll = requiredCols 756 | 757 | input_features = AnnotationList( 758 | requiredColumnsAll = reqColsAll, # add list of all required columns for model #1 and model #2 759 | requiredColumnsModel0 = "blank", # add list of all required columns for model #1 760 | requiredColumnsModel1 = "blank", # add list of all required columns for model #2 761 | annotations = "blank") 762 | 763 | #input_features.create_feature_df() 764 | input_features.feature_df = features 765 | input_features.check_feature_columns() 766 | # input_features.select_model_features() 767 | # input_features.transform_model_features(model_0_scaler, model_1_scaler) 768 | 769 | logging.info("Making KEGG module presence/absence predictions") 770 | 771 | predictions_list = [] 772 | for x in range(2): 773 | 774 | #features = torch.tensor(np.asarray(input_features.feature_df[x]), dtype=torch.float32) 775 | 776 | # predict 777 | #predictions = models[x]['model'](features) 778 | logging.info(f"Model {x} is making predictions") 779 | predictions = models[x].predict(input_features.feature_df[x]) 780 | 781 | # round predictions 782 | #roundedPreds = np.round(predictions.detach().numpy()) 783 | roundedPreds = np.round(predictions) 784 | 785 | #predsDf = pd.DataFrame(data = roundedPreds, columns = models[x]['labels']).astype(int) 786 | predsDf = pd.DataFrame(data = roundedPreds, columns = labels[x]).astype(int) 787 | 788 | predictions_list.append(predsDf) 789 | 790 | logging.info(f"Model {x} completed making predictions") 791 | 792 | logging.info("All done.") 793 | 794 | out_df = pd.concat(predictions_list, axis = 1) 795 | 796 | if args.kegg_modules is not None: 797 | if all(modules in out_df.columns for modules in args.kegg_modules): 798 | out_df = out_df[args.kegg_modules] 799 | else: 800 | logging.error("""Did not recognize one or more KEGG modules specified with --kegg-modules; keeping all prediction columns""") 801 | 802 | out_df.insert(loc = 0, column = 'file', value = args.input) 803 | 804 | logging.info(f"Writing output to file: {args.output}") 805 | out_df.to_csv(args.output, sep='\t', index=None) 806 | 807 | #logging.info(f"Output matrix size: {out_df.shape[0]} x {out_df.shape[1]}") 808 | 809 | 810 | 811 | @classmethod 812 | def predict_from_feature_table_fs_models(cls, args: Iterable[str] = None) -> int: 813 | """Predict the presence or absence of select KEGG modules on bacterial 814 | annotation data -- from an input feature table of KEGG K numbers 815 | 816 | Parameters 817 | ---------- 818 | args : Iterable[str], optional 819 | value of None, when passed to `parser.parse_args` causes the parser to 820 | read `sys.argv` 821 | 822 | Returns 823 | ------- 824 | return_call : 0 825 | return call if the program completes successfully 826 | 827 | """ 828 | 829 | # disable tensorflow info messages 830 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1' 831 | 832 | parser = argparse.ArgumentParser() 833 | 834 | parser.add_argument( 835 | "--input", 836 | "-i", 837 | dest="input", 838 | required=True, 839 | help="input file path(s) and name(s) [required]", 840 | ) 841 | parser.add_argument( 842 | "--kegg-modules", 843 | "-k", 844 | dest="kegg_modules", 845 | required=False, 846 | default=None, 847 | action="extend", 848 | nargs="+", 849 | help="KEGG modules to predict [default: MetaPathPredict KEGG modules]", 850 | ) 851 | parser.add_argument( 852 | "--output", 853 | "-o", 854 | dest="output", 855 | required=True, 856 | help="output file path and name [required]", 857 | ) 858 | 859 | args = parser.parse_args() 860 | 861 | module_dir = importlib.resources.files('metapathpredict') 862 | data_dir = module_dir.joinpath("data/") 863 | 864 | # scaler_0_path = module_dir.joinpath("data/model_0_scaler.pkl") 865 | # scaler_1_path = module_dir.joinpath("data/model_1_scaler.pkl") 866 | 867 | model_0_path = module_dir.joinpath("data/model_0.keras") 868 | model_1_path = module_dir.joinpath("data/model_1.keras") 869 | 870 | labels_path = module_dir.joinpath("data/labels.pkl") 871 | requiredCols_path = module_dir.joinpath("data/requiredCols.pkl") 872 | 873 | requiredColumnsModel0_path = module_dir.joinpath("data/requiredColumnsModel0.pkl") 874 | requiredColumnsModel1_path = module_dir.joinpath("data/requiredColumnsModel1.pkl") 875 | 876 | # with open(scaler_0_path, "rb") as f: 877 | # model_0_scaler = pickle.load(f) 878 | # 879 | # with open(scaler_1_path, "rb") as f: 880 | # model_1_scaler = pickle.load(f) 881 | 882 | with open(labels_path, "rb") as f: 883 | labels = pickle.load(f) 884 | 885 | with open(requiredCols_path, "rb") as f: 886 | requiredCols = pickle.load(f) 887 | 888 | with open(requiredColumnsModel0_path, "rb") as f: 889 | model_0_features = pickle.load(f) 890 | 891 | with open(requiredColumnsModel1_path, "rb") as f: 892 | model_1_features = pickle.load(f) 893 | 894 | 895 | #models = [torch.load(model_0_path), torch.load(model_1_path)] 896 | models = [keras.models.load_model(model_0_path), keras.models.load_model(model_1_path)] 897 | 898 | # logging.info(f"reading model files: {args.model_in[0]}, {args.model_in[1]}") 899 | # logging.info(f"reading scaler files: {args.scaler_in[0]}, {args.scaler_in[1]}") 900 | 901 | # logging.info(f"Reading model files from directory: {data_dir}") 902 | # logging.info(f"Reading scaler files from directory: {data_dir}") 903 | 904 | 905 | # load the input features 906 | features = pd.read_csv(args.input, sep = "\t") 907 | # files_list = InputData(files = args.input) 908 | # 909 | # if args.annotation_format == "kofamscan": 910 | # files_list.read_kofamscan_detailed_tsv() 911 | # 912 | # elif args.annotation_format == "kofamkoala": 913 | # files_list.read_kofamkoala() 914 | # 915 | # elif args.annotation_format == "dram": 916 | # files_list.read_dram_annotation_tsv() 917 | # 918 | # elif args.annotation_format == "koala": 919 | # files_list.read_koala_tsv() 920 | # 921 | # else: 922 | # logging.error("""Did not recognize annotation format; use "kofamscan", "kofamkoala", "dram", or "koala""""") 923 | # sys.exit(0) 924 | # 925 | # logging.info(f"Reading input files with format: {args.annotation_format}") 926 | 927 | # model_0_cols = np.ndarray.tolist(model_0_scaler.feature_names_in_) 928 | # model_1_cols = np.ndarray.tolist(model_1_scaler.feature_names_in_) 929 | # reqColsAll = list(set(model_0_cols).union(set(model_1_cols))) 930 | 931 | #reqColsAll = np.ndarray.tolist(model_0_scaler.feature_names_in_) 932 | reqColsAll = requiredCols 933 | 934 | input_features = AnnotationList( 935 | requiredColumnsAll = reqColsAll, # add list of all required columns for model #1 and model #2 936 | requiredColumnsModel0 = model_0_features, # add list of all required columns for model #1 937 | requiredColumnsModel1 = model_1_features, # add list of all required columns for model #2 938 | annotations = "blank") 939 | 940 | #input_features.create_feature_df() 941 | input_features.feature_df = features 942 | input_features.check_feature_columns() 943 | input_features.select_model_features() 944 | # input_features.transform_model_features(model_0_scaler, model_1_scaler) 945 | 946 | logging.info("Making KEGG module presence/absence predictions") 947 | 948 | predictions_list = [] 949 | for x in range(2): 950 | 951 | #features = torch.tensor(np.asarray(input_features.feature_df[x]), dtype=torch.float32) 952 | 953 | # predict 954 | #predictions = models[x]['model'](features) 955 | logging.info(f"Model {x} is making predictions") 956 | predictions = models[x].predict(input_features.feature_df[x]) 957 | 958 | # round predictions 959 | #roundedPreds = np.round(predictions.detach().numpy()) 960 | roundedPreds = np.round(predictions) 961 | 962 | #predsDf = pd.DataFrame(data = roundedPreds, columns = models[x]['labels']).astype(int) 963 | predsDf = pd.DataFrame(data = roundedPreds, columns = labels[x]).astype(int) 964 | 965 | predictions_list.append(predsDf) 966 | 967 | logging.info(f"Model {x} completed making predictions") 968 | 969 | logging.info("All done.") 970 | 971 | out_df = pd.concat(predictions_list, axis = 1) 972 | 973 | if args.kegg_modules is not None: 974 | if all(modules in out_df.columns for modules in args.kegg_modules): 975 | out_df = out_df[args.kegg_modules] 976 | else: 977 | logging.error("""Did not recognize one or more KEGG modules specified with --kegg-modules; keeping all prediction columns""") 978 | 979 | out_df.insert(loc = 0, column = 'file', value = args.input) 980 | 981 | logging.info(f"Writing output to file: {args.output}") 982 | out_df.to_csv(args.output, sep='\t', index=None) 983 | 984 | #logging.info(f"Output matrix size: {out_df.shape[0]} x {out_df.shape[1]}") 985 | -------------------------------------------------------------------------------- /package/src/metapathpredict/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/d-mcgrath/MetaPathPredict/0f276eaa693be2bc4c735d04b50548949ce2079c/package/src/metapathpredict/__init__.py -------------------------------------------------------------------------------- /package/src/metapathpredict/data/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/d-mcgrath/MetaPathPredict/0f276eaa693be2bc4c735d04b50548949ce2079c/package/src/metapathpredict/data/__init__.py -------------------------------------------------------------------------------- /package/src/metapathpredict/data/labels.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/d-mcgrath/MetaPathPredict/0f276eaa693be2bc4c735d04b50548949ce2079c/package/src/metapathpredict/data/labels.pkl -------------------------------------------------------------------------------- /package/src/metapathpredict/data/metapathmodules.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/d-mcgrath/MetaPathPredict/0f276eaa693be2bc4c735d04b50548949ce2079c/package/src/metapathpredict/data/metapathmodules.pkl -------------------------------------------------------------------------------- /package/src/metapathpredict/data/requiredCols.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/d-mcgrath/MetaPathPredict/0f276eaa693be2bc4c735d04b50548949ce2079c/package/src/metapathpredict/data/requiredCols.pkl -------------------------------------------------------------------------------- /package/src/metapathpredict/download_models.py: -------------------------------------------------------------------------------- 1 | #import pyxet 2 | import importlib 3 | import shutil 4 | from importlib import resources 5 | from huggingface_hub import hf_hub_download 6 | 7 | 8 | class Download: 9 | """Functions to download MetaPathPredict's machine learning models""" 10 | 11 | @classmethod 12 | def download_models(cls): 13 | """Downloads MetaPathPredict's models. 14 | 15 | Returns: 16 | None 17 | 18 | """ 19 | print("Downloading MetaPathPredict models...") 20 | module_dir = resources.files('metapathpredict') 21 | data_dir = module_dir.joinpath("data/") 22 | # model_0_dl_path = "xet://dgellermcgrath/MetaPathPredict/main/package/src/metapathpredict/data/model_0.keras" 23 | # model_1_dl_path = "xet://dgellermcgrath/MetaPathPredict/main/package/src/metapathpredict/data/model_1.keras" 24 | model_0_install_path = module_dir.joinpath("data/MetaPathPredict_model_0.keras") 25 | model_1_install_path = module_dir.joinpath("data/MetaPathPredict_model_1.keras") 26 | 27 | model_0_renamed_dir_path = module_dir.joinpath("data/model_0.keras_directory") 28 | model_1_renamed_dir_path = module_dir.joinpath("data/model_1.keras_directory") 29 | 30 | model_0_initial_path = module_dir.joinpath("data/model_0.keras_directory/MetaPathPredict_model_0.keras") 31 | model_1_initial_path = module_dir.joinpath("data/model_1.keras_directory/MetaPathPredict_model_1.keras") 32 | 33 | model_0_final_path = module_dir.joinpath("data/model_0.keras") 34 | model_1_final_path = module_dir.joinpath("data/model_1.keras") 35 | 36 | download_destination = module_dir.joinpath("data/") 37 | 38 | hf_hub_download(repo_id="dgellermcgrath/MetaPathPredict", filename="MetaPathPredict_model_0.keras", local_dir=model_0_install_path, force_download=True) 39 | hf_hub_download(repo_id="dgellermcgrath/MetaPathPredict", filename="MetaPathPredict_model_1.keras", local_dir=model_1_install_path, force_download=True) 40 | 41 | # rename the model directories downloaded from HuggingFace 42 | shutil.move(model_0_install_path, model_0_renamed_dir_path) 43 | shutil.move(model_1_install_path, model_1_renamed_dir_path) 44 | 45 | # move the models out of their directories and rename them 46 | shutil.move(model_0_initial_path, model_0_final_path) 47 | shutil.move(model_1_initial_path, model_1_final_path) 48 | 49 | # remove the directories downloaded from HuggingFace 50 | shutil.rmtree(model_0_renamed_dir_path) 51 | shutil.rmtree(model_1_renamed_dir_path) 52 | 53 | # fs = pyxet.XetFS() # fsspec filesystem 54 | # fs.get(model_0_dl_path, str(model_0_install_path)) 55 | # fs.get(model_1_dl_path, str(model_1_install_path)) 56 | print("Models were downloaded to: " + str(download_destination)) 57 | print("All done. Use MetaPathPredict -h to see how to make predictions.") 58 | -------------------------------------------------------------------------------- /package/src/metapathpredict/utils.py: -------------------------------------------------------------------------------- 1 | import csv 2 | import re 3 | import gzip 4 | import numpy as np 5 | import pandas as pd 6 | 7 | 8 | class InputData: 9 | 10 | """Data parsing functions of input data""" 11 | 12 | 13 | def __init__(self, files, annotations = []): 14 | self.files = files 15 | self.annotations = annotations 16 | 17 | def read_kofamscan_detailed_tsv(self): 18 | """Reads in multiple .tsv files, each with columns: 0: "surpassed_threshold", 19 | 1: 'gene_name', 2: "k_number", 3: "adaptive_threshold", 4: "score", 20 | 5: "evalue", 6: "definition". Keeps only rows where "surpassed_threshold" is 21 | equal to "*". When there are duplicate values in "gene name", keeps the 22 | row containing the highest value in the "score" column. If column "gene name" 23 | contains multiple rows with the same maximum value, calculates the 24 | score-to-adaptive-threshold ratio, and picks the annotation with the highest 25 | ratio. 26 | 27 | Returns: 28 | A list of lists, where each inner list is the annotation data from one file. 29 | """ 30 | 31 | if type(self.files) is str: 32 | self.files = [self.files] 33 | 34 | for file in self.files: 35 | lines = [] 36 | 37 | if file.endswith(".gz"): 38 | with gzip.open(file, "rb") as f: 39 | for row in f: 40 | if row.decode().split("\t")[0] == "*": 41 | lines.append(row.decode().split("\t")) 42 | else: 43 | with open(file, "rb") as f: 44 | for row in f: 45 | if row.decode().split("\t")[0] == "*": 46 | lines.append(row.decode().split("\t")) 47 | 48 | data = pd.DataFrame(lines) 49 | data.rename(columns={0: "surpassed_threshold", 1: 'gene_identifier', 50 | 2: "k_number", 3: "adaptive_threshold", 4: "score", 51 | 5: "evalue", 6: "definition"}, inplace=True) 52 | 53 | data[["adaptive_threshold", "score", "evalue"]] = data[["adaptive_threshold", "score", "evalue"]].apply(pd.to_numeric, axis = 1) 54 | data = data.groupby("gene_identifier").apply(lambda group: group.loc[group["score"] == group["score"].max()]).reset_index(level = 0, drop = True) 55 | 56 | data["group_size"] = data.groupby(["gene_identifier"]).transform("size") 57 | 58 | if data["group_size"].max() > 1: 59 | n_genes = (data[['gene_identifier', 'group_size']].drop_duplicates()['group_size'] > 1).sum() 60 | print(f"""{n_genes} gene(s) contained multiple annotations that surpassed the adaptive threshold. 61 | Picking the annotation with the highest score-to-adaptive_threshold ratio for these genes.""") 62 | 63 | data["ratio"] = data["score"] / data["adaptive_threshold"] 64 | data = data.groupby("gene_identifier").apply(lambda group: group.loc[group["ratio"] == group["ratio"].max()]).reset_index(level = 0, drop = True) 65 | 66 | data = data.drop(["ratio"], axis = 1) 67 | 68 | data["file_name"] = file 69 | data = data[["file_name", "gene_identifier", "k_number", "definition"]] 70 | 71 | self.annotations.append(data) 72 | 73 | 74 | 75 | 76 | def read_kofamkoala(self): 77 | """Reads in multiple .tsv files, each with columns: 0: "gene_identifier", 78 | 1: 'k_number', 2: "adaptive_threshold", 3: "score", 4: "evalue", 79 | 5: "definition", 6: "definition_2". Keeps only rows where 80 | "surpassed_threshold" is equal to "*". When there are duplicate values in 81 | "gene name", keeps the row containing the highest value in the "score" 82 | column. If column "gene name" contains multiple rows with the same maximum 83 | value, calculates the score-to-adaptive-threshold ratio, and picks the 84 | annotation with the highest ratio. 85 | 86 | Returns: 87 | A list of lists, where each inner list is the annotation data from one file. 88 | """ 89 | 90 | if type(self.files) is str: 91 | self.files = [self.files] 92 | 93 | for file in self.files: 94 | lines = [] 95 | 96 | if file.endswith(".gz"): 97 | with gzip.open(file, "rb") as f: 98 | for row in f: 99 | if row.decode().split("\t")[0] == "gene": 100 | continue 101 | elif row.decode().split("\t")[3] == "-": 102 | continue 103 | elif row.decode().split("\t")[2] == "-": 104 | if float(row.decode().split("\t")[4]) <= 1e-50: 105 | lines.append(row.decode().split("\t")) 106 | else: 107 | continue 108 | else: 109 | if float(row.decode().split("\t")[3]) > float(row.decode().split("\t")[2]): 110 | lines.append(row.decode().split("\t")) 111 | else: 112 | with open(file, "rb") as f: 113 | for row in f: 114 | if row.decode().split("\t")[0] == "gene": 115 | continue 116 | elif row.decode().split("\t")[3] == "-": 117 | continue 118 | elif row.decode().split("\t")[2] == "-": 119 | if float(row.decode().split("\t")[4]) <= 1e-50: 120 | lines.append(row.decode().split("\t")) 121 | else: 122 | continue 123 | else: 124 | if float(row.decode().split("\t")[3]) > float(row.decode().split("\t")[2]): 125 | lines.append(row.decode().split("\t")) 126 | 127 | data = pd.DataFrame(lines) 128 | data.rename(columns={0: "gene_identifier", 1: 'k_number', 129 | 2: "adaptive_threshold", 3: "score", 4: "evalue", 130 | 5: "definition", 6: "definition_2"}, inplace=True) 131 | 132 | data.loc[data["adaptive_threshold"] == "-", "adaptive_threshold"] = 1 133 | 134 | data[["adaptive_threshold", "score", "evalue"]] = data[["adaptive_threshold", "score", "evalue"]].apply(pd.to_numeric, axis = 1) 135 | data = data.groupby("gene_identifier").apply(lambda group: group.loc[group["score"] == group["score"].max()]).reset_index(level = 0, drop = True) 136 | 137 | data["group_size"] = data.groupby(["gene_identifier"]).transform("size") 138 | 139 | if data["group_size"].max() > 1: 140 | n_genes = (data[['gene_identifier', 'group_size']].drop_duplicates()['group_size'] > 1).sum() 141 | print(f"""{n_genes} gene(s) contained multiple annotations that surpassed the adaptive threshold. 142 | Picking the annotation with the highest score-to-adaptive_threshold ratio for these genes.""") 143 | 144 | data["ratio"] = data["score"] / data["adaptive_threshold"] 145 | data = data.groupby("gene_identifier", group_keys = False).apply(lambda group: group.loc[group["ratio"] == group["ratio"].max()]).reset_index(level = 0, drop = True) 146 | 147 | data = data.drop(["ratio"], axis = 1) 148 | 149 | data["file_name"] = file 150 | data = data[["file_name", "gene_identifier", "k_number", "definition"]] 151 | 152 | self.annotations.append(data) 153 | 154 | 155 | 156 | def read_dram_annotation_tsv(self): 157 | """Reads in multiple DRAM annotation.tsv files, keeping the "gene_identifier" 158 | as column 0, "k_number"" as column 1, and "definition" as column 2. Keeps 159 | only rows where a gene had a KEGG Ortholog annotation. 160 | 161 | Returns: 162 | A list of lists, where each inner list is the annotation data from one file. 163 | """ 164 | 165 | pattern = "K[0-9]{5}" 166 | 167 | if type(self.files) is str: 168 | self.files = [self.files] 169 | 170 | for file in self.files: 171 | lines = [] 172 | if file.endswith(".gz"): 173 | with gzip.open(file, "rb") as f: 174 | for row in f: 175 | if re.match(pattern, row.decode().split("\t")[8]): 176 | lines.append(row.decode().split("\t")) 177 | else: 178 | with open(file, "rb") as f: 179 | for row in f: 180 | if re.match(pattern, row.decode().split("\t")[8]): 181 | lines.append(row.decode().split("\t")) 182 | 183 | data = pd.DataFrame(lines)[[0,8,9]] 184 | data.rename(columns={0: "gene_identifier", 8: 'k_number', 185 | 9: "definition"}, inplace=True) 186 | data["file_name"] = file 187 | data = data[["file_name", "gene_identifier", "k_number", "definition"]] 188 | 189 | 190 | self.annotations.append(data) 191 | 192 | 193 | 194 | def read_koala_tsv(self): 195 | """Reads in multiple blastKoala or ghostKoala .tsv files, keeping the 196 | "gene_identifier" as column 0, "k_number"" as column 1, and "definition" as 197 | column 2. Keeps only rows where a gene had a KEGG Ortholog annotation. 198 | 199 | Returns: 200 | A list of lists, where each inner list is the annotation data from one file. 201 | """ 202 | 203 | pattern = "K[0-9]{5}" 204 | 205 | if type(self.files) is str: 206 | self.files = [self.files] 207 | 208 | for file in self.files: 209 | lines = [] 210 | if file.endswith(".gz"): 211 | with gzip.open(file, "rb") as f: 212 | for row in f: 213 | if re.match(pattern, row.decode().split("\t")[1]): 214 | lines.append(row.decode().split("\t")) 215 | else: 216 | with open(file, "rb") as f: 217 | for row in f: 218 | if re.match(pattern, row.decode().split("\t")[1]): 219 | lines.append(row.decode().split("\t")) 220 | 221 | data = pd.DataFrame(lines)[[0,1,2]] 222 | data.rename(columns={0: "gene_identifier", 1: 'k_number', 223 | 2: "definition"}, inplace=True) 224 | data["file_name"] = file 225 | data = data[["file_name", "gene_identifier", "k_number", "definition"]] 226 | 227 | self.annotations.append(data) 228 | 229 | 230 | 231 | class AnnotationList: 232 | 233 | """Data formatting functions to feed formatted data to the MetaPathPredict function""" 234 | 235 | 236 | def __init__(self, requiredColumnsAll, requiredColumnsModel0, requiredColumnsModel1, annotations, feature_df = pd.DataFrame()): 237 | self.requiredColumnsAll = requiredColumnsAll # all required columns for model #1 and model #2 238 | self.requiredColumnsModel0 = requiredColumnsModel0 # list of all required columns for model #1 239 | self.requiredColumnsModel1 = requiredColumnsModel1 # list of all required columns for model #2 240 | self.annotations = annotations 241 | self.feature_df = feature_df 242 | 243 | 244 | 245 | def create_feature_df(self): 246 | """Converts as list of annotations into a Pandas feature DataFrame. 247 | 248 | Returns: 249 | A Pandas DataFrame. 250 | """ 251 | 252 | for df in self.annotations: 253 | df["count"] = 1 254 | self.feature_df = pd.concat([self.feature_df, df], axis = 0) 255 | 256 | self.feature_df = self.feature_df.groupby(["file_name", "k_number"]).agg(count=("count", "sum")).reset_index().pivot_table( 257 | index = "file_name", 258 | columns = "k_number", 259 | values = "count", 260 | aggfunc = "first") 261 | 262 | self.feature_df = self.feature_df.replace(np.NaN, 0) 263 | self.feature_df = self.feature_df.where(self.feature_df <= 1, 1) 264 | 265 | 266 | 267 | def check_feature_columns(self): 268 | """Checks that all required columns are present for both of MetaPathPredict's models. 269 | 270 | Returns: 271 | A Pandas DataFrame. 272 | """ 273 | 274 | cols_to_add = [col for col in self.requiredColumnsAll if col not in self.feature_df.columns] 275 | #self.feature_df.loc[:, cols_to_add] = 0 276 | col_dict = dict.fromkeys(cols_to_add, 0) 277 | temp_df = pd.DataFrame(col_dict, index = self.feature_df.index) 278 | self.feature_df = pd.concat([self.feature_df, temp_df], axis = 1) 279 | 280 | cols_to_drop = [col for col in self.feature_df.columns if col not in self.requiredColumnsAll] 281 | self.feature_df.drop(cols_to_drop, axis = 1, inplace = True) 282 | 283 | self.feature_df = self.feature_df.reindex(self.requiredColumnsAll, axis = 1) 284 | 285 | self.feature_df = [self.feature_df, self.feature_df] 286 | 287 | 288 | 289 | # def select_model_features(self): 290 | # """Selects all required columns for the specified MetaPathPredict model (both model #1 and model #2). 291 | # 292 | # Returns: 293 | # A Pandas DataFrame. 294 | # """ 295 | # 296 | # self.feature_df[0] = self.feature_df[0][self.requiredColumnsModel0] 297 | # self.feature_df[0] = self.feature_df[0].reindex(self.requiredColumnsModel0, axis = 1) 298 | # 299 | # self.feature_df[1] = self.feature_df[1][self.requiredColumnsModel1] 300 | # self.feature_df[1] = self.feature_df[1].reindex(self.requiredColumnsModel1, axis = 1) 301 | 302 | 303 | 304 | # def transform_model_features(self, scaler_0, scaler_1): 305 | # """Transforms all required columns for the specified MetaPathPredict model (both model #1 and model #2). 306 | # 307 | # Returns: 308 | # A Pandas DataFrame. 309 | # """ 310 | # 311 | # scaled_features_0 = scaler_0.transform(self.feature_df[0]) 312 | # self.feature_df[0] = pd.DataFrame(scaled_features_0, index = self.feature_df[0].index, columns = self.feature_df[0].columns) 313 | # 314 | # scaled_features_1 = scaler_1.transform(self.feature_df[1]) 315 | # self.feature_df[1] = pd.DataFrame(scaled_features_1, index = self.feature_df[1].index, columns = self.feature_df[1].columns) 316 | --------------------------------------------------------------------------------