├── .gitignore
├── MANIFEST.in
├── README.md
├── annotatation_examples
    ├── blastKoala_annotations.tsv.gz
    ├── dram_annotations.tsv.gz
    ├── ghostKoala_annotations.tsv.gz
    └── kofamscan_annotations.tsv.gz
└── package
    ├── build
        └── lib
        │   └── metapathpredict
        │       └── cmdline_models.py
    ├── setup.py
    └── src
        ├── .DS_Store
        ├── metapathpredict.egg-info
            ├── PKG-INFO
            ├── SOURCES.txt
            ├── dependency_links.txt
            ├── entry_points.txt
            ├── requires.txt
            └── top_level.txt
        └── metapathpredict
            ├── .DS_Store
            ├── MetaPathPredict.py
            ├── __init__.py
            ├── data
                ├── __init__.py
                ├── labels.pkl
                ├── metapathmodules.pkl
                └── requiredCols.pkl
            ├── download_models.py
            └── utils.py


/.gitignore:
--------------------------------------------------------------------------------
1 | package/build
2 | metapathpredict.log
3 | 


--------------------------------------------------------------------------------
/MANIFEST.in:
--------------------------------------------------------------------------------
1 | recursive-include src/metapathpredict/ *.py
2 | recursive-include src/metapathpredict/data *.pkl
3 | recursive-include src/metapathpredict/data *.keras
4 | recursive-include src/metapathpredict/data *.py
5 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # MetaPathPredict
  2 | 
  3 | The MetaPathPredict Python module utilizes deep learning models to predict the presence or absence of KEGG metabolic modules in bacterial genomes recovered from environmental sequencing efforts.
  4 | 
  5 | ## Installation
  6 | 
  7 | To run MetaPathPredict, download this repository and install it as a Python module (see download and installation instructions below):
  8 | 
  9 | 
 10 | ### GitHub install:
 11 | 
 12 | NOTE: [Conda](https://docs.conda.io/en/latest/) is required for this installation.
 13 | 
 14 | 1. Open a Terminal/Command Prompt window and run the following command to download the
 15 | GitHub repository to the desired location (note: change your current working directory first
 16 | to the desired download location, e.g., `~/Downloads` on MacOS):
 17 | `git clone https://github.com/d-mcgrath/MetaPathPredict.git`
 18 | 
 19 |     1. NOTE: You can also download the repository zip file from GitHub
 20 | 
 21 | 2. In a Terminal/Command Prompt window, run the following commands from the parent directory the MetaPathPredict repository was cloned to:
 22 | ```bash
 23 | conda create -n MetaPathPredict python=3.10.6 scikit-learn=1.3.0 tensorflow=2.10.0 numpy=1.23.4 pandas=1.5.2 keras=2.10.0 git=2.40.1
 24 | ```
 25 | NOTE: You will be prompted (y/n) to confirm creating this conda environment. Now activate it:
 26 | 
 27 | ```bash
 28 | conda activate MetaPathPredict
 29 | ```
 30 | 
 31 | 3. Install the `huggingface_hub` library:
 32 | ```bash
 33 | pip install --upgrade huggingface_hub
 34 | ```
 35 | 
 36 | 4. Once complete, pip install MetaPathPredict:
 37 | ```bash
 38 | pip install MetaPathPredict/package
 39 | ```
 40 | 
 41 | 5. Download MetaPathPredict's models by running the following command:
 42 | ```bash
 43 | DownloadModels
 44 | ```
 45 | 
 46 | Note: MetaPathPredict is now installed in the `MetaPathPredict` conda environment. Activate the conda environment prior to any use of MetaPathPredict.
 47 | 
 48 | ### pip install:
 49 | [not available yet]
 50 | 
 51 | <br>
 52 | 
 53 | ## Functions
 54 | 
 55 | The following functions can be implemented to run MetaPathPredict on the command line:
 56 | 
 57 | - `MetaPathPredict` parses one or more input KEGG Ortholog gene annotation datasets (currently only bacterial genome data is supported) and predicts the presence or absence of [KEGG Modules](https://www.genome.jp/kegg/module.html). This function takes as input the .tsv output files from the [KofamScan](https://github.com/takaram/kofam_scan) and [DRAM](https://github.com/WrightonLabCSU/DRAM) gene annotation tools as well as the KEGG KOALA online annotation platforms [blastKOALA](https://www.kegg.jp/blastkoala/), [ghostKOALA](https://www.kegg.jp/ghostkoala/), and [kofamKOALA](https://www.genome.jp/tools/kofamkoala/). Run any of these tools first and then use one or more of their output .tsv files as input to MetaPathPredict.
 58 |     - A single file or multiple space-separated files can be specified to the `--input` parameter, or use a wildcard (e.g., /results/*.tsv). Include full or relative paths to the input file(s). A sample of each annotation file format that MetaPathPredict can process is included in this repository in the [annotatation_examples](annotatation_examples) folder. The sample annotation files in [annotatation_examples](annotatation_examples) can optionally be used as input to test the installation.
 59 |     - The format of the gene annotation files (kofamscan, kofamkoala, dram, or koala) that is used as input must be specified with the `--annotation-format` parameter. Currently, only one input type can be specified at a time.
 60 |     - The full or relative path to the desired destination for MetaPathPredict's output .tsv file must be specified, as well as a name for the file. The output file path and name can be specified using the `--output` parameter. By default, MetaPathPredict does not create any default output directory nor does the output have a default file name.
 61 |     - To specify a specific KEGG module or modules to reconstruct and predict, include the module identifier (e.g., M00001) or identifiers as a space-separated list to the argument `--kegg-modules`.
 62 | 
 63 | - To view which KEGG modules MetaPathPredict can reconstruct and make predictions for, run the following on the command line: `MetaPathModules`.
 64 | 
 65 | <br>
 66 | 
 67 | ## Basic usage
 68 | 
 69 | ```
 70 | # predict method for making KEGG module presence/absence predictions on input gene annotations
 71 | 
 72 | usage: MetaPathPredict [-h] --input INPUT [INPUT ...] --annotation-format ANNOTATION_FORMAT
 73 |                        [--kegg-modules KEGG_MODULES [KEGG_MODULES ...]] --output OUTPUT
 74 | 
 75 | options:
 76 |   -h, --help            show this help message and exit
 77 |   --input INPUT [INPUT ...], -i INPUT [INPUT ...]
 78 |                         input file path(s) and name(s) [required]
 79 |   --annotation-format ANNOTATION_FORMAT, -a ANNOTATION_FORMAT
 80 |                         annotation format (kofamscan, kofamkoala, dram, or koala) [default:
 81 |                         kofamscan]
 82 |   --kegg-modules KEGG_MODULES [KEGG_MODULES ...], -k KEGG_MODULES [KEGG_MODULES ...]
 83 |                         KEGG modules to predict [default: MetaPathPredict KEGG modules]
 84 |   --output OUTPUT, -o OUTPUT
 85 |                         output file path and name [required]
 86 | ```
 87 | 
 88 | <br>
 89 | 
 90 | ## Examples with sample datasets
 91 | 
 92 | ```
 93 | # One KofamScan gene annotation dataset
 94 | MetaPathPredict -i /path/to/kofamscan_annotations_1.tsv -a kofamscan -o /results/predictions.tsv
 95 | 
 96 | # Three KofamScan gene annotation datasets, with predictions for modules M00001 and M00003
 97 | MetaPathPredict \
 98 | -i kofamscan_annotations_1.tsv kofamscan_annotations_2.tsv kofamscan_annotations_3.tsv \
 99 | -a kofamscan \
100 | -k M00001 M00003 \
101 | -o /results/predictions.tsv
102 | 
103 | # Multiple KofamScan datasets in a directory
104 | MetaPathPredict -i annotations/*.tsv -a kofamscan -o /results/predictions.tsv
105 | 
106 | # One DRAM gene annotation dataset
107 | MetaPathPredict -i dram_annotation.tsv -a dram -o /results/predictions.tsv
108 | 
109 | # Multiple DRAM datasets in a directory
110 | MetaPathPredict -i annotations/*.tsv -a dram -o /results/predictions.tsv
111 | ```
112 | 
113 | <br>
114 | 
115 | ## Understanding the output
116 | 
117 | The output of running `MetaPathPredict` is a table. The first column, `file`, displays the full file name of each input gene annotation file. The remaining columns give the class predictions (module present = 1; module absent = 0) of KEGG modules. Each KEGG module occupies a single column in the table and is labelled by its module identifier. See a sample output below of four KEGG module predictions for three input annotation files:
118 | 
119 | | file                                 | M00001 | M00002 | M00003 | M00004 |
120 | |--------------------------------------|--------|--------|--------|--------|
121 | | /path/to/kofamscan_annotations_1.tsv | 1      | 1      | 0      | 1      |
122 | | /path/to/kofamscan_annotations_2.tsv | 0      | 1      | 0      | 0      |
123 | | /path/to/kofamscan_annotations_3.tsv | 1      | 0      | 0      | 0      |
124 | 
125 | <br>
126 | 
127 | ## Developer usage
128 | 
129 | ```
130 | # training method for MetaPathPredict's internal models
131 | 
132 | usage: MetaPathTrain [-h] --train-targets TRAIN_TARGETS --train-features TRAIN_FEATURES
133 |                      [--num-epochs NUM_EPOCHS] --model-out MODEL_OUT [--use-gpu]
134 |                      [--num-cores NUM_CORES] [--num-hidden-layers NUM_HIDDEN_LAYERS]
135 |                      [--hidden-nodes-per-layer HIDDEN_NODES_PER_LAYER]
136 |                      [--num-features NUM_FEATURES] [--threshold THRESHOLD]
137 | 
138 | options:
139 |   -h, --help            show this help message and exit
140 |   --train-targets TRAIN_TARGETS
141 |                         training targets file
142 |   --train-features TRAIN_FEATURES
143 |                         training features
144 |   --num-epochs NUM_EPOCHS
145 |                         number of epochs
146 |   --model-out MODEL_OUT, -m MODEL_OUT
147 |                         model file name output
148 |   --use-gpu             use GPU if available
149 |   --num-cores NUM_CORES
150 |                         Number of cores for parallel processing
151 | 
152 | Neural Net parameters:
153 |   --num-hidden-layers NUM_HIDDEN_LAYERS
154 |                         number of hidden layers
155 |   --hidden-nodes-per-layer HIDDEN_NODES_PER_LAYER
156 |                         number of nodes in each hidden layer
157 |   --num-features NUM_FEATURES
158 |                         number of features to retain from training data
159 |   --threshold THRESHOLD
160 |                         threshold for SelectKBest feature selection
161 | ```
162 | 


--------------------------------------------------------------------------------
/annotatation_examples/blastKoala_annotations.tsv.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/d-mcgrath/MetaPathPredict/0f276eaa693be2bc4c735d04b50548949ce2079c/annotatation_examples/blastKoala_annotations.tsv.gz


--------------------------------------------------------------------------------
/annotatation_examples/dram_annotations.tsv.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/d-mcgrath/MetaPathPredict/0f276eaa693be2bc4c735d04b50548949ce2079c/annotatation_examples/dram_annotations.tsv.gz


--------------------------------------------------------------------------------
/annotatation_examples/ghostKoala_annotations.tsv.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/d-mcgrath/MetaPathPredict/0f276eaa693be2bc4c735d04b50548949ce2079c/annotatation_examples/ghostKoala_annotations.tsv.gz


--------------------------------------------------------------------------------
/annotatation_examples/kofamscan_annotations.tsv.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/d-mcgrath/MetaPathPredict/0f276eaa693be2bc4c735d04b50548949ce2079c/annotatation_examples/kofamscan_annotations.tsv.gz


--------------------------------------------------------------------------------
/package/build/lib/metapathpredict/cmdline_models.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Command Line Interface for MetaPathPredict Tools:
  3 | ====================================
  4 | 
  5 | .. currentmodule:: metapathpredict
  6 | 
  7 | class methods:
  8 |    MetaPathPredict methods
  9 | """
 10 | 
 11 | import logging
 12 | import argparse
 13 | import datetime
 14 | import pickle
 15 | import os
 16 | import sys
 17 | import re
 18 | import math
 19 | import importlib
 20 | from typing import Iterable, List, Dict, Set, Optional, Sequence
 21 | from itertools import chain
 22 | 
 23 | # disable tensorflow info messages
 24 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1'
 25 | 
 26 | import sklearn
 27 | import numpy as np
 28 | import pandas as pd
 29 | import keras
 30 | from torchvision import transforms
 31 | import torch.optim as optim
 32 | from torch.utils.data import Dataset, DataLoader, TensorDataset
 33 | from sklearn.model_selection import train_test_split
 34 | from sklearn.preprocessing import StandardScaler
 35 | from sklearn.feature_selection import SelectKBest, f_classif
 36 | from sklearn.metrics import classification_report
 37 | import torch
 38 | import torch.nn as nn
 39 | 
 40 | import warnings
 41 | from sklearn.exceptions import InconsistentVersionWarning
 42 | warnings.filterwarnings(action='ignore', category=InconsistentVersionWarning)
 43 | 
 44 | from metapathpredict.utils import InputData
 45 | from metapathpredict.utils import AnnotationList
 46 | 
 47 | 
 48 | # CUDA for PyTorch
 49 | use_cuda = torch.cuda.is_available()
 50 | device = torch.device("cuda:0" if use_cuda else "cpu")
 51 | # device = "cpu"
 52 | 
 53 | torch.backends.cudnn.benchmark = True
 54 | 
 55 | # Parameters
 56 | params = {"batch_size": 64, "shuffle": True, "num_workers": 6}
 57 | 
 58 | #Configure the logging system
 59 | logging.basicConfig(
 60 |     filename='HISTORYlistener.log',
 61 |     level=logging.DEBUG,
 62 |     format='%(asctime)s %(levelname)s %(module)s - %(message)s',
 63 |     datefmt='%Y-%m-%d %H:%M:%S')
 64 | 
 65 | root = logging.getLogger()
 66 | root.setLevel(logging.DEBUG)
 67 | 
 68 | handler = logging.StreamHandler(sys.stdout)
 69 | handler.setLevel(logging.DEBUG)
 70 | formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
 71 | handler.setFormatter(formatter)
 72 | root.addHandler(handler)
 73 | 
 74 | 
 75 | 
 76 | class CustomDataset(Dataset):
 77 |     def __init__(self, data, targets, transform=None):
 78 |         print("type", type(data), data.shape)
 79 |         self.data = torch.tensor(data, dtype=torch.float32)
 80 |         self.targets = torch.tensor(targets, dtype=torch.float32)
 81 |         self.transform = transform
 82 | 
 83 |     def __len__(self):
 84 |         return len(self.data)
 85 | 
 86 |     def __getitem__(self, idx):
 87 |         features, target = self.data[idx], self.targets[idx]
 88 | 
 89 |         if self.transform:
 90 |             sample = self.transform(sample)
 91 | 
 92 |         return features, target
 93 | 
 94 | 
 95 | 
 96 | class CustomModel(nn.Module):
 97 |     def __init__(self, num_hidden_nodes_per_layer=1024, num_hidden_layers=5):
 98 |         super(CustomModel, self).__init__()
 99 |         NUM_HIDDEN_NODES = num_hidden_nodes_per_layer
100 |         self.NUM_HIDDEN_LAYERS = num_hidden_layers
101 | 
102 |         self.fc1 = nn.Linear(2000, NUM_HIDDEN_NODES)
103 |         self.relu = nn.ReLU()
104 |         self.dropout = nn.Dropout(0.1)
105 | 
106 |         # array of hidden layers
107 |         self.fcs = [
108 |             nn.Linear(NUM_HIDDEN_NODES, NUM_HIDDEN_NODES)
109 |             for i in range(num_hidden_layers)
110 |         ]
111 | 
112 |         self.output_layer = nn.Linear(NUM_HIDDEN_NODES, 94)
113 |         self.sigmoid = nn.Sigmoid()
114 | 
115 |     def forward(self, x):
116 |         x = self.fc1(x)
117 |         x = self.relu(x)
118 |         x = self.dropout(x)
119 | 
120 |         for i in range(self.NUM_HIDDEN_LAYERS - 1):
121 |             x = self.fcs[i](x)
122 |             x = self.relu(x)
123 |             x = self.dropout(x)
124 | 
125 |         x = self.fcs[self.NUM_HIDDEN_LAYERS - 1](x)
126 |         x = self.relu(x)
127 | 
128 |         x = self.output_layer(x)
129 |         x = self.sigmoid(x)
130 |         return x
131 | 
132 | 
133 | 
134 | class Models:
135 | 
136 |     """Platform-agnostic command line functions available in MetaPathPredict tools."""
137 | 
138 |     @classmethod
139 |     def train(cls, args: Iterable[str] = None) -> int:
140 |         """Train a model from the input data .
141 | 
142 |         Writes out a DNN model in the keras forma
143 | 
144 |         Parameters
145 |         ----------
146 |         args : Iterable[str], optional
147 |             value of None, when passed to `parser.parse_args` causes the parser to
148 |             read `sys.argv`
149 | 
150 |         Returns
151 |         -------
152 |         return_call : 0
153 |             return call if the program completes successfully
154 | 
155 |         """
156 |         parser = argparse.ArgumentParser()
157 | 
158 |         parser.add_argument(
159 |             "--train-targets",
160 |             dest="train_targets",
161 |             required=True,
162 |             help="training targets file",
163 |         )
164 |         parser.add_argument(
165 |             "--train-features",
166 |             dest="train_features",
167 |             required=True,
168 |             help="training features",
169 |         )
170 |         parser.add_argument(
171 |             "--num-epochs",
172 |             dest="num_epochs",
173 |             required=False,
174 |             default=100,
175 |             type=int,
176 |             help="number of epochs",
177 |         )
178 |         parser.add_argument(
179 |             "--model-out",
180 |             "-m",
181 |             dest="model_out",
182 |             required=True,
183 |             help="model file name output",
184 |         )
185 |         parser.add_argument(
186 |             "--use-gpu",
187 |             dest="use_gpu",
188 |             required=False,
189 |             action="store_true",
190 |             help="use GPU if available",
191 |         )
192 |         parser.add_argument(
193 |             "--num-cores",
194 |             dest="num_cores",
195 |             required=False,
196 |             default=10,
197 |             type=int,
198 |             help="Number of cores for parallel processing",
199 |         )
200 |         neural_net_params = parser.add_argument_group("Neural Net parameters")
201 |         neural_net_params.add_argument(
202 |             "--num-hidden-layers",
203 |             default=5,
204 |             required=False,
205 |             type=int,
206 |             help="number of hidden layers",
207 |         )
208 |         neural_net_params.add_argument(
209 |             "--hidden-nodes-per-layer",
210 |             type=int,
211 |             required=False,
212 |             default=1024,
213 |             help="number of nodes in each hidden layer",
214 |         )
215 |         neural_net_params.add_argument(
216 |             "--num-features",
217 |             dest="num_features",
218 |             default=2000,
219 |             required=False,
220 |             type=int,
221 |             help="number of features to retain from training data",
222 |         )
223 |         neural_net_params.add_argument(
224 |             "--threshold",
225 |             dest="threshold",
226 |             default=6432,
227 |             required=False,
228 |             type=float,
229 |             help="threshold for SelectKBest feature selection",
230 |         )
231 | 
232 | 
233 |         args = parser.parse_args()
234 | 
235 |         # CUDA for PyTorch
236 |         device = "cpu"
237 |         if args.use_gpu:
238 |             use_cuda = torch.cuda.is_available()
239 |             device = torch.device("cuda:0" if use_cuda else "cpu")
240 | 
241 |         logging.info(f"Using device: {device}")
242 | 
243 |         # read in features
244 |         features = pd.read_table(args.train_features, compression="gzip")
245 |         logging.info(f"reading input features of shape: {features.shape[0]} x {features.shape[1]}")
246 | 
247 |         # read in labels
248 |         targets = pd.read_table(args.train_targets, compression="gzip")
249 |         logging.info(f"reading input labels of shape: {targets.shape[0]} x {targets.shape[1]}")
250 | 
251 |         # split the data into training and test sets
252 |         test_size = 0.25
253 |         x, x_test, y, y_test = train_test_split(
254 |             features,
255 |             targets,
256 |             stratify=targets,
257 |             shuffle=True,
258 |             test_size= test_size,
259 |             random_state=111,
260 |         )
261 |         logging.info(f"creating test size of: {test_size}%")
262 | 
263 |         # Split the remaining data to train and validation
264 |         x_train, x_val, y_train, y_val = train_test_split(
265 |             x, y, stratify=y, test_size=0.2, shuffle=True, random_state=111
266 |         )
267 | 
268 |         print("features size", features.shape)
269 |         print("targets size", targets.shape)
270 | 
271 |         print("x_test", x_test.shape, " y_test ", y_test.shape)
272 |         print("x", x.shape, " y ", y.shape)
273 | 
274 |         print("x_train", x_train.shape, " y_train ", y_train.shape)
275 |         print("x_val", x_val.shape, " y_val ", y_val.shape)
276 |         print("x_test", x_test.shape, " y_test ", y_test.shape)
277 |         
278 |         
279 |         
280 |         # Initialize the StandardScaler
281 |         scaler = StandardScaler()
282 |         
283 |         # Fit the scaler to training data and transform it
284 |         # and then transform val and test data w/ the fitted scaler object 
285 |         # (std. dev., variance, etc. are based on training data columns)
286 |         scaled_features = scaler.fit_transform(x_train)
287 |         x_train = pd.DataFrame(scaled_features, index = x_train.index, columns = x_train.columns)
288 |         x_val = pd.DataFrame(scaler.transform(x_val), index = x_val.index, columns = x_val.columns) 
289 |         x_test = pd.DataFrame(scaler.transform(x_test), index = x_test.index, columns = x_test.columns) 
290 |         logging.info(f"normalizing the training input features")
291 |         
292 |         
293 | 
294 |         # feature selection based only on the training data
295 |         # Select features according to the k highest F-values
296 |         # from running ANOVA on y_train and x_train
297 |         selected_features = []
298 |         for label in y_train:
299 |             selector = SelectKBest(f_classif, k = 'all')
300 |             selector.fit(x_train, y_train[label])
301 |             selected_features.append(list(selector.scores_))
302 | 
303 |         # select threshold that retains 2000 features
304 |         threshold = args.threshold
305 | 
306 |         # # MeanCS
307 |         logging.info(f"total number of features in input: {x_train.shape[1]}")
308 |         selected_features2 = np.mean(selected_features, axis = 0) > threshold
309 |         logging.info(f"number of features selected for training: {sum(selected_features2)}")
310 | 
311 |         # create new training, validation, and test datasets retaining only the 2000 top features
312 |         # determined from the training data
313 |         x_train2 = x_train.loc[:, selected_features2]
314 |         x_val2 = x_val.loc[:, selected_features2]
315 |         x_test2 = x_test.loc[:, selected_features2]
316 |         features_used = x_train2.columns.values
317 |         labels_used = y_val.columns.values
318 | 
319 |         logging.info(f"Using features : {str(features_used)}")
320 |         logging.info(f"Using labels : {str(labels_used)}")
321 | 
322 |         # Initialize the StandardScaler
323 |         #scaler = StandardScaler()
324 | 
325 |         # Fit the scaler to your data and transform it
326 |         #x_train2 = scaler.fit_transform(x_train2)
327 |         #x_val2 = scaler.fit_transform(x_val2)
328 |         #logging.info(f"normalizing the training input features")
329 | 
330 |         y_train = np.asarray(y_train.values)
331 |         y_val = np.asarray(y_val.values)
332 | 
333 |         print()
334 |         print("x_train2", x_train2.shape)
335 |         print("x_val2", x_val2.shape)
336 |         print("x_test2", x_test2.shape)
337 | 
338 |         # outline the neural network architecture - multilable classifier
339 |         # 1 input layer, 5 hidden layers, 1 output layer
340 |         # inclue dropout for all hidden layers
341 |         model = CustomModel(
342 |             num_hidden_nodes_per_layer=args.hidden_nodes_per_layer,
343 |             num_hidden_layers=args.num_hidden_layers,
344 |         ).to(device)
345 | 
346 |         # Define loss function and optimizer
347 |         criterion = nn.BCELoss()
348 |         optimizer = optim.Adam(model.parameters(), lr=0.001)
349 |         logging.info(f"optimizer Adam with learning rate: 0.001")
350 | 
351 |         # Define early stopping
352 |         early_stopping = torch.optim.lr_scheduler.ReduceLROnPlateau(
353 |             optimizer, "min", patience=10
354 |         )
355 | 
356 |         # Create an empty transform
357 |         no_transform = transforms.Compose([])
358 | 
359 |         # dataset DataLoader
360 |         x_train2 = np.asarray(x_train2)
361 |         x_val2 = np.asarray(x_val2)
362 |         print("xtrain2", x_train2.shape, y_train.shape)
363 | 
364 |         logging.info(f"loading training dataset into dataloader")
365 |         dataset = CustomDataset(data=x_train2, targets=y_train, transform=None)
366 | 
367 |         batch_size = 10000
368 |         train_data_loader = DataLoader(
369 |             dataset, batch_size=batch_size, num_workers=args.num_cores, shuffle=True
370 |         )
371 | 
372 |         logging.info(f"loading testing dataset into dataloader")
373 |         val_dataset = CustomDataset(data=x_val2, targets=y_val, transform=None)
374 |         val_data_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=True)
375 | 
376 |         # Train the model
377 |         num_epochs = args.num_epochs
378 |         logging.info(f"number of epochs for training: {num_epochs}")
379 |         for epoch in range(num_epochs):
380 |             model.train()
381 |             train_loss = 0.0
382 | 
383 |             for inputs, targets in train_data_loader:
384 |                 inputs, targets = inputs.to(device), targets.to(device)
385 |                 optimizer.zero_grad()
386 |                 outputs = model(inputs)
387 |                 loss = criterion(outputs, targets)
388 | 
389 |                 loss.backward()
390 |                 optimizer.step()
391 |                 train_loss += loss.item()
392 | 
393 |             model.eval()
394 |             val_loss = 0.0
395 |             with torch.no_grad():
396 |                 for inputs, targets in val_data_loader:
397 |                     inputs, targets = inputs.to(device), targets.to(device)
398 |                     outputs = model(inputs)
399 |                     loss = criterion(outputs, targets)
400 |                     val_loss += loss.item()
401 | 
402 |             # Update learning rate using early stopping
403 |             early_stopping.step(val_loss)
404 | 
405 |             logging.info(
406 |                 f"Epoch [{epoch+1}/{num_epochs}], Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}"
407 |             )
408 | 
409 |             print(
410 |                 f"Epoch [{epoch+1}/{num_epochs}], Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}"
411 |             )
412 | 
413 |         # assess the model on test data
414 |         x_test2 = np.asarray(x_test2)
415 |         x_test2 = torch.tensor(x_test2, dtype=torch.float32)
416 |         logging.info(f"converting test inputs to torch.tensor")
417 | 
418 |         predictions_test = model(x_test2)
419 | 
420 |         # round predictions
421 |         roundedTestPreds = np.round(predictions_test.detach().numpy())
422 | 
423 |         # print out performance metrics
424 |         print(classification_report(y_test.values, roundedTestPreds))
425 | 
426 |         logging.info(f"Training finished successfully!")
427 | 
428 |         model_file = {}
429 |         model_file["description"] = "neural net trained for predicting multilabels"
430 |         model_file["features"] = features_used
431 |         model_file["labels"] = labels_used
432 |         model_file["model"] = model
433 |         torch.save(model_file, args.model_out)
434 |         logging.info(f"writing model file: {args.model_out}")
435 | 
436 | 
437 | 
438 |     @classmethod
439 |     def predict(cls, args: Iterable[str] = None) -> int:
440 |         """Predict the presence or absence of select KEGG modules on bacterial
441 |         annotation data.
442 | 
443 |         Parameters
444 |         ----------
445 |         args : Iterable[str], optional
446 |             value of None, when passed to `parser.parse_args` causes the parser to
447 |             read `sys.argv`
448 | 
449 |         Returns
450 |         -------
451 |         return_call : 0
452 |             return call if the program completes successfully
453 | 
454 |         """
455 |         
456 |         # disable tensorflow info messages
457 |         os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1'
458 | 
459 |         parser = argparse.ArgumentParser()
460 | 
461 |         parser.add_argument(
462 |             "--input",
463 |             "-i",
464 |             action = "extend",
465 |             nargs = "+",
466 |             dest="input",
467 |             required=True,
468 |             help="input file path(s) and name(s) [required]",
469 |         )
470 |         parser.add_argument(
471 |             "--annotation-format",
472 |             "-a",
473 |             dest="annotation_format",
474 |             required=True,
475 |             help="annotation format (kofamscan, kofamscan-web, dram, or koala) [default: kofamscan]",
476 |         )
477 |         parser.add_argument(
478 |             "--kegg-modules",
479 |             "-k",
480 |             dest="kegg_modules",
481 |             required=False,
482 |             default=None,
483 |             action="extend",
484 |             nargs="+",
485 |             help="KEGG modules to predict [default: MetaPathPredict KEGG modules]",
486 |         )
487 |         parser.add_argument(
488 |             "--output",
489 |             "-o",
490 |             dest="output",
491 |             required=True,
492 |             help="output file path and name [required]",
493 |         )
494 | 
495 |         args = parser.parse_args()
496 |         
497 |         module_dir = importlib.resources.files('metapathpredict')
498 |         data_dir = module_dir.joinpath("data/")
499 |         
500 |         scaler_0_path = module_dir.joinpath("data/model_0_scaler.pkl")
501 |         scaler_1_path = module_dir.joinpath("data/model_1_scaler.pkl")
502 |         
503 |         model_0_path = module_dir.joinpath("data/model_0.keras")
504 |         model_1_path = module_dir.joinpath("data/model_1.keras")
505 |         
506 |         labels_path = module_dir.joinpath("data/labels.pkl")
507 | 
508 |         with open(scaler_0_path, "rb") as f:
509 |           model_0_scaler = pickle.load(f)
510 |           
511 |         with open(scaler_1_path, "rb") as f:
512 |           model_1_scaler = pickle.load(f)
513 |           
514 |         with open(labels_path, "rb") as f:
515 |           labels = pickle.load(f)
516 | 
517 |         #models = [torch.load(model_0_path), torch.load(model_1_path)]
518 |         models = [keras.models.load_model(model_0_path), keras.models.load_model(model_1_path)]
519 |         
520 |         # logging.info(f"reading model files: {args.model_in[0]}, {args.model_in[1]}")
521 |         # logging.info(f"reading scaler files: {args.scaler_in[0]}, {args.scaler_in[1]}")
522 |         
523 |         # logging.info(f"Reading model files from directory: {data_dir}")
524 |         # logging.info(f"Reading scaler files from directory: {data_dir}")
525 | 
526 | 
527 |         # load the input features
528 |         files_list = InputData(files = args.input) 
529 |         
530 |         if args.annotation_format == "kofamscan":
531 |           files_list.read_kofamscan_detailed_tsv()
532 |           
533 |         elif args.annotation_format == "kofamkoala":
534 |           files_list.read_kofamkoala()
535 |           
536 |         elif args.annotation_format == "dram":
537 |           files_list.read_dram_annotation_tsv()
538 |           
539 |         elif args.annotation_format == "koala":
540 |           files_list.read_koala_tsv()
541 |         
542 |         else:
543 |           logging.error("""Did not recognize annotation format; use "kofamscan", "kofamkoala", "dram", or "koala""""")
544 |           sys.exit(0)
545 |           
546 |         logging.info(f"Reading input files with format: {args.annotation_format}")
547 |           
548 |         model_0_cols = np.ndarray.tolist(model_0_scaler.feature_names_in_)
549 |         model_1_cols = np.ndarray.tolist(model_1_scaler.feature_names_in_)
550 |         reqColsAll = list(set(model_0_cols).union(set(model_1_cols)))
551 |           
552 |         input_features = AnnotationList(
553 |           requiredColumnsAll = reqColsAll, # add list of all required columns for model #1 and model #2
554 |           requiredColumnsModel0 = model_0_scaler.feature_names_in_, # add list of all required columns for model #1
555 |           requiredColumnsModel1 = model_1_scaler.feature_names_in_, # add list of all required columns for model #2
556 |           annotations = files_list.annotations)
557 | 
558 |         input_features.create_feature_df()
559 |         input_features.check_feature_columns()
560 |         input_features.select_model_features()
561 |         input_features.transform_model_features(model_0_scaler, model_1_scaler)
562 | 
563 |         logging.info("Making KEGG module presence/absence predictions")
564 | 
565 |         predictions_list = []
566 |         for x in range(2):
567 |           
568 |           #features = torch.tensor(np.asarray(input_features.feature_df[x]), dtype=torch.float32)
569 | 
570 |           # predict
571 |           #predictions = models[x]['model'](features)
572 |           logging.info(f"Model {x} is making predictions")
573 |           predictions = models[x].predict(input_features.feature_df[x])
574 | 
575 |           # round predictions
576 |           #roundedPreds = np.round(predictions.detach().numpy())
577 |           roundedPreds = np.round(predictions)
578 |           
579 |           #predsDf = pd.DataFrame(data = roundedPreds, columns = models[x]['labels']).astype(int)
580 |           predsDf = pd.DataFrame(data = roundedPreds, columns = labels[x]).astype(int)
581 | 
582 |           predictions_list.append(predsDf)
583 |           
584 |           logging.info(f"Model {x} completed predictions")
585 |           
586 |         logging.info("All done.")
587 | 
588 |         out_df = pd.concat(predictions_list, axis = 1)
589 | 
590 |         if args.kegg_modules is not None:
591 |           if all(modules in out_df.columns for modules in args.kegg_modules):
592 |             out_df = out_df[args.kegg_modules]
593 |           else:
594 |             logging.error("""Did not recognize one or more KEGG modules specified with --kegg-modules; keeping all prediction columns""")
595 | 
596 |         out_df.insert(loc = 0, column = 'file', value = args.input)
597 |         
598 |         logging.info(f"Writing output to file: {args.output}")
599 |         out_df.to_csv(args.output, sep='\t', index=None)
600 | 
601 |         #logging.info(f"Output matrix size: {out_df.shape[0]} x {out_df.shape[1]}")
602 | 


--------------------------------------------------------------------------------
/package/setup.py:
--------------------------------------------------------------------------------
 1 | from setuptools import Extension, setup, find_packages
 2 | import os
 3 | 
 4 | CLASSIFIERS = [
 5 |     "Development Status :: 4 - Beta",
 6 |     "Natural Language :: English",
 7 |     "License :: OSI Approved :: BSD License",
 8 |     "Operating System :: Linux, MacOS, Windows",
 9 |     "Programming Language :: Python :: 3.10.6+"
10 | ]
11 | 
12 | setup(
13 |     name="metapathpredict", 
14 |     description="Tool for predicting the presence or absence of KEGG modules in bacterial genomes",
15 |     author="D. Geller-McGrath, K.M. Konwar, V.P. Edgcomb, M. Pachiadaki, J.W. Roddy, T.J. Wheeler, J.E. McDermott",
16 |     author_email="dgellermcgrath@gmail.com, kishori82@gmail.com",
17 |     package_dir={"": "src"},
18 |     packages=["metapathpredict"],
19 |     package_data={"metapathpredict": ["data/*.*"]},
20 |     install_requires=[
21 |       "scikit-learn>=1.1.3",
22 |       "tensorflow>=2.10.0",
23 |       "numpy>=1.23.4",
24 |       "pandas>=1.5.2",
25 |       "keras>=2.10.0",
26 |       "torchvision>=0.15.2",
27 |       "torch>=2.0.1",
28 |     ],
29 |     entry_points={
30 |         "console_scripts": [
31 |             "MetaPathTrain = metapathpredict.MetaPathPredict:Models.train", 
32 |             "MetaPathPredict = metapathpredict.MetaPathPredict:Models.predict", 
33 |             "MetaPathModules = metapathpredict.MetaPathPredict:Models.show_available_modules",
34 |             "DownloadModels = metapathpredict.download_models:Download.download_models",
35 |             "PredictFromTable = metapathpredict.MetaPathPredict:Models.predict_from_feature_table",
36 |             "PredictFromTableFs = metapathpredict.MetaPathPredict:Models.predict_from_feature_table_fs_models"
37 |         ]
38 |     },
39 |     classifiers=CLASSIFIERS,
40 |     include_package_data=True,
41 |     #ext_modules=cythonize("src/metapathpredict/cpp_mods.pyx")
42 |  )
43 | 


--------------------------------------------------------------------------------
/package/src/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/d-mcgrath/MetaPathPredict/0f276eaa693be2bc4c735d04b50548949ce2079c/package/src/.DS_Store


--------------------------------------------------------------------------------
/package/src/metapathpredict.egg-info/PKG-INFO:
--------------------------------------------------------------------------------
 1 | Metadata-Version: 2.1
 2 | Name: metapathpredict
 3 | Version: 0.0.0
 4 | Summary: Tool for predicting the presence or absence of KEGG modules in bacterial genomes
 5 | Author: D. Geller-McGrath, K.M. Konwar, V.P. Edgcomb, M. Pachiadaki, J.W. Roddy, T.J. Wheeler, J.E. McDermott
 6 | Author-email: dgellermcgrath@gmail.com, kishori82@gmail.com
 7 | Classifier: Development Status :: 4 - Beta
 8 | Classifier: Natural Language :: English
 9 | Classifier: License :: OSI Approved :: BSD License
10 | Classifier: Operating System :: Linux, MacOS, Windows
11 | Classifier: Programming Language :: Python :: 3.10.6+
12 | 


--------------------------------------------------------------------------------
/package/src/metapathpredict.egg-info/SOURCES.txt:
--------------------------------------------------------------------------------
 1 | MANIFEST.in
 2 | setup.py
 3 | src/metapathpredict/MetaPathPredict.py
 4 | src/metapathpredict/__init__.py
 5 | src/metapathpredict/download_models.py
 6 | src/metapathpredict/utils.py
 7 | src/metapathpredict.egg-info/PKG-INFO
 8 | src/metapathpredict.egg-info/SOURCES.txt
 9 | src/metapathpredict.egg-info/dependency_links.txt
10 | src/metapathpredict.egg-info/entry_points.txt
11 | src/metapathpredict.egg-info/requires.txt
12 | src/metapathpredict.egg-info/top_level.txt
13 | src/metapathpredict/data/__init__.py
14 | src/metapathpredict/data/labels.pkl
15 | src/metapathpredict/data/metapathmodules.pkl
16 | src/metapathpredict/data/model_0.keras
17 | src/metapathpredict/data/model_1.keras
18 | src/metapathpredict/data/requiredCols.pkl


--------------------------------------------------------------------------------
/package/src/metapathpredict.egg-info/dependency_links.txt:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/package/src/metapathpredict.egg-info/entry_points.txt:
--------------------------------------------------------------------------------
1 | [console_scripts]
2 | DownloadModels = metapathpredict.download_models:Download.download_models
3 | MetaPathModules = metapathpredict.MetaPathPredict:Models.show_available_modules
4 | MetaPathPredict = metapathpredict.MetaPathPredict:Models.predict
5 | MetaPathTrain = metapathpredict.MetaPathPredict:Models.train
6 | PredictFromTable = metapathpredict.MetaPathPredict:Models.predict_from_feature_table
7 | PredictFromTableFs = metapathpredict.MetaPathPredict:Models.predict_from_feature_table_fs_models
8 | 


--------------------------------------------------------------------------------
/package/src/metapathpredict.egg-info/requires.txt:
--------------------------------------------------------------------------------
1 | scikit-learn>=1.1.3
2 | tensorflow>=2.10.0
3 | numpy>=1.23.4
4 | pandas>=1.5.2
5 | keras>=2.10.0
6 | torchvision>=0.15.2
7 | torch>=2.0.1
8 | 


--------------------------------------------------------------------------------
/package/src/metapathpredict.egg-info/top_level.txt:
--------------------------------------------------------------------------------
1 | metapathpredict
2 | 


--------------------------------------------------------------------------------
/package/src/metapathpredict/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/d-mcgrath/MetaPathPredict/0f276eaa693be2bc4c735d04b50548949ce2079c/package/src/metapathpredict/.DS_Store


--------------------------------------------------------------------------------
/package/src/metapathpredict/MetaPathPredict.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Command Line Interface for MetaPathPredict Tools:
  3 | ====================================
  4 | 
  5 | .. currentmodule:: metapathpredict
  6 | 
  7 | class methods:
  8 |    MetaPathPredict methods
  9 | """
 10 | 
 11 | import logging
 12 | import argparse
 13 | import datetime
 14 | import pickle
 15 | import os
 16 | import sys
 17 | import re
 18 | import math
 19 | import importlib
 20 | from typing import Iterable, List, Dict, Set, Optional, Sequence
 21 | from itertools import chain
 22 | 
 23 | # disable tensorflow info messages
 24 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1'
 25 | 
 26 | import sklearn
 27 | import numpy as np
 28 | import pandas as pd
 29 | import keras
 30 | from torchvision import transforms
 31 | import torch.optim as optim
 32 | from torch.utils.data import Dataset, DataLoader, TensorDataset
 33 | from sklearn.model_selection import train_test_split
 34 | from sklearn.preprocessing import StandardScaler
 35 | from sklearn.feature_selection import SelectKBest, f_classif
 36 | from sklearn.metrics import classification_report
 37 | import torch
 38 | import torch.nn as nn
 39 | 
 40 | import warnings
 41 | from sklearn.exceptions import InconsistentVersionWarning
 42 | warnings.filterwarnings(action='ignore', category=InconsistentVersionWarning)
 43 | 
 44 | from metapathpredict.utils import InputData
 45 | from metapathpredict.utils import AnnotationList
 46 | 
 47 | 
 48 | # CUDA for PyTorch
 49 | use_cuda = torch.cuda.is_available()
 50 | device = torch.device("cuda:0" if use_cuda else "cpu")
 51 | # device = "cpu"
 52 | 
 53 | torch.backends.cudnn.benchmark = True
 54 | 
 55 | # Parameters
 56 | params = {"batch_size": 64, "shuffle": True, "num_workers": 6}
 57 | 
 58 | #Configure the logging system
 59 | logging.basicConfig(
 60 |     filename='metapathpredict.log',
 61 |     level=logging.INFO,
 62 |     format="%(asctime)s %(levelname)s %(module)s - %(message)s",
 63 |     datefmt="%Y-%m-%d %H:%M:%S")
 64 | 
 65 | root = logging.getLogger()
 66 | root.setLevel(logging.INFO)
 67 | 
 68 | handler = logging.StreamHandler(sys.stdout)
 69 | handler.setLevel(logging.INFO)
 70 | formatter = logging.Formatter("%(asctime)s %(levelname)s %(module)s - %(message)s",
 71 |                               "%Y-%m-%d %H:%M:%S")
 72 | handler.setFormatter(formatter)
 73 | root.addHandler(handler)
 74 | 
 75 | 
 76 | 
 77 | class CustomDataset(Dataset):
 78 |     def __init__(self, data, targets, transform=None):
 79 |         print("type", type(data), data.shape)
 80 |         self.data = torch.tensor(data, dtype=torch.float32)
 81 |         self.targets = torch.tensor(targets, dtype=torch.float32)
 82 |         self.transform = transform
 83 | 
 84 |     def __len__(self):
 85 |         return len(self.data)
 86 | 
 87 |     def __getitem__(self, idx):
 88 |         features, target = self.data[idx], self.targets[idx]
 89 | 
 90 |         if self.transform:
 91 |             sample = self.transform(sample)
 92 | 
 93 |         return features, target
 94 | 
 95 | 
 96 | 
 97 | class CustomModel(nn.Module):
 98 |     def __init__(self, num_hidden_nodes_per_layer=1024, num_hidden_layers=5):
 99 |         super(CustomModel, self).__init__()
100 |         NUM_HIDDEN_NODES = num_hidden_nodes_per_layer
101 |         self.NUM_HIDDEN_LAYERS = num_hidden_layers
102 | 
103 |         self.fc1 = nn.Linear(2000, NUM_HIDDEN_NODES)
104 |         self.relu = nn.ReLU()
105 |         self.dropout = nn.Dropout(0.1)
106 | 
107 |         # array of hidden layers
108 |         self.fcs = [
109 |             nn.Linear(NUM_HIDDEN_NODES, NUM_HIDDEN_NODES)
110 |             for i in range(num_hidden_layers)
111 |         ]
112 | 
113 |         self.output_layer = nn.Linear(NUM_HIDDEN_NODES, 94)
114 |         self.sigmoid = nn.Sigmoid()
115 | 
116 |     def forward(self, x):
117 |         x = self.fc1(x)
118 |         x = self.relu(x)
119 |         x = self.dropout(x)
120 | 
121 |         for i in range(self.NUM_HIDDEN_LAYERS - 1):
122 |             x = self.fcs[i](x)
123 |             x = self.relu(x)
124 |             x = self.dropout(x)
125 | 
126 |         x = self.fcs[self.NUM_HIDDEN_LAYERS - 1](x)
127 |         x = self.relu(x)
128 | 
129 |         x = self.output_layer(x)
130 |         x = self.sigmoid(x)
131 |         return x
132 | 
133 | 
134 | 
135 | class Models:
136 | 
137 |     """Platform-agnostic command line functions available in MetaPathPredict tools."""
138 | 
139 |     @classmethod
140 |     def train(cls, args: Iterable[str] = None) -> int:
141 |         """Train a model from the input data .
142 | 
143 |         Writes out a DNN model in the keras forma
144 | 
145 |         Parameters
146 |         ----------
147 |         args : Iterable[str], optional
148 |             value of None, when passed to `parser.parse_args` causes the parser to
149 |             read `sys.argv`
150 | 
151 |         Returns
152 |         -------
153 |         return_call : 0
154 |             return call if the program completes successfully
155 | 
156 |         """
157 |         parser = argparse.ArgumentParser()
158 | 
159 |         parser.add_argument(
160 |             "--train-targets",
161 |             dest="train_targets",
162 |             required=True,
163 |             help="training targets file",
164 |         )
165 |         parser.add_argument(
166 |             "--train-features",
167 |             dest="train_features",
168 |             required=True,
169 |             help="training features",
170 |         )
171 |         parser.add_argument(
172 |             "--num-epochs",
173 |             dest="num_epochs",
174 |             required=False,
175 |             default=100,
176 |             type=int,
177 |             help="number of epochs",
178 |         )
179 |         parser.add_argument(
180 |             "--model-out",
181 |             "-m",
182 |             dest="model_out",
183 |             required=True,
184 |             help="model file name output",
185 |         )
186 |         parser.add_argument(
187 |             "--use-gpu",
188 |             dest="use_gpu",
189 |             required=False,
190 |             action="store_true",
191 |             help="use GPU if available",
192 |         )
193 |         parser.add_argument(
194 |             "--num-cores",
195 |             dest="num_cores",
196 |             required=False,
197 |             default=10,
198 |             type=int,
199 |             help="Number of cores for parallel processing",
200 |         )
201 |         neural_net_params = parser.add_argument_group("Neural Net parameters")
202 |         neural_net_params.add_argument(
203 |             "--num-hidden-layers",
204 |             default=5,
205 |             required=False,
206 |             type=int,
207 |             help="number of hidden layers",
208 |         )
209 |         neural_net_params.add_argument(
210 |             "--hidden-nodes-per-layer",
211 |             type=int,
212 |             required=False,
213 |             default=1024,
214 |             help="number of nodes in each hidden layer",
215 |         )
216 |         neural_net_params.add_argument(
217 |             "--num-features",
218 |             dest="num_features",
219 |             default=2000,
220 |             required=False,
221 |             type=int,
222 |             help="number of features to retain from training data",
223 |         )
224 |         neural_net_params.add_argument(
225 |             "--threshold",
226 |             dest="threshold",
227 |             default=6432,
228 |             required=False,
229 |             type=float,
230 |             help="threshold for SelectKBest feature selection",
231 |         )
232 | 
233 | 
234 |         args = parser.parse_args()
235 | 
236 |         # CUDA for PyTorch
237 |         device = "cpu"
238 |         if args.use_gpu:
239 |             use_cuda = torch.cuda.is_available()
240 |             device = torch.device("cuda:0" if use_cuda else "cpu")
241 | 
242 |         logging.info(f"Using device: {device}")
243 | 
244 |         # read in features
245 |         features = pd.read_table(args.train_features, compression="gzip")
246 |         logging.info(f"reading input features of shape: {features.shape[0]} x {features.shape[1]}")
247 | 
248 |         # read in labels
249 |         targets = pd.read_table(args.train_targets, compression="gzip")
250 |         logging.info(f"reading input labels of shape: {targets.shape[0]} x {targets.shape[1]}")
251 | 
252 |         # split the data into training and test sets
253 |         test_size = 0.25
254 |         x, x_test, y, y_test = train_test_split(
255 |             features,
256 |             targets,
257 |             stratify=targets,
258 |             shuffle=True,
259 |             test_size= test_size,
260 |             random_state=111,
261 |         )
262 |         logging.info(f"creating test size of: {test_size}%")
263 | 
264 |         # Split the remaining data to train and validation
265 |         x_train, x_val, y_train, y_val = train_test_split(
266 |             x, y, stratify=y, test_size=0.2, shuffle=True, random_state=111
267 |         )
268 | 
269 |         print("features size", features.shape)
270 |         print("targets size", targets.shape)
271 | 
272 |         print("x_test", x_test.shape, " y_test ", y_test.shape)
273 |         print("x", x.shape, " y ", y.shape)
274 | 
275 |         print("x_train", x_train.shape, " y_train ", y_train.shape)
276 |         print("x_val", x_val.shape, " y_val ", y_val.shape)
277 |         print("x_test", x_test.shape, " y_test ", y_test.shape)
278 |         
279 |         
280 |         
281 |         # Initialize the StandardScaler
282 |         scaler = StandardScaler()
283 |         
284 |         # Fit the scaler to training data and transform it
285 |         # and then transform val and test data w/ the fitted scaler object 
286 |         # (std. dev., variance, etc. are based on training data columns)
287 |         scaled_features = scaler.fit_transform(x_train)
288 |         x_train = pd.DataFrame(scaled_features, index = x_train.index, columns = x_train.columns)
289 |         x_val = pd.DataFrame(scaler.transform(x_val), index = x_val.index, columns = x_val.columns) 
290 |         x_test = pd.DataFrame(scaler.transform(x_test), index = x_test.index, columns = x_test.columns) 
291 |         logging.info(f"normalizing the training input features")
292 |         
293 |         
294 | 
295 |         # feature selection based only on the training data
296 |         # Select features according to the k highest F-values
297 |         # from running ANOVA on y_train and x_train
298 |         selected_features = []
299 |         for label in y_train:
300 |             selector = SelectKBest(f_classif, k = 'all')
301 |             selector.fit(x_train, y_train[label])
302 |             selected_features.append(list(selector.scores_))
303 | 
304 |         # select threshold that retains 2000 features
305 |         threshold = args.threshold
306 | 
307 |         # # MeanCS
308 |         logging.info(f"total number of features in input: {x_train.shape[1]}")
309 |         selected_features2 = np.mean(selected_features, axis = 0) > threshold
310 |         logging.info(f"number of features selected for training: {sum(selected_features2)}")
311 | 
312 |         # create new training, validation, and test datasets retaining only the 2000 top features
313 |         # determined from the training data
314 |         x_train2 = x_train.loc[:, selected_features2]
315 |         x_val2 = x_val.loc[:, selected_features2]
316 |         x_test2 = x_test.loc[:, selected_features2]
317 |         features_used = x_train2.columns.values
318 |         labels_used = y_val.columns.values
319 | 
320 |         logging.info(f"Using features : {str(features_used)}")
321 |         logging.info(f"Using labels : {str(labels_used)}")
322 | 
323 |         # Initialize the StandardScaler
324 |         #scaler = StandardScaler()
325 | 
326 |         # Fit the scaler to your data and transform it
327 |         #x_train2 = scaler.fit_transform(x_train2)
328 |         #x_val2 = scaler.fit_transform(x_val2)
329 |         #logging.info(f"normalizing the training input features")
330 | 
331 |         y_train = np.asarray(y_train.values)
332 |         y_val = np.asarray(y_val.values)
333 | 
334 |         print()
335 |         print("x_train2", x_train2.shape)
336 |         print("x_val2", x_val2.shape)
337 |         print("x_test2", x_test2.shape)
338 | 
339 |         # outline the neural network architecture - multilable classifier
340 |         # 1 input layer, 5 hidden layers, 1 output layer
341 |         # inclue dropout for all hidden layers
342 |         model = CustomModel(
343 |             num_hidden_nodes_per_layer=args.hidden_nodes_per_layer,
344 |             num_hidden_layers=args.num_hidden_layers,
345 |         ).to(device)
346 | 
347 |         # Define loss function and optimizer
348 |         criterion = nn.BCELoss()
349 |         optimizer = optim.Adam(model.parameters(), lr=0.001)
350 |         logging.info(f"optimizer Adam with learning rate: 0.001")
351 | 
352 |         # Define early stopping
353 |         early_stopping = torch.optim.lr_scheduler.ReduceLROnPlateau(
354 |             optimizer, "min", patience=10
355 |         )
356 | 
357 |         # Create an empty transform
358 |         no_transform = transforms.Compose([])
359 | 
360 |         # dataset DataLoader
361 |         x_train2 = np.asarray(x_train2)
362 |         x_val2 = np.asarray(x_val2)
363 |         print("xtrain2", x_train2.shape, y_train.shape)
364 | 
365 |         logging.info(f"loading training dataset into dataloader")
366 |         dataset = CustomDataset(data=x_train2, targets=y_train, transform=None)
367 | 
368 |         batch_size = 10000
369 |         train_data_loader = DataLoader(
370 |             dataset, batch_size=batch_size, num_workers=args.num_cores, shuffle=True
371 |         )
372 | 
373 |         logging.info(f"loading testing dataset into dataloader")
374 |         val_dataset = CustomDataset(data=x_val2, targets=y_val, transform=None)
375 |         val_data_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=True)
376 | 
377 |         # Train the model
378 |         num_epochs = args.num_epochs
379 |         logging.info(f"number of epochs for training: {num_epochs}")
380 |         for epoch in range(num_epochs):
381 |             model.train()
382 |             train_loss = 0.0
383 | 
384 |             for inputs, targets in train_data_loader:
385 |                 inputs, targets = inputs.to(device), targets.to(device)
386 |                 optimizer.zero_grad()
387 |                 outputs = model(inputs)
388 |                 loss = criterion(outputs, targets)
389 | 
390 |                 loss.backward()
391 |                 optimizer.step()
392 |                 train_loss += loss.item()
393 | 
394 |             model.eval()
395 |             val_loss = 0.0
396 |             with torch.no_grad():
397 |                 for inputs, targets in val_data_loader:
398 |                     inputs, targets = inputs.to(device), targets.to(device)
399 |                     outputs = model(inputs)
400 |                     loss = criterion(outputs, targets)
401 |                     val_loss += loss.item()
402 | 
403 |             # Update learning rate using early stopping
404 |             early_stopping.step(val_loss)
405 | 
406 |             logging.info(
407 |                 f"Epoch [{epoch+1}/{num_epochs}], Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}"
408 |             )
409 | 
410 |             print(
411 |                 f"Epoch [{epoch+1}/{num_epochs}], Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}"
412 |             )
413 | 
414 |         # assess the model on test data
415 |         x_test2 = np.asarray(x_test2)
416 |         x_test2 = torch.tensor(x_test2, dtype=torch.float32)
417 |         logging.info(f"converting test inputs to torch.tensor")
418 | 
419 |         predictions_test = model(x_test2)
420 | 
421 |         # round predictions
422 |         roundedTestPreds = np.round(predictions_test.detach().numpy())
423 | 
424 |         # print out performance metrics
425 |         print(classification_report(y_test.values, roundedTestPreds))
426 | 
427 |         logging.info(f"Training finished successfully!")
428 | 
429 |         model_file = {}
430 |         model_file["description"] = "neural net trained for predicting multilabels"
431 |         model_file["features"] = features_used
432 |         model_file["labels"] = labels_used
433 |         model_file["model"] = model
434 |         torch.save(model_file, args.model_out)
435 |         logging.info(f"writing model file: {args.model_out}")
436 | 
437 | 
438 | 
439 |     @classmethod
440 |     def predict(cls, args: Iterable[str] = None) -> int:
441 |         """Predict the presence or absence of select KEGG modules on bacterial
442 |         annotation data.
443 | 
444 |         Parameters
445 |         ----------
446 |         args : Iterable[str], optional
447 |             value of None, when passed to `parser.parse_args` causes the parser to
448 |             read `sys.argv`
449 | 
450 |         Returns
451 |         -------
452 |         return_call : 0
453 |             return call if the program completes successfully
454 | 
455 |         """
456 |         
457 |         # disable tensorflow info messages
458 |         os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1'
459 | 
460 |         parser = argparse.ArgumentParser()
461 | 
462 |         parser.add_argument(
463 |             "--input",
464 |             "-i",
465 |             action = "extend",
466 |             nargs = "+",
467 |             dest="input",
468 |             required=True,
469 |             help="input file path(s) and name(s) [required]",
470 |         )
471 |         parser.add_argument(
472 |             "--annotation-format",
473 |             "-a",
474 |             dest="annotation_format",
475 |             required=True,
476 |             help="annotation format (kofamscan, kofamscan-web, dram, or koala) [default: kofamscan]",
477 |         )
478 |         parser.add_argument(
479 |             "--kegg-modules",
480 |             "-k",
481 |             dest="kegg_modules",
482 |             required=False,
483 |             default=None,
484 |             action="extend",
485 |             nargs="+",
486 |             help="KEGG modules to predict [default: MetaPathPredict KEGG modules]",
487 |         )
488 |         parser.add_argument(
489 |             "--output",
490 |             "-o",
491 |             dest="output",
492 |             required=True,
493 |             help="output file path and name [required]",
494 |         )
495 | 
496 |         args = parser.parse_args()
497 |         
498 |         module_dir = importlib.resources.files('metapathpredict')
499 |         data_dir = module_dir.joinpath("data/")
500 |         
501 |         # scaler_0_path = module_dir.joinpath("data/model_0_scaler.pkl")
502 |         # scaler_1_path = module_dir.joinpath("data/model_1_scaler.pkl")
503 |         
504 |         model_0_path = module_dir.joinpath("data/model_0.keras")
505 |         model_1_path = module_dir.joinpath("data/model_1.keras")
506 |         
507 |         labels_path = module_dir.joinpath("data/labels.pkl")
508 |         requiredCols_path = module_dir.joinpath("data/requiredCols.pkl")
509 | 
510 |         # with open(scaler_0_path, "rb") as f:
511 |         #   model_0_scaler = pickle.load(f)
512 |         #   
513 |         # with open(scaler_1_path, "rb") as f:
514 |         #   model_1_scaler = pickle.load(f)
515 |           
516 |         with open(labels_path, "rb") as f:
517 |           labels = pickle.load(f)
518 |           
519 |         with open(requiredCols_path, "rb") as f:
520 |           requiredCols = pickle.load(f)
521 | 
522 |         #models = [torch.load(model_0_path), torch.load(model_1_path)]
523 |         models = [keras.models.load_model(model_0_path), keras.models.load_model(model_1_path)]
524 |         
525 |         # logging.info(f"reading model files: {args.model_in[0]}, {args.model_in[1]}")
526 |         # logging.info(f"reading scaler files: {args.scaler_in[0]}, {args.scaler_in[1]}")
527 |         
528 |         # logging.info(f"Reading model files from directory: {data_dir}")
529 |         # logging.info(f"Reading scaler files from directory: {data_dir}")
530 | 
531 | 
532 |         # load the input features
533 |         files_list = InputData(files = args.input) 
534 |         
535 |         if args.annotation_format == "kofamscan":
536 |           files_list.read_kofamscan_detailed_tsv()
537 |           
538 |         elif args.annotation_format == "kofamkoala":
539 |           files_list.read_kofamkoala()
540 |           
541 |         elif args.annotation_format == "dram":
542 |           files_list.read_dram_annotation_tsv()
543 |           
544 |         elif args.annotation_format == "koala":
545 |           files_list.read_koala_tsv()
546 |         
547 |         else:
548 |           logging.error("""Did not recognize annotation format; use "kofamscan", "kofamkoala", "dram", or "koala""""")
549 |           sys.exit(0)
550 |           
551 |         logging.info(f"Reading input files with format: {args.annotation_format}")
552 |           
553 |         # model_0_cols = np.ndarray.tolist(model_0_scaler.feature_names_in_)
554 |         # model_1_cols = np.ndarray.tolist(model_1_scaler.feature_names_in_)
555 |         # reqColsAll = list(set(model_0_cols).union(set(model_1_cols)))
556 | 
557 |         reqColsAll = requiredCols
558 |         
559 |         input_features = AnnotationList(
560 |           requiredColumnsAll = reqColsAll, # add list of all required columns for model #1 and model #2
561 |           requiredColumnsModel0 = "blank", #model_0_scaler.feature_names_in_, # add list of all required columns for model #1
562 |           requiredColumnsModel1 = "blank", #model_1_scaler.feature_names_in_, # add list of all required columns for model #2
563 |           annotations = files_list.annotations)
564 | 
565 |         input_features.create_feature_df()
566 |         input_features.check_feature_columns()
567 |         # input_features.select_model_features()
568 |         # input_features.transform_model_features(model_0_scaler, model_1_scaler)
569 | 
570 |         logging.info("Making KEGG module presence/absence predictions")
571 | 
572 |         predictions_list = []
573 |         for prediction_iteration in range(2):
574 |           
575 |           #features = torch.tensor(np.asarray(input_features.feature_df[x]), dtype=torch.float32)
576 | 
577 |           # predict
578 |           #predictions = models[x]['model'](features)
579 |           logging.info(f"Model {prediction_iteration} is making predictions")
580 |           predictions = models[prediction_iteration].predict(input_features.feature_df[prediction_iteration])
581 | 
582 |           # round predictions
583 |           #roundedPreds = np.round(predictions.detach().numpy())
584 |           roundedPreds = np.round(predictions)
585 |           
586 |           #predsDf = pd.DataFrame(data = roundedPreds, columns = models[x]['labels']).astype(int)
587 |           predsDf = pd.DataFrame(data = roundedPreds, columns = labels[prediction_iteration]).astype(int)
588 | 
589 |           predictions_list.append(predsDf)
590 |           
591 |           logging.info(f"Model {prediction_iteration} completed making predictions")
592 |           
593 |         logging.info("All done.")
594 | 
595 |         out_df = pd.concat(predictions_list, axis = 1)
596 | 
597 |         if args.kegg_modules is not None:
598 |           if all(modules in out_df.columns for modules in args.kegg_modules):
599 |             out_df = out_df[args.kegg_modules]
600 |           else:
601 |             logging.error("""Did not recognize one or more KEGG modules specified with --kegg-modules; keeping all prediction columns""")
602 | 
603 |         out_df.insert(loc = 0, column = 'file', value = args.input)
604 |         
605 |         logging.info(f"Writing output to file: {args.output}")
606 |         out_df.to_csv(args.output, sep='\t', index=None)
607 | 
608 |         #logging.info(f"Output matrix size: {out_df.shape[0]} x {out_df.shape[1]}")
609 | 
610 | 
611 | 
612 |     @classmethod
613 |     def show_available_modules(cls, args: Iterable[str] = None) -> int:
614 | 
615 |         """List available KEGG modules for presence/absence prediction.
616 | 
617 |         Parameters
618 |         ----------
619 |         args : Iterable[str], optional
620 |             value of None, when passed to `parser.parse_args` causes the parser to
621 |             read `sys.argv`
622 | 
623 |         Returns
624 |         -------
625 |         return_call : 0
626 |             return call if the program completes successfully
627 | 
628 |         """
629 | 
630 |         module_dir = importlib.resources.files('metapathpredict')
631 | 
632 |         metapathmodules_path = module_dir.joinpath("data/metapathmodules.pkl")
633 | 
634 |         with open(metapathmodules_path, "rb") as f:
635 |           metapathmodules = pickle.load(f)
636 |         
637 |         pd.set_option('display.max_rows', None)
638 |         pd.set_option('max_colwidth', None)
639 | 
640 |         print(metapathmodules)
641 |           
642 |         
643 |         
644 |     @classmethod
645 |     def predict_from_feature_table(cls, args: Iterable[str] = None) -> int:
646 |         """Predict the presence or absence of select KEGG modules on bacterial
647 |         annotation data -- from an input feature table of KEGG K numbers
648 | 
649 |         Parameters
650 |         ----------
651 |         args : Iterable[str], optional
652 |             value of None, when passed to `parser.parse_args` causes the parser to
653 |             read `sys.argv`
654 | 
655 |         Returns
656 |         -------
657 |         return_call : 0
658 |             return call if the program completes successfully
659 | 
660 |         """
661 |         
662 |         # disable tensorflow info messages
663 |         os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1'
664 | 
665 |         parser = argparse.ArgumentParser()
666 | 
667 |         parser.add_argument(
668 |             "--input",
669 |             "-i",
670 |             dest="input",
671 |             required=True,
672 |             help="input file path(s) and name(s) [required]",
673 |         )
674 |         parser.add_argument(
675 |             "--kegg-modules",
676 |             "-k",
677 |             dest="kegg_modules",
678 |             required=False,
679 |             default=None,
680 |             action="extend",
681 |             nargs="+",
682 |             help="KEGG modules to predict [default: MetaPathPredict KEGG modules]",
683 |         )
684 |         parser.add_argument(
685 |             "--output",
686 |             "-o",
687 |             dest="output",
688 |             required=True,
689 |             help="output file path and name [required]",
690 |         )
691 | 
692 |         args = parser.parse_args()
693 |         
694 |         module_dir = importlib.resources.files('metapathpredict')
695 |         data_dir = module_dir.joinpath("data/")
696 |         
697 |         # scaler_0_path = module_dir.joinpath("data/model_0_scaler.pkl")
698 |         # scaler_1_path = module_dir.joinpath("data/model_1_scaler.pkl")
699 |         
700 |         model_0_path = module_dir.joinpath("data/model_0.keras")
701 |         model_1_path = module_dir.joinpath("data/model_1.keras")
702 |         
703 |         labels_path = module_dir.joinpath("data/labels.pkl")
704 |         requiredCols_path = module_dir.joinpath("data/requiredCols.pkl")
705 | 
706 |         # with open(scaler_0_path, "rb") as f:
707 |         #   model_0_scaler = pickle.load(f)
708 |         #   
709 |         # with open(scaler_1_path, "rb") as f:
710 |         #   model_1_scaler = pickle.load(f)
711 |           
712 |         with open(labels_path, "rb") as f:
713 |           labels = pickle.load(f)
714 |           
715 |         with open(requiredCols_path, "rb") as f:
716 |           requiredCols = pickle.load(f)
717 | 
718 |         #models = [torch.load(model_0_path), torch.load(model_1_path)]
719 |         models = [keras.models.load_model(model_0_path), keras.models.load_model(model_1_path)]
720 |         
721 |         # logging.info(f"reading model files: {args.model_in[0]}, {args.model_in[1]}")
722 |         # logging.info(f"reading scaler files: {args.scaler_in[0]}, {args.scaler_in[1]}")
723 |         
724 |         # logging.info(f"Reading model files from directory: {data_dir}")
725 |         # logging.info(f"Reading scaler files from directory: {data_dir}")
726 | 
727 | 
728 |         # load the input features
729 |         features = pd.read_csv(args.input, sep = "\t")
730 |         # files_list = InputData(files = args.input) 
731 |         # 
732 |         # if args.annotation_format == "kofamscan":
733 |         #   files_list.read_kofamscan_detailed_tsv()
734 |         #   
735 |         # elif args.annotation_format == "kofamkoala":
736 |         #   files_list.read_kofamkoala()
737 |         #   
738 |         # elif args.annotation_format == "dram":
739 |         #   files_list.read_dram_annotation_tsv()
740 |         #   
741 |         # elif args.annotation_format == "koala":
742 |         #   files_list.read_koala_tsv()
743 |         # 
744 |         # else:
745 |         #   logging.error("""Did not recognize annotation format; use "kofamscan", "kofamkoala", "dram", or "koala""""")
746 |         #   sys.exit(0)
747 |         #   
748 |         # logging.info(f"Reading input files with format: {args.annotation_format}")
749 |           
750 |         # model_0_cols = np.ndarray.tolist(model_0_scaler.feature_names_in_)
751 |         # model_1_cols = np.ndarray.tolist(model_1_scaler.feature_names_in_)
752 |         # reqColsAll = list(set(model_0_cols).union(set(model_1_cols)))
753 |         
754 |         #reqColsAll = np.ndarray.tolist(model_0_scaler.feature_names_in_)
755 |         reqColsAll = requiredCols
756 |           
757 |         input_features = AnnotationList(
758 |           requiredColumnsAll = reqColsAll, # add list of all required columns for model #1 and model #2
759 |           requiredColumnsModel0 = "blank", # add list of all required columns for model #1
760 |           requiredColumnsModel1 = "blank", # add list of all required columns for model #2
761 |           annotations = "blank")
762 | 
763 |         #input_features.create_feature_df()
764 |         input_features.feature_df = features
765 |         input_features.check_feature_columns()
766 |         # input_features.select_model_features()
767 |         # input_features.transform_model_features(model_0_scaler, model_1_scaler)
768 | 
769 |         logging.info("Making KEGG module presence/absence predictions")
770 | 
771 |         predictions_list = []
772 |         for x in range(2):
773 |           
774 |           #features = torch.tensor(np.asarray(input_features.feature_df[x]), dtype=torch.float32)
775 | 
776 |           # predict
777 |           #predictions = models[x]['model'](features)
778 |           logging.info(f"Model {x} is making predictions")
779 |           predictions = models[x].predict(input_features.feature_df[x])
780 | 
781 |           # round predictions
782 |           #roundedPreds = np.round(predictions.detach().numpy())
783 |           roundedPreds = np.round(predictions)
784 |           
785 |           #predsDf = pd.DataFrame(data = roundedPreds, columns = models[x]['labels']).astype(int)
786 |           predsDf = pd.DataFrame(data = roundedPreds, columns = labels[x]).astype(int)
787 | 
788 |           predictions_list.append(predsDf)
789 |           
790 |           logging.info(f"Model {x} completed making predictions")
791 |           
792 |         logging.info("All done.")
793 | 
794 |         out_df = pd.concat(predictions_list, axis = 1)
795 | 
796 |         if args.kegg_modules is not None:
797 |           if all(modules in out_df.columns for modules in args.kegg_modules):
798 |             out_df = out_df[args.kegg_modules]
799 |           else:
800 |             logging.error("""Did not recognize one or more KEGG modules specified with --kegg-modules; keeping all prediction columns""")
801 | 
802 |         out_df.insert(loc = 0, column = 'file', value = args.input)
803 |         
804 |         logging.info(f"Writing output to file: {args.output}")
805 |         out_df.to_csv(args.output, sep='\t', index=None)
806 | 
807 |         #logging.info(f"Output matrix size: {out_df.shape[0]} x {out_df.shape[1]}")
808 | 
809 | 
810 | 
811 |     @classmethod
812 |     def predict_from_feature_table_fs_models(cls, args: Iterable[str] = None) -> int:
813 |         """Predict the presence or absence of select KEGG modules on bacterial
814 |         annotation data -- from an input feature table of KEGG K numbers
815 | 
816 |         Parameters
817 |         ----------
818 |         args : Iterable[str], optional
819 |             value of None, when passed to `parser.parse_args` causes the parser to
820 |             read `sys.argv`
821 | 
822 |         Returns
823 |         -------
824 |         return_call : 0
825 |             return call if the program completes successfully
826 | 
827 |         """
828 |         
829 |         # disable tensorflow info messages
830 |         os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1'
831 | 
832 |         parser = argparse.ArgumentParser()
833 | 
834 |         parser.add_argument(
835 |             "--input",
836 |             "-i",
837 |             dest="input",
838 |             required=True,
839 |             help="input file path(s) and name(s) [required]",
840 |         )
841 |         parser.add_argument(
842 |             "--kegg-modules",
843 |             "-k",
844 |             dest="kegg_modules",
845 |             required=False,
846 |             default=None,
847 |             action="extend",
848 |             nargs="+",
849 |             help="KEGG modules to predict [default: MetaPathPredict KEGG modules]",
850 |         )
851 |         parser.add_argument(
852 |             "--output",
853 |             "-o",
854 |             dest="output",
855 |             required=True,
856 |             help="output file path and name [required]",
857 |         )
858 | 
859 |         args = parser.parse_args()
860 |         
861 |         module_dir = importlib.resources.files('metapathpredict')
862 |         data_dir = module_dir.joinpath("data/")
863 |         
864 |         # scaler_0_path = module_dir.joinpath("data/model_0_scaler.pkl")
865 |         # scaler_1_path = module_dir.joinpath("data/model_1_scaler.pkl")
866 |         
867 |         model_0_path = module_dir.joinpath("data/model_0.keras")
868 |         model_1_path = module_dir.joinpath("data/model_1.keras")
869 |         
870 |         labels_path = module_dir.joinpath("data/labels.pkl")
871 |         requiredCols_path = module_dir.joinpath("data/requiredCols.pkl")
872 |         
873 |         requiredColumnsModel0_path = module_dir.joinpath("data/requiredColumnsModel0.pkl")
874 |         requiredColumnsModel1_path = module_dir.joinpath("data/requiredColumnsModel1.pkl")
875 | 
876 |         # with open(scaler_0_path, "rb") as f:
877 |         #   model_0_scaler = pickle.load(f)
878 |         #   
879 |         # with open(scaler_1_path, "rb") as f:
880 |         #   model_1_scaler = pickle.load(f)
881 |           
882 |         with open(labels_path, "rb") as f:
883 |           labels = pickle.load(f)
884 |           
885 |         with open(requiredCols_path, "rb") as f:
886 |           requiredCols = pickle.load(f)
887 |           
888 |         with open(requiredColumnsModel0_path, "rb") as f:
889 |           model_0_features = pickle.load(f)
890 | 
891 |         with open(requiredColumnsModel1_path, "rb") as f:
892 |           model_1_features = pickle.load(f)
893 | 
894 | 
895 |         #models = [torch.load(model_0_path), torch.load(model_1_path)]
896 |         models = [keras.models.load_model(model_0_path), keras.models.load_model(model_1_path)]
897 |         
898 |         # logging.info(f"reading model files: {args.model_in[0]}, {args.model_in[1]}")
899 |         # logging.info(f"reading scaler files: {args.scaler_in[0]}, {args.scaler_in[1]}")
900 |         
901 |         # logging.info(f"Reading model files from directory: {data_dir}")
902 |         # logging.info(f"Reading scaler files from directory: {data_dir}")
903 | 
904 | 
905 |         # load the input features
906 |         features = pd.read_csv(args.input, sep = "\t")
907 |         # files_list = InputData(files = args.input) 
908 |         # 
909 |         # if args.annotation_format == "kofamscan":
910 |         #   files_list.read_kofamscan_detailed_tsv()
911 |         #   
912 |         # elif args.annotation_format == "kofamkoala":
913 |         #   files_list.read_kofamkoala()
914 |         #   
915 |         # elif args.annotation_format == "dram":
916 |         #   files_list.read_dram_annotation_tsv()
917 |         #   
918 |         # elif args.annotation_format == "koala":
919 |         #   files_list.read_koala_tsv()
920 |         # 
921 |         # else:
922 |         #   logging.error("""Did not recognize annotation format; use "kofamscan", "kofamkoala", "dram", or "koala""""")
923 |         #   sys.exit(0)
924 |         #   
925 |         # logging.info(f"Reading input files with format: {args.annotation_format}")
926 |           
927 |         # model_0_cols = np.ndarray.tolist(model_0_scaler.feature_names_in_)
928 |         # model_1_cols = np.ndarray.tolist(model_1_scaler.feature_names_in_)
929 |         # reqColsAll = list(set(model_0_cols).union(set(model_1_cols)))
930 |         
931 |         #reqColsAll = np.ndarray.tolist(model_0_scaler.feature_names_in_)
932 |         reqColsAll = requiredCols
933 |           
934 |         input_features = AnnotationList(
935 |           requiredColumnsAll = reqColsAll, # add list of all required columns for model #1 and model #2
936 |           requiredColumnsModel0 = model_0_features, # add list of all required columns for model #1
937 |           requiredColumnsModel1 = model_1_features, # add list of all required columns for model #2
938 |           annotations = "blank")
939 | 
940 |         #input_features.create_feature_df()
941 |         input_features.feature_df = features
942 |         input_features.check_feature_columns()
943 |         input_features.select_model_features()
944 |         # input_features.transform_model_features(model_0_scaler, model_1_scaler)
945 | 
946 |         logging.info("Making KEGG module presence/absence predictions")
947 | 
948 |         predictions_list = []
949 |         for x in range(2):
950 |           
951 |           #features = torch.tensor(np.asarray(input_features.feature_df[x]), dtype=torch.float32)
952 | 
953 |           # predict
954 |           #predictions = models[x]['model'](features)
955 |           logging.info(f"Model {x} is making predictions")
956 |           predictions = models[x].predict(input_features.feature_df[x])
957 | 
958 |           # round predictions
959 |           #roundedPreds = np.round(predictions.detach().numpy())
960 |           roundedPreds = np.round(predictions)
961 |           
962 |           #predsDf = pd.DataFrame(data = roundedPreds, columns = models[x]['labels']).astype(int)
963 |           predsDf = pd.DataFrame(data = roundedPreds, columns = labels[x]).astype(int)
964 | 
965 |           predictions_list.append(predsDf)
966 |           
967 |           logging.info(f"Model {x} completed making predictions")
968 |           
969 |         logging.info("All done.")
970 | 
971 |         out_df = pd.concat(predictions_list, axis = 1)
972 | 
973 |         if args.kegg_modules is not None:
974 |           if all(modules in out_df.columns for modules in args.kegg_modules):
975 |             out_df = out_df[args.kegg_modules]
976 |           else:
977 |             logging.error("""Did not recognize one or more KEGG modules specified with --kegg-modules; keeping all prediction columns""")
978 | 
979 |         out_df.insert(loc = 0, column = 'file', value = args.input)
980 |         
981 |         logging.info(f"Writing output to file: {args.output}")
982 |         out_df.to_csv(args.output, sep='\t', index=None)
983 | 
984 |         #logging.info(f"Output matrix size: {out_df.shape[0]} x {out_df.shape[1]}")
985 | 


--------------------------------------------------------------------------------
/package/src/metapathpredict/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/d-mcgrath/MetaPathPredict/0f276eaa693be2bc4c735d04b50548949ce2079c/package/src/metapathpredict/__init__.py


--------------------------------------------------------------------------------
/package/src/metapathpredict/data/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/d-mcgrath/MetaPathPredict/0f276eaa693be2bc4c735d04b50548949ce2079c/package/src/metapathpredict/data/__init__.py


--------------------------------------------------------------------------------
/package/src/metapathpredict/data/labels.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/d-mcgrath/MetaPathPredict/0f276eaa693be2bc4c735d04b50548949ce2079c/package/src/metapathpredict/data/labels.pkl


--------------------------------------------------------------------------------
/package/src/metapathpredict/data/metapathmodules.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/d-mcgrath/MetaPathPredict/0f276eaa693be2bc4c735d04b50548949ce2079c/package/src/metapathpredict/data/metapathmodules.pkl


--------------------------------------------------------------------------------
/package/src/metapathpredict/data/requiredCols.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/d-mcgrath/MetaPathPredict/0f276eaa693be2bc4c735d04b50548949ce2079c/package/src/metapathpredict/data/requiredCols.pkl


--------------------------------------------------------------------------------
/package/src/metapathpredict/download_models.py:
--------------------------------------------------------------------------------
 1 | #import pyxet
 2 | import importlib
 3 | import shutil
 4 | from importlib import resources
 5 | from huggingface_hub import hf_hub_download
 6 | 
 7 | 
 8 | class Download:
 9 |     """Functions to download MetaPathPredict's machine learning models"""
10 |     
11 |     @classmethod
12 |     def download_models(cls):
13 |       """Downloads MetaPathPredict's models.
14 | 
15 |       Returns:
16 |         None
17 | 
18 |       """
19 |       print("Downloading MetaPathPredict models...")
20 |       module_dir = resources.files('metapathpredict')
21 |       data_dir = module_dir.joinpath("data/")
22 |       # model_0_dl_path = "xet://dgellermcgrath/MetaPathPredict/main/package/src/metapathpredict/data/model_0.keras"
23 |       # model_1_dl_path = "xet://dgellermcgrath/MetaPathPredict/main/package/src/metapathpredict/data/model_1.keras"
24 |       model_0_install_path = module_dir.joinpath("data/MetaPathPredict_model_0.keras")
25 |       model_1_install_path = module_dir.joinpath("data/MetaPathPredict_model_1.keras")
26 |       
27 |       model_0_renamed_dir_path = module_dir.joinpath("data/model_0.keras_directory")
28 |       model_1_renamed_dir_path = module_dir.joinpath("data/model_1.keras_directory")
29 |       
30 |       model_0_initial_path = module_dir.joinpath("data/model_0.keras_directory/MetaPathPredict_model_0.keras")
31 |       model_1_initial_path = module_dir.joinpath("data/model_1.keras_directory/MetaPathPredict_model_1.keras")
32 |       
33 |       model_0_final_path = module_dir.joinpath("data/model_0.keras")
34 |       model_1_final_path = module_dir.joinpath("data/model_1.keras")
35 |       
36 |       download_destination = module_dir.joinpath("data/")
37 | 
38 |       hf_hub_download(repo_id="dgellermcgrath/MetaPathPredict", filename="MetaPathPredict_model_0.keras", local_dir=model_0_install_path, force_download=True)
39 |       hf_hub_download(repo_id="dgellermcgrath/MetaPathPredict", filename="MetaPathPredict_model_1.keras", local_dir=model_1_install_path, force_download=True)
40 |       
41 |       # rename the model directories downloaded from HuggingFace
42 |       shutil.move(model_0_install_path, model_0_renamed_dir_path)
43 |       shutil.move(model_1_install_path, model_1_renamed_dir_path)
44 |       
45 |       # move the models out of their directories and rename them
46 |       shutil.move(model_0_initial_path, model_0_final_path)
47 |       shutil.move(model_1_initial_path, model_1_final_path)
48 |       
49 |       # remove the directories downloaded from HuggingFace
50 |       shutil.rmtree(model_0_renamed_dir_path)
51 |       shutil.rmtree(model_1_renamed_dir_path)
52 | 
53 |       # fs = pyxet.XetFS()  # fsspec filesystem
54 |       # fs.get(model_0_dl_path, str(model_0_install_path))
55 |       # fs.get(model_1_dl_path, str(model_1_install_path))
56 |       print("Models were downloaded to: " + str(download_destination))
57 |       print("All done. Use MetaPathPredict -h to see how to make predictions.")
58 | 


--------------------------------------------------------------------------------
/package/src/metapathpredict/utils.py:
--------------------------------------------------------------------------------
  1 | import csv
  2 | import re
  3 | import gzip
  4 | import numpy as np
  5 | import pandas as pd
  6 | 
  7 | 
  8 | class InputData:
  9 |   
 10 |     """Data parsing functions of input data"""
 11 | 
 12 | 
 13 |     def __init__(self, files, annotations = []):
 14 |       self.files = files
 15 |       self.annotations = annotations
 16 | 
 17 |     def read_kofamscan_detailed_tsv(self):
 18 |       """Reads in multiple .tsv files, each with columns: 0: "surpassed_threshold", 
 19 |       1: 'gene_name', 2: "k_number", 3: "adaptive_threshold", 4: "score",
 20 |       5: "evalue", 6: "definition". Keeps only rows where "surpassed_threshold" is 
 21 |       equal to "*". When there are duplicate values in "gene name", keeps the 
 22 |       row containing the highest value in the "score" column. If column "gene name" 
 23 |       contains multiple rows with the same maximum value, calculates the 
 24 |       score-to-adaptive-threshold ratio, and picks the annotation with the highest
 25 |       ratio.
 26 |     
 27 |       Returns:
 28 |         A list of lists, where each inner list is the annotation data from one file.
 29 |       """
 30 | 
 31 |       if type(self.files) is str:
 32 |         self.files = [self.files]
 33 |       
 34 |       for file in self.files:
 35 |         lines = []
 36 |         
 37 |         if file.endswith(".gz"):
 38 |           with gzip.open(file, "rb") as f:
 39 |             for row in f:
 40 |               if row.decode().split("\t")[0] == "*":
 41 |                 lines.append(row.decode().split("\t"))
 42 |         else:
 43 |           with open(file, "rb") as f:
 44 |             for row in f:
 45 |               if row.decode().split("\t")[0] == "*":
 46 |                 lines.append(row.decode().split("\t"))
 47 |      
 48 |         data = pd.DataFrame(lines)
 49 |         data.rename(columns={0: "surpassed_threshold", 1: 'gene_identifier',
 50 |         2: "k_number", 3: "adaptive_threshold", 4: "score",
 51 |         5: "evalue", 6: "definition"}, inplace=True)
 52 |       
 53 |         data[["adaptive_threshold", "score", "evalue"]] = data[["adaptive_threshold", "score", "evalue"]].apply(pd.to_numeric, axis = 1)
 54 |         data = data.groupby("gene_identifier").apply(lambda group: group.loc[group["score"] == group["score"].max()]).reset_index(level = 0, drop = True)
 55 |       
 56 |         data["group_size"] = data.groupby(["gene_identifier"]).transform("size")
 57 |       
 58 |         if data["group_size"].max() > 1:
 59 |             n_genes = (data[['gene_identifier', 'group_size']].drop_duplicates()['group_size'] > 1).sum()
 60 |             print(f"""{n_genes} gene(s) contained multiple annotations that surpassed the adaptive threshold. 
 61 |             Picking the annotation with the highest score-to-adaptive_threshold ratio for these genes.""")
 62 |         
 63 |             data["ratio"] = data["score"] / data["adaptive_threshold"]
 64 |             data = data.groupby("gene_identifier").apply(lambda group: group.loc[group["ratio"] == group["ratio"].max()]).reset_index(level = 0, drop = True)
 65 |       
 66 |             data = data.drop(["ratio"], axis = 1)
 67 |       
 68 |         data["file_name"] = file
 69 |         data = data[["file_name", "gene_identifier", "k_number", "definition"]]
 70 |         
 71 |         self.annotations.append(data)
 72 |         
 73 |         
 74 |         
 75 |         
 76 |     def read_kofamkoala(self):
 77 |       """Reads in multiple .tsv files, each with columns: 0: "gene_identifier", 
 78 |       1: 'k_number', 2: "adaptive_threshold", 3: "score", 4: "evalue", 
 79 |       5: "definition", 6: "definition_2". Keeps only rows where 
 80 |       "surpassed_threshold" is equal to "*". When there are duplicate values in 
 81 |       "gene name", keeps the row containing the highest value in the "score" 
 82 |       column. If column "gene name" contains multiple rows with the same maximum 
 83 |       value, calculates the score-to-adaptive-threshold ratio, and picks the 
 84 |       annotation with the highest ratio.
 85 |     
 86 |       Returns:
 87 |         A list of lists, where each inner list is the annotation data from one file.
 88 |       """
 89 | 
 90 |       if type(self.files) is str:
 91 |         self.files = [self.files]
 92 |       
 93 |       for file in self.files:
 94 |         lines = []
 95 |         
 96 |         if file.endswith(".gz"):
 97 |           with gzip.open(file, "rb") as f:
 98 |              for row in f:
 99 |               if row.decode().split("\t")[0] == "gene":
100 |                 continue
101 |               elif row.decode().split("\t")[3] == "-":
102 |                 continue
103 |               elif row.decode().split("\t")[2] == "-":
104 |                 if float(row.decode().split("\t")[4]) <= 1e-50:
105 |                   lines.append(row.decode().split("\t"))
106 |                 else:
107 |                   continue
108 |               else:
109 |                 if float(row.decode().split("\t")[3]) > float(row.decode().split("\t")[2]):
110 |                   lines.append(row.decode().split("\t"))
111 |         else:
112 |           with open(file, "rb") as f:
113 |             for row in f:
114 |               if row.decode().split("\t")[0] == "gene":
115 |                 continue
116 |               elif row.decode().split("\t")[3] == "-":
117 |                 continue
118 |               elif row.decode().split("\t")[2] == "-":
119 |                 if float(row.decode().split("\t")[4]) <= 1e-50:
120 |                   lines.append(row.decode().split("\t"))
121 |                 else:
122 |                   continue
123 |               else:
124 |                 if float(row.decode().split("\t")[3]) > float(row.decode().split("\t")[2]):
125 |                   lines.append(row.decode().split("\t"))
126 |         
127 |         data = pd.DataFrame(lines)
128 |         data.rename(columns={0: "gene_identifier", 1: 'k_number',
129 |         2: "adaptive_threshold", 3: "score", 4: "evalue",
130 |         5: "definition", 6: "definition_2"}, inplace=True)
131 |         
132 |         data.loc[data["adaptive_threshold"] == "-", "adaptive_threshold"] = 1
133 |               
134 |         data[["adaptive_threshold", "score", "evalue"]] = data[["adaptive_threshold", "score", "evalue"]].apply(pd.to_numeric, axis = 1)
135 |         data = data.groupby("gene_identifier").apply(lambda group: group.loc[group["score"] == group["score"].max()]).reset_index(level = 0, drop = True)
136 |               
137 |         data["group_size"] = data.groupby(["gene_identifier"]).transform("size")
138 |               
139 |         if data["group_size"].max() > 1:
140 |           n_genes = (data[['gene_identifier', 'group_size']].drop_duplicates()['group_size'] > 1).sum()
141 |           print(f"""{n_genes} gene(s) contained multiple annotations that surpassed the adaptive threshold. 
142 |           Picking the annotation with the highest score-to-adaptive_threshold ratio for these genes.""")
143 |                 
144 |         data["ratio"] = data["score"] / data["adaptive_threshold"]
145 |         data = data.groupby("gene_identifier", group_keys = False).apply(lambda group: group.loc[group["ratio"] == group["ratio"].max()]).reset_index(level = 0, drop = True)
146 |               
147 |         data = data.drop(["ratio"], axis = 1)
148 |       
149 |         data["file_name"] = file
150 |         data = data[["file_name", "gene_identifier", "k_number", "definition"]]
151 |         
152 |         self.annotations.append(data)
153 | 
154 |     
155 |     
156 |     def read_dram_annotation_tsv(self):
157 |       """Reads in multiple DRAM annotation.tsv files, keeping the "gene_identifier" 
158 |       as column 0, "k_number"" as column 1, and "definition" as column 2. Keeps 
159 |       only rows where a gene had a KEGG Ortholog annotation.
160 |     
161 |       Returns:
162 |         A list of lists, where each inner list is the annotation data from one file.
163 |       """
164 |       
165 |       pattern = "K[0-9]{5}"
166 |       
167 |       if type(self.files) is str:
168 |         self.files = [self.files]
169 |       
170 |       for file in self.files:
171 |         lines = []
172 |         if file.endswith(".gz"):
173 |           with gzip.open(file, "rb") as f:
174 |             for row in f:
175 |               if re.match(pattern, row.decode().split("\t")[8]):
176 |                 lines.append(row.decode().split("\t"))
177 |         else:
178 |           with open(file, "rb") as f:
179 |             for row in f:
180 |               if re.match(pattern, row.decode().split("\t")[8]):
181 |                 lines.append(row.decode().split("\t"))
182 |         
183 |         data = pd.DataFrame(lines)[[0,8,9]]
184 |         data.rename(columns={0: "gene_identifier", 8: 'k_number',
185 |         9: "definition"}, inplace=True)
186 |         data["file_name"] = file
187 |         data = data[["file_name", "gene_identifier", "k_number", "definition"]]
188 |     
189 |         
190 |         self.annotations.append(data)
191 |         
192 | 
193 | 
194 |     def read_koala_tsv(self):
195 |       """Reads in multiple blastKoala or ghostKoala .tsv files, keeping the 
196 |       "gene_identifier" as column 0, "k_number"" as column 1, and "definition" as 
197 |       column 2. Keeps only rows where a gene had a KEGG Ortholog annotation.
198 |     
199 |       Returns:
200 |         A list of lists, where each inner list is the annotation data from one file.
201 |       """
202 |       
203 |       pattern = "K[0-9]{5}"
204 | 
205 |       if type(self.files) is str:
206 |         self.files = [self.files]
207 |       
208 |       for file in self.files:
209 |         lines = []
210 |         if file.endswith(".gz"):
211 |           with gzip.open(file, "rb") as f:
212 |             for row in f:
213 |               if re.match(pattern, row.decode().split("\t")[1]):
214 |                 lines.append(row.decode().split("\t"))
215 |         else:
216 |           with open(file, "rb") as f:
217 |             for row in f:
218 |               if re.match(pattern, row.decode().split("\t")[1]):
219 |                 lines.append(row.decode().split("\t"))
220 |         
221 |         data = pd.DataFrame(lines)[[0,1,2]]
222 |         data.rename(columns={0: "gene_identifier", 1: 'k_number',
223 |         2: "definition"}, inplace=True)
224 |         data["file_name"] = file
225 |         data = data[["file_name", "gene_identifier", "k_number", "definition"]]
226 |     
227 |         self.annotations.append(data)
228 |         
229 | 
230 | 
231 | class AnnotationList:
232 |   
233 |     """Data formatting functions to feed formatted data to the MetaPathPredict function"""
234 | 
235 |   
236 |     def __init__(self, requiredColumnsAll, requiredColumnsModel0, requiredColumnsModel1, annotations, feature_df = pd.DataFrame()): 
237 |       self.requiredColumnsAll = requiredColumnsAll # all required columns for model #1 and model #2
238 |       self.requiredColumnsModel0 = requiredColumnsModel0 # list of all required columns for model #1
239 |       self.requiredColumnsModel1 = requiredColumnsModel1 # list of all required columns for model #2
240 |       self.annotations = annotations
241 |       self.feature_df = feature_df
242 | 
243 | 
244 | 
245 |     def create_feature_df(self):
246 |       """Converts as list of annotations into a Pandas feature DataFrame.
247 |     
248 |       Returns:
249 |         A Pandas DataFrame.
250 |       """
251 | 
252 |       for df in self.annotations:
253 |         df["count"] = 1
254 |         self.feature_df = pd.concat([self.feature_df, df], axis = 0)
255 |       
256 |       self.feature_df = self.feature_df.groupby(["file_name", "k_number"]).agg(count=("count", "sum")).reset_index().pivot_table(
257 |       index = "file_name",
258 |       columns = "k_number",
259 |       values = "count",
260 |       aggfunc = "first")
261 |       
262 |       self.feature_df = self.feature_df.replace(np.NaN, 0)
263 |       self.feature_df = self.feature_df.where(self.feature_df <= 1, 1)
264 |     
265 | 
266 | 
267 |     def check_feature_columns(self):
268 |       """Checks that all required columns are present for both of MetaPathPredict's models.
269 |     
270 |       Returns:
271 |         A Pandas DataFrame.
272 |       """
273 |       
274 |       cols_to_add = [col for col in self.requiredColumnsAll if col not in self.feature_df.columns]
275 |       #self.feature_df.loc[:, cols_to_add] = 0
276 |       col_dict = dict.fromkeys(cols_to_add, 0)
277 |       temp_df = pd.DataFrame(col_dict, index = self.feature_df.index)
278 |       self.feature_df = pd.concat([self.feature_df, temp_df], axis = 1)
279 | 
280 |       cols_to_drop = [col for col in self.feature_df.columns if col not in self.requiredColumnsAll]
281 |       self.feature_df.drop(cols_to_drop, axis = 1, inplace = True)
282 |       
283 |       self.feature_df = self.feature_df.reindex(self.requiredColumnsAll, axis = 1)
284 |       
285 |       self.feature_df = [self.feature_df, self.feature_df]
286 |       
287 | 
288 | 
289 |     # def select_model_features(self):
290 |     #   """Selects all required columns for the specified MetaPathPredict model (both model #1 and model #2).
291 |     # 
292 |     #   Returns:
293 |     #     A Pandas DataFrame.
294 |     #   """
295 |     # 
296 |     #   self.feature_df[0] = self.feature_df[0][self.requiredColumnsModel0]
297 |     #   self.feature_df[0] = self.feature_df[0].reindex(self.requiredColumnsModel0, axis = 1)
298 |     # 
299 |     #   self.feature_df[1] = self.feature_df[1][self.requiredColumnsModel1]
300 |     #   self.feature_df[1] = self.feature_df[1].reindex(self.requiredColumnsModel1, axis = 1)
301 | 
302 | 
303 | 
304 |     # def transform_model_features(self, scaler_0, scaler_1):
305 |     #   """Transforms all required columns for the specified MetaPathPredict model (both model #1 and model #2).
306 |     # 
307 |     #   Returns:
308 |     #     A Pandas DataFrame.
309 |     #   """
310 |     # 
311 |     #   scaled_features_0 = scaler_0.transform(self.feature_df[0])
312 |     #   self.feature_df[0] = pd.DataFrame(scaled_features_0, index = self.feature_df[0].index, columns = self.feature_df[0].columns)
313 |     # 
314 |     #   scaled_features_1 = scaler_1.transform(self.feature_df[1])
315 |     #   self.feature_df[1] = pd.DataFrame(scaled_features_1, index = self.feature_df[1].index, columns = self.feature_df[1].columns)
316 | 


--------------------------------------------------------------------------------