├── python
    ├── sgt-package
    │   ├── sgt.egg-info
    │   │   ├── top_level.txt
    │   │   ├── dependency_links.txt
    │   │   └── SOURCES.txt
    │   ├── sgt_cran2367_bugfix_1.egg-info
    │   │   ├── top_level.txt
    │   │   ├── dependency_links.txt
    │   │   ├── SOURCES.txt
    │   │   └── PKG-INFO
    │   ├── sgt_cran2367_bugfix_2.egg-info
    │   │   ├── top_level.txt
    │   │   ├── dependency_links.txt
    │   │   └── SOURCES.txt
    │   ├── sgt
    │   │   ├── __init__.py
    │   │   └── sgt.py
    │   ├── build
    │   │   └── lib
    │   │   │   └── sgt
    │   │   │       └── __init__.py
    │   ├── output_19_1.png
    │   ├── output_23_1.png
    │   ├── dist
    │   │   ├── sgt-2.0.0.tar.gz
    │   │   ├── sgt-2.0.1.tar.gz
    │   │   ├── sgt-2.0.2.tar.gz
    │   │   ├── sgt-2.0.3.tar.gz
    │   │   ├── sgt-2.0.0b15.tar.gz
    │   │   ├── sgt-2.0.0b16.tar.gz
    │   │   ├── sgt-2.0.0b17.tar.gz
    │   │   ├── sgt-2.0.0b18.tar.gz
    │   │   ├── sgt-2.0.0b19.tar.gz
    │   │   ├── sgt-2.0.0b20.tar.gz
    │   │   ├── sgt-2.0.0b21.tar.gz
    │   │   ├── sgt-2.0.0-py3-none-any.whl
    │   │   ├── sgt-2.0.1-py3-none-any.whl
    │   │   ├── sgt-2.0.2-py3-none-any.whl
    │   │   ├── sgt-2.0.3-py3-none-any.whl
    │   │   ├── sgt-2.0.0b15-py3-none-any.whl
    │   │   ├── sgt-2.0.0b16-py3-none-any.whl
    │   │   ├── sgt-2.0.0b17-py3-none-any.whl
    │   │   ├── sgt-2.0.0b18-py3-none-any.whl
    │   │   ├── sgt-2.0.0b19-py3-none-any.whl
    │   │   ├── sgt-2.0.0b20-py3-none-any.whl
    │   │   └── sgt-2.0.0b21-py3-none-any.whl
    │   ├── setup.py
    │   └── LICENSE
    ├── output_23_1.png
    └── __pycache__
    │   ├── sgtdev.cpython-37.pyc
    │   └── sgttemp.cpython-37.pyc
├── output_23_1.png
├── .gitignore~
├── .gitignore
├── R
    ├── main.R
    ├── kmeans.R
    └── sgt.R
├── data
    └── darpa_data.csv
└── README.md


/python/sgt-package/sgt.egg-info/top_level.txt:
--------------------------------------------------------------------------------
1 | sgt
2 | 


--------------------------------------------------------------------------------
/python/sgt-package/sgt.egg-info/dependency_links.txt:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/python/sgt-package/sgt_cran2367_bugfix_1.egg-info/top_level.txt:
--------------------------------------------------------------------------------
1 | sgt
2 | 


--------------------------------------------------------------------------------
/python/sgt-package/sgt_cran2367_bugfix_2.egg-info/top_level.txt:
--------------------------------------------------------------------------------
1 | sgt
2 | 


--------------------------------------------------------------------------------
/output_23_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cran2367/sgt/HEAD/output_23_1.png


--------------------------------------------------------------------------------
/python/sgt-package/sgt_cran2367_bugfix_1.egg-info/dependency_links.txt:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/python/sgt-package/sgt_cran2367_bugfix_2.egg-info/dependency_links.txt:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/python/sgt-package/sgt/__init__.py:
--------------------------------------------------------------------------------
1 | __version__ = "2.0.3"
2 | 
3 | from .sgt import SGT


--------------------------------------------------------------------------------
/python/output_23_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/output_23_1.png


--------------------------------------------------------------------------------
/python/sgt-package/build/lib/sgt/__init__.py:
--------------------------------------------------------------------------------
1 | __version__ = "2.0.3"
2 | 
3 | from .sgt import SGT


--------------------------------------------------------------------------------
/python/sgt-package/output_19_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/output_19_1.png


--------------------------------------------------------------------------------
/python/sgt-package/output_23_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/output_23_1.png


--------------------------------------------------------------------------------
/python/__pycache__/sgtdev.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/__pycache__/sgtdev.cpython-37.pyc


--------------------------------------------------------------------------------
/python/sgt-package/dist/sgt-2.0.0.tar.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/dist/sgt-2.0.0.tar.gz


--------------------------------------------------------------------------------
/python/sgt-package/dist/sgt-2.0.1.tar.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/dist/sgt-2.0.1.tar.gz


--------------------------------------------------------------------------------
/python/sgt-package/dist/sgt-2.0.2.tar.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/dist/sgt-2.0.2.tar.gz


--------------------------------------------------------------------------------
/python/sgt-package/dist/sgt-2.0.3.tar.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/dist/sgt-2.0.3.tar.gz


--------------------------------------------------------------------------------
/python/__pycache__/sgttemp.cpython-37.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/__pycache__/sgttemp.cpython-37.pyc


--------------------------------------------------------------------------------
/python/sgt-package/dist/sgt-2.0.0b15.tar.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/dist/sgt-2.0.0b15.tar.gz


--------------------------------------------------------------------------------
/python/sgt-package/dist/sgt-2.0.0b16.tar.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/dist/sgt-2.0.0b16.tar.gz


--------------------------------------------------------------------------------
/python/sgt-package/dist/sgt-2.0.0b17.tar.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/dist/sgt-2.0.0b17.tar.gz


--------------------------------------------------------------------------------
/python/sgt-package/dist/sgt-2.0.0b18.tar.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/dist/sgt-2.0.0b18.tar.gz


--------------------------------------------------------------------------------
/python/sgt-package/dist/sgt-2.0.0b19.tar.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/dist/sgt-2.0.0b19.tar.gz


--------------------------------------------------------------------------------
/python/sgt-package/dist/sgt-2.0.0b20.tar.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/dist/sgt-2.0.0b20.tar.gz


--------------------------------------------------------------------------------
/python/sgt-package/dist/sgt-2.0.0b21.tar.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/dist/sgt-2.0.0b21.tar.gz


--------------------------------------------------------------------------------
/python/sgt-package/dist/sgt-2.0.0-py3-none-any.whl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/dist/sgt-2.0.0-py3-none-any.whl


--------------------------------------------------------------------------------
/python/sgt-package/dist/sgt-2.0.1-py3-none-any.whl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/dist/sgt-2.0.1-py3-none-any.whl


--------------------------------------------------------------------------------
/python/sgt-package/dist/sgt-2.0.2-py3-none-any.whl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/dist/sgt-2.0.2-py3-none-any.whl


--------------------------------------------------------------------------------
/python/sgt-package/dist/sgt-2.0.3-py3-none-any.whl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/dist/sgt-2.0.3-py3-none-any.whl


--------------------------------------------------------------------------------
/python/sgt-package/dist/sgt-2.0.0b15-py3-none-any.whl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/dist/sgt-2.0.0b15-py3-none-any.whl


--------------------------------------------------------------------------------
/python/sgt-package/dist/sgt-2.0.0b16-py3-none-any.whl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/dist/sgt-2.0.0b16-py3-none-any.whl


--------------------------------------------------------------------------------
/python/sgt-package/dist/sgt-2.0.0b17-py3-none-any.whl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/dist/sgt-2.0.0b17-py3-none-any.whl


--------------------------------------------------------------------------------
/python/sgt-package/dist/sgt-2.0.0b18-py3-none-any.whl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/dist/sgt-2.0.0b18-py3-none-any.whl


--------------------------------------------------------------------------------
/python/sgt-package/dist/sgt-2.0.0b19-py3-none-any.whl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/dist/sgt-2.0.0b19-py3-none-any.whl


--------------------------------------------------------------------------------
/python/sgt-package/dist/sgt-2.0.0b20-py3-none-any.whl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/dist/sgt-2.0.0b20-py3-none-any.whl


--------------------------------------------------------------------------------
/python/sgt-package/dist/sgt-2.0.0b21-py3-none-any.whl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/dist/sgt-2.0.0b21-py3-none-any.whl


--------------------------------------------------------------------------------
/.gitignore~:
--------------------------------------------------------------------------------
 1 | *__pycache__/
 2 | *.Rproj*
 3 | *.Rhistory
 4 | *logs/
 5 | *.DS_Store
 6 | *.ipynb_checkpoints
 7 | *build/
 8 | *sgt.egg-info
 9 | *sgt_cran*
10 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | *.Rproj*
 2 | *.Rhistory
 3 | *logs/
 4 | *.DS_Store
 5 | *.ipynb_checkpoints
 6 | *build/
 7 | *sgt.egg-info
 8 | *sgt_cran*
 9 | *archive/
10 | .Rproj.user
11 | 


--------------------------------------------------------------------------------
/python/sgt-package/sgt.egg-info/SOURCES.txt:
--------------------------------------------------------------------------------
1 | README.md
2 | setup.py
3 | sgt/__init__.py
4 | sgt/sgt.py
5 | sgt.egg-info/PKG-INFO
6 | sgt.egg-info/SOURCES.txt
7 | sgt.egg-info/dependency_links.txt
8 | sgt.egg-info/top_level.txt


--------------------------------------------------------------------------------
/python/sgt-package/sgt_cran2367_bugfix_1.egg-info/SOURCES.txt:
--------------------------------------------------------------------------------
1 | README.md
2 | setup.py
3 | sgt/__init__.py
4 | sgt_cran2367_bugfix_1.egg-info/PKG-INFO
5 | sgt_cran2367_bugfix_1.egg-info/SOURCES.txt
6 | sgt_cran2367_bugfix_1.egg-info/dependency_links.txt
7 | sgt_cran2367_bugfix_1.egg-info/top_level.txt


--------------------------------------------------------------------------------
/python/sgt-package/sgt_cran2367_bugfix_2.egg-info/SOURCES.txt:
--------------------------------------------------------------------------------
1 | README.md
2 | setup.py
3 | sgt/__init__.py
4 | sgt_cran2367_bugfix_2.egg-info/PKG-INFO
5 | sgt_cran2367_bugfix_2.egg-info/SOURCES.txt
6 | sgt_cran2367_bugfix_2.egg-info/dependency_links.txt
7 | sgt_cran2367_bugfix_2.egg-info/top_level.txt


--------------------------------------------------------------------------------
/python/sgt-package/setup.py:
--------------------------------------------------------------------------------
 1 | import setuptools
 2 | 
 3 | with open("README.md", "r") as fh:
 4 |     long_description = fh.read()
 5 | 
 6 | setuptools.setup(
 7 |     name="sgt",
 8 |     version="2.0.3",
 9 |     author="Chitta Ranjan",
10 |     author_email="cran2367@gmail.com",
11 |     description="Sequence Graph Transform (SGT) is a sequence embedding function. SGT extracts the short- and long-term sequence features and embeds them in a finite-dimensional feature space. The long and short term patterns embedded in SGT can be tuned without any increase in the computation.",
12 |     long_description=long_description,
13 |     long_description_content_type="text/markdown",
14 |     url="https://github.com/cran2367/sgt",
15 |     packages=setuptools.find_packages(),
16 |     classifiers=[
17 |         "Programming Language :: Python :: 3",
18 |         "License :: OSI Approved :: MIT License",
19 |         "Operating System :: OS Independent",
20 |     ],
21 | )


--------------------------------------------------------------------------------
/python/sgt-package/LICENSE:
--------------------------------------------------------------------------------
 1 | Copyright (c) 2018 The Python Packaging Authority
 2 | 
 3 | Permission is hereby granted, free of charge, to any person obtaining a copy
 4 | of this software and associated documentation files (the "Software"), to deal
 5 | in the Software without restriction, including without limitation the rights
 6 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 7 | copies of the Software, and to permit persons to whom the Software is
 8 | furnished to do so, subject to the following conditions:
 9 | 
10 | The above copyright notice and this permission notice shall be included in all
11 | copies or substantial portions of the Software.
12 | 
13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
14 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
15 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
16 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
17 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
18 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
19 | SOFTWARE.


--------------------------------------------------------------------------------
/R/main.R:
--------------------------------------------------------------------------------
 1 | library(matrixStats)
 2 | library(dplyr)
 3 | source('sgt.R', echo = F)
 4 | source('kmeans.R', echo = F)
 5 | 
 6 | ######################################################################
 7 | ######## Validate SGT output with a simple sequence example ##########
 8 | ######################################################################
 9 | 
10 | alphabet_set      <- c("A", "B", "C")
11 | alphabet_set_size <- length(alphabet_set)
12 | 
13 | seq <- "BBACACAABA"
14 | 
15 | kappa <- 5
16 | 
17 | ###### Algorithm 1 ######
18 | sgt_parts_alg1 <- f_sgt_parts(sequence = seq, kappa = kappa, alphabet_set_size = alphabet_set_size)
19 | print(sgt_parts_alg1)
20 | 
21 | sgt <- f_SGT(W_kappa = sgt_parts_alg1$W_kappa, W0 = sgt_parts_alg1$W0, 
22 |              Len = sgt_parts_alg1$Len, kappa = kappa)  # Set Len = NULL for length-sensitive SGT.
23 | print(sgt)
24 | 
25 | ###### Algorithm 2 ######
26 | seq_split <- f_seq_split(sequence = seq)
27 | seq_alphabet_positions <- f_get_alphabet_positions(sequence_split = seq_split, alphabet_set = alphabet_set)
28 | 
29 | sgt_parts_alg2 <- f_sgt_parts_using_element_positions(seq_alphabet_positions = seq_alphabet_positions, 
30 |                                                       alphabet_set = alphabet_set, 
31 |                                                       kappa = kappa)
32 | 
33 | sgt <- f_SGT(W_kappa = sgt_parts_alg2$W_kappa, W0 = sgt_parts_alg2$W0, 
34 |              Len = sgt_parts_alg2$Len, kappa = kappa)  # Set Len = NULL for length-sensitive SGT.
35 | 
36 | 
37 | ############################################################################
38 | ######## Demo: Performing a Clustering operation on a seq dataset ##########
39 | ############################################################################
40 | 
41 | ## The dataset contains all roman letters, A-Z.
42 | dataset <- read.csv("../data/simulated-sequence-dataset.csv", header = T, stringsAsFactors = F)
43 | 
44 | sgt_parts_sequences_in_dataset <- f_SGT_for_each_sequence_in_dataset(sequence_dataset = dataset, 
45 |                                                                      kappa = 5, alphabet_set = LETTERS, 
46 |                                                                      spp = NULL, sgt_using_alphabet_positions = T)
47 |   
48 |   
49 | input_data <- f_create_input_kmeans(all_seq_sgt_parts = sgt_parts_sequences_in_dataset, 
50 |                                     length_normalize = T, 
51 |                                     alphabet_set_size = 26, 
52 |                                     kappa = 5, trace = TRUE, 
53 |                                     inv.powered = T)
54 | K = 5
55 | clustering_output <- f_kmeans(input_data = input_data, K = K, alphabet_set_size = 26, trace = T)
56 | 
57 | cc <- f_clustering_accuracy(actual = c(strtoi(dataset[,1])), pred = c(clustering_output$class), K = K, type = "f1")  
58 | print(cc)
59 | 
60 | ######## Clustering on Principal Components of SGT features ########
61 | num_pcs <- 5  # Number of principal components we want
62 | input_data_pcs <- f_pcs(input_data = input_data, PCs = num_pcs)$input_data_pcs
63 | 
64 | clustering_output_pcs <- f_kmeans(input_data = input_data_pcs, K = K, alphabet_set_size = sqrt(num_pcs), trace = F)
65 | 
66 | cc <- f_clustering_accuracy(actual = c(strtoi(dataset[,1])), pred = c(clustering_output_pcs$class), K = K, type = "f1")  
67 | print(cc)
68 | 


--------------------------------------------------------------------------------
/R/kmeans.R:
--------------------------------------------------------------------------------
  1 | #### In this file, we have all the functions required for performing a k-means clustering. The k-means will be performed on the SGT-vectors. In there 'class' and 'cluster' of a sequence will mean the same.
  2 | 
  3 | f_centroid <- function(Ks, alphabet_set_size, input_data, class)
  4 | {
  5 |   # For any given classes, we find the centroids.
  6 |   # Inputs
  7 |   # Ks                  Vector of names of the classes. Typically, it is denoted by a scalar K, and cluster names are 1:K. But sometimes it's not 1:K (e.g. if one of the clusters are dropped mid way of clustering).
  8 |   # alphabet_set_size   The number of alphabets sequences are made of. 
  9 |   # input_data          A matrix, where a row is a SGT vector for a sequence and a column is one sgt feature.
 10 |   # class               A vector having class assignment for each sequence.
 11 |   
 12 |   K <- length(Ks)
 13 |   centroid <- matrix(rep(0, K * (alphabet_set_size * alphabet_set_size)), nrow = K)
 14 |   rownames(centroid) <- Ks
 15 |   
 16 |   for(k in Ks)
 17 |   {
 18 |     centroid[toString(k),] <- t(t(input_data) %*% (class == k) / sum(class == k))
 19 |   }
 20 |   
 21 |   return(centroid)
 22 | }
 23 | 
 24 | 
 25 | f_class <- function(Ks, n, input_data, centroid, asgnmt_threshold = 999999)
 26 | {
 27 |   # For given centroids we assign classes to each sequence
 28 |   # Inputs
 29 |   # Ks                  Vector of names of the classes. 
 30 |   # n                   Number of sequences
 31 |   # input_data          A matrix, where a row is a SGT vector for a sequence and a column is one sgt feature.
 32 |   # centroid            Matrix containing the centroids. Each row is a centroid for a cluster.
 33 |   # asgnmt_threshold    This assignment threshold is there for experimental purpose. Not needed for regular operations. Thus, it is default to a very high value.
 34 |   
 35 |   Z <- NULL
 36 |   for(k in Ks)
 37 |   {
 38 |     tmp <- input_data - t(matrix(rep(centroid[toString(k),], n), ncol = n))
 39 |     Z   <- cbind(Z, rowSums(abs(tmp)))  # We use L1 norm for distances.
 40 |   }
 41 |   
 42 |   colnames(Z) <- Ks
 43 |   class <- Ks[max.col(-1*Z, ties.method = "random")] # Updated classes by assigning the sequence to the class it is closest with
 44 |   
 45 |   wss  <- sum(abs(do.call(pmin,data.frame(Z))))
 46 |   
 47 |   class.last          <- max(as.numeric(Ks))
 48 |   threshold.violation <- c(do.call(pmin,data.frame(Z)) > asgnmt_threshold)
 49 |   if(any(threshold.violation))
 50 |   {
 51 |     class[threshold.violation] <- class.last + 1
 52 |     Ks <- c(Ks, (class.last + 1))
 53 |   }
 54 |   
 55 |   out <- list(class = class, Ks = Ks, wss = wss, Z = Z)
 56 |   return(out)
 57 | }
 58 | 
 59 | 
 60 | f_NA_centroid_exception <- function(Ks, centroid, trace = FALSE)
 61 | {
 62 |   # If there are too many classes, sometimes a class does not get any datapoint assigned, we should remove them. This is an important function to handle these exceptions.
 63 |   if(is.na(sum(sum(centroid))))
 64 |   {
 65 |     if(trace){print("inside nan")}
 66 |     centroid     <- centroid[!is.na(centroid[,1]), ] # Remove the centroid rows with Inf
 67 |     Ks           <- strtoi(rownames(centroid))
 68 |   }
 69 |   out <- list(Ks = Ks, centroid = centroid)
 70 |   
 71 |   return(out)
 72 | }
 73 | 
 74 | f_create_input_kmeans <- function(all_seq_sgt_parts, length_normalize = FALSE, alphabet_set_size, kappa, trace = TRUE, inv.powered = T)
 75 | {
 76 |   # Creating the input data for feeding into the kmeans function
 77 |   # Inputs
 78 |   # all_seq_sgt_parts       The transform on sequences in a dataset
 79 |   # length_normalize        Is True for length-insensitive variant of SGT [1]
 80 |   # alphabet_set_size       The number of alphabets that makes the sequences in the dataset.
 81 |   # kappa                   The tuning parameter
 82 |   # inv.powered             Is True if we want the take the kappa-th root of SGT as shown the algorithm 1 [1].
 83 |   
 84 |   n.seq  <- dim(all_seq_sgt_parts$W0_all)[3]
 85 |   
 86 |   # Find the SGT for each sequence
 87 |   sgt_mat_all <- array(rep(0,n.seq * alphabet_set_size * alphabet_set_size), 
 88 |                        dim=c(alphabet_set_size, alphabet_set_size, n.seq))
 89 |   
 90 |   for(ind in 1:n.seq)
 91 |   {
 92 |     if(trace){print(paste(ind,"in",n.seq))}
 93 |     if(length_normalize == TRUE)
 94 |     {
 95 |       sgt_mat_all[ , ,ind] <- f_SGT(W_kappa = all_seq_sgt_parts$W_kappa_all[[ind]], 
 96 |                                     W0 = all_seq_sgt_parts$W0_all[,,ind], 
 97 |                                     kappa = kappa, 
 98 |                                     Len = all_seq_sgt_parts$Len_all[ind],
 99 |                                     inv.powered = inv.powered)
100 |     }else{ # Not length normalize
101 |       sgt_mat_all[ , ,ind] <- f_SGT(W_kappa = all_seq_sgt_parts$W_kappa_all[[ind]], 
102 |                                     W0 = all_seq_sgt_parts$W0_all[,,ind], 
103 |                                     kappa = kappa, 
104 |                                     Len = NULL,
105 |                                     inv.powered = inv.powered)
106 |     }
107 |   }
108 |   
109 |   # Vectorize the sequence alphabet_set_size x alphabet_set_size statistics (mean in this case)
110 |   # Code taken for this from http://stackoverflow.com/questions/4022195/transform-a-3d-array-into-a-matrix-in-r
111 |   
112 |   input_data      <- aperm(sgt_mat_all, c(3,2,1))
113 |   dim(input_data) <- c(n.seq, alphabet_set_size * alphabet_set_size)
114 |   
115 |   return(input_data)
116 | }
117 | 
118 | 
119 | f_kmeans_procedure <- function(input_data, K, alphabet_set_size = 26, max_iteration = 50, trace = TRUE)
120 | {
121 |   # This function will perform the centroid based kmeans clustering using Manhattan distance.
122 |   # Inputs
123 |   # input_data      The input data matrix, each row a data point and the columns are its features
124 |   # K               The number of clusters
125 |   
126 |   set.seed(12)  # To ensure reproducibility  
127 |   n          <- nrow(input_data)
128 |   
129 |   # Step 0: Initialization
130 |   if(K <= n) 
131 |   {
132 |     # Making sure at least one member is given to each cluster in the beginning
133 |     class.tmp  <- 1:K
134 |     class.tmp2 <- sample.int(n = K, size = (n - K), replace = T)
135 |     class      <- c(class.tmp, class.tmp2)
136 |     class.tmp2 <- sample.int(n = K, size = (n - K), replace = T)  # Another initialization for class.old
137 |     class.old  <- c(class.tmp, class.tmp2)
138 |   } else{
139 |     stop("K is greater than n. Terminating!")
140 |   }
141 |   
142 |   Ks       <- 1:K # List of cluster
143 |   centroid <- f_centroid(Ks = Ks, alphabet_set_size = alphabet_set_size, input_data = input_data, class = class)
144 |   
145 |   out_NA   <- f_NA_centroid_exception(Ks = Ks, centroid = centroid, trace = trace)
146 |   Ks       <- out_NA$Ks
147 |   centroid <- out_NA$centroid
148 |   
149 |   # Iterations for clustering 
150 |   class.changes <- 10 # arbitrary
151 |   epsilon  <- 100 # arbitrary
152 |   counter  <- 0
153 |   class.changes.check <- 0
154 |   
155 |   while(class.changes != 0 && counter <= max_iteration)
156 |   {
157 |     counter   <- counter + 1
158 |     class.old <- class
159 |     
160 |     # Step 1: Getting the centroid for each class
161 |     centroid  <- f_centroid(Ks = Ks, alphabet_set_size = alphabet_set_size, input_data = input_data, class = class)
162 |     
163 |     # Exception handling: If a centroid does not get any data point assigned  
164 |     out_NA    <- f_NA_centroid_exception(Ks = Ks, centroid = centroid)
165 |     Ks        <- out_NA$Ks
166 |     centroid  <- out_NA$centroid
167 |     
168 |     
169 |     # Step 2: Assign (update) class to each data point based on its distance from the centroids
170 |     class.out <- f_class(Ks = Ks, n = n, input_data = input_data, centroid = centroid)
171 |     class <- class.out$class
172 |     Ks    <- class.out$Ks
173 |     wss   <- class.out$wss
174 |     Z     <- class.out$Z
175 |     
176 |     # Iteration differences
177 |     class.changes <- sum(class != class.old)
178 |     
179 |     if(trace)
180 |     {
181 |       print(paste("Iteration", counter, "in", max_iteration, "--Class chgs: ", class.changes, "wss: ", round(wss,2), "and K is ", length(Ks)))      
182 |     }
183 |   }
184 |   return(list(class = class, centroid = centroid, Ks = Ks, wss = wss, Z = Z))
185 | }
186 | 
187 | 
188 | f_kmeans <- function(input_data, K, alphabet_set_size = 26, max_iteration = 50, trace = TRUE, K_fixed = T)
189 | {
190 |   if(K_fixed){
191 |     
192 |     check <- 0
193 |     while(check != K)
194 |     {
195 |       class   <- f_kmeans_procedure(input_data = input_data, K = K, alphabet_set_size = alphabet_set_size, trace = trace)
196 |       check   <- length(levels(factor(class$class)))
197 |     }
198 |   }else{
199 |     class   <- f_kmeans_procedure(input_data = input_data, K = K, alphabet_set_size = alphabet_set_size, trace = trace)
200 |   }
201 |   return(class)
202 | }
203 | 
204 | 
205 | f_get_ss <- function(input_data)
206 | {
207 |   n  <- nrow(input_data)
208 |   ybar <- t(rowMeans(t(input_data)))
209 |   ss <- 0
210 |   for(i in 1:n)
211 |   {
212 |     ss <- ss + sum(abs(input_data[i,] - ybar))
213 |   }
214 |   return(ss)
215 | }
216 | 
217 | 
218 | f_pcs <- function(input_data, PCs = 50)
219 | {
220 |   nc <- ncol(input_data)
221 |   nr <- nrow(input_data)
222 |   mu <- t(rowMeans(t(input_data)))
223 |   Sigma <- cov(input_data)
224 |   
225 |   eg           <- eigen(Sigma)
226 |   lam          <- eg$values
227 |   lam.perc     <- lam/sum(lam)
228 |   lam.perc.cum <- cumsum(lam.perc)
229 |   
230 |   print(paste(PCs, "PCs explain", round(lam.perc.cum[PCs]*100, 2),"percentage of variance"))
231 |   
232 |   V              <- eg$vectors
233 |   tmp            <- sqrt(matrix(rep(lam, nc), nrow = nc, byrow = TRUE))
234 |   V.norm         <- V / tmp
235 |   V.norm.reduced <- V.norm[, 1:PCs]
236 |   input_data_pcs <- (input_data - matrix(mu, nrow = nrow(input_data), ncol = nc)) %*% V.norm.reduced
237 |   
238 |   return(list(input_data_pcs = input_data_pcs, lam = lam, lam.perc.cum = lam.perc.cum))
239 | }


--------------------------------------------------------------------------------
/python/sgt-package/sgt/sgt.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import pandas as pd
  3 | from itertools import chain
  4 | from itertools import product as iterproduct
  5 | import warnings
  6 | 
  7 | class SGT():
  8 |     '''
  9 |     Compute embedding of a single or a collection of discrete item
 10 |     sequences. A discrete item sequence is a sequence made from a set
 11 |     discrete elements, also known as alphabet set. For example,
 12 |     suppose the alphabet set is the set of roman letters,
 13 |     {A, B, ..., Z}. This set is made of discrete elements. Examples of
 14 |     sequences from such a set are AABADDSA, UADSFJPFFFOIHOUGD, etc.
 15 |     Such sequence datasets are commonly found in online industry,
 16 |     for example, item purchase history, where the alphabet set is
 17 |     the set of all product items. Sequence datasets are abundant in
 18 |     bioinformatics as protein sequences.
 19 |     Using the embeddings created here, classification and clustering
 20 |     models can be built for sequence datasets.
 21 |     Read more in https://arxiv.org/pdf/1608.03533.pdf
 22 | 
 23 |     Parameters
 24 |     ----------
 25 |     Input
 26 | 
 27 |     alphabets       Optional, except if mode is Spark.
 28 |                     The set of alphabets that make up all
 29 |                     the sequences in the dataset. If not passed, the
 30 |                     alphabet set is automatically computed as the
 31 |                     unique set of elements that make all the sequences.
 32 |                     A list or 1d-array of the set of elements that make up the
 33 |                     sequences. For example, np.array(["A", "B", "C"].
 34 |                     If mode is 'spark', the alphabets are necessary.
 35 | 
 36 |     kappa           Tuning parameter, kappa > 0, to change the extraction of
 37 |                     long-term dependency. Higher the value the lesser
 38 |                     the long-term dependency captured in the embedding.
 39 |                     Typical values for kappa are 1, 5, 10.
 40 | 
 41 |     lengthsensitive Default False. This is set to true if the embedding of
 42 |                     should have the information of the length of the sequence.
 43 |                     If set to false then the embedding of two sequences with
 44 |                     similar pattern but different lengths will be the same.
 45 |                     lengthsensitive = false is similar to length-normalization.
 46 | 
 47 |     flatten         Default True. If True the SGT embedding is flattened and returned as
 48 |                     a vector. Otherwise, it is returned as a matrix with the row and col
 49 |                     names same as the alphabets. The matrix form is used for
 50 |                     interpretation purposes. Especially, to understand how the alphabets
 51 |                     are "related". Otherwise, for applying machine learning or deep
 52 |                     learning algorithms, the embedding vectors are required.
 53 | 
 54 |     mode            Choices in {'default', 'multiprocessing'}.
 55 | 
 56 |     processors      Used if mode is 'multiprocessing'. By default, the
 57 |                     number of processors used in multiprocessing is
 58 |                     number of available - 1.
 59 |     '''
 60 | 
 61 |     def __init__(self,
 62 |                  alphabets=[],
 63 |                  kappa=1,
 64 |                  lengthsensitive=False,
 65 |                  flatten=True,
 66 |                  mode='default',
 67 |                  processors=None,
 68 |                  lazy=False):
 69 | 
 70 |         self.alphabets = alphabets
 71 |         
 72 |         if len(self.alphabets) != 0:
 73 |             self.feature_names = self.__set_feature_names(self.alphabets)
 74 |         
 75 |         self.kappa = kappa
 76 |         self.lengthsensitive = lengthsensitive
 77 |         self.flatten = flatten
 78 |         self.mode = mode
 79 |         self.processors = processors
 80 | 
 81 |         if self.processors==None:
 82 |             import os
 83 |             self.processors = os.cpu_count() - 1
 84 | 
 85 |         self.lazy = lazy
 86 | 
 87 |     def getpositions(self, sequence, alphabets):
 88 |         '''
 89 |         Compute index position elements in the sequence
 90 |         given alphabets set.
 91 | 
 92 |         Return list of tuples [(value, position)]
 93 |         '''
 94 |         positions = [(v, np.where(sequence == v))
 95 |                      for v in alphabets if v in sequence]
 96 | 
 97 |         return positions
 98 | 
 99 |     def fit(self, sequence):
100 |         '''
101 |         Extract Sequence Graph Transform features using Algorithm-2.
102 | 
103 |         sequence            An array of discrete elements. For example,
104 |                             np.array(["B","B","A","C","A","C","A","A","B","A"].
105 | 
106 |         return: sgt matrix or vector (depending on Flatten==False or True)
107 | 
108 |         '''
109 | 
110 |         sequence = np.array(sequence)
111 | 
112 |         if(len(self.alphabets) == 0):
113 |             self.alphabets = self.estimate_alphabets(sequence)
114 |             self.feature_names = self.__set_feature_names(self.alphabets)
115 |             
116 |         size = len(self.alphabets)
117 |         l = 0
118 |         W0, Wk = np.zeros((size, size)), np.zeros((size, size))
119 |         positions = self.getpositions(sequence, self.alphabets)
120 | 
121 |         alphabets_in_sequence = np.unique(sequence)
122 | 
123 |         for i, u in enumerate(alphabets_in_sequence):
124 |             index = [p[0] for p in positions].index(u)
125 | 
126 |             U = np.array(positions[index][1]).ravel()
127 | 
128 |             for j, v in enumerate(alphabets_in_sequence):
129 |                 index = [p[0] for p in positions].index(v)
130 | 
131 |                 V2 = np.array(positions[index][1]).ravel()
132 | 
133 |                 C = [(i, j) for i in U for j in V2 if j > i]
134 | 
135 |                 cu = np.array([ic[0] for ic in C])
136 |                 cv = np.array([ic[1] for ic in C])
137 | 
138 |                 # Insertion positions
139 |                 pos_i = self.alphabets.index(u)
140 |                 pos_j = self.alphabets.index(v)
141 | 
142 |                 W0[pos_i, pos_j] = len(C)
143 | 
144 |                 Wk[pos_i, pos_j] = np.sum(np.exp(-self.kappa * np.abs(cu - cv)))
145 | 
146 |             l += U.shape[0]
147 | 
148 |         if self.lengthsensitive:
149 |             W0 /= l
150 | 
151 |         W0[np.where(W0 == 0)] = 1e7  # avoid divide by 0
152 | 
153 |         sgt = np.power(np.divide(Wk, W0), 1/self.kappa)
154 | 
155 |         if(self.flatten):
156 |             sgt = pd.Series(sgt.flatten(), index=self.feature_names)
157 |         else:
158 |             sgt = pd.DataFrame(sgt, 
159 |                                columns=self.alphabets, 
160 |                                index=self.alphabets)
161 |         return sgt
162 | 
163 |     def __flatten(self, listOfLists):
164 |         "Flatten one level of nesting"
165 |         flat = [x for sublist in listOfLists for x in sublist]
166 |         return flat
167 | 
168 |     def estimate_alphabets(self, corpus):
169 |         if len(corpus) > 1e5:
170 |             print("Error: Too many sequences. Pass the alphabet list as an input. Exiting.")
171 |             sys.exit(1)
172 |         else:
173 |             return(np.unique(np.asarray(self.__flatten(corpus))).tolist())
174 | 
175 |     def set_alphabets(self, corpus):
176 |         self.alphabets = self.estimate_alphabets(corpus)
177 |         self.feature_names = self.__set_feature_names(self.alphabets)
178 |         return self
179 | 
180 |     def get_alphabets(self):
181 |         return self.alphabets
182 |     
183 |     def get_feature_names(self):
184 |         return self.feature_names
185 | 
186 |     def __fit_to_list(self, sequence):
187 |         return list(self.fit(sequence))
188 | 
189 |     def __set_feature_names(self, alphabets):
190 |         return list(iterproduct(alphabets, alphabets))
191 | 
192 |     def fit_transform(self, corpus):
193 |         '''
194 |         Inputs:
195 |         corpus       A list of sequences. Each sequence is a list of alphabets.
196 |         '''
197 | 
198 |         if(len(self.alphabets) == 0):
199 |             self.alphabets = self.estimate_alphabets(corpus['sequence'])
200 |             self.feature_names = self.__set_feature_names(self.alphabets)
201 | 
202 |         if self.mode=='default':
203 |             sgt = corpus.apply(lambda x: [x['id']] + list(self.fit(x['sequence'])), 
204 |                                axis=1, 
205 |                                result_type='expand')
206 |             sgt.columns = ['id'] + self.feature_names
207 |             return sgt
208 |         elif self.mode=='multiprocessing':
209 |             # Import
210 |             from pandarallel import pandarallel
211 |             # Initialization
212 |             pandarallel.initialize(nb_workers=self.processors)
213 |             sgt = corpus.parallel_apply(lambda x: [x['id']] + 
214 |                                         list(self.fit(x['sequence'])), 
215 |                                         axis=1, 
216 |                                         result_type='expand')
217 |             sgt.columns = ['id'] + self.feature_names            
218 |             return sgt
219 |     
220 |     def transform(self, corpus):
221 |         '''
222 |         Inputs:
223 |         corpus       A list of sequences. Each sequence is a list of alphabets.
224 |         '''
225 |         
226 |         '''
227 |         Difference between fit_transform and transform is:
228 |         In transform() we have the alphabets already  known.
229 |         In fit_transform() is alphabets are not known, they
230 |         are computed.
231 |         The computation in fit is essentially getting the
232 |         alphabets set.
233 |         '''
234 | 
235 |         if self.mode=='default':
236 |             sgt = corpus.apply(lambda x: [x['id']] + list(self.fit(x['sequence'])), 
237 |                                axis=1, 
238 |                                result_type='expand')
239 |             sgt.columns = ['id'] + self.feature_names
240 |             return sgt
241 |         elif self.mode=='multiprocessing':
242 |             # Import
243 |             from pandarallel import pandarallel
244 |             # Initialization
245 |             pandarallel.initialize(nb_workers=self.processors)
246 |             sgt = corpus.parallel_apply(lambda x: [x['id']] + 
247 |                                         list(self.fit(x['sequence'])), 
248 |                                         axis=1, 
249 |                                         result_type='expand')
250 |             sgt.columns = ['id'] + self.feature_names            
251 |             return sgt


--------------------------------------------------------------------------------
/python/sgt-package/sgt_cran2367_bugfix_1.egg-info/PKG-INFO:
--------------------------------------------------------------------------------
  1 | Metadata-Version: 2.1
  2 | Name: sgt-cran2367-bugfix-1
  3 | Version: 1.0.0
  4 | Summary: Sequence Graph Transform (SGT) is a sequence embedding function.                          SGT extracts the short- and long-term sequence features and embeds them                   in a finite-dimensional feature space. With SGT you can tune the amount                   of short- to long-term patterns extracted in the embeddings without any                   increase in the computation.
  5 | Home-page: https://github.com/cran2367/sgt
  6 | Author: Chitta Ranjan
  7 | Author-email: cran2367@gmail.com
  8 | License: UNKNOWN
  9 | Description: # Sequence Graph Transform (SGT)
 10 |         
 11 |         #### Maintained by: Chitta Ranjan, PhD (cran2367@gmail.com).
 12 |         
 13 |         
 14 |         This is open source code repository for SGT. Sequence Graph Transform extracts the short- and long-term sequence features and embeds them in a finite-dimensional feature space. Importantly, SGT has low computation and can extract any amount of short- to long-term patterns without any increase in the computation. These properties are proved theoretically and demonstrated on real data in this paper: https://arxiv.org/abs/1608.03533.
 15 |         
 16 |         If using this code or dataset, please cite the following:
 17 |         
 18 |         [1] Ranjan, Chitta, Samaneh Ebrahimi, and Kamran Paynabar. "Sequence Graph Transform (SGT): A Feature Extraction Function for Sequence Data Mining." arXiv preprint arXiv:1608.03533 (2016).
 19 |         
 20 |         @article{ranjan2016sequence,
 21 |           title={Sequence Graph Transform (SGT): A Feature Extraction Function for Sequence Data Mining},
 22 |           author={Ranjan, Chitta and Ebrahimi, Samaneh and Paynabar, Kamran},
 23 |           journal={arXiv preprint arXiv:1608.03533},
 24 |           year={2016}
 25 |         }
 26 |         
 27 |         ## Quick validation of your code
 28 |         Apply the algorithm on a sequence `BBACACAABA`. The parts of SGT, W<sup>(0)</sup> and W<sup>(\kappa)</sup>, in Algorithm 1 & 2 in [1], and the resulting SGT estimate will be (line-by-line execution of `main.R`):
 29 |         
 30 |         ```
 31 |         alphabet_set <- c("A", "B", "C")  # Denoted by variable V in [1]
 32 |         seq          <- "BBACACAABA"
 33 |         
 34 |         kappa <- 5
 35 |         ###### Algorithm 1 ######
 36 |         sgt_parts_alg1 <- f_sgt_parts(sequence = seq, kappa = kappa, alphabet_set_size = length(alphabet_set))
 37 |         print(sgt_parts_alg1)
 38 |         ```
 39 |         
 40 |         *Result*
 41 |         ```
 42 |         $W0
 43 |            A B C
 44 |         A 10 4 3
 45 |         B 11 3 4
 46 |         C  7 2 1
 47 |         
 48 |         $W_kappa
 49 |                     A            B            C
 50 |         A 0.006874761 6.783349e-03 1.347620e-02
 51 |         B 0.013521602 6.737947e-03 4.570791e-05
 52 |         C 0.013521604 3.059162e-07 4.539993e-05
 53 |         ```
 54 |         
 55 |         ```
 56 |         sgt <- f_SGT(W_kappa = sgt_parts_alg1$W_kappa, W0 = sgt_parts_alg1$W0, 
 57 |                      Len = sgt_parts_alg1$Len, kappa = kappa)  # Set Len = NULL for length-sensitive SGT.
 58 |         print(sgt)
 59 |         ```
 60 |         
 61 |         *Result*
 62 |         ```
 63 |                   A          B         C
 64 |         A 0.3693614 0.44246287 0.5376371
 65 |         B 0.4148844 0.46803816 0.1627745
 66 |         C 0.4541361 0.06869332 0.2144920
 67 |         
 68 |         ```
 69 |         
 70 |         Similarly, the execution for Algorithm-2 is shown in `main.R`.
 71 |         
 72 |         ## Illustration and use of the code
 73 |         Open file `main.R` and execute line-by-line to understand the process. In this sample execution, we present SGT estimation from either of the two algorithms presented in [1]. The first part is for understanding the SGT computation process.
 74 |         
 75 |         In the next part we demonstrate sequence clustering using SGT on a synthesized sample dataset. The sequence lengths in the dataset ranges between (45, 711) with a uniform distribution (hence, average length is ~365). Similar sequences in the dataset has some similar patterns, in turn common substrings. These common substrings can be of any length. Also, the order of the instances of these substrings is arbitrary and random in different sequences. For example, the following two sequences have common patterns. One common subtring in both is `QZTA` which is present arbitrarily in both sequences. The two sequences have other common substrings as well. Other than these commonlities there are significant amount of noise present in the sequences. On average, about 40% of the letters in all sequences in the dataset are noise.
 76 |         
 77 |         ```
 78 |         AKQZTAEEYTDZUXXIRZSTAYFUIXCPDZUXMCSMEMVDVGMTDRDDEJWNDGDPSVPKJHKQBRKMXHHNLUBXBMHISQ
 79 |         WEHGXGDDCADPVKESYQXGRLRZSTAYFUOQZTAWTBRKMXHHNWYRYBRKMXHHNPRNRBRKMXHHNPBMHIUSVXBMHI
 80 |         WXQRZSTAYFUCWRZSTAYFUJEJDZUXPUEMVDVGMTOHUDZUXLOQSKESYQXGRCTLBRKMXHHNNJZDZUXTFWZKES
 81 |         YQXGRUATSNDGDPWEBNIQZMBNIQKESYQXGRSZTTPTZWRMEMVDVGMTAPBNIRPSADZUXJTEDESOKPTLJEMZTD
 82 |         LUIPSMZTDLUIWYDELISBRKMXHHNMADEDXKESYQXGRWEFRZSTAYFUDNDGDPKYEKPTSXMKNDGDPUTIQJHKSD
 83 |         ZUXVMZTDLUINFNDGDPMQZTAPPKBMHIUQIUBMHIEKKJHK
 84 |         ```
 85 |         
 86 |         ```
 87 |         SDBRKMXHHNRATBMHIYDZUXMTRMZTDLUIEKDEIBQZTAZOAMZTDLUILHGXGDDCAZEXJHKTDOOHGXGDDCAKZH
 88 |         NEMVDVGMTIHZXDEROEQDEGZPPTDBCLBMHIJMMKESYQXGRGDPTNBRKMXHHNGCBYNDGDPKMWKBMHIDQDZUXI
 89 |         HKVBMHINQZTAHBRKMXHHNIRBRKMXHHNDISDZUXWBOYEMVDVGMTNTAQZTA
 90 |         ```
 91 |         
 92 |         Identifying similar sequences with good accuracy, and also low false positives (calling sequences similar when they are not) is difficult in such situations due to, 
 93 |         
 94 |         1. _Different lengths of the sequences_: due to different lengths figuring out that two sequences have same inherent pattern is not straightforward. Normalizing the pattern features by the sequence length is a non-trivial problem.
 95 |         
 96 |         2. _Commonalities are not in order_: as shown in the above example sequences, the common substrings are anywhere. This makes methods such as alignment-based approaches infeasible.
 97 |         
 98 |         3. _Significant amount of noise_: a good amount of noise is a nemesis to most sequence similarity algorithms. It often results into high false positives.
 99 |         
100 |         ### SGT Clustering
101 |         
102 |         The dataset here is a good example for the above challenges. We run clustering on the dataset in `main.R`. The sequences in the dataset are from 5 (=K) clusters. We use this ground truth about the number of clusters as input to our execution below. Although, in reality, the true number of clusters is unknown for a data, here we are demonstrating the SGT implementation. Regardless, using the _random search procedure_ discussed in Sec.SGT-ALGORITHM in [1], we could find the number of clusters as equal to 5. For simplicity it has been kept out of this demonstration.
103 |         
104 |         > Other state-of-the-art sequence clustering methods had significantly poorer performance even with the number of true clusters (K=5). HMM had good performance but significantly higher computation time.
105 |         
106 |         
107 |         ```
108 |         ## The dataset contains all roman letters, A-Z.
109 |         dataset <- read.csv("dataset.csv", header = T, stringsAsFactors = F)
110 |         
111 |         sgt_parts_sequences_in_dataset <- f_SGT_for_each_sequence_in_dataset(sequence_dataset = dataset, 
112 |                                                                              kappa = 5, alphabet_set = LETTERS, 
113 |                                                                              spp = NULL, sgt_using_alphabet_positions = T)
114 |           
115 |           
116 |         input_data <- f_create_input_kmeans(all_seq_sgt_parts = sgt_parts_sequences_in_dataset, 
117 |                                             length_normalize = T, 
118 |                                             alphabet_set_size = 26, 
119 |                                             kappa = 5, trace = TRUE, 
120 |                                             inv.powered = T)
121 |         K = 5
122 |         clustering_output <- f_kmeans(input_data = input_data, K = K, alphabet_set_size = 26, trace = T)
123 |         
124 |         cc <- f_clustering_accuracy(actual = c(strtoi(dataset[,1])), pred = c(clustering_output$class), K = K, type = "f1")
125 |         print(cc)
126 |         ```
127 |         *Result*
128 |         ```
129 |         $cc
130 |         Confusion Matrix and Statistics
131 |         
132 |                   Reference
133 |         Prediction  a  b  c  d  e
134 |                  a 50  0  0  0  0
135 |                  b  0 66  0  0  0
136 |                  c  0  0 60  0  0
137 |                  d  0  0  0 55  0
138 |                  e  0  0  0  0 68
139 |         
140 |         Overall Statistics
141 |                                              
142 |                        Accuracy : 1          
143 |                          95% CI : (0.9877, 1)
144 |             No Information Rate : 0.2274     
145 |             P-Value [Acc > NIR] : < 2.2e-16  
146 |                                              
147 |                           Kappa : 1          
148 |          Mcnemar's Test P-Value : NA         
149 |         
150 |         Statistics by Class:
151 |         
152 |                              Class: a Class: b Class: c Class: d Class: e
153 |         Sensitivity            1.0000   1.0000   1.0000   1.0000   1.0000
154 |         Specificity            1.0000   1.0000   1.0000   1.0000   1.0000
155 |         Pos Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
156 |         Neg Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
157 |         Prevalence             0.1672   0.2207   0.2007   0.1839   0.2274
158 |         Detection Rate         0.1672   0.2207   0.2007   0.1839   0.2274
159 |         Detection Prevalence   0.1672   0.2207   0.2007   0.1839   0.2274
160 |         Balanced Accuracy      1.0000   1.0000   1.0000   1.0000   1.0000
161 |         
162 |         $F1
163 |         F1 
164 |          1 
165 |         ```
166 |         
167 |         As we can see the clustering result is accurate with no false-positives. The f1-score is 1.0.
168 |         
169 |         > Note: Do not run function `f_clustering_accuracy` when `K` is larger (> 7), because it does a permutation operation which will become expensive.
170 |         
171 |         ### PCA on SGT & Clustering
172 |         
173 |         For demonstrating PCA on SGT for dimension reduction and then performing clustering, we added another code snippet. PCA becomes more important on datasets where SGT's are sparse. A sparse SGT is present when the alphabet set is large but the observed sequences contain only a few of those alphabets. For example, the alphabet set for sequence dataset of music listening history will have thousands to millions of songs, but a single sequence will have only a few of them
174 |         
175 |         ```
176 |         ######## Clustering on Principal Components of SGT features ########
177 |         num_pcs <- 5  # Number of principal components we want
178 |         input_data_pcs <- f_pcs(input_data = input_data, PCs = num_pcs)$input_data_pcs
179 |         
180 |         clustering_output_pcs <- f_kmeans(input_data = input_data_pcs, K = K, alphabet_set_size = sqrt(num_pcs), trace = F)
181 |         
182 |         cc <- f_clustering_accuracy(actual = c(strtoi(dataset[,1])), pred = c(clustering_output_pcs$class), K = K, type = "f1")  
183 |         print(cc)
184 |         ```
185 |         
186 |         *Result*
187 |         ```
188 |         $cc
189 |         Confusion Matrix and Statistics
190 |         
191 |                   Reference
192 |         Prediction  a  b  c  d  e
193 |                  a 50  0  0  0  0
194 |                  b  0 66  0  0  0
195 |                  c  0  0 60  0  0
196 |                  d  0  0  0 55  0
197 |                  e  0  0  0  0 68
198 |         
199 |         Overall Statistics
200 |                                              
201 |                        Accuracy : 1          
202 |                          95% CI : (0.9877, 1)
203 |             No Information Rate : 0.2274     
204 |             P-Value [Acc > NIR] : < 2.2e-16  
205 |                                              
206 |                           Kappa : 1          
207 |          Mcnemar's Test P-Value : NA         
208 |         
209 |         Statistics by Class:
210 |         
211 |                              Class: a Class: b Class: c Class: d Class: e
212 |         Sensitivity            1.0000   1.0000   1.0000   1.0000   1.0000
213 |         Specificity            1.0000   1.0000   1.0000   1.0000   1.0000
214 |         Pos Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
215 |         Neg Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
216 |         Prevalence             0.1672   0.2207   0.2007   0.1839   0.2274
217 |         Detection Rate         0.1672   0.2207   0.2007   0.1839   0.2274
218 |         Detection Prevalence   0.1672   0.2207   0.2007   0.1839   0.2274
219 |         Balanced Accuracy      1.0000   1.0000   1.0000   1.0000   1.0000
220 |         
221 |         $F1
222 |         F1 
223 |          1 
224 |          ```
225 |         
226 |         The clustering result remains accurate upon clustering the PCs on the SGT of sequences.
227 |         
228 |         
229 |         -----------------------
230 |         #### Comments:
231 |         1. Simplicity: SGT's is simple to implement. There is no numerical optimization or other solution search algorithm required to estimate SGT. This makes it deterministic and powerful.
232 |         2. Length sensitive: The length sensitive version of SGT can be easily tried by changing the marked arguments in `main.R`.
233 |         
234 |         #### Note:
235 |         1. Small alphabet set: If the alphabet set is small (< 4), SGT's performance may not be good. This is because the feature space becomes too small.
236 |         2. Faster implementation: The provided code is a research level code, not optimized for the best of speed. Significant speed improvements can be made, e.g. multithreading the SGT estimation for sequences in a dataset.
237 |         
238 |         #### Additional resource:
239 |         Python implementation: Please refer to 
240 |         
241 |         https://github.com/datashinobi/Sequence-Graph-transform
242 |         
243 |         Thanks to Yassine for providing the Python implementation.
244 | Platform: UNKNOWN
245 | Classifier: Programming Language :: Python :: 3
246 | Classifier: License :: OSI Approved :: MIT License
247 | Classifier: Operating System :: OS Independent
248 | Description-Content-Type: text/markdown
249 | 


--------------------------------------------------------------------------------
/R/sgt.R:
--------------------------------------------------------------------------------
  1 | ## This lookup table will be needed throughout for converting integer to corresponding alphabet
  2 | alphabet_lookup <- data.frame(Integer=1:26, Alphabet = LETTERS)
  3 | 
  4 | ####################################################################################
  5 | ############### Algorithm 1: Parsing a sequence to obtain its SGT. #################
  6 | ####################################################################################
  7 | 
  8 | f_sgt_parts <- function(sequence, kappa = 3, alphabet_set_size = 26, lag = 0, skip_same_char = FALSE, long_seq = FALSE, long_seq_ele_limits = NULL, spp = NULL)
  9 | {
 10 |   
 11 |   # The inputs
 12 |   # Sequence            Any given sequence (with padding)
 13 |   # kappa               The tuning param
 14 |   # alphabet_set_size   The number of alphabet_set the sequences are made up of
 15 |   # skip_same_char      Esp. for the element clustering it does not make sense to use repeated characters. For example, AAABC... in this actual transaction between A to B should be just one, however, if we don't skip the repetition it will be 3 (unnecessary inflating the transactions). Hence, skip repetition (value = TRUE) for element clustering.
 16 |   # long_seq            TRUE if sequences have many more alphabet_set (not just A-Z).
 17 |   # long_seq_ele_limits The limits of the alphabet_set in long sequences. For eg. in grayscale images it will be c(0,255)
 18 |     
 19 |   
 20 |   if(length(sequence) == 1)
 21 |   {
 22 |     s_split <- f_seq_split(sequence, spp = spp)
 23 |   } else{
 24 |     s_split <- sequence #already splitted
 25 |   }
 26 |   
 27 |   
 28 |   if(long_seq == FALSE)
 29 |   {
 30 |     rnames <- cnames <- c(levels(alphabet_lookup[,'Alphabet']))[1:alphabet_set_size]
 31 |   }else{
 32 |     alphabet_set_size <- long_seq_ele_limits[2] - long_seq_ele_limits[1] + 1
 33 |     rnames <- cnames <- seq(long_seq_ele_limits[1], long_seq_ele_limits[2], 1)
 34 |   }
 35 |   
 36 |   Len           <- length(s_split)
 37 |   
 38 |   Ls <- list()
 39 |   
 40 |   # Just the W0 corresponding to m = 0, and the M-th moment
 41 |   iter_set <- c(0, kappa)
 42 | 
 43 |   for(m in iter_set)  # The m=0 corresponds to W0
 44 |   {
 45 |     mat_Lm           <- matrix(rep(0,alphabet_set_size*alphabet_set_size), nrow=alphabet_set_size)
 46 |     rownames(mat_Lm) <- rnames
 47 |     colnames(mat_Lm) <- cnames
 48 |     
 49 |     for(i in 1:(Len-1))
 50 |     {
 51 |       if(skip_same_char == TRUE && s_split[i] == s_split[i+1])
 52 |       {
 53 |         # SKipping the loop when the next event in the sequence is same as current
 54 |         next
 55 |       }
 56 |       
 57 |       
 58 |       for(j in (i+1):length(s_split))
 59 |       {
 60 |         if(abs(j-i)>lag)
 61 |         {
 62 |           mat_Lm[s_split[i], s_split[j]]     <- mat_Lm[s_split[i], s_split[j]] + exp(-1*((abs(j-i) - lag)*m))
 63 |           
 64 |         }
 65 |       }
 66 |     }
 67 |     
 68 |     Ls[[length(Ls) + 1]] <- mat_Lm
 69 |   }
 70 |   
 71 |   output <- list(Len = Len, W0 = Ls[[1]], W_kappa = Ls[[2]])
 72 |   return(output)
 73 | }
 74 | 
 75 | 
 76 | ##############################################################################################################
 77 | ############### Algorithm 2: Extract SGT features by scanning alphabet positions of a sequence ###############
 78 | ##############################################################################################################
 79 | 
 80 | f_get_alphabet_positions <- function(sequence_split, alphabet_set)
 81 | {
 82 |   # This function corresponds to the one defined in Line 1 in Algorithm 2 in [1]
 83 |   # Inputs
 84 |   # sequence_split   A sequence is passed as a vector of alphabets. It is called sequence split because a string sequence is split into its alphabets (in the same order)
 85 |   # alphabet_set    The set of alphabets sequence is made of.
 86 |   
 87 |   positions <- list()
 88 |   for(e in alphabet_set)
 89 |   {
 90 |     positions[[e]] <- which(sequence_split == e)
 91 |   }
 92 |   return(positions)
 93 | }
 94 | 
 95 | f_sgt_parts_using_alphabet_positions <- function(seq_alphabet_positions, alphabet_set, kappa = 12, lag = 0, skip_same_char = F)
 96 | {
 97 |   ### See the comments for the input parameters in f_seq_transform function.
 98 |   # seq_alphabet_positions      A list of index positions of all alphabet_set in the sequence
 99 |   # alphabet_set               Set of alphabet_set possible in the sequence. Can remove the long_seq and long_seq_ele_limits parameters because alphabet_set are given.
100 |   
101 |   Len <- sum(unlist(lapply(seq_alphabet_positions, function(x) length(x))))   # The sequence length
102 |   
103 |   Ls <- list()
104 |   
105 |   iter_set <- c(0, kappa)
106 |   
107 |   alphabet_set_size <- length(alphabet_set)
108 |   for(m in iter_set)
109 |   {
110 |     mat_Lm           <- matrix(rep(0,alphabet_set_size*alphabet_set_size), nrow=alphabet_set_size)
111 |     rownames(mat_Lm) <- colnames(mat_Lm) <- alphabet_set
112 |     
113 |     for(i in alphabet_set)
114 |     {
115 |       for(j in alphabet_set)
116 |       {
117 |         enumerated_combos <- arrange(expand.grid(i = seq_alphabet_positions[[i]],
118 |                                                  j = seq_alphabet_positions[[j]]),
119 |                                      i)
120 |         x           <- c(enumerated_combos[,"j"]-enumerated_combos[,"i"])
121 |         x.positives <- x[x>0]  # The x's which are greater than 0 are only corresponding to the feed-forward thing of the sequence. Others mean element j was before elemnet i.
122 |         mat_Lm[i,j] <- sum(exp(-1*m*x.positives))  # Line 15 in Algorithm 2 in [1]
123 |       }
124 |     }
125 |     
126 |     Ls[[length(Ls) + 1]] <- mat_Lm
127 |   }
128 |   
129 |   output <- list(Len = Len, W0 = Ls[[1]], W_kappa = Ls[[2]])
130 |   return(output)
131 | }
132 | 
133 | 
134 | ####################################################################################
135 | ##### Yield SGT output from the SGT parts computed from either algorithm 1 or 2 ####
136 | ####################################################################################
137 | 
138 | f_SGT <- function(W_kappa, W0, kappa, Len = NULL, inv.powered = T)
139 | {
140 |   ## This function computes the resulting SGT from the sgt parts found in function f_sgt_parts().
141 |   # Inputs
142 |   # W_kappa      See algorithm 1 in [1]
143 |   # W0           See algorithm 1 in [1]
144 |   # Len          Length of sequence
145 |   # inv.powered  Is True if we want the take the kappa-th root of SGT as shown the algorithm 1 [1].
146 |   
147 |   if(!is.null(Len))# Normalizing for the length
148 |   {
149 |     W0 <- W0/Len
150 |     W0[W0 == 0] <- NA
151 |   }
152 |   
153 |   tmp <- W_kappa/W0  
154 |   
155 |   tmp[is.na(tmp)] <- 0
156 |   
157 |   SGT_mat <- tmp
158 |   
159 |   if(inv.powered){
160 |     SGT_mat <- Math.invpow(SGT_mat, pow = kappa)  
161 |   }
162 |   
163 |   return(SGT_mat)
164 | }
165 | 
166 |   
167 | f_SGT_for_each_sequence_in_dataset <- function(sequence_dataset, kappa = 3, alphabet_set = LETTERS, lag = 0, skip_same_char = FALSE, long_seq = FALSE, long_seq_ele_limits = NULL, spp = NULL, sgt_using_alphabet_positions = F, trace = T)
168 | {
169 |   # The inputs
170 |   # Sequence_dataset    Either a vector with each element as a string (a sequence), or a dataframe with the sequences under column name 'seq'.
171 |   # kappa               The tuning param
172 |   # alphabet_set_size   The number of alphabet_set the sequences are made up of
173 |   # skip_same_char      Esp. for the element clustering it does not make sense to use repeated characters. For example, AAABC... in this actual transaction between A to B should be just one, however, if we don't skip the repetition it will be 3 (unnecessary inflating the transactions). Hence, skip repetition (value = TRUE) for element clustering.
174 |   # long_seq            TRUE if sequences have many more alphabet_set (not just A-Z).
175 |   # long_seq_ele_limits The limits of the alphabet_set in long sequences. For eg. in grayscale images it will be c(0,255)
176 |   # sgt_using_alphabet_positions
177 |   #                     If True, then the alternate algorithm (Algorithm 2 in the paper) will be used.
178 |   
179 |   n.seq <- nrow(sequence_dataset)
180 |   
181 |   alphabet_set_size <- length(alphabet_set)
182 |   
183 |   Len_all <- array(rep(0,n.seq), dim = c(n.seq)) 
184 |   W0_all  <- array(rep(0,n.seq*alphabet_set_size*alphabet_set_size), dim=c(alphabet_set_size,alphabet_set_size,n.seq))
185 |   
186 |   W_kappa_all <- list()
187 |   
188 |   if(ncol(sequence_dataset) > 1)
189 |   {
190 |     sequences <- sequence_dataset[, 'seq']  
191 |   }else{
192 |     sequences <- sequence_dataset
193 |   }
194 |   
195 |   for(i in 1:n.seq)
196 |   {
197 |     if(trace){print(paste("Sequence",i,"of",n.seq))}
198 |     
199 |     if(!sgt_using_alphabet_positions)
200 |     {
201 |       sgt_parts <- f_sgt_parts(sequence = sequences[i], kappa = kappa, 
202 |                           alphabet_set_size = alphabet_set_size, lag = lag, 
203 |                           skip_same_char = skip_same_char, 
204 |                           long_seq = long_seq, long_seq_ele_limits = long_seq_ele_limits)  
205 |     }else{
206 |       s_split                <- f_seq_split(sequence = sequences[i], spp = spp)
207 |       seq_alphabet_positions <- f_get_alphabet_positions(sequence_split = s_split, alphabet_set = alphabet_set)
208 |       sgt_parts              <- f_sgt_parts_using_alphabet_positions(seq_alphabet_positions = seq_alphabet_positions,
209 |                                                                      alphabet_set = alphabet_set, 
210 |                                                                      kappa = kappa,
211 |                                                                      lag = lag, skip_same_char = skip_same_char)
212 |     }
213 |     
214 |     tmp  <- sgt_parts$W0
215 |     
216 |     Len_all[i] <- sgt_parts$Len
217 |     W0_all[, , i]   <- tmp
218 |     
219 |     W_kappa_all[[length(W_kappa_all) + 1]] <- sgt_parts$W_kappa
220 |   }
221 |   dimnames(W0_all) <- list(rownames(tmp), colnames(tmp), c(sequence_dataset[,1]))  
222 |   
223 |   output <- list(Len_all = Len_all, W0_all = W0_all, W_kappa_all = W_kappa_all)
224 |   return(output)
225 | }
226 | 
227 | 
228 | 
229 | ################################################################################
230 | ############################   Auxiliary functions  ############################
231 | ################################################################################
232 | 
233 | ## Get alphabet for an integer
234 | f_get_alphabet <- function(integer)
235 | {
236 |   return(levels(factor(alphabet_lookup[alphabet_lookup[,'Integer']==integer, 'Alphabet'])))
237 | }
238 | 
239 | 
240 | f_seq_split <- function(sequence, spp = NULL)
241 | {
242 |   ## Split a sequence into a vector of alphabets. The order of alphabets is retained. Usually we get a sequence as a long string. This function just splits it to be further processed for SGT.
243 |   ## Inputs
244 |   # sequence   A sequence, e.g. "FSDFSFIFFSAOPDSA"
245 |   # spp        The separator of alphabets in the sequence. In the above example it is NULL.
246 |   
247 |   ## Output
248 |   # s_split    The input sequence returned as a vector of alphabets in the same order. 
249 |   if(!is.null(spp))
250 |   {
251 |     tmp <- strsplit(x = sequence, split = spp)
252 |     s_split <- tmp
253 |   }else{
254 |     countCharOccurrences <- function(char, s) {
255 |       s2 <- gsub(char,"",s)
256 |       return (nchar(s) - nchar(s2))
257 |     }
258 |     
259 |     if(countCharOccurrences("-",sequence) > 1)
260 |     {
261 |       tmp <- strsplit(sequence, "-")
262 |       s_split <- tmp
263 |     }else if(countCharOccurrences("-",sequence) == 1)
264 |     {
265 |       tmp     <- strsplit(sequence, "-")
266 |       tmp     <- tmp[[1]][1]
267 |       s_split <- strsplit(tmp,"")
268 |     }else if(length(grep(" ",sequence)))
269 |     {
270 |       s_split <- strsplit(sequence," ")
271 |     }else if(length(grep("~",sequence)))
272 |     {
273 |       s_split <- strsplit(sequence,"~")
274 |     }else
275 |     {
276 |       s_split <- strsplit(sequence,"")    
277 |     }
278 |   }
279 |   
280 |   s_split <- s_split[[1]]
281 |   s_split <- s_split[s_split != ""]
282 |   return(s_split)
283 | }
284 | 
285 | 
286 | Math.invpow <- function(x, pow) {
287 |   sign(x) * abs(x)^(1/pow)
288 | }
289 | 
290 | Math.pow <- function(x, pow) {
291 |   out <- 1
292 |   if(pow > 0){
293 |     for(p in 1:pow)
294 |     {
295 |       out <- out * x
296 |     }
297 |   }else if(pow == 0){
298 |     out <- 1
299 |   }
300 |   return(out)
301 | }
302 | 
303 | Math.matrixpow <- function(x, pow) {
304 |   out <- x
305 |   if(pow > 0){
306 |     for(p in 1:pow)
307 |     {
308 |       out <- out %*% x
309 |     }
310 |   }else if(pow == 0){
311 |     out <- 1
312 |   }
313 |   return(out)
314 | }
315 | 
316 | Math.matrix_norm <- function(mat, norm)
317 | {
318 |   if(norm == 1)
319 |   {
320 |     out <- abs(mat)
321 |   }else{
322 |     out <- mat^norm
323 |   }
324 |   return(out)
325 | }
326 | 
327 | Math.standardize <- function(x, y) {
328 |   y[y == 0] <- NA
329 |   out <- x/y
330 |   out[is.na(out)] <- 0
331 |   return(out)
332 | }
333 | 
334 | 
335 | f_get_f1 <- function(confusion)
336 | {
337 |   ## In this function we find F1 score from a confusion matrix. This will be used to select a clustering model also in function f_clustering_accuracy()
338 |   K <- ncol(confusion)
339 |   f1 <- NULL
340 |   for(k in 1:K)
341 |   {
342 |     tp <- confusion[k,k]  # True pos
343 |     fp <- sum(confusion[, k])-confusion[k, k]  # False pos
344 |     fn <- sum(confusion[k, ])-confusion[k, k]  # False neg
345 |     tmp <- 2*tp / (2 * tp + fn + fp)
346 |     f1 <- c(f1, tmp)
347 |   }
348 |   return (mean(f1))
349 | }
350 | 
351 | f_clustering_accuracy <- function(actual, pred, K = 2, type = "f1", trace = F, do.permutation = T)
352 | {
353 |   ### In this function we will find the accuracy of clustering ffrom any clustering method.
354 |   ## Inputs
355 |   # actual   A vector of actual clusters
356 |   # pred     A vector of estimated clusters
357 |   # K        Number of clusters of classes
358 |   # type     Best confusion selection method, type = c("accuracy", "f1")
359 |   library(gtools)
360 |   x <- letters[actual]
361 |   y <- letters[pred]
362 |   out_cc <- confusionMatrix(x,y)
363 |   out_f1 <- NA
364 |   if(type == "f1")
365 |   {
366 |     # out_f1 <- 2*(out_cc$byClass["Pos Pred Value"]*out_cc$byClass["Sensitivity"]/(out_cc$byClass["Pos Pred Value"]+out_cc$byClass["Sensitivity"]))  
367 |     out_f1 <- f_get_f1(confusion = out_cc$table)
368 |   }
369 |   
370 |   if(do.permutation)
371 |   {
372 |     possibilities <- permutations(K,K,letters[1:K]) # Depending on the version, one of these two lines (this and the one below) work
373 |     # possibilities <- matrix(letters[permutations(K)], ncol = K)  
374 |     
375 |     ## We are trying for all possibilities because the digit of the assigned class does not matter. For that any naming is fine. Thus, end of the day, the one with the best accuracy is the right one.
376 |     for(poss in 2:nrow(possibilities)){  # Number of other (hence, starting from 2) possibilities
377 |       if(trace){print(paste("Trying possibility",poss,sep="-"))}
378 |       for(k in 1:K){
379 |         y[pred==k] <- possibilities[poss,k]
380 |       }
381 |       
382 |       tmp <- confusionMatrix(x,y)
383 |       if(type == "f1")
384 |       {
385 |         tmp_f1 <- f_get_f1(confusion = tmp$table)
386 |         flag <- (tmp_f1 > out_f1)
387 |       }else if(type == "accuracy"){
388 |         flag <- (tmp$overall["Accuracy"] > out_cc$overall["Accuracy"])
389 |       }
390 |       
391 |       if(flag){  # Choosing based on F1 score instead of accuracy
392 |         out_cc <- tmp
393 |         if(type=="f1"){
394 |           out_f1 <- tmp_f1
395 |           names(out_f1) <- "F1"      
396 |         }
397 |         
398 |         if(trace){print(paste("Selecting possibility",poss,sep="-"))}
399 |       }
400 |     }  
401 |   }
402 |   
403 |   return(list(cc = out_cc, F1=out_f1))  
404 | }
405 | 
406 | f_reorder_class_assignment <- function(class)
407 | {
408 |   ## In this function we reorder the assigned classes in clustering, such that they are ordered with consecutive class labels for easier clustering accuracy check
409 |   conse_class <- 1
410 |   class_map <- matrix(c(class[1], conse_class), nrow = 1)
411 |   final_class <- matrix(c(class[1], conse_class), nrow = 1)
412 |   
413 |   for(i in 2:length(class))
414 |   {
415 |     if(class[i] == class[i-1])
416 |     {
417 |       final_class <- rbind(final_class, cbind(class[i], class_map[class_map[,1]==class[i], 2]))
418 |     }else{
419 |       if(class[i] %in% class_map[,1])
420 |       {
421 |         final_class <- rbind(final_class, cbind(class[i], class_map[class_map[,1]==class[i], 2]))
422 |       }else{
423 |         conse_class <- conse_class + 1
424 |         final_class <- rbind(final_class, cbind(class[i], conse_class))
425 |         class_map   <- rbind(class_map, cbind(class[i], conse_class))
426 |       }
427 |     }
428 |   }
429 |   
430 |   out <- list(class_mapped = final_class, consecutive_class = final_class[,2])
431 |   return(out)
432 | }
433 | 
434 | 
435 | f_seq_len_mu_var <- function(sequences)
436 | {
437 |   # A function that will find the mean and var of the sequence lengths
438 |   seq.lens <- NULL
439 |   for(i in 1:nrow(sequences))
440 |   {
441 |     tmp <- f_seq_split(sequences[i,'seq'])
442 |     seq.lens <- c(seq.lens, length(tmp))
443 |   }
444 |   seq.lens.mu  <- mean(seq.lens)
445 |   seq.lens.var <- var(seq.lens)
446 |   
447 |   out <- list(seq.lens.mu = seq.lens.mu, seq.lens.var = seq.lens.var, seq.lens = seq.lens)
448 |   return(out)
449 | }
450 | 


--------------------------------------------------------------------------------
/data/darpa_data.csv:
--------------------------------------------------------------------------------
  1 | "timeduration","seqlen","seq","class"
  2 | 552,575,"1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~4~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~4~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~4~1~2~3~3~3~3~3~3~1~4~5",0
  3 | 22,205,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~2~13~3~3~14~6~2~3~3~3~3~3~3~3~2~3~5~2~3~3~3~3~1~2~2~3~3~3~3~4~3~3~15~3~3~3~3~3~3~16~16~2~11~11~12~11~5~12~2~2~14~3~2~5~2~5~2~5~17~17~5~5~5~5~5~5~5~5~5~5~5~5~5~5~18",1
  4 | 524,43,"19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19",0
  5 | 14,311,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~8~5~17~3~3~3~2~3~5~2~3~3~3~3~2~3~5~2~11~11~12~11~5~12~2~3~5~16~20~20~16~2~11~11~12~11~5~12~17~17~5~5~20~16~2~11~11~12~11~5~12~21~2~3~5~22~23~24~20~20~2~2~2~2~25~9~20~2~2~26~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~23~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~23~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~2~16~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~20~16~20~17~17~20~5~5~5~5~18",0
  6 | 0,45,"5~5~17~5~5~5~5~5~10~2~11~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~5~2~5~3~3~5~5~5~5~18",0
  7 | 71,136,"15~3~5~5~5~5~5~5~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~2~16~3~14~16~14~2~14~30~31~5~32~30~33~2~3~3~5~5~30~33~2~3~3~2~3~5~16~3~3~3~16~16~16~16~14~30~31~5~32~30~33~16~17~34~5~33~35~36~30~33~14~37~30~5~33~5~5~5~5~18",0
  8 | 23,205,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~2~13~3~3~14~6~2~3~3~3~3~3~3~3~2~3~5~2~3~3~3~3~1~2~2~3~3~3~3~4~3~3~15~3~3~3~3~3~3~16~16~2~11~11~12~11~5~12~2~2~14~3~2~5~2~5~2~5~17~17~5~5~5~5~5~5~5~5~5~5~5~5~5~5~18",1
  9 | 6,156,"5~5~17~5~5~5~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~5~3~2~5~2~2~3~5~2~2~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~5~5~5~5~5~5~18",0
 10 | 2,74,"6~5~6~5~5~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~5~3~30~2~30~30~30~30~30~30~5~3~3~2~2~2~11~11~12~11~5~12~16~2~5~16~16~2~2~16~2~5~16~16~2~5~5~5~5~18",0
 11 | 23,205,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~2~13~3~3~14~6~2~3~3~3~3~3~3~3~2~3~5~2~3~3~3~3~1~2~2~3~3~3~3~4~3~3~15~3~3~3~3~3~3~16~16~2~11~11~12~11~5~12~2~2~14~3~2~5~2~5~2~5~17~17~5~5~5~5~5~5~5~5~5~5~5~5~5~5~18",1
 12 | 19,311,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~8~5~17~3~3~3~2~3~5~2~3~3~3~3~2~3~5~2~11~11~12~11~5~12~2~3~5~16~20~20~16~2~11~11~12~11~5~12~17~17~5~5~20~16~2~11~11~12~11~5~12~21~2~3~5~22~23~24~20~20~2~2~2~2~25~9~20~2~2~26~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~23~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~23~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~2~16~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~20~16~20~17~17~20~5~5~5~5~18",0
 13 | 1,28,"10~2~11~2~11~11~12~11~11~5~2~11~5~2~5~2~3~14~34~3~14~35~5~5~5~5~5~18",0
 14 | 49,632,"38~2~11~11~12~11~5~2~11~5~12~2~3~5~22~39~20~40~20~23~5~5~14~2~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~6~5~6~5~6~5~6~5~5~5~5~2~3~40~15~3~3~15~3~6~5~2~3~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~16~5~5~5~2~28~15~3~28~15~3~28~15~3~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~3~5~2~3~5~5~5~5~5~5~2~5~5~5~2~5~2~5~2~5~2~5~2~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~3~5~2~2~3~28~15~3~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~28~15~3~28~15~3~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~2~15~3~5~5~5~5~5~5~5~5~5~18",0
 15 | 3,126,"6~5~5~6~5~6~5~2~38~2~3~5~22~39~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~27~4~5~3~3~3~5~5~5~5~18",0
 16 | 2,89,"15~3~5~5~5~5~5~5~5~5~5~5~10~2~11~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~2~3~2~5~2~2~3~2~2~2~2~2~2~2~3~5~2~3~3~3~3~3~3~3~41~1~41~1~5~5~5~5~5~5~18",0
 17 | 3,26,"5~3~6~5~3~3~4~5~5~2~5~17~6~5~6~5~5~39~5~5~5~5~5~5~5~18",0
 18 | 3,67,"14~16~2~2~6~6~16~16~17~6~17~2~3~5~2~3~5~16~3~6~6~2~3~5~22~20~16~16~16~20~20~38~17~6~3~5~6~41~2~5~17~2~3~3~3~3~5~3~2~5~5~33~5~4~39~5~5~5~5~5~5~5~5~5~5~5~18",0
 19 | 3,188,"6~5~6~5~5~6~5~5~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~15~5~5~5~5~5~5~5~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~2~38~2~33~2~33~3~3~3~3~2~3~5~2~5~20~14~30~31~5~32~30~33~30~17~5~2~3~5~2~3~5~2~11~11~12~11~5~12~2~3~5~2~3~3~3~3~3~3~41~30~33~20~5~5~5~5~5~5~5~18",0
 20 | 5,71,"5~5~2~5~17~6~5~6~5~5~2~6~17~3~2~2~3~5~2~3~5~6~6~2~3~5~22~20~16~16~16~20~20~38~2~3~3~3~3~5~27~28~5~3~3~5~6~41~17~6~3~36~5~2~33~33~5~5~33~5~39~5~5~5~5~5~5~5~5~5~18",0
 21 | 1,28,"10~2~11~2~11~11~12~11~11~5~2~11~5~2~5~2~3~14~34~3~14~35~5~5~5~5~5~18",0
 22 | 1,56,"15~3~5~5~5~5~5~5~5~5~10~2~11~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~5~3~2~2~14~2~2~2~14~3~2~5~5~5~5~5~5~5~18",0
 23 | 3,188,"6~5~6~5~5~6~5~5~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~15~5~5~5~5~5~5~5~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~2~38~2~33~2~33~3~3~3~3~2~3~5~2~5~20~14~30~31~5~32~30~33~30~17~5~2~3~5~2~3~5~2~11~11~12~11~5~12~2~3~5~2~3~3~3~3~3~3~41~30~33~20~5~5~5~5~5~5~5~18",0
 24 | 12,286,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~8~5~17~3~3~3~2~3~5~2~3~3~3~3~2~3~5~2~11~11~12~11~5~12~2~3~5~16~20~20~16~2~11~11~12~11~5~12~17~17~5~5~20~16~2~11~11~12~11~5~12~21~2~3~5~22~23~24~20~20~2~2~2~2~25~9~20~2~2~26~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~23~16~17~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~34~5~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~20~16~20~17~17~20~5~5~5~5~18",0
 25 | 69,1773,"15~3~5~5~5~5~5~5~5~5~5~10~2~11~2~2~11~11~12~11~5~2~2~2~11~11~12~11~5~2~2~2~11~11~12~11~11~5~2~2~2~11~11~12~11~5~2~2~2~11~11~12~11~5~2~2~2~11~11~12~11~11~5~2~11~11~12~11~5~2~2~11~5~2~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~16~16~16~16~16~16~14~16~16~42~16~16~14~14~14~14~3~2~5~3~3~3~3~3~3~3~3~3~3~3~3~3~3~2~3~5~2~16~16~2~3~5~30~30~16~16~16~16~16~16~16~16~16~16~16~16~16~16~16~16~16~16~2~16~16~3~5~3~3~16~42~42~42~30~42~42~42~30~16~17~35~5~2~33~16~16~16~16~16~16~16~14~30~16~16~3~3~3~3~16~16~16~16~16~16~42~3~3~3~3~3~3~16~17~35~5~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~16~16~14~16~16~17~2~5~43~31~5~16~17~35~5~2~5~33~33~3~30~3~3~3~3~3~31~5~3~3~3~3~3~3~3~3~3~3~3~15~3~15~3~6~3~3~6~3~3~31~5~33~5~5~5~5~18",0
 26 | 18,199,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~2~3~5~2~2~13~3~3~14~2~3~3~3~3~3~3~1~3~1~3~3~3~4~5~3~3~15~41~2~2~3~3~3~3~3~16~16~2~11~11~12~11~5~12~2~2~14~3~2~5~2~5~2~5~17~17~5~5~5~5~5~5~5~5~5~5~5~5~5~18",0
 27 | 0,49,"15~3~5~5~5~5~5~5~5~5~5~5~10~2~11~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~5~2~5~2~11~44~12~5~5~5~5~18",0
 28 | 1,88,"15~3~5~5~5~5~5~5~5~5~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~2~16~3~14~14~30~5~5~5~5~18",0
 29 | 12,268,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~8~5~17~3~3~3~2~3~5~2~3~3~3~3~2~3~5~2~11~11~12~11~5~12~2~3~5~16~20~20~16~2~11~11~12~11~5~12~17~17~5~5~20~16~2~11~11~12~11~5~12~21~2~3~5~22~23~24~20~20~2~2~2~2~25~9~20~2~2~26~23~16~16~2~5~2~5~2~5~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~2~16~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~20~16~20~17~17~20~5~5~5~5~18",0
 30 | 2,69,"14~16~2~2~6~6~16~16~17~6~17~2~3~5~2~3~5~2~3~5~3~6~6~2~3~5~22~20~16~16~16~20~20~38~17~6~3~5~6~41~2~5~17~2~3~3~3~3~5~3~2~5~5~33~5~4~39~5~5~5~5~5~5~5~5~5~5~5~18",0
 31 | 3,188,"6~5~6~5~5~6~5~5~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~15~5~5~5~5~5~5~5~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~2~38~2~33~2~33~3~3~3~3~2~3~5~2~5~20~14~30~31~5~32~30~33~30~17~5~2~3~5~2~3~5~2~11~11~12~11~5~12~2~3~5~2~3~3~3~3~3~3~41~30~33~20~5~5~5~5~5~5~5~18",0
 32 | 3,188,"6~5~6~5~5~6~5~5~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~15~5~5~5~5~5~5~5~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~2~38~2~33~2~33~3~3~3~3~2~3~5~2~5~20~14~30~31~5~32~30~33~30~17~5~2~3~5~2~3~5~2~11~11~12~11~5~12~2~3~5~2~3~3~3~3~3~3~41~30~33~20~5~5~5~5~5~5~5~18",0
 33 | 1,56,"15~3~5~5~5~5~5~5~5~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~5~3~30~2~30~30~30~5~3~3~2~2~5~5~5~5~5~18",0
 34 | 1,126,"6~5~5~6~5~6~5~2~38~2~3~5~22~39~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~27~4~5~3~3~3~5~5~5~5~18",0
 35 | 1,91,"5~6~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~2~3~5~2~11~11~12~11~5~12~2~3~5~12~12~2~5~2~16~16~16~5~3~3~2~5~16~16~16~5~5~5~18",0
 36 | 1,126,"6~5~5~6~5~6~5~2~38~2~3~5~22~39~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~27~4~5~3~3~3~5~5~5~5~18",0
 37 | 2,71,"5~5~2~5~17~6~5~6~5~5~2~6~17~3~2~2~3~5~2~3~5~6~6~2~3~5~22~20~16~16~16~20~20~38~2~3~3~3~3~5~27~28~5~3~3~5~6~41~17~6~3~36~5~2~33~33~5~5~33~5~39~5~5~5~5~5~5~5~5~5~18",0
 38 | 1,188,"6~5~6~5~5~6~5~5~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~15~5~5~5~5~5~5~5~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~2~38~2~33~2~33~3~3~3~3~2~3~5~2~5~20~14~30~31~5~32~30~33~30~17~5~2~3~5~2~3~5~2~11~11~12~11~5~12~2~3~5~2~3~3~3~3~3~3~41~30~33~20~5~5~5~5~5~5~5~18",0
 39 | 2,126,"6~5~5~6~5~6~5~2~38~2~3~5~22~39~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~27~4~5~3~3~3~5~5~5~5~18",0
 40 | 5,71,"5~5~2~5~17~6~5~6~5~5~2~6~17~3~2~2~3~5~2~3~5~6~6~2~3~5~22~20~16~16~16~20~20~38~2~3~3~3~3~5~27~28~5~3~3~5~6~41~17~6~3~36~5~2~33~33~5~5~33~5~39~5~5~5~5~5~5~5~5~5~18",0
 41 | 1,96,"5~6~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~2~2~3~5~2~11~11~12~11~5~12~2~2~2~16~16~16~16~5~5~12~12~5~3~3~2~5~16~16~16~16~16~16~5~5~5~5~18",0
 42 | 3,188,"6~5~6~5~5~6~5~5~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~15~5~5~5~5~5~5~5~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~2~38~2~33~2~33~3~3~3~3~2~3~5~2~5~20~14~30~31~5~32~30~33~30~17~5~2~3~5~2~3~5~2~11~11~12~11~5~12~2~3~5~2~3~3~3~3~3~3~41~30~33~20~5~5~5~5~5~5~5~18",0
 43 | 1,126,"6~5~5~6~5~6~5~2~38~2~3~5~22~39~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~27~4~5~3~3~3~5~5~5~5~18",0
 44 | 1,83,"5~6~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~2~3~5~2~11~11~12~11~5~12~2~3~5~12~12~2~5~2~5~3~3~5~5~5~18",0
 45 | 2,48,"5~5~5~14~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~5~15~15~39~38~3~3~4~3~15~3~5~5~5~5~18",0
 46 | 2,229,"10~2~11~2~11~11~12~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~2~2~3~5~2~3~3~3~3~3~3~40~2~5~3~2~3~5~2~40~2~2~3~42~16~2~3~5~5~23~14~16~2~2~6~6~16~16~2~3~5~17~6~17~3~3~2~16~5~16~16~2~2~14~3~5~5~2~5~2~3~5~2~3~5~17~6~3~3~3~5~17~2~6~41~2~2~3~3~3~3~5~3~2~5~5~33~5~4~39~5~5~5~5~5~5~5~5~5~18",0
 47 | 4,126,"6~5~5~6~5~6~5~2~38~2~3~5~22~39~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~27~4~5~3~3~3~5~5~5~5~18",0
 48 | 1,172,"2~11~11~12~11~5~12~2~3~5~22~38~39~6~5~6~5~5~6~5~5~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~15~5~5~5~5~5~5~5~5~5~10~2~11~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~5~2~5~2~3~3~3~2~16~17~2~3~5~2~5~3~39~27~28~5~39~3~5~23~2~3~5~2~3~5~2~3~5~2~3~5~2~2~3~5~2~3~2~3~5~2~3~5~5~5~5~5~5~5~18",0
 49 | 3,189,"6~5~6~5~5~6~5~5~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~15~5~5~5~5~5~5~5~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~2~38~2~33~2~33~3~3~3~3~2~3~5~2~5~20~14~30~31~5~32~30~33~30~17~35~5~2~3~5~2~3~5~2~11~11~12~11~5~12~2~3~5~2~3~3~3~3~3~3~41~30~33~20~5~5~5~5~5~5~5~18",0
 50 | 1,59,"5~5~5~14~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~5~15~15~39~3~5~16~16~2~5~2~5~2~2~3~30~30~5~5~6~5~5~5~5~5~5~18",0
 51 | 18,336,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~8~5~17~3~3~3~2~3~5~2~3~3~3~3~2~3~5~2~11~11~12~11~5~12~2~3~5~16~20~20~16~2~11~11~12~11~5~12~17~17~5~5~20~16~2~11~11~12~11~5~12~21~2~3~5~22~23~24~20~20~2~2~2~2~25~9~20~2~2~26~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~23~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~23~2~16~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~2~16~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~23~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~20~16~20~17~17~20~5~5~5~5~18",0
 52 | 0,108,"6~5~6~5~5~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~5~3~30~2~30~30~42~30~30~30~30~30~30~5~3~3~2~2~2~11~11~12~11~5~12~16~2~5~16~16~2~2~16~2~5~16~16~2~5~2~16~2~11~11~12~11~5~2~11~11~12~11~2~11~5~2~11~11~12~11~5~2~12~12~12~12~12~12~5~12~5~5~5~18",0
 53 | 1,126,"6~5~5~6~5~6~5~2~38~2~3~5~22~39~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~27~4~5~3~3~3~5~5~5~5~18",0
 54 | 0,188,"6~5~6~5~5~6~5~5~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~15~5~5~5~5~5~5~5~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~2~38~2~33~2~33~3~3~3~3~2~3~5~2~5~20~14~30~31~5~32~30~33~30~17~5~2~3~5~2~3~5~2~11~11~12~11~5~12~2~3~5~2~3~3~3~3~3~3~41~30~33~20~5~5~5~5~5~5~5~18",0
 55 | 1,74,"6~5~6~5~5~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~5~3~30~2~30~30~30~30~5~3~3~2~2~2~11~11~12~11~5~12~16~2~5~16~16~2~2~16~2~5~16~16~2~5~2~16~5~5~5~18",0
 56 | 1,28,"10~2~11~2~11~11~12~11~11~5~2~11~5~2~5~2~3~14~34~3~14~35~5~5~5~5~5~18",0
 57 | 1,188,"6~5~6~5~5~6~5~5~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~15~5~5~5~5~5~5~5~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~2~38~2~33~2~33~3~3~3~3~2~3~5~2~5~20~14~30~31~5~32~30~33~30~17~5~2~3~5~2~3~5~2~11~11~12~11~5~12~2~3~5~2~3~3~3~3~3~3~41~30~33~20~5~5~5~5~5~5~5~18",0
 58 | 40,205,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~2~13~3~3~14~6~2~3~3~3~3~3~3~3~2~3~5~2~3~3~3~3~1~2~2~3~3~3~3~4~3~3~15~3~3~3~3~3~3~16~16~2~11~11~12~11~5~12~2~2~14~3~2~5~2~5~2~5~17~17~5~5~5~5~5~5~5~5~5~5~5~5~5~5~18",1
 59 | 1,28,"10~2~11~2~11~11~12~11~11~5~2~11~5~2~5~2~3~14~34~3~14~35~5~5~5~5~5~18",0
 60 | 3,12,"16~2~3~5~22~22~16~14~14~4~16~33",0
 61 | 42,205,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~2~13~3~3~14~6~2~3~3~3~3~3~3~3~2~3~5~2~3~3~3~3~1~2~2~3~3~3~3~4~3~3~15~3~3~3~3~3~3~16~16~2~11~11~12~11~5~12~2~2~14~3~2~5~2~5~2~5~17~17~5~5~5~5~5~5~5~5~5~5~5~5~5~5~18",0
 62 | 11,289,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~8~5~17~3~3~3~2~3~5~2~3~3~3~3~2~3~5~2~11~11~12~11~5~12~2~3~5~16~20~20~16~2~11~11~12~11~5~12~17~17~5~5~20~16~2~11~11~12~11~5~12~21~2~3~5~22~23~24~20~20~2~2~2~2~25~9~20~2~2~26~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~16~16~2~5~23~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~2~16~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~20~16~20~17~17~20~5~5~5~5~18",0
 63 | 0,188,"6~5~6~5~5~6~5~5~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~15~5~5~5~5~5~5~5~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~2~38~2~33~2~33~3~3~3~3~2~3~5~2~5~20~14~30~31~5~32~30~33~30~17~5~2~3~5~2~3~5~2~11~11~12~11~5~12~2~3~5~2~3~3~3~3~3~3~41~30~33~20~5~5~5~5~5~5~5~18",0
 64 | 15,335,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~8~5~17~3~3~3~2~3~5~2~3~3~3~3~2~3~5~2~11~11~12~11~5~12~2~3~5~16~20~20~16~2~11~11~12~11~5~12~17~17~5~5~20~16~2~11~11~12~11~5~12~21~2~3~5~22~23~24~20~20~2~2~2~2~25~9~20~2~2~26~23~27~2~5~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~2~16~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~2~16~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~2~16~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~20~16~20~17~17~20~5~5~5~5~18",0
 65 | 2,188,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~6~5~6~5~6~5~2~3~5~2~16~16~2~11~11~12~11~5~12~2~3~5~2~11~11~12~11~5~12~2~5~12~12~8~2~3~2~3~5~2~3~5~2~3~5~2~3~5~3~45~9~5~5~5~5~5~5~18",1
 66 | 44,205,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~2~13~3~3~14~6~2~3~3~3~3~3~3~3~2~3~5~2~3~3~3~3~1~2~2~3~3~3~3~4~3~3~15~3~3~3~3~3~3~16~16~2~11~11~12~11~5~12~2~2~14~3~2~5~2~5~2~5~17~17~5~5~5~5~5~5~5~5~5~5~5~5~5~5~18",0
 67 | 21,353,"15~2~5~5~5~5~5~6~5~6~5~6~5~5~2~2~6~6~17~17~5~5~6~6~5~5~2~5~5~10~2~11~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~2~3~5~46~8~2~3~5~2~3~2~16~5~16~16~16~16~16~2~3~3~3~3~2~16~2~11~11~12~11~5~12~2~3~5~2~11~11~12~11~5~12~2~5~12~12~2~11~11~12~11~5~12~2~2~3~5~5~12~12~2~3~3~3~5~2~3~2~3~5~2~3~5~5~2~3~5~2~3~5~2~3~5~47~9~16~2~11~11~12~11~5~12~2~11~11~12~11~5~12~2~5~12~12~2~11~11~12~11~5~12~2~2~3~5~5~12~12~2~2~2~2~11~11~12~11~5~12~2~5~12~12~2~11~11~12~11~5~12~2~2~3~5~5~12~12~2~3~3~3~5~2~3~5~2~3~5~2~3~5~47~9~2~11~11~12~11~5~12~2~5~12~12~2~11~11~12~11~5~12~2~2~3~5~5~12~12~2~2~2~2~11~11~12~11~5~12~2~5~12~12~2~11~11~12~11~5~12~2~2~3~5~5~12~12~2~3~3~3~5~2~3~5~2~3~5~2~3~5~47~9~2~11~11~12~11~5~12~2~5~12~12~2~11~11~12~11~5~12~2~2~3~5~5~12~12~2~2~5~5~5~5~18",0
 68 | 14,199,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~2~3~5~2~2~13~3~3~14~2~3~3~3~3~3~3~1~3~1~3~3~3~4~5~3~3~15~41~2~2~3~3~3~3~3~16~16~2~11~11~12~11~5~12~2~2~14~3~2~5~2~5~2~5~17~17~5~5~5~5~5~5~5~5~5~5~5~5~5~18",1
 69 | 48,205,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~2~13~3~3~14~6~2~3~3~3~3~3~3~3~2~3~5~2~3~3~3~3~1~2~2~3~3~3~3~4~3~3~15~3~3~3~3~3~3~16~16~2~11~11~12~11~5~12~2~2~14~3~2~5~2~5~2~5~17~17~5~5~5~5~5~5~5~5~5~5~5~5~5~5~18",0
 70 | 0,89,"15~3~5~5~5~5~5~5~5~5~5~5~10~2~11~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~2~3~2~5~2~2~3~2~2~2~2~2~2~2~3~5~2~3~3~3~3~3~3~3~41~1~41~1~5~5~5~5~5~5~18",0
 71 | 2,202,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~2~3~5~2~3~3~3~3~3~3~3~3~6~29~1~1~6~5~6~5~6~5~2~16~16~2~11~11~12~11~5~12~2~3~5~2~11~11~12~11~5~12~2~5~12~12~8~2~3~2~3~5~2~3~5~2~3~5~2~3~5~3~45~9~5~5~5~5~5~5~5~18",1
 72 | 1,49,"15~3~5~5~5~5~5~5~5~5~5~5~10~2~11~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~5~2~5~2~11~44~12~5~5~5~5~18",0
 73 | 13,289,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~8~5~17~3~3~3~2~3~5~2~3~3~3~3~2~3~5~2~11~11~12~11~5~12~2~3~5~16~20~20~16~2~11~11~12~11~5~12~17~17~5~5~20~16~2~11~11~12~11~5~12~21~2~3~5~22~23~24~20~20~2~2~2~2~25~9~20~2~2~26~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~16~16~2~5~23~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~2~16~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~20~16~20~17~17~20~5~5~5~5~18",0
 74 | 53,205,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~2~13~3~3~14~6~2~3~3~3~3~3~3~3~2~3~5~2~3~3~3~3~1~2~2~3~3~3~3~4~3~3~15~3~3~3~3~3~3~16~16~2~11~11~12~11~5~12~2~2~14~3~2~5~2~5~2~5~17~17~5~5~5~5~5~5~5~5~5~5~5~5~5~5~18",0
 75 | 0,28,"10~2~11~2~11~11~12~11~11~5~2~11~5~2~5~2~3~14~34~3~14~35~5~5~5~5~5~18",0
 76 | 20,336,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~8~5~17~3~3~3~2~3~5~2~3~3~3~3~2~3~5~2~11~11~12~11~5~12~2~3~5~16~20~20~16~2~11~11~12~11~5~12~17~17~5~5~20~16~2~11~11~12~11~5~12~21~2~3~5~22~23~24~20~20~2~2~2~2~25~9~20~2~2~26~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~23~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~23~2~16~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~2~16~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~23~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~20~16~20~17~17~20~5~5~5~5~18",0
 77 | 12,335,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~8~5~17~3~3~3~2~3~5~2~3~3~3~3~2~3~5~2~11~11~12~11~5~12~2~3~5~16~20~20~16~2~11~11~12~11~5~12~17~17~5~5~20~16~2~11~11~12~11~5~12~21~2~3~5~22~23~24~20~20~2~2~2~2~25~9~20~2~2~26~23~27~2~5~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~2~16~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~2~16~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~2~16~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~20~16~20~17~17~20~5~5~5~5~18",0
 78 | 2,126,"6~5~5~6~5~6~5~2~38~2~3~5~22~39~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~27~4~5~3~3~3~5~5~5~5~18",0
 79 | 1,74,"6~5~6~5~5~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~5~3~30~2~30~30~30~30~5~3~3~2~2~2~11~11~12~11~5~12~16~2~5~16~16~2~2~16~2~5~16~16~2~5~2~16~5~5~5~18",0
 80 | 2,126,"6~5~5~6~5~6~5~2~38~2~3~5~22~39~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~27~4~5~3~3~3~5~5~5~5~18",0
 81 | 1,126,"6~5~5~6~5~6~5~2~38~2~3~5~22~39~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~27~4~5~3~3~3~5~5~5~5~18",0
 82 | 60,40,"5~5~5~5~5~5~5~10~2~11~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~5~5~5~5~5~18",0
 83 | 1,89,"5~6~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~2~2~3~5~2~11~11~12~11~5~12~2~2~2~16~16~16~5~5~12~12~5~3~3~2~5~5~5~5~5~18",0
 84 | 11,268,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~8~5~17~3~3~3~2~3~5~2~3~3~3~3~2~3~5~2~11~11~12~11~5~12~2~3~5~16~20~20~16~2~11~11~12~11~5~12~17~17~5~5~20~16~2~11~11~12~11~5~12~21~2~3~5~22~23~24~20~20~2~2~2~2~25~9~20~2~2~26~23~16~16~2~5~2~5~2~5~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~2~16~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~20~16~20~17~17~20~5~5~5~5~18",0
 85 | 53,205,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~2~13~3~3~14~6~2~3~3~3~3~3~3~3~2~3~5~2~3~3~3~3~1~2~2~3~3~3~3~4~3~3~15~3~3~3~3~3~3~16~16~2~11~11~12~11~5~12~2~2~14~3~2~5~2~5~2~5~17~17~5~5~5~5~5~5~5~5~5~5~5~5~5~5~18",0
 86 | 45,577,"15~2~5~5~5~5~5~6~5~6~5~6~5~5~2~2~6~6~17~17~5~5~6~6~5~5~2~5~5~10~2~11~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~2~3~5~46~8~2~3~5~2~3~2~16~5~16~16~16~16~16~2~3~3~3~3~2~16~2~11~11~12~11~5~12~2~3~5~2~11~11~12~11~5~12~2~5~12~12~2~11~11~12~11~5~12~2~2~3~5~5~12~12~2~3~3~3~5~2~3~2~3~5~2~3~5~5~2~3~5~2~3~5~2~3~5~47~9~16~2~11~11~12~11~5~12~2~11~11~12~11~5~12~2~5~12~12~2~11~11~12~11~5~12~2~2~3~5~5~12~12~2~2~2~2~11~11~12~11~5~12~2~5~12~12~2~11~11~12~11~5~12~2~2~3~5~5~12~12~2~3~3~3~5~2~3~5~2~3~5~2~3~5~47~9~2~11~11~12~11~5~12~2~5~12~12~2~11~11~12~11~5~12~2~2~3~5~5~12~12~2~2~2~2~11~11~12~11~5~12~2~5~12~12~2~11~11~12~11~5~12~2~2~3~5~5~12~12~2~3~3~3~5~2~3~5~2~3~5~2~3~5~47~9~2~11~11~12~11~5~12~2~5~12~12~2~11~11~12~11~5~12~2~2~3~5~5~12~12~2~2~2~2~11~11~12~11~5~12~2~5~12~12~2~11~11~12~11~5~12~2~2~3~5~5~12~12~2~3~3~3~5~2~3~5~2~3~5~2~3~5~47~9~2~11~11~12~11~5~12~2~5~12~12~2~11~11~12~11~5~12~2~2~3~5~5~12~12~2~2~2~2~11~11~12~11~5~12~2~5~12~12~2~11~11~12~11~5~12~2~2~3~5~5~12~12~2~3~3~3~5~2~11~11~12~11~5~12~2~5~12~12~2~11~11~12~11~5~12~2~2~3~5~5~12~12~2~23~2~5~16~2~11~11~12~11~5~12~2~11~11~12~11~5~12~2~5~12~12~2~11~11~12~11~5~12~2~2~3~5~5~12~12~2~5~2~3~5~16~2~11~11~12~11~5~12~2~2~14~3~2~5~2~5~5~5~17~17~5~5~35~34~2~3~5~2~3~5~2~3~5~26~2~3~5~2~3~5~2~3~5~47~9~4~2~3~5~2~3~5~2~3~5~48~9~5~5~5~5~18",0
 87 | 57,205,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~2~13~3~3~14~6~2~3~3~3~3~3~3~3~2~3~5~2~3~3~3~3~1~2~2~3~3~3~3~4~3~3~15~3~3~3~3~3~3~16~16~2~11~11~12~11~5~12~2~2~14~3~2~5~2~5~2~5~17~17~5~5~5~5~5~5~5~5~5~5~5~5~5~5~18",0
 88 | 60,205,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~2~13~3~3~14~6~2~3~3~3~3~3~3~3~2~3~5~2~3~3~3~3~1~2~2~3~3~3~3~4~3~3~15~3~3~3~3~3~3~16~16~2~11~11~12~11~5~12~2~2~14~3~2~5~2~5~2~5~17~17~5~5~5~5~5~5~5~5~5~5~5~5~5~5~18",0
 89 | 0,28,"10~2~11~2~11~11~12~11~11~5~2~11~5~2~5~2~3~14~34~3~14~35~5~5~5~5~5~18",0
 90 | 46,632,"38~2~11~11~12~11~5~2~11~5~12~2~3~5~22~39~20~40~20~23~5~5~14~2~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~6~5~6~5~6~5~6~5~5~5~5~2~3~40~15~3~3~15~3~6~5~2~3~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~16~5~5~5~2~28~15~3~28~15~3~28~15~3~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~3~5~2~3~5~5~5~5~5~5~2~5~5~5~2~5~2~5~2~5~2~5~2~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~3~5~2~2~3~28~15~3~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~28~15~3~28~15~3~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~2~15~3~5~5~5~5~5~5~5~5~5~18",0
 91 | 1,188,"6~5~6~5~5~6~5~5~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~15~5~5~5~5~5~5~5~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~2~38~2~33~2~33~3~3~3~3~2~3~5~2~5~20~14~30~31~5~32~30~33~30~17~5~2~3~5~2~3~5~2~11~11~12~11~5~12~2~3~5~2~3~3~3~3~3~3~41~30~33~20~5~5~5~5~5~5~5~18",0
 92 | 71,205,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~2~13~3~3~14~6~2~3~3~3~3~3~3~3~2~3~5~2~3~3~3~3~1~2~2~3~3~3~3~4~3~3~15~3~3~3~3~3~3~16~16~2~11~11~12~11~5~12~2~2~14~3~2~5~2~5~2~5~17~17~5~5~5~5~5~5~5~5~5~5~5~5~5~5~18",0
 93 | 0,61,"15~3~5~5~5~5~5~5~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~5~3~30~2~30~30~30~30~30~30~30~30~30~5~3~3~2~2~5~5~5~5~5~18",0
 94 | 84,205,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~2~13~3~3~14~6~2~3~3~3~3~3~3~3~2~3~5~2~3~3~3~3~1~2~2~3~3~3~3~4~3~3~15~3~3~3~3~3~3~16~16~2~11~11~12~11~5~12~2~2~14~3~2~5~2~5~2~5~17~17~5~5~5~5~5~5~5~5~5~5~5~5~5~5~18",0
 95 | 11,286,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~8~5~17~3~3~3~2~3~5~2~3~3~3~3~2~3~5~2~11~11~12~11~5~12~2~3~5~16~20~20~16~2~11~11~12~11~5~12~17~17~5~5~20~16~2~11~11~12~11~5~12~21~2~3~5~22~23~24~20~20~2~2~2~2~25~9~20~2~2~26~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~23~16~17~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~34~5~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~20~16~20~17~17~20~5~5~5~5~18",0
 96 | 1,40,"27~4~6~5~5~5~10~2~11~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~5~3~3~5~5~5~18",0
 97 | 1,70,"5~6~5~5~10~2~11~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~5~3~16~2~2~14~2~2~14~3~2~5~16~16~16~16~16~16~16~16~16~16~16~16~16~16~16~16~16~16~16~5~5~5~5~5~5~5~18",0
 98 | 86,205,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~2~13~3~3~14~6~2~3~3~3~3~3~3~3~2~3~5~2~3~3~3~3~1~2~2~3~3~3~3~4~3~3~15~3~3~3~3~3~3~16~16~2~11~11~12~11~5~12~2~2~14~3~2~5~2~5~2~5~17~17~5~5~5~5~5~5~5~5~5~5~5~5~5~5~18",0
 99 | 100,205,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~2~13~3~3~14~6~2~3~3~3~3~3~3~3~2~3~5~2~3~3~3~3~1~2~2~3~3~3~3~4~3~3~15~3~3~3~3~3~3~16~16~2~11~11~12~11~5~12~2~2~14~3~2~5~2~5~2~5~17~17~5~5~5~5~5~5~5~5~5~5~5~5~5~5~18",0
100 | 5,163,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~2~13~3~3~14~6~2~3~3~3~3~3~3~3~2~3~5~2~17~2~5~2~6~41~5~5~5~5~5~5~5~5~18",1
101 | 0,22,"3~1~2~3~3~3~3~3~3~3~3~3~6~6~6~1~6~1~3~3~41~5",0
102 | 3,119,"6~5~5~6~5~6~5~2~38~2~3~5~22~39~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~5~5~5~18",1
103 | 4,136,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~8~5~17~2~5~2~5~5~5~5~18",1
104 | 5,149,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~8~49~2~2~5~2~5~3~3~2~3~5~2~6~6~5~6~5~5~5~5~5~5~18",1
105 | 4,124,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~12~5~2~12~5~2~12~5~2~12~5~2~5~12~3~3~3~3~2~3~5~2~5~5~5~5~18",1
106 | 4,129,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~12~5~2~12~5~2~12~5~2~12~5~2~5~12~3~3~3~3~3~17~2~5~2~6~41~5~5~5~5~5~5~18",1
107 | 100,205,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~2~13~3~3~14~6~2~3~3~3~3~3~3~3~2~3~5~2~3~3~3~3~1~2~2~3~3~3~3~4~3~3~15~3~3~3~3~3~3~16~16~2~11~11~12~11~5~12~2~2~14~3~2~5~2~5~2~5~17~17~5~5~5~5~5~5~5~5~5~5~5~5~5~5~18",0
108 | 0,28,"10~2~11~2~11~11~12~11~11~5~2~11~5~2~5~2~3~14~34~3~14~35~5~5~5~5~5~18",0
109 | 2,71,"5~5~2~5~17~6~5~6~5~5~2~6~17~3~2~2~3~5~2~3~5~6~6~2~3~5~22~20~16~16~16~20~20~38~2~3~3~3~3~5~27~28~5~3~3~5~6~41~17~6~3~36~5~2~33~33~5~5~33~5~39~5~5~5~5~5~5~5~5~5~18",0
110 | 1,188,"6~5~6~5~5~6~5~5~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~15~5~5~5~5~5~5~5~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~2~38~2~33~2~33~3~3~3~3~2~3~5~2~5~20~14~30~31~5~32~30~33~30~17~5~2~3~5~2~3~5~2~11~11~12~11~5~12~2~3~5~2~3~3~3~3~3~3~41~30~33~20~5~5~5~5~5~5~5~18",0
111 | 1,126,"6~5~5~6~5~6~5~2~38~2~3~5~22~39~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~27~4~5~3~3~3~5~5~5~5~18",0
112 | 1,89,"5~6~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~2~2~3~5~2~11~11~12~11~5~12~2~2~2~16~16~16~5~5~12~12~5~3~3~2~5~5~5~5~5~18",0
113 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
   1 | # Sequence Graph Transform (SGT) &mdash; Sequence Embedding for Clustering, Classification, and Search
   2 | 
   3 | #### Maintained by: Chitta Ranjan 
   4 | Email: <cran2367@gmail.com>
   5 | | LinkedIn: [https://www.linkedin.com/in/chitta-ranjan-b0851911/](https://www.linkedin.com/in/chitta-ranjan-b0851911/)
   6 | 
   7 | 
   8 | The following will cover,
   9 | 
  10 | 1. [SGT Class Definition](#sgt-class-def)
  11 | 2. [Installation](#install-sgt)
  12 | 3. [Test Examples](#installation-test-examples)
  13 | 4. [Sequence Clustering Example](#sequence-clustering)
  14 | 5. [Sequence Classification Example](#sequence-classification)
  15 | 6. [Sequence Search Example](#sequence-search)
  16 | 7. [SGT - Spark for Distributed Computing](#sgt-spark)
  17 | 8. [Datasets](#datasets)
  18 | 
  19 | 
  20 | ## <a name="sgt-class-def"></a> SGT Class Definition
  21 | 
  22 | Sequence Graph Transform (SGT) is a sequence embedding function. SGT extracts the short- and long-term sequence features and embeds them in a finite-dimensional feature space. The long and short term patterns embedded in SGT can be tuned without any increase in the computation."
  23 | 
  24 | 
  25 | ```
  26 | class SGT():
  27 |     '''
  28 |     Compute embedding of a single or a collection of discrete item 
  29 |     sequences. A discrete item sequence is a sequence made from a set
  30 |     discrete elements, also known as alphabet set. For example,
  31 |     suppose the alphabet set is the set of roman letters, 
  32 |     {A, B, ..., Z}. This set is made of discrete elements. Examples of
  33 |     sequences from such a set are AABADDSA, UADSFJPFFFOIHOUGD, etc.
  34 |     Such sequence datasets are commonly found in online industry,
  35 |     for example, item purchase history, where the alphabet set is
  36 |     the set of all product items. Sequence datasets are abundant in
  37 |     bioinformatics as protein sequences.
  38 |     Using the embeddings created here, classification and clustering
  39 |     models can be built for sequence datasets.
  40 |     Read more in https://arxiv.org/pdf/1608.03533.pdf
  41 |     '''
  42 | 
  43 |     Parameters
  44 |     ----------
  45 |     Input:
  46 | 
  47 |     alphabets       Optional, except if mode is Spark. 
  48 |                     The set of alphabets that make up all 
  49 |                     the sequences in the dataset. If not passed, the
  50 |                     alphabet set is automatically computed as the 
  51 |                     unique set of elements that make all the sequences.
  52 |                     A list or 1d-array of the set of elements that make up the      
  53 |                     sequences. For example, np.array(["A", "B", "C"].
  54 |                     If mode is 'spark', the alphabets are necessary.
  55 | 
  56 |     kappa           Tuning parameter, kappa > 0, to change the extraction of 
  57 |                     long-term dependency. Higher the value the lesser
  58 |                     the long-term dependency captured in the embedding.
  59 |                     Typical values for kappa are 1, 5, 10.
  60 | 
  61 |     lengthsensitive Default false. This is set to true if the embedding of
  62 |                     should have the information of the length of the sequence.
  63 |                     If set to false then the embedding of two sequences with
  64 |                     similar pattern but different lengths will be the same.
  65 |                     lengthsensitive = false is similar to length-normalization.
  66 |                     
  67 |     flatten         Default True. If True the SGT embedding is flattened and returned as
  68 |                     a vector. Otherwise, it is returned as a matrix with the row and col
  69 |                     names same as the alphabets. The matrix form is used for            
  70 |                     interpretation purposes. Especially, to understand how the alphabets
  71 |                     are "related". Otherwise, for applying machine learning or deep
  72 |                     learning algorithms, the embedding vectors are required.
  73 |     
  74 |     mode            Choices in {'default', 'multiprocessing'}. Note: 'multiprocessing' 
  75 |                     mode requires pandas==1.0.3+ and pandarallel libraries.
  76 |     
  77 |     processors      Used if mode is 'multiprocessing'. By default, the 
  78 |                     number of processors used in multiprocessing is
  79 |                     number of available - 1.
  80 |     '''
  81 | 
  82 |     
  83 |     Attributes
  84 |     ----------
  85 |     def fit(sequence)
  86 |     
  87 |     Extract Sequence Graph Transform features using Algorithm-2 in https://arxiv.org/abs/1608.03533.
  88 |     Input:
  89 |     sequence        An array of discrete elements. For example,
  90 |                     np.array(["B","B","A","C","A","C","A","A","B","A"].
  91 |                     
  92 |     Output: 
  93 |     sgt embedding   sgt matrix or vector (depending on Flatten==False or True) of the sequence
  94 |     
  95 |     
  96 |     --
  97 |     def fit_transform(corpus)
  98 |     
  99 |     Extract SGT embeddings for all sequences in a corpus. It finds
 100 |     the alphabets encompassing all the sequences in the corpus, if not inputted. 
 101 |     However, if the mode is 'spark', then the alphabets list has to be
 102 |     explicitly given in Sgt object declaration.
 103 |     
 104 |     Input:
 105 |     corpus          A list of sequences. Each sequence is a list of alphabets.
 106 |     
 107 |     Output:
 108 |     sgt embedding of all sequences in the corpus.
 109 |     
 110 |     
 111 |     --
 112 |     def transform(corpus)
 113 |     
 114 |     Find SGT embeddings of a new data sample belonging to the same population
 115 |     of the corpus that was fitted initially.
 116 | ```
 117 | 
 118 | ## <a name="install-sgt"></a> Install SGT
 119 | 
 120 | Install SGT in Python by running,
 121 | 
 122 | ```$ pip install sgt```
 123 | 
 124 | 
 125 | ```python
 126 | import sgt
 127 | sgt.__version__
 128 | from sgt import SGT
 129 | 
 130 | ```
 131 | 
 132 | 
 133 | 
 134 | 
 135 |     '2.0.0'
 136 | 
 137 | 
 138 | 
 139 | 
 140 | 
 141 | ```python
 142 | # -*- coding: utf-8 -*-
 143 | # Authors: Chitta Ranjan <cran2367@gmail.com>
 144 | #
 145 | # License: BSD 3 clause
 146 | ```
 147 | 
 148 | 
 149 | ## <a name="installation-test-examples"></a> Installation Test Examples
 150 | 
 151 | In the following, there are a few test examples to verify the installation.
 152 | 
 153 | 
 154 | ```python
 155 | # Learning a sgt embedding as a matrix with 
 156 | # rows and columns as the sequence alphabets. 
 157 | # This embedding shows the relationship between 
 158 | # the alphabets. The higher the value the 
 159 | # stronger the relationship.
 160 | 
 161 | sgt = SGT(flatten=False)
 162 | sequence = np.array(["B","B","A","C","A","C","A","A","B","A"])
 163 | sgt.fit(sequence)
 164 | ```
 165 | 
 166 | 
 167 | 
 168 | 
 169 | <div>
 170 | <table border="1" class="dataframe">
 171 |   <thead>
 172 |     <tr style="text-align: right;">
 173 |       <th></th>
 174 |       <th>A</th>
 175 |       <th>B</th>
 176 |       <th>C</th>
 177 |     </tr>
 178 |   </thead>
 179 |   <tbody>
 180 |     <tr>
 181 |       <th>A</th>
 182 |       <td>0.090616</td>
 183 |       <td>0.131002</td>
 184 |       <td>0.261849</td>
 185 |     </tr>
 186 |     <tr>
 187 |       <th>B</th>
 188 |       <td>0.086569</td>
 189 |       <td>0.123042</td>
 190 |       <td>0.052544</td>
 191 |     </tr>
 192 |     <tr>
 193 |       <th>C</th>
 194 |       <td>0.137142</td>
 195 |       <td>0.028263</td>
 196 |       <td>0.135335</td>
 197 |     </tr>
 198 |   </tbody>
 199 | </table>
 200 | </div>
 201 | 
 202 | 
 203 | 
 204 | 
 205 | ```python
 206 | # SGT embedding to a vector. The vector
 207 | # embedding is useful for directly applying
 208 | # a machine learning algorithm.
 209 | 
 210 | sgt = SGT(flatten=True)
 211 | sequence = np.array(["B","B","A","C","A","C","A","A","B","A"])
 212 | sgt.fit(sequence)
 213 | ```
 214 | 
 215 | 
 216 | 
 217 | 
 218 |     (A, A)    0.090616
 219 |     (A, B)    0.131002
 220 |     (A, C)    0.261849
 221 |     (B, A)    0.086569
 222 |     (B, B)    0.123042
 223 |     (B, C)    0.052544
 224 |     (C, A)    0.137142
 225 |     (C, B)    0.028263
 226 |     (C, C)    0.135335
 227 |     dtype: float64
 228 | 
 229 | 
 230 | 
 231 | 
 232 | ```python
 233 | '''
 234 | SGT embedding on a corpus of sequences.
 235 | Test the two processing modes within the
 236 | SGT class: 'default', 'multiprocessing'.
 237 | 
 238 | '''
 239 | 
 240 | # A sample corpus of two sequences.
 241 | corpus = pd.DataFrame([[1, ["B","B","A","C","A","C","A","A","B","A"]], 
 242 |                        [2, ["C", "Z", "Z", "Z", "D"]]], 
 243 |                       columns=['id', 'sequence'])
 244 | corpus
 245 | ```
 246 | 
 247 | 
 248 | 
 249 | 
 250 | <div>
 251 | <table border="1" class="dataframe">
 252 |   <thead>
 253 |     <tr style="text-align: right;">
 254 |       <th></th>
 255 |       <th>id</th>
 256 |       <th>sequence</th>
 257 |     </tr>
 258 |   </thead>
 259 |   <tbody>
 260 |     <tr>
 261 |       <th>0</th>
 262 |       <td>1</td>
 263 |       <td>[B, B, A, C, A, C, A, A, B, A]</td>
 264 |     </tr>
 265 |     <tr>
 266 |       <th>1</th>
 267 |       <td>2</td>
 268 |       <td>[C, Z, Z, Z, D]</td>
 269 |     </tr>
 270 |   </tbody>
 271 | </table>
 272 | </div>
 273 | 
 274 | 
 275 | 
 276 | 
 277 | ```python
 278 | # Learning the sgt embeddings as vector for
 279 | # all sequences in a corpus.
 280 | # mode: 'default'
 281 | sgt = SGT(kappa=1, 
 282 |           flatten=True, 
 283 |           lengthsensitive=False, 
 284 |           mode='default')
 285 | sgt.fit_transform(corpus)
 286 | ```
 287 | 
 288 | 
 289 | 
 290 | 
 291 | <div>
 292 | <table border="1" class="dataframe">
 293 |   <thead>
 294 |     <tr style="text-align: right;">
 295 |       <th></th>
 296 |       <th>id</th>
 297 |       <th>(A, A)</th>
 298 |       <th>(A, B)</th>
 299 |       <th>(A, C)</th>
 300 |       <th>(A, D)</th>
 301 |       <th>(A, Z)</th>
 302 |       <th>(B, A)</th>
 303 |       <th>(B, B)</th>
 304 |       <th>(B, C)</th>
 305 |       <th>(B, D)</th>
 306 |       <th>...</th>
 307 |       <th>(D, A)</th>
 308 |       <th>(D, B)</th>
 309 |       <th>(D, C)</th>
 310 |       <th>(D, D)</th>
 311 |       <th>(D, Z)</th>
 312 |       <th>(Z, A)</th>
 313 |       <th>(Z, B)</th>
 314 |       <th>(Z, C)</th>
 315 |       <th>(Z, D)</th>
 316 |       <th>(Z, Z)</th>
 317 |     </tr>
 318 |   </thead>
 319 |   <tbody>
 320 |     <tr>
 321 |       <th>0</th>
 322 |       <td>1.0</td>
 323 |       <td>0.090616</td>
 324 |       <td>0.131002</td>
 325 |       <td>0.261849</td>
 326 |       <td>0.0</td>
 327 |       <td>0.0</td>
 328 |       <td>0.086569</td>
 329 |       <td>0.123042</td>
 330 |       <td>0.052544</td>
 331 |       <td>0.0</td>
 332 |       <td>...</td>
 333 |       <td>0.0</td>
 334 |       <td>0.0</td>
 335 |       <td>0.0</td>
 336 |       <td>0.0</td>
 337 |       <td>0.0</td>
 338 |       <td>0.0</td>
 339 |       <td>0.0</td>
 340 |       <td>0.0</td>
 341 |       <td>0.000000</td>
 342 |       <td>0.000000</td>
 343 |     </tr>
 344 |     <tr>
 345 |       <th>1</th>
 346 |       <td>2.0</td>
 347 |       <td>0.000000</td>
 348 |       <td>0.000000</td>
 349 |       <td>0.000000</td>
 350 |       <td>0.0</td>
 351 |       <td>0.0</td>
 352 |       <td>0.000000</td>
 353 |       <td>0.000000</td>
 354 |       <td>0.000000</td>
 355 |       <td>0.0</td>
 356 |       <td>...</td>
 357 |       <td>0.0</td>
 358 |       <td>0.0</td>
 359 |       <td>0.0</td>
 360 |       <td>0.0</td>
 361 |       <td>0.0</td>
 362 |       <td>0.0</td>
 363 |       <td>0.0</td>
 364 |       <td>0.0</td>
 365 |       <td>0.184334</td>
 366 |       <td>0.290365</td>
 367 |     </tr>
 368 |   </tbody>
 369 | </table>
 370 | <p>2 rows × 26 columns</p>
 371 | </div>
 372 | 
 373 | 
 374 | 
 375 | 
 376 | ```python
 377 | # Learning the sgt embeddings as vector for
 378 | # all sequences in a corpus.
 379 | # mode: 'multiprocessing'
 380 | 
 381 | import pandarallel  # required library for multiprocessing
 382 | 
 383 | sgt = SGT(kappa=1, 
 384 |           flatten=True, 
 385 |           lengthsensitive=False,
 386 |           mode='multiprocessing')
 387 | 
 388 | sgt.fit_transform(corpus)
 389 | ```
 390 | 
 391 |     INFO: Pandarallel will run on 7 workers.
 392 |     INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
 393 | 
 394 | 
 395 | 
 396 | 
 397 | 
 398 | <div>
 399 | <table border="1" class="dataframe">
 400 |   <thead>
 401 |     <tr style="text-align: right;">
 402 |       <th></th>
 403 |       <th>id</th>
 404 |       <th>(A, A)</th>
 405 |       <th>(A, B)</th>
 406 |       <th>(A, C)</th>
 407 |       <th>(A, D)</th>
 408 |       <th>(A, Z)</th>
 409 |       <th>(B, A)</th>
 410 |       <th>(B, B)</th>
 411 |       <th>(B, C)</th>
 412 |       <th>(B, D)</th>
 413 |       <th>...</th>
 414 |       <th>(D, A)</th>
 415 |       <th>(D, B)</th>
 416 |       <th>(D, C)</th>
 417 |       <th>(D, D)</th>
 418 |       <th>(D, Z)</th>
 419 |       <th>(Z, A)</th>
 420 |       <th>(Z, B)</th>
 421 |       <th>(Z, C)</th>
 422 |       <th>(Z, D)</th>
 423 |       <th>(Z, Z)</th>
 424 |     </tr>
 425 |   </thead>
 426 |   <tbody>
 427 |     <tr>
 428 |       <th>0</th>
 429 |       <td>1.0</td>
 430 |       <td>0.090616</td>
 431 |       <td>0.131002</td>
 432 |       <td>0.261849</td>
 433 |       <td>0.0</td>
 434 |       <td>0.0</td>
 435 |       <td>0.086569</td>
 436 |       <td>0.123042</td>
 437 |       <td>0.052544</td>
 438 |       <td>0.0</td>
 439 |       <td>...</td>
 440 |       <td>0.0</td>
 441 |       <td>0.0</td>
 442 |       <td>0.0</td>
 443 |       <td>0.0</td>
 444 |       <td>0.0</td>
 445 |       <td>0.0</td>
 446 |       <td>0.0</td>
 447 |       <td>0.0</td>
 448 |       <td>0.000000</td>
 449 |       <td>0.000000</td>
 450 |     </tr>
 451 |     <tr>
 452 |       <th>1</th>
 453 |       <td>2.0</td>
 454 |       <td>0.000000</td>
 455 |       <td>0.000000</td>
 456 |       <td>0.000000</td>
 457 |       <td>0.0</td>
 458 |       <td>0.0</td>
 459 |       <td>0.000000</td>
 460 |       <td>0.000000</td>
 461 |       <td>0.000000</td>
 462 |       <td>0.0</td>
 463 |       <td>...</td>
 464 |       <td>0.0</td>
 465 |       <td>0.0</td>
 466 |       <td>0.0</td>
 467 |       <td>0.0</td>
 468 |       <td>0.0</td>
 469 |       <td>0.0</td>
 470 |       <td>0.0</td>
 471 |       <td>0.0</td>
 472 |       <td>0.184334</td>
 473 |       <td>0.290365</td>
 474 |     </tr>
 475 |   </tbody>
 476 | </table>
 477 | <p>2 rows × 26 columns</p>
 478 | </div>
 479 | 
 480 | ## Load Libraries for Illustrative Examples
 481 | 
 482 | 
 483 | ```python
 484 | from sgt import SGT
 485 | 
 486 | import numpy as np
 487 | import pandas as pd
 488 | from itertools import chain
 489 | from itertools import product as iterproduct
 490 | import warnings
 491 | 
 492 | import pickle
 493 | 
 494 | ########
 495 | from sklearn.preprocessing import LabelEncoder
 496 | import tensorflow as tf
 497 | from keras.datasets import imdb
 498 | from tensorflow.keras.models import Sequential
 499 | from tensorflow.keras.layers import Dense
 500 | from tensorflow.keras.layers import LSTM
 501 | from tensorflow.keras.layers import Dropout
 502 | from tensorflow.keras.layers import Activation
 503 | from tensorflow.keras.layers import Flatten
 504 | from tensorflow.keras.layers import Embedding
 505 | from tensorflow.keras.preprocessing import sequence
 506 | 
 507 | from sklearn.model_selection import train_test_split
 508 | from sklearn.model_selection import KFold
 509 | from sklearn.model_selection import StratifiedKFold
 510 | import sklearn.metrics
 511 | import time
 512 | 
 513 | from sklearn.decomposition import PCA
 514 | from sklearn.cluster import KMeans
 515 | 
 516 | import matplotlib.pyplot as plt
 517 | %matplotlib inline
 518 | 
 519 | np.random.seed(7) # fix random seed for reproducibility
 520 | 
 521 | # from sgt import Sgt
 522 | ```
 523 | 
 524 | 
 525 | ## <a name="sequence-clustering"></a> Sequence Clustering
 526 | 
 527 | A form of unsupervised learning from sequences is clustering. For example, in 
 528 | 
 529 | - user weblogs sequences: clustering the weblogs segments users into groups with similar browsing behavior. This helps in targeted marketing, anomaly detection, and other web customizations.
 530 | 
 531 | - protein sequences: clustering proteins with similar structures help researchers study the commonalities between species. It also helps in faster search in some search algorithms.
 532 | 
 533 | In the following, clustering on a protein sequence dataset will be shown.
 534 | 
 535 | 
 536 | 
 537 | ### Protein Sequence Clustering
 538 | 
 539 | The data used here is taken from www.uniprot.org. This is a public database for proteins. The data contains the protein sequences and their functions.
 540 | 
 541 | 
 542 | ```python
 543 | # Loading data
 544 | corpus = pd.read_csv('data/protein_classification.csv')
 545 | 
 546 | # Data preprocessing
 547 | corpus = corpus.loc[:,['Entry','Sequence']]
 548 | corpus.columns = ['id', 'sequence']
 549 | corpus['sequence'] = corpus['sequence'].map(list)
 550 | corpus
 551 | ```
 552 | 
 553 | 
 554 | 
 555 | 
 556 | <div>
 557 | <table border="1" class="dataframe">
 558 |   <thead>
 559 |     <tr style="text-align: right;">
 560 |       <th></th>
 561 |       <th>id</th>
 562 |       <th>sequence</th>
 563 |     </tr>
 564 |   </thead>
 565 |   <tbody>
 566 |     <tr>
 567 |       <th>0</th>
 568 |       <td>M7MCX3</td>
 569 |       <td>[M, E, I, E, K, T, N, R, M, N, A, L, F, E, F, ...</td>
 570 |     </tr>
 571 |     <tr>
 572 |       <th>1</th>
 573 |       <td>K6PL84</td>
 574 |       <td>[M, E, I, E, K, N, Y, R, M, N, S, L, F, E, F, ...</td>
 575 |     </tr>
 576 |     <tr>
 577 |       <th>2</th>
 578 |       <td>R4W5V3</td>
 579 |       <td>[M, E, I, E, K, T, N, R, M, N, A, L, F, E, F, ...</td>
 580 |     </tr>
 581 |     <tr>
 582 |       <th>3</th>
 583 |       <td>T2A126</td>
 584 |       <td>[M, E, I, E, K, T, N, R, M, N, A, L, F, E, F, ...</td>
 585 |     </tr>
 586 |     <tr>
 587 |       <th>4</th>
 588 |       <td>L0SHD5</td>
 589 |       <td>[M, E, I, E, K, T, N, R, M, N, A, L, F, E, F, ...</td>
 590 |     </tr>
 591 |     <tr>
 592 |       <th>...</th>
 593 |       <td>...</td>
 594 |       <td>...</td>
 595 |     </tr>
 596 |     <tr>
 597 |       <th>2107</th>
 598 |       <td>A0A081R612</td>
 599 |       <td>[M, M, N, M, Q, N, M, M, R, Q, A, Q, K, L, Q, ...</td>
 600 |     </tr>
 601 |     <tr>
 602 |       <th>2108</th>
 603 |       <td>A0A081QQM2</td>
 604 |       <td>[M, M, N, M, Q, N, M, M, R, Q, A, Q, K, L, Q, ...</td>
 605 |     </tr>
 606 |     <tr>
 607 |       <th>2109</th>
 608 |       <td>J1A517</td>
 609 |       <td>[M, M, R, Q, A, Q, K, L, Q, K, Q, M, E, Q, S, ...</td>
 610 |     </tr>
 611 |     <tr>
 612 |       <th>2110</th>
 613 |       <td>F5U1T6</td>
 614 |       <td>[M, M, N, M, Q, S, M, M, K, Q, A, Q, K, L, Q, ...</td>
 615 |     </tr>
 616 |     <tr>
 617 |       <th>2111</th>
 618 |       <td>J3A2T7</td>
 619 |       <td>[M, M, N, M, Q, N, M, M, K, Q, A, Q, K, L, Q, ...</td>
 620 |     </tr>
 621 |   </tbody>
 622 | </table>
 623 | <p>2112 rows × 2 columns</p>
 624 | </div>
 625 | 
 626 | 
 627 | 
 628 | 
 629 | ```python
 630 | %%time
 631 | # Compute SGT embeddings
 632 | sgt_ = SGT(kappa=1, 
 633 |            lengthsensitive=False, 
 634 |            mode='multiprocessing')
 635 | sgtembedding_df = sgt_.fit_transform(corpus)
 636 | ```
 637 | 
 638 |     INFO: Pandarallel will run on 7 workers.
 639 |     INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
 640 |     CPU times: user 324 ms, sys: 68 ms, total: 392 ms
 641 |     Wall time: 9.02 s
 642 | 
 643 | 
 644 | 
 645 | ```python
 646 | sgtembedding_df
 647 | ```
 648 | 
 649 | 
 650 | 
 651 | 
 652 | <div>
 653 | <table border="1" class="dataframe">
 654 |   <thead>
 655 |     <tr style="text-align: right;">
 656 |       <th></th>
 657 |       <th>id</th>
 658 |       <th>(A, A)</th>
 659 |       <th>(A, C)</th>
 660 |       <th>(A, D)</th>
 661 |       <th>(A, E)</th>
 662 |       <th>(A, F)</th>
 663 |       <th>(A, G)</th>
 664 |       <th>(A, H)</th>
 665 |       <th>(A, I)</th>
 666 |       <th>(A, K)</th>
 667 |       <th>...</th>
 668 |       <th>(Y, M)</th>
 669 |       <th>(Y, N)</th>
 670 |       <th>(Y, P)</th>
 671 |       <th>(Y, Q)</th>
 672 |       <th>(Y, R)</th>
 673 |       <th>(Y, S)</th>
 674 |       <th>(Y, T)</th>
 675 |       <th>(Y, V)</th>
 676 |       <th>(Y, W)</th>
 677 |       <th>(Y, Y)</th>
 678 |     </tr>
 679 |   </thead>
 680 |   <tbody>
 681 |     <tr>
 682 |       <th>0</th>
 683 |       <td>M7MCX3</td>
 684 |       <td>0.020180</td>
 685 |       <td>0.0</td>
 686 |       <td>0.009635</td>
 687 |       <td>0.013529</td>
 688 |       <td>0.009360</td>
 689 |       <td>0.003205</td>
 690 |       <td>2.944887e-10</td>
 691 |       <td>0.002226</td>
 692 |       <td>0.000379</td>
 693 |       <td>...</td>
 694 |       <td>0.009196</td>
 695 |       <td>0.007964</td>
 696 |       <td>0.036788</td>
 697 |       <td>0.000195</td>
 698 |       <td>0.001513</td>
 699 |       <td>0.020665</td>
 700 |       <td>0.000542</td>
 701 |       <td>0.007479</td>
 702 |       <td>0.0</td>
 703 |       <td>0.010419</td>
 704 |     </tr>
 705 |     <tr>
 706 |       <th>1</th>
 707 |       <td>K6PL84</td>
 708 |       <td>0.001604</td>
 709 |       <td>0.0</td>
 710 |       <td>0.012637</td>
 711 |       <td>0.006323</td>
 712 |       <td>0.006224</td>
 713 |       <td>0.004819</td>
 714 |       <td>3.560677e-03</td>
 715 |       <td>0.001124</td>
 716 |       <td>0.012136</td>
 717 |       <td>...</td>
 718 |       <td>0.135335</td>
 719 |       <td>0.006568</td>
 720 |       <td>0.038901</td>
 721 |       <td>0.011298</td>
 722 |       <td>0.012578</td>
 723 |       <td>0.009913</td>
 724 |       <td>0.001079</td>
 725 |       <td>0.000023</td>
 726 |       <td>0.0</td>
 727 |       <td>0.007728</td>
 728 |     </tr>
 729 |     <tr>
 730 |       <th>2</th>
 731 |       <td>R4W5V3</td>
 732 |       <td>0.012448</td>
 733 |       <td>0.0</td>
 734 |       <td>0.008408</td>
 735 |       <td>0.016363</td>
 736 |       <td>0.027469</td>
 737 |       <td>0.003205</td>
 738 |       <td>2.944887e-10</td>
 739 |       <td>0.004249</td>
 740 |       <td>0.013013</td>
 741 |       <td>...</td>
 742 |       <td>0.008114</td>
 743 |       <td>0.007128</td>
 744 |       <td>0.000000</td>
 745 |       <td>0.000203</td>
 746 |       <td>0.001757</td>
 747 |       <td>0.022736</td>
 748 |       <td>0.000249</td>
 749 |       <td>0.012652</td>
 750 |       <td>0.0</td>
 751 |       <td>0.008533</td>
 752 |     </tr>
 753 |     <tr>
 754 |       <th>3</th>
 755 |       <td>T2A126</td>
 756 |       <td>0.010545</td>
 757 |       <td>0.0</td>
 758 |       <td>0.012560</td>
 759 |       <td>0.014212</td>
 760 |       <td>0.013728</td>
 761 |       <td>0.000000</td>
 762 |       <td>2.944887e-10</td>
 763 |       <td>0.007223</td>
 764 |       <td>0.000309</td>
 765 |       <td>...</td>
 766 |       <td>0.000325</td>
 767 |       <td>0.009669</td>
 768 |       <td>0.000000</td>
 769 |       <td>0.003182</td>
 770 |       <td>0.001904</td>
 771 |       <td>0.015607</td>
 772 |       <td>0.000577</td>
 773 |       <td>0.007479</td>
 774 |       <td>0.0</td>
 775 |       <td>0.008648</td>
 776 |     </tr>
 777 |     <tr>
 778 |       <th>4</th>
 779 |       <td>L0SHD5</td>
 780 |       <td>0.020180</td>
 781 |       <td>0.0</td>
 782 |       <td>0.008628</td>
 783 |       <td>0.015033</td>
 784 |       <td>0.009360</td>
 785 |       <td>0.003205</td>
 786 |       <td>2.944887e-10</td>
 787 |       <td>0.002226</td>
 788 |       <td>0.000379</td>
 789 |       <td>...</td>
 790 |       <td>0.009196</td>
 791 |       <td>0.007964</td>
 792 |       <td>0.036788</td>
 793 |       <td>0.000195</td>
 794 |       <td>0.001513</td>
 795 |       <td>0.020665</td>
 796 |       <td>0.000542</td>
 797 |       <td>0.007479</td>
 798 |       <td>0.0</td>
 799 |       <td>0.010419</td>
 800 |     </tr>
 801 |     <tr>
 802 |       <th>...</th>
 803 |       <td>...</td>
 804 |       <td>...</td>
 805 |       <td>...</td>
 806 |       <td>...</td>
 807 |       <td>...</td>
 808 |       <td>...</td>
 809 |       <td>...</td>
 810 |       <td>...</td>
 811 |       <td>...</td>
 812 |       <td>...</td>
 813 |       <td>...</td>
 814 |       <td>...</td>
 815 |       <td>...</td>
 816 |       <td>...</td>
 817 |       <td>...</td>
 818 |       <td>...</td>
 819 |       <td>...</td>
 820 |       <td>...</td>
 821 |       <td>...</td>
 822 |       <td>...</td>
 823 |       <td>...</td>
 824 |     </tr>
 825 |     <tr>
 826 |       <th>2107</th>
 827 |       <td>A0A081R612</td>
 828 |       <td>0.014805</td>
 829 |       <td>0.0</td>
 830 |       <td>0.004159</td>
 831 |       <td>0.017541</td>
 832 |       <td>0.012701</td>
 833 |       <td>0.013099</td>
 834 |       <td>0.000000e+00</td>
 835 |       <td>0.017043</td>
 836 |       <td>0.004732</td>
 837 |       <td>...</td>
 838 |       <td>0.000000</td>
 839 |       <td>0.000000</td>
 840 |       <td>0.000000</td>
 841 |       <td>0.000000</td>
 842 |       <td>0.000000</td>
 843 |       <td>0.000000</td>
 844 |       <td>0.000000</td>
 845 |       <td>0.000000</td>
 846 |       <td>0.0</td>
 847 |       <td>0.000000</td>
 848 |     </tr>
 849 |     <tr>
 850 |       <th>2108</th>
 851 |       <td>A0A081QQM2</td>
 852 |       <td>0.010774</td>
 853 |       <td>0.0</td>
 854 |       <td>0.004283</td>
 855 |       <td>0.014732</td>
 856 |       <td>0.014340</td>
 857 |       <td>0.014846</td>
 858 |       <td>0.000000e+00</td>
 859 |       <td>0.016806</td>
 860 |       <td>0.005406</td>
 861 |       <td>...</td>
 862 |       <td>0.000000</td>
 863 |       <td>0.000000</td>
 864 |       <td>0.000000</td>
 865 |       <td>0.000000</td>
 866 |       <td>0.000000</td>
 867 |       <td>0.000000</td>
 868 |       <td>0.000000</td>
 869 |       <td>0.000000</td>
 870 |       <td>0.0</td>
 871 |       <td>0.000000</td>
 872 |     </tr>
 873 |     <tr>
 874 |       <th>2109</th>
 875 |       <td>J1A517</td>
 876 |       <td>0.010774</td>
 877 |       <td>0.0</td>
 878 |       <td>0.004283</td>
 879 |       <td>0.014732</td>
 880 |       <td>0.014340</td>
 881 |       <td>0.014846</td>
 882 |       <td>0.000000e+00</td>
 883 |       <td>0.014500</td>
 884 |       <td>0.005406</td>
 885 |       <td>...</td>
 886 |       <td>0.000000</td>
 887 |       <td>0.000000</td>
 888 |       <td>0.000000</td>
 889 |       <td>0.000000</td>
 890 |       <td>0.000000</td>
 891 |       <td>0.000000</td>
 892 |       <td>0.000000</td>
 893 |       <td>0.000000</td>
 894 |       <td>0.0</td>
 895 |       <td>0.000000</td>
 896 |     </tr>
 897 |     <tr>
 898 |       <th>2110</th>
 899 |       <td>F5U1T6</td>
 900 |       <td>0.015209</td>
 901 |       <td>0.0</td>
 902 |       <td>0.005175</td>
 903 |       <td>0.023888</td>
 904 |       <td>0.011410</td>
 905 |       <td>0.011510</td>
 906 |       <td>0.000000e+00</td>
 907 |       <td>0.021145</td>
 908 |       <td>0.009280</td>
 909 |       <td>...</td>
 910 |       <td>0.000000</td>
 911 |       <td>0.000000</td>
 912 |       <td>0.000000</td>
 913 |       <td>0.000000</td>
 914 |       <td>0.000000</td>
 915 |       <td>0.000000</td>
 916 |       <td>0.000000</td>
 917 |       <td>0.000000</td>
 918 |       <td>0.0</td>
 919 |       <td>0.000000</td>
 920 |     </tr>
 921 |     <tr>
 922 |       <th>2111</th>
 923 |       <td>J3A2T7</td>
 924 |       <td>0.005240</td>
 925 |       <td>0.0</td>
 926 |       <td>0.012301</td>
 927 |       <td>0.013178</td>
 928 |       <td>0.014744</td>
 929 |       <td>0.014705</td>
 930 |       <td>0.000000e+00</td>
 931 |       <td>0.000981</td>
 932 |       <td>0.007957</td>
 933 |       <td>...</td>
 934 |       <td>0.000000</td>
 935 |       <td>0.000000</td>
 936 |       <td>0.000000</td>
 937 |       <td>0.000000</td>
 938 |       <td>0.000000</td>
 939 |       <td>0.000000</td>
 940 |       <td>0.000000</td>
 941 |       <td>0.000000</td>
 942 |       <td>0.0</td>
 943 |       <td>0.000000</td>
 944 |     </tr>
 945 |   </tbody>
 946 | </table>
 947 | <p>2112 rows × 401 columns</p>
 948 | </div>
 949 | 
 950 | 
 951 | 
 952 | 
 953 | ```python
 954 | # Set the id column as the dataframe index
 955 | sgtembedding_df = sgtembedding_df.set_index('id')
 956 | sgtembedding_df
 957 | ```
 958 | 
 959 | 
 960 | 
 961 | 
 962 | <div>
 963 | <table border="1" class="dataframe">
 964 |   <thead>
 965 |     <tr style="text-align: right;">
 966 |       <th></th>
 967 |       <th>(A, A)</th>
 968 |       <th>(A, C)</th>
 969 |       <th>(A, D)</th>
 970 |       <th>(A, E)</th>
 971 |       <th>(A, F)</th>
 972 |       <th>(A, G)</th>
 973 |       <th>(A, H)</th>
 974 |       <th>(A, I)</th>
 975 |       <th>(A, K)</th>
 976 |       <th>(A, L)</th>
 977 |       <th>...</th>
 978 |       <th>(Y, M)</th>
 979 |       <th>(Y, N)</th>
 980 |       <th>(Y, P)</th>
 981 |       <th>(Y, Q)</th>
 982 |       <th>(Y, R)</th>
 983 |       <th>(Y, S)</th>
 984 |       <th>(Y, T)</th>
 985 |       <th>(Y, V)</th>
 986 |       <th>(Y, W)</th>
 987 |       <th>(Y, Y)</th>
 988 |     </tr>
 989 |     <tr>
 990 |       <th>id</th>
 991 |       <th></th>
 992 |       <th></th>
 993 |       <th></th>
 994 |       <th></th>
 995 |       <th></th>
 996 |       <th></th>
 997 |       <th></th>
 998 |       <th></th>
 999 |       <th></th>
1000 |       <th></th>
1001 |       <th></th>
1002 |       <th></th>
1003 |       <th></th>
1004 |       <th></th>
1005 |       <th></th>
1006 |       <th></th>
1007 |       <th></th>
1008 |       <th></th>
1009 |       <th></th>
1010 |       <th></th>
1011 |       <th></th>
1012 |     </tr>
1013 |   </thead>
1014 |   <tbody>
1015 |     <tr>
1016 |       <th>M7MCX3</th>
1017 |       <td>0.020180</td>
1018 |       <td>0.0</td>
1019 |       <td>0.009635</td>
1020 |       <td>0.013529</td>
1021 |       <td>0.009360</td>
1022 |       <td>0.003205</td>
1023 |       <td>2.944887e-10</td>
1024 |       <td>0.002226</td>
1025 |       <td>0.000379</td>
1026 |       <td>0.021703</td>
1027 |       <td>...</td>
1028 |       <td>0.009196</td>
1029 |       <td>0.007964</td>
1030 |       <td>0.036788</td>
1031 |       <td>0.000195</td>
1032 |       <td>0.001513</td>
1033 |       <td>0.020665</td>
1034 |       <td>0.000542</td>
1035 |       <td>0.007479</td>
1036 |       <td>0.0</td>
1037 |       <td>0.010419</td>
1038 |     </tr>
1039 |     <tr>
1040 |       <th>K6PL84</th>
1041 |       <td>0.001604</td>
1042 |       <td>0.0</td>
1043 |       <td>0.012637</td>
1044 |       <td>0.006323</td>
1045 |       <td>0.006224</td>
1046 |       <td>0.004819</td>
1047 |       <td>3.560677e-03</td>
1048 |       <td>0.001124</td>
1049 |       <td>0.012136</td>
1050 |       <td>0.018427</td>
1051 |       <td>...</td>
1052 |       <td>0.135335</td>
1053 |       <td>0.006568</td>
1054 |       <td>0.038901</td>
1055 |       <td>0.011298</td>
1056 |       <td>0.012578</td>
1057 |       <td>0.009913</td>
1058 |       <td>0.001079</td>
1059 |       <td>0.000023</td>
1060 |       <td>0.0</td>
1061 |       <td>0.007728</td>
1062 |     </tr>
1063 |     <tr>
1064 |       <th>R4W5V3</th>
1065 |       <td>0.012448</td>
1066 |       <td>0.0</td>
1067 |       <td>0.008408</td>
1068 |       <td>0.016363</td>
1069 |       <td>0.027469</td>
1070 |       <td>0.003205</td>
1071 |       <td>2.944887e-10</td>
1072 |       <td>0.004249</td>
1073 |       <td>0.013013</td>
1074 |       <td>0.031118</td>
1075 |       <td>...</td>
1076 |       <td>0.008114</td>
1077 |       <td>0.007128</td>
1078 |       <td>0.000000</td>
1079 |       <td>0.000203</td>
1080 |       <td>0.001757</td>
1081 |       <td>0.022736</td>
1082 |       <td>0.000249</td>
1083 |       <td>0.012652</td>
1084 |       <td>0.0</td>
1085 |       <td>0.008533</td>
1086 |     </tr>
1087 |     <tr>
1088 |       <th>T2A126</th>
1089 |       <td>0.010545</td>
1090 |       <td>0.0</td>
1091 |       <td>0.012560</td>
1092 |       <td>0.014212</td>
1093 |       <td>0.013728</td>
1094 |       <td>0.000000</td>
1095 |       <td>2.944887e-10</td>
1096 |       <td>0.007223</td>
1097 |       <td>0.000309</td>
1098 |       <td>0.028531</td>
1099 |       <td>...</td>
1100 |       <td>0.000325</td>
1101 |       <td>0.009669</td>
1102 |       <td>0.000000</td>
1103 |       <td>0.003182</td>
1104 |       <td>0.001904</td>
1105 |       <td>0.015607</td>
1106 |       <td>0.000577</td>
1107 |       <td>0.007479</td>
1108 |       <td>0.0</td>
1109 |       <td>0.008648</td>
1110 |     </tr>
1111 |     <tr>
1112 |       <th>L0SHD5</th>
1113 |       <td>0.020180</td>
1114 |       <td>0.0</td>
1115 |       <td>0.008628</td>
1116 |       <td>0.015033</td>
1117 |       <td>0.009360</td>
1118 |       <td>0.003205</td>
1119 |       <td>2.944887e-10</td>
1120 |       <td>0.002226</td>
1121 |       <td>0.000379</td>
1122 |       <td>0.021703</td>
1123 |       <td>...</td>
1124 |       <td>0.009196</td>
1125 |       <td>0.007964</td>
1126 |       <td>0.036788</td>
1127 |       <td>0.000195</td>
1128 |       <td>0.001513</td>
1129 |       <td>0.020665</td>
1130 |       <td>0.000542</td>
1131 |       <td>0.007479</td>
1132 |       <td>0.0</td>
1133 |       <td>0.010419</td>
1134 |     </tr>
1135 |     <tr>
1136 |       <th>...</th>
1137 |       <td>...</td>
1138 |       <td>...</td>
1139 |       <td>...</td>
1140 |       <td>...</td>
1141 |       <td>...</td>
1142 |       <td>...</td>
1143 |       <td>...</td>
1144 |       <td>...</td>
1145 |       <td>...</td>
1146 |       <td>...</td>
1147 |       <td>...</td>
1148 |       <td>...</td>
1149 |       <td>...</td>
1150 |       <td>...</td>
1151 |       <td>...</td>
1152 |       <td>...</td>
1153 |       <td>...</td>
1154 |       <td>...</td>
1155 |       <td>...</td>
1156 |       <td>...</td>
1157 |       <td>...</td>
1158 |     </tr>
1159 |     <tr>
1160 |       <th>A0A081R612</th>
1161 |       <td>0.014805</td>
1162 |       <td>0.0</td>
1163 |       <td>0.004159</td>
1164 |       <td>0.017541</td>
1165 |       <td>0.012701</td>
1166 |       <td>0.013099</td>
1167 |       <td>0.000000e+00</td>
1168 |       <td>0.017043</td>
1169 |       <td>0.004732</td>
1170 |       <td>0.014904</td>
1171 |       <td>...</td>
1172 |       <td>0.000000</td>
1173 |       <td>0.000000</td>
1174 |       <td>0.000000</td>
1175 |       <td>0.000000</td>
1176 |       <td>0.000000</td>
1177 |       <td>0.000000</td>
1178 |       <td>0.000000</td>
1179 |       <td>0.000000</td>
1180 |       <td>0.0</td>
1181 |       <td>0.000000</td>
1182 |     </tr>
1183 |     <tr>
1184 |       <th>A0A081QQM2</th>
1185 |       <td>0.010774</td>
1186 |       <td>0.0</td>
1187 |       <td>0.004283</td>
1188 |       <td>0.014732</td>
1189 |       <td>0.014340</td>
1190 |       <td>0.014846</td>
1191 |       <td>0.000000e+00</td>
1192 |       <td>0.016806</td>
1193 |       <td>0.005406</td>
1194 |       <td>0.014083</td>
1195 |       <td>...</td>
1196 |       <td>0.000000</td>
1197 |       <td>0.000000</td>
1198 |       <td>0.000000</td>
1199 |       <td>0.000000</td>
1200 |       <td>0.000000</td>
1201 |       <td>0.000000</td>
1202 |       <td>0.000000</td>
1203 |       <td>0.000000</td>
1204 |       <td>0.0</td>
1205 |       <td>0.000000</td>
1206 |     </tr>
1207 |     <tr>
1208 |       <th>J1A517</th>
1209 |       <td>0.010774</td>
1210 |       <td>0.0</td>
1211 |       <td>0.004283</td>
1212 |       <td>0.014732</td>
1213 |       <td>0.014340</td>
1214 |       <td>0.014846</td>
1215 |       <td>0.000000e+00</td>
1216 |       <td>0.014500</td>
1217 |       <td>0.005406</td>
1218 |       <td>0.014083</td>
1219 |       <td>...</td>
1220 |       <td>0.000000</td>
1221 |       <td>0.000000</td>
1222 |       <td>0.000000</td>
1223 |       <td>0.000000</td>
1224 |       <td>0.000000</td>
1225 |       <td>0.000000</td>
1226 |       <td>0.000000</td>
1227 |       <td>0.000000</td>
1228 |       <td>0.0</td>
1229 |       <td>0.000000</td>
1230 |     </tr>
1231 |     <tr>
1232 |       <th>F5U1T6</th>
1233 |       <td>0.015209</td>
1234 |       <td>0.0</td>
1235 |       <td>0.005175</td>
1236 |       <td>0.023888</td>
1237 |       <td>0.011410</td>
1238 |       <td>0.011510</td>
1239 |       <td>0.000000e+00</td>
1240 |       <td>0.021145</td>
1241 |       <td>0.009280</td>
1242 |       <td>0.017466</td>
1243 |       <td>...</td>
1244 |       <td>0.000000</td>
1245 |       <td>0.000000</td>
1246 |       <td>0.000000</td>
1247 |       <td>0.000000</td>
1248 |       <td>0.000000</td>
1249 |       <td>0.000000</td>
1250 |       <td>0.000000</td>
1251 |       <td>0.000000</td>
1252 |       <td>0.0</td>
1253 |       <td>0.000000</td>
1254 |     </tr>
1255 |     <tr>
1256 |       <th>J3A2T7</th>
1257 |       <td>0.005240</td>
1258 |       <td>0.0</td>
1259 |       <td>0.012301</td>
1260 |       <td>0.013178</td>
1261 |       <td>0.014744</td>
1262 |       <td>0.014705</td>
1263 |       <td>0.000000e+00</td>
1264 |       <td>0.000981</td>
1265 |       <td>0.007957</td>
1266 |       <td>0.017112</td>
1267 |       <td>...</td>
1268 |       <td>0.000000</td>
1269 |       <td>0.000000</td>
1270 |       <td>0.000000</td>
1271 |       <td>0.000000</td>
1272 |       <td>0.000000</td>
1273 |       <td>0.000000</td>
1274 |       <td>0.000000</td>
1275 |       <td>0.000000</td>
1276 |       <td>0.0</td>
1277 |       <td>0.000000</td>
1278 |     </tr>
1279 |   </tbody>
1280 | </table>
1281 | <p>2112 rows × 400 columns</p>
1282 | </div>
1283 | 
1284 | 
1285 | 
1286 | We perform PCA on the sequence embeddings and then do kmeans clustering.
1287 | 
1288 | 
1289 | ```python
1290 | pca = PCA(n_components=2)
1291 | pca.fit(sgtembedding_df)
1292 | 
1293 | X=pca.transform(sgtembedding_df)
1294 | 
1295 | print(np.sum(pca.explained_variance_ratio_))
1296 | df = pd.DataFrame(data=X, columns=['x1', 'x2'])
1297 | df.head()
1298 | ```
1299 | 
1300 |     0.6432744907364981
1301 | 
1302 | 
1303 | 
1304 | 
1305 | 
1306 | <div>
1307 | 
1308 | <table border="1" class="dataframe">
1309 |   <thead>
1310 |     <tr style="text-align: right;">
1311 |       <th></th>
1312 |       <th>x1</th>
1313 |       <th>x2</th>
1314 |     </tr>
1315 |   </thead>
1316 |   <tbody>
1317 |     <tr>
1318 |       <th>0</th>
1319 |       <td>0.384913</td>
1320 |       <td>-0.269873</td>
1321 |     </tr>
1322 |     <tr>
1323 |       <th>1</th>
1324 |       <td>0.022764</td>
1325 |       <td>0.135995</td>
1326 |     </tr>
1327 |     <tr>
1328 |       <th>2</th>
1329 |       <td>0.177792</td>
1330 |       <td>-0.172454</td>
1331 |     </tr>
1332 |     <tr>
1333 |       <th>3</th>
1334 |       <td>0.168074</td>
1335 |       <td>-0.147334</td>
1336 |     </tr>
1337 |     <tr>
1338 |       <th>4</th>
1339 |       <td>0.383616</td>
1340 |       <td>-0.271163</td>
1341 |     </tr>
1342 |   </tbody>
1343 | </table>
1344 | </div>
1345 | 
1346 | 
1347 | 
1348 | 
1349 | ```python
1350 | kmeans = KMeans(n_clusters=3, max_iter =300)
1351 | kmeans.fit(df)
1352 | 
1353 | labels = kmeans.predict(df)
1354 | centroids = kmeans.cluster_centers_
1355 | 
1356 | fig = plt.figure(figsize=(5, 5))
1357 | colmap = {1: 'r', 2: 'g', 3: 'b'}
1358 | colors = list(map(lambda x: colmap[x+1], labels))
1359 | plt.scatter(df['x1'], df['x2'], color=colors, alpha=0.5, edgecolor=colors)
1360 | ```
1361 | 
1362 | 
1363 | 
1364 | 
1365 |     <matplotlib.collections.PathCollection at 0x14c5a77f0>
1366 | 
1367 | 
1368 | 
1369 | 
1370 | ![png](output_23_1.png)
1371 | 
1372 | 
1373 | ## <a name="sequence-classification"></a> Sequence Classification using Deep Learning in TensorFlow
1374 | 
1375 | The protein data set used above is also labeled. The labels represent the protein functions. Similarly, there are other labeled sequence data sets. For example, DARPA shared an intrusion weblog data set. It contains weblog sequences with positive labels if the log represents a network intrusion.
1376 | 
1377 | In such problems supervised learning is employed. Classification is a supervised learning we will demonstrate here.
1378 | 
1379 | ### Protein Sequence Classification
1380 | 
1381 | The data set is taken from https://www.uniprot.org . The protein sequences in the data set have one of the two functions,
1382 | - Binds to DNA and alters its conformation. May be involved in regulation of gene expression, nucleoid organization and DNA protection.
1383 | - Might take part in the signal recognition particle (SRP) pathway. This is inferred from the conservation of its genetic proximity to ftsY/ffh. May be a regulatory protein.
1384 | 
1385 | There are a total of 2113 samples. The sequence lengths vary between 80-130.
1386 | 
1387 | 
1388 | ```python
1389 | # Loading data
1390 | data = pd.read_csv('data/protein_classification.csv')
1391 | 
1392 | 
1393 | # Data preprocessing
1394 | y = data['Function [CC]']
1395 | encoder = LabelEncoder()
1396 | encoder.fit(y)
1397 | encoded_y = encoder.transform(y)
1398 | 
1399 | corpus = data.loc[:,['Entry','Sequence']]
1400 | corpus.columns = ['id', 'sequence']
1401 | corpus['sequence'] = corpus['sequence'].map(list)
1402 | ```
1403 | 
1404 | #### Sequence embeddings
1405 | 
1406 | 
1407 | ```python
1408 | # Sequence embedding
1409 | sgt_ = SGT(kappa=1, 
1410 |            lengthsensitive=False, 
1411 |            mode='multiprocessing')
1412 | sgtembedding_df = sgt_.fit_transform(corpus)
1413 | X = sgtembedding_df.set_index('id')
1414 | ```
1415 | 
1416 |     INFO: Pandarallel will run on 7 workers.
1417 |     INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
1418 | 
1419 | 
1420 | We will perform a 10-fold cross-validation to measure the performance of the classification model.
1421 | 
1422 | 
1423 | ```python
1424 | kfold = 10
1425 | X = X
1426 | y = encoded_y
1427 | 
1428 | random_state = 1
1429 | 
1430 | test_F1 = np.zeros(kfold)
1431 | skf = KFold(n_splits = kfold, shuffle = True, random_state = random_state)
1432 | k = 0
1433 | epochs = 50
1434 | batch_size = 128
1435 | 
1436 | for train_index, test_index in skf.split(X, y):
1437 |     X_train, X_test = X.iloc[train_index], X.iloc[test_index]
1438 |     y_train, y_test = y[train_index], y[test_index]
1439 |     
1440 |     model = Sequential()
1441 |     model.add(Dense(64, input_shape = (X_train.shape[1],))) 
1442 |     model.add(Activation('relu'))
1443 |     model.add(Dropout(0.5))
1444 |     model.add(Dense(32))
1445 |     model.add(Activation('relu'))
1446 |     model.add(Dropout(0.5))
1447 |     model.add(Dense(1))
1448 |     model.add(Activation('sigmoid'))
1449 |     model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
1450 |     
1451 |     model.fit(X_train, y_train ,batch_size=batch_size, epochs=epochs, verbose=0)
1452 |     
1453 |     y_pred = model.predict_proba(X_test).round().astype(int)
1454 |     y_train_pred = model.predict_proba(X_train).round().astype(int)
1455 | 
1456 |     test_F1[k] = sklearn.metrics.f1_score(y_test, y_pred)
1457 |     k+=1
1458 |     
1459 | print ('Average f1 score', np.mean(test_F1))
1460 | ```
1461 | 
1462 |     Average f1 score 1.0
1463 | 
1464 | 
1465 | ### Weblog Classification for Intrusion Detection
1466 | 
1467 | This data sample is taken from https://www.ll.mit.edu/r-d/datasets/1998-darpa-intrusion-detection-evaluation-dataset. 
1468 | This is a network intrusion data containing audit logs and any attack as a positive label. Since, network intrusion is a rare event, the data is unbalanced. Here we will,
1469 | - build a sequence classification model to predict a network intrusion.
1470 | 
1471 | Each sequence contains in the data is a series of activity, for example, {login, password}. The _alphabets_ in the input data sequences are already encoded into integers. The original sequences data file is also present in the `/data` directory.
1472 | 
1473 | 
1474 | ```python
1475 | # Loading data
1476 | data = pd.read_csv('data/darpa_data.csv')
1477 | data.columns
1478 | ```
1479 | 
1480 | 
1481 | 
1482 | 
1483 |     Index(['timeduration', 'seqlen', 'seq', 'class'], dtype='object')
1484 | 
1485 | 
1486 | 
1487 | 
1488 | ```python
1489 | data['id'] = data.index
1490 | ```
1491 | 
1492 | 
1493 | ```python
1494 | # Data preprocessing
1495 | y = data['class']
1496 | encoder = LabelEncoder()
1497 | encoder.fit(y)
1498 | encoded_y = encoder.transform(y)
1499 | 
1500 | corpus = data.loc[:,['id','seq']]
1501 | corpus.columns = ['id', 'sequence']
1502 | corpus['sequence'] = corpus['sequence'].map(list)
1503 | ```
1504 | 
1505 | #### Sequence embeddings
1506 | In this data, the sequence embeddings should be **length-sensitive**. 
1507 | 
1508 | The lengths are important here because sequences with similar patterns but different lengths can have different labels. Consider a simple example of two sessions: `{login, pswd, login, pswd,...}` and `{login, pswd,...(repeated several times)..., login, pswd}`. 
1509 | 
1510 | While the first session can be a regular user mistyping the password once, the other session is possibly an attack to guess the password. Thus, the sequence lengths are as important as the patterns.
1511 | 
1512 | Therefore, `lengthsensitive=True` is used here.
1513 | 
1514 | 
1515 | ```python
1516 | # Sequence embedding
1517 | sgt_ = SGT(kappa=5, 
1518 |            lengthsensitive=True, 
1519 |            mode='multiprocessing')
1520 | sgtembedding_df = sgt_.fit_transform(corpus)
1521 | sgtembedding_df = sgtembedding_df.set_index('id')
1522 | sgtembedding_df
1523 | ```
1524 | 
1525 |     INFO: Pandarallel will run on 7 workers.
1526 |     INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
1527 | 
1528 | 
1529 | 
1530 | 
1531 | 
1532 | <div>
1533 | <table border="1" class="dataframe">
1534 |   <thead>
1535 |     <tr style="text-align: right;">
1536 |       <th></th>
1537 |       <th>(0, 0)</th>
1538 |       <th>(0, 1)</th>
1539 |       <th>(0, 2)</th>
1540 |       <th>(0, 3)</th>
1541 |       <th>(0, 4)</th>
1542 |       <th>(0, 5)</th>
1543 |       <th>(0, 6)</th>
1544 |       <th>(0, 7)</th>
1545 |       <th>(0, 8)</th>
1546 |       <th>(0, 9)</th>
1547 |       <th>...</th>
1548 |       <th>(~, 1)</th>
1549 |       <th>(~, 2)</th>
1550 |       <th>(~, 3)</th>
1551 |       <th>(~, 4)</th>
1552 |       <th>(~, 5)</th>
1553 |       <th>(~, 6)</th>
1554 |       <th>(~, 7)</th>
1555 |       <th>(~, 8)</th>
1556 |       <th>(~, 9)</th>
1557 |       <th>(~, ~)</th>
1558 |     </tr>
1559 |     <tr>
1560 |       <th>id</th>
1561 |       <th></th>
1562 |       <th></th>
1563 |       <th></th>
1564 |       <th></th>
1565 |       <th></th>
1566 |       <th></th>
1567 |       <th></th>
1568 |       <th></th>
1569 |       <th></th>
1570 |       <th></th>
1571 |       <th></th>
1572 |       <th></th>
1573 |       <th></th>
1574 |       <th></th>
1575 |       <th></th>
1576 |       <th></th>
1577 |       <th></th>
1578 |       <th></th>
1579 |       <th></th>
1580 |       <th></th>
1581 |       <th></th>
1582 |     </tr>
1583 |   </thead>
1584 |   <tbody>
1585 |     <tr>
1586 |       <th>0.0</th>
1587 |       <td>0.000000</td>
1588 |       <td>0.000000</td>
1589 |       <td>0.000000</td>
1590 |       <td>0.000000e+00</td>
1591 |       <td>0.000000e+00</td>
1592 |       <td>0.000000e+00</td>
1593 |       <td>0.000000</td>
1594 |       <td>0.000000e+00</td>
1595 |       <td>0.000000e+00</td>
1596 |       <td>0.000000e+00</td>
1597 |       <td>...</td>
1598 |       <td>0.485034</td>
1599 |       <td>0.486999</td>
1600 |       <td>0.485802</td>
1601 |       <td>0.483097</td>
1602 |       <td>0.483956</td>
1603 |       <td>0.000000</td>
1604 |       <td>0.000000</td>
1605 |       <td>0.000000</td>
1606 |       <td>0.000000</td>
1607 |       <td>0.178609</td>
1608 |     </tr>
1609 |     <tr>
1610 |       <th>1.0</th>
1611 |       <td>0.000000</td>
1612 |       <td>0.025622</td>
1613 |       <td>0.228156</td>
1614 |       <td>0.000000e+00</td>
1615 |       <td>0.000000e+00</td>
1616 |       <td>1.310714e-09</td>
1617 |       <td>0.000000</td>
1618 |       <td>0.000000e+00</td>
1619 |       <td>0.000000e+00</td>
1620 |       <td>0.000000e+00</td>
1621 |       <td>...</td>
1622 |       <td>0.447620</td>
1623 |       <td>0.452097</td>
1624 |       <td>0.464568</td>
1625 |       <td>0.367296</td>
1626 |       <td>0.525141</td>
1627 |       <td>0.455018</td>
1628 |       <td>0.374364</td>
1629 |       <td>0.414081</td>
1630 |       <td>0.549981</td>
1631 |       <td>0.172479</td>
1632 |     </tr>
1633 |     <tr>
1634 |       <th>2.0</th>
1635 |       <td>0.000000</td>
1636 |       <td>0.000000</td>
1637 |       <td>0.000000</td>
1638 |       <td>0.000000e+00</td>
1639 |       <td>0.000000e+00</td>
1640 |       <td>0.000000e+00</td>
1641 |       <td>0.000000</td>
1642 |       <td>0.000000e+00</td>
1643 |       <td>0.000000e+00</td>
1644 |       <td>0.000000e+00</td>
1645 |       <td>...</td>
1646 |       <td>0.525605</td>
1647 |       <td>0.000000</td>
1648 |       <td>0.000000</td>
1649 |       <td>0.000000</td>
1650 |       <td>0.000000</td>
1651 |       <td>0.000000</td>
1652 |       <td>0.000000</td>
1653 |       <td>0.000000</td>
1654 |       <td>0.193359</td>
1655 |       <td>0.071469</td>
1656 |     </tr>
1657 |     <tr>
1658 |       <th>3.0</th>
1659 |       <td>0.077999</td>
1660 |       <td>0.208974</td>
1661 |       <td>0.230338</td>
1662 |       <td>1.830519e-01</td>
1663 |       <td>1.200926e-17</td>
1664 |       <td>1.696880e-01</td>
1665 |       <td>0.093646</td>
1666 |       <td>7.985870e-02</td>
1667 |       <td>2.896813e-05</td>
1668 |       <td>3.701710e-05</td>
1669 |       <td>...</td>
1670 |       <td>0.474072</td>
1671 |       <td>0.468353</td>
1672 |       <td>0.463594</td>
1673 |       <td>0.177507</td>
1674 |       <td>0.551270</td>
1675 |       <td>0.418652</td>
1676 |       <td>0.309652</td>
1677 |       <td>0.384657</td>
1678 |       <td>0.378225</td>
1679 |       <td>0.170362</td>
1680 |     </tr>
1681 |     <tr>
1682 |       <th>4.0</th>
1683 |       <td>0.000000</td>
1684 |       <td>0.023695</td>
1685 |       <td>0.217819</td>
1686 |       <td>2.188276e-33</td>
1687 |       <td>0.000000e+00</td>
1688 |       <td>6.075992e-11</td>
1689 |       <td>0.000000</td>
1690 |       <td>0.000000e+00</td>
1691 |       <td>5.681668e-39</td>
1692 |       <td>0.000000e+00</td>
1693 |       <td>...</td>
1694 |       <td>0.464120</td>
1695 |       <td>0.468229</td>
1696 |       <td>0.452170</td>
1697 |       <td>0.000000</td>
1698 |       <td>0.501242</td>
1699 |       <td>0.000000</td>
1700 |       <td>0.300534</td>
1701 |       <td>0.161961</td>
1702 |       <td>0.000000</td>
1703 |       <td>0.167082</td>
1704 |     </tr>
1705 |     <tr>
1706 |       <th>...</th>
1707 |       <td>...</td>
1708 |       <td>...</td>
1709 |       <td>...</td>
1710 |       <td>...</td>
1711 |       <td>...</td>
1712 |       <td>...</td>
1713 |       <td>...</td>
1714 |       <td>...</td>
1715 |       <td>...</td>
1716 |       <td>...</td>
1717 |       <td>...</td>
1718 |       <td>...</td>
1719 |       <td>...</td>
1720 |       <td>...</td>
1721 |       <td>...</td>
1722 |       <td>...</td>
1723 |       <td>...</td>
1724 |       <td>...</td>
1725 |       <td>...</td>
1726 |       <td>...</td>
1727 |       <td>...</td>
1728 |     </tr>
1729 |     <tr>
1730 |       <th>106.0</th>
1731 |       <td>0.000000</td>
1732 |       <td>0.024495</td>
1733 |       <td>0.219929</td>
1734 |       <td>2.035190e-17</td>
1735 |       <td>1.073271e-18</td>
1736 |       <td>5.656994e-11</td>
1737 |       <td>0.000000</td>
1738 |       <td>0.000000e+00</td>
1739 |       <td>5.047380e-29</td>
1740 |       <td>0.000000e+00</td>
1741 |       <td>...</td>
1742 |       <td>0.502213</td>
1743 |       <td>0.544343</td>
1744 |       <td>0.477281</td>
1745 |       <td>0.175901</td>
1746 |       <td>0.461103</td>
1747 |       <td>0.000000</td>
1748 |       <td>0.000000</td>
1749 |       <td>0.162796</td>
1750 |       <td>0.000000</td>
1751 |       <td>0.167687</td>
1752 |     </tr>
1753 |     <tr>
1754 |       <th>107.0</th>
1755 |       <td>0.110422</td>
1756 |       <td>0.227478</td>
1757 |       <td>0.217549</td>
1758 |       <td>1.723963e-01</td>
1759 |       <td>1.033292e-14</td>
1760 |       <td>3.896725e-07</td>
1761 |       <td>0.083685</td>
1762 |       <td>2.940589e-08</td>
1763 |       <td>8.864072e-02</td>
1764 |       <td>4.813990e-29</td>
1765 |       <td>...</td>
1766 |       <td>0.490398</td>
1767 |       <td>0.522016</td>
1768 |       <td>0.466808</td>
1769 |       <td>0.470603</td>
1770 |       <td>0.479795</td>
1771 |       <td>0.480057</td>
1772 |       <td>0.194888</td>
1773 |       <td>0.172397</td>
1774 |       <td>0.164873</td>
1775 |       <td>0.172271</td>
1776 |     </tr>
1777 |     <tr>
1778 |       <th>108.0</th>
1779 |       <td>0.005646</td>
1780 |       <td>0.202424</td>
1781 |       <td>0.196786</td>
1782 |       <td>2.281242e-01</td>
1783 |       <td>1.133936e-01</td>
1784 |       <td>1.862098e-01</td>
1785 |       <td>0.000000</td>
1786 |       <td>1.212869e-01</td>
1787 |       <td>9.180520e-08</td>
1788 |       <td>0.000000e+00</td>
1789 |       <td>...</td>
1790 |       <td>0.432834</td>
1791 |       <td>0.434953</td>
1792 |       <td>0.439615</td>
1793 |       <td>0.390864</td>
1794 |       <td>0.481764</td>
1795 |       <td>0.600875</td>
1796 |       <td>0.166766</td>
1797 |       <td>0.165368</td>
1798 |       <td>0.000000</td>
1799 |       <td>0.171729</td>
1800 |     </tr>
1801 |     <tr>
1802 |       <th>109.0</th>
1803 |       <td>0.000000</td>
1804 |       <td>0.025616</td>
1805 |       <td>0.238176</td>
1806 |       <td>3.889176e-55</td>
1807 |       <td>1.332427e-60</td>
1808 |       <td>1.408003e-09</td>
1809 |       <td>0.000000</td>
1810 |       <td>9.845377e-60</td>
1811 |       <td>0.000000e+00</td>
1812 |       <td>0.000000e+00</td>
1813 |       <td>...</td>
1814 |       <td>0.421318</td>
1815 |       <td>0.439985</td>
1816 |       <td>0.467953</td>
1817 |       <td>0.440951</td>
1818 |       <td>0.527165</td>
1819 |       <td>0.864717</td>
1820 |       <td>0.407155</td>
1821 |       <td>0.399335</td>
1822 |       <td>0.251304</td>
1823 |       <td>0.171885</td>
1824 |     </tr>
1825 |     <tr>
1826 |       <th>110.0</th>
1827 |       <td>0.000000</td>
1828 |       <td>0.022868</td>
1829 |       <td>0.203513</td>
1830 |       <td>9.273472e-64</td>
1831 |       <td>0.000000e+00</td>
1832 |       <td>1.240870e-09</td>
1833 |       <td>0.000000</td>
1834 |       <td>0.000000e+00</td>
1835 |       <td>0.000000e+00</td>
1836 |       <td>0.000000e+00</td>
1837 |       <td>...</td>
1838 |       <td>0.478090</td>
1839 |       <td>0.454871</td>
1840 |       <td>0.459109</td>
1841 |       <td>0.000000</td>
1842 |       <td>0.490534</td>
1843 |       <td>0.370357</td>
1844 |       <td>0.000000</td>
1845 |       <td>0.162997</td>
1846 |       <td>0.000000</td>
1847 |       <td>0.162089</td>
1848 |     </tr>
1849 |   </tbody>
1850 | </table>
1851 | <p>111 rows × 121 columns</p>
1852 | </div>
1853 | 
1854 | 
1855 | 
1856 | #### Applying PCA on the embeddings
1857 | The embeddings are sparse and high-dimensional. PCA is, therefore, applied for dimension reduction.
1858 | 
1859 | 
1860 | ```python
1861 | from sklearn.decomposition import PCA
1862 | pca = PCA(n_components=35)
1863 | pca.fit(sgtembedding_df)
1864 | X = pca.transform(sgtembedding_df)
1865 | print(np.sum(pca.explained_variance_ratio_))
1866 | ```
1867 | 
1868 |     0.9962446146783123
1869 | 
1870 | 
1871 | #### Building a Multi-Layer Perceptron Classifier
1872 | The PCA transforms of the embeddings are used directly as inputs to an MLP classifier.
1873 | 
1874 | 
1875 | ```python
1876 | kfold = 3
1877 | random_state = 11
1878 | 
1879 | X = X
1880 | y = encoded_y
1881 | 
1882 | test_F1 = np.zeros(kfold)
1883 | time_k = np.zeros(kfold)
1884 | skf = StratifiedKFold(n_splits=kfold, shuffle=True, random_state=random_state)
1885 | k = 0
1886 | epochs = 300
1887 | batch_size = 15
1888 | 
1889 | # class_weight = {0 : 1., 1: 1.,}  # The weights can be changed and made inversely proportional to the class size to improve the accuracy.
1890 | class_weight = {0 : 0.12, 1: 0.88,}
1891 | 
1892 | for train_index, test_index in skf.split(X, y):
1893 |     X_train, X_test = X[train_index], X[test_index]
1894 |     y_train, y_test = y[train_index], y[test_index]
1895 |     
1896 |     model = Sequential()
1897 |     model.add(Dense(128, input_shape=(X_train.shape[1],))) 
1898 |     model.add(Activation('relu'))
1899 |     model.add(Dropout(0.5))
1900 |     model.add(Dense(1))
1901 |     model.add(Activation('sigmoid'))
1902 |     model.summary()
1903 |     model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
1904 |     
1905 |     start_time = time.time()
1906 |     model.fit(X_train, y_train ,batch_size=batch_size, epochs=epochs, verbose=1, class_weight=class_weight)
1907 |     end_time = time.time()
1908 |     time_k[k] = end_time-start_time
1909 | 
1910 |     y_pred = model.predict_proba(X_test).round().astype(int)
1911 |     y_train_pred = model.predict_proba(X_train).round().astype(int)
1912 |     test_F1[k] = sklearn.metrics.f1_score(y_test, y_pred)
1913 |     k += 1
1914 | ```
1915 | 
1916 |     Model: "sequential_10"
1917 |     _________________________________________________________________
1918 |     Layer (type)                 Output Shape              Param #   
1919 |     =================================================================
1920 |     dense_30 (Dense)             (None, 128)               4608      
1921 |     _________________________________________________________________
1922 |     activation_30 (Activation)   (None, 128)               0         
1923 |     _________________________________________________________________
1924 |     dropout_20 (Dropout)         (None, 128)               0         
1925 |     _________________________________________________________________
1926 |     dense_31 (Dense)             (None, 1)                 129       
1927 |     _________________________________________________________________
1928 |     activation_31 (Activation)   (None, 1)                 0         
1929 |     =================================================================
1930 |     Total params: 4,737
1931 |     Trainable params: 4,737
1932 |     Non-trainable params: 0
1933 |     _________________________________________________________________
1934 |     WARNING:tensorflow:sample_weight modes were coerced from
1935 |       ...
1936 |         to  
1937 |       ['...']
1938 |     Train on 74 samples
1939 |     Epoch 1/300
1940 |     74/74 [==============================] - 0s 7ms/sample - loss: 0.1487 - accuracy: 0.5270
1941 |     Epoch 2/300
1942 |     74/74 [==============================] - 0s 120us/sample - loss: 0.1421 - accuracy: 0.5000
1943 |     ...
1944 |     74/74 [==============================] - 0s 118us/sample - loss: 0.0299 - accuracy: 0.8784
1945 |     Epoch 300/300
1946 |     74/74 [==============================] - 0s 133us/sample - loss: 0.0296 - accuracy: 0.8649
1947 | 
1948 | 
1949 | 
1950 | ```python
1951 | print ('Average f1 score', np.mean(test_F1))
1952 | print ('Average Run time', np.mean(time_k))
1953 | ```
1954 | 
1955 |     Average f1 score 0.6341880341880342
1956 |     Average Run time 3.880180994669596
1957 | 
1958 | 
1959 | #### Building an LSTM Classifier on the sequences for comparison
1960 | We built an LSTM Classifier on the sequences to compare the accuracy.
1961 | 
1962 | 
1963 | ```python
1964 | X = data['seq']
1965 | encoded_X = np.ndarray(shape=(len(X),), dtype=list)
1966 | for i in range(0,len(X)):
1967 |     encoded_X[i]=X.iloc[i].split("~")
1968 | X
1969 | ```
1970 | 
1971 | 
1972 | 
1973 | 
1974 |     0      1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~...
1975 |     1      6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~...
1976 |     2      19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~1...
1977 |     3      6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~...
1978 |     4      5~5~17~5~5~5~5~5~10~2~11~2~11~11~12~11~11~5~2~...
1979 |                                  ...                        
1980 |     106    10~2~11~2~11~11~12~11~11~5~2~11~5~2~5~2~3~14~3...
1981 |     107    5~5~2~5~17~6~5~6~5~5~2~6~17~3~2~2~3~5~2~3~5~6~...
1982 |     108    6~5~6~5~5~6~5~5~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~...
1983 |     109    6~5~5~6~5~6~5~2~38~2~3~5~22~39~5~5~5~5~5~5~5~5...
1984 |     110    5~6~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~...
1985 |     Name: seq, Length: 111, dtype: object
1986 | 
1987 | 
1988 | 
1989 | 
1990 | ```python
1991 | max_seq_length = np.max(data['seqlen'])
1992 | encoded_X = tf.keras.preprocessing.sequence.pad_sequences(encoded_X, maxlen=max_seq_length)
1993 | ```
1994 | 
1995 | 
1996 | ```python
1997 | kfold = 3
1998 | random_state = 11
1999 | 
2000 | test_F1 = np.zeros(kfold)
2001 | time_k = np.zeros(kfold)
2002 | 
2003 | epochs = 50
2004 | batch_size = 15
2005 | skf = StratifiedKFold(n_splits=kfold, shuffle=True, random_state=random_state)
2006 | k = 0
2007 | 
2008 | for train_index, test_index in skf.split(encoded_X, y):
2009 |     X_train, X_test = encoded_X[train_index], encoded_X[test_index]
2010 |     y_train, y_test = y[train_index], y[test_index]
2011 |     
2012 |     embedding_vecor_length = 32
2013 |     top_words=50
2014 |     model = Sequential()
2015 |     model.add(Embedding(top_words, embedding_vecor_length, input_length=max_seq_length))
2016 |     model.add(LSTM(32))
2017 |     model.add(Dense(1))
2018 |     model.add(Activation('sigmoid'))
2019 |     model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
2020 |     
2021 |     model.summary()
2022 |     
2023 |     start_time = time.time()
2024 |     model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size, verbose=1)
2025 |     end_time=time.time()
2026 |     time_k[k]=end_time-start_time
2027 | 
2028 |     y_pred = model.predict_proba(X_test).round().astype(int)
2029 |     y_train_pred=model.predict_proba(X_train).round().astype(int)
2030 |     test_F1[k]=sklearn.metrics.f1_score(y_test, y_pred)
2031 |     k+=1
2032 | ```
2033 | 
2034 |     Model: "sequential_13"
2035 |     _________________________________________________________________
2036 |     Layer (type)                 Output Shape              Param #   
2037 |     =================================================================
2038 |     embedding (Embedding)        (None, 1773, 32)          1600      
2039 |     _________________________________________________________________
2040 |     lstm (LSTM)                  (None, 32)                8320      
2041 |     _________________________________________________________________
2042 |     dense_36 (Dense)             (None, 1)                 33        
2043 |     _________________________________________________________________
2044 |     activation_36 (Activation)   (None, 1)                 0         
2045 |     =================================================================
2046 |     Total params: 9,953
2047 |     Trainable params: 9,953
2048 |     Non-trainable params: 0
2049 |     _________________________________________________________________
2050 |     Train on 74 samples
2051 |     Epoch 1/50
2052 |     74/74 [==============================] - 5s 72ms/sample - loss: 0.6894 - accuracy: 0.5676
2053 |     Epoch 2/50
2054 |     74/74 [==============================] - 4s 48ms/sample - loss: 0.6590 - accuracy: 0.8784
2055 |     ...
2056 |     Epoch 50/50
2057 |     74/74 [==============================] - 4s 51ms/sample - loss: 0.1596 - accuracy: 0.9324
2058 | 
2059 | 
2060 | 
2061 | ```python
2062 | print ('Average f1 score', np.mean(test_F1))
2063 | print ('Average Run time', np.mean(time_k))
2064 | ```
2065 | 
2066 |     Average f1 score 0.36111111111111116
2067 |     Average Run time 192.46954011917114
2068 | 
2069 | 
2070 | We find that the LSTM classifier gives a significantly lower F1 score. This may be improved by changing the model. However, we find that the SGT embedding could work with a small and unbalanced data without the need of a complicated classifier model.
2071 | 
2072 | LSTM models typically require more data for training and also has significantly more computation time. The LSTM model above took 425.6 secs while the MLP model took just 9.1 secs.
2073 | 
2074 | ## <a name="sequence-search"></a> Sequence Search
2075 | 
2076 | Sequence data sets are generally large. For example, sequences of listening history in music streaming services, such as Pandora, for more than 70M users are huge. In protein data bases there could be even larger size. For instance, the Uniprot data repository has more than 177M sequences.
2077 | 
2078 | Searching for similar sequences in such large data bases is challenging. SGT embedding provides a simple solution. In the following it will be shown on a protein data set that SGT embedding can be used to compute similarity between a query sequence and the sequence corpus using a dot product. The sequences with the highest dot product are returned as the most similar sequence to the query.
2079 | 
2080 | ### Protein Sequence Search
2081 | 
2082 | In the following, a sample of 10k protein sequences are used for illustration. The data is taken from https://www.uniprot.org .
2083 | 
2084 | 
2085 | ```python
2086 | # Loading data
2087 | data = pd.read_csv('data/protein-uniprot-reviewed-Ano-10k.tab', sep='\t')
2088 | 
2089 | # Data preprocessing
2090 | corpus = data.loc[:,['Entry','Sequence']]
2091 | corpus.columns = ['id', 'sequence']
2092 | corpus['sequence'] = corpus['sequence'].map(list)
2093 | corpus.head(3)
2094 | ```
2095 | 
2096 | 
2097 | 
2098 | 
2099 | <div>
2100 | <table border="1" class="dataframe">
2101 |   <thead>
2102 |     <tr style="text-align: right;">
2103 |       <th></th>
2104 |       <th>id</th>
2105 |       <th>sequence</th>
2106 |     </tr>
2107 |   </thead>
2108 |   <tbody>
2109 |     <tr>
2110 |       <th>0</th>
2111 |       <td>I2WKR6</td>
2112 |       <td>[M, V, H, K, S, D, S, D, E, L, A, A, L, R, A, ...</td>
2113 |     </tr>
2114 |     <tr>
2115 |       <th>1</th>
2116 |       <td>A0A2A6M8K9</td>
2117 |       <td>[M, Q, E, S, L, V, V, R, R, E, T, H, I, A, A, ...</td>
2118 |     </tr>
2119 |     <tr>
2120 |       <th>2</th>
2121 |       <td>A0A3G5KEC3</td>
2122 |       <td>[M, A, S, G, A, Y, S, K, Y, L, F, Q, I, I, G, ...</td>
2123 |     </tr>
2124 |   </tbody>
2125 | </table>
2126 | </div>
2127 | 
2128 | 
2129 | 
2130 | 
2131 | ```python
2132 | # Protein sequence alphabets
2133 | alphabets = ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 
2134 |              'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 
2135 |              'W', 'X', 'Y', 'U', 'O']  # List of amino acids
2136 | 
2137 | # Alphabets are known and inputted 
2138 | # as arguments for faster computation
2139 | sgt_ = SGT(alphabets=alphabets, 
2140 |            lengthsensitive=True, 
2141 |            kappa=1, 
2142 |            flatten=True, 
2143 |            mode='multiprocessing')
2144 | 
2145 | sgtembedding_df = sgt_.fit_transform(corpus)
2146 | sgtembedding_df = sgtembedding_df.set_index('id')
2147 | ```
2148 | 
2149 |     INFO: Pandarallel will run on 7 workers.
2150 |     INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
2151 | 
2152 | 
2153 | 
2154 | ```python
2155 | '''
2156 | Search proteins similar to a query protein.
2157 | The approach is to find the SGT embedding of the
2158 | query protein and find its similarity with the
2159 | embeddings of the protein database.
2160 | '''
2161 | 
2162 | query_protein = 'MSHVFPIVIDDNFLSPQDLVSAARSGCSLRLHTGVVDKIDRAHRFVLEIAGAEALHYGINTGFGSLCTTHIDPADLSTLQHNLLKSHACGVGPTVSEEVSRVVTLIKLLTFRTGNSGVSLSTVNRIIDLWNHGVVGAIAQKGTVGASGDLAPLAHLFLPLIGLGQVWHRGVLRPSREVMDELKLAPLTLQPKDGLCLTNGVQYLNAWGALSTVRAKRLVALADLCAAMSMMGFSAARSFIEAQIHQTSLHPERGHVALHLRTLTHGSNHADLPHCNPAMEDPYSFRCAPQVHGAARQVVGYLETVIGNECNSVSDNPLVFPDTRQILTCGNLHGQSTAFALDFAAIGITDLSNISERRTYQLLSGQNGLPGFLVAKPGLNSGFMVVQYTSAALLNENKVLSNPASVDTIPTCHLQEDHVSMGGTSAYKLQTILDNCETILAIELMTACQAIDMNPGLQLSERGRAIYEAVREEIPFVKEDHLMAGLISKSRDLCQHSTVIAQQLAEMQAQ'
2163 | 
2164 | # Step 1. Compute sgt embedding for the query protein.
2165 | query_protein_sgt_embedding = sgt_.fit(list(query_protein))
2166 | 
2167 | # Step 2. Compute the dot product of query embedding 
2168 | # with the protein embedding database.
2169 | similarity = sgtembedding_df.dot(query_protein_sgt_embedding)
2170 | 
2171 | # Step 3. Return the top k protein names based on similarity.
2172 | similarity.sort_values(ascending=False)
2173 | ```
2174 | 
2175 | 
2176 | 
2177 | 
2178 |     id
2179 |     K0ZGN5        2773.749663
2180 |     A0A0Y1CPH7    1617.451379
2181 |     A0A5R8LCJ1    1566.833152
2182 |     A0A290WY40    1448.772820
2183 |     A0A073K6N6    1392.267250
2184 |                      ...     
2185 |     A0A1S7UBK4     160.074989
2186 |     A0A2S7T1R9     156.580584
2187 |     A0A0E0UQV6     155.834932
2188 |     A0A1Y5Y0S0     148.862049
2189 |     B0NRP3         117.656497
2190 |     Length: 10063, dtype: float64
2191 | 
2192 | 
2193 | 
2194 | ## <a name="sgt-spark"></a> SGT - Spark for Distributed Computing
2195 | 
2196 | As mentioned in the previous section, sequence data sets can be large. SGT complexity is linear with the number of sequences in a data set. Still if the data size is large the computation becomes high. For example, for a set of 1M protein sequences the default SGT mode takes over 24 hours.
2197 | 
2198 | Using distributed computing with Spark the runtime can be significantly reduced. For instance, SGT-Spark on the same 1M protein data set took less than 29 minutes.
2199 | 
2200 | In the following, Spark implementation for SGT is shown. First, it is applied on a smaller 10k data set for comparison. Then it is applied on 1M data set without any syntactical change.
2201 | 
2202 | 
2203 | ```python
2204 | '''
2205 | Load the data and remove header.
2206 | '''
2207 | data = sc.textFile('data/protein-uniprot-reviewed-Ano-10k.tab')
2208 |  
2209 | header = data.first() #extract header
2210 | data = data.filter(lambda row: row != header)   #filter out header
2211 | data.take(1)  # See one sample
2212 | 
2213 | ```
2214 | 
2215 | ```
2216 | [&apos;I2WKR6\tI2WKR6_ECOLX\tunreviewed\tType III restriction enzyme, res subunit (EC 3.1.21.5)\tEC90111_4246\tEscherichia coli 9.0111\t786\tMVHKSDSDELAALRAENVRLVSLLEAHGIEWRRKPQSPVPRVSVLSTNEKVALFRRLFRGRDDVWALRWESKTSGKSGYSPACANEWQLGICGKPRIKCGDCAHRQLIPVSDLVIYHHLAGTHTAGMYPLLEDDSCYFLAVDFDEAEWQKDASAFMRSCDELGVPAALEISRSRQGAHVWIFFASRVSAREARRLGTAIISYTCSRTRQLRLGSYDRLFPNQDTMPKGGFGNLIALPLQKRPRELGGSVFVDMNLQPYPDQWAFLVSVIPMNVQDIEPTILRATGSIHPLDVNFINEEDLGTPWEEKKSSGNRLNIAVTEPLIITLANQIYFEKAQLPQALVNRLIRLAAFPNPEFYKAQAMRMSVWNKPRVIGCAENYPQHIALPRGCLDSALSFLRYNNIAAELIDKRFAGTECNAVFTGNLRAEQEEAVSALLRYDTGVLCAPTAFGKTVTAAAVIARRKVNTLILVHRTELLKQWQERLAVFLQVGDSIGIIGGGKHKPCGNIDIAVVQSISRHGEVEPLVRNYGQIIVDECHHIGAVSFSAILKETNARYLLGLTATPIRRDGLHPIIFMYCGAIRHTAARPKESLHNLEVLTRSRFTSGHLPSDARIQDIFREIALDHDRTVAIAEEAMKAFGQGRKVLVLTERTDHLDDIASVMNTLKLSPFVLHSRLSKKKRTMLISGLNALPPDSPRILLSTGRLIGEGFDHPPLDTLILAMPVSWKGTLQQYAGRLHREHTGKSDVRIIDFVDTAYPVLLRMWDKRQRGYKAMGYRIVADGEGLSF&apos;]
2217 | ```
2218 | 
2219 | 
2220 | ```python
2221 | # Repartition for increasing the parallel processes
2222 | data = data.repartition(500)
2223 | ```
2224 | 
2225 | 
2226 | ```python
2227 | def preprocessing(line):
2228 |     '''
2229 |     Original data are lines where each line has \t
2230 |     separated values. We are interested in preserving
2231 |     the first value (entry id), tmp[0], and the last value
2232 |     (the sequence), tmp[-1].
2233 |     '''
2234 |     tmp = line.split('\t')
2235 |     id = tmp[0]
2236 |     sequence = list(tmp[-1])
2237 |     return (id, sequence)
2238 | 
2239 | processeddata = data.map(lambda line: preprocessing(line))
2240 | processeddata.take(1)  # See one sample
2241 | 
2242 | ```
2243 | 
2244 | 
2245 | <div class="ansiout"><span class="ansired">Out[</span><span class="ansired">5</span><span class="ansired">]: </span>[(&apos;A0A2E9WIJ1&apos;,
2246 |   [&apos;M&apos;,
2247 |    &apos;Y&apos;,
2248 |    &apos;I&apos;,
2249 |    &apos;F&apos;,
2250 |    &apos;L&apos;,
2251 |    &apos;T&apos;,
2252 |    &apos;L&apos;,
2253 | 	...   
2254 |    &apos;A&apos;,
2255 |    &apos;K&apos;,
2256 |    &apos;L&apos;,
2257 |    &apos;D&apos;,
2258 |    &apos;K&apos;,
2259 |    &apos;N&apos;,
2260 |    &apos;D&apos;])]</div>
2261 | 
2262 | 
2263 | 
2264 | 
2265 | 
2266 | ```python
2267 | # Protein sequence alphabets
2268 | alphabets = ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 
2269 |              'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 
2270 |              'W', 'X', 'Y', 'U', 'O']  # List of amino acids
2271 | ```
2272 | 
2273 | 
2274 | ```python
2275 | '''
2276 | Spark approach.
2277 | In this approach the alphabets argument has to
2278 | be passed to the SGT class definition.
2279 | The SGT.fit() is then called in parallel.
2280 | '''
2281 | sgt_ = sgt.SGT(alphabets=alphabets, 
2282 |                kappa=1, 
2283 |                lengthsensitive=True, 
2284 |                flatten=True)
2285 | rdd = processeddata.map(lambda x: (x[0], list(sgt_.fit(x[1]))))
2286 | sgtembeddings = rdd.collect()
2287 | # Command took 29.66 seconds -- by cranjan@processminer.com at 4/22/2020, 12:31:23 PM on databricks
2288 | ```
2289 | 
2290 | ### Compare with the default SGT mode
2291 | 
2292 | 
2293 | ```python
2294 | # Loading data
2295 | data = pd.read_csv('data/protein-uniprot-reviewed-Ano-10k.tab', sep='\t')
2296 | 
2297 | # Data preprocessing
2298 | corpus = data.loc[:,['Entry','Sequence']]
2299 | corpus.columns = ['id', 'sequence']
2300 | corpus['sequence'] = corpus['sequence'].map(list)
2301 | 
2302 | ```
2303 | 
2304 | 
2305 | ```python
2306 | sgt_ = sgt.SGT(alphabets=alphabets, 
2307 |                lengthsensitive=True, 
2308 |                kappa=1, 
2309 |                flatten=True, 
2310 |                mode='default')
2311 | 
2312 | sgtembedding_df = sgt_.fit_transform(corpus)
2313 | # Command took 13.08 minutes -- by cranjan@processminer.com at 4/22/2020, 1:48:02 PM on databricks
2314 | ```
2315 | 
2316 | ### 1M Protein Database
2317 | 
2318 | Protein 1M sequence data set is embedded here. The data set is available [here](https://mega.nz/file/1qAXhSAS#l7E60cLJzMGtFQzeHZL9PI8yX4tRQcAMFRN2xeHK81w).
2319 | 
2320 | ```python
2321 | '''
2322 | Load the data and remove header.
2323 | '''
2324 | data = sc.textFile('data/protein-uniprot-reviewed-Ano-1M.tab')
2325 |  
2326 | header = data.first() #extract header
2327 | data = data.filter(lambda row: row != header)   #filter out header
2328 | data.take(1)  # See one sample
2329 | ```
2330 | 
2331 | 
2332 | 
2333 | ```python
2334 | # Repartition for increasing the parallel processes
2335 | data = data.repartition(10000)
2336 | ```
2337 | 
2338 | 
2339 | ```python
2340 | processeddata = data.map(lambda line: preprocessing(line))
2341 | processeddata.take(1)  # See one sample
2342 | 
2343 | # [('A0A2E9WIJ1',
2344 | #   ['M','Y','I','F','L','T','L','A','L','F','S',...,'F','S','I','F','A','K','L','D','K','N','D'])]
2345 | ```
2346 | 
2347 | 
2348 | ```python
2349 | # Protein sequence alphabets
2350 | alphabets = ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 
2351 |              'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 
2352 |              'W', 'X', 'Y', 'U', 'O']  # List of amino acids
2353 | ```
2354 | 
2355 | 
2356 | ```python
2357 | '''
2358 | Spark approach.
2359 | In this approach the alphabets argument has to
2360 | be passed to the SGT class definition.
2361 | The SGT.fit() is then called in parallel.
2362 | '''
2363 | sgt_ = sgt.SGT(alphabets=alphabets, 
2364 |                kappa=1, 
2365 |                lengthsensitive=True, 
2366 |                flatten=True)
2367 | rdd = processeddata.map(lambda x: (x[0], list(sgt_.fit(x[1]))))
2368 | sgtembeddings = rdd.collect()
2369 | # Command took 28.98 minutes -- by cranjan@processminer.com at 4/22/2020, 3:16:41 PM on databricks
2370 | ```
2371 | 
2372 | 
2373 | ```python
2374 | '''OPTIONAL.
2375 | Save the embeddings for future use or 
2376 | production deployment.'''
2377 | # Save for deployment
2378 | # pickle.dump(sgtembeddings, 
2379 | #             open("data/protein-sgt-embeddings-1M.pkl", "wb"))
2380 | # The pickle dump is shared at https://mega.nz/file/hiAxAAoI#SStAIn_FZjAHvXSpXfdy8VpISG6rusHRf9HlUSqwcsw
2381 | # sgtembeddings = pickle.load(open("data/protein-sgt-embeddings-1M.pkl", "rb"))
2382 | ```
2383 | 
2384 | The pickle dump is shared [here](https://mega.nz/file/hiAxAAoI#SStAIn_FZjAHvXSpXfdy8VpISG6rusHRf9HlUSqwcsw).
2385 | 
2386 | ### Sequence Search using SGT - Spark
2387 | 
2388 | Since `sgtembeddings` on the 1M data set is large it is recommended to use distributed computing to find similar proteins during a search.
2389 | 
2390 | 
2391 | ```python
2392 | sgtembeddings_rdd = sc.parallelize(list(dict(sgtembeddings).items()))
2393 | sgtembeddings_rdd = sgtembeddings_rdd.repartition(10000)
2394 | ```
2395 | 
2396 | 
2397 | ```python
2398 | '''
2399 | Search proteins similar to a query protein.
2400 | The approach is to find the SGT embedding of the
2401 | query protein and find its similarity with the
2402 | embeddings of the protein database.
2403 | '''
2404 | 
2405 | query_protein = 'MSHVFPIVIDDNFLSPQDLVSAARSGCSLRLHTGVVDKIDRAHRFVLEIAGAEALHYGINTGFGSLCTTHIDPADLSTLQHNLLKSHACGVGPTVSEEVSRVVTLIKLLTFRTGNSGVSLSTVNRIIDLWNHGVVGAIAQKGTVGASGDLAPLAHLFLPLIGLGQVWHRGVLRPSREVMDELKLAPLTLQPKDGLCLTNGVQYLNAWGALSTVRAKRLVALADLCAAMSMMGFSAARSFIEAQIHQTSLHPERGHVALHLRTLTHGSNHADLPHCNPAMEDPYSFRCAPQVHGAARQVVGYLETVIGNECNSVSDNPLVFPDTRQILTCGNLHGQSTAFALDFAAIGITDLSNISERRTYQLLSGQNGLPGFLVAKPGLNSGFMVVQYTSAALLNENKVLSNPASVDTIPTCHLQEDHVSMGGTSAYKLQTILDNCETILAIELMTACQAIDMNPGLQLSERGRAIYEAVREEIPFVKEDHLMAGLISKSRDLCQHSTVIAQQLAEMQAQ'
2406 | 
2407 | # Step 1. Compute sgt embedding for the query protein.
2408 | query_protein_sgt_embedding = sgt_.fit(list(query_protein))
2409 | 
2410 | # Step 2. Broadcast the embedding to the cluster.
2411 | query_protein_sgt_embedding_broadcasted = sc.broadcast(list(query_protein_sgt_embedding))
2412 | 
2413 | # Step 3. Compute similarity between each sequence embedding and the query.
2414 | similarity = sgtembeddings_rdd.map(lambda x: (x[0], 
2415 |                                               np.dot(query_protein_sgt_embedding_broadcasted.value, 
2416 |                                                      x[1]))).collect()
2417 | 
2418 | # Step 4. Show the most similar sequences with the query.
2419 | pd.DataFrame(similarity).sort_values(by=1, ascending=False)
2420 | ```
2421 | 
2422 | ## <a name="datasets"></a> Datasets
2423 | 
2424 | Data sets provided with this release are,
2425 | 
2426 | ### Simulated Sequence Dataset
2427 | 
2428 | A benchmark simulated sequence data set with labels are provided. There are 5 labels and a total of 300 samples. The sequence lengths range from 50-800.
2429 | 
2430 | Location:
2431 | 
2432 | `data/simulated-sequence-dataset.csv`
2433 | 
2434 | ### Protein Dataset - 2k
2435 | 
2436 | Protein sequences data set taken from https://www.uniprot.org. The data set has reviewed and annotated proteins. The fields in the data set are,
2437 | 
2438 | - Entry
2439 | - Entry name	
2440 | - Status	
2441 | - Protein names	
2442 | - Gene names	
2443 | - Organism	
2444 | - Length	
2445 | - Sequence	
2446 | - Function [CC]	
2447 | - Features	
2448 | - Taxonomic lineage (all)
2449 | - Protein families
2450 | 
2451 | There are a total of 2113 samples (protein sequences). The proteins have one of the following two functions,
2452 | 
2453 | - Binds to DNA and alters its conformation. May be involved in regulation of gene expression, nucleoid organization and DNA protection.
2454 | - Might take part in the signal recognition particle (SRP) pathway. This is inferred from the conservation of its genetic proximity to ftsY/ffh. May be a regulatory protein.
2455 | 
2456 | 
2457 | The data set has about 40:60 class distribution.
2458 | 
2459 | Location:
2460 | `data/protein_classification.csv`
2461 | 
2462 | 
2463 | ### Darpa Weblog Network Intrusion Dataset
2464 | 
2465 | This is a processed weblog data provided by DARPA in: DARPA INTRUSION DETECTION EVALUATION DATASET. The link to it is shared by MIT at https://www.ll.mit.edu/r-d/datasets/1998-darpa-intrusion-detection-evaluation-dataset .
2466 | 
2467 | The available data set is a weblog dump with timestamps. They are converted to sequences and shared here. A sequence is labeled as 1 if it was a potential intrusion, otherwise 0. 
2468 | 
2469 | The data has 112 samples with imbalanced class distribution of about 10% positive labeled samples.
2470 | 
2471 | The available fields are,
2472 | 
2473 | - timeduration
2474 | - seqlen
2475 | - seq
2476 | - class
2477 | 
2478 | Location:
2479 | 
2480 | `data/darpa_data.csv`
2481 | 
2482 | 
2483 | ### Protein Sequence - 10k, 1M and 3M
2484 | 
2485 | Three protein sequence data sets of size 10k, 1 Million, and 3 Million are provided. The 10k data set is available in GitHub, while the latter two are available publicly [here](https://mega.nz/folder/MqAzmKqS#2jqJKJifOgnFACP9GqX6QQ) https://mega.nz/folder/MqAzmKqS#2jqJKJifOgnFACP9GqX6QQ .
2486 | 
2487 | The fields in these data sets are,
2488 | 
2489 | - Entry	
2490 | - Entry name	
2491 | - Protein names	
2492 | - Gene names	
2493 | - Organism
2494 | - Length
2495 | - Sequence
2496 | 
2497 | The source of these data are https://www.uniprot.org .
2498 | 
2499 | Location:
2500 | 
2501 | 10k: `data/protein-uniprot-reviewed-Ano-10k.tab`
2502 | 
2503 | 1M: `https://mega.nz/folder/MqAzmKqS#2jqJKJifOgnFACP9GqX6QQ/file/t7YlUQTK`
2504 | 
2505 | 3M: `https://mega.nz/folder/MqAzmKqS#2jqJKJifOgnFACP9GqX6QQ/file/InAzwYDa`
2506 | 
2507 | 
2508 | 


--------------------------------------------------------------------------------