├── python ├── sgt-package │ ├── sgt.egg-info │ │ ├── top_level.txt │ │ ├── dependency_links.txt │ │ └── SOURCES.txt │ ├── sgt_cran2367_bugfix_1.egg-info │ │ ├── top_level.txt │ │ ├── dependency_links.txt │ │ ├── SOURCES.txt │ │ └── PKG-INFO │ ├── sgt_cran2367_bugfix_2.egg-info │ │ ├── top_level.txt │ │ ├── dependency_links.txt │ │ └── SOURCES.txt │ ├── sgt │ │ ├── __init__.py │ │ └── sgt.py │ ├── build │ │ └── lib │ │ │ └── sgt │ │ │ └── __init__.py │ ├── output_19_1.png │ ├── output_23_1.png │ ├── dist │ │ ├── sgt-2.0.0.tar.gz │ │ ├── sgt-2.0.1.tar.gz │ │ ├── sgt-2.0.2.tar.gz │ │ ├── sgt-2.0.3.tar.gz │ │ ├── sgt-2.0.0b15.tar.gz │ │ ├── sgt-2.0.0b16.tar.gz │ │ ├── sgt-2.0.0b17.tar.gz │ │ ├── sgt-2.0.0b18.tar.gz │ │ ├── sgt-2.0.0b19.tar.gz │ │ ├── sgt-2.0.0b20.tar.gz │ │ ├── sgt-2.0.0b21.tar.gz │ │ ├── sgt-2.0.0-py3-none-any.whl │ │ ├── sgt-2.0.1-py3-none-any.whl │ │ ├── sgt-2.0.2-py3-none-any.whl │ │ ├── sgt-2.0.3-py3-none-any.whl │ │ ├── sgt-2.0.0b15-py3-none-any.whl │ │ ├── sgt-2.0.0b16-py3-none-any.whl │ │ ├── sgt-2.0.0b17-py3-none-any.whl │ │ ├── sgt-2.0.0b18-py3-none-any.whl │ │ ├── sgt-2.0.0b19-py3-none-any.whl │ │ ├── sgt-2.0.0b20-py3-none-any.whl │ │ └── sgt-2.0.0b21-py3-none-any.whl │ ├── setup.py │ └── LICENSE ├── output_23_1.png └── __pycache__ │ ├── sgtdev.cpython-37.pyc │ └── sgttemp.cpython-37.pyc ├── output_23_1.png ├── .gitignore~ ├── .gitignore ├── R ├── main.R ├── kmeans.R └── sgt.R ├── data └── darpa_data.csv └── README.md /python/sgt-package/sgt.egg-info/top_level.txt: -------------------------------------------------------------------------------- 1 | sgt 2 | -------------------------------------------------------------------------------- /python/sgt-package/sgt.egg-info/dependency_links.txt: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /python/sgt-package/sgt_cran2367_bugfix_1.egg-info/top_level.txt: -------------------------------------------------------------------------------- 1 | sgt 2 | -------------------------------------------------------------------------------- /python/sgt-package/sgt_cran2367_bugfix_2.egg-info/top_level.txt: -------------------------------------------------------------------------------- 1 | sgt 2 | -------------------------------------------------------------------------------- /output_23_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cran2367/sgt/HEAD/output_23_1.png -------------------------------------------------------------------------------- /python/sgt-package/sgt_cran2367_bugfix_1.egg-info/dependency_links.txt: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /python/sgt-package/sgt_cran2367_bugfix_2.egg-info/dependency_links.txt: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /python/sgt-package/sgt/__init__.py: -------------------------------------------------------------------------------- 1 | __version__ = "2.0.3" 2 | 3 | from .sgt import SGT -------------------------------------------------------------------------------- /python/output_23_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/output_23_1.png -------------------------------------------------------------------------------- /python/sgt-package/build/lib/sgt/__init__.py: -------------------------------------------------------------------------------- 1 | __version__ = "2.0.3" 2 | 3 | from .sgt import SGT -------------------------------------------------------------------------------- /python/sgt-package/output_19_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/output_19_1.png -------------------------------------------------------------------------------- /python/sgt-package/output_23_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/output_23_1.png -------------------------------------------------------------------------------- /python/__pycache__/sgtdev.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/__pycache__/sgtdev.cpython-37.pyc -------------------------------------------------------------------------------- /python/sgt-package/dist/sgt-2.0.0.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/dist/sgt-2.0.0.tar.gz -------------------------------------------------------------------------------- /python/sgt-package/dist/sgt-2.0.1.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/dist/sgt-2.0.1.tar.gz -------------------------------------------------------------------------------- /python/sgt-package/dist/sgt-2.0.2.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/dist/sgt-2.0.2.tar.gz -------------------------------------------------------------------------------- /python/sgt-package/dist/sgt-2.0.3.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/dist/sgt-2.0.3.tar.gz -------------------------------------------------------------------------------- /python/__pycache__/sgttemp.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/__pycache__/sgttemp.cpython-37.pyc -------------------------------------------------------------------------------- /python/sgt-package/dist/sgt-2.0.0b15.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/dist/sgt-2.0.0b15.tar.gz -------------------------------------------------------------------------------- /python/sgt-package/dist/sgt-2.0.0b16.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/dist/sgt-2.0.0b16.tar.gz -------------------------------------------------------------------------------- /python/sgt-package/dist/sgt-2.0.0b17.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/dist/sgt-2.0.0b17.tar.gz -------------------------------------------------------------------------------- /python/sgt-package/dist/sgt-2.0.0b18.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/dist/sgt-2.0.0b18.tar.gz -------------------------------------------------------------------------------- /python/sgt-package/dist/sgt-2.0.0b19.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/dist/sgt-2.0.0b19.tar.gz -------------------------------------------------------------------------------- /python/sgt-package/dist/sgt-2.0.0b20.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/dist/sgt-2.0.0b20.tar.gz -------------------------------------------------------------------------------- /python/sgt-package/dist/sgt-2.0.0b21.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/dist/sgt-2.0.0b21.tar.gz -------------------------------------------------------------------------------- /python/sgt-package/dist/sgt-2.0.0-py3-none-any.whl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/dist/sgt-2.0.0-py3-none-any.whl -------------------------------------------------------------------------------- /python/sgt-package/dist/sgt-2.0.1-py3-none-any.whl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/dist/sgt-2.0.1-py3-none-any.whl -------------------------------------------------------------------------------- /python/sgt-package/dist/sgt-2.0.2-py3-none-any.whl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/dist/sgt-2.0.2-py3-none-any.whl -------------------------------------------------------------------------------- /python/sgt-package/dist/sgt-2.0.3-py3-none-any.whl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/dist/sgt-2.0.3-py3-none-any.whl -------------------------------------------------------------------------------- /python/sgt-package/dist/sgt-2.0.0b15-py3-none-any.whl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/dist/sgt-2.0.0b15-py3-none-any.whl -------------------------------------------------------------------------------- /python/sgt-package/dist/sgt-2.0.0b16-py3-none-any.whl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/dist/sgt-2.0.0b16-py3-none-any.whl -------------------------------------------------------------------------------- /python/sgt-package/dist/sgt-2.0.0b17-py3-none-any.whl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/dist/sgt-2.0.0b17-py3-none-any.whl -------------------------------------------------------------------------------- /python/sgt-package/dist/sgt-2.0.0b18-py3-none-any.whl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/dist/sgt-2.0.0b18-py3-none-any.whl -------------------------------------------------------------------------------- /python/sgt-package/dist/sgt-2.0.0b19-py3-none-any.whl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/dist/sgt-2.0.0b19-py3-none-any.whl -------------------------------------------------------------------------------- /python/sgt-package/dist/sgt-2.0.0b20-py3-none-any.whl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/dist/sgt-2.0.0b20-py3-none-any.whl -------------------------------------------------------------------------------- /python/sgt-package/dist/sgt-2.0.0b21-py3-none-any.whl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cran2367/sgt/HEAD/python/sgt-package/dist/sgt-2.0.0b21-py3-none-any.whl -------------------------------------------------------------------------------- /.gitignore~: -------------------------------------------------------------------------------- 1 | *__pycache__/ 2 | *.Rproj* 3 | *.Rhistory 4 | *logs/ 5 | *.DS_Store 6 | *.ipynb_checkpoints 7 | *build/ 8 | *sgt.egg-info 9 | *sgt_cran* 10 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | *.Rproj* 2 | *.Rhistory 3 | *logs/ 4 | *.DS_Store 5 | *.ipynb_checkpoints 6 | *build/ 7 | *sgt.egg-info 8 | *sgt_cran* 9 | *archive/ 10 | .Rproj.user 11 | -------------------------------------------------------------------------------- /python/sgt-package/sgt.egg-info/SOURCES.txt: -------------------------------------------------------------------------------- 1 | README.md 2 | setup.py 3 | sgt/__init__.py 4 | sgt/sgt.py 5 | sgt.egg-info/PKG-INFO 6 | sgt.egg-info/SOURCES.txt 7 | sgt.egg-info/dependency_links.txt 8 | sgt.egg-info/top_level.txt -------------------------------------------------------------------------------- /python/sgt-package/sgt_cran2367_bugfix_1.egg-info/SOURCES.txt: -------------------------------------------------------------------------------- 1 | README.md 2 | setup.py 3 | sgt/__init__.py 4 | sgt_cran2367_bugfix_1.egg-info/PKG-INFO 5 | sgt_cran2367_bugfix_1.egg-info/SOURCES.txt 6 | sgt_cran2367_bugfix_1.egg-info/dependency_links.txt 7 | sgt_cran2367_bugfix_1.egg-info/top_level.txt -------------------------------------------------------------------------------- /python/sgt-package/sgt_cran2367_bugfix_2.egg-info/SOURCES.txt: -------------------------------------------------------------------------------- 1 | README.md 2 | setup.py 3 | sgt/__init__.py 4 | sgt_cran2367_bugfix_2.egg-info/PKG-INFO 5 | sgt_cran2367_bugfix_2.egg-info/SOURCES.txt 6 | sgt_cran2367_bugfix_2.egg-info/dependency_links.txt 7 | sgt_cran2367_bugfix_2.egg-info/top_level.txt -------------------------------------------------------------------------------- /python/sgt-package/setup.py: -------------------------------------------------------------------------------- 1 | import setuptools 2 | 3 | with open("README.md", "r") as fh: 4 | long_description = fh.read() 5 | 6 | setuptools.setup( 7 | name="sgt", 8 | version="2.0.3", 9 | author="Chitta Ranjan", 10 | author_email="cran2367@gmail.com", 11 | description="Sequence Graph Transform (SGT) is a sequence embedding function. SGT extracts the short- and long-term sequence features and embeds them in a finite-dimensional feature space. The long and short term patterns embedded in SGT can be tuned without any increase in the computation.", 12 | long_description=long_description, 13 | long_description_content_type="text/markdown", 14 | url="https://github.com/cran2367/sgt", 15 | packages=setuptools.find_packages(), 16 | classifiers=[ 17 | "Programming Language :: Python :: 3", 18 | "License :: OSI Approved :: MIT License", 19 | "Operating System :: OS Independent", 20 | ], 21 | ) -------------------------------------------------------------------------------- /python/sgt-package/LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2018 The Python Packaging Authority 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy 4 | of this software and associated documentation files (the "Software"), to deal 5 | in the Software without restriction, including without limitation the rights 6 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 7 | copies of the Software, and to permit persons to whom the Software is 8 | furnished to do so, subject to the following conditions: 9 | 10 | The above copyright notice and this permission notice shall be included in all 11 | copies or substantial portions of the Software. 12 | 13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 14 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 15 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 16 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 17 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 18 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 19 | SOFTWARE. -------------------------------------------------------------------------------- /R/main.R: -------------------------------------------------------------------------------- 1 | library(matrixStats) 2 | library(dplyr) 3 | source('sgt.R', echo = F) 4 | source('kmeans.R', echo = F) 5 | 6 | ###################################################################### 7 | ######## Validate SGT output with a simple sequence example ########## 8 | ###################################################################### 9 | 10 | alphabet_set <- c("A", "B", "C") 11 | alphabet_set_size <- length(alphabet_set) 12 | 13 | seq <- "BBACACAABA" 14 | 15 | kappa <- 5 16 | 17 | ###### Algorithm 1 ###### 18 | sgt_parts_alg1 <- f_sgt_parts(sequence = seq, kappa = kappa, alphabet_set_size = alphabet_set_size) 19 | print(sgt_parts_alg1) 20 | 21 | sgt <- f_SGT(W_kappa = sgt_parts_alg1$W_kappa, W0 = sgt_parts_alg1$W0, 22 | Len = sgt_parts_alg1$Len, kappa = kappa) # Set Len = NULL for length-sensitive SGT. 23 | print(sgt) 24 | 25 | ###### Algorithm 2 ###### 26 | seq_split <- f_seq_split(sequence = seq) 27 | seq_alphabet_positions <- f_get_alphabet_positions(sequence_split = seq_split, alphabet_set = alphabet_set) 28 | 29 | sgt_parts_alg2 <- f_sgt_parts_using_element_positions(seq_alphabet_positions = seq_alphabet_positions, 30 | alphabet_set = alphabet_set, 31 | kappa = kappa) 32 | 33 | sgt <- f_SGT(W_kappa = sgt_parts_alg2$W_kappa, W0 = sgt_parts_alg2$W0, 34 | Len = sgt_parts_alg2$Len, kappa = kappa) # Set Len = NULL for length-sensitive SGT. 35 | 36 | 37 | ############################################################################ 38 | ######## Demo: Performing a Clustering operation on a seq dataset ########## 39 | ############################################################################ 40 | 41 | ## The dataset contains all roman letters, A-Z. 42 | dataset <- read.csv("../data/simulated-sequence-dataset.csv", header = T, stringsAsFactors = F) 43 | 44 | sgt_parts_sequences_in_dataset <- f_SGT_for_each_sequence_in_dataset(sequence_dataset = dataset, 45 | kappa = 5, alphabet_set = LETTERS, 46 | spp = NULL, sgt_using_alphabet_positions = T) 47 | 48 | 49 | input_data <- f_create_input_kmeans(all_seq_sgt_parts = sgt_parts_sequences_in_dataset, 50 | length_normalize = T, 51 | alphabet_set_size = 26, 52 | kappa = 5, trace = TRUE, 53 | inv.powered = T) 54 | K = 5 55 | clustering_output <- f_kmeans(input_data = input_data, K = K, alphabet_set_size = 26, trace = T) 56 | 57 | cc <- f_clustering_accuracy(actual = c(strtoi(dataset[,1])), pred = c(clustering_output$class), K = K, type = "f1") 58 | print(cc) 59 | 60 | ######## Clustering on Principal Components of SGT features ######## 61 | num_pcs <- 5 # Number of principal components we want 62 | input_data_pcs <- f_pcs(input_data = input_data, PCs = num_pcs)$input_data_pcs 63 | 64 | clustering_output_pcs <- f_kmeans(input_data = input_data_pcs, K = K, alphabet_set_size = sqrt(num_pcs), trace = F) 65 | 66 | cc <- f_clustering_accuracy(actual = c(strtoi(dataset[,1])), pred = c(clustering_output_pcs$class), K = K, type = "f1") 67 | print(cc) 68 | -------------------------------------------------------------------------------- /R/kmeans.R: -------------------------------------------------------------------------------- 1 | #### In this file, we have all the functions required for performing a k-means clustering. The k-means will be performed on the SGT-vectors. In there 'class' and 'cluster' of a sequence will mean the same. 2 | 3 | f_centroid <- function(Ks, alphabet_set_size, input_data, class) 4 | { 5 | # For any given classes, we find the centroids. 6 | # Inputs 7 | # Ks Vector of names of the classes. Typically, it is denoted by a scalar K, and cluster names are 1:K. But sometimes it's not 1:K (e.g. if one of the clusters are dropped mid way of clustering). 8 | # alphabet_set_size The number of alphabets sequences are made of. 9 | # input_data A matrix, where a row is a SGT vector for a sequence and a column is one sgt feature. 10 | # class A vector having class assignment for each sequence. 11 | 12 | K <- length(Ks) 13 | centroid <- matrix(rep(0, K * (alphabet_set_size * alphabet_set_size)), nrow = K) 14 | rownames(centroid) <- Ks 15 | 16 | for(k in Ks) 17 | { 18 | centroid[toString(k),] <- t(t(input_data) %*% (class == k) / sum(class == k)) 19 | } 20 | 21 | return(centroid) 22 | } 23 | 24 | 25 | f_class <- function(Ks, n, input_data, centroid, asgnmt_threshold = 999999) 26 | { 27 | # For given centroids we assign classes to each sequence 28 | # Inputs 29 | # Ks Vector of names of the classes. 30 | # n Number of sequences 31 | # input_data A matrix, where a row is a SGT vector for a sequence and a column is one sgt feature. 32 | # centroid Matrix containing the centroids. Each row is a centroid for a cluster. 33 | # asgnmt_threshold This assignment threshold is there for experimental purpose. Not needed for regular operations. Thus, it is default to a very high value. 34 | 35 | Z <- NULL 36 | for(k in Ks) 37 | { 38 | tmp <- input_data - t(matrix(rep(centroid[toString(k),], n), ncol = n)) 39 | Z <- cbind(Z, rowSums(abs(tmp))) # We use L1 norm for distances. 40 | } 41 | 42 | colnames(Z) <- Ks 43 | class <- Ks[max.col(-1*Z, ties.method = "random")] # Updated classes by assigning the sequence to the class it is closest with 44 | 45 | wss <- sum(abs(do.call(pmin,data.frame(Z)))) 46 | 47 | class.last <- max(as.numeric(Ks)) 48 | threshold.violation <- c(do.call(pmin,data.frame(Z)) > asgnmt_threshold) 49 | if(any(threshold.violation)) 50 | { 51 | class[threshold.violation] <- class.last + 1 52 | Ks <- c(Ks, (class.last + 1)) 53 | } 54 | 55 | out <- list(class = class, Ks = Ks, wss = wss, Z = Z) 56 | return(out) 57 | } 58 | 59 | 60 | f_NA_centroid_exception <- function(Ks, centroid, trace = FALSE) 61 | { 62 | # If there are too many classes, sometimes a class does not get any datapoint assigned, we should remove them. This is an important function to handle these exceptions. 63 | if(is.na(sum(sum(centroid)))) 64 | { 65 | if(trace){print("inside nan")} 66 | centroid <- centroid[!is.na(centroid[,1]), ] # Remove the centroid rows with Inf 67 | Ks <- strtoi(rownames(centroid)) 68 | } 69 | out <- list(Ks = Ks, centroid = centroid) 70 | 71 | return(out) 72 | } 73 | 74 | f_create_input_kmeans <- function(all_seq_sgt_parts, length_normalize = FALSE, alphabet_set_size, kappa, trace = TRUE, inv.powered = T) 75 | { 76 | # Creating the input data for feeding into the kmeans function 77 | # Inputs 78 | # all_seq_sgt_parts The transform on sequences in a dataset 79 | # length_normalize Is True for length-insensitive variant of SGT [1] 80 | # alphabet_set_size The number of alphabets that makes the sequences in the dataset. 81 | # kappa The tuning parameter 82 | # inv.powered Is True if we want the take the kappa-th root of SGT as shown the algorithm 1 [1]. 83 | 84 | n.seq <- dim(all_seq_sgt_parts$W0_all)[3] 85 | 86 | # Find the SGT for each sequence 87 | sgt_mat_all <- array(rep(0,n.seq * alphabet_set_size * alphabet_set_size), 88 | dim=c(alphabet_set_size, alphabet_set_size, n.seq)) 89 | 90 | for(ind in 1:n.seq) 91 | { 92 | if(trace){print(paste(ind,"in",n.seq))} 93 | if(length_normalize == TRUE) 94 | { 95 | sgt_mat_all[ , ,ind] <- f_SGT(W_kappa = all_seq_sgt_parts$W_kappa_all[[ind]], 96 | W0 = all_seq_sgt_parts$W0_all[,,ind], 97 | kappa = kappa, 98 | Len = all_seq_sgt_parts$Len_all[ind], 99 | inv.powered = inv.powered) 100 | }else{ # Not length normalize 101 | sgt_mat_all[ , ,ind] <- f_SGT(W_kappa = all_seq_sgt_parts$W_kappa_all[[ind]], 102 | W0 = all_seq_sgt_parts$W0_all[,,ind], 103 | kappa = kappa, 104 | Len = NULL, 105 | inv.powered = inv.powered) 106 | } 107 | } 108 | 109 | # Vectorize the sequence alphabet_set_size x alphabet_set_size statistics (mean in this case) 110 | # Code taken for this from http://stackoverflow.com/questions/4022195/transform-a-3d-array-into-a-matrix-in-r 111 | 112 | input_data <- aperm(sgt_mat_all, c(3,2,1)) 113 | dim(input_data) <- c(n.seq, alphabet_set_size * alphabet_set_size) 114 | 115 | return(input_data) 116 | } 117 | 118 | 119 | f_kmeans_procedure <- function(input_data, K, alphabet_set_size = 26, max_iteration = 50, trace = TRUE) 120 | { 121 | # This function will perform the centroid based kmeans clustering using Manhattan distance. 122 | # Inputs 123 | # input_data The input data matrix, each row a data point and the columns are its features 124 | # K The number of clusters 125 | 126 | set.seed(12) # To ensure reproducibility 127 | n <- nrow(input_data) 128 | 129 | # Step 0: Initialization 130 | if(K <= n) 131 | { 132 | # Making sure at least one member is given to each cluster in the beginning 133 | class.tmp <- 1:K 134 | class.tmp2 <- sample.int(n = K, size = (n - K), replace = T) 135 | class <- c(class.tmp, class.tmp2) 136 | class.tmp2 <- sample.int(n = K, size = (n - K), replace = T) # Another initialization for class.old 137 | class.old <- c(class.tmp, class.tmp2) 138 | } else{ 139 | stop("K is greater than n. Terminating!") 140 | } 141 | 142 | Ks <- 1:K # List of cluster 143 | centroid <- f_centroid(Ks = Ks, alphabet_set_size = alphabet_set_size, input_data = input_data, class = class) 144 | 145 | out_NA <- f_NA_centroid_exception(Ks = Ks, centroid = centroid, trace = trace) 146 | Ks <- out_NA$Ks 147 | centroid <- out_NA$centroid 148 | 149 | # Iterations for clustering 150 | class.changes <- 10 # arbitrary 151 | epsilon <- 100 # arbitrary 152 | counter <- 0 153 | class.changes.check <- 0 154 | 155 | while(class.changes != 0 && counter <= max_iteration) 156 | { 157 | counter <- counter + 1 158 | class.old <- class 159 | 160 | # Step 1: Getting the centroid for each class 161 | centroid <- f_centroid(Ks = Ks, alphabet_set_size = alphabet_set_size, input_data = input_data, class = class) 162 | 163 | # Exception handling: If a centroid does not get any data point assigned 164 | out_NA <- f_NA_centroid_exception(Ks = Ks, centroid = centroid) 165 | Ks <- out_NA$Ks 166 | centroid <- out_NA$centroid 167 | 168 | 169 | # Step 2: Assign (update) class to each data point based on its distance from the centroids 170 | class.out <- f_class(Ks = Ks, n = n, input_data = input_data, centroid = centroid) 171 | class <- class.out$class 172 | Ks <- class.out$Ks 173 | wss <- class.out$wss 174 | Z <- class.out$Z 175 | 176 | # Iteration differences 177 | class.changes <- sum(class != class.old) 178 | 179 | if(trace) 180 | { 181 | print(paste("Iteration", counter, "in", max_iteration, "--Class chgs: ", class.changes, "wss: ", round(wss,2), "and K is ", length(Ks))) 182 | } 183 | } 184 | return(list(class = class, centroid = centroid, Ks = Ks, wss = wss, Z = Z)) 185 | } 186 | 187 | 188 | f_kmeans <- function(input_data, K, alphabet_set_size = 26, max_iteration = 50, trace = TRUE, K_fixed = T) 189 | { 190 | if(K_fixed){ 191 | 192 | check <- 0 193 | while(check != K) 194 | { 195 | class <- f_kmeans_procedure(input_data = input_data, K = K, alphabet_set_size = alphabet_set_size, trace = trace) 196 | check <- length(levels(factor(class$class))) 197 | } 198 | }else{ 199 | class <- f_kmeans_procedure(input_data = input_data, K = K, alphabet_set_size = alphabet_set_size, trace = trace) 200 | } 201 | return(class) 202 | } 203 | 204 | 205 | f_get_ss <- function(input_data) 206 | { 207 | n <- nrow(input_data) 208 | ybar <- t(rowMeans(t(input_data))) 209 | ss <- 0 210 | for(i in 1:n) 211 | { 212 | ss <- ss + sum(abs(input_data[i,] - ybar)) 213 | } 214 | return(ss) 215 | } 216 | 217 | 218 | f_pcs <- function(input_data, PCs = 50) 219 | { 220 | nc <- ncol(input_data) 221 | nr <- nrow(input_data) 222 | mu <- t(rowMeans(t(input_data))) 223 | Sigma <- cov(input_data) 224 | 225 | eg <- eigen(Sigma) 226 | lam <- eg$values 227 | lam.perc <- lam/sum(lam) 228 | lam.perc.cum <- cumsum(lam.perc) 229 | 230 | print(paste(PCs, "PCs explain", round(lam.perc.cum[PCs]*100, 2),"percentage of variance")) 231 | 232 | V <- eg$vectors 233 | tmp <- sqrt(matrix(rep(lam, nc), nrow = nc, byrow = TRUE)) 234 | V.norm <- V / tmp 235 | V.norm.reduced <- V.norm[, 1:PCs] 236 | input_data_pcs <- (input_data - matrix(mu, nrow = nrow(input_data), ncol = nc)) %*% V.norm.reduced 237 | 238 | return(list(input_data_pcs = input_data_pcs, lam = lam, lam.perc.cum = lam.perc.cum)) 239 | } -------------------------------------------------------------------------------- /python/sgt-package/sgt/sgt.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | from itertools import chain 4 | from itertools import product as iterproduct 5 | import warnings 6 | 7 | class SGT(): 8 | ''' 9 | Compute embedding of a single or a collection of discrete item 10 | sequences. A discrete item sequence is a sequence made from a set 11 | discrete elements, also known as alphabet set. For example, 12 | suppose the alphabet set is the set of roman letters, 13 | {A, B, ..., Z}. This set is made of discrete elements. Examples of 14 | sequences from such a set are AABADDSA, UADSFJPFFFOIHOUGD, etc. 15 | Such sequence datasets are commonly found in online industry, 16 | for example, item purchase history, where the alphabet set is 17 | the set of all product items. Sequence datasets are abundant in 18 | bioinformatics as protein sequences. 19 | Using the embeddings created here, classification and clustering 20 | models can be built for sequence datasets. 21 | Read more in https://arxiv.org/pdf/1608.03533.pdf 22 | 23 | Parameters 24 | ---------- 25 | Input 26 | 27 | alphabets Optional, except if mode is Spark. 28 | The set of alphabets that make up all 29 | the sequences in the dataset. If not passed, the 30 | alphabet set is automatically computed as the 31 | unique set of elements that make all the sequences. 32 | A list or 1d-array of the set of elements that make up the 33 | sequences. For example, np.array(["A", "B", "C"]. 34 | If mode is 'spark', the alphabets are necessary. 35 | 36 | kappa Tuning parameter, kappa > 0, to change the extraction of 37 | long-term dependency. Higher the value the lesser 38 | the long-term dependency captured in the embedding. 39 | Typical values for kappa are 1, 5, 10. 40 | 41 | lengthsensitive Default False. This is set to true if the embedding of 42 | should have the information of the length of the sequence. 43 | If set to false then the embedding of two sequences with 44 | similar pattern but different lengths will be the same. 45 | lengthsensitive = false is similar to length-normalization. 46 | 47 | flatten Default True. If True the SGT embedding is flattened and returned as 48 | a vector. Otherwise, it is returned as a matrix with the row and col 49 | names same as the alphabets. The matrix form is used for 50 | interpretation purposes. Especially, to understand how the alphabets 51 | are "related". Otherwise, for applying machine learning or deep 52 | learning algorithms, the embedding vectors are required. 53 | 54 | mode Choices in {'default', 'multiprocessing'}. 55 | 56 | processors Used if mode is 'multiprocessing'. By default, the 57 | number of processors used in multiprocessing is 58 | number of available - 1. 59 | ''' 60 | 61 | def __init__(self, 62 | alphabets=[], 63 | kappa=1, 64 | lengthsensitive=False, 65 | flatten=True, 66 | mode='default', 67 | processors=None, 68 | lazy=False): 69 | 70 | self.alphabets = alphabets 71 | 72 | if len(self.alphabets) != 0: 73 | self.feature_names = self.__set_feature_names(self.alphabets) 74 | 75 | self.kappa = kappa 76 | self.lengthsensitive = lengthsensitive 77 | self.flatten = flatten 78 | self.mode = mode 79 | self.processors = processors 80 | 81 | if self.processors==None: 82 | import os 83 | self.processors = os.cpu_count() - 1 84 | 85 | self.lazy = lazy 86 | 87 | def getpositions(self, sequence, alphabets): 88 | ''' 89 | Compute index position elements in the sequence 90 | given alphabets set. 91 | 92 | Return list of tuples [(value, position)] 93 | ''' 94 | positions = [(v, np.where(sequence == v)) 95 | for v in alphabets if v in sequence] 96 | 97 | return positions 98 | 99 | def fit(self, sequence): 100 | ''' 101 | Extract Sequence Graph Transform features using Algorithm-2. 102 | 103 | sequence An array of discrete elements. For example, 104 | np.array(["B","B","A","C","A","C","A","A","B","A"]. 105 | 106 | return: sgt matrix or vector (depending on Flatten==False or True) 107 | 108 | ''' 109 | 110 | sequence = np.array(sequence) 111 | 112 | if(len(self.alphabets) == 0): 113 | self.alphabets = self.estimate_alphabets(sequence) 114 | self.feature_names = self.__set_feature_names(self.alphabets) 115 | 116 | size = len(self.alphabets) 117 | l = 0 118 | W0, Wk = np.zeros((size, size)), np.zeros((size, size)) 119 | positions = self.getpositions(sequence, self.alphabets) 120 | 121 | alphabets_in_sequence = np.unique(sequence) 122 | 123 | for i, u in enumerate(alphabets_in_sequence): 124 | index = [p[0] for p in positions].index(u) 125 | 126 | U = np.array(positions[index][1]).ravel() 127 | 128 | for j, v in enumerate(alphabets_in_sequence): 129 | index = [p[0] for p in positions].index(v) 130 | 131 | V2 = np.array(positions[index][1]).ravel() 132 | 133 | C = [(i, j) for i in U for j in V2 if j > i] 134 | 135 | cu = np.array([ic[0] for ic in C]) 136 | cv = np.array([ic[1] for ic in C]) 137 | 138 | # Insertion positions 139 | pos_i = self.alphabets.index(u) 140 | pos_j = self.alphabets.index(v) 141 | 142 | W0[pos_i, pos_j] = len(C) 143 | 144 | Wk[pos_i, pos_j] = np.sum(np.exp(-self.kappa * np.abs(cu - cv))) 145 | 146 | l += U.shape[0] 147 | 148 | if self.lengthsensitive: 149 | W0 /= l 150 | 151 | W0[np.where(W0 == 0)] = 1e7 # avoid divide by 0 152 | 153 | sgt = np.power(np.divide(Wk, W0), 1/self.kappa) 154 | 155 | if(self.flatten): 156 | sgt = pd.Series(sgt.flatten(), index=self.feature_names) 157 | else: 158 | sgt = pd.DataFrame(sgt, 159 | columns=self.alphabets, 160 | index=self.alphabets) 161 | return sgt 162 | 163 | def __flatten(self, listOfLists): 164 | "Flatten one level of nesting" 165 | flat = [x for sublist in listOfLists for x in sublist] 166 | return flat 167 | 168 | def estimate_alphabets(self, corpus): 169 | if len(corpus) > 1e5: 170 | print("Error: Too many sequences. Pass the alphabet list as an input. Exiting.") 171 | sys.exit(1) 172 | else: 173 | return(np.unique(np.asarray(self.__flatten(corpus))).tolist()) 174 | 175 | def set_alphabets(self, corpus): 176 | self.alphabets = self.estimate_alphabets(corpus) 177 | self.feature_names = self.__set_feature_names(self.alphabets) 178 | return self 179 | 180 | def get_alphabets(self): 181 | return self.alphabets 182 | 183 | def get_feature_names(self): 184 | return self.feature_names 185 | 186 | def __fit_to_list(self, sequence): 187 | return list(self.fit(sequence)) 188 | 189 | def __set_feature_names(self, alphabets): 190 | return list(iterproduct(alphabets, alphabets)) 191 | 192 | def fit_transform(self, corpus): 193 | ''' 194 | Inputs: 195 | corpus A list of sequences. Each sequence is a list of alphabets. 196 | ''' 197 | 198 | if(len(self.alphabets) == 0): 199 | self.alphabets = self.estimate_alphabets(corpus['sequence']) 200 | self.feature_names = self.__set_feature_names(self.alphabets) 201 | 202 | if self.mode=='default': 203 | sgt = corpus.apply(lambda x: [x['id']] + list(self.fit(x['sequence'])), 204 | axis=1, 205 | result_type='expand') 206 | sgt.columns = ['id'] + self.feature_names 207 | return sgt 208 | elif self.mode=='multiprocessing': 209 | # Import 210 | from pandarallel import pandarallel 211 | # Initialization 212 | pandarallel.initialize(nb_workers=self.processors) 213 | sgt = corpus.parallel_apply(lambda x: [x['id']] + 214 | list(self.fit(x['sequence'])), 215 | axis=1, 216 | result_type='expand') 217 | sgt.columns = ['id'] + self.feature_names 218 | return sgt 219 | 220 | def transform(self, corpus): 221 | ''' 222 | Inputs: 223 | corpus A list of sequences. Each sequence is a list of alphabets. 224 | ''' 225 | 226 | ''' 227 | Difference between fit_transform and transform is: 228 | In transform() we have the alphabets already known. 229 | In fit_transform() is alphabets are not known, they 230 | are computed. 231 | The computation in fit is essentially getting the 232 | alphabets set. 233 | ''' 234 | 235 | if self.mode=='default': 236 | sgt = corpus.apply(lambda x: [x['id']] + list(self.fit(x['sequence'])), 237 | axis=1, 238 | result_type='expand') 239 | sgt.columns = ['id'] + self.feature_names 240 | return sgt 241 | elif self.mode=='multiprocessing': 242 | # Import 243 | from pandarallel import pandarallel 244 | # Initialization 245 | pandarallel.initialize(nb_workers=self.processors) 246 | sgt = corpus.parallel_apply(lambda x: [x['id']] + 247 | list(self.fit(x['sequence'])), 248 | axis=1, 249 | result_type='expand') 250 | sgt.columns = ['id'] + self.feature_names 251 | return sgt -------------------------------------------------------------------------------- /python/sgt-package/sgt_cran2367_bugfix_1.egg-info/PKG-INFO: -------------------------------------------------------------------------------- 1 | Metadata-Version: 2.1 2 | Name: sgt-cran2367-bugfix-1 3 | Version: 1.0.0 4 | Summary: Sequence Graph Transform (SGT) is a sequence embedding function. SGT extracts the short- and long-term sequence features and embeds them in a finite-dimensional feature space. With SGT you can tune the amount of short- to long-term patterns extracted in the embeddings without any increase in the computation. 5 | Home-page: https://github.com/cran2367/sgt 6 | Author: Chitta Ranjan 7 | Author-email: cran2367@gmail.com 8 | License: UNKNOWN 9 | Description: # Sequence Graph Transform (SGT) 10 | 11 | #### Maintained by: Chitta Ranjan, PhD (cran2367@gmail.com). 12 | 13 | 14 | This is open source code repository for SGT. Sequence Graph Transform extracts the short- and long-term sequence features and embeds them in a finite-dimensional feature space. Importantly, SGT has low computation and can extract any amount of short- to long-term patterns without any increase in the computation. These properties are proved theoretically and demonstrated on real data in this paper: https://arxiv.org/abs/1608.03533. 15 | 16 | If using this code or dataset, please cite the following: 17 | 18 | [1] Ranjan, Chitta, Samaneh Ebrahimi, and Kamran Paynabar. "Sequence Graph Transform (SGT): A Feature Extraction Function for Sequence Data Mining." arXiv preprint arXiv:1608.03533 (2016). 19 | 20 | @article{ranjan2016sequence, 21 | title={Sequence Graph Transform (SGT): A Feature Extraction Function for Sequence Data Mining}, 22 | author={Ranjan, Chitta and Ebrahimi, Samaneh and Paynabar, Kamran}, 23 | journal={arXiv preprint arXiv:1608.03533}, 24 | year={2016} 25 | } 26 | 27 | ## Quick validation of your code 28 | Apply the algorithm on a sequence `BBACACAABA`. The parts of SGT, W(0) and W(\kappa), in Algorithm 1 & 2 in [1], and the resulting SGT estimate will be (line-by-line execution of `main.R`): 29 | 30 | ``` 31 | alphabet_set <- c("A", "B", "C") # Denoted by variable V in [1] 32 | seq <- "BBACACAABA" 33 | 34 | kappa <- 5 35 | ###### Algorithm 1 ###### 36 | sgt_parts_alg1 <- f_sgt_parts(sequence = seq, kappa = kappa, alphabet_set_size = length(alphabet_set)) 37 | print(sgt_parts_alg1) 38 | ``` 39 | 40 | *Result* 41 | ``` 42 | $W0 43 | A B C 44 | A 10 4 3 45 | B 11 3 4 46 | C 7 2 1 47 | 48 | $W_kappa 49 | A B C 50 | A 0.006874761 6.783349e-03 1.347620e-02 51 | B 0.013521602 6.737947e-03 4.570791e-05 52 | C 0.013521604 3.059162e-07 4.539993e-05 53 | ``` 54 | 55 | ``` 56 | sgt <- f_SGT(W_kappa = sgt_parts_alg1$W_kappa, W0 = sgt_parts_alg1$W0, 57 | Len = sgt_parts_alg1$Len, kappa = kappa) # Set Len = NULL for length-sensitive SGT. 58 | print(sgt) 59 | ``` 60 | 61 | *Result* 62 | ``` 63 | A B C 64 | A 0.3693614 0.44246287 0.5376371 65 | B 0.4148844 0.46803816 0.1627745 66 | C 0.4541361 0.06869332 0.2144920 67 | 68 | ``` 69 | 70 | Similarly, the execution for Algorithm-2 is shown in `main.R`. 71 | 72 | ## Illustration and use of the code 73 | Open file `main.R` and execute line-by-line to understand the process. In this sample execution, we present SGT estimation from either of the two algorithms presented in [1]. The first part is for understanding the SGT computation process. 74 | 75 | In the next part we demonstrate sequence clustering using SGT on a synthesized sample dataset. The sequence lengths in the dataset ranges between (45, 711) with a uniform distribution (hence, average length is ~365). Similar sequences in the dataset has some similar patterns, in turn common substrings. These common substrings can be of any length. Also, the order of the instances of these substrings is arbitrary and random in different sequences. For example, the following two sequences have common patterns. One common subtring in both is `QZTA` which is present arbitrarily in both sequences. The two sequences have other common substrings as well. Other than these commonlities there are significant amount of noise present in the sequences. On average, about 40% of the letters in all sequences in the dataset are noise. 76 | 77 | ``` 78 | AKQZTAEEYTDZUXXIRZSTAYFUIXCPDZUXMCSMEMVDVGMTDRDDEJWNDGDPSVPKJHKQBRKMXHHNLUBXBMHISQ 79 | WEHGXGDDCADPVKESYQXGRLRZSTAYFUOQZTAWTBRKMXHHNWYRYBRKMXHHNPRNRBRKMXHHNPBMHIUSVXBMHI 80 | WXQRZSTAYFUCWRZSTAYFUJEJDZUXPUEMVDVGMTOHUDZUXLOQSKESYQXGRCTLBRKMXHHNNJZDZUXTFWZKES 81 | YQXGRUATSNDGDPWEBNIQZMBNIQKESYQXGRSZTTPTZWRMEMVDVGMTAPBNIRPSADZUXJTEDESOKPTLJEMZTD 82 | LUIPSMZTDLUIWYDELISBRKMXHHNMADEDXKESYQXGRWEFRZSTAYFUDNDGDPKYEKPTSXMKNDGDPUTIQJHKSD 83 | ZUXVMZTDLUINFNDGDPMQZTAPPKBMHIUQIUBMHIEKKJHK 84 | ``` 85 | 86 | ``` 87 | SDBRKMXHHNRATBMHIYDZUXMTRMZTDLUIEKDEIBQZTAZOAMZTDLUILHGXGDDCAZEXJHKTDOOHGXGDDCAKZH 88 | NEMVDVGMTIHZXDEROEQDEGZPPTDBCLBMHIJMMKESYQXGRGDPTNBRKMXHHNGCBYNDGDPKMWKBMHIDQDZUXI 89 | HKVBMHINQZTAHBRKMXHHNIRBRKMXHHNDISDZUXWBOYEMVDVGMTNTAQZTA 90 | ``` 91 | 92 | Identifying similar sequences with good accuracy, and also low false positives (calling sequences similar when they are not) is difficult in such situations due to, 93 | 94 | 1. _Different lengths of the sequences_: due to different lengths figuring out that two sequences have same inherent pattern is not straightforward. Normalizing the pattern features by the sequence length is a non-trivial problem. 95 | 96 | 2. _Commonalities are not in order_: as shown in the above example sequences, the common substrings are anywhere. This makes methods such as alignment-based approaches infeasible. 97 | 98 | 3. _Significant amount of noise_: a good amount of noise is a nemesis to most sequence similarity algorithms. It often results into high false positives. 99 | 100 | ### SGT Clustering 101 | 102 | The dataset here is a good example for the above challenges. We run clustering on the dataset in `main.R`. The sequences in the dataset are from 5 (=K) clusters. We use this ground truth about the number of clusters as input to our execution below. Although, in reality, the true number of clusters is unknown for a data, here we are demonstrating the SGT implementation. Regardless, using the _random search procedure_ discussed in Sec.SGT-ALGORITHM in [1], we could find the number of clusters as equal to 5. For simplicity it has been kept out of this demonstration. 103 | 104 | > Other state-of-the-art sequence clustering methods had significantly poorer performance even with the number of true clusters (K=5). HMM had good performance but significantly higher computation time. 105 | 106 | 107 | ``` 108 | ## The dataset contains all roman letters, A-Z. 109 | dataset <- read.csv("dataset.csv", header = T, stringsAsFactors = F) 110 | 111 | sgt_parts_sequences_in_dataset <- f_SGT_for_each_sequence_in_dataset(sequence_dataset = dataset, 112 | kappa = 5, alphabet_set = LETTERS, 113 | spp = NULL, sgt_using_alphabet_positions = T) 114 | 115 | 116 | input_data <- f_create_input_kmeans(all_seq_sgt_parts = sgt_parts_sequences_in_dataset, 117 | length_normalize = T, 118 | alphabet_set_size = 26, 119 | kappa = 5, trace = TRUE, 120 | inv.powered = T) 121 | K = 5 122 | clustering_output <- f_kmeans(input_data = input_data, K = K, alphabet_set_size = 26, trace = T) 123 | 124 | cc <- f_clustering_accuracy(actual = c(strtoi(dataset[,1])), pred = c(clustering_output$class), K = K, type = "f1") 125 | print(cc) 126 | ``` 127 | *Result* 128 | ``` 129 | $cc 130 | Confusion Matrix and Statistics 131 | 132 | Reference 133 | Prediction a b c d e 134 | a 50 0 0 0 0 135 | b 0 66 0 0 0 136 | c 0 0 60 0 0 137 | d 0 0 0 55 0 138 | e 0 0 0 0 68 139 | 140 | Overall Statistics 141 | 142 | Accuracy : 1 143 | 95% CI : (0.9877, 1) 144 | No Information Rate : 0.2274 145 | P-Value [Acc > NIR] : < 2.2e-16 146 | 147 | Kappa : 1 148 | Mcnemar's Test P-Value : NA 149 | 150 | Statistics by Class: 151 | 152 | Class: a Class: b Class: c Class: d Class: e 153 | Sensitivity 1.0000 1.0000 1.0000 1.0000 1.0000 154 | Specificity 1.0000 1.0000 1.0000 1.0000 1.0000 155 | Pos Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000 156 | Neg Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000 157 | Prevalence 0.1672 0.2207 0.2007 0.1839 0.2274 158 | Detection Rate 0.1672 0.2207 0.2007 0.1839 0.2274 159 | Detection Prevalence 0.1672 0.2207 0.2007 0.1839 0.2274 160 | Balanced Accuracy 1.0000 1.0000 1.0000 1.0000 1.0000 161 | 162 | $F1 163 | F1 164 | 1 165 | ``` 166 | 167 | As we can see the clustering result is accurate with no false-positives. The f1-score is 1.0. 168 | 169 | > Note: Do not run function `f_clustering_accuracy` when `K` is larger (> 7), because it does a permutation operation which will become expensive. 170 | 171 | ### PCA on SGT & Clustering 172 | 173 | For demonstrating PCA on SGT for dimension reduction and then performing clustering, we added another code snippet. PCA becomes more important on datasets where SGT's are sparse. A sparse SGT is present when the alphabet set is large but the observed sequences contain only a few of those alphabets. For example, the alphabet set for sequence dataset of music listening history will have thousands to millions of songs, but a single sequence will have only a few of them 174 | 175 | ``` 176 | ######## Clustering on Principal Components of SGT features ######## 177 | num_pcs <- 5 # Number of principal components we want 178 | input_data_pcs <- f_pcs(input_data = input_data, PCs = num_pcs)$input_data_pcs 179 | 180 | clustering_output_pcs <- f_kmeans(input_data = input_data_pcs, K = K, alphabet_set_size = sqrt(num_pcs), trace = F) 181 | 182 | cc <- f_clustering_accuracy(actual = c(strtoi(dataset[,1])), pred = c(clustering_output_pcs$class), K = K, type = "f1") 183 | print(cc) 184 | ``` 185 | 186 | *Result* 187 | ``` 188 | $cc 189 | Confusion Matrix and Statistics 190 | 191 | Reference 192 | Prediction a b c d e 193 | a 50 0 0 0 0 194 | b 0 66 0 0 0 195 | c 0 0 60 0 0 196 | d 0 0 0 55 0 197 | e 0 0 0 0 68 198 | 199 | Overall Statistics 200 | 201 | Accuracy : 1 202 | 95% CI : (0.9877, 1) 203 | No Information Rate : 0.2274 204 | P-Value [Acc > NIR] : < 2.2e-16 205 | 206 | Kappa : 1 207 | Mcnemar's Test P-Value : NA 208 | 209 | Statistics by Class: 210 | 211 | Class: a Class: b Class: c Class: d Class: e 212 | Sensitivity 1.0000 1.0000 1.0000 1.0000 1.0000 213 | Specificity 1.0000 1.0000 1.0000 1.0000 1.0000 214 | Pos Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000 215 | Neg Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000 216 | Prevalence 0.1672 0.2207 0.2007 0.1839 0.2274 217 | Detection Rate 0.1672 0.2207 0.2007 0.1839 0.2274 218 | Detection Prevalence 0.1672 0.2207 0.2007 0.1839 0.2274 219 | Balanced Accuracy 1.0000 1.0000 1.0000 1.0000 1.0000 220 | 221 | $F1 222 | F1 223 | 1 224 | ``` 225 | 226 | The clustering result remains accurate upon clustering the PCs on the SGT of sequences. 227 | 228 | 229 | ----------------------- 230 | #### Comments: 231 | 1. Simplicity: SGT's is simple to implement. There is no numerical optimization or other solution search algorithm required to estimate SGT. This makes it deterministic and powerful. 232 | 2. Length sensitive: The length sensitive version of SGT can be easily tried by changing the marked arguments in `main.R`. 233 | 234 | #### Note: 235 | 1. Small alphabet set: If the alphabet set is small (< 4), SGT's performance may not be good. This is because the feature space becomes too small. 236 | 2. Faster implementation: The provided code is a research level code, not optimized for the best of speed. Significant speed improvements can be made, e.g. multithreading the SGT estimation for sequences in a dataset. 237 | 238 | #### Additional resource: 239 | Python implementation: Please refer to 240 | 241 | https://github.com/datashinobi/Sequence-Graph-transform 242 | 243 | Thanks to Yassine for providing the Python implementation. 244 | Platform: UNKNOWN 245 | Classifier: Programming Language :: Python :: 3 246 | Classifier: License :: OSI Approved :: MIT License 247 | Classifier: Operating System :: OS Independent 248 | Description-Content-Type: text/markdown 249 | -------------------------------------------------------------------------------- /R/sgt.R: -------------------------------------------------------------------------------- 1 | ## This lookup table will be needed throughout for converting integer to corresponding alphabet 2 | alphabet_lookup <- data.frame(Integer=1:26, Alphabet = LETTERS) 3 | 4 | #################################################################################### 5 | ############### Algorithm 1: Parsing a sequence to obtain its SGT. ################# 6 | #################################################################################### 7 | 8 | f_sgt_parts <- function(sequence, kappa = 3, alphabet_set_size = 26, lag = 0, skip_same_char = FALSE, long_seq = FALSE, long_seq_ele_limits = NULL, spp = NULL) 9 | { 10 | 11 | # The inputs 12 | # Sequence Any given sequence (with padding) 13 | # kappa The tuning param 14 | # alphabet_set_size The number of alphabet_set the sequences are made up of 15 | # skip_same_char Esp. for the element clustering it does not make sense to use repeated characters. For example, AAABC... in this actual transaction between A to B should be just one, however, if we don't skip the repetition it will be 3 (unnecessary inflating the transactions). Hence, skip repetition (value = TRUE) for element clustering. 16 | # long_seq TRUE if sequences have many more alphabet_set (not just A-Z). 17 | # long_seq_ele_limits The limits of the alphabet_set in long sequences. For eg. in grayscale images it will be c(0,255) 18 | 19 | 20 | if(length(sequence) == 1) 21 | { 22 | s_split <- f_seq_split(sequence, spp = spp) 23 | } else{ 24 | s_split <- sequence #already splitted 25 | } 26 | 27 | 28 | if(long_seq == FALSE) 29 | { 30 | rnames <- cnames <- c(levels(alphabet_lookup[,'Alphabet']))[1:alphabet_set_size] 31 | }else{ 32 | alphabet_set_size <- long_seq_ele_limits[2] - long_seq_ele_limits[1] + 1 33 | rnames <- cnames <- seq(long_seq_ele_limits[1], long_seq_ele_limits[2], 1) 34 | } 35 | 36 | Len <- length(s_split) 37 | 38 | Ls <- list() 39 | 40 | # Just the W0 corresponding to m = 0, and the M-th moment 41 | iter_set <- c(0, kappa) 42 | 43 | for(m in iter_set) # The m=0 corresponds to W0 44 | { 45 | mat_Lm <- matrix(rep(0,alphabet_set_size*alphabet_set_size), nrow=alphabet_set_size) 46 | rownames(mat_Lm) <- rnames 47 | colnames(mat_Lm) <- cnames 48 | 49 | for(i in 1:(Len-1)) 50 | { 51 | if(skip_same_char == TRUE && s_split[i] == s_split[i+1]) 52 | { 53 | # SKipping the loop when the next event in the sequence is same as current 54 | next 55 | } 56 | 57 | 58 | for(j in (i+1):length(s_split)) 59 | { 60 | if(abs(j-i)>lag) 61 | { 62 | mat_Lm[s_split[i], s_split[j]] <- mat_Lm[s_split[i], s_split[j]] + exp(-1*((abs(j-i) - lag)*m)) 63 | 64 | } 65 | } 66 | } 67 | 68 | Ls[[length(Ls) + 1]] <- mat_Lm 69 | } 70 | 71 | output <- list(Len = Len, W0 = Ls[[1]], W_kappa = Ls[[2]]) 72 | return(output) 73 | } 74 | 75 | 76 | ############################################################################################################## 77 | ############### Algorithm 2: Extract SGT features by scanning alphabet positions of a sequence ############### 78 | ############################################################################################################## 79 | 80 | f_get_alphabet_positions <- function(sequence_split, alphabet_set) 81 | { 82 | # This function corresponds to the one defined in Line 1 in Algorithm 2 in [1] 83 | # Inputs 84 | # sequence_split A sequence is passed as a vector of alphabets. It is called sequence split because a string sequence is split into its alphabets (in the same order) 85 | # alphabet_set The set of alphabets sequence is made of. 86 | 87 | positions <- list() 88 | for(e in alphabet_set) 89 | { 90 | positions[[e]] <- which(sequence_split == e) 91 | } 92 | return(positions) 93 | } 94 | 95 | f_sgt_parts_using_alphabet_positions <- function(seq_alphabet_positions, alphabet_set, kappa = 12, lag = 0, skip_same_char = F) 96 | { 97 | ### See the comments for the input parameters in f_seq_transform function. 98 | # seq_alphabet_positions A list of index positions of all alphabet_set in the sequence 99 | # alphabet_set Set of alphabet_set possible in the sequence. Can remove the long_seq and long_seq_ele_limits parameters because alphabet_set are given. 100 | 101 | Len <- sum(unlist(lapply(seq_alphabet_positions, function(x) length(x)))) # The sequence length 102 | 103 | Ls <- list() 104 | 105 | iter_set <- c(0, kappa) 106 | 107 | alphabet_set_size <- length(alphabet_set) 108 | for(m in iter_set) 109 | { 110 | mat_Lm <- matrix(rep(0,alphabet_set_size*alphabet_set_size), nrow=alphabet_set_size) 111 | rownames(mat_Lm) <- colnames(mat_Lm) <- alphabet_set 112 | 113 | for(i in alphabet_set) 114 | { 115 | for(j in alphabet_set) 116 | { 117 | enumerated_combos <- arrange(expand.grid(i = seq_alphabet_positions[[i]], 118 | j = seq_alphabet_positions[[j]]), 119 | i) 120 | x <- c(enumerated_combos[,"j"]-enumerated_combos[,"i"]) 121 | x.positives <- x[x>0] # The x's which are greater than 0 are only corresponding to the feed-forward thing of the sequence. Others mean element j was before elemnet i. 122 | mat_Lm[i,j] <- sum(exp(-1*m*x.positives)) # Line 15 in Algorithm 2 in [1] 123 | } 124 | } 125 | 126 | Ls[[length(Ls) + 1]] <- mat_Lm 127 | } 128 | 129 | output <- list(Len = Len, W0 = Ls[[1]], W_kappa = Ls[[2]]) 130 | return(output) 131 | } 132 | 133 | 134 | #################################################################################### 135 | ##### Yield SGT output from the SGT parts computed from either algorithm 1 or 2 #### 136 | #################################################################################### 137 | 138 | f_SGT <- function(W_kappa, W0, kappa, Len = NULL, inv.powered = T) 139 | { 140 | ## This function computes the resulting SGT from the sgt parts found in function f_sgt_parts(). 141 | # Inputs 142 | # W_kappa See algorithm 1 in [1] 143 | # W0 See algorithm 1 in [1] 144 | # Len Length of sequence 145 | # inv.powered Is True if we want the take the kappa-th root of SGT as shown the algorithm 1 [1]. 146 | 147 | if(!is.null(Len))# Normalizing for the length 148 | { 149 | W0 <- W0/Len 150 | W0[W0 == 0] <- NA 151 | } 152 | 153 | tmp <- W_kappa/W0 154 | 155 | tmp[is.na(tmp)] <- 0 156 | 157 | SGT_mat <- tmp 158 | 159 | if(inv.powered){ 160 | SGT_mat <- Math.invpow(SGT_mat, pow = kappa) 161 | } 162 | 163 | return(SGT_mat) 164 | } 165 | 166 | 167 | f_SGT_for_each_sequence_in_dataset <- function(sequence_dataset, kappa = 3, alphabet_set = LETTERS, lag = 0, skip_same_char = FALSE, long_seq = FALSE, long_seq_ele_limits = NULL, spp = NULL, sgt_using_alphabet_positions = F, trace = T) 168 | { 169 | # The inputs 170 | # Sequence_dataset Either a vector with each element as a string (a sequence), or a dataframe with the sequences under column name 'seq'. 171 | # kappa The tuning param 172 | # alphabet_set_size The number of alphabet_set the sequences are made up of 173 | # skip_same_char Esp. for the element clustering it does not make sense to use repeated characters. For example, AAABC... in this actual transaction between A to B should be just one, however, if we don't skip the repetition it will be 3 (unnecessary inflating the transactions). Hence, skip repetition (value = TRUE) for element clustering. 174 | # long_seq TRUE if sequences have many more alphabet_set (not just A-Z). 175 | # long_seq_ele_limits The limits of the alphabet_set in long sequences. For eg. in grayscale images it will be c(0,255) 176 | # sgt_using_alphabet_positions 177 | # If True, then the alternate algorithm (Algorithm 2 in the paper) will be used. 178 | 179 | n.seq <- nrow(sequence_dataset) 180 | 181 | alphabet_set_size <- length(alphabet_set) 182 | 183 | Len_all <- array(rep(0,n.seq), dim = c(n.seq)) 184 | W0_all <- array(rep(0,n.seq*alphabet_set_size*alphabet_set_size), dim=c(alphabet_set_size,alphabet_set_size,n.seq)) 185 | 186 | W_kappa_all <- list() 187 | 188 | if(ncol(sequence_dataset) > 1) 189 | { 190 | sequences <- sequence_dataset[, 'seq'] 191 | }else{ 192 | sequences <- sequence_dataset 193 | } 194 | 195 | for(i in 1:n.seq) 196 | { 197 | if(trace){print(paste("Sequence",i,"of",n.seq))} 198 | 199 | if(!sgt_using_alphabet_positions) 200 | { 201 | sgt_parts <- f_sgt_parts(sequence = sequences[i], kappa = kappa, 202 | alphabet_set_size = alphabet_set_size, lag = lag, 203 | skip_same_char = skip_same_char, 204 | long_seq = long_seq, long_seq_ele_limits = long_seq_ele_limits) 205 | }else{ 206 | s_split <- f_seq_split(sequence = sequences[i], spp = spp) 207 | seq_alphabet_positions <- f_get_alphabet_positions(sequence_split = s_split, alphabet_set = alphabet_set) 208 | sgt_parts <- f_sgt_parts_using_alphabet_positions(seq_alphabet_positions = seq_alphabet_positions, 209 | alphabet_set = alphabet_set, 210 | kappa = kappa, 211 | lag = lag, skip_same_char = skip_same_char) 212 | } 213 | 214 | tmp <- sgt_parts$W0 215 | 216 | Len_all[i] <- sgt_parts$Len 217 | W0_all[, , i] <- tmp 218 | 219 | W_kappa_all[[length(W_kappa_all) + 1]] <- sgt_parts$W_kappa 220 | } 221 | dimnames(W0_all) <- list(rownames(tmp), colnames(tmp), c(sequence_dataset[,1])) 222 | 223 | output <- list(Len_all = Len_all, W0_all = W0_all, W_kappa_all = W_kappa_all) 224 | return(output) 225 | } 226 | 227 | 228 | 229 | ################################################################################ 230 | ############################ Auxiliary functions ############################ 231 | ################################################################################ 232 | 233 | ## Get alphabet for an integer 234 | f_get_alphabet <- function(integer) 235 | { 236 | return(levels(factor(alphabet_lookup[alphabet_lookup[,'Integer']==integer, 'Alphabet']))) 237 | } 238 | 239 | 240 | f_seq_split <- function(sequence, spp = NULL) 241 | { 242 | ## Split a sequence into a vector of alphabets. The order of alphabets is retained. Usually we get a sequence as a long string. This function just splits it to be further processed for SGT. 243 | ## Inputs 244 | # sequence A sequence, e.g. "FSDFSFIFFSAOPDSA" 245 | # spp The separator of alphabets in the sequence. In the above example it is NULL. 246 | 247 | ## Output 248 | # s_split The input sequence returned as a vector of alphabets in the same order. 249 | if(!is.null(spp)) 250 | { 251 | tmp <- strsplit(x = sequence, split = spp) 252 | s_split <- tmp 253 | }else{ 254 | countCharOccurrences <- function(char, s) { 255 | s2 <- gsub(char,"",s) 256 | return (nchar(s) - nchar(s2)) 257 | } 258 | 259 | if(countCharOccurrences("-",sequence) > 1) 260 | { 261 | tmp <- strsplit(sequence, "-") 262 | s_split <- tmp 263 | }else if(countCharOccurrences("-",sequence) == 1) 264 | { 265 | tmp <- strsplit(sequence, "-") 266 | tmp <- tmp[[1]][1] 267 | s_split <- strsplit(tmp,"") 268 | }else if(length(grep(" ",sequence))) 269 | { 270 | s_split <- strsplit(sequence," ") 271 | }else if(length(grep("~",sequence))) 272 | { 273 | s_split <- strsplit(sequence,"~") 274 | }else 275 | { 276 | s_split <- strsplit(sequence,"") 277 | } 278 | } 279 | 280 | s_split <- s_split[[1]] 281 | s_split <- s_split[s_split != ""] 282 | return(s_split) 283 | } 284 | 285 | 286 | Math.invpow <- function(x, pow) { 287 | sign(x) * abs(x)^(1/pow) 288 | } 289 | 290 | Math.pow <- function(x, pow) { 291 | out <- 1 292 | if(pow > 0){ 293 | for(p in 1:pow) 294 | { 295 | out <- out * x 296 | } 297 | }else if(pow == 0){ 298 | out <- 1 299 | } 300 | return(out) 301 | } 302 | 303 | Math.matrixpow <- function(x, pow) { 304 | out <- x 305 | if(pow > 0){ 306 | for(p in 1:pow) 307 | { 308 | out <- out %*% x 309 | } 310 | }else if(pow == 0){ 311 | out <- 1 312 | } 313 | return(out) 314 | } 315 | 316 | Math.matrix_norm <- function(mat, norm) 317 | { 318 | if(norm == 1) 319 | { 320 | out <- abs(mat) 321 | }else{ 322 | out <- mat^norm 323 | } 324 | return(out) 325 | } 326 | 327 | Math.standardize <- function(x, y) { 328 | y[y == 0] <- NA 329 | out <- x/y 330 | out[is.na(out)] <- 0 331 | return(out) 332 | } 333 | 334 | 335 | f_get_f1 <- function(confusion) 336 | { 337 | ## In this function we find F1 score from a confusion matrix. This will be used to select a clustering model also in function f_clustering_accuracy() 338 | K <- ncol(confusion) 339 | f1 <- NULL 340 | for(k in 1:K) 341 | { 342 | tp <- confusion[k,k] # True pos 343 | fp <- sum(confusion[, k])-confusion[k, k] # False pos 344 | fn <- sum(confusion[k, ])-confusion[k, k] # False neg 345 | tmp <- 2*tp / (2 * tp + fn + fp) 346 | f1 <- c(f1, tmp) 347 | } 348 | return (mean(f1)) 349 | } 350 | 351 | f_clustering_accuracy <- function(actual, pred, K = 2, type = "f1", trace = F, do.permutation = T) 352 | { 353 | ### In this function we will find the accuracy of clustering ffrom any clustering method. 354 | ## Inputs 355 | # actual A vector of actual clusters 356 | # pred A vector of estimated clusters 357 | # K Number of clusters of classes 358 | # type Best confusion selection method, type = c("accuracy", "f1") 359 | library(gtools) 360 | x <- letters[actual] 361 | y <- letters[pred] 362 | out_cc <- confusionMatrix(x,y) 363 | out_f1 <- NA 364 | if(type == "f1") 365 | { 366 | # out_f1 <- 2*(out_cc$byClass["Pos Pred Value"]*out_cc$byClass["Sensitivity"]/(out_cc$byClass["Pos Pred Value"]+out_cc$byClass["Sensitivity"])) 367 | out_f1 <- f_get_f1(confusion = out_cc$table) 368 | } 369 | 370 | if(do.permutation) 371 | { 372 | possibilities <- permutations(K,K,letters[1:K]) # Depending on the version, one of these two lines (this and the one below) work 373 | # possibilities <- matrix(letters[permutations(K)], ncol = K) 374 | 375 | ## We are trying for all possibilities because the digit of the assigned class does not matter. For that any naming is fine. Thus, end of the day, the one with the best accuracy is the right one. 376 | for(poss in 2:nrow(possibilities)){ # Number of other (hence, starting from 2) possibilities 377 | if(trace){print(paste("Trying possibility",poss,sep="-"))} 378 | for(k in 1:K){ 379 | y[pred==k] <- possibilities[poss,k] 380 | } 381 | 382 | tmp <- confusionMatrix(x,y) 383 | if(type == "f1") 384 | { 385 | tmp_f1 <- f_get_f1(confusion = tmp$table) 386 | flag <- (tmp_f1 > out_f1) 387 | }else if(type == "accuracy"){ 388 | flag <- (tmp$overall["Accuracy"] > out_cc$overall["Accuracy"]) 389 | } 390 | 391 | if(flag){ # Choosing based on F1 score instead of accuracy 392 | out_cc <- tmp 393 | if(type=="f1"){ 394 | out_f1 <- tmp_f1 395 | names(out_f1) <- "F1" 396 | } 397 | 398 | if(trace){print(paste("Selecting possibility",poss,sep="-"))} 399 | } 400 | } 401 | } 402 | 403 | return(list(cc = out_cc, F1=out_f1)) 404 | } 405 | 406 | f_reorder_class_assignment <- function(class) 407 | { 408 | ## In this function we reorder the assigned classes in clustering, such that they are ordered with consecutive class labels for easier clustering accuracy check 409 | conse_class <- 1 410 | class_map <- matrix(c(class[1], conse_class), nrow = 1) 411 | final_class <- matrix(c(class[1], conse_class), nrow = 1) 412 | 413 | for(i in 2:length(class)) 414 | { 415 | if(class[i] == class[i-1]) 416 | { 417 | final_class <- rbind(final_class, cbind(class[i], class_map[class_map[,1]==class[i], 2])) 418 | }else{ 419 | if(class[i] %in% class_map[,1]) 420 | { 421 | final_class <- rbind(final_class, cbind(class[i], class_map[class_map[,1]==class[i], 2])) 422 | }else{ 423 | conse_class <- conse_class + 1 424 | final_class <- rbind(final_class, cbind(class[i], conse_class)) 425 | class_map <- rbind(class_map, cbind(class[i], conse_class)) 426 | } 427 | } 428 | } 429 | 430 | out <- list(class_mapped = final_class, consecutive_class = final_class[,2]) 431 | return(out) 432 | } 433 | 434 | 435 | f_seq_len_mu_var <- function(sequences) 436 | { 437 | # A function that will find the mean and var of the sequence lengths 438 | seq.lens <- NULL 439 | for(i in 1:nrow(sequences)) 440 | { 441 | tmp <- f_seq_split(sequences[i,'seq']) 442 | seq.lens <- c(seq.lens, length(tmp)) 443 | } 444 | seq.lens.mu <- mean(seq.lens) 445 | seq.lens.var <- var(seq.lens) 446 | 447 | out <- list(seq.lens.mu = seq.lens.mu, seq.lens.var = seq.lens.var, seq.lens = seq.lens) 448 | return(out) 449 | } 450 | -------------------------------------------------------------------------------- /data/darpa_data.csv: -------------------------------------------------------------------------------- 1 | "timeduration","seqlen","seq","class" 2 | 552,575,"1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~4~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~4~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~4~1~2~3~3~3~3~3~3~1~4~5",0 3 | 22,205,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~2~13~3~3~14~6~2~3~3~3~3~3~3~3~2~3~5~2~3~3~3~3~1~2~2~3~3~3~3~4~3~3~15~3~3~3~3~3~3~16~16~2~11~11~12~11~5~12~2~2~14~3~2~5~2~5~2~5~17~17~5~5~5~5~5~5~5~5~5~5~5~5~5~5~18",1 4 | 524,43,"19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~19",0 5 | 14,311,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~8~5~17~3~3~3~2~3~5~2~3~3~3~3~2~3~5~2~11~11~12~11~5~12~2~3~5~16~20~20~16~2~11~11~12~11~5~12~17~17~5~5~20~16~2~11~11~12~11~5~12~21~2~3~5~22~23~24~20~20~2~2~2~2~25~9~20~2~2~26~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~23~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~23~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~2~16~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~20~16~20~17~17~20~5~5~5~5~18",0 6 | 0,45,"5~5~17~5~5~5~5~5~10~2~11~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~5~2~5~3~3~5~5~5~5~18",0 7 | 71,136,"15~3~5~5~5~5~5~5~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~2~16~3~14~16~14~2~14~30~31~5~32~30~33~2~3~3~5~5~30~33~2~3~3~2~3~5~16~3~3~3~16~16~16~16~14~30~31~5~32~30~33~16~17~34~5~33~35~36~30~33~14~37~30~5~33~5~5~5~5~18",0 8 | 23,205,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~2~13~3~3~14~6~2~3~3~3~3~3~3~3~2~3~5~2~3~3~3~3~1~2~2~3~3~3~3~4~3~3~15~3~3~3~3~3~3~16~16~2~11~11~12~11~5~12~2~2~14~3~2~5~2~5~2~5~17~17~5~5~5~5~5~5~5~5~5~5~5~5~5~5~18",1 9 | 6,156,"5~5~17~5~5~5~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~5~3~2~5~2~2~3~5~2~2~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~2~3~5~5~5~5~5~5~5~18",0 10 | 2,74,"6~5~6~5~5~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~5~3~30~2~30~30~30~30~30~30~5~3~3~2~2~2~11~11~12~11~5~12~16~2~5~16~16~2~2~16~2~5~16~16~2~5~5~5~5~18",0 11 | 23,205,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~2~13~3~3~14~6~2~3~3~3~3~3~3~3~2~3~5~2~3~3~3~3~1~2~2~3~3~3~3~4~3~3~15~3~3~3~3~3~3~16~16~2~11~11~12~11~5~12~2~2~14~3~2~5~2~5~2~5~17~17~5~5~5~5~5~5~5~5~5~5~5~5~5~5~18",1 12 | 19,311,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~8~5~17~3~3~3~2~3~5~2~3~3~3~3~2~3~5~2~11~11~12~11~5~12~2~3~5~16~20~20~16~2~11~11~12~11~5~12~17~17~5~5~20~16~2~11~11~12~11~5~12~21~2~3~5~22~23~24~20~20~2~2~2~2~25~9~20~2~2~26~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~23~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~23~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~2~16~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~20~16~20~17~17~20~5~5~5~5~18",0 13 | 1,28,"10~2~11~2~11~11~12~11~11~5~2~11~5~2~5~2~3~14~34~3~14~35~5~5~5~5~5~18",0 14 | 49,632,"38~2~11~11~12~11~5~2~11~5~12~2~3~5~22~39~20~40~20~23~5~5~14~2~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~6~5~6~5~6~5~6~5~5~5~5~2~3~40~15~3~3~15~3~6~5~2~3~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~16~5~5~5~2~28~15~3~28~15~3~28~15~3~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~3~5~2~3~5~5~5~5~5~5~2~5~5~5~2~5~2~5~2~5~2~5~2~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~3~5~2~2~3~28~15~3~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~28~15~3~28~15~3~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~2~15~3~5~5~5~5~5~5~5~5~5~18",0 15 | 3,126,"6~5~5~6~5~6~5~2~38~2~3~5~22~39~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~27~4~5~3~3~3~5~5~5~5~18",0 16 | 2,89,"15~3~5~5~5~5~5~5~5~5~5~5~10~2~11~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~2~3~2~5~2~2~3~2~2~2~2~2~2~2~3~5~2~3~3~3~3~3~3~3~41~1~41~1~5~5~5~5~5~5~18",0 17 | 3,26,"5~3~6~5~3~3~4~5~5~2~5~17~6~5~6~5~5~39~5~5~5~5~5~5~5~18",0 18 | 3,67,"14~16~2~2~6~6~16~16~17~6~17~2~3~5~2~3~5~16~3~6~6~2~3~5~22~20~16~16~16~20~20~38~17~6~3~5~6~41~2~5~17~2~3~3~3~3~5~3~2~5~5~33~5~4~39~5~5~5~5~5~5~5~5~5~5~5~18",0 19 | 3,188,"6~5~6~5~5~6~5~5~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~15~5~5~5~5~5~5~5~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~2~38~2~33~2~33~3~3~3~3~2~3~5~2~5~20~14~30~31~5~32~30~33~30~17~5~2~3~5~2~3~5~2~11~11~12~11~5~12~2~3~5~2~3~3~3~3~3~3~41~30~33~20~5~5~5~5~5~5~5~18",0 20 | 5,71,"5~5~2~5~17~6~5~6~5~5~2~6~17~3~2~2~3~5~2~3~5~6~6~2~3~5~22~20~16~16~16~20~20~38~2~3~3~3~3~5~27~28~5~3~3~5~6~41~17~6~3~36~5~2~33~33~5~5~33~5~39~5~5~5~5~5~5~5~5~5~18",0 21 | 1,28,"10~2~11~2~11~11~12~11~11~5~2~11~5~2~5~2~3~14~34~3~14~35~5~5~5~5~5~18",0 22 | 1,56,"15~3~5~5~5~5~5~5~5~5~10~2~11~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~5~3~2~2~14~2~2~2~14~3~2~5~5~5~5~5~5~5~18",0 23 | 3,188,"6~5~6~5~5~6~5~5~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~15~5~5~5~5~5~5~5~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~2~38~2~33~2~33~3~3~3~3~2~3~5~2~5~20~14~30~31~5~32~30~33~30~17~5~2~3~5~2~3~5~2~11~11~12~11~5~12~2~3~5~2~3~3~3~3~3~3~41~30~33~20~5~5~5~5~5~5~5~18",0 24 | 12,286,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~8~5~17~3~3~3~2~3~5~2~3~3~3~3~2~3~5~2~11~11~12~11~5~12~2~3~5~16~20~20~16~2~11~11~12~11~5~12~17~17~5~5~20~16~2~11~11~12~11~5~12~21~2~3~5~22~23~24~20~20~2~2~2~2~25~9~20~2~2~26~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~23~16~17~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~34~5~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~20~16~20~17~17~20~5~5~5~5~18",0 25 | 69,1773,"15~3~5~5~5~5~5~5~5~5~5~10~2~11~2~2~11~11~12~11~5~2~2~2~11~11~12~11~5~2~2~2~11~11~12~11~11~5~2~2~2~11~11~12~11~5~2~2~2~11~11~12~11~5~2~2~2~11~11~12~11~11~5~2~11~11~12~11~5~2~2~11~5~2~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~16~16~16~16~16~16~14~16~16~42~16~16~14~14~14~14~3~2~5~3~3~3~3~3~3~3~3~3~3~3~3~3~3~2~3~5~2~16~16~2~3~5~30~30~16~16~16~16~16~16~16~16~16~16~16~16~16~16~16~16~16~16~2~16~16~3~5~3~3~16~42~42~42~30~42~42~42~30~16~17~35~5~2~33~16~16~16~16~16~16~16~14~30~16~16~3~3~3~3~16~16~16~16~16~16~42~3~3~3~3~3~3~16~17~35~5~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~3~16~16~14~16~16~17~2~5~43~31~5~16~17~35~5~2~5~33~33~3~30~3~3~3~3~3~31~5~3~3~3~3~3~3~3~3~3~3~3~15~3~15~3~6~3~3~6~3~3~31~5~33~5~5~5~5~18",0 26 | 18,199,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~2~3~5~2~2~13~3~3~14~2~3~3~3~3~3~3~1~3~1~3~3~3~4~5~3~3~15~41~2~2~3~3~3~3~3~16~16~2~11~11~12~11~5~12~2~2~14~3~2~5~2~5~2~5~17~17~5~5~5~5~5~5~5~5~5~5~5~5~5~18",0 27 | 0,49,"15~3~5~5~5~5~5~5~5~5~5~5~10~2~11~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~5~2~5~2~11~44~12~5~5~5~5~18",0 28 | 1,88,"15~3~5~5~5~5~5~5~5~5~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~2~16~3~14~14~30~5~5~5~5~18",0 29 | 12,268,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~8~5~17~3~3~3~2~3~5~2~3~3~3~3~2~3~5~2~11~11~12~11~5~12~2~3~5~16~20~20~16~2~11~11~12~11~5~12~17~17~5~5~20~16~2~11~11~12~11~5~12~21~2~3~5~22~23~24~20~20~2~2~2~2~25~9~20~2~2~26~23~16~16~2~5~2~5~2~5~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~2~16~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~20~16~20~17~17~20~5~5~5~5~18",0 30 | 2,69,"14~16~2~2~6~6~16~16~17~6~17~2~3~5~2~3~5~2~3~5~3~6~6~2~3~5~22~20~16~16~16~20~20~38~17~6~3~5~6~41~2~5~17~2~3~3~3~3~5~3~2~5~5~33~5~4~39~5~5~5~5~5~5~5~5~5~5~5~18",0 31 | 3,188,"6~5~6~5~5~6~5~5~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~15~5~5~5~5~5~5~5~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~2~38~2~33~2~33~3~3~3~3~2~3~5~2~5~20~14~30~31~5~32~30~33~30~17~5~2~3~5~2~3~5~2~11~11~12~11~5~12~2~3~5~2~3~3~3~3~3~3~41~30~33~20~5~5~5~5~5~5~5~18",0 32 | 3,188,"6~5~6~5~5~6~5~5~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~15~5~5~5~5~5~5~5~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~2~38~2~33~2~33~3~3~3~3~2~3~5~2~5~20~14~30~31~5~32~30~33~30~17~5~2~3~5~2~3~5~2~11~11~12~11~5~12~2~3~5~2~3~3~3~3~3~3~41~30~33~20~5~5~5~5~5~5~5~18",0 33 | 1,56,"15~3~5~5~5~5~5~5~5~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~5~3~30~2~30~30~30~5~3~3~2~2~5~5~5~5~5~18",0 34 | 1,126,"6~5~5~6~5~6~5~2~38~2~3~5~22~39~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~27~4~5~3~3~3~5~5~5~5~18",0 35 | 1,91,"5~6~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~2~3~5~2~11~11~12~11~5~12~2~3~5~12~12~2~5~2~16~16~16~5~3~3~2~5~16~16~16~5~5~5~18",0 36 | 1,126,"6~5~5~6~5~6~5~2~38~2~3~5~22~39~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~27~4~5~3~3~3~5~5~5~5~18",0 37 | 2,71,"5~5~2~5~17~6~5~6~5~5~2~6~17~3~2~2~3~5~2~3~5~6~6~2~3~5~22~20~16~16~16~20~20~38~2~3~3~3~3~5~27~28~5~3~3~5~6~41~17~6~3~36~5~2~33~33~5~5~33~5~39~5~5~5~5~5~5~5~5~5~18",0 38 | 1,188,"6~5~6~5~5~6~5~5~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~15~5~5~5~5~5~5~5~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~2~38~2~33~2~33~3~3~3~3~2~3~5~2~5~20~14~30~31~5~32~30~33~30~17~5~2~3~5~2~3~5~2~11~11~12~11~5~12~2~3~5~2~3~3~3~3~3~3~41~30~33~20~5~5~5~5~5~5~5~18",0 39 | 2,126,"6~5~5~6~5~6~5~2~38~2~3~5~22~39~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~27~4~5~3~3~3~5~5~5~5~18",0 40 | 5,71,"5~5~2~5~17~6~5~6~5~5~2~6~17~3~2~2~3~5~2~3~5~6~6~2~3~5~22~20~16~16~16~20~20~38~2~3~3~3~3~5~27~28~5~3~3~5~6~41~17~6~3~36~5~2~33~33~5~5~33~5~39~5~5~5~5~5~5~5~5~5~18",0 41 | 1,96,"5~6~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~2~2~3~5~2~11~11~12~11~5~12~2~2~2~16~16~16~16~5~5~12~12~5~3~3~2~5~16~16~16~16~16~16~5~5~5~5~18",0 42 | 3,188,"6~5~6~5~5~6~5~5~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~15~5~5~5~5~5~5~5~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~2~38~2~33~2~33~3~3~3~3~2~3~5~2~5~20~14~30~31~5~32~30~33~30~17~5~2~3~5~2~3~5~2~11~11~12~11~5~12~2~3~5~2~3~3~3~3~3~3~41~30~33~20~5~5~5~5~5~5~5~18",0 43 | 1,126,"6~5~5~6~5~6~5~2~38~2~3~5~22~39~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~27~4~5~3~3~3~5~5~5~5~18",0 44 | 1,83,"5~6~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~2~3~5~2~11~11~12~11~5~12~2~3~5~12~12~2~5~2~5~3~3~5~5~5~18",0 45 | 2,48,"5~5~5~14~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~5~15~15~39~38~3~3~4~3~15~3~5~5~5~5~18",0 46 | 2,229,"10~2~11~2~11~11~12~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~2~2~3~5~2~3~3~3~3~3~3~40~2~5~3~2~3~5~2~40~2~2~3~42~16~2~3~5~5~23~14~16~2~2~6~6~16~16~2~3~5~17~6~17~3~3~2~16~5~16~16~2~2~14~3~5~5~2~5~2~3~5~2~3~5~17~6~3~3~3~5~17~2~6~41~2~2~3~3~3~3~5~3~2~5~5~33~5~4~39~5~5~5~5~5~5~5~5~5~18",0 47 | 4,126,"6~5~5~6~5~6~5~2~38~2~3~5~22~39~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~27~4~5~3~3~3~5~5~5~5~18",0 48 | 1,172,"2~11~11~12~11~5~12~2~3~5~22~38~39~6~5~6~5~5~6~5~5~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~15~5~5~5~5~5~5~5~5~5~10~2~11~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~5~2~5~2~3~3~3~2~16~17~2~3~5~2~5~3~39~27~28~5~39~3~5~23~2~3~5~2~3~5~2~3~5~2~3~5~2~2~3~5~2~3~2~3~5~2~3~5~5~5~5~5~5~5~18",0 49 | 3,189,"6~5~6~5~5~6~5~5~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~15~5~5~5~5~5~5~5~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~2~38~2~33~2~33~3~3~3~3~2~3~5~2~5~20~14~30~31~5~32~30~33~30~17~35~5~2~3~5~2~3~5~2~11~11~12~11~5~12~2~3~5~2~3~3~3~3~3~3~41~30~33~20~5~5~5~5~5~5~5~18",0 50 | 1,59,"5~5~5~14~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~5~15~15~39~3~5~16~16~2~5~2~5~2~2~3~30~30~5~5~6~5~5~5~5~5~5~18",0 51 | 18,336,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~8~5~17~3~3~3~2~3~5~2~3~3~3~3~2~3~5~2~11~11~12~11~5~12~2~3~5~16~20~20~16~2~11~11~12~11~5~12~17~17~5~5~20~16~2~11~11~12~11~5~12~21~2~3~5~22~23~24~20~20~2~2~2~2~25~9~20~2~2~26~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~23~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~23~2~16~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~2~16~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~23~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~20~16~20~17~17~20~5~5~5~5~18",0 52 | 0,108,"6~5~6~5~5~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~5~3~30~2~30~30~42~30~30~30~30~30~30~5~3~3~2~2~2~11~11~12~11~5~12~16~2~5~16~16~2~2~16~2~5~16~16~2~5~2~16~2~11~11~12~11~5~2~11~11~12~11~2~11~5~2~11~11~12~11~5~2~12~12~12~12~12~12~5~12~5~5~5~18",0 53 | 1,126,"6~5~5~6~5~6~5~2~38~2~3~5~22~39~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~27~4~5~3~3~3~5~5~5~5~18",0 54 | 0,188,"6~5~6~5~5~6~5~5~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~15~5~5~5~5~5~5~5~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~2~38~2~33~2~33~3~3~3~3~2~3~5~2~5~20~14~30~31~5~32~30~33~30~17~5~2~3~5~2~3~5~2~11~11~12~11~5~12~2~3~5~2~3~3~3~3~3~3~41~30~33~20~5~5~5~5~5~5~5~18",0 55 | 1,74,"6~5~6~5~5~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~5~3~30~2~30~30~30~30~5~3~3~2~2~2~11~11~12~11~5~12~16~2~5~16~16~2~2~16~2~5~16~16~2~5~2~16~5~5~5~18",0 56 | 1,28,"10~2~11~2~11~11~12~11~11~5~2~11~5~2~5~2~3~14~34~3~14~35~5~5~5~5~5~18",0 57 | 1,188,"6~5~6~5~5~6~5~5~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~15~5~5~5~5~5~5~5~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~2~38~2~33~2~33~3~3~3~3~2~3~5~2~5~20~14~30~31~5~32~30~33~30~17~5~2~3~5~2~3~5~2~11~11~12~11~5~12~2~3~5~2~3~3~3~3~3~3~41~30~33~20~5~5~5~5~5~5~5~18",0 58 | 40,205,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~2~13~3~3~14~6~2~3~3~3~3~3~3~3~2~3~5~2~3~3~3~3~1~2~2~3~3~3~3~4~3~3~15~3~3~3~3~3~3~16~16~2~11~11~12~11~5~12~2~2~14~3~2~5~2~5~2~5~17~17~5~5~5~5~5~5~5~5~5~5~5~5~5~5~18",1 59 | 1,28,"10~2~11~2~11~11~12~11~11~5~2~11~5~2~5~2~3~14~34~3~14~35~5~5~5~5~5~18",0 60 | 3,12,"16~2~3~5~22~22~16~14~14~4~16~33",0 61 | 42,205,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~2~13~3~3~14~6~2~3~3~3~3~3~3~3~2~3~5~2~3~3~3~3~1~2~2~3~3~3~3~4~3~3~15~3~3~3~3~3~3~16~16~2~11~11~12~11~5~12~2~2~14~3~2~5~2~5~2~5~17~17~5~5~5~5~5~5~5~5~5~5~5~5~5~5~18",0 62 | 11,289,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~8~5~17~3~3~3~2~3~5~2~3~3~3~3~2~3~5~2~11~11~12~11~5~12~2~3~5~16~20~20~16~2~11~11~12~11~5~12~17~17~5~5~20~16~2~11~11~12~11~5~12~21~2~3~5~22~23~24~20~20~2~2~2~2~25~9~20~2~2~26~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~16~16~2~5~23~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~2~16~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~20~16~20~17~17~20~5~5~5~5~18",0 63 | 0,188,"6~5~6~5~5~6~5~5~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~15~5~5~5~5~5~5~5~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~2~38~2~33~2~33~3~3~3~3~2~3~5~2~5~20~14~30~31~5~32~30~33~30~17~5~2~3~5~2~3~5~2~11~11~12~11~5~12~2~3~5~2~3~3~3~3~3~3~41~30~33~20~5~5~5~5~5~5~5~18",0 64 | 15,335,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~8~5~17~3~3~3~2~3~5~2~3~3~3~3~2~3~5~2~11~11~12~11~5~12~2~3~5~16~20~20~16~2~11~11~12~11~5~12~17~17~5~5~20~16~2~11~11~12~11~5~12~21~2~3~5~22~23~24~20~20~2~2~2~2~25~9~20~2~2~26~23~27~2~5~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~2~16~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~2~16~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~2~16~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~20~16~20~17~17~20~5~5~5~5~18",0 65 | 2,188,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~6~5~6~5~6~5~2~3~5~2~16~16~2~11~11~12~11~5~12~2~3~5~2~11~11~12~11~5~12~2~5~12~12~8~2~3~2~3~5~2~3~5~2~3~5~2~3~5~3~45~9~5~5~5~5~5~5~18",1 66 | 44,205,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~2~13~3~3~14~6~2~3~3~3~3~3~3~3~2~3~5~2~3~3~3~3~1~2~2~3~3~3~3~4~3~3~15~3~3~3~3~3~3~16~16~2~11~11~12~11~5~12~2~2~14~3~2~5~2~5~2~5~17~17~5~5~5~5~5~5~5~5~5~5~5~5~5~5~18",0 67 | 21,353,"15~2~5~5~5~5~5~6~5~6~5~6~5~5~2~2~6~6~17~17~5~5~6~6~5~5~2~5~5~10~2~11~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~2~3~5~46~8~2~3~5~2~3~2~16~5~16~16~16~16~16~2~3~3~3~3~2~16~2~11~11~12~11~5~12~2~3~5~2~11~11~12~11~5~12~2~5~12~12~2~11~11~12~11~5~12~2~2~3~5~5~12~12~2~3~3~3~5~2~3~2~3~5~2~3~5~5~2~3~5~2~3~5~2~3~5~47~9~16~2~11~11~12~11~5~12~2~11~11~12~11~5~12~2~5~12~12~2~11~11~12~11~5~12~2~2~3~5~5~12~12~2~2~2~2~11~11~12~11~5~12~2~5~12~12~2~11~11~12~11~5~12~2~2~3~5~5~12~12~2~3~3~3~5~2~3~5~2~3~5~2~3~5~47~9~2~11~11~12~11~5~12~2~5~12~12~2~11~11~12~11~5~12~2~2~3~5~5~12~12~2~2~2~2~11~11~12~11~5~12~2~5~12~12~2~11~11~12~11~5~12~2~2~3~5~5~12~12~2~3~3~3~5~2~3~5~2~3~5~2~3~5~47~9~2~11~11~12~11~5~12~2~5~12~12~2~11~11~12~11~5~12~2~2~3~5~5~12~12~2~2~5~5~5~5~18",0 68 | 14,199,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~2~3~5~2~2~13~3~3~14~2~3~3~3~3~3~3~1~3~1~3~3~3~4~5~3~3~15~41~2~2~3~3~3~3~3~16~16~2~11~11~12~11~5~12~2~2~14~3~2~5~2~5~2~5~17~17~5~5~5~5~5~5~5~5~5~5~5~5~5~18",1 69 | 48,205,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~2~13~3~3~14~6~2~3~3~3~3~3~3~3~2~3~5~2~3~3~3~3~1~2~2~3~3~3~3~4~3~3~15~3~3~3~3~3~3~16~16~2~11~11~12~11~5~12~2~2~14~3~2~5~2~5~2~5~17~17~5~5~5~5~5~5~5~5~5~5~5~5~5~5~18",0 70 | 0,89,"15~3~5~5~5~5~5~5~5~5~5~5~10~2~11~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~2~3~2~5~2~2~3~2~2~2~2~2~2~2~3~5~2~3~3~3~3~3~3~3~41~1~41~1~5~5~5~5~5~5~18",0 71 | 2,202,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~2~3~5~2~3~3~3~3~3~3~3~3~6~29~1~1~6~5~6~5~6~5~2~16~16~2~11~11~12~11~5~12~2~3~5~2~11~11~12~11~5~12~2~5~12~12~8~2~3~2~3~5~2~3~5~2~3~5~2~3~5~3~45~9~5~5~5~5~5~5~5~18",1 72 | 1,49,"15~3~5~5~5~5~5~5~5~5~5~5~10~2~11~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~5~2~5~2~11~44~12~5~5~5~5~18",0 73 | 13,289,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~8~5~17~3~3~3~2~3~5~2~3~3~3~3~2~3~5~2~11~11~12~11~5~12~2~3~5~16~20~20~16~2~11~11~12~11~5~12~17~17~5~5~20~16~2~11~11~12~11~5~12~21~2~3~5~22~23~24~20~20~2~2~2~2~25~9~20~2~2~26~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~16~16~2~5~23~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~2~16~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~20~16~20~17~17~20~5~5~5~5~18",0 74 | 53,205,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~2~13~3~3~14~6~2~3~3~3~3~3~3~3~2~3~5~2~3~3~3~3~1~2~2~3~3~3~3~4~3~3~15~3~3~3~3~3~3~16~16~2~11~11~12~11~5~12~2~2~14~3~2~5~2~5~2~5~17~17~5~5~5~5~5~5~5~5~5~5~5~5~5~5~18",0 75 | 0,28,"10~2~11~2~11~11~12~11~11~5~2~11~5~2~5~2~3~14~34~3~14~35~5~5~5~5~5~18",0 76 | 20,336,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~8~5~17~3~3~3~2~3~5~2~3~3~3~3~2~3~5~2~11~11~12~11~5~12~2~3~5~16~20~20~16~2~11~11~12~11~5~12~17~17~5~5~20~16~2~11~11~12~11~5~12~21~2~3~5~22~23~24~20~20~2~2~2~2~25~9~20~2~2~26~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~23~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~23~2~16~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~2~16~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~23~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~20~16~20~17~17~20~5~5~5~5~18",0 77 | 12,335,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~8~5~17~3~3~3~2~3~5~2~3~3~3~3~2~3~5~2~11~11~12~11~5~12~2~3~5~16~20~20~16~2~11~11~12~11~5~12~17~17~5~5~20~16~2~11~11~12~11~5~12~21~2~3~5~22~23~24~20~20~2~2~2~2~25~9~20~2~2~26~23~27~2~5~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~2~16~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~2~16~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~2~16~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~20~16~20~17~17~20~5~5~5~5~18",0 78 | 2,126,"6~5~5~6~5~6~5~2~38~2~3~5~22~39~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~27~4~5~3~3~3~5~5~5~5~18",0 79 | 1,74,"6~5~6~5~5~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~5~3~30~2~30~30~30~30~5~3~3~2~2~2~11~11~12~11~5~12~16~2~5~16~16~2~2~16~2~5~16~16~2~5~2~16~5~5~5~18",0 80 | 2,126,"6~5~5~6~5~6~5~2~38~2~3~5~22~39~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~27~4~5~3~3~3~5~5~5~5~18",0 81 | 1,126,"6~5~5~6~5~6~5~2~38~2~3~5~22~39~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~27~4~5~3~3~3~5~5~5~5~18",0 82 | 60,40,"5~5~5~5~5~5~5~10~2~11~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~5~5~5~5~5~18",0 83 | 1,89,"5~6~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~2~2~3~5~2~11~11~12~11~5~12~2~2~2~16~16~16~5~5~12~12~5~3~3~2~5~5~5~5~5~18",0 84 | 11,268,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~8~5~17~3~3~3~2~3~5~2~3~3~3~3~2~3~5~2~11~11~12~11~5~12~2~3~5~16~20~20~16~2~11~11~12~11~5~12~17~17~5~5~20~16~2~11~11~12~11~5~12~21~2~3~5~22~23~24~20~20~2~2~2~2~25~9~20~2~2~26~23~16~16~2~5~2~5~2~5~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~2~16~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~20~16~20~17~17~20~5~5~5~5~18",0 85 | 53,205,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~2~13~3~3~14~6~2~3~3~3~3~3~3~3~2~3~5~2~3~3~3~3~1~2~2~3~3~3~3~4~3~3~15~3~3~3~3~3~3~16~16~2~11~11~12~11~5~12~2~2~14~3~2~5~2~5~2~5~17~17~5~5~5~5~5~5~5~5~5~5~5~5~5~5~18",0 86 | 45,577,"15~2~5~5~5~5~5~6~5~6~5~6~5~5~2~2~6~6~17~17~5~5~6~6~5~5~2~5~5~10~2~11~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~2~3~5~46~8~2~3~5~2~3~2~16~5~16~16~16~16~16~2~3~3~3~3~2~16~2~11~11~12~11~5~12~2~3~5~2~11~11~12~11~5~12~2~5~12~12~2~11~11~12~11~5~12~2~2~3~5~5~12~12~2~3~3~3~5~2~3~2~3~5~2~3~5~5~2~3~5~2~3~5~2~3~5~47~9~16~2~11~11~12~11~5~12~2~11~11~12~11~5~12~2~5~12~12~2~11~11~12~11~5~12~2~2~3~5~5~12~12~2~2~2~2~11~11~12~11~5~12~2~5~12~12~2~11~11~12~11~5~12~2~2~3~5~5~12~12~2~3~3~3~5~2~3~5~2~3~5~2~3~5~47~9~2~11~11~12~11~5~12~2~5~12~12~2~11~11~12~11~5~12~2~2~3~5~5~12~12~2~2~2~2~11~11~12~11~5~12~2~5~12~12~2~11~11~12~11~5~12~2~2~3~5~5~12~12~2~3~3~3~5~2~3~5~2~3~5~2~3~5~47~9~2~11~11~12~11~5~12~2~5~12~12~2~11~11~12~11~5~12~2~2~3~5~5~12~12~2~2~2~2~11~11~12~11~5~12~2~5~12~12~2~11~11~12~11~5~12~2~2~3~5~5~12~12~2~3~3~3~5~2~3~5~2~3~5~2~3~5~47~9~2~11~11~12~11~5~12~2~5~12~12~2~11~11~12~11~5~12~2~2~3~5~5~12~12~2~2~2~2~11~11~12~11~5~12~2~5~12~12~2~11~11~12~11~5~12~2~2~3~5~5~12~12~2~3~3~3~5~2~11~11~12~11~5~12~2~5~12~12~2~11~11~12~11~5~12~2~2~3~5~5~12~12~2~23~2~5~16~2~11~11~12~11~5~12~2~11~11~12~11~5~12~2~5~12~12~2~11~11~12~11~5~12~2~2~3~5~5~12~12~2~5~2~3~5~16~2~11~11~12~11~5~12~2~2~14~3~2~5~2~5~5~5~17~17~5~5~35~34~2~3~5~2~3~5~2~3~5~26~2~3~5~2~3~5~2~3~5~47~9~4~2~3~5~2~3~5~2~3~5~48~9~5~5~5~5~18",0 87 | 57,205,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~2~13~3~3~14~6~2~3~3~3~3~3~3~3~2~3~5~2~3~3~3~3~1~2~2~3~3~3~3~4~3~3~15~3~3~3~3~3~3~16~16~2~11~11~12~11~5~12~2~2~14~3~2~5~2~5~2~5~17~17~5~5~5~5~5~5~5~5~5~5~5~5~5~5~18",0 88 | 60,205,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~2~13~3~3~14~6~2~3~3~3~3~3~3~3~2~3~5~2~3~3~3~3~1~2~2~3~3~3~3~4~3~3~15~3~3~3~3~3~3~16~16~2~11~11~12~11~5~12~2~2~14~3~2~5~2~5~2~5~17~17~5~5~5~5~5~5~5~5~5~5~5~5~5~5~18",0 89 | 0,28,"10~2~11~2~11~11~12~11~11~5~2~11~5~2~5~2~3~14~34~3~14~35~5~5~5~5~5~18",0 90 | 46,632,"38~2~11~11~12~11~5~2~11~5~12~2~3~5~22~39~20~40~20~23~5~5~14~2~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~6~5~6~5~6~5~6~5~5~5~5~2~3~40~15~3~3~15~3~6~5~2~3~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~16~5~5~5~2~28~15~3~28~15~3~28~15~3~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~3~5~2~3~5~5~5~5~5~5~2~5~5~5~2~5~2~5~2~5~2~5~2~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~3~5~2~2~3~28~15~3~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~28~15~3~28~15~3~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~5~5~5~2~5~5~5~2~15~3~5~5~5~5~5~5~5~5~5~18",0 91 | 1,188,"6~5~6~5~5~6~5~5~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~15~5~5~5~5~5~5~5~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~2~38~2~33~2~33~3~3~3~3~2~3~5~2~5~20~14~30~31~5~32~30~33~30~17~5~2~3~5~2~3~5~2~11~11~12~11~5~12~2~3~5~2~3~3~3~3~3~3~41~30~33~20~5~5~5~5~5~5~5~18",0 92 | 71,205,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~2~13~3~3~14~6~2~3~3~3~3~3~3~3~2~3~5~2~3~3~3~3~1~2~2~3~3~3~3~4~3~3~15~3~3~3~3~3~3~16~16~2~11~11~12~11~5~12~2~2~14~3~2~5~2~5~2~5~17~17~5~5~5~5~5~5~5~5~5~5~5~5~5~5~18",0 93 | 0,61,"15~3~5~5~5~5~5~5~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~5~3~30~2~30~30~30~30~30~30~30~30~30~5~3~3~2~2~5~5~5~5~5~18",0 94 | 84,205,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~2~13~3~3~14~6~2~3~3~3~3~3~3~3~2~3~5~2~3~3~3~3~1~2~2~3~3~3~3~4~3~3~15~3~3~3~3~3~3~16~16~2~11~11~12~11~5~12~2~2~14~3~2~5~2~5~2~5~17~17~5~5~5~5~5~5~5~5~5~5~5~5~5~5~18",0 95 | 11,286,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~8~5~17~3~3~3~2~3~5~2~3~3~3~3~2~3~5~2~11~11~12~11~5~12~2~3~5~16~20~20~16~2~11~11~12~11~5~12~17~17~5~5~20~16~2~11~11~12~11~5~12~21~2~3~5~22~23~24~20~20~2~2~2~2~25~9~20~2~2~26~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~23~16~17~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~34~5~27~28~5~20~2~3~3~3~3~3~3~3~20~3~3~3~6~29~1~1~3~3~3~5~5~20~16~20~17~17~20~5~5~5~5~18",0 96 | 1,40,"27~4~6~5~5~5~10~2~11~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~5~3~3~5~5~5~18",0 97 | 1,70,"5~6~5~5~10~2~11~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~5~3~16~2~2~14~2~2~14~3~2~5~16~16~16~16~16~16~16~16~16~16~16~16~16~16~16~16~16~16~16~5~5~5~5~5~5~5~18",0 98 | 86,205,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~2~13~3~3~14~6~2~3~3~3~3~3~3~3~2~3~5~2~3~3~3~3~1~2~2~3~3~3~3~4~3~3~15~3~3~3~3~3~3~16~16~2~11~11~12~11~5~12~2~2~14~3~2~5~2~5~2~5~17~17~5~5~5~5~5~5~5~5~5~5~5~5~5~5~18",0 99 | 100,205,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~2~13~3~3~14~6~2~3~3~3~3~3~3~3~2~3~5~2~3~3~3~3~1~2~2~3~3~3~3~4~3~3~15~3~3~3~3~3~3~16~16~2~11~11~12~11~5~12~2~2~14~3~2~5~2~5~2~5~17~17~5~5~5~5~5~5~5~5~5~5~5~5~5~5~18",0 100 | 5,163,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~2~13~3~3~14~6~2~3~3~3~3~3~3~3~2~3~5~2~17~2~5~2~6~41~5~5~5~5~5~5~5~5~18",1 101 | 0,22,"3~1~2~3~3~3~3~3~3~3~3~3~6~6~6~1~6~1~3~3~41~5",0 102 | 3,119,"6~5~5~6~5~6~5~2~38~2~3~5~22~39~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~5~5~5~18",1 103 | 4,136,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~8~5~17~2~5~2~5~5~5~5~18",1 104 | 5,149,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~8~49~2~2~5~2~5~3~3~2~3~5~2~6~6~5~6~5~5~5~5~5~5~18",1 105 | 4,124,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~12~5~2~12~5~2~12~5~2~12~5~2~5~12~3~3~3~3~2~3~5~2~5~5~5~5~18",1 106 | 4,129,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~12~5~2~12~5~2~12~5~2~12~5~2~5~12~3~3~3~3~3~17~2~5~2~6~41~5~5~5~5~5~5~18",1 107 | 100,205,"6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~3~2~3~5~2~3~5~9~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~2~13~3~3~14~6~2~3~3~3~3~3~3~3~2~3~5~2~3~3~3~3~1~2~2~3~3~3~3~4~3~3~15~3~3~3~3~3~3~16~16~2~11~11~12~11~5~12~2~2~14~3~2~5~2~5~2~5~17~17~5~5~5~5~5~5~5~5~5~5~5~5~5~5~18",0 108 | 0,28,"10~2~11~2~11~11~12~11~11~5~2~11~5~2~5~2~3~14~34~3~14~35~5~5~5~5~5~18",0 109 | 2,71,"5~5~2~5~17~6~5~6~5~5~2~6~17~3~2~2~3~5~2~3~5~6~6~2~3~5~22~20~16~16~16~20~20~38~2~3~3~3~3~5~27~28~5~3~3~5~6~41~17~6~3~36~5~2~33~33~5~5~33~5~39~5~5~5~5~5~5~5~5~5~18",0 110 | 1,188,"6~5~6~5~5~6~5~5~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~15~5~5~5~5~5~5~5~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~2~38~2~33~2~33~3~3~3~3~2~3~5~2~5~20~14~30~31~5~32~30~33~30~17~5~2~3~5~2~3~5~2~11~11~12~11~5~12~2~3~5~2~3~3~3~3~3~3~41~30~33~20~5~5~5~5~5~5~5~18",0 111 | 1,126,"6~5~5~6~5~6~5~2~38~2~3~5~22~39~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~7~8~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~5~12~3~3~3~3~3~27~4~5~3~3~3~5~5~5~5~18",0 112 | 1,89,"5~6~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~11~5~2~11~5~2~11~11~12~11~11~5~2~11~11~12~11~5~2~11~11~12~11~5~2~5~12~2~2~3~5~2~11~11~12~11~5~12~2~2~2~16~16~16~5~5~12~12~5~3~3~2~5~5~5~5~5~18",0 113 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Sequence Graph Transform (SGT) — Sequence Embedding for Clustering, Classification, and Search 2 | 3 | #### Maintained by: Chitta Ranjan 4 | Email: 5 | | LinkedIn: [https://www.linkedin.com/in/chitta-ranjan-b0851911/](https://www.linkedin.com/in/chitta-ranjan-b0851911/) 6 | 7 | 8 | The following will cover, 9 | 10 | 1. [SGT Class Definition](#sgt-class-def) 11 | 2. [Installation](#install-sgt) 12 | 3. [Test Examples](#installation-test-examples) 13 | 4. [Sequence Clustering Example](#sequence-clustering) 14 | 5. [Sequence Classification Example](#sequence-classification) 15 | 6. [Sequence Search Example](#sequence-search) 16 | 7. [SGT - Spark for Distributed Computing](#sgt-spark) 17 | 8. [Datasets](#datasets) 18 | 19 | 20 | ## SGT Class Definition 21 | 22 | Sequence Graph Transform (SGT) is a sequence embedding function. SGT extracts the short- and long-term sequence features and embeds them in a finite-dimensional feature space. The long and short term patterns embedded in SGT can be tuned without any increase in the computation." 23 | 24 | 25 | ``` 26 | class SGT(): 27 | ''' 28 | Compute embedding of a single or a collection of discrete item 29 | sequences. A discrete item sequence is a sequence made from a set 30 | discrete elements, also known as alphabet set. For example, 31 | suppose the alphabet set is the set of roman letters, 32 | {A, B, ..., Z}. This set is made of discrete elements. Examples of 33 | sequences from such a set are AABADDSA, UADSFJPFFFOIHOUGD, etc. 34 | Such sequence datasets are commonly found in online industry, 35 | for example, item purchase history, where the alphabet set is 36 | the set of all product items. Sequence datasets are abundant in 37 | bioinformatics as protein sequences. 38 | Using the embeddings created here, classification and clustering 39 | models can be built for sequence datasets. 40 | Read more in https://arxiv.org/pdf/1608.03533.pdf 41 | ''' 42 | 43 | Parameters 44 | ---------- 45 | Input: 46 | 47 | alphabets Optional, except if mode is Spark. 48 | The set of alphabets that make up all 49 | the sequences in the dataset. If not passed, the 50 | alphabet set is automatically computed as the 51 | unique set of elements that make all the sequences. 52 | A list or 1d-array of the set of elements that make up the 53 | sequences. For example, np.array(["A", "B", "C"]. 54 | If mode is 'spark', the alphabets are necessary. 55 | 56 | kappa Tuning parameter, kappa > 0, to change the extraction of 57 | long-term dependency. Higher the value the lesser 58 | the long-term dependency captured in the embedding. 59 | Typical values for kappa are 1, 5, 10. 60 | 61 | lengthsensitive Default false. This is set to true if the embedding of 62 | should have the information of the length of the sequence. 63 | If set to false then the embedding of two sequences with 64 | similar pattern but different lengths will be the same. 65 | lengthsensitive = false is similar to length-normalization. 66 | 67 | flatten Default True. If True the SGT embedding is flattened and returned as 68 | a vector. Otherwise, it is returned as a matrix with the row and col 69 | names same as the alphabets. The matrix form is used for 70 | interpretation purposes. Especially, to understand how the alphabets 71 | are "related". Otherwise, for applying machine learning or deep 72 | learning algorithms, the embedding vectors are required. 73 | 74 | mode Choices in {'default', 'multiprocessing'}. Note: 'multiprocessing' 75 | mode requires pandas==1.0.3+ and pandarallel libraries. 76 | 77 | processors Used if mode is 'multiprocessing'. By default, the 78 | number of processors used in multiprocessing is 79 | number of available - 1. 80 | ''' 81 | 82 | 83 | Attributes 84 | ---------- 85 | def fit(sequence) 86 | 87 | Extract Sequence Graph Transform features using Algorithm-2 in https://arxiv.org/abs/1608.03533. 88 | Input: 89 | sequence An array of discrete elements. For example, 90 | np.array(["B","B","A","C","A","C","A","A","B","A"]. 91 | 92 | Output: 93 | sgt embedding sgt matrix or vector (depending on Flatten==False or True) of the sequence 94 | 95 | 96 | -- 97 | def fit_transform(corpus) 98 | 99 | Extract SGT embeddings for all sequences in a corpus. It finds 100 | the alphabets encompassing all the sequences in the corpus, if not inputted. 101 | However, if the mode is 'spark', then the alphabets list has to be 102 | explicitly given in Sgt object declaration. 103 | 104 | Input: 105 | corpus A list of sequences. Each sequence is a list of alphabets. 106 | 107 | Output: 108 | sgt embedding of all sequences in the corpus. 109 | 110 | 111 | -- 112 | def transform(corpus) 113 | 114 | Find SGT embeddings of a new data sample belonging to the same population 115 | of the corpus that was fitted initially. 116 | ``` 117 | 118 | ## Install SGT 119 | 120 | Install SGT in Python by running, 121 | 122 | ```$ pip install sgt``` 123 | 124 | 125 | ```python 126 | import sgt 127 | sgt.__version__ 128 | from sgt import SGT 129 | 130 | ``` 131 | 132 | 133 | 134 | 135 | '2.0.0' 136 | 137 | 138 | 139 | 140 | 141 | ```python 142 | # -*- coding: utf-8 -*- 143 | # Authors: Chitta Ranjan 144 | # 145 | # License: BSD 3 clause 146 | ``` 147 | 148 | 149 | ## Installation Test Examples 150 | 151 | In the following, there are a few test examples to verify the installation. 152 | 153 | 154 | ```python 155 | # Learning a sgt embedding as a matrix with 156 | # rows and columns as the sequence alphabets. 157 | # This embedding shows the relationship between 158 | # the alphabets. The higher the value the 159 | # stronger the relationship. 160 | 161 | sgt = SGT(flatten=False) 162 | sequence = np.array(["B","B","A","C","A","C","A","A","B","A"]) 163 | sgt.fit(sequence) 164 | ``` 165 | 166 | 167 | 168 | 169 |
170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 |
ABC
A0.0906160.1310020.261849
B0.0865690.1230420.052544
C0.1371420.0282630.135335
200 |
201 | 202 | 203 | 204 | 205 | ```python 206 | # SGT embedding to a vector. The vector 207 | # embedding is useful for directly applying 208 | # a machine learning algorithm. 209 | 210 | sgt = SGT(flatten=True) 211 | sequence = np.array(["B","B","A","C","A","C","A","A","B","A"]) 212 | sgt.fit(sequence) 213 | ``` 214 | 215 | 216 | 217 | 218 | (A, A) 0.090616 219 | (A, B) 0.131002 220 | (A, C) 0.261849 221 | (B, A) 0.086569 222 | (B, B) 0.123042 223 | (B, C) 0.052544 224 | (C, A) 0.137142 225 | (C, B) 0.028263 226 | (C, C) 0.135335 227 | dtype: float64 228 | 229 | 230 | 231 | 232 | ```python 233 | ''' 234 | SGT embedding on a corpus of sequences. 235 | Test the two processing modes within the 236 | SGT class: 'default', 'multiprocessing'. 237 | 238 | ''' 239 | 240 | # A sample corpus of two sequences. 241 | corpus = pd.DataFrame([[1, ["B","B","A","C","A","C","A","A","B","A"]], 242 | [2, ["C", "Z", "Z", "Z", "D"]]], 243 | columns=['id', 'sequence']) 244 | corpus 245 | ``` 246 | 247 | 248 | 249 | 250 |
251 | 252 | 253 | 254 | 255 | 256 | 257 | 258 | 259 | 260 | 261 | 262 | 263 | 264 | 265 | 266 | 267 | 268 | 269 | 270 | 271 |
idsequence
01[B, B, A, C, A, C, A, A, B, A]
12[C, Z, Z, Z, D]
272 |
273 | 274 | 275 | 276 | 277 | ```python 278 | # Learning the sgt embeddings as vector for 279 | # all sequences in a corpus. 280 | # mode: 'default' 281 | sgt = SGT(kappa=1, 282 | flatten=True, 283 | lengthsensitive=False, 284 | mode='default') 285 | sgt.fit_transform(corpus) 286 | ``` 287 | 288 | 289 | 290 | 291 |
292 | 293 | 294 | 295 | 296 | 297 | 298 | 299 | 300 | 301 | 302 | 303 | 304 | 305 | 306 | 307 | 308 | 309 | 310 | 311 | 312 | 313 | 314 | 315 | 316 | 317 | 318 | 319 | 320 | 321 | 322 | 323 | 324 | 325 | 326 | 327 | 328 | 329 | 330 | 331 | 332 | 333 | 334 | 335 | 336 | 337 | 338 | 339 | 340 | 341 | 342 | 343 | 344 | 345 | 346 | 347 | 348 | 349 | 350 | 351 | 352 | 353 | 354 | 355 | 356 | 357 | 358 | 359 | 360 | 361 | 362 | 363 | 364 | 365 | 366 | 367 | 368 | 369 |
id(A, A)(A, B)(A, C)(A, D)(A, Z)(B, A)(B, B)(B, C)(B, D)...(D, A)(D, B)(D, C)(D, D)(D, Z)(Z, A)(Z, B)(Z, C)(Z, D)(Z, Z)
01.00.0906160.1310020.2618490.00.00.0865690.1230420.0525440.0...0.00.00.00.00.00.00.00.00.0000000.000000
12.00.0000000.0000000.0000000.00.00.0000000.0000000.0000000.0...0.00.00.00.00.00.00.00.00.1843340.290365
370 |

2 rows × 26 columns

371 |
372 | 373 | 374 | 375 | 376 | ```python 377 | # Learning the sgt embeddings as vector for 378 | # all sequences in a corpus. 379 | # mode: 'multiprocessing' 380 | 381 | import pandarallel # required library for multiprocessing 382 | 383 | sgt = SGT(kappa=1, 384 | flatten=True, 385 | lengthsensitive=False, 386 | mode='multiprocessing') 387 | 388 | sgt.fit_transform(corpus) 389 | ``` 390 | 391 | INFO: Pandarallel will run on 7 workers. 392 | INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers. 393 | 394 | 395 | 396 | 397 | 398 |
399 | 400 | 401 | 402 | 403 | 404 | 405 | 406 | 407 | 408 | 409 | 410 | 411 | 412 | 413 | 414 | 415 | 416 | 417 | 418 | 419 | 420 | 421 | 422 | 423 | 424 | 425 | 426 | 427 | 428 | 429 | 430 | 431 | 432 | 433 | 434 | 435 | 436 | 437 | 438 | 439 | 440 | 441 | 442 | 443 | 444 | 445 | 446 | 447 | 448 | 449 | 450 | 451 | 452 | 453 | 454 | 455 | 456 | 457 | 458 | 459 | 460 | 461 | 462 | 463 | 464 | 465 | 466 | 467 | 468 | 469 | 470 | 471 | 472 | 473 | 474 | 475 | 476 |
id(A, A)(A, B)(A, C)(A, D)(A, Z)(B, A)(B, B)(B, C)(B, D)...(D, A)(D, B)(D, C)(D, D)(D, Z)(Z, A)(Z, B)(Z, C)(Z, D)(Z, Z)
01.00.0906160.1310020.2618490.00.00.0865690.1230420.0525440.0...0.00.00.00.00.00.00.00.00.0000000.000000
12.00.0000000.0000000.0000000.00.00.0000000.0000000.0000000.0...0.00.00.00.00.00.00.00.00.1843340.290365
477 |

2 rows × 26 columns

478 |
479 | 480 | ## Load Libraries for Illustrative Examples 481 | 482 | 483 | ```python 484 | from sgt import SGT 485 | 486 | import numpy as np 487 | import pandas as pd 488 | from itertools import chain 489 | from itertools import product as iterproduct 490 | import warnings 491 | 492 | import pickle 493 | 494 | ######## 495 | from sklearn.preprocessing import LabelEncoder 496 | import tensorflow as tf 497 | from keras.datasets import imdb 498 | from tensorflow.keras.models import Sequential 499 | from tensorflow.keras.layers import Dense 500 | from tensorflow.keras.layers import LSTM 501 | from tensorflow.keras.layers import Dropout 502 | from tensorflow.keras.layers import Activation 503 | from tensorflow.keras.layers import Flatten 504 | from tensorflow.keras.layers import Embedding 505 | from tensorflow.keras.preprocessing import sequence 506 | 507 | from sklearn.model_selection import train_test_split 508 | from sklearn.model_selection import KFold 509 | from sklearn.model_selection import StratifiedKFold 510 | import sklearn.metrics 511 | import time 512 | 513 | from sklearn.decomposition import PCA 514 | from sklearn.cluster import KMeans 515 | 516 | import matplotlib.pyplot as plt 517 | %matplotlib inline 518 | 519 | np.random.seed(7) # fix random seed for reproducibility 520 | 521 | # from sgt import Sgt 522 | ``` 523 | 524 | 525 | ## Sequence Clustering 526 | 527 | A form of unsupervised learning from sequences is clustering. For example, in 528 | 529 | - user weblogs sequences: clustering the weblogs segments users into groups with similar browsing behavior. This helps in targeted marketing, anomaly detection, and other web customizations. 530 | 531 | - protein sequences: clustering proteins with similar structures help researchers study the commonalities between species. It also helps in faster search in some search algorithms. 532 | 533 | In the following, clustering on a protein sequence dataset will be shown. 534 | 535 | 536 | 537 | ### Protein Sequence Clustering 538 | 539 | The data used here is taken from www.uniprot.org. This is a public database for proteins. The data contains the protein sequences and their functions. 540 | 541 | 542 | ```python 543 | # Loading data 544 | corpus = pd.read_csv('data/protein_classification.csv') 545 | 546 | # Data preprocessing 547 | corpus = corpus.loc[:,['Entry','Sequence']] 548 | corpus.columns = ['id', 'sequence'] 549 | corpus['sequence'] = corpus['sequence'].map(list) 550 | corpus 551 | ``` 552 | 553 | 554 | 555 | 556 |
557 | 558 | 559 | 560 | 561 | 562 | 563 | 564 | 565 | 566 | 567 | 568 | 569 | 570 | 571 | 572 | 573 | 574 | 575 | 576 | 577 | 578 | 579 | 580 | 581 | 582 | 583 | 584 | 585 | 586 | 587 | 588 | 589 | 590 | 591 | 592 | 593 | 594 | 595 | 596 | 597 | 598 | 599 | 600 | 601 | 602 | 603 | 604 | 605 | 606 | 607 | 608 | 609 | 610 | 611 | 612 | 613 | 614 | 615 | 616 | 617 | 618 | 619 | 620 | 621 | 622 |
idsequence
0M7MCX3[M, E, I, E, K, T, N, R, M, N, A, L, F, E, F, ...
1K6PL84[M, E, I, E, K, N, Y, R, M, N, S, L, F, E, F, ...
2R4W5V3[M, E, I, E, K, T, N, R, M, N, A, L, F, E, F, ...
3T2A126[M, E, I, E, K, T, N, R, M, N, A, L, F, E, F, ...
4L0SHD5[M, E, I, E, K, T, N, R, M, N, A, L, F, E, F, ...
.........
2107A0A081R612[M, M, N, M, Q, N, M, M, R, Q, A, Q, K, L, Q, ...
2108A0A081QQM2[M, M, N, M, Q, N, M, M, R, Q, A, Q, K, L, Q, ...
2109J1A517[M, M, R, Q, A, Q, K, L, Q, K, Q, M, E, Q, S, ...
2110F5U1T6[M, M, N, M, Q, S, M, M, K, Q, A, Q, K, L, Q, ...
2111J3A2T7[M, M, N, M, Q, N, M, M, K, Q, A, Q, K, L, Q, ...
623 |

2112 rows × 2 columns

624 |
625 | 626 | 627 | 628 | 629 | ```python 630 | %%time 631 | # Compute SGT embeddings 632 | sgt_ = SGT(kappa=1, 633 | lengthsensitive=False, 634 | mode='multiprocessing') 635 | sgtembedding_df = sgt_.fit_transform(corpus) 636 | ``` 637 | 638 | INFO: Pandarallel will run on 7 workers. 639 | INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers. 640 | CPU times: user 324 ms, sys: 68 ms, total: 392 ms 641 | Wall time: 9.02 s 642 | 643 | 644 | 645 | ```python 646 | sgtembedding_df 647 | ``` 648 | 649 | 650 | 651 | 652 |
653 | 654 | 655 | 656 | 657 | 658 | 659 | 660 | 661 | 662 | 663 | 664 | 665 | 666 | 667 | 668 | 669 | 670 | 671 | 672 | 673 | 674 | 675 | 676 | 677 | 678 | 679 | 680 | 681 | 682 | 683 | 684 | 685 | 686 | 687 | 688 | 689 | 690 | 691 | 692 | 693 | 694 | 695 | 696 | 697 | 698 | 699 | 700 | 701 | 702 | 703 | 704 | 705 | 706 | 707 | 708 | 709 | 710 | 711 | 712 | 713 | 714 | 715 | 716 | 717 | 718 | 719 | 720 | 721 | 722 | 723 | 724 | 725 | 726 | 727 | 728 | 729 | 730 | 731 | 732 | 733 | 734 | 735 | 736 | 737 | 738 | 739 | 740 | 741 | 742 | 743 | 744 | 745 | 746 | 747 | 748 | 749 | 750 | 751 | 752 | 753 | 754 | 755 | 756 | 757 | 758 | 759 | 760 | 761 | 762 | 763 | 764 | 765 | 766 | 767 | 768 | 769 | 770 | 771 | 772 | 773 | 774 | 775 | 776 | 777 | 778 | 779 | 780 | 781 | 782 | 783 | 784 | 785 | 786 | 787 | 788 | 789 | 790 | 791 | 792 | 793 | 794 | 795 | 796 | 797 | 798 | 799 | 800 | 801 | 802 | 803 | 804 | 805 | 806 | 807 | 808 | 809 | 810 | 811 | 812 | 813 | 814 | 815 | 816 | 817 | 818 | 819 | 820 | 821 | 822 | 823 | 824 | 825 | 826 | 827 | 828 | 829 | 830 | 831 | 832 | 833 | 834 | 835 | 836 | 837 | 838 | 839 | 840 | 841 | 842 | 843 | 844 | 845 | 846 | 847 | 848 | 849 | 850 | 851 | 852 | 853 | 854 | 855 | 856 | 857 | 858 | 859 | 860 | 861 | 862 | 863 | 864 | 865 | 866 | 867 | 868 | 869 | 870 | 871 | 872 | 873 | 874 | 875 | 876 | 877 | 878 | 879 | 880 | 881 | 882 | 883 | 884 | 885 | 886 | 887 | 888 | 889 | 890 | 891 | 892 | 893 | 894 | 895 | 896 | 897 | 898 | 899 | 900 | 901 | 902 | 903 | 904 | 905 | 906 | 907 | 908 | 909 | 910 | 911 | 912 | 913 | 914 | 915 | 916 | 917 | 918 | 919 | 920 | 921 | 922 | 923 | 924 | 925 | 926 | 927 | 928 | 929 | 930 | 931 | 932 | 933 | 934 | 935 | 936 | 937 | 938 | 939 | 940 | 941 | 942 | 943 | 944 | 945 | 946 |
id(A, A)(A, C)(A, D)(A, E)(A, F)(A, G)(A, H)(A, I)(A, K)...(Y, M)(Y, N)(Y, P)(Y, Q)(Y, R)(Y, S)(Y, T)(Y, V)(Y, W)(Y, Y)
0M7MCX30.0201800.00.0096350.0135290.0093600.0032052.944887e-100.0022260.000379...0.0091960.0079640.0367880.0001950.0015130.0206650.0005420.0074790.00.010419
1K6PL840.0016040.00.0126370.0063230.0062240.0048193.560677e-030.0011240.012136...0.1353350.0065680.0389010.0112980.0125780.0099130.0010790.0000230.00.007728
2R4W5V30.0124480.00.0084080.0163630.0274690.0032052.944887e-100.0042490.013013...0.0081140.0071280.0000000.0002030.0017570.0227360.0002490.0126520.00.008533
3T2A1260.0105450.00.0125600.0142120.0137280.0000002.944887e-100.0072230.000309...0.0003250.0096690.0000000.0031820.0019040.0156070.0005770.0074790.00.008648
4L0SHD50.0201800.00.0086280.0150330.0093600.0032052.944887e-100.0022260.000379...0.0091960.0079640.0367880.0001950.0015130.0206650.0005420.0074790.00.010419
..................................................................
2107A0A081R6120.0148050.00.0041590.0175410.0127010.0130990.000000e+000.0170430.004732...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.00.000000
2108A0A081QQM20.0107740.00.0042830.0147320.0143400.0148460.000000e+000.0168060.005406...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.00.000000
2109J1A5170.0107740.00.0042830.0147320.0143400.0148460.000000e+000.0145000.005406...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.00.000000
2110F5U1T60.0152090.00.0051750.0238880.0114100.0115100.000000e+000.0211450.009280...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.00.000000
2111J3A2T70.0052400.00.0123010.0131780.0147440.0147050.000000e+000.0009810.007957...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.00.000000
947 |

2112 rows × 401 columns

948 |
949 | 950 | 951 | 952 | 953 | ```python 954 | # Set the id column as the dataframe index 955 | sgtembedding_df = sgtembedding_df.set_index('id') 956 | sgtembedding_df 957 | ``` 958 | 959 | 960 | 961 | 962 |
963 | 964 | 965 | 966 | 967 | 968 | 969 | 970 | 971 | 972 | 973 | 974 | 975 | 976 | 977 | 978 | 979 | 980 | 981 | 982 | 983 | 984 | 985 | 986 | 987 | 988 | 989 | 990 | 991 | 992 | 993 | 994 | 995 | 996 | 997 | 998 | 999 | 1000 | 1001 | 1002 | 1003 | 1004 | 1005 | 1006 | 1007 | 1008 | 1009 | 1010 | 1011 | 1012 | 1013 | 1014 | 1015 | 1016 | 1017 | 1018 | 1019 | 1020 | 1021 | 1022 | 1023 | 1024 | 1025 | 1026 | 1027 | 1028 | 1029 | 1030 | 1031 | 1032 | 1033 | 1034 | 1035 | 1036 | 1037 | 1038 | 1039 | 1040 | 1041 | 1042 | 1043 | 1044 | 1045 | 1046 | 1047 | 1048 | 1049 | 1050 | 1051 | 1052 | 1053 | 1054 | 1055 | 1056 | 1057 | 1058 | 1059 | 1060 | 1061 | 1062 | 1063 | 1064 | 1065 | 1066 | 1067 | 1068 | 1069 | 1070 | 1071 | 1072 | 1073 | 1074 | 1075 | 1076 | 1077 | 1078 | 1079 | 1080 | 1081 | 1082 | 1083 | 1084 | 1085 | 1086 | 1087 | 1088 | 1089 | 1090 | 1091 | 1092 | 1093 | 1094 | 1095 | 1096 | 1097 | 1098 | 1099 | 1100 | 1101 | 1102 | 1103 | 1104 | 1105 | 1106 | 1107 | 1108 | 1109 | 1110 | 1111 | 1112 | 1113 | 1114 | 1115 | 1116 | 1117 | 1118 | 1119 | 1120 | 1121 | 1122 | 1123 | 1124 | 1125 | 1126 | 1127 | 1128 | 1129 | 1130 | 1131 | 1132 | 1133 | 1134 | 1135 | 1136 | 1137 | 1138 | 1139 | 1140 | 1141 | 1142 | 1143 | 1144 | 1145 | 1146 | 1147 | 1148 | 1149 | 1150 | 1151 | 1152 | 1153 | 1154 | 1155 | 1156 | 1157 | 1158 | 1159 | 1160 | 1161 | 1162 | 1163 | 1164 | 1165 | 1166 | 1167 | 1168 | 1169 | 1170 | 1171 | 1172 | 1173 | 1174 | 1175 | 1176 | 1177 | 1178 | 1179 | 1180 | 1181 | 1182 | 1183 | 1184 | 1185 | 1186 | 1187 | 1188 | 1189 | 1190 | 1191 | 1192 | 1193 | 1194 | 1195 | 1196 | 1197 | 1198 | 1199 | 1200 | 1201 | 1202 | 1203 | 1204 | 1205 | 1206 | 1207 | 1208 | 1209 | 1210 | 1211 | 1212 | 1213 | 1214 | 1215 | 1216 | 1217 | 1218 | 1219 | 1220 | 1221 | 1222 | 1223 | 1224 | 1225 | 1226 | 1227 | 1228 | 1229 | 1230 | 1231 | 1232 | 1233 | 1234 | 1235 | 1236 | 1237 | 1238 | 1239 | 1240 | 1241 | 1242 | 1243 | 1244 | 1245 | 1246 | 1247 | 1248 | 1249 | 1250 | 1251 | 1252 | 1253 | 1254 | 1255 | 1256 | 1257 | 1258 | 1259 | 1260 | 1261 | 1262 | 1263 | 1264 | 1265 | 1266 | 1267 | 1268 | 1269 | 1270 | 1271 | 1272 | 1273 | 1274 | 1275 | 1276 | 1277 | 1278 | 1279 | 1280 |
(A, A)(A, C)(A, D)(A, E)(A, F)(A, G)(A, H)(A, I)(A, K)(A, L)...(Y, M)(Y, N)(Y, P)(Y, Q)(Y, R)(Y, S)(Y, T)(Y, V)(Y, W)(Y, Y)
id
M7MCX30.0201800.00.0096350.0135290.0093600.0032052.944887e-100.0022260.0003790.021703...0.0091960.0079640.0367880.0001950.0015130.0206650.0005420.0074790.00.010419
K6PL840.0016040.00.0126370.0063230.0062240.0048193.560677e-030.0011240.0121360.018427...0.1353350.0065680.0389010.0112980.0125780.0099130.0010790.0000230.00.007728
R4W5V30.0124480.00.0084080.0163630.0274690.0032052.944887e-100.0042490.0130130.031118...0.0081140.0071280.0000000.0002030.0017570.0227360.0002490.0126520.00.008533
T2A1260.0105450.00.0125600.0142120.0137280.0000002.944887e-100.0072230.0003090.028531...0.0003250.0096690.0000000.0031820.0019040.0156070.0005770.0074790.00.008648
L0SHD50.0201800.00.0086280.0150330.0093600.0032052.944887e-100.0022260.0003790.021703...0.0091960.0079640.0367880.0001950.0015130.0206650.0005420.0074790.00.010419
..................................................................
A0A081R6120.0148050.00.0041590.0175410.0127010.0130990.000000e+000.0170430.0047320.014904...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.00.000000
A0A081QQM20.0107740.00.0042830.0147320.0143400.0148460.000000e+000.0168060.0054060.014083...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.00.000000
J1A5170.0107740.00.0042830.0147320.0143400.0148460.000000e+000.0145000.0054060.014083...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.00.000000
F5U1T60.0152090.00.0051750.0238880.0114100.0115100.000000e+000.0211450.0092800.017466...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.00.000000
J3A2T70.0052400.00.0123010.0131780.0147440.0147050.000000e+000.0009810.0079570.017112...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.00.000000
1281 |

2112 rows × 400 columns

1282 |
1283 | 1284 | 1285 | 1286 | We perform PCA on the sequence embeddings and then do kmeans clustering. 1287 | 1288 | 1289 | ```python 1290 | pca = PCA(n_components=2) 1291 | pca.fit(sgtembedding_df) 1292 | 1293 | X=pca.transform(sgtembedding_df) 1294 | 1295 | print(np.sum(pca.explained_variance_ratio_)) 1296 | df = pd.DataFrame(data=X, columns=['x1', 'x2']) 1297 | df.head() 1298 | ``` 1299 | 1300 | 0.6432744907364981 1301 | 1302 | 1303 | 1304 | 1305 | 1306 |
1307 | 1308 | 1309 | 1310 | 1311 | 1312 | 1313 | 1314 | 1315 | 1316 | 1317 | 1318 | 1319 | 1320 | 1321 | 1322 | 1323 | 1324 | 1325 | 1326 | 1327 | 1328 | 1329 | 1330 | 1331 | 1332 | 1333 | 1334 | 1335 | 1336 | 1337 | 1338 | 1339 | 1340 | 1341 | 1342 | 1343 |
x1x2
00.384913-0.269873
10.0227640.135995
20.177792-0.172454
30.168074-0.147334
40.383616-0.271163
1344 |
1345 | 1346 | 1347 | 1348 | 1349 | ```python 1350 | kmeans = KMeans(n_clusters=3, max_iter =300) 1351 | kmeans.fit(df) 1352 | 1353 | labels = kmeans.predict(df) 1354 | centroids = kmeans.cluster_centers_ 1355 | 1356 | fig = plt.figure(figsize=(5, 5)) 1357 | colmap = {1: 'r', 2: 'g', 3: 'b'} 1358 | colors = list(map(lambda x: colmap[x+1], labels)) 1359 | plt.scatter(df['x1'], df['x2'], color=colors, alpha=0.5, edgecolor=colors) 1360 | ``` 1361 | 1362 | 1363 | 1364 | 1365 | 1366 | 1367 | 1368 | 1369 | 1370 | ![png](output_23_1.png) 1371 | 1372 | 1373 | ## Sequence Classification using Deep Learning in TensorFlow 1374 | 1375 | The protein data set used above is also labeled. The labels represent the protein functions. Similarly, there are other labeled sequence data sets. For example, DARPA shared an intrusion weblog data set. It contains weblog sequences with positive labels if the log represents a network intrusion. 1376 | 1377 | In such problems supervised learning is employed. Classification is a supervised learning we will demonstrate here. 1378 | 1379 | ### Protein Sequence Classification 1380 | 1381 | The data set is taken from https://www.uniprot.org . The protein sequences in the data set have one of the two functions, 1382 | - Binds to DNA and alters its conformation. May be involved in regulation of gene expression, nucleoid organization and DNA protection. 1383 | - Might take part in the signal recognition particle (SRP) pathway. This is inferred from the conservation of its genetic proximity to ftsY/ffh. May be a regulatory protein. 1384 | 1385 | There are a total of 2113 samples. The sequence lengths vary between 80-130. 1386 | 1387 | 1388 | ```python 1389 | # Loading data 1390 | data = pd.read_csv('data/protein_classification.csv') 1391 | 1392 | 1393 | # Data preprocessing 1394 | y = data['Function [CC]'] 1395 | encoder = LabelEncoder() 1396 | encoder.fit(y) 1397 | encoded_y = encoder.transform(y) 1398 | 1399 | corpus = data.loc[:,['Entry','Sequence']] 1400 | corpus.columns = ['id', 'sequence'] 1401 | corpus['sequence'] = corpus['sequence'].map(list) 1402 | ``` 1403 | 1404 | #### Sequence embeddings 1405 | 1406 | 1407 | ```python 1408 | # Sequence embedding 1409 | sgt_ = SGT(kappa=1, 1410 | lengthsensitive=False, 1411 | mode='multiprocessing') 1412 | sgtembedding_df = sgt_.fit_transform(corpus) 1413 | X = sgtembedding_df.set_index('id') 1414 | ``` 1415 | 1416 | INFO: Pandarallel will run on 7 workers. 1417 | INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers. 1418 | 1419 | 1420 | We will perform a 10-fold cross-validation to measure the performance of the classification model. 1421 | 1422 | 1423 | ```python 1424 | kfold = 10 1425 | X = X 1426 | y = encoded_y 1427 | 1428 | random_state = 1 1429 | 1430 | test_F1 = np.zeros(kfold) 1431 | skf = KFold(n_splits = kfold, shuffle = True, random_state = random_state) 1432 | k = 0 1433 | epochs = 50 1434 | batch_size = 128 1435 | 1436 | for train_index, test_index in skf.split(X, y): 1437 | X_train, X_test = X.iloc[train_index], X.iloc[test_index] 1438 | y_train, y_test = y[train_index], y[test_index] 1439 | 1440 | model = Sequential() 1441 | model.add(Dense(64, input_shape = (X_train.shape[1],))) 1442 | model.add(Activation('relu')) 1443 | model.add(Dropout(0.5)) 1444 | model.add(Dense(32)) 1445 | model.add(Activation('relu')) 1446 | model.add(Dropout(0.5)) 1447 | model.add(Dense(1)) 1448 | model.add(Activation('sigmoid')) 1449 | model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) 1450 | 1451 | model.fit(X_train, y_train ,batch_size=batch_size, epochs=epochs, verbose=0) 1452 | 1453 | y_pred = model.predict_proba(X_test).round().astype(int) 1454 | y_train_pred = model.predict_proba(X_train).round().astype(int) 1455 | 1456 | test_F1[k] = sklearn.metrics.f1_score(y_test, y_pred) 1457 | k+=1 1458 | 1459 | print ('Average f1 score', np.mean(test_F1)) 1460 | ``` 1461 | 1462 | Average f1 score 1.0 1463 | 1464 | 1465 | ### Weblog Classification for Intrusion Detection 1466 | 1467 | This data sample is taken from https://www.ll.mit.edu/r-d/datasets/1998-darpa-intrusion-detection-evaluation-dataset. 1468 | This is a network intrusion data containing audit logs and any attack as a positive label. Since, network intrusion is a rare event, the data is unbalanced. Here we will, 1469 | - build a sequence classification model to predict a network intrusion. 1470 | 1471 | Each sequence contains in the data is a series of activity, for example, {login, password}. The _alphabets_ in the input data sequences are already encoded into integers. The original sequences data file is also present in the `/data` directory. 1472 | 1473 | 1474 | ```python 1475 | # Loading data 1476 | data = pd.read_csv('data/darpa_data.csv') 1477 | data.columns 1478 | ``` 1479 | 1480 | 1481 | 1482 | 1483 | Index(['timeduration', 'seqlen', 'seq', 'class'], dtype='object') 1484 | 1485 | 1486 | 1487 | 1488 | ```python 1489 | data['id'] = data.index 1490 | ``` 1491 | 1492 | 1493 | ```python 1494 | # Data preprocessing 1495 | y = data['class'] 1496 | encoder = LabelEncoder() 1497 | encoder.fit(y) 1498 | encoded_y = encoder.transform(y) 1499 | 1500 | corpus = data.loc[:,['id','seq']] 1501 | corpus.columns = ['id', 'sequence'] 1502 | corpus['sequence'] = corpus['sequence'].map(list) 1503 | ``` 1504 | 1505 | #### Sequence embeddings 1506 | In this data, the sequence embeddings should be **length-sensitive**. 1507 | 1508 | The lengths are important here because sequences with similar patterns but different lengths can have different labels. Consider a simple example of two sessions: `{login, pswd, login, pswd,...}` and `{login, pswd,...(repeated several times)..., login, pswd}`. 1509 | 1510 | While the first session can be a regular user mistyping the password once, the other session is possibly an attack to guess the password. Thus, the sequence lengths are as important as the patterns. 1511 | 1512 | Therefore, `lengthsensitive=True` is used here. 1513 | 1514 | 1515 | ```python 1516 | # Sequence embedding 1517 | sgt_ = SGT(kappa=5, 1518 | lengthsensitive=True, 1519 | mode='multiprocessing') 1520 | sgtembedding_df = sgt_.fit_transform(corpus) 1521 | sgtembedding_df = sgtembedding_df.set_index('id') 1522 | sgtembedding_df 1523 | ``` 1524 | 1525 | INFO: Pandarallel will run on 7 workers. 1526 | INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers. 1527 | 1528 | 1529 | 1530 | 1531 | 1532 |
1533 | 1534 | 1535 | 1536 | 1537 | 1538 | 1539 | 1540 | 1541 | 1542 | 1543 | 1544 | 1545 | 1546 | 1547 | 1548 | 1549 | 1550 | 1551 | 1552 | 1553 | 1554 | 1555 | 1556 | 1557 | 1558 | 1559 | 1560 | 1561 | 1562 | 1563 | 1564 | 1565 | 1566 | 1567 | 1568 | 1569 | 1570 | 1571 | 1572 | 1573 | 1574 | 1575 | 1576 | 1577 | 1578 | 1579 | 1580 | 1581 | 1582 | 1583 | 1584 | 1585 | 1586 | 1587 | 1588 | 1589 | 1590 | 1591 | 1592 | 1593 | 1594 | 1595 | 1596 | 1597 | 1598 | 1599 | 1600 | 1601 | 1602 | 1603 | 1604 | 1605 | 1606 | 1607 | 1608 | 1609 | 1610 | 1611 | 1612 | 1613 | 1614 | 1615 | 1616 | 1617 | 1618 | 1619 | 1620 | 1621 | 1622 | 1623 | 1624 | 1625 | 1626 | 1627 | 1628 | 1629 | 1630 | 1631 | 1632 | 1633 | 1634 | 1635 | 1636 | 1637 | 1638 | 1639 | 1640 | 1641 | 1642 | 1643 | 1644 | 1645 | 1646 | 1647 | 1648 | 1649 | 1650 | 1651 | 1652 | 1653 | 1654 | 1655 | 1656 | 1657 | 1658 | 1659 | 1660 | 1661 | 1662 | 1663 | 1664 | 1665 | 1666 | 1667 | 1668 | 1669 | 1670 | 1671 | 1672 | 1673 | 1674 | 1675 | 1676 | 1677 | 1678 | 1679 | 1680 | 1681 | 1682 | 1683 | 1684 | 1685 | 1686 | 1687 | 1688 | 1689 | 1690 | 1691 | 1692 | 1693 | 1694 | 1695 | 1696 | 1697 | 1698 | 1699 | 1700 | 1701 | 1702 | 1703 | 1704 | 1705 | 1706 | 1707 | 1708 | 1709 | 1710 | 1711 | 1712 | 1713 | 1714 | 1715 | 1716 | 1717 | 1718 | 1719 | 1720 | 1721 | 1722 | 1723 | 1724 | 1725 | 1726 | 1727 | 1728 | 1729 | 1730 | 1731 | 1732 | 1733 | 1734 | 1735 | 1736 | 1737 | 1738 | 1739 | 1740 | 1741 | 1742 | 1743 | 1744 | 1745 | 1746 | 1747 | 1748 | 1749 | 1750 | 1751 | 1752 | 1753 | 1754 | 1755 | 1756 | 1757 | 1758 | 1759 | 1760 | 1761 | 1762 | 1763 | 1764 | 1765 | 1766 | 1767 | 1768 | 1769 | 1770 | 1771 | 1772 | 1773 | 1774 | 1775 | 1776 | 1777 | 1778 | 1779 | 1780 | 1781 | 1782 | 1783 | 1784 | 1785 | 1786 | 1787 | 1788 | 1789 | 1790 | 1791 | 1792 | 1793 | 1794 | 1795 | 1796 | 1797 | 1798 | 1799 | 1800 | 1801 | 1802 | 1803 | 1804 | 1805 | 1806 | 1807 | 1808 | 1809 | 1810 | 1811 | 1812 | 1813 | 1814 | 1815 | 1816 | 1817 | 1818 | 1819 | 1820 | 1821 | 1822 | 1823 | 1824 | 1825 | 1826 | 1827 | 1828 | 1829 | 1830 | 1831 | 1832 | 1833 | 1834 | 1835 | 1836 | 1837 | 1838 | 1839 | 1840 | 1841 | 1842 | 1843 | 1844 | 1845 | 1846 | 1847 | 1848 | 1849 | 1850 |
(0, 0)(0, 1)(0, 2)(0, 3)(0, 4)(0, 5)(0, 6)(0, 7)(0, 8)(0, 9)...(~, 1)(~, 2)(~, 3)(~, 4)(~, 5)(~, 6)(~, 7)(~, 8)(~, 9)(~, ~)
id
0.00.0000000.0000000.0000000.000000e+000.000000e+000.000000e+000.0000000.000000e+000.000000e+000.000000e+00...0.4850340.4869990.4858020.4830970.4839560.0000000.0000000.0000000.0000000.178609
1.00.0000000.0256220.2281560.000000e+000.000000e+001.310714e-090.0000000.000000e+000.000000e+000.000000e+00...0.4476200.4520970.4645680.3672960.5251410.4550180.3743640.4140810.5499810.172479
2.00.0000000.0000000.0000000.000000e+000.000000e+000.000000e+000.0000000.000000e+000.000000e+000.000000e+00...0.5256050.0000000.0000000.0000000.0000000.0000000.0000000.0000000.1933590.071469
3.00.0779990.2089740.2303381.830519e-011.200926e-171.696880e-010.0936467.985870e-022.896813e-053.701710e-05...0.4740720.4683530.4635940.1775070.5512700.4186520.3096520.3846570.3782250.170362
4.00.0000000.0236950.2178192.188276e-330.000000e+006.075992e-110.0000000.000000e+005.681668e-390.000000e+00...0.4641200.4682290.4521700.0000000.5012420.0000000.3005340.1619610.0000000.167082
..................................................................
106.00.0000000.0244950.2199292.035190e-171.073271e-185.656994e-110.0000000.000000e+005.047380e-290.000000e+00...0.5022130.5443430.4772810.1759010.4611030.0000000.0000000.1627960.0000000.167687
107.00.1104220.2274780.2175491.723963e-011.033292e-143.896725e-070.0836852.940589e-088.864072e-024.813990e-29...0.4903980.5220160.4668080.4706030.4797950.4800570.1948880.1723970.1648730.172271
108.00.0056460.2024240.1967862.281242e-011.133936e-011.862098e-010.0000001.212869e-019.180520e-080.000000e+00...0.4328340.4349530.4396150.3908640.4817640.6008750.1667660.1653680.0000000.171729
109.00.0000000.0256160.2381763.889176e-551.332427e-601.408003e-090.0000009.845377e-600.000000e+000.000000e+00...0.4213180.4399850.4679530.4409510.5271650.8647170.4071550.3993350.2513040.171885
110.00.0000000.0228680.2035139.273472e-640.000000e+001.240870e-090.0000000.000000e+000.000000e+000.000000e+00...0.4780900.4548710.4591090.0000000.4905340.3703570.0000000.1629970.0000000.162089
1851 |

111 rows × 121 columns

1852 |
1853 | 1854 | 1855 | 1856 | #### Applying PCA on the embeddings 1857 | The embeddings are sparse and high-dimensional. PCA is, therefore, applied for dimension reduction. 1858 | 1859 | 1860 | ```python 1861 | from sklearn.decomposition import PCA 1862 | pca = PCA(n_components=35) 1863 | pca.fit(sgtembedding_df) 1864 | X = pca.transform(sgtembedding_df) 1865 | print(np.sum(pca.explained_variance_ratio_)) 1866 | ``` 1867 | 1868 | 0.9962446146783123 1869 | 1870 | 1871 | #### Building a Multi-Layer Perceptron Classifier 1872 | The PCA transforms of the embeddings are used directly as inputs to an MLP classifier. 1873 | 1874 | 1875 | ```python 1876 | kfold = 3 1877 | random_state = 11 1878 | 1879 | X = X 1880 | y = encoded_y 1881 | 1882 | test_F1 = np.zeros(kfold) 1883 | time_k = np.zeros(kfold) 1884 | skf = StratifiedKFold(n_splits=kfold, shuffle=True, random_state=random_state) 1885 | k = 0 1886 | epochs = 300 1887 | batch_size = 15 1888 | 1889 | # class_weight = {0 : 1., 1: 1.,} # The weights can be changed and made inversely proportional to the class size to improve the accuracy. 1890 | class_weight = {0 : 0.12, 1: 0.88,} 1891 | 1892 | for train_index, test_index in skf.split(X, y): 1893 | X_train, X_test = X[train_index], X[test_index] 1894 | y_train, y_test = y[train_index], y[test_index] 1895 | 1896 | model = Sequential() 1897 | model.add(Dense(128, input_shape=(X_train.shape[1],))) 1898 | model.add(Activation('relu')) 1899 | model.add(Dropout(0.5)) 1900 | model.add(Dense(1)) 1901 | model.add(Activation('sigmoid')) 1902 | model.summary() 1903 | model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) 1904 | 1905 | start_time = time.time() 1906 | model.fit(X_train, y_train ,batch_size=batch_size, epochs=epochs, verbose=1, class_weight=class_weight) 1907 | end_time = time.time() 1908 | time_k[k] = end_time-start_time 1909 | 1910 | y_pred = model.predict_proba(X_test).round().astype(int) 1911 | y_train_pred = model.predict_proba(X_train).round().astype(int) 1912 | test_F1[k] = sklearn.metrics.f1_score(y_test, y_pred) 1913 | k += 1 1914 | ``` 1915 | 1916 | Model: "sequential_10" 1917 | _________________________________________________________________ 1918 | Layer (type) Output Shape Param # 1919 | ================================================================= 1920 | dense_30 (Dense) (None, 128) 4608 1921 | _________________________________________________________________ 1922 | activation_30 (Activation) (None, 128) 0 1923 | _________________________________________________________________ 1924 | dropout_20 (Dropout) (None, 128) 0 1925 | _________________________________________________________________ 1926 | dense_31 (Dense) (None, 1) 129 1927 | _________________________________________________________________ 1928 | activation_31 (Activation) (None, 1) 0 1929 | ================================================================= 1930 | Total params: 4,737 1931 | Trainable params: 4,737 1932 | Non-trainable params: 0 1933 | _________________________________________________________________ 1934 | WARNING:tensorflow:sample_weight modes were coerced from 1935 | ... 1936 | to 1937 | ['...'] 1938 | Train on 74 samples 1939 | Epoch 1/300 1940 | 74/74 [==============================] - 0s 7ms/sample - loss: 0.1487 - accuracy: 0.5270 1941 | Epoch 2/300 1942 | 74/74 [==============================] - 0s 120us/sample - loss: 0.1421 - accuracy: 0.5000 1943 | ... 1944 | 74/74 [==============================] - 0s 118us/sample - loss: 0.0299 - accuracy: 0.8784 1945 | Epoch 300/300 1946 | 74/74 [==============================] - 0s 133us/sample - loss: 0.0296 - accuracy: 0.8649 1947 | 1948 | 1949 | 1950 | ```python 1951 | print ('Average f1 score', np.mean(test_F1)) 1952 | print ('Average Run time', np.mean(time_k)) 1953 | ``` 1954 | 1955 | Average f1 score 0.6341880341880342 1956 | Average Run time 3.880180994669596 1957 | 1958 | 1959 | #### Building an LSTM Classifier on the sequences for comparison 1960 | We built an LSTM Classifier on the sequences to compare the accuracy. 1961 | 1962 | 1963 | ```python 1964 | X = data['seq'] 1965 | encoded_X = np.ndarray(shape=(len(X),), dtype=list) 1966 | for i in range(0,len(X)): 1967 | encoded_X[i]=X.iloc[i].split("~") 1968 | X 1969 | ``` 1970 | 1971 | 1972 | 1973 | 1974 | 0 1~2~3~3~3~3~3~3~1~4~5~1~2~3~3~3~3~3~3~1~4~5~1~... 1975 | 1 6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~... 1976 | 2 19~19~19~19~19~19~19~19~19~19~19~19~19~19~19~1... 1977 | 3 6~5~5~6~5~6~5~2~5~5~5~5~5~5~5~5~5~5~5~5~5~5~5~... 1978 | 4 5~5~17~5~5~5~5~5~10~2~11~2~11~11~12~11~11~5~2~... 1979 | ... 1980 | 106 10~2~11~2~11~11~12~11~11~5~2~11~5~2~5~2~3~14~3... 1981 | 107 5~5~2~5~17~6~5~6~5~5~2~6~17~3~2~2~3~5~2~3~5~6~... 1982 | 108 6~5~6~5~5~6~5~5~6~6~6~6~6~6~6~6~6~6~6~6~6~6~6~... 1983 | 109 6~5~5~6~5~6~5~2~38~2~3~5~22~39~5~5~5~5~5~5~5~5... 1984 | 110 5~6~5~5~10~2~11~2~11~11~12~11~5~2~11~11~12~11~... 1985 | Name: seq, Length: 111, dtype: object 1986 | 1987 | 1988 | 1989 | 1990 | ```python 1991 | max_seq_length = np.max(data['seqlen']) 1992 | encoded_X = tf.keras.preprocessing.sequence.pad_sequences(encoded_X, maxlen=max_seq_length) 1993 | ``` 1994 | 1995 | 1996 | ```python 1997 | kfold = 3 1998 | random_state = 11 1999 | 2000 | test_F1 = np.zeros(kfold) 2001 | time_k = np.zeros(kfold) 2002 | 2003 | epochs = 50 2004 | batch_size = 15 2005 | skf = StratifiedKFold(n_splits=kfold, shuffle=True, random_state=random_state) 2006 | k = 0 2007 | 2008 | for train_index, test_index in skf.split(encoded_X, y): 2009 | X_train, X_test = encoded_X[train_index], encoded_X[test_index] 2010 | y_train, y_test = y[train_index], y[test_index] 2011 | 2012 | embedding_vecor_length = 32 2013 | top_words=50 2014 | model = Sequential() 2015 | model.add(Embedding(top_words, embedding_vecor_length, input_length=max_seq_length)) 2016 | model.add(LSTM(32)) 2017 | model.add(Dense(1)) 2018 | model.add(Activation('sigmoid')) 2019 | model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) 2020 | 2021 | model.summary() 2022 | 2023 | start_time = time.time() 2024 | model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size, verbose=1) 2025 | end_time=time.time() 2026 | time_k[k]=end_time-start_time 2027 | 2028 | y_pred = model.predict_proba(X_test).round().astype(int) 2029 | y_train_pred=model.predict_proba(X_train).round().astype(int) 2030 | test_F1[k]=sklearn.metrics.f1_score(y_test, y_pred) 2031 | k+=1 2032 | ``` 2033 | 2034 | Model: "sequential_13" 2035 | _________________________________________________________________ 2036 | Layer (type) Output Shape Param # 2037 | ================================================================= 2038 | embedding (Embedding) (None, 1773, 32) 1600 2039 | _________________________________________________________________ 2040 | lstm (LSTM) (None, 32) 8320 2041 | _________________________________________________________________ 2042 | dense_36 (Dense) (None, 1) 33 2043 | _________________________________________________________________ 2044 | activation_36 (Activation) (None, 1) 0 2045 | ================================================================= 2046 | Total params: 9,953 2047 | Trainable params: 9,953 2048 | Non-trainable params: 0 2049 | _________________________________________________________________ 2050 | Train on 74 samples 2051 | Epoch 1/50 2052 | 74/74 [==============================] - 5s 72ms/sample - loss: 0.6894 - accuracy: 0.5676 2053 | Epoch 2/50 2054 | 74/74 [==============================] - 4s 48ms/sample - loss: 0.6590 - accuracy: 0.8784 2055 | ... 2056 | Epoch 50/50 2057 | 74/74 [==============================] - 4s 51ms/sample - loss: 0.1596 - accuracy: 0.9324 2058 | 2059 | 2060 | 2061 | ```python 2062 | print ('Average f1 score', np.mean(test_F1)) 2063 | print ('Average Run time', np.mean(time_k)) 2064 | ``` 2065 | 2066 | Average f1 score 0.36111111111111116 2067 | Average Run time 192.46954011917114 2068 | 2069 | 2070 | We find that the LSTM classifier gives a significantly lower F1 score. This may be improved by changing the model. However, we find that the SGT embedding could work with a small and unbalanced data without the need of a complicated classifier model. 2071 | 2072 | LSTM models typically require more data for training and also has significantly more computation time. The LSTM model above took 425.6 secs while the MLP model took just 9.1 secs. 2073 | 2074 | ## Sequence Search 2075 | 2076 | Sequence data sets are generally large. For example, sequences of listening history in music streaming services, such as Pandora, for more than 70M users are huge. In protein data bases there could be even larger size. For instance, the Uniprot data repository has more than 177M sequences. 2077 | 2078 | Searching for similar sequences in such large data bases is challenging. SGT embedding provides a simple solution. In the following it will be shown on a protein data set that SGT embedding can be used to compute similarity between a query sequence and the sequence corpus using a dot product. The sequences with the highest dot product are returned as the most similar sequence to the query. 2079 | 2080 | ### Protein Sequence Search 2081 | 2082 | In the following, a sample of 10k protein sequences are used for illustration. The data is taken from https://www.uniprot.org . 2083 | 2084 | 2085 | ```python 2086 | # Loading data 2087 | data = pd.read_csv('data/protein-uniprot-reviewed-Ano-10k.tab', sep='\t') 2088 | 2089 | # Data preprocessing 2090 | corpus = data.loc[:,['Entry','Sequence']] 2091 | corpus.columns = ['id', 'sequence'] 2092 | corpus['sequence'] = corpus['sequence'].map(list) 2093 | corpus.head(3) 2094 | ``` 2095 | 2096 | 2097 | 2098 | 2099 |
2100 | 2101 | 2102 | 2103 | 2104 | 2105 | 2106 | 2107 | 2108 | 2109 | 2110 | 2111 | 2112 | 2113 | 2114 | 2115 | 2116 | 2117 | 2118 | 2119 | 2120 | 2121 | 2122 | 2123 | 2124 | 2125 |
idsequence
0I2WKR6[M, V, H, K, S, D, S, D, E, L, A, A, L, R, A, ...
1A0A2A6M8K9[M, Q, E, S, L, V, V, R, R, E, T, H, I, A, A, ...
2A0A3G5KEC3[M, A, S, G, A, Y, S, K, Y, L, F, Q, I, I, G, ...
2126 |
2127 | 2128 | 2129 | 2130 | 2131 | ```python 2132 | # Protein sequence alphabets 2133 | alphabets = ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 2134 | 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 2135 | 'W', 'X', 'Y', 'U', 'O'] # List of amino acids 2136 | 2137 | # Alphabets are known and inputted 2138 | # as arguments for faster computation 2139 | sgt_ = SGT(alphabets=alphabets, 2140 | lengthsensitive=True, 2141 | kappa=1, 2142 | flatten=True, 2143 | mode='multiprocessing') 2144 | 2145 | sgtembedding_df = sgt_.fit_transform(corpus) 2146 | sgtembedding_df = sgtembedding_df.set_index('id') 2147 | ``` 2148 | 2149 | INFO: Pandarallel will run on 7 workers. 2150 | INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers. 2151 | 2152 | 2153 | 2154 | ```python 2155 | ''' 2156 | Search proteins similar to a query protein. 2157 | The approach is to find the SGT embedding of the 2158 | query protein and find its similarity with the 2159 | embeddings of the protein database. 2160 | ''' 2161 | 2162 | query_protein = 'MSHVFPIVIDDNFLSPQDLVSAARSGCSLRLHTGVVDKIDRAHRFVLEIAGAEALHYGINTGFGSLCTTHIDPADLSTLQHNLLKSHACGVGPTVSEEVSRVVTLIKLLTFRTGNSGVSLSTVNRIIDLWNHGVVGAIAQKGTVGASGDLAPLAHLFLPLIGLGQVWHRGVLRPSREVMDELKLAPLTLQPKDGLCLTNGVQYLNAWGALSTVRAKRLVALADLCAAMSMMGFSAARSFIEAQIHQTSLHPERGHVALHLRTLTHGSNHADLPHCNPAMEDPYSFRCAPQVHGAARQVVGYLETVIGNECNSVSDNPLVFPDTRQILTCGNLHGQSTAFALDFAAIGITDLSNISERRTYQLLSGQNGLPGFLVAKPGLNSGFMVVQYTSAALLNENKVLSNPASVDTIPTCHLQEDHVSMGGTSAYKLQTILDNCETILAIELMTACQAIDMNPGLQLSERGRAIYEAVREEIPFVKEDHLMAGLISKSRDLCQHSTVIAQQLAEMQAQ' 2163 | 2164 | # Step 1. Compute sgt embedding for the query protein. 2165 | query_protein_sgt_embedding = sgt_.fit(list(query_protein)) 2166 | 2167 | # Step 2. Compute the dot product of query embedding 2168 | # with the protein embedding database. 2169 | similarity = sgtembedding_df.dot(query_protein_sgt_embedding) 2170 | 2171 | # Step 3. Return the top k protein names based on similarity. 2172 | similarity.sort_values(ascending=False) 2173 | ``` 2174 | 2175 | 2176 | 2177 | 2178 | id 2179 | K0ZGN5 2773.749663 2180 | A0A0Y1CPH7 1617.451379 2181 | A0A5R8LCJ1 1566.833152 2182 | A0A290WY40 1448.772820 2183 | A0A073K6N6 1392.267250 2184 | ... 2185 | A0A1S7UBK4 160.074989 2186 | A0A2S7T1R9 156.580584 2187 | A0A0E0UQV6 155.834932 2188 | A0A1Y5Y0S0 148.862049 2189 | B0NRP3 117.656497 2190 | Length: 10063, dtype: float64 2191 | 2192 | 2193 | 2194 | ## SGT - Spark for Distributed Computing 2195 | 2196 | As mentioned in the previous section, sequence data sets can be large. SGT complexity is linear with the number of sequences in a data set. Still if the data size is large the computation becomes high. For example, for a set of 1M protein sequences the default SGT mode takes over 24 hours. 2197 | 2198 | Using distributed computing with Spark the runtime can be significantly reduced. For instance, SGT-Spark on the same 1M protein data set took less than 29 minutes. 2199 | 2200 | In the following, Spark implementation for SGT is shown. First, it is applied on a smaller 10k data set for comparison. Then it is applied on 1M data set without any syntactical change. 2201 | 2202 | 2203 | ```python 2204 | ''' 2205 | Load the data and remove header. 2206 | ''' 2207 | data = sc.textFile('data/protein-uniprot-reviewed-Ano-10k.tab') 2208 | 2209 | header = data.first() #extract header 2210 | data = data.filter(lambda row: row != header) #filter out header 2211 | data.take(1) # See one sample 2212 | 2213 | ``` 2214 | 2215 | ``` 2216 | ['I2WKR6\tI2WKR6_ECOLX\tunreviewed\tType III restriction enzyme, res subunit (EC 3.1.21.5)\tEC90111_4246\tEscherichia coli 9.0111\t786\tMVHKSDSDELAALRAENVRLVSLLEAHGIEWRRKPQSPVPRVSVLSTNEKVALFRRLFRGRDDVWALRWESKTSGKSGYSPACANEWQLGICGKPRIKCGDCAHRQLIPVSDLVIYHHLAGTHTAGMYPLLEDDSCYFLAVDFDEAEWQKDASAFMRSCDELGVPAALEISRSRQGAHVWIFFASRVSAREARRLGTAIISYTCSRTRQLRLGSYDRLFPNQDTMPKGGFGNLIALPLQKRPRELGGSVFVDMNLQPYPDQWAFLVSVIPMNVQDIEPTILRATGSIHPLDVNFINEEDLGTPWEEKKSSGNRLNIAVTEPLIITLANQIYFEKAQLPQALVNRLIRLAAFPNPEFYKAQAMRMSVWNKPRVIGCAENYPQHIALPRGCLDSALSFLRYNNIAAELIDKRFAGTECNAVFTGNLRAEQEEAVSALLRYDTGVLCAPTAFGKTVTAAAVIARRKVNTLILVHRTELLKQWQERLAVFLQVGDSIGIIGGGKHKPCGNIDIAVVQSISRHGEVEPLVRNYGQIIVDECHHIGAVSFSAILKETNARYLLGLTATPIRRDGLHPIIFMYCGAIRHTAARPKESLHNLEVLTRSRFTSGHLPSDARIQDIFREIALDHDRTVAIAEEAMKAFGQGRKVLVLTERTDHLDDIASVMNTLKLSPFVLHSRLSKKKRTMLISGLNALPPDSPRILLSTGRLIGEGFDHPPLDTLILAMPVSWKGTLQQYAGRLHREHTGKSDVRIIDFVDTAYPVLLRMWDKRQRGYKAMGYRIVADGEGLSF'] 2217 | ``` 2218 | 2219 | 2220 | ```python 2221 | # Repartition for increasing the parallel processes 2222 | data = data.repartition(500) 2223 | ``` 2224 | 2225 | 2226 | ```python 2227 | def preprocessing(line): 2228 | ''' 2229 | Original data are lines where each line has \t 2230 | separated values. We are interested in preserving 2231 | the first value (entry id), tmp[0], and the last value 2232 | (the sequence), tmp[-1]. 2233 | ''' 2234 | tmp = line.split('\t') 2235 | id = tmp[0] 2236 | sequence = list(tmp[-1]) 2237 | return (id, sequence) 2238 | 2239 | processeddata = data.map(lambda line: preprocessing(line)) 2240 | processeddata.take(1) # See one sample 2241 | 2242 | ``` 2243 | 2244 | 2245 |
Out[5]: [('A0A2E9WIJ1', 2246 | ['M', 2247 | 'Y', 2248 | 'I', 2249 | 'F', 2250 | 'L', 2251 | 'T', 2252 | 'L', 2253 | ... 2254 | 'A', 2255 | 'K', 2256 | 'L', 2257 | 'D', 2258 | 'K', 2259 | 'N', 2260 | 'D'])]
2261 | 2262 | 2263 | 2264 | 2265 | 2266 | ```python 2267 | # Protein sequence alphabets 2268 | alphabets = ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 2269 | 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 2270 | 'W', 'X', 'Y', 'U', 'O'] # List of amino acids 2271 | ``` 2272 | 2273 | 2274 | ```python 2275 | ''' 2276 | Spark approach. 2277 | In this approach the alphabets argument has to 2278 | be passed to the SGT class definition. 2279 | The SGT.fit() is then called in parallel. 2280 | ''' 2281 | sgt_ = sgt.SGT(alphabets=alphabets, 2282 | kappa=1, 2283 | lengthsensitive=True, 2284 | flatten=True) 2285 | rdd = processeddata.map(lambda x: (x[0], list(sgt_.fit(x[1])))) 2286 | sgtembeddings = rdd.collect() 2287 | # Command took 29.66 seconds -- by cranjan@processminer.com at 4/22/2020, 12:31:23 PM on databricks 2288 | ``` 2289 | 2290 | ### Compare with the default SGT mode 2291 | 2292 | 2293 | ```python 2294 | # Loading data 2295 | data = pd.read_csv('data/protein-uniprot-reviewed-Ano-10k.tab', sep='\t') 2296 | 2297 | # Data preprocessing 2298 | corpus = data.loc[:,['Entry','Sequence']] 2299 | corpus.columns = ['id', 'sequence'] 2300 | corpus['sequence'] = corpus['sequence'].map(list) 2301 | 2302 | ``` 2303 | 2304 | 2305 | ```python 2306 | sgt_ = sgt.SGT(alphabets=alphabets, 2307 | lengthsensitive=True, 2308 | kappa=1, 2309 | flatten=True, 2310 | mode='default') 2311 | 2312 | sgtembedding_df = sgt_.fit_transform(corpus) 2313 | # Command took 13.08 minutes -- by cranjan@processminer.com at 4/22/2020, 1:48:02 PM on databricks 2314 | ``` 2315 | 2316 | ### 1M Protein Database 2317 | 2318 | Protein 1M sequence data set is embedded here. The data set is available [here](https://mega.nz/file/1qAXhSAS#l7E60cLJzMGtFQzeHZL9PI8yX4tRQcAMFRN2xeHK81w). 2319 | 2320 | ```python 2321 | ''' 2322 | Load the data and remove header. 2323 | ''' 2324 | data = sc.textFile('data/protein-uniprot-reviewed-Ano-1M.tab') 2325 | 2326 | header = data.first() #extract header 2327 | data = data.filter(lambda row: row != header) #filter out header 2328 | data.take(1) # See one sample 2329 | ``` 2330 | 2331 | 2332 | 2333 | ```python 2334 | # Repartition for increasing the parallel processes 2335 | data = data.repartition(10000) 2336 | ``` 2337 | 2338 | 2339 | ```python 2340 | processeddata = data.map(lambda line: preprocessing(line)) 2341 | processeddata.take(1) # See one sample 2342 | 2343 | # [('A0A2E9WIJ1', 2344 | # ['M','Y','I','F','L','T','L','A','L','F','S',...,'F','S','I','F','A','K','L','D','K','N','D'])] 2345 | ``` 2346 | 2347 | 2348 | ```python 2349 | # Protein sequence alphabets 2350 | alphabets = ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 2351 | 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 2352 | 'W', 'X', 'Y', 'U', 'O'] # List of amino acids 2353 | ``` 2354 | 2355 | 2356 | ```python 2357 | ''' 2358 | Spark approach. 2359 | In this approach the alphabets argument has to 2360 | be passed to the SGT class definition. 2361 | The SGT.fit() is then called in parallel. 2362 | ''' 2363 | sgt_ = sgt.SGT(alphabets=alphabets, 2364 | kappa=1, 2365 | lengthsensitive=True, 2366 | flatten=True) 2367 | rdd = processeddata.map(lambda x: (x[0], list(sgt_.fit(x[1])))) 2368 | sgtembeddings = rdd.collect() 2369 | # Command took 28.98 minutes -- by cranjan@processminer.com at 4/22/2020, 3:16:41 PM on databricks 2370 | ``` 2371 | 2372 | 2373 | ```python 2374 | '''OPTIONAL. 2375 | Save the embeddings for future use or 2376 | production deployment.''' 2377 | # Save for deployment 2378 | # pickle.dump(sgtembeddings, 2379 | # open("data/protein-sgt-embeddings-1M.pkl", "wb")) 2380 | # The pickle dump is shared at https://mega.nz/file/hiAxAAoI#SStAIn_FZjAHvXSpXfdy8VpISG6rusHRf9HlUSqwcsw 2381 | # sgtembeddings = pickle.load(open("data/protein-sgt-embeddings-1M.pkl", "rb")) 2382 | ``` 2383 | 2384 | The pickle dump is shared [here](https://mega.nz/file/hiAxAAoI#SStAIn_FZjAHvXSpXfdy8VpISG6rusHRf9HlUSqwcsw). 2385 | 2386 | ### Sequence Search using SGT - Spark 2387 | 2388 | Since `sgtembeddings` on the 1M data set is large it is recommended to use distributed computing to find similar proteins during a search. 2389 | 2390 | 2391 | ```python 2392 | sgtembeddings_rdd = sc.parallelize(list(dict(sgtembeddings).items())) 2393 | sgtembeddings_rdd = sgtembeddings_rdd.repartition(10000) 2394 | ``` 2395 | 2396 | 2397 | ```python 2398 | ''' 2399 | Search proteins similar to a query protein. 2400 | The approach is to find the SGT embedding of the 2401 | query protein and find its similarity with the 2402 | embeddings of the protein database. 2403 | ''' 2404 | 2405 | query_protein = 'MSHVFPIVIDDNFLSPQDLVSAARSGCSLRLHTGVVDKIDRAHRFVLEIAGAEALHYGINTGFGSLCTTHIDPADLSTLQHNLLKSHACGVGPTVSEEVSRVVTLIKLLTFRTGNSGVSLSTVNRIIDLWNHGVVGAIAQKGTVGASGDLAPLAHLFLPLIGLGQVWHRGVLRPSREVMDELKLAPLTLQPKDGLCLTNGVQYLNAWGALSTVRAKRLVALADLCAAMSMMGFSAARSFIEAQIHQTSLHPERGHVALHLRTLTHGSNHADLPHCNPAMEDPYSFRCAPQVHGAARQVVGYLETVIGNECNSVSDNPLVFPDTRQILTCGNLHGQSTAFALDFAAIGITDLSNISERRTYQLLSGQNGLPGFLVAKPGLNSGFMVVQYTSAALLNENKVLSNPASVDTIPTCHLQEDHVSMGGTSAYKLQTILDNCETILAIELMTACQAIDMNPGLQLSERGRAIYEAVREEIPFVKEDHLMAGLISKSRDLCQHSTVIAQQLAEMQAQ' 2406 | 2407 | # Step 1. Compute sgt embedding for the query protein. 2408 | query_protein_sgt_embedding = sgt_.fit(list(query_protein)) 2409 | 2410 | # Step 2. Broadcast the embedding to the cluster. 2411 | query_protein_sgt_embedding_broadcasted = sc.broadcast(list(query_protein_sgt_embedding)) 2412 | 2413 | # Step 3. Compute similarity between each sequence embedding and the query. 2414 | similarity = sgtembeddings_rdd.map(lambda x: (x[0], 2415 | np.dot(query_protein_sgt_embedding_broadcasted.value, 2416 | x[1]))).collect() 2417 | 2418 | # Step 4. Show the most similar sequences with the query. 2419 | pd.DataFrame(similarity).sort_values(by=1, ascending=False) 2420 | ``` 2421 | 2422 | ## Datasets 2423 | 2424 | Data sets provided with this release are, 2425 | 2426 | ### Simulated Sequence Dataset 2427 | 2428 | A benchmark simulated sequence data set with labels are provided. There are 5 labels and a total of 300 samples. The sequence lengths range from 50-800. 2429 | 2430 | Location: 2431 | 2432 | `data/simulated-sequence-dataset.csv` 2433 | 2434 | ### Protein Dataset - 2k 2435 | 2436 | Protein sequences data set taken from https://www.uniprot.org. The data set has reviewed and annotated proteins. The fields in the data set are, 2437 | 2438 | - Entry 2439 | - Entry name 2440 | - Status 2441 | - Protein names 2442 | - Gene names 2443 | - Organism 2444 | - Length 2445 | - Sequence 2446 | - Function [CC] 2447 | - Features 2448 | - Taxonomic lineage (all) 2449 | - Protein families 2450 | 2451 | There are a total of 2113 samples (protein sequences). The proteins have one of the following two functions, 2452 | 2453 | - Binds to DNA and alters its conformation. May be involved in regulation of gene expression, nucleoid organization and DNA protection. 2454 | - Might take part in the signal recognition particle (SRP) pathway. This is inferred from the conservation of its genetic proximity to ftsY/ffh. May be a regulatory protein. 2455 | 2456 | 2457 | The data set has about 40:60 class distribution. 2458 | 2459 | Location: 2460 | `data/protein_classification.csv` 2461 | 2462 | 2463 | ### Darpa Weblog Network Intrusion Dataset 2464 | 2465 | This is a processed weblog data provided by DARPA in: DARPA INTRUSION DETECTION EVALUATION DATASET. The link to it is shared by MIT at https://www.ll.mit.edu/r-d/datasets/1998-darpa-intrusion-detection-evaluation-dataset . 2466 | 2467 | The available data set is a weblog dump with timestamps. They are converted to sequences and shared here. A sequence is labeled as 1 if it was a potential intrusion, otherwise 0. 2468 | 2469 | The data has 112 samples with imbalanced class distribution of about 10% positive labeled samples. 2470 | 2471 | The available fields are, 2472 | 2473 | - timeduration 2474 | - seqlen 2475 | - seq 2476 | - class 2477 | 2478 | Location: 2479 | 2480 | `data/darpa_data.csv` 2481 | 2482 | 2483 | ### Protein Sequence - 10k, 1M and 3M 2484 | 2485 | Three protein sequence data sets of size 10k, 1 Million, and 3 Million are provided. The 10k data set is available in GitHub, while the latter two are available publicly [here](https://mega.nz/folder/MqAzmKqS#2jqJKJifOgnFACP9GqX6QQ) https://mega.nz/folder/MqAzmKqS#2jqJKJifOgnFACP9GqX6QQ . 2486 | 2487 | The fields in these data sets are, 2488 | 2489 | - Entry 2490 | - Entry name 2491 | - Protein names 2492 | - Gene names 2493 | - Organism 2494 | - Length 2495 | - Sequence 2496 | 2497 | The source of these data are https://www.uniprot.org . 2498 | 2499 | Location: 2500 | 2501 | 10k: `data/protein-uniprot-reviewed-Ano-10k.tab` 2502 | 2503 | 1M: `https://mega.nz/folder/MqAzmKqS#2jqJKJifOgnFACP9GqX6QQ/file/t7YlUQTK` 2504 | 2505 | 3M: `https://mega.nz/folder/MqAzmKqS#2jqJKJifOgnFACP9GqX6QQ/file/InAzwYDa` 2506 | 2507 | 2508 | --------------------------------------------------------------------------------