├── images
    ├── 4.png
    ├── 5-2.png
    ├── .DS_Store
    └── blast-result.png
├── LICENSE
├── README.md
└── blast.py


/images/4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JiaShun-Xiao/BLAST-bioinfor-tool/HEAD/images/4.png


--------------------------------------------------------------------------------
/images/5-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JiaShun-Xiao/BLAST-bioinfor-tool/HEAD/images/5-2.png


--------------------------------------------------------------------------------
/images/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JiaShun-Xiao/BLAST-bioinfor-tool/HEAD/images/.DS_Store


--------------------------------------------------------------------------------
/images/blast-result.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JiaShun-Xiao/BLAST-bioinfor-tool/HEAD/images/blast-result.png


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | The MIT License (MIT)
 2 | 
 3 | Copyright (c) 2016 JiaShun Xiao
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # python implement of fast BLAST 
 2 | Python implementation of Basic Local Alignment Search Tool (BLAST) , which is the core algorithm in sequence alignment for genomes and only it need about 2 seconds to output location and Smith,Waterman alignment result.
 3 | 
 4 | **Table of Contents**
 5 | 
 6 | - [Introduction to BLAST](#Introduction-to-BLAST)
 7 | - [BLAST implementation in python: For human genome](#Blast-implementation-in-python:-for-human-genome)
 8 |   - [construct library](#construct-library)
 9 |   - [alignment algrithms](#alignment-algrithms)
10 | 
11 | ## Introduction to BLAST
12 | 
13 | > In bioinformatics, BLAST for Basic Local Alignment Search Tool is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences. A BLAST search enables a researcher to compare a query sequence with a library or database of sequences, and identify library sequences that resemble the query sequence above a certain threshold.
14 | 
15 | ## BLAST implementation in python: For human genome
16 | 
17 | ### Construct library
18 | >Construct library for human genome. Break whole genome sequence into **11** bases length words overlappedly, then record the location of each word in liarary file. (**respectively for each chromosome**)
19 | <img src="./images/4.png" width=600  />
20 | 
21 | #### Detail programing 
22 | 
23 | ```python
24 | python build_library.py 
25 | # It takes several hours when multiprocessing was applicable
26 | # reset GRCh37 path and output library file path
27 | ```
28 | 
29 | ### alignment algorithms
30 | > Align query sequence with genome. after cut query sequences into 11 bases length, find all location of each reads from library files we build above. 
31 | 
32 | <img src="./images/5-2.png" width=700 />
33 | 
34 | > Compare all locations between each 11 bases length words of query sequences. For each words, it will have many location in each chromosome, but only one of them is the right location of query sequnce. For example, as we can see in figue above, word x have locations: a, b, **c**, d..., c is the right location; word x+1 have locations: e, f, **c+1**, g........... c+1 is the right location. Thus, if the query sequence have no mutation and gaps, each word will have a location like that: **c, c+1, c+2, c+3,..........** repectively. However, if the query sequence have a mutation or gap, all the words contain the muation of gap will have misleading postions, like that: **c, c+1, c+2,...,c+i-11, misleading postions ,c+i+11....**. Then, for words from first one to last, locations of each word add length(query) - i, i is the index of words, so we can get the new locations like that: **c+length(query), c+1+length(query)-1, c+2+length(query)-2.......... finally, we can find the right location of the highest repeated location: c+length(query)**, and we can select the bigger threshold of the highest repeated number (default: 5) just like we select the highly similar sequence in NCBI BLAST. the result figure is showed below, The time to find the location and finish sequence alignment is about 2 seconds.
35 | 
36 | ```python 
37 | python blast.py 
38 | # finally, input query sequence in blast.py 
39 | # this will output location and Smith,Waterman alignment result.
40 | ```
41 | 
42 | <img src="./images/blast-result.png" width=700>
43 | 
44 | 
45 | 
46 | 


--------------------------------------------------------------------------------
/blast.py:
--------------------------------------------------------------------------------
  1 | __author__ = 'Jiashun'
  2 | import re
  3 | import numpy as np
  4 | from collections import Counter
  5 | from math import ceil
  6 | from math import floor
  7 | 
  8 | # compare single base
  9 | def SingleBaseCompare(seq1,seq2,i,j):
 10 |     if seq1[i] == seq2[j]:
 11 |         return 2
 12 |     else:
 13 |         return -1
 14 |     
 15 | # Smith–Waterman Alignment 
 16 | def SMalignment(seq1, seq2):
 17 |     m = len(seq1)
 18 |     n = len(seq2)
 19 |     g = -3
 20 |     matrix = []
 21 |     for i in range(0, m):
 22 |         tmp = []
 23 |         for j in range(0, n):
 24 |             tmp.append(0)
 25 |         matrix.append(tmp)
 26 |     for sii in range(0, m):
 27 |         matrix[sii][0] = sii*g
 28 |     for sjj in range(0, n):
 29 |         matrix[0][sjj] = sjj*g
 30 |     for siii in range(1, m):
 31 |         for sjjj in range(1, n):
 32 |             matrix[siii][sjjj] = max(matrix[siii-1][sjjj] + g, matrix[siii - 1][sjjj - 1] + SingleBaseCompare(seq1,seq2,siii, sjjj), matrix[siii][sjjj-1] + g)
 33 |     sequ1 = [seq1[m-1]]
 34 |     sequ2 = [seq2[n-1]]
 35 |     while m > 1 and n > 1:
 36 |         if max(matrix[m-1][n-2], matrix[m-2][n-2], matrix[m-2][n-1]) == matrix[m-2][n-2]:
 37 |             m -= 1
 38 |             n -= 1
 39 |             sequ1.append(seq1[m-1])
 40 |             sequ2.append(seq2[n-1])
 41 |         elif max(matrix[m-1][n-2], matrix[m-2][n-2], matrix[m-2][n-1]) == matrix[m-1][n-2]:
 42 |             n -= 1
 43 |             sequ1.append('-')
 44 |             sequ2.append(seq2[n-1])
 45 |         else:
 46 |             m -= 1
 47 |             sequ1.append(seq1[m-1])
 48 |             sequ2.append('-')
 49 |     sequ1.reverse()
 50 |     sequ2.reverse()
 51 |     align_seq1 = ''.join(sequ1)
 52 |     align_seq2 = ''.join(sequ2)
 53 |     align_score = 0.
 54 |     for k in range(0, len(align_seq1)):
 55 |         if align_seq1[k] == align_seq2[k]:
 56 |             align_score += 1
 57 |     align_score = float(align_score)/len(align_seq1)
 58 |     return align_seq1, align_seq2, align_score
 59 | 
 60 | # Display BlAST result
 61 | def Display(seque1, seque2):
 62 |     le = 60
 63 |     while len(seque1)-le >= 0:
 64 |         print('sequence1: ',end='')
 65 |         for a in list(seque1)[le-40:le]:
 66 |             print(a,end='')
 67 |         print("\n")
 68 |         print('           ',end='')
 69 |         for k in range(le-40, le):
 70 |             if seque1[k] == seque2[k]:
 71 |                 print('|',end='')
 72 |             else:
 73 |                 print(' ',end='')
 74 |         print("\n")
 75 |         print('sequence2: ',end='')
 76 |         for b in list(seque2)[le-40:le]:
 77 |             print(b,end='')
 78 |         print("\n")
 79 |         le += 40
 80 |     if len(seque1) > le-40:
 81 |         print('sequence1: ',end='')
 82 |         for a in list(seque1)[le-40:len(seque1)]:
 83 |             print(a,end='')
 84 |         print("\n")
 85 |         print('           ',end='')
 86 |         for k in range(le-40, len(seque1)):
 87 |             if seque1[k] == seque2[k]:
 88 |                 print('|',end='')
 89 |             else:
 90 |                 print(' ',end='')
 91 |         print("\n")
 92 |         print('sequence2: ',end='')
 93 |         for b in list(seque2)[le-40:len(seque2)]:
 94 |             print(b,end='')
 95 |         print("\n")
 96 | 
 97 | # transform base to numeric value
 98 | def WordToNum(word):
 99 |     tmp = []
100 |     trans = {'A':1,'C':2,'G':3,'T':4}
101 |     for w in word:
102 |         tmp.append(trans[w])
103 |     return tmp
104 | 
105 | # transform word with 11 bases to its index
106 | def WordToIndex(word,word_len):
107 |     tmp = 0
108 |     word_num = WordToNum(word)
109 |     for i,v in enumerate(word_num):
110 |         tmp += (v-1)*4**(word_len-i)
111 |     return tmp   
112 | 
113 | # Get word's postion in genome from library
114 | def GetWordPos(word):
115 |     assert len(word)== 11
116 |     seek_index = WordToIndex(word,11-1)
117 |     positions = []
118 |     for chr_name in chr_names:
119 |         chr_seq = open('/home/jxiaoae/class/blast/chromosome_{}_library.txt'.format(chr_name),'r')
120 |         seeks = np.load("chromosome_{}_library_seeks.npy".format(chr_name))
121 |         chr_seq.seek(seeks[seek_index,0])
122 |         position = chr_seq.read(seeks[seek_index,1])
123 |         try:
124 |             positions.append(list(map(int, position[:-1].split(","))))
125 |         except:
126 |             positions.append([])
127 |     return positions
128 | 
129 | # Extract subsequence from GRCh37 file
130 | def ExtractSeq(chr_index,pos,length):
131 |     pos = pos+floor(pos/60)
132 |     hg19.seek(chrom_seek_index[chr_index,1]+pos-1)
133 |     return re.sub(r'\n', '', hg19.read(length))
134 | 
135 | # main blast function
136 | def Blast(query_seq):
137 |     i = 0
138 |     query_words = []
139 |     query_seq_length = len(query_seq)
140 |     words_length = query_seq_length-11+1
141 |     while i < words_length:
142 |         query_words.append(query_seq[i:i+11])
143 |         i += 1
144 |     words_positions = []
145 |     for word in query_words:
146 |         words_positions.append(GetWordPos(word))
147 |     for chr_index in range(24):
148 |         for word_index in range(words_length):
149 |             for pos in range(len(words_positions[word_index][chr_index])):
150 |                 words_positions[word_index][chr_index][pos] += words_length - word_index - 1
151 |         
152 |         words_positions_corrects = []
153 |         for word_index in range(words_length):
154 |             words_positions_corrects += words_positions[word_index][chr_index]
155 |         
156 |         words_positions_corrects_count = Counter(words_positions_corrects)
157 |         finded_postions = []
158 |         for count_ in words_positions_corrects_count:
159 |             # we can select the bigger threshold of words_positions_corrects_count[count_] just 
160 |             # like we select the highly similar sequence in NCBI BLAST
161 |             if words_positions_corrects_count[count_] > 5:
162 |                 finded_postions.append(count_)
163 |         if finded_postions:
164 |             for finded_postion in finded_postions:
165 |                 candidate_seq_pos = finded_postion - query_seq_length + 11 - 5
166 |                 candidate_seq_length = query_seq_length + 11
167 |                 candidate_sequence = ExtractSeq(chr_index,candidate_seq_pos,candidate_seq_length)
168 |                 i_start_indexs = []
169 |                 for i_start in range(15):
170 |                     _,_,score = SMalignment(candidate_sequence[i_start:],query_seq)
171 |                     i_start_indexs.append(score)
172 |                 i_start = np.array(i_start_indexs).argmax()
173 |                 i_end_indexs = []
174 |                 for i_end in range(1,16):
175 |                     _,_,score = SMalignment(candidate_sequence[:-i_end],query_seq)
176 |                     i_end_indexs.append(score)
177 |                 i_end = np.array(i_end_indexs).argmax()+1
178 |                 candidate_sequence = candidate_sequence[i_start:-i_end]
179 |                 align_seq1,align_seq2,align_score = SMalignment(candidate_sequence,query_seq)
180 |                 if align_score>0.7:
181 |                     print("find in chromosome "+chr_names[chr_index]+": "+str(candidate_seq_pos+i_start)+' ---> '+str(candidate_seq_pos+i_start+len(candidate_sequence)-1)+", align score: "+str(align_score))
182 |                     Display(align_seq1, align_seq2)
183 |     return None
184 | 
185 | if __name__ == "__main__":
186 |     chr_names = np.load('/home/jxiaoae/class/blast/GRCh37_chr_names.npy')
187 |     chrom_seek_index = np.load('/home/jxiaoae/class/blast/GRCh37_chrom_seek_index.npy')
188 |     hg19 = open("/home/share/GRCh37/human_g1k_v37.fasta")
189 |     query_sequence = 'GTATCGGAACTTCCAACTTGTAGGCAAAATAGATATGCTTCATATTCTTAAAAACCACAAGAAA'
190 |     Blast(query_sequence)


--------------------------------------------------------------------------------