├── .DS_Store ├── LICENSE ├── README.md ├── model ├── pytorch_HindIII_model_40000 └── pytorch_model_12000 └── src ├── HindIII_train.txt ├── model.py ├── runHiCPlus.py ├── testConvNet.py ├── trainConvNet.py └── utils.py /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zhangyan32/HiCPlus_pytorch/62c3cd674a32d8f2f8ecd296da7ca7bdd3ba087d/.DS_Store -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2017 Yan Zhang 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy 4 | of this software and associated documentation files (the "Software"), to deal 5 | in the Software without restriction, including without limitation the rights 6 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 7 | copies of the Software, and to permit persons to whom the Software is 8 | furnished to do so, subject to the following conditions: 9 | 10 | The above copyright notice and this permission notice shall be included in all 11 | copies or substantial portions of the Software. 12 | 13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 14 | EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 15 | MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. 16 | IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, 17 | DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR 18 | OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE 19 | OR OTHER DEALINGS IN THE SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | HiCPlus 2 | Impletmented by PyTorch 3 | 4 | ## Dependency 5 | 6 | * [Python] (https://www.python.org) (2.7) with Numpy and Scipy. We recommand use the [Anaconda] (https://www.continuum.io) distribution to install Python. 7 | 8 | * [PyTorch] (https://http://pytorch.org/) (0.1.12_2). GPU acceleration is not required but strongly recommended. 9 | 10 | ## Installation 11 | Clone the repo to your local folder. 12 | 13 | ``` 14 | $ git clone https://github.com/zhangyan32/HiCPlus_pytorch.git 15 | 16 | ``` 17 | ## Usage 18 | 19 | ### Prediction 20 | If the user doesn't train the model, just use [runHiCPlus.py](https://github.com/zhangyan32/HiCPlus_pytorch/blob/master/src/runHiCPlus.py) to generate the enhanced Hi-C interaction matrix. 21 | 22 | 23 | ### Training 24 | In the training stage, both high-resolution Hi-C samples and low-resolution Hi-C samples are needed. Two samples should be in the same shape as (N, 1, n, n), where N is the number of the samples, and n is the size of the samples. The sample index of the sample should be from the sample genomic location in two input data sets. 25 | 26 | ### Prediction 27 | Only low-resolution Hi-C samples are needed. The shape of the samples should be the same with the training stage. The prediction generates the enhanced Hi-C data, and the user should recombine the output to obtain the entire Hi-C matrix. 28 | 29 | ### Suggested way to generate samples 30 | We suggest that generate a file containing the location of each samples when generate the samples with n x n size. Therefore, after obtaining the high-resolution Hi-C, it is easy to recombine all of the samples to obtain high-resolution Hi-C matrix. 31 | 32 | ### Normalization and experimental condition 33 | Hi-C experiments have several different types of cutting enzyme as well as different normalization method. Our model can handle all of the conditions as long as the training and testing are under the same condition. For example, if the KR normalized samples are used in the training stage, the trained model only works for the KR normalized low-resolution sample. 34 | 35 | -------------------------------------------------------------------------------- /model/pytorch_HindIII_model_40000: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zhangyan32/HiCPlus_pytorch/62c3cd674a32d8f2f8ecd296da7ca7bdd3ba087d/model/pytorch_HindIII_model_40000 -------------------------------------------------------------------------------- /model/pytorch_model_12000: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/zhangyan32/HiCPlus_pytorch/62c3cd674a32d8f2f8ecd296da7ca7bdd3ba087d/model/pytorch_model_12000 -------------------------------------------------------------------------------- /src/model.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from torch.autograd import Variable 3 | import torch.nn as nn 4 | import torch.nn.functional as F 5 | import numpy as np 6 | from torch.utils import data 7 | import gzip 8 | import sys 9 | import torch.optim as optim 10 | conv2d1_filters_numbers = 8 11 | conv2d1_filters_size = 9 12 | conv2d2_filters_numbers = 8 13 | conv2d2_filters_size = 1 14 | conv2d3_filters_numbers = 1 15 | conv2d3_filters_size = 5 16 | 17 | class Net(nn.Module): 18 | def __init__(self, D_in, D_out): 19 | super(Net, self).__init__() 20 | # 1 input image channel, 6 output channels, 5x5 square convolution 21 | # kernel 22 | self.conv1 = nn.Conv2d(1, conv2d1_filters_numbers, conv2d1_filters_size) 23 | self.conv2 = nn.Conv2d(conv2d1_filters_numbers, conv2d2_filters_numbers, conv2d2_filters_size) 24 | self.conv3 = nn.Conv2d(conv2d2_filters_numbers, 1, conv2d3_filters_size) 25 | 26 | def forward(self, x): 27 | print("start forwardingf") 28 | x = self.conv1(x) 29 | x = F.relu(x) 30 | x = self.conv2(x) 31 | x = F.relu(x) 32 | x = self.conv3(x) 33 | x = F.relu(x) 34 | return x 35 | ''' 36 | def num_flat_features(self, x): 37 | size = x.size()[1:] # all dimensions except the batch dimension 38 | num_features = 1 39 | for s in size: 40 | num_features *= s 41 | return num_features 42 | ''' 43 | ''' 44 | net = Net(40, 24) 45 | 46 | 47 | 48 | #sys.exit() 49 | #low_resolution_samples = low_resolution_samples.reshape((low_resolution_samples.shape[0], 40, 40)) 50 | #print low_resolution_samples[0:1, :,: ,: ].shape 51 | #low_resolution_samples = torch.from_numpy(low_resolution_samples[0:1, :,: ,: ]) 52 | #X = Variable(low_resolution_samples) 53 | #print X 54 | #Y = Variable(torch.from_numpy(Y[0])) 55 | #X = Variable(torch.randn(1, 1, 40, 40)) 56 | #print X 57 | optimizer = optim.SGD(net.parameters(), lr=0.0001, momentum=0.9) 58 | criterion = nn.MSELoss() 59 | for epoch in range(2): # loop over the dataset multiple times 60 | print "epoch", epoch 61 | 62 | running_loss = 0.0 63 | for i, data in enumerate(train_loader, 0): 64 | # get the inputs 65 | inputs, labels = data 66 | #print(inputs.size()) 67 | #print(labels.size()) 68 | #print type(inputs) 69 | 70 | # wrap them in Variable 71 | inputs, labels = Variable(inputs), Variable(labels) 72 | 73 | # zero the parameter gradients 74 | optimizer.zero_grad() 75 | 76 | # forward + backward + optimize 77 | outputs = net(inputs) 78 | #print outputs 79 | loss = criterion(outputs, labels) 80 | 81 | loss.backward() 82 | optimizer.step() 83 | print i 84 | # print statistics 85 | #print type(loss) 86 | #print loss 87 | #print loss.data[0] 88 | #print loss.data 89 | #print type(data), len(data) 90 | #print "the key is ", type(data[0]) 91 | 92 | 93 | 94 | print('Finished Training') 95 | 96 | 97 | output = net(X) 98 | print(output) 99 | print type(output) 100 | 101 | loss = criterion(output, Y) 102 | 103 | 104 | net.zero_grad() # zeroes the gradient buffers of all parameters 105 | 106 | print('conv1.bias.grad before backward') 107 | print(net.conv1.bias.grad) 108 | 109 | loss.backward() 110 | 111 | print('conv1.bias.grad after backward') 112 | print(net.conv1.weight.grad) 113 | 114 | ''' 115 | 116 | -------------------------------------------------------------------------------- /src/runHiCPlus.py: -------------------------------------------------------------------------------- 1 | # Author: Yan Zhang 2 | # Email: zhangyan.cse (@) gmail.com 3 | 4 | import sys 5 | import numpy as np 6 | import matplotlib.pyplot as plt 7 | import pickle 8 | import os 9 | import gzip 10 | import model 11 | from torch.utils import data 12 | import torch 13 | import torch.optim as optim 14 | from torch.autograd import Variable 15 | from time import gmtime, strftime 16 | import sys 17 | import torch.nn as nn 18 | import utils 19 | import math 20 | 21 | use_gpu = 1 22 | 23 | conv2d1_filters_numbers = 8 24 | conv2d1_filters_size = 9 25 | conv2d2_filters_numbers = 8 26 | conv2d2_filters_size = 1 27 | conv2d3_filters_numbers = 1 28 | conv2d3_filters_size = 5 29 | 30 | 31 | down_sample_ratio = 16 32 | epochs = 10 33 | HiC_max_value = 100 34 | 35 | 36 | 37 | # This block is the actual training data used in the training. The training data is too large to put on Github, so only toy data is used. 38 | input_file = '/home/zhangyan/Desktop/chr21.10kb.matrix' 39 | low_resolution_samples, index = utils.divide(input_file) 40 | 41 | low_resolution_samples = np.minimum(HiC_max_value, low_resolution_samples) 42 | 43 | batch_size = low_resolution_samples.shape[0] 44 | 45 | # Reshape the high-quality Hi-C sample as the target value of the training. 46 | sample_size = low_resolution_samples.shape[-1] 47 | padding = conv2d1_filters_size + conv2d2_filters_size + conv2d3_filters_size - 3 48 | half_padding = padding / 2 49 | output_length = sample_size - padding 50 | 51 | 52 | print low_resolution_samples.shape 53 | 54 | lowres_set = data.TensorDataset(torch.from_numpy(low_resolution_samples), torch.from_numpy(np.zeros(low_resolution_samples.shape[0]))) 55 | lowres_loader = torch.utils.data.DataLoader(lowres_set, batch_size=batch_size, shuffle=False) 56 | 57 | hires_loader = lowres_loader 58 | 59 | model = model.Net(40, 28) 60 | model.load_state_dict(torch.load('../model/pytorch_model_12000')) 61 | if use_gpu: 62 | model = model.cuda() 63 | 64 | _loss = nn.MSELoss() 65 | 66 | 67 | running_loss = 0.0 68 | running_loss_validate = 0.0 69 | reg_loss = 0.0 70 | 71 | 72 | for i, (v1, v2) in enumerate(zip(lowres_loader, hires_loader)): 73 | _lowRes, _ = v1 74 | _highRes, _ = v2 75 | 76 | _lowRes = Variable(_lowRes).float() 77 | _highRes = Variable(_highRes).float() 78 | 79 | 80 | if use_gpu: 81 | _lowRes = _lowRes.cuda() 82 | _highRes = _highRes.cuda() 83 | y_prediction = model(_lowRes) 84 | 85 | 86 | print '-------', i, running_loss, strftime("%Y-%m-%d %H:%M:%S", gmtime()) 87 | 88 | 89 | y_predict = y_prediction.data.cpu().numpy() 90 | 91 | 92 | print y_predict.shape 93 | 94 | # recombine samples 95 | 96 | length = int(y_predict.shape[2]) 97 | y_predict = np.reshape(y_predict, (y_predict.shape[0], length, length)) 98 | 99 | 100 | chrs_length = [249250621,243199373,198022430,191154276,180915260,171115067,159138663,146364022,141213431,135534747,135006516,133851895,115169878,107349540,102531392,90354753,81195210,78077248,59128983,63025520,48129895,51304566] 101 | 102 | chrN = 21 103 | 104 | length = chrs_length[chrN-1]/10000 105 | 106 | prediction_1 = np.zeros((length, length)) 107 | 108 | 109 | print 'predicted sample: ', y_predict.shape, '; index shape is: ', index.shape 110 | #print index 111 | for i in range(0, y_predict.shape[0]): 112 | if (int(index[i][1]) != chrN): 113 | continue 114 | #print index[i] 115 | x = int(index[i][2]) 116 | y = int(index[i][3]) 117 | #print np.count_nonzero(y_predict[i]) 118 | prediction_1[x+6:x+34, y+6:y+34] = y_predict[i] 119 | 120 | np.save(input_file + 'enhanced.npy', prediction_1) 121 | 122 | 123 | 124 | 125 | 126 | -------------------------------------------------------------------------------- /src/testConvNet.py: -------------------------------------------------------------------------------- 1 | # Author: Yan Zhang 2 | # Email: zhangyan.cse (@) gmail.com 3 | 4 | import sys 5 | import numpy as np 6 | import matplotlib.pyplot as plt 7 | import pickle 8 | import os 9 | import gzip 10 | import model 11 | from torch.utils import data 12 | import torch 13 | import torch.optim as optim 14 | from torch.autograd import Variable 15 | from time import gmtime, strftime 16 | import sys 17 | import torch.nn as nn 18 | 19 | use_gpu = 1 20 | 21 | conv2d1_filters_numbers = 8 22 | conv2d1_filters_size = 9 23 | conv2d2_filters_numbers = 8 24 | conv2d2_filters_size = 1 25 | conv2d3_filters_numbers = 1 26 | conv2d3_filters_size = 5 27 | 28 | 29 | down_sample_ratio = 16 30 | epochs = 10 31 | HiC_max_value = 100 32 | 33 | 34 | 35 | # This block is the actual training data used in the training. The training data is too large to put on Github, so only toy data is used. 36 | # cell = "GM12878_replicate" 37 | # chrN_range1 = '1_8' 38 | # chrN_range = '1_8' 39 | 40 | # low_resolution_samples = np.load(gzip.GzipFile('/home/zhangyan/SRHiC_samples/'+cell+'down16_chr'+chrN_range+'.npy.gz', "r")).astype(np.float32) * down_sample_ratio 41 | # high_resolution_samples = np.load(gzip.GzipFile('/home/zhangyan/SRHiC_samples/original10k/'+cell+'_original_chr'+chrN_range+'.npy.gz', "r")).astype(np.float32) 42 | 43 | # low_resolution_samples = np.minimum(HiC_max_value, low_resolution_samples) 44 | # high_resolution_samples = np.minimum(HiC_max_value, high_resolution_samples) 45 | 46 | 47 | low_resolution_samples = np.load(gzip.GzipFile('../../data/GM12878_replicate_down16_chr19_22.npy.gz', "r")).astype(np.float32) * down_sample_ratio 48 | 49 | low_resolution_samples = np.minimum(HiC_max_value, low_resolution_samples) 50 | 51 | batch_size = low_resolution_samples.shape[0] 52 | 53 | # Reshape the high-quality Hi-C sample as the target value of the training. 54 | sample_size = low_resolution_samples.shape[-1] 55 | padding = conv2d1_filters_size + conv2d2_filters_size + conv2d3_filters_size - 3 56 | half_padding = padding / 2 57 | output_length = sample_size - padding 58 | 59 | 60 | print low_resolution_samples.shape 61 | 62 | lowres_set = data.TensorDataset(torch.from_numpy(low_resolution_samples), torch.from_numpy(np.zeros(low_resolution_samples.shape[0]))) 63 | lowres_loader = torch.utils.data.DataLoader(lowres_set, batch_size=batch_size, shuffle=False) 64 | 65 | production = False 66 | try: 67 | high_resolution_samples = np.load(gzip.GzipFile('../../data/GM12878_replicate_original_chr19_22.npy.gz', "r")).astype(np.float32) 68 | high_resolution_samples = np.minimum(HiC_max_value, high_resolution_samples) 69 | Y = [] 70 | for i in range(high_resolution_samples.shape[0]): 71 | no_padding_sample = high_resolution_samples[i][0][half_padding:(sample_size-half_padding) , half_padding:(sample_size - half_padding)] 72 | Y.append(no_padding_sample) 73 | Y = np.array(Y).astype(np.float32) 74 | hires_set = data.TensorDataset(torch.from_numpy(Y), torch.from_numpy(np.zeros(Y.shape[0]))) 75 | hires_loader = torch.utils.data.DataLoader(hires_set, batch_size=batch_size, shuffle=False) 76 | except: 77 | production = True 78 | hires_loader = lowres_loader 79 | 80 | Net = model.Net(40, 28) 81 | Net.load_state_dict(torch.load('../model/pytorch_model_12000')) 82 | if use_gpu: 83 | Net = Net.cuda() 84 | 85 | _loss = nn.MSELoss() 86 | 87 | 88 | running_loss = 0.0 89 | running_loss_validate = 0.0 90 | reg_loss = 0.0 91 | 92 | 93 | for i, (v1, v2) in enumerate(zip(lowres_loader, hires_loader)): 94 | _lowRes, _ = v1 95 | _highRes, _ = v2 96 | 97 | 98 | _lowRes = Variable(_lowRes) 99 | _highRes = Variable(_highRes) 100 | 101 | 102 | if use_gpu: 103 | _lowRes = _lowRes.cuda() 104 | _highRes = _highRes.cuda() 105 | y_prediction = Net(_lowRes) 106 | if (not production): 107 | loss = _loss(y_prediction, _highRes) 108 | 109 | 110 | running_loss += loss.data[0] 111 | 112 | print '-------', i, running_loss, strftime("%Y-%m-%d %H:%M:%S", gmtime()) 113 | 114 | 115 | y_prediction = y_prediction.data.cpu().numpy() 116 | 117 | print y_prediction.shape 118 | 119 | 120 | 121 | 122 | 123 | 124 | -------------------------------------------------------------------------------- /src/trainConvNet.py: -------------------------------------------------------------------------------- 1 | # Author: Yan Zhang 2 | # Email: zhangyan.cse (@) gmail.com 3 | 4 | import sys 5 | import numpy as np 6 | import matplotlib.pyplot as plt 7 | import pickle 8 | import os 9 | import gzip 10 | import model 11 | from torch.utils import data 12 | import torch 13 | import torch.optim as optim 14 | from torch.autograd import Variable 15 | from time import gmtime, strftime 16 | import sys 17 | import torch.nn as nn 18 | 19 | use_gpu = 1 20 | 21 | conv2d1_filters_numbers = 8 22 | conv2d1_filters_size = 9 23 | conv2d2_filters_numbers = 8 24 | conv2d2_filters_size = 1 25 | conv2d3_filters_numbers = 1 26 | conv2d3_filters_size = 5 27 | 28 | 29 | down_sample_ratio = 16 30 | epochs = 10 31 | HiC_max_value = 100 32 | batch_size = 256 33 | 34 | 35 | # This block is the actual training data used in the training. The training data is too large to put on Github, so only toy data is used. 36 | # cell = "GM12878_replicate" 37 | # chrN_range1 = '1_8' 38 | # chrN_range = '1_8' 39 | 40 | # low_resolution_samples = np.load(gzip.GzipFile('/home/zhangyan/SRHiC_samples/'+cell+'down16_chr'+chrN_range+'.npy.gz', "r")).astype(np.float32) * down_sample_ratio 41 | # high_resolution_samples = np.load(gzip.GzipFile('/home/zhangyan/SRHiC_samples/original10k/'+cell+'_original_chr'+chrN_range+'.npy.gz', "r")).astype(np.float32) 42 | 43 | # low_resolution_samples = np.minimum(HiC_max_value, low_resolution_samples) 44 | # high_resolution_samples = np.minimum(HiC_max_value, high_resolution_samples) 45 | 46 | 47 | #low_resolution_samples = np.load(gzip.GzipFile('../../data/GM12878_replicate_down16_chr19_22.npy.gz', "r")).astype(np.float32) * down_sample_ratio 48 | #high_resolution_samples = np.load(gzip.GzipFile('../../data/GM12878_replicate_original_chr19_22.npy.gz', "r")).astype(np.float32) 49 | 50 | low_resolution_samples = np.load(gzip.GzipFile('/home/zhangyan/SRHiC_samples/IMR90_down_HINDIII16_chr1_8.npy.gz', "r")).astype(np.float32) * down_sample_ratio 51 | high_resolution_samples = np.load(gzip.GzipFile('/home/zhangyan/SRHiC_samples/original10k/_IMR90_HindIII_original_chr1_8.npy.gz', "r")).astype(np.float32) 52 | 53 | 54 | low_resolution_samples = np.minimum(HiC_max_value, low_resolution_samples) 55 | high_resolution_samples = np.minimum(HiC_max_value, high_resolution_samples) 56 | 57 | 58 | 59 | # Reshape the high-quality Hi-C sample as the target value of the training. 60 | sample_size = low_resolution_samples.shape[-1] 61 | padding = conv2d1_filters_size + conv2d2_filters_size + conv2d3_filters_size - 3 62 | half_padding = padding / 2 63 | output_length = sample_size - padding 64 | Y = [] 65 | for i in range(high_resolution_samples.shape[0]): 66 | no_padding_sample = high_resolution_samples[i][0][half_padding:(sample_size-half_padding) , half_padding:(sample_size - half_padding)] 67 | Y.append(no_padding_sample) 68 | Y = np.array(Y).astype(np.float32) 69 | 70 | print low_resolution_samples.shape, Y.shape 71 | 72 | lowres_set = data.TensorDataset(torch.from_numpy(low_resolution_samples), torch.from_numpy(np.zeros(low_resolution_samples.shape[0]))) 73 | lowres_loader = torch.utils.data.DataLoader(lowres_set, batch_size=batch_size, shuffle=False) 74 | 75 | hires_set = data.TensorDataset(torch.from_numpy(Y), torch.from_numpy(np.zeros(Y.shape[0]))) 76 | hires_loader = torch.utils.data.DataLoader(hires_set, batch_size=batch_size, shuffle=False) 77 | 78 | 79 | Net = model.Net(40, 28) 80 | 81 | if use_gpu: 82 | Net = Net.cuda() 83 | 84 | optimizer = optim.SGD(Net.parameters(), lr = 0.00001) 85 | _loss = nn.MSELoss() 86 | Net.train() 87 | 88 | running_loss = 0.0 89 | running_loss_validate = 0.0 90 | reg_loss = 0.0 91 | 92 | # write the log file to record the training process 93 | log = open('HindIII_train.txt', 'w') 94 | for epoch in range(0, 100000): 95 | for i, (v1, v2) in enumerate(zip(lowres_loader, hires_loader)): 96 | if (i == len(lowres_loader) - 1): 97 | continue 98 | _lowRes, _ = v1 99 | _highRes, _ = v2 100 | 101 | 102 | _lowRes = Variable(_lowRes) 103 | _highRes = Variable(_highRes) 104 | 105 | 106 | if use_gpu: 107 | _lowRes = _lowRes.cuda() 108 | _highRes = _highRes.cuda() 109 | optimizer.zero_grad() 110 | y_prediction = Net(_lowRes) 111 | 112 | loss = _loss(y_prediction, _highRes) 113 | 114 | loss.backward() 115 | optimizer.step() 116 | 117 | running_loss += loss.data[0] 118 | 119 | print '-------', i, epoch, running_loss/i, strftime("%Y-%m-%d %H:%M:%S", gmtime()) 120 | 121 | log.write(str(epoch) + ', ' + str(running_loss/i,) + '\n') 122 | running_loss = 0.0 123 | running_loss_validate = 0.0 124 | # save the model every 100 epoches 125 | if (epoch % 100 == 0): 126 | torch.save(Net.state_dict(), '/home/zhangyan/pytorch_models/pytorch_HindIII_model_' + str(epoch)) 127 | pass 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | -------------------------------------------------------------------------------- /src/utils.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import matplotlib.pyplot as plt 3 | import os 4 | 5 | 6 | def readSparseMatrix(filename, total_length): 7 | print "reading Rao's HiC " 8 | infile = open(filename).readlines() 9 | print len(infile) 10 | HiC = np.zeros((total_length,total_length)).astype(np.int16) 11 | percentage_finish = 0 12 | for i in range(0, len(infile)): 13 | if (i % (len(infile) / 10)== 0): 14 | print 'finish ', percentage_finish, '%' 15 | percentage_finish += 10 16 | nums = infile[i].split('\t') 17 | x = int(nums[0]) 18 | y = int(nums[1]) 19 | val = int(float(nums[2])) 20 | 21 | HiC[x][y] = val 22 | HiC[y][x] = val 23 | return HiC 24 | 25 | def readSquareMatrix(filename, total_length): 26 | print "reading Rao's HiC " 27 | infile = open(filename).readlines() 28 | print('size of matrix is ' + str(len(infile))) 29 | print('number of the bins based on the length of chromsomes is ' + str(total_length) ) 30 | result = [] 31 | for line in infile: 32 | tokens = line.split('\t') 33 | line_int = list(map(int, tokens)) 34 | result.append(line_int) 35 | result = np.array(result) 36 | print(result.shape) 37 | return result 38 | 39 | 40 | def divide(HiCfile): 41 | subImage_size = 40 42 | step = 25 43 | chrs_length = [249250621,243199373,198022430,191154276,180915260,171115067,159138663,146364022,141213431,135534747,135006516,133851895,115169878,107349540,102531392,90354753,81195210,78077248,59128983,63025520,48129895,51304566] 44 | input_resolution = 10000 45 | result = [] 46 | index = [] 47 | chrN = 21 48 | matrix_name = HiCfile + '_npy_form_tmp.npy' 49 | if os.path.exists(matrix_name): 50 | print 'loading ', matrix_name 51 | HiCsample = np.load(matrix_name) 52 | else: 53 | print matrix_name, 'not exist, creating' 54 | print HiCfile 55 | HiCsample = readSquareMatrix(HiCfile, (chrs_length[chrN-1]/input_resolution + 1)) 56 | #HiCsample = np.loadtxt('/home/zhangyan/private_data/IMR90.nodup.bam.chr'+str(chrN)+'.10000.matrix', dtype=np.int16) 57 | print HiCsample.shape 58 | np.save(matrix_name, HiCsample) 59 | print HiCsample.shape 60 | path = '/home/zhangyan/HiCPlus_pytorch_production/' 61 | if not os.path.exists(path): 62 | os.makedirs(path) 63 | total_loci = HiCsample.shape[0] 64 | for i in range(0, total_loci, step): 65 | for j in range(0, total_loci, ): 66 | if (abs(i-j) > 201 or i + subImage_size >= total_loci or j + subImage_size >= total_loci): 67 | continue 68 | subImage = HiCsample[i:i+subImage_size, j:j+subImage_size] 69 | 70 | result.append([subImage,]) 71 | tag = 'test' 72 | index.append((tag, chrN, i, j)) 73 | result = np.array(result) 74 | print result.shape 75 | result = result.astype(np.double) 76 | index = np.array(index) 77 | return result, index 78 | 79 | 80 | if __name__ == "__main__": 81 | main() --------------------------------------------------------------------------------