├── README.md ├── RepEnrich.py └── RepEnrich_setup.py /README.md: -------------------------------------------------------------------------------- 1 | ##There is currently a newer version of RepEnrich located here: https://github.com/nerettilab/RepEnrich2 2 | 3 | #RepEnrich 4 | ## Tutorial By Steven Criscione 5 | Email: [steven_criscione@brown.edu](mailto:steven_criscione@brown.edu) 6 | 7 | ### Dependencies 8 | This example is for mouse genome **mm9**. Before 9 | getting started you should make sure you have installed the dependencies 10 | for RepEnrich. RepEnrich requires python version 2.7.3. 11 | RepEnrich requires: [Bowtie 1](http://bowtie-bio.sourceforge.net/index.shtml), 12 | [bedtools](http://bedtools.readthedocs.org/en/latest/), 13 | and [samtools](http://www.htslib.org/). I am using bedtools version 2.20.1, bowtie 1 version 0.12.9, samtools version 0.1.19. 14 | RepEnrich also requires a bowtie1 indexed genome in fasta format 15 | available. (Example `mm9.fa`) 16 | The RepEnrich python scripts also use [BioPython](http://biopython.org) which 17 | can be installed with the following command: 18 | 19 | pip install BioPython 20 | 21 | IMPORTANT: bedtools version 2.24.0 and greater yield an error due to altered functionality of [coverageBed](http://bedtools.readthedocs.org/en/latest/content/tools/coverage.html). 22 | 23 | ### Step 1) Attain repetitive element annotation 24 | I have temporarily provided the setup for the human genome (build hg19 and hg38) and the mouse genome (build mm9) available [here](https://www.dropbox.com/sh/xpd68jhw7bwd9ie/AAA_IJzhTn1GhawoGnXgtH6Ca?dl=0). After downloading you can extract the files using: 25 | 26 | gunzip hg19_repeatmasker_clean.txt.gz 27 | tar -zxvf RepEnrich_setup_hg19.tar.gz 28 | 29 | To yield hg19_repeatmasker_clean.txt annotation file and RepEnrich_setup_hg19 setup folder. The annotation files I am using are repeatmasker files with simple and low-complexity repeats removed (satellite repeats and transposons are still present). If you choose to use these files for the set-up you can skip ahead to step 2. 30 | 31 | The RepEnrich setup script will build the annotation 32 | required by RepEnrich. The default is a repeatmasker file which can be 33 | downloaded from [repeatmasker.org](http://www.repeatmasker.org/genomicDatasets/RMGenomicDatasets.html), 34 | (for instance, find the `mm9.fa.out.gz` download 35 | [here](http://www.repeatmasker.org/genomes/mm9/RepeatMasker-rm328-db20090604/mm9.fa.out.gz). 36 | Once you have downloaded the file you can unzip it and rename it: 37 | 38 | gunzip mm9.fa.out.gz 39 | mv mm9.fa.out mm9_repeatmasker.txt 40 | 41 | This is what the file looks like: 42 | 43 | SW perc perc perc query position in query matching repeat position in repeat score div. del. ins. sequence begin end (left) repeat class/family begin end (left) ID 44 | 687 17.4 0.0 0.0 chr1 3000002 3000156 (194195276) C L1_Mur2 LINE/L1 (4310) 1567 1413 1 45 | 917 21.4 11.4 4.5 chr1 3000238 3000733 (194194699) C L1_Mur2 LINE/L1 (4488) 1389 913 1 46 | 845 23.3 7.6 11.4 chr1 3000767 3000792 (194194640) C L1_Mur2 LINE/L1 (6816) 912 887 1 47 | 621 25.0 6.5 3.7 chr1 3001288 3001583 (194193849) C Lx9 LINE/L1 (1596) 6048 5742 3 48 | 49 | The RepEnrich setup script will also allow 50 | you to build the annotation required by RepEnrich for a custom set of 51 | elements using a bed file. So if you want to examine mm9 LTR repetitive 52 | elements; you can build this file using the the repeatmasker track from 53 | [UCSC genome table browser](http://genome.ucsc.edu/cgi-bin/hgTables). 54 | 55 | To do this, select genome `mm9`, click the edit box next to _Filter_, fill 56 | in the repclass does match with `LTR`, then click submit. Back at the table 57 | browser select option `Selected fields from primary and related tables`, 58 | name the output file something like `mm9_LTR_repeatmasker.bed`, and click 59 | `Get output`. On the next page select `genoName`, `genoStart`, `genoEnd`, 60 | `repName`, `repClass`, `repFamily` then download the file. 61 | 62 | The UCSC puts a header on the file that needs to be removed: 63 | 64 | tail -n +3 mm9_LTR_repeatmasker.bed | head -n -4 > mm9_LTR_repeatmasker_fix.bed 65 | mv mm9_LTR_repeatmasker_fix.bed mm9_LTR_repeatmasker.bed 66 | 67 | This is what our custom mm9 LTR retrotransposon bed file looks like: 68 | 69 | $ head mm9_LTR_repeatmasker.bed 70 | 71 | chr1 3001722 3002005 RLTR25A LTR ERVK 72 | chr1 3002051 3002615 RLTR25A LTR ERVK 73 | chr1 3016886 3017193 RLTRETN_Mm LTR ERVK 74 | chr1 3018338 3018653 RLTR14 LTR ERV1 75 | 76 | Note: It is important to get the column format right: 77 | 78 | * Column 1: Chromosome 79 | * Column 2: Start 80 | * Column 3: End 81 | * Column 4: Repeat_name 82 | * Column 5: Class 83 | * Column 6: Family 84 | 85 | The file should be tab delimited. If there is no information on class 86 | or family, you can replace these columns with the repeat name or an 87 | arbitrary label such as `group1`. 88 | 89 | ### Step 2) Run the setup for RepEnrich 90 | 91 | Now that we have our annotation files we can move 92 | on to running the setup for RepEnrich. First load the dependencies 93 | (if you use Environment Modules - otherwise just make sure that these 94 | programs are available in your `PATH`). 95 | 96 | module load bowtie 97 | module load bedtools 98 | module load samtools 99 | 100 | Next run the setup using the type of annotation you have selected (default): 101 | 102 | python RepEnrich_setup.py /data/mm9_repeatmasker.txt /data/mm9.fa /data/setup_folder_mm9 103 | 104 | custom bed file: 105 | 106 | python RepEnrich_setup.py /data/mm9_LTR_repeatmasker.bed /data/mm9.fa /data/setup_folder_mm9 --is_bed TRUE 107 | 108 | The previous commands have setup RepEnrich annotation that is used in 109 | downstream analysis of data. You only have to do the setup step once for 110 | an organism of interest. One cautionary note is that RepEnrich is only 111 | as reliable as the genome annotation of repetitive elements for your 112 | organism of interest. Therefore, RepEnrich performance may not be 113 | optimal for poorly annotated genomes. 114 | 115 | ### Step 3) Map the data to the genome using bowtie1 116 | 117 | After the setup of the RepEnrich we now have to 118 | map our data uniquely to the genome before running RepEnrich. This is 119 | because RepEnrich treats unique mapping and multi-mapping reads 120 | separately. This requires use of specific bowtie options. The bowtie 121 | command below is recommended for RepEnrich: 122 | 123 | bowtie /data/mm9 -p 16 -t -m 1 -S --max /data/sampleA_multimap.fastq sample_A.fastq /data/sampleA_unique.sam 124 | 125 | An explanation of bowtie options: 126 | 127 | * `bowtie ` 128 | * `-p 16` - 16 cpus 129 | * `-t` - print time 130 | * `-m 1` - only allow unique mapping 131 | * `-S` - output SAM 132 | * `--max multimapping.fastq` - output multimapping reads to `multimapping.fastq` 133 | * `unique_mapping.sam` - uniquely mapping reads 134 | 135 | For paired-end reads the bowtie command is: 136 | 137 | bowtie /data/mm9 -p 16 -t -m 1 -S --max /data/sampleA_multimap.fastq -1 sample_A_1.fastq -2 sample_A_2.fastq /data/sampleA_unique.sam 138 | 139 | The Sam file should be converted to a bam file with samtools: 140 | 141 | samtools view -bS sampleA_unique.sam > sampleA_unique.bam 142 | samtools sort sampleA_unique.bam sampleA_unique_sorted 143 | mv sampleA_unique_sorted.bam sampleA_unique.bam 144 | samtools index sampleA_unique.bam 145 | rm sampleA_unique.sam 146 | 147 | You should now compute the total mapping reads for your alignment. This 148 | includes the reads that mapped uniquely (`sampleA_unique.bam`) 149 | and more than once (`sample_A_multimap.fastq`). The `.out` file 150 | from your bowtie batch script contains this information (or `stdout` 151 | from an interactive job). 152 | 153 | It should looks like this: 154 | 155 | Seeded quality full-index 156 | search: 00:32:26 157 | # reads processed: 92084909 158 | # reads with at least one reported alignment: 48299773 (52.45%) 159 | # reads that failed to align: 17061693 (18.53%) 160 | # reads with alignments suppressed due to -m: 26723443 (29.02%) 161 | Reported 48299773 alignments to 1 output stream(s) 162 | 163 | The total mapping reads is the `# of reads processed` - 164 | `# reads that failed to align`. Here our total mapping reads are: 165 | `92084909 - 17061693 = 75023216` 166 | 167 | 168 | ### Step 4) Run RepEnrich on the data 169 | 170 | Now we have all the information we need to run RepEnrich. 171 | Here is an example (for default annotation): 172 | 173 | python RepEnrich.py /data/mm9_repeatmasker.txt /data/sample_A sample_A /data/hg19_setup_folder sampleA_multimap.fastq sampleA_unique.bam --cpus 16 174 | 175 | for custom bed file annotation: 176 | 177 | python RepEnrich.py /data/mm9_LTR_repeatmasker.bed /data/sample_A sample_A /data/hg19_setup_folder sampleA_multimap.fastq sampleA_unique.bam --is_bed TRUE --cpus 16 178 | 179 | An explanation of the RepEnrich command: 180 | 181 | python RepEnrich.py 182 | 183 | 184 | 185 | 186 | 187 | 188 | (--is_bed TRUE) 189 | (--cpus 16) 190 | 191 | If you have paired-end data the command is very similar. There will be two `sampleA_multimap.fastq` and `sampleA_multimap_1.fastq` and 192 | `sampleA_multimap_2.fastq` from the bowtie step. 193 | 194 | The command for running 195 | RepEnrich in this case is (for default annotation): 196 | 197 | python RepEnrich.py /data/mm9_repeatmasker.txt /data/sample_A sample_A /data/hg19_setup_folder sampleA_multimap_1.fastq --fastqfile2 sampleA_multimap_2.fastq sampleA_unique.bam --cpus 16 --pairedend TRUE 198 | 199 | for custom bed file annotation: 200 | 201 | python RepEnrich.py /data/mm9_LTR_repeatmasker.bed /data/sample_A sample_A /data/hg19_setup_folder sampleA_multimap_1.fastq --fastqfile2 sampleA_multimap_2.fastq sampleA_unique.bam --is_bed TRUE --cpus 16 --pairedend TRUE 202 | 203 | 204 | ### Step 5) Processing the output of RepEnrich 205 | 206 | The final outputs will be 207 | in the path `/data/sample_A`. This will include a few files. The most 208 | important of which is the `sampleA_fraction_counts.txt` file. This is 209 | the estimated counts for the repeats. I use this file to build a table 210 | of counts for all my conditions (by pasting the individual 211 | `*_fraction_counts.txt` files together for my complete experiment). 212 | 213 | You can use the compiled counts file to do differential expression analysis 214 | similar to what is done for genes. We use 215 | [EdgeR](http://www.bioconductor.org/packages/release/bioc/html/edgeR.html) 216 | or [DESEQ](http://bioconductor.org/packages/release/bioc/html/DESeq.html) 217 | to do the differential expression analysis. These are R packages that you can 218 | download from [bioconductor](http://bioconductor.org/). 219 | 220 | When running the EdgeR differential expression analysis 221 | you can follow the examples in the EdgeR manual. I manually input the 222 | library sizes (the total mapping reads we obtained in the tutorial). 223 | Some of the downstream analysis, though, is left to your discretion. 224 | There are multiple ways you can do the differential expression analysis. 225 | I use the `GLM` method within the EdgeR packgage, although DESeq has 226 | similar methods and EdgeR also has a more straightforward approach 227 | called `exactTest`. Below is a sample EdgeR script used to do the 228 | differential analysis of repeats for young, old, and very old mice. The 229 | file `counts.csv` contains the ouput from RepEnrich that was made by 230 | pasting the individual `*_fraction_counts.txt` files together for my 231 | complete experiment. 232 | 233 | ## Example Script for EdgeR differential enrichment analysis 234 | 235 | ```r 236 | # EdgeR example 237 | 238 | # Setup - Install and load edgeR 239 | source("http://bioconductor.org/biocLite.R") 240 | biocLite("edgeR") 241 | library('edgeR') 242 | 243 | # In the case of a pre-assembled file of the fraction count output do the following: 244 | # counts <- read.csv(file = "counts.csv") 245 | 246 | # In the case of seperate outputs, load the RepEnrich results - fraction counts 247 | young_r1 <- read.delim('young_r1_fraction_counts.txt', header=FALSE) 248 | young_r2 <- read.delim('young_r2_fraction_counts.txt', header=FALSE) 249 | young_r3 <- read.delim('young_r3_fraction_counts.txt', header=FALSE) 250 | old_r1 <- read.delim('old_r1_fraction_counts.txt', header=FALSE) 251 | old_r2 <- read.delim('old_r2_fraction_counts.txt', header=FALSE) 252 | old_r3 <- read.delim('old_r3_fraction_counts.txt', header=FALSE) 253 | v_old_r1 <- read.delim('veryold_r1_fraction_counts.txt', header=FALSE) 254 | v_old_r2 <- read.delim('veryold_r2_fraction_counts.txt', header=FALSE) 255 | v_old_r3 <- read.delim('veryold_r3_fraction_counts.txt', header=FALSE) 256 | 257 | #' Build a counts table 258 | counts <- data.frame( 259 | row.names = young_r1[,1], 260 | young_r1 = young_r1[,4], young_r2 = young_r2[,4], young_r3 = young_r3[,4], 261 | old_r1 = old_r1[,4], old_r2 = old_r2[,4], old_r3 = old_r3[,4], 262 | v_old_r1 = v_old_r1[,4], v_old_r2 = v_old_r2[,4], v_old_r3 = v_old_r3[,4] 263 | ) 264 | 265 | # Build a meta data object. I am comparing young, old, and veryold mice. 266 | # I manually input the total mapping reads for each sample. 267 | # The total mapping reads are calculated using the bowtie logs: 268 | # # of reads processed - # reads that failed to align 269 | meta <- data.frame( 270 | row.names=colnames(counts), 271 | condition=c("young","young","young","old","old","old","veryold","veryold","veryold"), 272 | libsize=c(24923593,28340805,21743712,16385707,26573335,28131649,34751164,37371774,28236419) 273 | ) 274 | 275 | # Define the library size and conditions for the GLM 276 | libsize <- meta$libsize 277 | condition <- factor(meta$condition) 278 | design <- model.matrix(~0+condition) 279 | colnames(design) <- levels(meta$condition) 280 | 281 | # Build a DGE object for the GLM 282 | y <- DGEList(counts=counts, lib.size=libsize) 283 | 284 | # Normalize the data 285 | y <- calcNormFactors(y) 286 | y$samples 287 | plotMDS(y) 288 | 289 | # Estimate the variance 290 | y <- estimateGLMCommonDisp(y, design) 291 | y <- estimateGLMTrendedDisp(y, design) 292 | y <- estimateGLMTagwiseDisp(y, design) 293 | plotBCV(y) 294 | 295 | # Build an object to contain the normalized read abundance 296 | logcpm <- cpm(y, log=TRUE, lib.size=libsize) 297 | logcpm <- as.data.frame(logcpm) 298 | colnames(logcpm) <- factor(meta$condition) 299 | 300 | # Conduct fitting of the GLM 301 | yfit <- glmFit(y, design) 302 | 303 | # Initialize result matrices to contain the results of the GLM 304 | results <- matrix(nrow=dim(counts)[1],ncol=0) 305 | logfc <- matrix(nrow=dim(counts)[1],ncol=0) 306 | 307 | # Make the comparisons for the GLM 308 | my.contrasts <- makeContrasts( 309 | veryold_old = veryold – old, 310 | veryold_young = veryold – young, 311 | old_young = old – young, 312 | levels = design 313 | ) 314 | 315 | # Define the contrasts used in the comparisons 316 | allcontrasts = c( 317 | "veryold_old", 318 | "veryold_young", 319 | "old_young" 320 | ) 321 | 322 | # Conduct a for loop that will do the fitting of the GLM for each comparison 323 | # Put the results into the results objects 324 | for(current_contrast in allcontrasts) { 325 | lrt <- glmLRT(yfit, contrast=my.contrasts[,current_contrast]) 326 | plotSmear(lrt, de.tags=rownames(y)) 327 | title(current_contrast) 328 | res <- topTags(lrt,n=dim(c)[1],sort.by="none")$table 329 | colnames(res) <- paste(colnames(res),current_contrast,sep=".") 330 | results <- cbind(results,res[,c(1,5)]) 331 | logfc <- cbind(logfc,res[c(1)]) 332 | } 333 | 334 | # Add the repeat types back into the results. 335 | # We should still have the same order as the input data 336 | results$class <- young_r1[,2] 337 | results$type <- young_r1[,3] 338 | 339 | # Sort the results table by the logFC 340 | results <- results[with(results, order(-abs(logFC.old_young))), ] 341 | 342 | # Save the results 343 | write.table(results, 'results.txt', quote=FALSE, sep="\t") 344 | 345 | # Plot Fold Changes for repeat classes and types 346 | for(current_contrast in allcontrasts) { 347 | logFC <- results[, paste0("logFC.", current_contrast)] 348 | # Plot the repeat classes 349 | classes <- with(results, reorder(class, -logFC, median)) 350 | par(mar=c(6,10,4,1)) 351 | boxplot(logFC ~ classes, data=results, outline=FALSE, horizontal=TRUE, 352 | las=2, xlab="log(Fold Change)", main=current_contrast) 353 | abline(v=0) 354 | # Plot the repeat types 355 | types <- with(results, reorder(type, -logFC, median)) 356 | boxplot(logFC ~ types, data=results, outline=FALSE, horizontal=TRUE, 357 | las=2, xlab="log(Fold Change)", main=current_contrast) 358 | abline(v=0) 359 | } 360 | 361 | ``` 362 | 363 | 364 | Note that the objects `logfc` contains the differential expression for the 365 | contrast, `logcpm` contains the normalized read abundance, and `result` 366 | contains both the differential expression and the false discovery rate for 367 | the experimental comparison. I recommended reading more about these in the 368 | [EdgeR manual](http://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf). 369 | 370 | -------------------------------------------------------------------------------- /RepEnrich.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | import argparse 3 | import csv 4 | import numpy 5 | import os 6 | import shlex 7 | import shutil 8 | import subprocess 9 | import sys 10 | 11 | parser = argparse.ArgumentParser(description='Part II: Conducting the alignments to the psuedogenomes. Before doing this step you will require 1) a bamfile of the unique alignments with index 2) a fastq file of the reads mapping to more than one location. These files can be obtained using the following bowtie options [EXAMPLE: bowtie -S -m 1 --max multimap.fastq mm9 mate1_reads.fastq] Once you have the unique alignment bamfile and the reads mapping to more than one location in a fastq file you can run this step. EXAMPLE: python master_output.py /users/nneretti/data/annotation/hg19/hg19_repeatmasker.txt /users/nneretti/datasets/repeatmapping/POL3/Pol3_human/HeLa_InputChIPseq_Rep1 HeLa_InputChIPseq_Rep1 /users/nneretti/data/annotation/hg19/setup_folder HeLa_InputChIPseq_Rep1_multimap.fastq HeLa_InputChIPseq_Rep1.bam') 12 | parser.add_argument('--version', action='version', version='%(prog)s 0.1') 13 | parser.add_argument('annotation_file', action= 'store', metavar='annotation_file', help='List RepeatMasker.org annotation file for your organism. The file may be downloaded from the RepeatMasker.org website. Example: /data/annotation/hg19/hg19_repeatmasker.txt') 14 | parser.add_argument('outputfolder', action= 'store', metavar='outputfolder', help='List folder to contain results. Example: /outputfolder') 15 | parser.add_argument('outputprefix', action= 'store', metavar='outputprefix', help='Enter prefix name for data. Example: HeLa_InputChIPseq_Rep1') 16 | parser.add_argument('setup_folder', action= 'store', metavar='setup_folder', help='List folder that contains the repeat element psuedogenomes. Example /data/annotation/hg19/setup_folder') 17 | parser.add_argument('fastqfile', action= 'store', metavar='fastqfile', help='Enter file for the fastq reads that map to multiple locations. Example /data/multimap.fastq') 18 | parser.add_argument('alignment_bam', action= 'store', metavar='alignment_bam', help='Enter bamfile output for reads that map uniquely. Example /bamfiles/old.bam') 19 | parser.add_argument('--pairedend', action= 'store', dest='pairedend', default= 'FALSE', help='Designate this option for paired-end sequencing. Default FALSE change to TRUE') 20 | parser.add_argument('--collapserepeat', action= 'store', dest='collapserepeat', metavar='collapserepeat', default= 'Simple_repeat', help='Designate this option to generate a collapsed repeat type. Uncollapsed output is generated in addition to collapsed repeat type. Simple_repeat is default to simplify downstream analysis. You can change the default to another repeat name to collapse a seperate specific repeat instead or if the name of Simple_repeat is different for your organism. Default Simple_repeat') 21 | parser.add_argument('--fastqfile2', action= 'store', dest='fastqfile2', metavar='fastqfile2', default= 'none', help='Enter fastqfile2 when using paired-end option. Default none') 22 | parser.add_argument('--cpus', action= 'store', dest='cpus', metavar='cpus', default= "1", type=int, help='Enter available cpus per node. The more cpus the faster RepEnrich performs. RepEnrich is designed to only work on one node. Default: "1"') 23 | parser.add_argument('--allcountmethod', action= 'store', dest='allcountmethod', metavar='allcountmethod', default= "FALSE", help='By default the pipeline only outputs the fraction count method. Consdidered to be the best way to count multimapped reads. Changing this option will include the unique count method, a conservative count, and the total count method, a liberal counting strategy. Our evaluation of simulated data indicated fraction counting is best. Default = FALSE, change to TRUE') 24 | parser.add_argument('--is_bed', action= 'store', dest='is_bed', metavar='is_bed', default= 'FALSE', help='Is the annotation file a bed file. This is also a compatible format. The file needs to be a tab seperated bed with optional fields. Ex. format chr\tstart\tend\tName_element\tclass\tfamily. The class and family should identical to name_element if not applicable. Default FALSE change to TRUE') 25 | args = parser.parse_args() 26 | 27 | # parameters 28 | annotation_file = args.annotation_file 29 | outputfolder = args.outputfolder 30 | outputfile_prefix = args.outputprefix 31 | setup_folder = args.setup_folder 32 | repeat_bed = setup_folder + os.path.sep + 'repnames.bed' 33 | unique_mapper_bam = args.alignment_bam 34 | fastqfile_1 = args.fastqfile 35 | fastqfile_2 = args.fastqfile2 36 | cpus = args.cpus 37 | b_opt = "-k1 -p " +str(1) +" --quiet" 38 | simple_repeat = args.collapserepeat 39 | paired_end = args.pairedend 40 | allcountmethod = args.allcountmethod 41 | is_bed = args.is_bed 42 | 43 | ################################################################################ 44 | # check that the programs we need are available 45 | try: 46 | subprocess.call(shlex.split("coverageBed -h"), stdout=open(os.devnull, 'wb'), stderr=open(os.devnull, 'wb')) 47 | subprocess.call(shlex.split("bowtie --version"), stdout=open(os.devnull, 'wb'), stderr=open(os.devnull, 'wb')) 48 | except OSError: 49 | print "Error: Bowtie or BEDTools not loaded" 50 | raise 51 | 52 | ################################################################################ 53 | # define a csv reader that reads space deliminated files 54 | print 'Preparing for analysis using RepEnrich...' 55 | csv.field_size_limit(sys.maxsize) 56 | def import_text(filename, separator): 57 | for line in csv.reader(open(filename), delimiter=separator, 58 | skipinitialspace=True): 59 | if line: 60 | yield line 61 | 62 | ################################################################################ 63 | # build dictionaries to convert repclass and rep families' 64 | if is_bed == "FALSE": 65 | repeatclass = {} 66 | repeatfamily = {} 67 | fin = import_text(annotation_file, ' ') 68 | x = 0 69 | for line in fin: 70 | if x>2: 71 | classfamily =[] 72 | classfamily = line[10].split(os.path.sep) 73 | line9 = line[9].replace("(","_").replace(")","_").replace("/","_") 74 | repeatclass[line9] = classfamily[0] 75 | if len(classfamily) == 2: 76 | repeatfamily[line9] = classfamily[1] 77 | else: 78 | repeatfamily[line9] = classfamily[0] 79 | x +=1 80 | if is_bed == "TRUE": 81 | repeatclass = {} 82 | repeatfamily = {} 83 | fin = open(annotation_file, 'r') 84 | for line in fin: 85 | line=line.strip('\n') 86 | line=line.split('\t') 87 | theclass =line[4] 88 | thefamily = line[5] 89 | line3 = line[3].replace("(","_").replace(")","_").replace("/","_") 90 | repeatclass[line3] = theclass 91 | repeatfamily[line3] = thefamily 92 | fin.close() 93 | 94 | ################################################################################ 95 | # build list of repeats initializing dictionaries for downstream analysis' 96 | fin = import_text(setup_folder + os.path.sep + 'repgenomes_key.txt', '\t') 97 | repeat_key ={} 98 | rev_repeat_key ={} 99 | repeat_list = [] 100 | reptotalcounts = {} 101 | classfractionalcounts = {} 102 | familyfractionalcounts = {} 103 | classtotalcounts = {} 104 | familytotalcounts = {} 105 | reptotalcounts_simple = {} 106 | fractionalcounts = {} 107 | i = 0 108 | for line in fin: 109 | reptotalcounts[line[0]] = 0 110 | fractionalcounts[line[0]] = 0 111 | if repeatclass.has_key(line[0]): 112 | classtotalcounts[repeatclass[line[0]]] = 0 113 | classfractionalcounts[repeatclass[line[0]]] = 0 114 | if repeatfamily.has_key(line[0]): 115 | familytotalcounts[repeatfamily[line[0]]] = 0 116 | familyfractionalcounts[repeatfamily[line[0]]] = 0 117 | if repeatfamily.has_key(line[0]): 118 | if repeatfamily[line[0]] == simple_repeat: 119 | reptotalcounts_simple[simple_repeat] = 0 120 | else: 121 | reptotalcounts_simple[line[0]] = 0 122 | repeat_list.append(line[0]) 123 | repeat_key[line[0]] = int(line[1]) 124 | rev_repeat_key[int(line[1])] = line[0] 125 | fin.close() 126 | ################################################################################ 127 | # map the repeats to the psuedogenomes: 128 | if not os.path.exists(outputfolder): 129 | os.mkdir(outputfolder) 130 | ################################################################################ 131 | # Conduct the regions sorting 132 | print 'Conducting region sorting on unique mapping reads....' 133 | fileout= outputfolder + os.path.sep + outputfile_prefix + '_regionsorter.txt' 134 | with open(fileout, 'w') as stdout: 135 | command = shlex.split("coverageBed -abam " +unique_mapper_bam+" -b " +setup_folder + os.path.sep + 'repnames.bed') 136 | p = subprocess.Popen(command, stdout=stdout) 137 | p.communicate() 138 | stdout.close() 139 | filein = open(outputfolder + os.path.sep + outputfile_prefix + '_regionsorter.txt','r') 140 | counts = {} 141 | sumofrepeatreads=0 142 | for line in filein: 143 | line= line.split('\t') 144 | if not counts.has_key(str(repeat_key[line[3]])): 145 | counts[str(repeat_key[line[3]])]=0 146 | counts[str(repeat_key[line[3]])]+=int(line[4]) 147 | sumofrepeatreads+=int(line[4]) 148 | print 'Identified ' + str(sumofrepeatreads) + 'unique reads that mapped to repeats.' 149 | ################################################################################ 150 | if paired_end == 'TRUE': 151 | if not os.path.exists(outputfolder + os.path.sep + 'pair1_bowtie'): 152 | os.mkdir(outputfolder + os.path.sep + 'pair1_bowtie') 153 | if not os.path.exists(outputfolder + os.path.sep + 'pair2_bowtie'): 154 | os.mkdir(outputfolder + os.path.sep + 'pair2_bowtie') 155 | folder_pair1 = outputfolder + os.path.sep + 'pair1_bowtie' 156 | folder_pair2 = outputfolder + os.path.sep + 'pair2_bowtie' 157 | ################################################################################ 158 | print "Processing repeat psuedogenomes..." 159 | ps = [] 160 | psb= [] 161 | ticker= 0 162 | for metagenome in repeat_list: 163 | metagenomepath = setup_folder + os.path.sep + metagenome 164 | file1=folder_pair1 + os.path.sep + metagenome + '.bowtie' 165 | file2 =folder_pair2 + os.path.sep + metagenome + '.bowtie' 166 | with open(file1, 'w') as stdout: 167 | command = shlex.split("bowtie " + b_opt + " " + metagenomepath + " " + fastqfile_1) 168 | p = subprocess.Popen(command,stdout=stdout) 169 | with open(file2, 'w') as stdout: 170 | command = shlex.split("bowtie " + b_opt + " " + metagenomepath + " " + fastqfile_2) 171 | pp = subprocess.Popen(command,stdout=stdout) 172 | ps.append(p) 173 | ticker +=1 174 | psb.append(pp) 175 | ticker +=1 176 | if ticker == cpus: 177 | for p in ps: 178 | p.communicate() 179 | for p in psb: 180 | p.communicate() 181 | ticker = 0 182 | psb =[] 183 | ps = [] 184 | if len(ps) > 0: 185 | for p in ps: 186 | p.communicate() 187 | stdout.close() 188 | 189 | ################################################################################ 190 | # combine the output from both read pairs: 191 | print 'sorting and combining the output for both read pairs...' 192 | if not os.path.exists(outputfolder + os.path.sep + 'sorted_bowtie'): 193 | os.mkdir(outputfolder + os.path.sep + 'sorted_bowtie') 194 | sorted_bowtie = outputfolder + os.path.sep + 'sorted_bowtie' 195 | for metagenome in repeat_list: 196 | file1 = folder_pair1 + os.path.sep + metagenome + '.bowtie' 197 | file2 = folder_pair2 + os.path.sep + metagenome + '.bowtie' 198 | fileout= sorted_bowtie + os.path.sep + metagenome + '.bowtie' 199 | with open(fileout, 'w') as stdout: 200 | p1 = subprocess.Popen(['cat',file1,file2], stdout = subprocess.PIPE) 201 | p2 = subprocess.Popen(['cut', '-f1',"-d "], stdin = p1.stdout, stdout = subprocess.PIPE) 202 | p3 = subprocess.Popen(['cut', '-f1', "-d/"], stdin = p2.stdout, stdout = subprocess.PIPE) 203 | p4 = subprocess.Popen(['sort'], stdin=p3.stdout, stdout = subprocess.PIPE) 204 | p5 = subprocess.Popen(['uniq'], stdin=p4.stdout, stdout = stdout) 205 | p5.communicate() 206 | stdout.close() 207 | print 'completed ...' 208 | ################################################################################ 209 | if paired_end == 'FALSE': 210 | if not os.path.exists(outputfolder + os.path.sep + 'pair1_bowtie'): 211 | os.mkdir(outputfolder + os.path.sep + 'pair1_bowtie') 212 | folder_pair1 = outputfolder + os.path.sep + 'pair1_bowtie' 213 | ################################################################################ 214 | ps = [] 215 | ticker= 0 216 | print "Processing repeat psuedogenomes..." 217 | for metagenome in repeat_list: 218 | metagenomepath = setup_folder + os.path.sep + metagenome 219 | file1=folder_pair1 + os.path.sep + metagenome + '.bowtie' 220 | with open(file1, 'w') as stdout: 221 | command = shlex.split("bowtie " + b_opt + " " + metagenomepath + " " + fastqfile_1) 222 | p = subprocess.Popen(command,stdout=stdout) 223 | ps.append(p) 224 | ticker +=1 225 | if ticker == cpus: 226 | for p in ps: 227 | p.communicate() 228 | ticker = 0 229 | ps = [] 230 | if len(ps) > 0: 231 | for p in ps: 232 | p.communicate() 233 | stdout.close() 234 | 235 | ################################################################################ 236 | # combine the output from both read pairs: 237 | print 'Sorting and combining the output for both read pairs....' 238 | if not os.path.exists(outputfolder + os.path.sep + 'sorted_bowtie'): 239 | os.mkdir(outputfolder + os.path.sep + 'sorted_bowtie') 240 | sorted_bowtie = outputfolder + os.path.sep + 'sorted_bowtie' 241 | for metagenome in repeat_list: 242 | file1 = folder_pair1 + os.path.sep + metagenome + '.bowtie' 243 | fileout= sorted_bowtie + os.path.sep + metagenome + '.bowtie' 244 | with open(fileout, 'w') as stdout: 245 | p1 = subprocess.Popen(['cat',file1], stdout = subprocess.PIPE) 246 | p2 = subprocess.Popen(['cut', '-f1'], stdin = p1.stdout, stdout = subprocess.PIPE) 247 | p3 = subprocess.Popen(['cut', '-f1', "-d/"], stdin = p2.stdout, stdout = subprocess.PIPE) 248 | p4 = subprocess.Popen(['sort'], stdin = p3.stdout,stdout = subprocess.PIPE) 249 | p5 = subprocess.Popen(['uniq'], stdin = p4.stdout,stdout = stdout) 250 | p5.communicate() 251 | stdout.close() 252 | print 'completed ...' 253 | 254 | ################################################################################ 255 | # build a file of repeat keys for all reads 256 | print 'Writing and processing intermediate files...' 257 | sorted_bowtie = outputfolder + os.path.sep + 'sorted_bowtie' 258 | readid = {} 259 | sumofrepeatreads=0 260 | for rep in repeat_list: 261 | for data in import_text(sorted_bowtie + os.path.sep + rep + '.bowtie', '\t'): 262 | readid[data[0]] = '' 263 | for rep in repeat_list: 264 | for data in import_text(sorted_bowtie + os.path.sep + rep + '.bowtie', '\t'): 265 | readid[data[0]]+=str(repeat_key[rep]) + str(',') 266 | for subfamilies in readid.values(): 267 | if not counts.has_key(subfamilies): 268 | counts[subfamilies]=0 269 | counts[subfamilies] +=1 270 | sumofrepeatreads+=1 271 | del readid 272 | print 'Identified ' + str(sumofrepeatreads) + ' reads that mapped to repeats for unique and multimappers.' 273 | 274 | ################################################################################ 275 | print "Conducting final calculations..." 276 | # build a converter to numeric label for repeat and yield a combined list of repnames seperated by backslash 277 | def convert(x): 278 | x = x.strip(',') 279 | x = x.split(',') 280 | global repname 281 | repname = "" 282 | for i in x: 283 | repname = repname + os.path.sep + rev_repeat_key[int(i)] 284 | # building the total counts for repeat element enrichment... 285 | for x in counts.keys(): 286 | count= counts[x] 287 | x = x.strip(',') 288 | x = x.split(',') 289 | for i in x: 290 | reptotalcounts[rev_repeat_key[int(i)]] += int(count) 291 | # building the fractional counts for repeat element enrichment... 292 | for x in counts.keys(): 293 | count= counts[x] 294 | x = x.strip(',') 295 | x = x.split(',') 296 | splits = len(x) 297 | for i in x: 298 | fractionalcounts[rev_repeat_key[int(i)]] += float(numpy.divide(float(count),float(splits))) 299 | # building categorized table of repeat element enrichment... 300 | repcounts = {} 301 | repcounts['other'] = 0 302 | for key in counts.keys(): 303 | convert(key) 304 | repcounts[repname] = counts[key] 305 | # building the total counts for class enrichment... 306 | for key in reptotalcounts.keys(): 307 | classtotalcounts[repeatclass[key]] += reptotalcounts[key] 308 | # building total counts for family enrichment... 309 | for key in reptotalcounts.keys(): 310 | familytotalcounts[repeatfamily[key]] += reptotalcounts[key] 311 | # building unique counts table' 312 | repcounts2 = {} 313 | for rep in repeat_list: 314 | if repcounts.has_key("/" +rep): 315 | repcounts2[rep] = repcounts["/" +rep] 316 | else: 317 | repcounts2[rep] = 0 318 | # building the fractionalcounts counts for class enrichment... 319 | for key in fractionalcounts.keys(): 320 | classfractionalcounts[repeatclass[key]] += fractionalcounts[key] 321 | # building fractional counts for family enrichment... 322 | for key in fractionalcounts.keys(): 323 | familyfractionalcounts[repeatfamily[key]] += fractionalcounts[key] 324 | 325 | ################################################################################ 326 | print 'Writing final output and removing intermediate files...' 327 | # print output to file of the categorized counts and total overlapping counts: 328 | if allcountmethod == "TRUE": 329 | fout1 = open(outputfolder + os.path.sep + outputfile_prefix + '_total_counts.txt' , 'w') 330 | for key in reptotalcounts.keys(): 331 | print >> fout1, str(key) + '\t' + repeatclass[key] + '\t' + repeatfamily[key] + '\t' + str(reptotalcounts[key]) 332 | fout2 = open(outputfolder + os.path.sep + outputfile_prefix + '_class_total_counts.txt' , 'w') 333 | for key in classtotalcounts.keys(): 334 | print >> fout2, str(key) + '\t' + str(classtotalcounts[key]) 335 | fout3 = open(outputfolder + os.path.sep + outputfile_prefix + '_family_total_counts.txt' , 'w') 336 | for key in familytotalcounts.keys(): 337 | print >> fout3, str(key) + '\t' + str(familytotalcounts[key]) 338 | fout4 = open(outputfolder + os.path.sep + outputfile_prefix + '_unique_counts.txt' , 'w') 339 | for key in repcounts2.keys(): 340 | print >> fout4, str(key) + '\t' + repeatclass[key] + '\t' + repeatfamily[key] + '\t' + str(repcounts2[key]) 341 | fout5 = open(outputfolder + os.path.sep + outputfile_prefix + '_class_fraction_counts.txt' , 'w') 342 | for key in classfractionalcounts.keys(): 343 | print >> fout5, str(key) + '\t' + str(classfractionalcounts[key]) 344 | fout6 = open(outputfolder + os.path.sep + outputfile_prefix + '_family_fraction_counts.txt' , 'w') 345 | for key in familyfractionalcounts.keys(): 346 | print >> fout6, str(key) + '\t' + str(familyfractionalcounts[key]) 347 | fout7 = open(outputfolder + os.path.sep + outputfile_prefix + '_fraction_counts.txt' , 'w') 348 | for key in fractionalcounts.keys(): 349 | print >> fout7, str(key) + '\t' + repeatclass[key] + '\t' + repeatfamily[key] + '\t' + str(int(fractionalcounts[key])) 350 | fout1.close() 351 | fout2.close() 352 | fout3.close() 353 | fout4.close() 354 | fout5.close() 355 | fout6.close() 356 | fout7.close() 357 | else: 358 | fout1 = open(outputfolder + os.path.sep + outputfile_prefix + '_class_fraction_counts.txt' , 'w') 359 | for key in classfractionalcounts.keys(): 360 | print >> fout1, str(key) + '\t' + str(classfractionalcounts[key]) 361 | fout2 = open(outputfolder + os.path.sep + outputfile_prefix + '_family_fraction_counts.txt' , 'w') 362 | for key in familyfractionalcounts.keys(): 363 | print >> fout2, str(key) + '\t' + str(familyfractionalcounts[key]) 364 | fout3 = open(outputfolder + os.path.sep + outputfile_prefix + '_fraction_counts.txt' , 'w') 365 | for key in fractionalcounts.keys(): 366 | print >> fout3, str(key) + '\t' + repeatclass[key] + '\t' + repeatfamily[key] + '\t' + str(int(fractionalcounts[key])) 367 | fout1.close() 368 | fout2.close() 369 | fout3.close() 370 | 371 | ################################################################################ 372 | # Remove Large intermediate files 373 | if os.path.exists(outputfolder + os.path.sep + outputfile_prefix + '_regionsorter.txt'): 374 | os.remove(outputfolder + os.path.sep + outputfile_prefix + '_regionsorter.txt') 375 | if os.path.exists(outputfolder + os.path.sep + 'pair1_bowtie'): 376 | shutil.rmtree(outputfolder + os.path.sep + 'pair1_bowtie') 377 | if os.path.exists(outputfolder + os.path.sep + 'pair2_bowtie'): 378 | shutil.rmtree(outputfolder + os.path.sep + 'pair2_bowtie') 379 | if os.path.exists(outputfolder + os.path.sep + 'sorted_bowtie'): 380 | shutil.rmtree(outputfolder + os.path.sep + 'sorted_bowtie') 381 | 382 | print "... Done" 383 | -------------------------------------------------------------------------------- /RepEnrich_setup.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | import argparse 3 | import csv 4 | import os 5 | import shlex 6 | import subprocess 7 | import sys 8 | from Bio import SeqIO 9 | from Bio.Seq import Seq 10 | from Bio.SeqRecord import SeqRecord 11 | from Bio.Alphabet import IUPAC 12 | 13 | parser = argparse.ArgumentParser(description='Part I: Prepartion of repetive element psuedogenomes and repetive element bamfiles. This script prepares the annotation used by downstream applications to analyze for repetitive element enrichment. For this script to run properly bowtie must be loaded. The repeat element psuedogenomes are prepared in order to analyze reads that map to multiple locations of the genome. The repeat element bamfiles are prepared in order to use a region sorter to analyze reads that map to a single location of the genome.You will 1) annotation_file: The repetitive element annotation file downloaded from RepeatMasker.org database for your organism of interest. 2) genomefasta: Your genome of interest in fasta format, 3)setup_folder: a folder to contain repeat element setup files command-line usage EXAMPLE: python master_setup.py /users/nneretti/data/annotation/mm9/mm9_repeatmasker.txt /users/nneretti/data/annotation/mm9/mm9.fa /users/nneretti/data/annotation/mm9/setup_folder', prog='getargs_genome_maker.py') 14 | parser.add_argument('--version', action='version', version='%(prog)s 0.1') 15 | parser.add_argument('annotation_file', action= 'store', metavar='annotation_file', help='List annotation file. The annotation file contains the repeat masker annotation for the genome of interest and may be downloaded at RepeatMasker.org Example /data/annotation/mm9/mm9.fa.out') 16 | parser.add_argument('genomefasta', action= 'store', metavar='genomefasta', help='File name and path for genome of interest in fasta format. Example /data/annotation/mm9/mm9.fa') 17 | parser.add_argument('setup_folder', action= 'store', metavar='setup_folder', help='List folder to contain bamfiles for repeats and repeat element psuedogenomes. Example /data/annotation/mm9/setup') 18 | parser.add_argument('--nfragmentsfile1', action= 'store', dest='nfragmentsfile1', metavar='nfragmentsfile1', default='./repnames_nfragments.txt', help='Output location of a description file that saves the number of fragments processed per repname. Default ./repnames_nfragments.txt') 19 | parser.add_argument('--gaplength', action= 'store', dest='gaplength', metavar='gaplength', default= '200', type=int, help='Length of the spacer used to build repeat psuedogeneomes. Default 200') 20 | parser.add_argument('--flankinglength', action= 'store', dest='flankinglength', metavar='flankinglength', default= '25', type=int, help='Length of the flanking region adjacent to the repeat element that is used to build repeat psuedogeneomes. The flanking length should be set according to the length of your reads. Default 25') 21 | parser.add_argument('--is_bed', action= 'store', dest='is_bed', metavar='is_bed', default= 'FALSE', help='Is the annotation file a bed file. This is also a compatible format. The file needs to be a tab seperated bed with optional fields. Ex. format chr\tstart\tend\tName_element\tclass\tfamily. The class and family should identical to name_element if not applicable. Default FALSE change to TRUE') 22 | args = parser.parse_args() 23 | 24 | # parameters and paths specified in args_parse 25 | gapl = args.gaplength 26 | flankingl = args.flankinglength 27 | annotation_file = args.annotation_file 28 | genomefasta = args.genomefasta 29 | setup_folder = args.setup_folder 30 | nfragmentsfile1 = args.nfragmentsfile1 31 | is_bed = args.is_bed 32 | 33 | ################################################################################ 34 | # check that the programs we need are available 35 | try: 36 | subprocess.call(shlex.split("bowtie --version"), stdout=open(os.devnull, 'wb'), stderr=open(os.devnull, 'wb')) 37 | except OSError: 38 | print "Error: Bowtie or BEDTools not loaded" 39 | raise 40 | 41 | ################################################################################ 42 | # Define a text importer 43 | csv.field_size_limit(sys.maxsize) 44 | def import_text(filename, separator): 45 | for line in csv.reader(open(os.path.realpath(filename)), delimiter=separator, 46 | skipinitialspace=True): 47 | if line: 48 | yield line 49 | # Make a setup folder 50 | if not os.path.exists(setup_folder): 51 | os.makedirs(setup_folder) 52 | 53 | ################################################################################ 54 | # load genome into dictionary 55 | print "loading genome..." 56 | g = SeqIO.to_dict(SeqIO.parse(genomefasta, "fasta")) 57 | 58 | print "Precomputing length of all chromosomes..." 59 | idxgenome = {} 60 | lgenome = {} 61 | genome = {} 62 | allchrs = g.keys() 63 | k = 0 64 | for chr in allchrs: 65 | genome[chr] = str(g[chr].seq) 66 | del g[chr] 67 | lgenome[chr] = len(genome[chr]) 68 | idxgenome[chr] = k 69 | k = k + 1 70 | del g 71 | 72 | ################################################################################ 73 | # Build a bedfile of repeatcoordinates to use by RepEnrich region_sorter 74 | if is_bed == "FALSE": 75 | repeat_elements= [] 76 | fout = open(os.path.realpath(setup_folder + os.path.sep + 'repnames.bed'), 'w') 77 | fin = import_text(annotation_file, ' ') 78 | x = 0 79 | rep_chr = {} 80 | rep_start = {} 81 | rep_end = {} 82 | x = 0 83 | for line in fin: 84 | if x>2: 85 | line9 = line[9].replace("(","_").replace(")","_").replace("/","_") 86 | repname = line9 87 | if not repname in repeat_elements: 88 | repeat_elements.append(repname) 89 | repchr = line[4] 90 | repstart = int(line[5]) 91 | repend = int(line[6]) 92 | print >> fout, str(repchr) + '\t'+str(repstart)+ '\t'+str(repend)+ '\t'+str(repname) 93 | if rep_chr.has_key(repname): 94 | rep_chr[repname].append(repchr) 95 | rep_start[repname].append(int(repstart)) 96 | rep_end[repname].append(int(repend)) 97 | else: 98 | rep_chr[repname] = [repchr] 99 | rep_start[repname] = [int(repstart)] 100 | rep_end[repname] = [int(repend)] 101 | x +=1 102 | if is_bed == "TRUE": 103 | repeat_elements= [] 104 | fout = open(os.path.realpath(setup_folder + os.path.sep + 'repnames.bed'), 'w') 105 | fin = open(os.path.realpath(annotation_file), 'r') 106 | x =0 107 | rep_chr = {} 108 | rep_start = {} 109 | rep_end = {} 110 | x =0 111 | for line in fin: 112 | line=line.strip('\n') 113 | line=line.split('\t') 114 | line3 = line[3].replace("(","_").replace(")","_").replace("/","_") 115 | repname = line3 116 | if not repname in repeat_elements: 117 | repeat_elements.append(repname) 118 | repchr = line[0] 119 | repstart = int(line[1]) 120 | repend = int(line[2]) 121 | print >> fout, str(repchr) + '\t'+str(repstart)+ '\t'+str(repend)+ '\t'+str(repname) 122 | if rep_chr.has_key(repname): 123 | rep_chr[repname].append(repchr) 124 | rep_start[repname].append(int(repstart)) 125 | rep_end[repname].append(int(repend)) 126 | else: 127 | rep_chr[repname] = [repchr] 128 | rep_start[repname] = [int(repstart)] 129 | rep_end[repname] = [int(repend)] 130 | 131 | fin.close() 132 | fout.close() 133 | repeat_elements = sorted(repeat_elements) 134 | print "Writing a key for all repeats..." 135 | #print to fout the binary key that contains each repeat type with the associated binary number; sort the binary key: 136 | fout = open(os.path.realpath(setup_folder + os.path.sep + 'repgenomes_key.txt'), 'w') 137 | x = 0 138 | for repeat in repeat_elements: 139 | print >> fout, str(repeat) + '\t' + str(x) 140 | x +=1 141 | fout.close() 142 | ################################################################################ 143 | # generate spacer for psuedogenomes 144 | spacer = "" 145 | for i in range(gapl): 146 | spacer = spacer + "N" 147 | 148 | # save file with number of fragments processed per repname 149 | print "Saving number of fragments processed per repname to " + nfragmentsfile1 150 | fout1 = open(os.path.realpath(nfragmentsfile1),"w") 151 | for repname in rep_chr.keys(): 152 | rep_chr_current = rep_chr[repname] 153 | print >>fout1, str(len(rep_chr[repname])) + "\t" + repname 154 | fout1.close() 155 | 156 | # generate metagenomes and save them to FASTA files 157 | k = 1 158 | nrepgenomes = len(rep_chr.keys()) 159 | for repname in rep_chr.keys(): 160 | metagenome = "" 161 | newname = repname.replace("(","_").replace(")","_").replace("/","_") 162 | print "processing repgenome " + newname + ".fa" + " (" + str(k) + " of " + str(nrepgenomes) + ")" 163 | rep_chr_current = rep_chr[repname] 164 | rep_start_current = rep_start[repname] 165 | rep_end_current = rep_end[repname] 166 | print "-------> " + str(len(rep_chr[repname])) + " fragments" 167 | for i in range(len(rep_chr[repname])): 168 | try: 169 | chr = rep_chr_current[i] 170 | rstart = max(rep_start_current[i] - flankingl, 0) 171 | rend = min(rep_end_current[i] + flankingl, lgenome[chr]-1) 172 | metagenome = metagenome + spacer + genome[chr][rstart:(rend+1)] 173 | except KeyError: 174 | print "Unrecognised Chromosome: "+chr 175 | pass 176 | 177 | # Convert metagenome to SeqRecord object (required by SeqIO.write) 178 | record = SeqRecord(Seq(metagenome, IUPAC.unambiguous_dna), id = "repname", name = "", description = "") 179 | print "saving repgenome " + newname + ".fa" + " (" + str(k) + " of " + str(nrepgenomes) + ")" 180 | fastafilename = os.path.realpath(setup_folder + os.path.sep + newname + ".fa") 181 | SeqIO.write(record, fastafilename, "fasta") 182 | print "indexing repgenome " + newname + ".fa" + " (" + str(k) + " of " + str(nrepgenomes) + ")" 183 | command = shlex.split('bowtie-build -f ' + fastafilename + ' ' + setup_folder + os.path.sep + newname) 184 | p = subprocess.Popen(command).communicate() 185 | k += 1 186 | 187 | print "... Done" 188 | 189 | --------------------------------------------------------------------------------