├── README.md
├── RepEnrich.py
└── RepEnrich_setup.py


/README.md:
--------------------------------------------------------------------------------
  1 | ##There is currently a newer version of RepEnrich located here: https://github.com/nerettilab/RepEnrich2
  2 | 
  3 | #RepEnrich
  4 | ## Tutorial By Steven Criscione
  5 | Email: [steven_criscione@brown.edu](mailto:steven_criscione@brown.edu)
  6 | 
  7 | ### Dependencies
  8 | This example is for mouse genome **mm9**. Before
  9 | getting started you should make sure you have installed the dependencies
 10 | for RepEnrich. RepEnrich requires python version 2.7.3.
 11 | RepEnrich requires: [Bowtie 1](http://bowtie-bio.sourceforge.net/index.shtml),
 12 | [bedtools](http://bedtools.readthedocs.org/en/latest/), 
 13 | and [samtools](http://www.htslib.org/). I am using bedtools version 2.20.1, bowtie 1 version 0.12.9, samtools version 0.1.19.
 14 | RepEnrich also requires a bowtie1 indexed genome in fasta format
 15 | available. (Example `mm9.fa`) 
 16 | The RepEnrich python scripts also use [BioPython](http://biopython.org) which
 17 | can be installed with the following command:
 18 | 
 19 |     pip install BioPython
 20 | 
 21 | IMPORTANT: bedtools version 2.24.0 and greater yield an error due to altered functionality of [coverageBed](http://bedtools.readthedocs.org/en/latest/content/tools/coverage.html).
 22 | 
 23 | ### Step 1) Attain repetitive element annotation
 24 | I have temporarily provided the setup for the human genome (build hg19 and hg38) and the mouse genome (build mm9) available [here](https://www.dropbox.com/sh/xpd68jhw7bwd9ie/AAA_IJzhTn1GhawoGnXgtH6Ca?dl=0). After downloading you can extract the files using:
 25 | 
 26 |     gunzip hg19_repeatmasker_clean.txt.gz
 27 |     tar -zxvf RepEnrich_setup_hg19.tar.gz
 28 |   
 29 | To yield hg19_repeatmasker_clean.txt annotation file and RepEnrich_setup_hg19 setup folder.  The annotation files I am using are repeatmasker files with simple and low-complexity repeats removed (satellite repeats and transposons are still present).  If you choose to use these files for the set-up you can skip ahead to step 2.
 30 | 
 31 | The RepEnrich setup script will build the annotation
 32 | required by RepEnrich. The default is a repeatmasker file which can be
 33 | downloaded from [repeatmasker.org](http://www.repeatmasker.org/genomicDatasets/RMGenomicDatasets.html),
 34 | (for instance, find the `mm9.fa.out.gz` download
 35 | [here](http://www.repeatmasker.org/genomes/mm9/RepeatMasker-rm328-db20090604/mm9.fa.out.gz).
 36 | Once you have downloaded the file you can unzip it and rename it:
 37 | 
 38 |     gunzip mm9.fa.out.gz
 39 |     mv mm9.fa.out mm9_repeatmasker.txt
 40 | 
 41 | This is what the file looks like:
 42 | 
 43 | 	SW perc perc perc query position in query matching repeat position in repeat score div. del. ins. sequence begin end (left) repeat class/family begin end  (left) ID
 44 | 	687 17.4 0.0 0.0 chr1 3000002 3000156 (194195276) C L1_Mur2 LINE/L1 (4310) 1567 1413 1
 45 | 	917 21.4 11.4 4.5 chr1 3000238 3000733 (194194699) C L1_Mur2 LINE/L1 (4488) 1389 913 1
 46 | 	845 23.3 7.6 11.4 chr1 3000767 3000792 (194194640) C L1_Mur2 LINE/L1 (6816) 912 887 1
 47 | 	621 25.0 6.5 3.7 chr1 3001288 3001583 (194193849) C Lx9 LINE/L1 (1596) 6048 5742 3
 48 | 
 49 | The RepEnrich setup script will also allow
 50 | you to build the annotation required by RepEnrich for a custom set of
 51 | elements using a bed file. So if you want to examine mm9 LTR repetitive
 52 | elements; you can build this file using the the repeatmasker track from
 53 | [UCSC genome table browser](http://genome.ucsc.edu/cgi-bin/hgTables).
 54 | 
 55 | To do this, select genome `mm9`, click the edit box next to _Filter_, fill
 56 | in the repclass does match with `LTR`, then click submit. Back at the table
 57 | browser select option `Selected fields from primary and related tables`, 
 58 | name the output file something like `mm9_LTR_repeatmasker.bed`, and click
 59 | `Get output`. On the next page select `genoName`, `genoStart`, `genoEnd`,
 60 | `repName`, `repClass`, `repFamily` then download the file.
 61 | 
 62 | The UCSC puts a header on the file that needs to be removed: 
 63 | 
 64 |     tail -n +3 mm9_LTR_repeatmasker.bed | head -n -4 > mm9_LTR_repeatmasker_fix.bed
 65 |     mv mm9_LTR_repeatmasker_fix.bed mm9_LTR_repeatmasker.bed
 66 | 
 67 | This is what our custom mm9 LTR retrotransposon bed file looks like:
 68 | 
 69 | 	$ head mm9_LTR_repeatmasker.bed
 70 | 
 71 | 	chr1 3001722 3002005 RLTR25A LTR ERVK
 72 | 	chr1 3002051 3002615 RLTR25A LTR ERVK
 73 | 	chr1 3016886 3017193 RLTRETN_Mm LTR ERVK
 74 | 	chr1 3018338 3018653 RLTR14 LTR ERV1
 75 | 
 76 | Note: It is important to get the column format right: 
 77 | 
 78 | * Column 1: Chromosome
 79 | * Column 2: Start
 80 | * Column 3: End
 81 | * Column 4: Repeat_name
 82 | * Column 5: Class
 83 | * Column 6: Family
 84 | 
 85 | The file should be tab delimited. If there is no information on class
 86 | or family, you can replace these columns with the repeat name or an
 87 | arbitrary label such as `group1`.
 88 | 
 89 | ### Step 2) Run the setup for RepEnrich
 90 | 
 91 | Now that we have our annotation files we can move
 92 | on to running the setup for RepEnrich. First load the dependencies
 93 | (if you use Environment Modules - otherwise just make sure that these
 94 | programs are available in your `PATH`).
 95 | 
 96 |     module load bowtie
 97 |     module load bedtools
 98 |     module load samtools
 99 | 
100 | Next run the setup using the type of annotation you have selected (default):
101 | 
102 |     python RepEnrich_setup.py /data/mm9_repeatmasker.txt /data/mm9.fa /data/setup_folder_mm9
103 | 
104 | custom bed file:
105 | 
106 |     python RepEnrich_setup.py /data/mm9_LTR_repeatmasker.bed /data/mm9.fa /data/setup_folder_mm9 --is_bed TRUE
107 | 
108 | The previous commands have setup RepEnrich annotation that is used in
109 | downstream analysis of data. You only have to do the setup step once for
110 | an organism of interest. One cautionary note is that RepEnrich is only
111 | as reliable as the genome annotation of repetitive elements for your
112 | organism of interest. Therefore, RepEnrich performance may not be
113 | optimal for poorly annotated genomes. 
114 | 
115 | ### Step 3) Map the data to the genome using bowtie1
116 | 
117 | After the setup of the RepEnrich we now have to
118 | map our data uniquely to the genome before running RepEnrich. This is
119 | because RepEnrich treats unique mapping and multi-mapping reads
120 | separately. This requires use of specific bowtie options. The bowtie
121 | command below is recommended for RepEnrich:
122 | 
123 |     bowtie /data/mm9 -p 16 -t -m 1 -S --max /data/sampleA_multimap.fastq sample_A.fastq /data/sampleA_unique.sam
124 | 
125 | An explanation of bowtie options:
126 | 
127 | * `bowtie <bowtie_index>`
128 | * `-p 16` - 16 cpus
129 | * `-t` - print time
130 | * `-m 1` - only allow unique mapping
131 | * `-S` - output SAM
132 | * `--max multimapping.fastq` - output multimapping reads to `multimapping.fastq`
133 | * `unique_mapping.sam` - uniquely mapping reads
134 | 
135 | For paired-end reads the bowtie command is:
136 | 
137 |     bowtie /data/mm9 -p 16 -t -m 1 -S --max /data/sampleA_multimap.fastq -1 sample_A_1.fastq -2 sample_A_2.fastq /data/sampleA_unique.sam
138 | 
139 | The Sam file should be converted to a bam file with samtools:
140 | 
141 |     samtools view -bS sampleA_unique.sam > sampleA_unique.bam
142 |     samtools sort sampleA_unique.bam sampleA_unique_sorted
143 |     mv sampleA_unique_sorted.bam sampleA_unique.bam
144 |     samtools index sampleA_unique.bam
145 |     rm sampleA_unique.sam
146 | 
147 | You should now compute the total mapping reads for your alignment. This
148 | includes the reads that mapped uniquely (`sampleA_unique.bam`)
149 | and more than once (`sample_A_multimap.fastq`). The `.out` file
150 | from your bowtie batch script contains this information (or `stdout` 
151 | from an interactive job). 
152 | 
153 | It should looks like this:
154 | 
155 | 	Seeded quality full-index
156 | 	search: 00:32:26
157 | 		# reads processed: 92084909
158 | 		# reads with at least one reported alignment: 48299773 (52.45%)
159 | 		# reads that failed to align: 17061693 (18.53%)
160 | 		# reads with alignments suppressed due to -m: 26723443 (29.02%)
161 | 	Reported 48299773 alignments to 1 output stream(s)
162 | 
163 | The total mapping reads is the `# of reads processed` - 
164 | `# reads that failed to align`. Here our total mapping reads are:
165 | `92084909 - 17061693 = 75023216` 
166 | 
167 | 
168 | ### Step 4) Run RepEnrich on the data
169 | 
170 | Now we have all the information we need to run RepEnrich.
171 | Here is an example (for default annotation):
172 | 
173 |     python RepEnrich.py /data/mm9_repeatmasker.txt /data/sample_A sample_A /data/hg19_setup_folder sampleA_multimap.fastq sampleA_unique.bam --cpus 16
174 | 
175 | for custom bed file annotation:
176 | 
177 |     python RepEnrich.py /data/mm9_LTR_repeatmasker.bed /data/sample_A sample_A /data/hg19_setup_folder sampleA_multimap.fastq sampleA_unique.bam --is_bed TRUE --cpus 16
178 | 
179 | An explanation of the RepEnrich command:
180 | 
181 | 	python RepEnrich.py 
182 | 		<repeat_annotation>
183 | 		<output_folder>
184 | 		<output_prefix>
185 | 		<RepEnrich_setup_folder>
186 | 		<multimapping_reads.fastq>
187 | 		<unique_mapping_reads.bam>
188 | 		(--is_bed TRUE)
189 | 		(--cpus 16)
190 | 
191 | If you have paired-end data the command is very similar. There will be two `sampleA_multimap.fastq` and `sampleA_multimap_1.fastq` and
192 | `sampleA_multimap_2.fastq` from the bowtie step.
193 | 
194 | The command for running
195 | RepEnrich in this case is (for default annotation):
196 | 
197 |     python RepEnrich.py /data/mm9_repeatmasker.txt /data/sample_A sample_A /data/hg19_setup_folder sampleA_multimap_1.fastq --fastqfile2 sampleA_multimap_2.fastq sampleA_unique.bam --cpus 16 --pairedend TRUE
198 | 
199 | for custom bed file annotation:
200 | 
201 |     python RepEnrich.py /data/mm9_LTR_repeatmasker.bed /data/sample_A sample_A /data/hg19_setup_folder sampleA_multimap_1.fastq --fastqfile2 sampleA_multimap_2.fastq sampleA_unique.bam --is_bed TRUE --cpus 16 --pairedend TRUE
202 | 
203 | 
204 | ### Step 5) Processing the output of RepEnrich
205 | 
206 | The final outputs will be
207 | in the path `/data/sample_A`. This will include a few files. The most
208 | important of which is the `sampleA_fraction_counts.txt` file. This is
209 | the estimated counts for the repeats. I use this file to build a table
210 | of counts for all my conditions (by pasting the individual
211 | `*_fraction_counts.txt` files together for my complete experiment).
212 | 
213 | You can use the compiled counts file to do differential expression analysis
214 | similar to what is done for genes. We use
215 | [EdgeR](http://www.bioconductor.org/packages/release/bioc/html/edgeR.html)
216 | or [DESEQ](http://bioconductor.org/packages/release/bioc/html/DESeq.html)
217 | to do the differential expression analysis. These are R packages that you can
218 | download from [bioconductor](http://bioconductor.org/).
219 | 
220 | When running the EdgeR differential expression analysis
221 | you can follow the examples in the EdgeR manual. I manually input the
222 | library sizes (the total mapping reads we obtained in the tutorial).
223 | Some of the downstream analysis, though, is left to your discretion.
224 | There are multiple ways you can do the differential expression analysis.
225 | I use the `GLM` method within the EdgeR packgage, although DESeq has
226 | similar methods and EdgeR also has a more straightforward approach
227 | called `exactTest`. Below is a sample EdgeR script used to do the
228 | differential analysis of repeats for young, old, and very old mice. The
229 | file `counts.csv` contains the ouput from RepEnrich that was made by
230 | pasting the individual `*_fraction_counts.txt` files together for my
231 | complete experiment. 
232 | 
233 | ## Example Script for EdgeR differential enrichment analysis
234 | 
235 | ```r
236 | # EdgeR example
237 | 
238 | # Setup - Install and load edgeR
239 | source("http://bioconductor.org/biocLite.R")
240 | biocLite("edgeR")
241 | library('edgeR')
242 | 
243 | # In the case of a pre-assembled file of the fraction count output do the following:
244 | # counts <- read.csv(file = "counts.csv")
245 | 
246 | # In the case of seperate outputs, load the RepEnrich results - fraction counts
247 | young_r1 <- read.delim('young_r1_fraction_counts.txt', header=FALSE)
248 | young_r2 <- read.delim('young_r2_fraction_counts.txt', header=FALSE)
249 | young_r3 <- read.delim('young_r3_fraction_counts.txt', header=FALSE)
250 | old_r1 <- read.delim('old_r1_fraction_counts.txt', header=FALSE)
251 | old_r2 <- read.delim('old_r2_fraction_counts.txt', header=FALSE)
252 | old_r3 <- read.delim('old_r3_fraction_counts.txt', header=FALSE)
253 | v_old_r1 <- read.delim('veryold_r1_fraction_counts.txt', header=FALSE)
254 | v_old_r2 <- read.delim('veryold_r2_fraction_counts.txt', header=FALSE)
255 | v_old_r3 <- read.delim('veryold_r3_fraction_counts.txt', header=FALSE)
256 | 
257 | #' Build a counts table
258 | counts <- data.frame(
259 |   row.names = young_r1[,1],
260 |   young_r1 = young_r1[,4], young_r2 = young_r2[,4], young_r3 = young_r3[,4],
261 |   old_r1 = old_r1[,4], old_r2 = old_r2[,4], old_r3 = old_r3[,4],
262 |   v_old_r1 = v_old_r1[,4], v_old_r2 = v_old_r2[,4], v_old_r3 = v_old_r3[,4]
263 | )
264 | 
265 | # Build a meta data object. I am comparing young, old, and veryold mice.
266 | # I manually input the total mapping reads for each sample.
267 | # The total mapping reads are calculated using the bowtie logs:
268 | # # of reads processed - # reads that failed to align
269 | meta <- data.frame(
270 | 	row.names=colnames(counts),
271 | 	condition=c("young","young","young","old","old","old","veryold","veryold","veryold"),
272 | 	libsize=c(24923593,28340805,21743712,16385707,26573335,28131649,34751164,37371774,28236419)
273 | )
274 | 
275 | # Define the library size and conditions for the GLM
276 | libsize <- meta$libsize
277 | condition <- factor(meta$condition)
278 | design <- model.matrix(~0+condition)
279 | colnames(design) <- levels(meta$condition)
280 | 
281 | # Build a DGE object for the GLM
282 | y <- DGEList(counts=counts, lib.size=libsize)
283 | 
284 | # Normalize the data
285 | y <- calcNormFactors(y)
286 | y$samples
287 | plotMDS(y)
288 | 
289 | # Estimate the variance
290 | y <- estimateGLMCommonDisp(y, design)
291 | y <- estimateGLMTrendedDisp(y, design)
292 | y <- estimateGLMTagwiseDisp(y, design)
293 | plotBCV(y)
294 | 
295 | # Build an object to contain the normalized read abundance
296 | logcpm <- cpm(y, log=TRUE, lib.size=libsize)
297 | logcpm <- as.data.frame(logcpm)
298 | colnames(logcpm) <- factor(meta$condition)
299 | 
300 | # Conduct fitting of the GLM
301 | yfit <- glmFit(y, design)
302 | 
303 | # Initialize result matrices to contain the results of the GLM
304 | results <- matrix(nrow=dim(counts)[1],ncol=0)
305 | logfc <- matrix(nrow=dim(counts)[1],ncol=0)
306 | 
307 | # Make the comparisons for the GLM
308 | my.contrasts <- makeContrasts(
309 | 	veryold_old = veryold – old,
310 | 	veryold_young = veryold – young,
311 | 	old_young = old – young,
312 | 	levels = design
313 | )
314 | 
315 | # Define the contrasts used in the comparisons
316 | allcontrasts = c(
317 | 	"veryold_old",
318 | 	"veryold_young",
319 | 	"old_young"
320 | )
321 | 
322 | # Conduct a for loop that will do the fitting of the GLM for each comparison
323 | # Put the results into the results objects
324 | for(current_contrast in allcontrasts) {
325 | 	lrt <- glmLRT(yfit, contrast=my.contrasts[,current_contrast])
326 | 	plotSmear(lrt, de.tags=rownames(y))
327 | 	title(current_contrast)
328 | 	res <- topTags(lrt,n=dim(c)[1],sort.by="none")$table
329 | 	colnames(res) <- paste(colnames(res),current_contrast,sep=".")
330 | 	results <- cbind(results,res[,c(1,5)])
331 | 	logfc <- cbind(logfc,res[c(1)])
332 | }
333 | 
334 | # Add the repeat types back into the results.
335 | # We should still have the same order as the input data
336 | results$class <- young_r1[,2]
337 | results$type <- young_r1[,3]
338 | 
339 | # Sort the results table by the logFC
340 | results <- results[with(results, order(-abs(logFC.old_young))), ]
341 | 
342 | # Save the results
343 | write.table(results, 'results.txt', quote=FALSE, sep="\t")
344 | 
345 | # Plot Fold Changes for repeat classes and types
346 | for(current_contrast in allcontrasts) {
347 |   logFC <- results[, paste0("logFC.", current_contrast)]
348 |   # Plot the repeat classes
349 |   classes <- with(results, reorder(class, -logFC, median))
350 |   par(mar=c(6,10,4,1))
351 |   boxplot(logFC ~ classes, data=results, outline=FALSE, horizontal=TRUE,
352 |           las=2, xlab="log(Fold Change)", main=current_contrast)
353 |   abline(v=0)
354 |   # Plot the repeat types
355 |   types <- with(results, reorder(type, -logFC, median))
356 |   boxplot(logFC ~ types, data=results, outline=FALSE, horizontal=TRUE,
357 |           las=2, xlab="log(Fold Change)", main=current_contrast)
358 |   abline(v=0)
359 | }
360 | 
361 | ```
362 | 
363 | 
364 | Note that the objects `logfc` contains the differential expression for the
365 | contrast, `logcpm` contains the normalized read abundance, and `result`
366 | contains both the differential expression and the false discovery rate for
367 | the experimental comparison. I recommended reading more about these in the
368 | [EdgeR manual](http://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf).
369 | 
370 | 


--------------------------------------------------------------------------------
/RepEnrich.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | import argparse
  3 | import csv
  4 | import numpy
  5 | import os
  6 | import shlex
  7 | import shutil
  8 | import subprocess
  9 | import sys
 10 | 
 11 | parser = argparse.ArgumentParser(description='Part II: Conducting the alignments to the psuedogenomes.  Before doing this step you will require 1) a bamfile of the unique alignments with index 2) a fastq file of the reads mapping to more than one location.  These files can be obtained using the following bowtie options [EXAMPLE: bowtie -S -m 1 --max multimap.fastq mm9 mate1_reads.fastq]  Once you have the unique alignment bamfile and the reads mapping to more than one location in a fastq file you can run this step.  EXAMPLE: python master_output.py /users/nneretti/data/annotation/hg19/hg19_repeatmasker.txt /users/nneretti/datasets/repeatmapping/POL3/Pol3_human/HeLa_InputChIPseq_Rep1 HeLa_InputChIPseq_Rep1 /users/nneretti/data/annotation/hg19/setup_folder HeLa_InputChIPseq_Rep1_multimap.fastq HeLa_InputChIPseq_Rep1.bam')
 12 | parser.add_argument('--version', action='version', version='%(prog)s 0.1')
 13 | parser.add_argument('annotation_file', action= 'store', metavar='annotation_file', help='List RepeatMasker.org annotation file for your organism.  The file may be downloaded from the RepeatMasker.org website.  Example: /data/annotation/hg19/hg19_repeatmasker.txt')
 14 | parser.add_argument('outputfolder', action= 'store', metavar='outputfolder', help='List folder to contain results. Example: /outputfolder')
 15 | parser.add_argument('outputprefix', action= 'store', metavar='outputprefix', help='Enter prefix name for data. Example: HeLa_InputChIPseq_Rep1')
 16 | parser.add_argument('setup_folder', action= 'store', metavar='setup_folder', help='List folder that contains the repeat element psuedogenomes. Example /data/annotation/hg19/setup_folder')
 17 | parser.add_argument('fastqfile', action= 'store', metavar='fastqfile', help='Enter file for the fastq reads that map to multiple locations. Example /data/multimap.fastq')
 18 | parser.add_argument('alignment_bam', action= 'store', metavar='alignment_bam', help='Enter bamfile output for reads that map uniquely. Example /bamfiles/old.bam')
 19 | parser.add_argument('--pairedend', action= 'store', dest='pairedend', default= 'FALSE', help='Designate this option for paired-end sequencing. Default FALSE change to TRUE')
 20 | parser.add_argument('--collapserepeat', action= 'store', dest='collapserepeat', metavar='collapserepeat', default= 'Simple_repeat', help='Designate this option to generate a collapsed repeat type.  Uncollapsed output is generated in addition to collapsed repeat type.  Simple_repeat is default to simplify downstream analysis.  You can change the default to another repeat name to collapse a seperate specific repeat instead or if the name of Simple_repeat is different for your organism.  Default Simple_repeat')
 21 | parser.add_argument('--fastqfile2', action= 'store', dest='fastqfile2', metavar='fastqfile2', default= 'none', help='Enter fastqfile2 when using paired-end option.  Default none')
 22 | parser.add_argument('--cpus', action= 'store', dest='cpus', metavar='cpus', default= "1", type=int, help='Enter available cpus per node.  The more cpus the faster RepEnrich performs. RepEnrich is designed to only work on one node. Default: "1"')
 23 | parser.add_argument('--allcountmethod', action= 'store', dest='allcountmethod', metavar='allcountmethod', default= "FALSE", help='By default the pipeline only outputs the fraction count method.  Consdidered to be the best way to count multimapped reads.  Changing this option will include the unique count method, a conservative count, and the total count method, a liberal counting strategy. Our evaluation of simulated data indicated fraction counting is best. Default = FALSE, change to TRUE')
 24 | parser.add_argument('--is_bed', action= 'store', dest='is_bed', metavar='is_bed', default= 'FALSE', help='Is the annotation file a bed file.  This is also a compatible format.  The file needs to be a tab seperated bed with optional fields.  Ex. format chr\tstart\tend\tName_element\tclass\tfamily.  The class and family should identical to name_element if not applicable.  Default FALSE change to TRUE')
 25 | args = parser.parse_args()
 26 | 
 27 | # parameters
 28 | annotation_file = args.annotation_file
 29 | outputfolder = args.outputfolder
 30 | outputfile_prefix = args.outputprefix
 31 | setup_folder = args.setup_folder
 32 | repeat_bed = setup_folder + os.path.sep + 'repnames.bed' 
 33 | unique_mapper_bam = args.alignment_bam
 34 | fastqfile_1 = args.fastqfile
 35 | fastqfile_2 = args.fastqfile2
 36 | cpus = args.cpus
 37 | b_opt = "-k1 -p " +str(1) +" --quiet"
 38 | simple_repeat = args.collapserepeat
 39 | paired_end = args.pairedend
 40 | allcountmethod = args.allcountmethod
 41 | is_bed = args.is_bed
 42 | 
 43 | ################################################################################
 44 | # check that the programs we need are available
 45 | try:
 46 |     subprocess.call(shlex.split("coverageBed -h"), stdout=open(os.devnull, 'wb'), stderr=open(os.devnull, 'wb'))
 47 |     subprocess.call(shlex.split("bowtie --version"), stdout=open(os.devnull, 'wb'), stderr=open(os.devnull, 'wb'))
 48 | except OSError:
 49 |     print "Error: Bowtie or BEDTools not loaded"
 50 |     raise
 51 | 
 52 | ################################################################################
 53 | # define a csv reader that reads space deliminated files
 54 | print 'Preparing for analysis using RepEnrich...'
 55 | csv.field_size_limit(sys.maxsize)
 56 | def import_text(filename, separator):
 57 |     for line in csv.reader(open(filename), delimiter=separator, 
 58 |                            skipinitialspace=True):
 59 |         if line:
 60 |             yield line
 61 |             
 62 | ################################################################################
 63 | # build dictionaries to convert repclass and rep families'
 64 | if is_bed == "FALSE":
 65 | 	repeatclass = {}
 66 | 	repeatfamily = {}
 67 | 	fin = import_text(annotation_file, ' ')
 68 | 	x = 0
 69 | 	for line in fin:
 70 | 	    if x>2:
 71 | 	        classfamily =[]
 72 | 	        classfamily = line[10].split(os.path.sep)
 73 | 	        line9 = line[9].replace("(","_").replace(")","_").replace("/","_")
 74 | 	        repeatclass[line9] = classfamily[0]
 75 | 	        if len(classfamily) == 2:
 76 | 	            repeatfamily[line9] = classfamily[1]
 77 | 	        else:
 78 | 	            repeatfamily[line9] = classfamily[0]
 79 | 	    x +=1
 80 | if is_bed == "TRUE":
 81 | 	repeatclass = {}
 82 | 	repeatfamily = {}
 83 | 	fin = open(annotation_file, 'r')
 84 | 	for line in fin:
 85 | 		line=line.strip('\n')
 86 | 		line=line.split('\t')
 87 | 	        theclass =line[4]
 88 | 	        thefamily = line[5]
 89 | 	        line3 = line[3].replace("(","_").replace(")","_").replace("/","_")
 90 | 	        repeatclass[line3] = theclass 
 91 |                 repeatfamily[line3] = thefamily
 92 | fin.close()
 93 | 
 94 | ################################################################################
 95 | # build list of repeats initializing dictionaries for downstream analysis'
 96 | fin = import_text(setup_folder + os.path.sep + 'repgenomes_key.txt', '\t')
 97 | repeat_key ={}
 98 | rev_repeat_key ={}
 99 | repeat_list = []
100 | reptotalcounts = {}
101 | classfractionalcounts = {}
102 | familyfractionalcounts = {}
103 | classtotalcounts = {}
104 | familytotalcounts = {}
105 | reptotalcounts_simple = {}
106 | fractionalcounts = {}
107 | i = 0
108 | for line in fin:
109 |     reptotalcounts[line[0]] = 0
110 |     fractionalcounts[line[0]] = 0
111 |     if repeatclass.has_key(line[0]):
112 | 	classtotalcounts[repeatclass[line[0]]] = 0
113 | 	classfractionalcounts[repeatclass[line[0]]] = 0
114 |     if repeatfamily.has_key(line[0]):
115 | 	familytotalcounts[repeatfamily[line[0]]] = 0
116 | 	familyfractionalcounts[repeatfamily[line[0]]] = 0
117 |     if repeatfamily.has_key(line[0]):
118 |         if repeatfamily[line[0]] == simple_repeat:
119 |             reptotalcounts_simple[simple_repeat] = 0
120 |     else:
121 |         reptotalcounts_simple[line[0]] = 0
122 |     repeat_list.append(line[0])
123 |     repeat_key[line[0]] = int(line[1])
124 |     rev_repeat_key[int(line[1])] = line[0]
125 | fin.close()
126 | ################################################################################
127 | # map the repeats to the psuedogenomes:
128 | if not os.path.exists(outputfolder):
129 | 	os.mkdir(outputfolder)
130 | ################################################################################
131 | # Conduct the regions sorting
132 | print 'Conducting region sorting on unique mapping reads....'
133 | fileout= outputfolder + os.path.sep + outputfile_prefix + '_regionsorter.txt'
134 | with open(fileout, 'w') as stdout:
135 |    command = shlex.split("coverageBed -abam " +unique_mapper_bam+" -b " +setup_folder + os.path.sep + 'repnames.bed')
136 |    p = subprocess.Popen(command, stdout=stdout)
137 | p.communicate()
138 | stdout.close()
139 | filein = open(outputfolder + os.path.sep + outputfile_prefix + '_regionsorter.txt','r')
140 | counts = {}
141 | sumofrepeatreads=0
142 | for line in filein:
143 | 	line= line.split('\t')
144 | 	if not counts.has_key(str(repeat_key[line[3]])):
145 | 		counts[str(repeat_key[line[3]])]=0
146 | 	counts[str(repeat_key[line[3]])]+=int(line[4])
147 | 	sumofrepeatreads+=int(line[4])
148 | print 'Identified ' + str(sumofrepeatreads) + 'unique reads that mapped to repeats.'
149 | ################################################################################
150 | if paired_end == 'TRUE':
151 | 	if not os.path.exists(outputfolder + os.path.sep + 'pair1_bowtie'):
152 | 		os.mkdir(outputfolder + os.path.sep + 'pair1_bowtie')
153 | 	if not os.path.exists(outputfolder + os.path.sep + 'pair2_bowtie'):
154 | 		os.mkdir(outputfolder + os.path.sep + 'pair2_bowtie')
155 | 	folder_pair1 = outputfolder + os.path.sep + 'pair1_bowtie'
156 | 	folder_pair2 = outputfolder + os.path.sep + 'pair2_bowtie'
157 | ################################################################################
158 | 	print "Processing repeat psuedogenomes..."
159 | 	ps = []
160 | 	psb= []
161 | 	ticker= 0
162 | 	for metagenome in repeat_list:
163 |    		metagenomepath = setup_folder + os.path.sep + metagenome
164 | 		file1=folder_pair1 + os.path.sep + metagenome + '.bowtie'
165 | 		file2 =folder_pair2 + os.path.sep + metagenome + '.bowtie'
166 | 		with open(file1, 'w') as stdout:
167 | 			command = shlex.split("bowtie " + b_opt + " " + metagenomepath + " " + fastqfile_1)
168 | 			p = subprocess.Popen(command,stdout=stdout)
169 | 		with open(file2, 'w') as stdout:
170 | 			command = shlex.split("bowtie " + b_opt + " " + metagenomepath + " " + fastqfile_2)
171 | 			pp = subprocess.Popen(command,stdout=stdout)
172 | 		ps.append(p)
173 | 		ticker +=1
174 | 		psb.append(pp)
175 | 		ticker +=1
176 | 		if ticker == cpus:
177 | 			for p in ps:
178 | 				p.communicate()
179 | 			for p in psb:
180 | 				p.communicate()
181 | 			ticker = 0
182 | 			psb =[]
183 | 			ps = []
184 | 	if len(ps) > 0:
185 | 		for p in ps:
186 | 			p.communicate()
187 | 	stdout.close()
188 |     
189 | ################################################################################
190 | # combine the output from both read pairs:
191 | 	print 'sorting and combining the output for both read pairs...'
192 | 	if not os.path.exists(outputfolder + os.path.sep + 'sorted_bowtie'):
193 | 		os.mkdir(outputfolder + os.path.sep + 'sorted_bowtie')
194 | 	sorted_bowtie = outputfolder + os.path.sep + 'sorted_bowtie'
195 | 	for metagenome in repeat_list:
196 | 		file1 = folder_pair1 + os.path.sep + metagenome + '.bowtie'
197 | 		file2 = folder_pair2 + os.path.sep + metagenome + '.bowtie'
198 | 		fileout= sorted_bowtie + os.path.sep + metagenome + '.bowtie'
199 | 		with open(fileout, 'w') as stdout:
200 | 			p1 = subprocess.Popen(['cat',file1,file2], stdout = subprocess.PIPE)
201 | 			p2 = subprocess.Popen(['cut', '-f1',"-d "], stdin = p1.stdout, stdout = subprocess.PIPE)
202 | 			p3 = subprocess.Popen(['cut', '-f1', "-d/"], stdin = p2.stdout, stdout = subprocess.PIPE)
203 | 			p4 = subprocess.Popen(['sort'], stdin=p3.stdout, stdout = subprocess.PIPE)
204 | 			p5 = subprocess.Popen(['uniq'], stdin=p4.stdout, stdout = stdout)
205 | 			p5.communicate()
206 | 		stdout.close()
207 | 	print 'completed ...'
208 | ################################################################################
209 | if paired_end == 'FALSE':
210 | 	if not os.path.exists(outputfolder + os.path.sep + 'pair1_bowtie'):
211 | 		os.mkdir(outputfolder + os.path.sep + 'pair1_bowtie')
212 | 	folder_pair1 = outputfolder + os.path.sep + 'pair1_bowtie'
213 | ################################################################################
214 | 	ps = []
215 | 	ticker= 0
216 | 	print "Processing repeat psuedogenomes..."
217 | 	for metagenome in repeat_list:
218 |     		metagenomepath = setup_folder + os.path.sep + metagenome
219 | 		file1=folder_pair1 + os.path.sep + metagenome + '.bowtie'
220 | 		with open(file1, 'w') as stdout:
221 | 			command = shlex.split("bowtie " + b_opt + " " + metagenomepath + " " + fastqfile_1)
222 | 			p = subprocess.Popen(command,stdout=stdout)
223 | 		ps.append(p)
224 | 		ticker +=1
225 | 		if ticker == cpus:
226 | 			for p in ps:
227 | 				p.communicate()
228 | 			ticker = 0
229 | 			ps = []
230 | 	if len(ps) > 0:
231 | 		for p in ps:
232 | 			p.communicate()
233 | 	stdout.close()
234 |     
235 | ################################################################################
236 | # combine the output from both read pairs:
237 | 	print 'Sorting and combining the output for both read pairs....'
238 | 	if not os.path.exists(outputfolder + os.path.sep + 'sorted_bowtie'):
239 | 		os.mkdir(outputfolder + os.path.sep + 'sorted_bowtie')
240 | 	sorted_bowtie = outputfolder + os.path.sep + 'sorted_bowtie'
241 | 	for metagenome in repeat_list:
242 | 		file1 = folder_pair1 + os.path.sep + metagenome + '.bowtie'
243 | 		fileout= sorted_bowtie + os.path.sep + metagenome + '.bowtie'
244 | 		with open(fileout, 'w') as stdout:
245 | 			p1 = subprocess.Popen(['cat',file1], stdout = subprocess.PIPE)
246 | 			p2 = subprocess.Popen(['cut', '-f1'], stdin = p1.stdout, stdout = subprocess.PIPE)
247 | 			p3 = subprocess.Popen(['cut', '-f1', "-d/"], stdin = p2.stdout, stdout = subprocess.PIPE)
248 | 			p4 = subprocess.Popen(['sort'], stdin = p3.stdout,stdout = subprocess.PIPE)
249 | 			p5 = subprocess.Popen(['uniq'], stdin = p4.stdout,stdout = stdout)
250 | 			p5.communicate()
251 | 		stdout.close()
252 | 	print 'completed ...'
253 |     
254 | ################################################################################
255 | # build a file of repeat keys for all reads
256 | print 'Writing and processing intermediate files...'
257 | sorted_bowtie = outputfolder + os.path.sep + 'sorted_bowtie'
258 | readid = {}
259 | sumofrepeatreads=0
260 | for rep in repeat_list: 
261 |     for data in import_text(sorted_bowtie + os.path.sep + rep + '.bowtie', '\t'):
262 | 	readid[data[0]] = ''
263 | for rep in repeat_list: 
264 |     for data in import_text(sorted_bowtie + os.path.sep + rep + '.bowtie', '\t'):	
265 | 	readid[data[0]]+=str(repeat_key[rep]) + str(',')
266 | for subfamilies in readid.values():
267 |     if not counts.has_key(subfamilies):
268 | 	counts[subfamilies]=0
269 |     counts[subfamilies] +=1
270 |     sumofrepeatreads+=1
271 | del readid
272 | print 'Identified ' + str(sumofrepeatreads) + ' reads that mapped to repeats for unique and multimappers.'
273 | 
274 | ################################################################################
275 | print "Conducting final calculations..."
276 | # build a converter to numeric label for repeat and yield a combined list of repnames seperated by backslash
277 | def convert(x):
278 |     x = x.strip(',')
279 |     x = x.split(',')
280 |     global repname
281 |     repname = ""
282 |     for i in x:
283 |         repname = repname + os.path.sep + rev_repeat_key[int(i)] 
284 | # building the total counts for repeat element enrichment...
285 | for x in counts.keys():
286 |     count= counts[x]
287 |     x = x.strip(',')
288 |     x = x.split(',')
289 |     for i in x:
290 |         reptotalcounts[rev_repeat_key[int(i)]] += int(count)
291 | # building the fractional counts for repeat element enrichment...
292 | for x in counts.keys():
293 |     count= counts[x]
294 |     x = x.strip(',')
295 |     x = x.split(',')
296 |     splits = len(x)
297 |     for i in x:
298 |         fractionalcounts[rev_repeat_key[int(i)]] += float(numpy.divide(float(count),float(splits)))
299 | # building categorized table of repeat element enrichment... 
300 | repcounts = {}
301 | repcounts['other'] = 0
302 | for key in counts.keys():
303 |         convert(key)
304 |         repcounts[repname] = counts[key]
305 | # building the total counts for class enrichment...
306 | for key in reptotalcounts.keys():
307 | 	classtotalcounts[repeatclass[key]] += reptotalcounts[key]
308 | # building total counts for family enrichment...
309 | for key in reptotalcounts.keys():
310 | 	familytotalcounts[repeatfamily[key]] += reptotalcounts[key]
311 | # building unique counts table'
312 | repcounts2 = {}
313 | for rep in repeat_list:
314 |     if repcounts.has_key("/" +rep):
315 |         repcounts2[rep] = repcounts["/" +rep]
316 |     else:
317 |         repcounts2[rep] = 0
318 | # building the fractionalcounts counts for class enrichment...
319 | for key in fractionalcounts.keys():
320 | 	classfractionalcounts[repeatclass[key]] += fractionalcounts[key]
321 | # building fractional counts for family enrichment...
322 | for key in fractionalcounts.keys():
323 | 	familyfractionalcounts[repeatfamily[key]] += fractionalcounts[key]   
324 |     
325 | ################################################################################
326 | print 'Writing final output and removing intermediate files...'
327 | # print output to file of the categorized counts and total overlapping counts:
328 | if allcountmethod  == "TRUE":
329 | 	fout1 = open(outputfolder + os.path.sep + outputfile_prefix + '_total_counts.txt' , 'w')
330 | 	for key in reptotalcounts.keys():
331 | 	    print >> fout1, str(key) + '\t' + repeatclass[key] + '\t' + repeatfamily[key] + '\t' + str(reptotalcounts[key])
332 | 	fout2 = open(outputfolder + os.path.sep + outputfile_prefix + '_class_total_counts.txt' , 'w')
333 | 	for key in classtotalcounts.keys():
334 | 	    print >> fout2, str(key) + '\t' + str(classtotalcounts[key])  
335 | 	fout3 = open(outputfolder + os.path.sep + outputfile_prefix + '_family_total_counts.txt' , 'w')
336 | 	for key in familytotalcounts.keys():
337 | 	    print >> fout3, str(key) + '\t' + str(familytotalcounts[key])  
338 | 	fout4 = open(outputfolder + os.path.sep + outputfile_prefix + '_unique_counts.txt' , 'w')
339 | 	for key in repcounts2.keys():
340 | 	    print >> fout4, str(key) + '\t' + repeatclass[key] + '\t' + repeatfamily[key] + '\t' + str(repcounts2[key])     
341 |     	fout5 = open(outputfolder + os.path.sep + outputfile_prefix + '_class_fraction_counts.txt' , 'w')
342 | 	for key in classfractionalcounts.keys():
343 | 	    print >> fout5, str(key) + '\t' + str(classfractionalcounts[key])  
344 | 	fout6 = open(outputfolder + os.path.sep + outputfile_prefix + '_family_fraction_counts.txt' , 'w')
345 | 	for key in familyfractionalcounts.keys():
346 | 	    print >> fout6, str(key) + '\t' + str(familyfractionalcounts[key])
347 | 	fout7 = open(outputfolder + os.path.sep + outputfile_prefix + '_fraction_counts.txt' , 'w')
348 | 	for key in fractionalcounts.keys():
349 | 	    print >> fout7, str(key) + '\t' + repeatclass[key] + '\t' + repeatfamily[key] + '\t' + str(int(fractionalcounts[key]))
350 |     	fout1.close()
351 | 	fout2.close()
352 | 	fout3.close()
353 | 	fout4.close()
354 | 	fout5.close()
355 | 	fout6.close()
356 | 	fout7.close()
357 | else:
358 | 	fout1 = open(outputfolder + os.path.sep + outputfile_prefix + '_class_fraction_counts.txt' , 'w')
359 | 	for key in classfractionalcounts.keys():
360 | 	    print >> fout1, str(key) + '\t' + str(classfractionalcounts[key])  
361 | 	fout2 = open(outputfolder + os.path.sep + outputfile_prefix + '_family_fraction_counts.txt' , 'w')
362 | 	for key in familyfractionalcounts.keys():
363 | 	    print >> fout2, str(key) + '\t' + str(familyfractionalcounts[key])
364 | 	fout3 = open(outputfolder + os.path.sep + outputfile_prefix + '_fraction_counts.txt' , 'w')
365 | 	for key in fractionalcounts.keys():
366 | 	    print >> fout3, str(key) + '\t' + repeatclass[key] + '\t' + repeatfamily[key] + '\t' + str(int(fractionalcounts[key]))
367 | 	fout1.close()
368 | 	fout2.close()
369 | 	fout3.close()
370 |     
371 | ################################################################################
372 | # Remove Large intermediate files
373 | if os.path.exists(outputfolder + os.path.sep + outputfile_prefix + '_regionsorter.txt'):
374 | 	os.remove(outputfolder + os.path.sep + outputfile_prefix + '_regionsorter.txt')
375 | if os.path.exists(outputfolder + os.path.sep + 'pair1_bowtie'):
376 | 	shutil.rmtree(outputfolder + os.path.sep + 'pair1_bowtie')
377 | if os.path.exists(outputfolder + os.path.sep + 'pair2_bowtie'):
378 | 	shutil.rmtree(outputfolder + os.path.sep + 'pair2_bowtie')
379 | if os.path.exists(outputfolder + os.path.sep + 'sorted_bowtie'):
380 | 	shutil.rmtree(outputfolder + os.path.sep + 'sorted_bowtie')
381 | 
382 | print "... Done"
383 | 


--------------------------------------------------------------------------------
/RepEnrich_setup.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | import argparse
  3 | import csv
  4 | import os
  5 | import shlex
  6 | import subprocess
  7 | import sys
  8 | from Bio import SeqIO
  9 | from Bio.Seq import Seq
 10 | from Bio.SeqRecord import SeqRecord
 11 | from Bio.Alphabet import IUPAC
 12 | 
 13 | parser = argparse.ArgumentParser(description='Part I: Prepartion of repetive element psuedogenomes and repetive element bamfiles.  This script prepares the annotation used by downstream applications to analyze for repetitive element enrichment. For this script to run properly bowtie must be loaded.  The repeat element psuedogenomes are prepared in order to analyze reads that map to multiple locations of the genome.  The repeat element bamfiles are prepared in order to use a region sorter to analyze reads that map to a single location of the genome.You will 1) annotation_file: The repetitive element annotation file downloaded from RepeatMasker.org database for your organism of interest. 2) genomefasta: Your genome of interest in fasta format, 3)setup_folder: a folder to contain repeat element setup files  command-line usage EXAMPLE: python master_setup.py /users/nneretti/data/annotation/mm9/mm9_repeatmasker.txt /users/nneretti/data/annotation/mm9/mm9.fa /users/nneretti/data/annotation/mm9/setup_folder', prog='getargs_genome_maker.py')
 14 | parser.add_argument('--version', action='version', version='%(prog)s 0.1')
 15 | parser.add_argument('annotation_file', action= 'store', metavar='annotation_file', help='List annotation file. The annotation file contains the repeat masker annotation for the genome of interest and may be downloaded at RepeatMasker.org  Example /data/annotation/mm9/mm9.fa.out')
 16 | parser.add_argument('genomefasta', action= 'store', metavar='genomefasta', help='File name and path for genome of interest in fasta format.  Example /data/annotation/mm9/mm9.fa')
 17 | parser.add_argument('setup_folder', action= 'store', metavar='setup_folder', help='List folder to contain bamfiles for repeats and repeat element psuedogenomes. Example /data/annotation/mm9/setup')
 18 | parser.add_argument('--nfragmentsfile1', action= 'store', dest='nfragmentsfile1', metavar='nfragmentsfile1', default='./repnames_nfragments.txt', help='Output location of a description file that saves the number of fragments processed per repname.  Default ./repnames_nfragments.txt')
 19 | parser.add_argument('--gaplength', action= 'store', dest='gaplength', metavar='gaplength', default= '200', type=int, help='Length of the spacer used to build repeat psuedogeneomes.  Default 200')
 20 | parser.add_argument('--flankinglength', action= 'store', dest='flankinglength', metavar='flankinglength', default= '25', type=int, help='Length of the flanking region adjacent to the repeat element that is used to build repeat psuedogeneomes.  The flanking length should be set according to the length of your reads.  Default 25')
 21 | parser.add_argument('--is_bed', action= 'store', dest='is_bed', metavar='is_bed', default= 'FALSE', help='Is the annotation file a bed file.  This is also a compatible format.  The file needs to be a tab seperated bed with optional fields.  Ex. format chr\tstart\tend\tName_element\tclass\tfamily.  The class and family should identical to name_element if not applicable.  Default FALSE change to TRUE')
 22 | args = parser.parse_args()
 23 | 
 24 | # parameters and paths specified in args_parse
 25 | gapl = args.gaplength
 26 | flankingl = args.flankinglength 
 27 | annotation_file = args.annotation_file
 28 | genomefasta = args.genomefasta
 29 | setup_folder = args.setup_folder
 30 | nfragmentsfile1 = args.nfragmentsfile1
 31 | is_bed = args.is_bed
 32 | 
 33 | ################################################################################
 34 | # check that the programs we need are available
 35 | try:
 36 |     subprocess.call(shlex.split("bowtie --version"), stdout=open(os.devnull, 'wb'), stderr=open(os.devnull, 'wb'))
 37 | except OSError:
 38 |     print "Error: Bowtie or BEDTools not loaded"
 39 |     raise
 40 | 
 41 | ################################################################################
 42 | # Define a text importer
 43 | csv.field_size_limit(sys.maxsize)
 44 | def import_text(filename, separator):
 45 |     for line in csv.reader(open(os.path.realpath(filename)), delimiter=separator, 
 46 |                            skipinitialspace=True):
 47 |         if line:
 48 |             yield line
 49 | # Make a setup folder
 50 | if not os.path.exists(setup_folder):
 51 | 	os.makedirs(setup_folder)
 52 | 
 53 | ################################################################################
 54 | # load genome into dictionary
 55 | print "loading genome..."
 56 | g = SeqIO.to_dict(SeqIO.parse(genomefasta, "fasta"))
 57 | 
 58 | print "Precomputing length of all chromosomes..."
 59 | idxgenome = {}
 60 | lgenome = {}
 61 | genome = {}
 62 | allchrs = g.keys()
 63 | k = 0
 64 | for chr in allchrs:
 65 |     genome[chr] = str(g[chr].seq)
 66 |     del g[chr]
 67 |     lgenome[chr] = len(genome[chr])
 68 |     idxgenome[chr] = k
 69 |     k = k + 1
 70 | del g
 71 | 
 72 | ################################################################################
 73 | # Build a bedfile of repeatcoordinates to use by RepEnrich region_sorter
 74 | if is_bed == "FALSE":
 75 | 	repeat_elements= []
 76 | 	fout = open(os.path.realpath(setup_folder + os.path.sep + 'repnames.bed'), 'w')
 77 | 	fin = import_text(annotation_file, ' ')
 78 | 	x = 0
 79 | 	rep_chr = {}
 80 | 	rep_start = {}
 81 | 	rep_end = {}
 82 | 	x = 0
 83 | 	for line in fin:
 84 | 	    if x>2:
 85 | 	        line9 = line[9].replace("(","_").replace(")","_").replace("/","_")
 86 | 	        repname = line9
 87 | 		if not repname in repeat_elements:
 88 | 			repeat_elements.append(repname)
 89 | 	        repchr = line[4]
 90 | 	        repstart = int(line[5])
 91 | 	        repend = int(line[6])
 92 | 		print >> fout, str(repchr) + '\t'+str(repstart)+ '\t'+str(repend)+ '\t'+str(repname)
 93 | 	        if rep_chr.has_key(repname):
 94 | 	            rep_chr[repname].append(repchr)
 95 | 	            rep_start[repname].append(int(repstart))
 96 | 	            rep_end[repname].append(int(repend))
 97 | 	        else:
 98 | 	            rep_chr[repname] = [repchr]
 99 | 	            rep_start[repname] = [int(repstart)]
100 | 	            rep_end[repname] = [int(repend)]
101 | 	    x +=1
102 | if is_bed == "TRUE":
103 | 	repeat_elements= []
104 | 	fout = open(os.path.realpath(setup_folder + os.path.sep + 'repnames.bed'), 'w')
105 | 	fin = open(os.path.realpath(annotation_file), 'r')
106 | 	x =0
107 | 	rep_chr = {}
108 | 	rep_start = {}
109 | 	rep_end = {}
110 | 	x =0
111 | 	for line in fin:
112 | 		line=line.strip('\n')
113 | 		line=line.split('\t')
114 | 	        line3 = line[3].replace("(","_").replace(")","_").replace("/","_")
115 | 	        repname = line3
116 | 		if not repname in repeat_elements:
117 | 			repeat_elements.append(repname)
118 | 	        repchr = line[0]
119 | 	        repstart = int(line[1])
120 | 	        repend = int(line[2])
121 | 		print >> fout, str(repchr) + '\t'+str(repstart)+ '\t'+str(repend)+ '\t'+str(repname)
122 | 	        if rep_chr.has_key(repname):
123 | 	            rep_chr[repname].append(repchr)
124 | 	            rep_start[repname].append(int(repstart))
125 | 	            rep_end[repname].append(int(repend))
126 | 	        else:
127 | 	            rep_chr[repname] = [repchr]
128 | 	            rep_start[repname] = [int(repstart)]
129 | 	            rep_end[repname] = [int(repend)]
130 | 
131 | fin.close()
132 | fout.close()
133 | repeat_elements = sorted(repeat_elements)
134 | print "Writing a key for all repeats..."
135 | #print to fout the binary key that contains each repeat type with the associated binary number; sort the binary key:
136 | fout = open(os.path.realpath(setup_folder + os.path.sep + 'repgenomes_key.txt'), 'w')
137 | x = 0
138 | for repeat in repeat_elements:
139 |     print >> fout, str(repeat) + '\t' + str(x)
140 |     x +=1
141 | fout.close()
142 | ################################################################################
143 | # generate spacer for psuedogenomes
144 | spacer = ""
145 | for i in range(gapl):
146 |     spacer = spacer + "N"
147 | 
148 | # save file with number of fragments processed per repname
149 | print "Saving number of fragments processed per repname to " + nfragmentsfile1
150 | fout1 = open(os.path.realpath(nfragmentsfile1),"w")
151 | for repname in rep_chr.keys():
152 | 	rep_chr_current = rep_chr[repname]
153 | 	print >>fout1, str(len(rep_chr[repname])) + "\t" + repname
154 | fout1.close()
155 | 
156 | # generate metagenomes and save them to FASTA files
157 | k = 1
158 | nrepgenomes = len(rep_chr.keys())
159 | for repname in rep_chr.keys():
160 |     metagenome = ""
161 |     newname = repname.replace("(","_").replace(")","_").replace("/","_")
162 |     print "processing repgenome " + newname + ".fa" + " (" + str(k) + " of " + str(nrepgenomes) + ")"
163 |     rep_chr_current = rep_chr[repname]
164 |     rep_start_current = rep_start[repname]
165 |     rep_end_current = rep_end[repname]
166 |     print "-------> " + str(len(rep_chr[repname])) + " fragments"
167 |     for i in range(len(rep_chr[repname])):
168 |         try:
169 |             chr = rep_chr_current[i]
170 |             rstart = max(rep_start_current[i] - flankingl, 0)
171 |             rend = min(rep_end_current[i] + flankingl, lgenome[chr]-1)
172 |             metagenome = metagenome + spacer + genome[chr][rstart:(rend+1)]
173 |         except KeyError:
174 |             print "Unrecognised Chromosome: "+chr
175 |             pass
176 |     
177 | 	# Convert metagenome to SeqRecord object (required by SeqIO.write)
178 |     record = SeqRecord(Seq(metagenome, IUPAC.unambiguous_dna), id = "repname", name = "", description = "")
179 |     print "saving repgenome " + newname + ".fa" + " (" + str(k) + " of " + str(nrepgenomes) + ")"
180 |     fastafilename = os.path.realpath(setup_folder + os.path.sep + newname + ".fa")
181 |     SeqIO.write(record, fastafilename, "fasta")
182 |     print "indexing repgenome " + newname + ".fa" + " (" + str(k) + " of " + str(nrepgenomes) + ")"
183 |     command = shlex.split('bowtie-build -f ' + fastafilename + ' ' + setup_folder + os.path.sep + newname)
184 |     p = subprocess.Popen(command).communicate()
185 |     k += 1
186 | 
187 | print "... Done"
188 | 
189 | 


--------------------------------------------------------------------------------