9 |
10 | The following image explains the high-level architecture.
11 |
12 |
13 |
14 |
15 |
16 | This version of REDItools shows an average 8x speed improvement over the previous version even when using only the serial-mode:
17 |
18 |
19 |
20 |
21 |
22 | # Index
23 |
24 | - [1. Python setup](#1-python-setup)
25 | - [2. Environment setup](#2-environment-setup)
26 | - [3. Cloning / downloading](#3-cloning--downloading)
27 | - [4. Installing](#4-installing)
28 | - [4.1 Spack](#41-spack)
29 | - [5. The two versions of REDItools 2.0](#5-the-two-versions-of-reditools-20)
30 | - [5.1 Serial version](#51-serial-version-reditoolspy)
31 | - [5.2 Parallel version](#52-parallel-version--parallel_reditoolspy)
32 | - [6. Running REDItools 2.0 on your own data](#6-running-reditools-20-on-your-own-data)
33 | - [7. REDItools 2.0 options](#7-reditools-20-options)
34 | - [8. DNA-Seq annotation with REDItools 2.0](#8-dna-seq-annotation-with-reditools-20)
35 | - [9. Running REDItools 2.0 in multisample mode](#9-running-reditools-20-in-multisample-mode)
36 | - [10. Displaying benchmarks in HTML with REDItools 2.0 (parallel version only)](#10-displaying-benchmarks-with-reditools-20-parallel-version-only)
37 |
38 |
39 | ## Installation
40 |
41 | ### 1. Python setup
42 | ---
43 | This guide assumes you have Python <= 2.7 installed in your system. If you do not have Python, please read the [official Python webpage](https://www.python.org/).
44 |
45 | Make sure to have the following packages installed:
46 |
47 | > sudo apt-get install python-dev build-essential libssl-dev libffi-dev libxml2-dev libxslt1-dev zlib1g-dev python-pip zlib-devel zlib zlib1g zlib1g-devel libbz2-dev zlib1g-dev libncurses5-dev libncursesw5-dev liblzma-dev
48 |
49 | Make sure you have you preferred Python version loaded. If you have a single Python version already installed in your system you should do nothing. If you have multiple versions, please be sure to point to a given version; in order to do so check your environmental variables (e.g., PATH).
50 |
51 | If you are running on a cluster (where usually several versions are available) make sure to load a given Python version. For example (if running on CINECA Marconi super computer) the following command would load Python 2.7.12:
52 | > module load autoload python/2.7.12
53 |
54 | Note: REDItools2.0 has been tested with Python 2.7.12. The software comes with no guarantee of being compatible with other versions of Python (e.g., Python >=3).
55 |
56 | ### 2. Environment setup
57 | ---
58 | Make sure the following libraries are installed:
59 |
60 | - htslib (see http://www.htslib.org/download/ and https://www.biostars.org/p/328831/ for instructions)
61 | - samtools:
62 |
63 | > sudo apt-get install samtools
64 |
65 | - tabix:
66 |
67 | > sudo apt-get install tabix
68 |
69 | - an MPI implementation. We suggest OpenMPI, but you can choose whatever you like the most. For installing OpenMPI, try the following command:
70 | > sudo apt-get install openmpi-common libopenmpi-dev
71 |
72 | ### 3. Cloning / Downloading
73 | ---
74 |
75 | The first step is to clone this repository (assumes you have *git* installed in your system - see the [Git official page](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git) otherwise):
76 | > git clone https://github.com/tflati/reditools2.0.git
77 |
78 | (alternatively you can download a ZIP package of REDItools2.0 from [here](https://github.com/tflati/reditools2.0/archive/master.zip) and uncompress the archive).
79 |
80 | Move into the project main directory:
81 | > cd reditools2.0
82 |
83 |
84 | ### 4. Installing
85 | ---
86 |
87 | REDItools 2.0 requires a few Python modules to be installed in the environment (e.g., pysam, sortedcontainers, mpi4py, etc.). These can be installed in three ways:
88 |
89 | - **System-level**: in this way the dependencies will be installed in your system and all users in your system will see changes. In order to perform this type of installation you need administrator rights.
90 | To install REDItools2.0 in this modality, just run the following command:
91 | > sudo pip install -r requirements.txt
92 |
93 | - **User-level**: in this way the dependencies will be installed only for your current user, usually in your home directory. In order to perform this type of installation you need only to be logged as a normal user. Note that this type of installation will install additional software in your local Python directory (usually $HOME/.local/lib/python2.7/site-packages/, but it depends on your operating system and distribution).
94 | This is the recommended modality if you do not care about altering your user environment. Note that altering your user environment might lead to software corruption. For example, assume you have already the *pysam* package installed (version 0.6); since REDItools 2.0 requires a version for *pysam* >= 0.9, the installation would uninstall the existing version of pysam and would install the version 0.9, thus altering the state of your environment. Any existing software which relied on version pysam 0.6 might break and stop working. In conclusion, choose this modality at your own risk.
95 | To install REDItools2.0 in this modality, just run the following command:
96 | > pip install -r requirements.txt --user
97 |
98 | - **Environment-level**: in this type of installation you create an isolated virtual environment (initially empty) which will contain any new required software, without creating conflicts with any existing environment or requiring any particular right.
99 | This modality will work regardless of the existing packages already installed in your system (both user and system levels) and thus gives the maximum possible freedom to the final-end user.
100 | This is the recommended modality.
101 | The downside of choosing this modality is a potential duplication of code with respect to other existing environments. For example, assume you already have a given version of *sortedcontainers*; by installing REDItools2.0 at environment-level will download and install a *new* copy of *sortedcontainers* into a new isolated environment (ending up with two copies of the same software present in the system, one inside and one outside the virtual environment).
102 | To install REDItools2.0 in this modality, run the following commands:
103 |
104 | > virtualenv ENV
105 | >
106 | > source ENV/bin/activate
107 | >
108 | > pip install -r requirements.txt
109 | >
110 | > deactivate
111 |
112 | These commands will create a new environment called *ENV* (you can choose any name you like) and will install all dependencies listed in the file *requirements.txt* into it). The commands *activate* and *deactivate* respectively activate (i.e., start/open) and deactivate (i.e., end/close) the virtual environment.
113 | When running the real commands, remember to wrap your commands between and activate and deactivate commands:
114 |
115 | >source ENV/bin/activate
116 | >
117 | >command...
118 | >
119 | >command...
120 | >
121 | >command...
122 | >
123 | >command...
124 | >
125 | >deactivate
126 |
127 | #### 4.1 Spack
128 | (Thanks to Silvia Gioiosa PhD, CINECA ROME)
129 |
130 |
131 | -Spack module loading
132 | >module load autoload spack
133 |
134 | -Installation of python required version (when prompted with ['Do yoy want to proceed?'], answer always y):
135 |
136 | >spack install python@2.7.16 #@ builds a specific version of python. If u want more verbosity, use -d
137 |
138 | >spack module tcl refresh python@2.7.16
139 |
140 | >spack install py-mpi4py^python@2.7.16
141 |
142 | >spack module tcl refresh py-mpi4py^python@2.7.16
143 |
144 | >spack install py-virtualenv^python@2.7.16
145 |
146 | >spack module tcl refresh py-virtualenv^python@2.7.16
147 |
148 | -Installation of REDItools 2.0 required modules
149 |
150 | >module load python/2.7.16--gcc--8.4.0-bgv
151 |
152 | >module load autoload py-mpi4py/3.0.3--gcc--8.4.0-spectrmpi-ac2
153 |
154 | >module load py-virtualenv/16.7.6--gcc--8.4.0-4ut
155 |
156 | >module load profile/global
157 |
158 | >module load samtools/1.12
159 |
160 | -Download of REDItols 2.0 from this repo
161 |
162 | > git clone https://github.com/BioinfoUNIBA/REDItools2.git
163 |
164 | > cd REDItools2
165 |
166 | -Virtualenv activation and dependencies download
167 |
168 | >virtualenv ENV
169 |
170 | >source ENV/bin/activate
171 |
172 | >pip install pysam
173 |
174 | >pip install sortedcontainers
175 |
176 | >pip install psutil
177 |
178 | >pip install netifaces
179 |
180 | -Data test preparation:
181 |
182 | >./prepare_test.sh
183 |
184 | -With a text editor modify the two slurm directives (queue -p and account) of serial_test_slurm.sh:
185 |
186 | >#SBATCH --account= (insert here your account)
187 | >#SBATCH -p m100_all_serial
188 |
189 | -Launch the run test:
190 |
191 | >sbatch serial_test_slurm.sh
192 |
193 | ### 5. The two versions of REDItools 2.0
194 | ---
195 |
196 | This repo includes test data and a test script for checking that dependencies have been installed properly and the basic REDItools command works.
197 |
198 | In order to have all the data you need, run the following commands:
199 |
200 | > cd test
201 | >
202 | > ./prepare_test.sh
203 |
204 | This will download and index the chromosome 21 of the hg19 version of the human genome (from http://hgdownload.cse.ucsc.edu/downloads.html).
205 | Once the script has finished running, you have all you need to perform the tests.
206 |
207 | The software comes with two modalities. Feel free to choose the one which best fits your needs.
208 |
209 | #### 5.1 Serial version (reditools.py)
210 |
211 | In this modality you benefit only from the optimization introduced after the first version. While being significantly faster (with about a 8x factor), you do not exploit the computational power of having multiple cores. On the other hand the setup and launch of REDItools is much easier.
212 | This might be the first modality you might want to give a try when using REDItools2.0 for the first time.
213 |
214 | The serial version of REDItools2.0 can be tested by issuing the following command:
215 |
216 | > serial_test.sh
217 |
218 | or, if you are in a SLURM-based cluster:
219 |
220 | > sbatch serial_test_slurm.sh
221 |
222 | #### 5.2 Parallel version (parallel_reditools.py)
223 |
224 | In this modality you benefit both from the serial optimization and from the parallel computation introduced in this brand new version which exploits the existence of multiple cores, also on multiple nodes, making it a perfect tool on High Performance Computing facilities.
225 | Using this modality requires you to perform a little bit more system setup, but it will definitely pay you off.
226 |
227 | The parallel version leverages on the existence of coverage information which reports for each position the number of supporting reads.
228 |
229 | We assume you already have installed and correctly configured the following tools:
230 |
231 | - **samtools** (http://www.htslib.org/)
232 | - **htslib** (http://www.htslib.org/)
233 |
234 | If you can use *mpi* on your machine (e.g., you are not on a multi-user system and there are no limitations to the jobs you can submit to the system), you can try launching the parallel version of REDItools 2.0 as follows:
235 |
236 | > ./parallel_test.sh
237 |
238 | If you are running on a SLURM-based cluster, instead, run the following command:
239 |
240 | > sbatch ./parallel_test_slurm.sh
241 |
242 | This script:
243 | - first defines a bunch of variables which point to input, output and accessory files; then
244 | - launches the production of coverage data; then
245 | - REDItools 2.0 is launched in parallel, by using the specified number of cores; finally
246 | - results are gathered and written into a single table (parameter *-o* provided in the command line)
247 |
248 | ## Running
249 |
250 | ### 6. Running REDItools 2.0 on your own data
251 | ---
252 | You can now customize the input test scripts to your needs with your input, output and ad-hoc options.
253 |
254 | ### 7. REDItools 2.0 options
255 | ---
256 | #### 7.1 Basic options
257 | In its most basic form, REDItools 2.0 can be invoked with an input BAM file, a reference genome and an output file:
258 | > python src/cineca/reditools.py -f \$INPUT_BAM_FILE -r $REFERENCE -o \$OUTPUT_FILE
259 |
260 | If you want, you can restrict the analysis only to a certain region (e.g., only chr1), by means of the **-g** option :
261 | > python src/cineca/reditools.py -f \$INPUT_BAM_FILE -r $REFERENCE -o \$OUTPUT_FILE -g chr1
262 | >
263 | or a specific interval:
264 | > python src/cineca/reditools.py -f \$INPUT_BAM_FILE -r $REFERENCE -o \$OUTPUT_FILE -g chr1:1000-2000
265 |
266 | For a complete list of options and their usage and meaning, please type:
267 |
268 | > python src/cineca/reditools.py -h
269 |
270 | #### 7.2 Other options
271 |
272 | Here we report the principal options with a detailed explanation for each of them.
273 | The following are the options accepted by the serial version of REDItools:
274 |
275 | > reditools.py [-h] [-f FILE] [-o OUTPUT_FILE] [-S] [-s STRAND] [-a]
276 | [-r REFERENCE] [-g REGION] [-m OMOPOLYMERIC_FILE] [-c]
277 | [-os OMOPOLYMERIC_SPAN] [-sf SPLICING_FILE]
278 | [-ss SPLICING_SPAN] [-mrl MIN_READ_LENGTH]
279 | [-q MIN_READ_QUALITY] [-bq MIN_BASE_QUALITY]
280 | [-mbp MIN_BASE_POSITION] [-Mbp MAX_BASE_POSITION]
281 | [-l MIN_COLUMN_LENGTH] [-men MIN_EDITS_PER_NUCLEOTIDE]
282 | [-me MIN_EDITS] [-Men MAX_EDITING_NUCLEOTIDES] [-d]
283 | [-T STRAND_CONFIDENCE] [-C] [-Tv STRAND_CONFIDENCE_VALUE]
284 | [-V] [-H] [-D] [-B BED_FILE]
285 | >
286 | > **-h**, --help
287 | > show this help message and exit
288 | >
289 | >**-f** FILE, --file FILE
290 | >The bam file to be analyzed
291 | >
292 | >**-o** OUTPUT_FILE, --output-file OUTPUT_FILE
293 | >The output statistics file
294 | >
295 | >**-S**, --strict
296 | > Activate strict mode: only sites with edits will be included in the output
297 | >
298 | >**-s** STRAND, --strand STRAND
299 | >Strand: this can be 0 (unstranded), 1 (secondstrand oriented) or 2 (firststrand oriented)
300 | >
301 | >**-a**, --append-file
302 | >Appends results to file (and creates if not existing)
303 | >
304 | >**-r** REFERENCE, --reference REFERENCE
305 | >The reference FASTA file
306 | >
307 | >**-g** REGION, --region REGION
308 | >The region of the bam file to be analyzed
309 | >
310 | >**-m** OMOPOLYMERIC_FILE, --omopolymeric-file OMOPOLYMERIC_FILE
311 | >The file containing the omopolymeric positions
312 | >
313 | >**-c**, --create-omopolymeric-file
314 | >Whether to create the omopolymeric span
315 | >
316 | >**-os** OMOPOLYMERIC_SPAN, --omopolymeric-span OMOPOLYMERIC_SPAN
317 | >The omopolymeric span
318 | >
319 | >**-sf** SPLICING_FILE, --splicing-file SPLICING_FILE
320 | >The file containing the splicing sites positions
321 | >
322 | >**-ss** SPLICING_SPAN, --splicing-span SPLICING_SPAN
323 | >The splicing span
324 | >
325 | >**-mrl** MIN_READ_LENGTH, --min-read-length MIN_READ_LENGTH
326 | >The minimum read length. Reads whose length is below this value will be discarded.
327 | >
328 | >**-q** MIN_READ_QUALITY, --min-read-quality MIN_READ_QUALITY
329 | >The minimum read quality. Reads whose mapping quality is below this value will be discarded.
330 | >
331 | >**-bq** MIN_BASE_QUALITY, --min-base-quality MIN_BASE_QUALITY
332 | >The minimum base quality. Bases whose quality is below this value will not be included in the analysis.
333 | >
334 | >**-mbp** MIN_BASE_POSITION, --min-base-position MIN_BASE_POSITION
335 | >The minimum base position. Bases which reside in a previous position (in the read) will not be included in the analysis.
336 | >
337 | >**-Mbp** MAX_BASE_POSITION, --max-base-position MAX_BASE_POSITION
338 | >The maximum base position. Bases which reside in a further position (in the read) will not be included in the analysis.
339 | >
340 | >**-l** MIN_COLUMN_LENGTH, --min-column-length MIN_COLUMN_LENGTH
341 | >The minimum length of editing column (per position). Positions whose columns have length below this value will not be included in the analysis.
342 | >
343 | >**-men** MIN_EDITS_PER_NUCLEOTIDE, --min-edits-per-nucleotide MIN_EDITS_PER_NUCLEOTIDE
344 | >The minimum number of editing for events each nucleotide (per position). Positions whose columns have bases with less than min-edits-per-base edits will not be included in the analysis.
345 | >
346 | >**-me** MIN_EDITS, --min-edits MIN_EDITS
347 | > The minimum number of editing events (per position). Positions whose columns have bases with less than 'min-edits-per-base edits' will not be included in the analysis.
348 | >
349 | >**-Men** MAX_EDITING_NUCLEOTIDES, --max-editing-nucleotides MAX_EDITING_NUCLEOTIDES
350 | > The maximum number of editing nucleotides, from 0 to 4 (per position). Positions whose columns have more than 'max-editing-nucleotides' will not be included in the analysis.
351 | >
352 | >**-d**, --debug
353 | >REDItools is run in DEBUG mode.
354 | >
355 | >**-T** STRAND_CONFIDENCE, --strand-confidence STRAND_CONFIDENCE
356 | > Strand inference type
357 | > 1:maxValue
358 | > 2:useConfidence [1];
359 | > maxValue: the most prominent strand count will be used;
360 | > useConfidence: strand is assigned if over a prefixed frequency confidence (-TV option)
361 | >
362 | >**-C**, --strand-correction
363 | > Strand correction. Once the strand has been inferred, only bases according to this strand will be selected.
364 | >
365 | >**-Tv** STRAND_CONFIDENCE_VALUE, --strand-confidence-value STRAND_CONFIDENCE_VALUE
366 | > Strand confidence [0.70]
367 | >
368 | >**-V**, --verbose
369 | > Verbose information in stderr
370 | >
371 | >**-H**, --remove-header
372 | >Do not include header in output file
373 | >
374 | >**-N**, --dna
375 | >Run REDItools 2.0 on DNA-Seq data
376 | >
377 | >**-B** BED_FILE, --bed_file BED_FILE
378 | > Path of BED file containing target regions
379 |
380 | The parallel version of REDItools 2.0 has also other 4 additional parameters, namely:
381 | >**-G** --coverage-file The coverage file of the sample to analyze
382 | >
383 | >**-D** --coverage-dir The coverage directory containing the coverage file of the sample to analyze divided by chromosome
384 | >
385 | >**-t** --temp-dir The temp directory where to store temporary data for this sample
386 | >
387 | >**-Z** --chromosome-sizes The file with the chromosome sizes
388 |
389 | ### 8. DNA-Seq annotation with REDItools 2.0
390 |
391 | - Analyze your RNA-Seq data (e.g., file *rna.bam*) with any version of REDItools and obtain the corresponding output table (e.g., *rna_table.txt* or *rna_table.txt.gz*);
392 | - Analyze your DNA-Seq data (e.g., *dna.bam*) with REDItools 2.0, providing as input:
393 | 1. The DNA-Seq file (*dna.bam*) (e.g., option *-f* *dna.bam*);
394 | 2. The output RNA-table output of the first step (e.g., option *-B* *rna_table.txt*)
395 | This step will produce the output table (e.g., *dna_table.txt*);
396 | - Annotate the RNA-Seq table by means of the DNA-Seq table by running REDItools2.0 annotator (script *src/cineca/annotate_with_DNA.py*) with the two tables as input (e.g., *rna_table.txt* and *dna_table.txt*) which will produce the final annotated table (e.g., *final_table.txt*).
397 |
398 |
399 |
400 |
401 |
402 | When RNA-editing tables are big (e.g., greater than 1GB in gz format) reading the full table in parallel mode can be really a time-consuming task. In order to optimize the loading of target positions, we have provided a script to convert RNA-editing tables to BED files:
403 |
404 | > python src/cineca/reditools_table_to_bed.py -i RNA_TABLE -o BED_FILE
405 |
406 | This can be further optimized by creating the final BED in parallel:
407 |
408 | > extract_bed_dynamic.sh RNA_TABLE TEMP_DIR SIZE_FILE
409 |
410 | where
411 | - RNA_TABLE is the input RNA-editing table;
412 | - TEMP_DIR is the directory that will contain the output BED file;
413 | - SIZE_FILE is the file containing the chromosome information (e.g., the .fai file of your reference genome).
414 |
415 | Finally run the script *src/cineca/annotate_with_DNA.py*:
416 |
417 | > python src/cineca/annotate_with_DNA.py -r RNA_TABLE -d DNA_TABLE [-Z]
418 |
419 | The option -Z (not mandatory and without arguments) will exclude positions with multiple changes in DNA-Seq.
420 |
421 | #### 8.1 Useful scripts
422 |
423 | In order to ease the annotation of RNA-Seq tables with DNA-Seq information, we also provided two sample scripts that you can customize with your own data:
424 |
425 | - [**WORK IN PROGRESS**] serial_dna_test.sh
426 | - [**WORK IN PROGRESS**] parallel_dna_test.sh
427 |
428 | ### 9. [**WORK IN PROGRESS**] Running REDItools 2.0 in multisample mode
429 | REDItools also supports the launch on multiple samples at the same time. This modality is extremely useful if you have a dataset (i.e., group of homogeneous samples) and wish to run the same analysis on all of them (i.e., with the same options).
430 |
431 | In order to do this, we provided a second script analogous to parallel_reditools.py, called *reditools2_multisample.py* which supports the specification of an additional option -F [SAMPLE_FILE]. SAMPLE_FILE is a file containing the (absolute) path of samples to be analyzed.
432 | It can be launched in the following manner:
433 |
434 | > mpirun src/cineca/reditools2_multisample.py -F $SAMPLE_FILE [OPTIONS]
435 |
436 | where OPTIONS are the same options accepted by the parallel version of REDItools 2.0.
437 |
438 | #### 9.1 Running in multisample mode on a SLURM-based cluster
439 | If you wish to run REDItools 2.0 in multisample mode on a SLURM-based cluster, we provided two scripts that will help you:
440 |
441 | - [**WORK IN PROGRESS**] *extract_coverage_slurm_multisample.sh*: will calculate the coverage data for all the samples in parallel (by using the script *extract_coverage_dynamic.sh*);
442 | - [**WORK IN PROGRESS**] *multisample_test.sh*: will calculate the RNA-editing events tables for all the samples in parallel using MPI.
443 |
444 | First run *extract_coverage_slurm_multisample.sh* and then *multisample_test.sh*.
445 |
446 | ### 10. Displaying benchmarks with REDItools 2.0 (parallel version only)
447 | We also released simple scripts to generate HTML pages containing the snapshot of the amount of time REDItools 2.0 (parallel version) spends on each part of the overall computation for each process (e.g., coverage computation, DIA algorithm, interval analysis, partial results recombination, etc).
448 |
449 | **Note**: this command will work only when launched *after* the parallel computation has completed.
450 |
451 | All you have to do to create the HTML page is launching the following command:
452 | > create_html.sh TEMP_DIR
453 |
454 | where TEMP_DIR is the directory you specified with the -t option; this directory should contain in fact some auxiliary files (e.g., intervals.txt, progress.txt, times.txt and groups.txt) which serve exactly this purpose.
455 | Once created, the HTML page should display time information similar to the following:
456 |
457 |
458 |
459 |
460 |
461 | By means of this visualization you can *hover* on slices to see more in details the statistics for each interval computation as well as *zoom in* and *zoom out* by using the scroll wheel of your mouse.
462 |
463 | Issues
464 | ---
465 | No issues are known so far. For any problem, write to t.flati@cineca.it.
466 |
473 |
--------------------------------------------------------------------------------
/README_2.md:
--------------------------------------------------------------------------------
1 | # REDItools2
2 |
3 | **REDItools2** is the optimized, parallel multi-node version of [ REDItools](https://github.com/BioinfoUNIBA/REDItools).
4 |
5 | REDItools takes in input a RNA-Seq (or DNA-Seq BAM) file and outputs a table of RNA-Seq editing events. Here is an example of REDItools's output:
6 |
7 |
8 |
9 |
10 | The following image explains the high-level architecture.
11 |
12 |
13 |
14 |
15 |
16 | This version of REDItools shows an average 8x speed improvement over the previous version even when using only the serial-mode:
17 |
18 |
19 |
20 |
21 |
22 | # Index
23 |
24 | - [1. Python setup](#1-python-setup)
25 | - [2. Environment setup](#2-environment-setup)
26 | - [3. Cloning / downloading](#3-cloning--downloading)
27 | - [4. Installing](#4-installing)
28 | - [5. The two versions of REDItools 2.0](#5-the-two-versions-of-reditools-20)
29 | - [5.1 Serial version](#51-serial-version-reditoolspy)
30 | - [5.2 Parallel version](#52-parallel-version--parallel_reditoolspy)
31 | - [6. Running REDItools 2.0 on your own data](#6-running-reditools-20-on-your-own-data)
32 | - [7. REDItools 2.0 options](#7-reditools-20-options)
33 | - [8. DNA-Seq annotation with REDItools 2.0](#8-dna-seq-annotation-with-reditools-20)
34 | - [9. Running REDItools 2.0 in multisample mode](#9-running-reditools-20-in-multisample-mode)
35 | - [10. Displaying benchmarks in HTML with REDItools 2.0 (parallel version only)](#10-displaying-benchmarks-with-reditools-20-parallel-version-only)
36 |
37 |
38 | ## Installation
39 |
40 | ### 1. Python setup
41 | ---
42 | This guide assumes you have Python <= 2.7 installed in your system. If you do not have Python, please read the [official Python webpage](https://www.python.org/).
43 |
44 | Make sure to have the following packages installed:
45 |
46 | > sudo apt-get install python-dev build-essential libssl-dev libffi-dev libxml2-dev libxslt1-dev zlib1g-dev python-pip zlib-devel zlib zlib1g zlib1g-devel libbz2-dev zlib1g-dev libncurses5-dev libncursesw5-dev liblzma-dev
47 |
48 | Make sure you have you preferred Python version loaded. If you have a single Python version already installed in your system you should do nothing. If you have multiple versions, please be sure to point to a given version; in order to do so check your environmental variables (e.g., PATH).
49 |
50 | If you are running on a cluster (where usually several versions are available) make sure to load a given Python version. For example (if running on CINECA Marconi super computer) the following command would load Python 2.7.12:
51 | > module load autoload python/2.7.12
52 |
53 | Note: REDItools2.0 has been tested with Python 2.7.12. The software comes with no guarantee of being compatible with other versions of Python (e.g., Python >=3).
54 |
55 | ### 2. Environment setup
56 | ---
57 | Make sure the following libraries are installed:
58 |
59 | - htslib (see http://www.htslib.org/download/ and https://www.biostars.org/p/328831/ for instructions)
60 | - samtools:
61 |
62 | > sudo apt-get install samtools
63 |
64 | - tabix:
65 |
66 | > sudo apt-get install tabix
67 |
68 | - an MPI implementation. We suggest OpenMPI, but you can choose whatever you like the most. For installing OpenMPI, try the following command:
69 | > sudo apt-get install openmpi-common libopenmpi-dev
70 |
71 | ### 3. Cloning / Downloading
72 | ---
73 |
74 | The first step is to clone this repository (assumes you have *git* installed in your system - see the [Git official page](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git) otherwise):
75 | > git clone https://github.com/tflati/reditools2.0.git
76 |
77 | (alternatively you can download a ZIP package of REDItools2.0 from [here](https://github.com/tflati/reditools2.0/archive/master.zip) and uncompress the archive).
78 |
79 | Move into the project main directory:
80 | > cd reditools2.0
81 |
82 |
83 | ### 4. Installing
84 | ---
85 |
86 | REDItools 2.0 requires a few Python modules to be installed in the environment (e.g., pysam, sortedcontainers, mpi4py, etc.). These can be installed in three ways:
87 |
88 | - **System-level**: in this way the dependencies will be installed in your system and all users in your system will see changes. In order to perform this type of installation you need administrator rights.
89 | To install REDItools2.0 in this modality, just run the following command:
90 | > sudo pip install -r requirements.txt
91 |
92 | - **User-level**: in this way the dependencies will be installed only for your current user, usually in your home directory. In order to perform this type of installation you need only to be logged as a normal user. Note that this type of installation will install additional software in your local Python directory (usually $HOME/.local/lib/python2.7/site-packages/, but it depends on your operating system and distribution).
93 | This is the recommended modality if you do not care about altering your user environment. Note that altering your user environment might lead to software corruption. For example, assume you have already the *pysam* package installed (version 0.6); since REDItools 2.0 requires a version for *pysam* >= 0.9, the installation would uninstall the existing version of pysam and would install the version 0.9, thus altering the state of your environment. Any existing software which relied on version pysam 0.6 might break and stop working. In conclusion, choose this modality at your own risk.
94 | To install REDItools2.0 in this modality, just run the following command:
95 | > pip install -r requirements.txt --user
96 |
97 | - **Environment-level**: in this type of installation you create an isolated virtual environment (initially empty) which will contain any new required software, without creating conflicts with any existing environment or requiring any particular right.
98 | This modality will work regardless of the existing packages already installed in your system (both user and system levels) and thus gives the maximum possible freedom to the final-end user.
99 | This is the recommended modality.
100 | The downside of choosing this modality is a potential duplication of code with respect to other existing environments. For example, assume you already have a given version of *sortedcontainers*; by installing REDItools2.0 at environment-level will download and install a *new* copy of *sortedcontainers* into a new isolated environment (ending up with two copies of the same software present in the system, one inside and one outside the virtual environment).
101 | To install REDItools2.0 in this modality, run the following commands:
102 |
103 | > virtualenv ENV
104 | >
105 | > source ENV/bin/activate
106 | >
107 | > pip install -r requirements.txt
108 | >
109 | > deactivate
110 |
111 | These commands will create a new environment called *ENV* (you can choose any name you like) and will install all dependencies listed in the file *requirements.txt* into it). The commands *activate* and *deactivate* respectively activate (i.e., start/open) and deactivate (i.e., end/close) the virtual environment.
112 | When running the real commands, remember to wrap your commands between and activate and deactivate commands:
113 |
114 | >source ENV/bin/activate
115 | >
116 | >command...
117 | >
118 | >command...
119 | >
120 | >command...
121 | >
122 | >command...
123 | >
124 | >deactivate
125 |
126 | ## Testing
127 |
128 | ### 5. The two versions of REDItools 2.0
129 | ---
130 |
131 | This repo includes test data and a test script for checking that dependencies have been installed properly and the basic REDItools command works.
132 |
133 | In order to have all the data you need, run the following commands:
134 |
135 | > cd test
136 | >
137 | > ./prepare_test.sh
138 |
139 | This will download and index the chromosome 21 of the hg19 version of the human genome (from http://hgdownload.cse.ucsc.edu/downloads.html).
140 | Once the script has finished running, you have all you need to perform the tests.
141 |
142 | The software comes with two modalities. Feel free to choose the one which best fits your needs.
143 |
144 | #### 5.1 Serial version (reditools.py)
145 |
146 | In this modality you benefit only from the optimization introduced after the first version. While being significantly faster (with about a 8x factor), you do not exploit the computational power of having multiple cores. On the other hand the setup and launch of REDItools is much easier.
147 | This might be the first modality you might want to give a try when using REDItools2.0 for the first time.
148 |
149 | The serial version of REDItools2.0 can be tested by issuing the following command:
150 |
151 | > serial_test.sh
152 |
153 | or, if you are in a SLURM-based cluster:
154 |
155 | > sbatch serial_test_slurm.sh
156 |
157 | #### 5.2 Parallel version (parallel_reditools.py)
158 |
159 | In this modality you benefit both from the serial optimization and from the parallel computation introduced in this brand new version which exploits the existence of multiple cores, also on multiple nodes, making it a perfect tool on High Performance Computing facilities.
160 | Using this modality requires you to perform a little bit more system setup, but it will definitely pay you off.
161 |
162 | The parallel version leverages on the existence of coverage information which reports for each position the number of supporting reads.
163 |
164 | We assume you already have installed and correctly configured the following tools:
165 |
166 | - **samtools** (http://www.htslib.org/)
167 | - **htslib** (http://www.htslib.org/)
168 |
169 | If you can use *mpi* on your machine (e.g., you are not on a multi-user system and there are no limitations to the jobs you can submit to the system), you can try launching the parallel version of REDItools 2.0 as follows:
170 |
171 | > ./parallel_test.sh
172 |
173 | If you are running on a SLURM-based cluster, instead, run the following command:
174 |
175 | > sbatch ./parallel_test_slurm.sh
176 |
177 | This script:
178 | - first defines a bunch of variables which point to input, output and accessory files; then
179 | - launches the production of coverage data; then
180 | - REDItools 2.0 is launched in parallel, by using the specified number of cores; finally
181 | - results are gathered and written into a single table (parameter *-o* provided in the command line)
182 |
183 | ## Running
184 |
185 | ### 6. Running REDItools 2.0 on your own data
186 | ---
187 | You can now customize the input test scripts to your needs with your input, output and ad-hoc options.
188 |
189 | ### 7. REDItools 2.0 options
190 | ---
191 | #### 7.1 Basic options
192 | In its most basic form, REDItools 2.0 can be invoked with an input BAM file, a reference genome and an output file:
193 | > python src/cineca/reditools.py -f \$INPUT_BAM_FILE -r $REFERENCE -o \$OUTPUT_FILE
194 |
195 | If you want, you can restrict the analysis only to a certain region (e.g., only chr1), by means of the **-g** option :
196 | > python src/cineca/reditools.py -f \$INPUT_BAM_FILE -r $REFERENCE -o \$OUTPUT_FILE -g chr1
197 | >
198 | or a specific interval:
199 | > python src/cineca/reditools.py -f \$INPUT_BAM_FILE -r $REFERENCE -o \$OUTPUT_FILE -g chr1:1000-2000
200 |
201 | For a complete list of options and their usage and meaning, please type:
202 |
203 | > python src/cineca/reditools.py -h
204 |
205 | #### 7.2 Other options
206 |
207 | Here we report the principal options with a detailed explanation for each of them.
208 | The following are the options accepted by the serial version of REDItools:
209 |
210 | > reditools.py [-h] [-f FILE] [-o OUTPUT_FILE] [-S] [-s STRAND] [-a]
211 | [-r REFERENCE] [-g REGION] [-m OMOPOLYMERIC_FILE] [-c]
212 | [-os OMOPOLYMERIC_SPAN] [-sf SPLICING_FILE]
213 | [-ss SPLICING_SPAN] [-mrl MIN_READ_LENGTH]
214 | [-q MIN_READ_QUALITY] [-bq MIN_BASE_QUALITY]
215 | [-mbp MIN_BASE_POSITION] [-Mbp MAX_BASE_POSITION]
216 | [-l MIN_COLUMN_LENGTH] [-men MIN_EDITS_PER_NUCLEOTIDE]
217 | [-me MIN_EDITS] [-Men MAX_EDITING_NUCLEOTIDES] [-d]
218 | [-T STRAND_CONFIDENCE] [-C] [-Tv STRAND_CONFIDENCE_VALUE]
219 | [-V] [-H] [-D] [-B BED_FILE]
220 | >
221 | > **-h**, --help
222 | > show this help message and exit
223 | >
224 | >**-f** FILE, --file FILE
225 | >The bam file to be analyzed
226 | >
227 | >**-o** OUTPUT_FILE, --output-file OUTPUT_FILE
228 | >The output statistics file
229 | >
230 | >**-S**, --strict
231 | > Activate strict mode: only sites with edits will be included in the output
232 | >
233 | >**-s** STRAND, --strand STRAND
234 | >Strand: this can be 0 (unstranded), 1 (secondstrand oriented) or 2 (firststrand oriented)
235 | >
236 | >**-a**, --append-file
237 | >Appends results to file (and creates if not existing)
238 | >
239 | >**-r** REFERENCE, --reference REFERENCE
240 | >The reference FASTA file
241 | >
242 | >**-g** REGION, --region REGION
243 | >The region of the bam file to be analyzed
244 | >
245 | >**-m** OMOPOLYMERIC_FILE, --omopolymeric-file OMOPOLYMERIC_FILE
246 | >The file containing the omopolymeric positions
247 | >
248 | >**-c**, --create-omopolymeric-file
249 | >Whether to create the omopolymeric span
250 | >
251 | >**-os** OMOPOLYMERIC_SPAN, --omopolymeric-span OMOPOLYMERIC_SPAN
252 | >The omopolymeric span
253 | >
254 | >**-sf** SPLICING_FILE, --splicing-file SPLICING_FILE
255 | >The file containing the splicing sites positions
256 | >
257 | >**-ss** SPLICING_SPAN, --splicing-span SPLICING_SPAN
258 | >The splicing span
259 | >
260 | >**-mrl** MIN_READ_LENGTH, --min-read-length MIN_READ_LENGTH
261 | >The minimum read length. Reads whose length is below this value will be discarded.
262 | >
263 | >**-q** MIN_READ_QUALITY, --min-read-quality MIN_READ_QUALITY
264 | >The minimum read quality. Reads whose mapping quality is below this value will be discarded.
265 | >
266 | >**-bq** MIN_BASE_QUALITY, --min-base-quality MIN_BASE_QUALITY
267 | >The minimum base quality. Bases whose quality is below this value will not be included in the analysis.
268 | >
269 | >**-mbp** MIN_BASE_POSITION, --min-base-position MIN_BASE_POSITION
270 | >The minimum base position. Bases which reside in a previous position (in the read) will not be included in the analysis.
271 | >
272 | >**-Mbp** MAX_BASE_POSITION, --max-base-position MAX_BASE_POSITION
273 | >The maximum base position. Bases which reside in a further position (in the read) will not be included in the analysis.
274 | >
275 | >**-l** MIN_COLUMN_LENGTH, --min-column-length MIN_COLUMN_LENGTH
276 | >The minimum length of editing column (per position). Positions whose columns have length below this value will not be included in the analysis.
277 | >
278 | >**-men** MIN_EDITS_PER_NUCLEOTIDE, --min-edits-per-nucleotide MIN_EDITS_PER_NUCLEOTIDE
279 | >The minimum number of editing for events each nucleotide (per position). Positions whose columns have bases with less than min-edits-per-base edits will not be included in the analysis.
280 | >
281 | >**-me** MIN_EDITS, --min-edits MIN_EDITS
282 | > The minimum number of editing events (per position). Positions whose columns have bases with less than 'min-edits-per-base edits' will not be included in the analysis.
283 | >
284 | >**-Men** MAX_EDITING_NUCLEOTIDES, --max-editing-nucleotides MAX_EDITING_NUCLEOTIDES
285 | > The maximum number of editing nucleotides, from 0 to 4 (per position). Positions whose columns have more than 'max-editing-nucleotides' will not be included in the analysis.
286 | >
287 | >**-d**, --debug
288 | >REDItools is run in DEBUG mode.
289 | >
290 | >**-T** STRAND_CONFIDENCE, --strand-confidence STRAND_CONFIDENCE
291 | > Strand inference type
292 | > 1:maxValue
293 | > 2:useConfidence [1];
294 | > maxValue: the most prominent strand count will be used;
295 | > useConfidence: strand is assigned if over a prefixed frequency confidence (-TV option)
296 | >
297 | >**-C**, --strand-correction
298 | > Strand correction. Once the strand has been inferred, only bases according to this strand will be selected.
299 | >
300 | >**-Tv** STRAND_CONFIDENCE_VALUE, --strand-confidence-value STRAND_CONFIDENCE_VALUE
301 | > Strand confidence [0.70]
302 | >
303 | >**-V**, --verbose
304 | > Verbose information in stderr
305 | >
306 | >**-H**, --remove-header
307 | >Do not include header in output file
308 | >
309 | >**-N**, --dna
310 | >Run REDItools 2.0 on DNA-Seq data
311 | >
312 | >**-B** BED_FILE, --bed_file BED_FILE
313 | > Path of BED file containing target regions
314 |
315 | The parallel version of REDItools 2.0 has also other 4 additional parameters, namely:
316 | >**-G** --coverage-file The coverage file of the sample to analyze
317 | >
318 | >**-D** --coverage-dir The coverage directory containing the coverage file of the sample to analyze divided by chromosome
319 | >
320 | >**-t** --temp-dir The temp directory where to store temporary data for this sample
321 | >
322 | >**-Z** --chromosome-sizes The file with the chromosome sizes
323 |
324 | ### 8. DNA-Seq annotation with REDItools 2.0
325 |
326 | - Analyze your RNA-Seq data (e.g., file *rna.bam*) with any version of REDItools and obtain the corresponding output table (e.g., *rna_table.txt* or *rna_table.txt.gz*);
327 | - Analyze your DNA-Seq data (e.g., *dna.bam*) with REDItools 2.0, providing as input:
328 | 1. The DNA-Seq file (*dna.bam*) (e.g., option *-f* *dna.bam*);
329 | 2. The output RNA-table output of the first step (e.g., option *-B* *rna_table.txt*)
330 | This step will produce the output table (e.g., *dna_table.txt*);
331 | - Annotate the RNA-Seq table by means of the DNA-Seq table by running REDItools2.0 annotator (script *src/cineca/annotate_with_DNA.py*) with the two tables as input (e.g., *rna_table.txt* and *dna_table.txt*) which will produce the final annotated table (e.g., *final_table.txt*).
332 |
333 |
334 |
335 |
336 |
337 | When RNA-editing tables are big (e.g., greater than 1GB in gz format) reading the full table in parallel mode can be really a time-consuming task. In order to optimize the loading of target positions, we have provided a script to convert RNA-editing tables to BED files:
338 |
339 | > python src/cineca/reditools_table_to_bed.py -i RNA_TABLE -o BED_FILE
340 |
341 | This can be further optimized by creating the final BED in parallel:
342 |
343 | > extract_bed_dynamic.sh RNA_TABLE TEMP_DIR SIZE_FILE
344 |
345 | where
346 | - RNA_TABLE is the input RNA-editing table;
347 | - TEMP_DIR is the directory that will contain the output BED file;
348 | - SIZE_FILE is the file containing the chromosome information (e.g., the .fai file of your reference genome).
349 |
350 | Finally run the script *src/cineca/annotate_with_DNA.py*:
351 |
352 | > python src/cineca/annotate_with_DNA.py -r RNA_TABLE -d DNA_TABLE [-Z]
353 |
354 | The option -Z (not mandatory and without arguments) will exclude positions with multiple changes in DNA-Seq.
355 |
356 | #### 8.1 Useful scripts
357 |
358 | In order to ease the annotation of RNA-Seq tables with DNA-Seq information, we also provided two sample scripts that you can customize with your own data:
359 |
360 | - [**WORK IN PROGRESS**] serial_dna_test.sh
361 | - [**WORK IN PROGRESS**] parallel_dna_test.sh
362 |
363 | ### 9. [**WORK IN PROGRESS**] Running REDItools 2.0 in multisample mode
364 | REDItools also supports the launch on multiple samples at the same time. This modality is extremely useful if you have a dataset (i.e., group of homogeneous samples) and wish to run the same analysis on all of them (i.e., with the same options).
365 |
366 | In order to do this, we provided a second script analogous to parallel_reditools.py, called *reditools2_multisample.py* which supports the specification of an additional option -F [SAMPLE_FILE]. SAMPLE_FILE is a file containing the (absolute) path of samples to be analyzed.
367 | It can be launched in the following manner:
368 |
369 | > mpirun src/cineca/reditools2_multisample.py -F $SAMPLE_FILE [OPTIONS]
370 |
371 | where OPTIONS are the same options accepted by the parallel version of REDItools 2.0.
372 |
373 | #### 9.1 Running in multisample mode on a SLURM-based cluster
374 | If you wish to run REDItools 2.0 in multisample mode on a SLURM-based cluster, we provided two scripts that will help you:
375 |
376 | - [**WORK IN PROGRESS**] *extract_coverage_slurm_multisample.sh*: will calculate the coverage data for all the samples in parallel (by using the script *extract_coverage_dynamic.sh*);
377 | - [**WORK IN PROGRESS**] *multisample_test.sh*: will calculate the RNA-editing events tables for all the samples in parallel using MPI.
378 |
379 | First run *extract_coverage_slurm_multisample.sh* and then *multisample_test.sh*.
380 |
381 | ### 10. Displaying benchmarks with REDItools 2.0 (parallel version only)
382 | We also released simple scripts to generate HTML pages containing the snapshot of the amount of time REDItools 2.0 (parallel version) spends on each part of the overall computation for each process (e.g., coverage computation, DIA algorithm, interval analysis, partial results recombination, etc).
383 |
384 | **Note**: this command will work only when launched *after* the parallel computation has completed.
385 |
386 | All you have to do to create the HTML page is launching the following command:
387 | > create_html.sh TEMP_DIR
388 |
389 | where TEMP_DIR is the directory you specified with the -t option; this directory should contain in fact some auxiliary files (e.g., intervals.txt, progress.txt, times.txt and groups.txt) which serve exactly this purpose.
390 | Once created, the HTML page should display time information similar to the following:
391 |
392 |
393 |
394 |
395 |
396 | By means of this visualization you can *hover* on slices to see more in details the statistics for each interval computation as well as *zoom in* and *zoom out* by using the scroll wheel of your mouse.
397 |
398 | Issues
399 | ---
400 | No issues are known so far. For any problem, write to t.flati@cineca.it.
401 |
408 |
--------------------------------------------------------------------------------
/bower.json:
--------------------------------------------------------------------------------
1 | {
2 | "name": "REDItools 2.0",
3 | "authors": [
4 | "tflati "
5 | ],
6 | "description": "Simple visualization tool",
7 | "main": "template.html",
8 | "license": "MIT",
9 | "homepage": "",
10 | "ignore": [
11 | "**/.*",
12 | "node_modules",
13 | "bower_components",
14 | "test",
15 | "tests"
16 | ],
17 | "dependencies": {
18 | "vis": "^4.21.0"
19 | }
20 | }
21 |
--------------------------------------------------------------------------------
/create_html.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | if [ $# -eq 0 ]; then
4 | echo "[ERROR] Please, remember to provide the temporary directory of interest."
5 | echo -e "Usage:\n\t$0 TEMPORARY_DIR"
6 | exit 1
7 | fi
8 |
9 | TEMPDIR=$1
10 | #cat template.html | sed "s@EVENTS_DATA@$(cat "$1"/times.txt)@g" | sed "s@GROUPS_DATA@$(cat "$1"/groups.txt)@g" > reditools.html
11 | cat template.html | sed "/EVENTS_DATA/{s/EVENTS_DATA//g
12 | r "$1"/times.txt
13 | }" | sed "/GROUPS_DATA/{s/GROUPS_DATA//g
14 | r "$1"/groups.txt
15 | }" > reditools.html
16 |
--------------------------------------------------------------------------------
/extract_bed_dynamic.sh:
--------------------------------------------------------------------------------
1 | INPUT=$1
2 | TEMP_DIR=$2
3 | SIZE_FILE=$3
4 |
5 | FILENAME=$(basename $INPUT)
6 | FILE_ID=${FILENAME%%.*}
7 |
8 | if [ ! -d $TEMP_DIR ]
9 | then
10 | mkdir -p $TEMP_DIR
11 | fi
12 |
13 | echo "INPUT=$INPUT"
14 | echo "TEMP=$TEMP_DIR"
15 | echo "CHROMOSOMES=$SIZE_FILE"
16 | echo "FILE_ID=$FILE_ID"
17 |
18 | t1=$(date +%s)
19 | t1_human=$(date)
20 |
21 | echo "[STATS] Dividing input table into pieces ["`date`"]"
22 | zcat $INPUT | cut -f 1,2 | awk '{print $0 >> "'$TEMP_DIR'/"$1".table"}'
23 | # read -n 1 -s -r -p "Press any key to continue"
24 |
25 | echo "[STATS] Creating single chromosome bed files ["`date`"]"
26 | CHROMOSOMES=()
27 | for chrom in $(cat $SIZE_FILE | cut -f 1)
28 | do
29 | CHROMOSOMES[${#CHROMOSOMES[@]}]=$chrom
30 | done
31 |
32 | NUM_CHROMS=$(cat $SIZE_FILE | cut -f 1 | wc -l)
33 | AVAILABLE_CPUS=$(nproc)
34 | CHUNK_SIZE=$(($NUM_CHROMS>$AVAILABLE_CPUS?$AVAILABLE_CPUS:$NUM_CHROMS))
35 | echo "CHROMOSOMES="$NUM_CHROMS
36 | echo "CHUNK SIZE="$CHUNK_SIZE
37 | start=0
38 | while [ $start -lt $NUM_CHROMS ]
39 | do
40 | echo "NEW BATCH [$(expr $start + 1)-$(expr $start + $CHUNK_SIZE)]"
41 | for i in $(seq $start $(expr $start + $CHUNK_SIZE - 1))
42 | do
43 | if [ $i -ge $NUM_CHROMS ]; then break; fi
44 |
45 | chrom=${CHROMOSOMES[$i]}
46 | if [ -s $TEMP_DIR/$chrom.table ]
47 | then
48 | echo "Calculating bed file for chromosome $chrom = $TEMP_DIR$chrom"
49 | python src/cineca/reditools_table_to_bed.py -i $TEMP_DIR/$chrom.table -o $TEMP_DIR/$chrom.bed &
50 | fi
51 | done
52 | wait
53 | start=$(expr $start + $CHUNK_SIZE)
54 | done
55 | # read -n 1 -s -r -p "Press any key to continue"
56 |
57 | t2=$(date +%s)
58 | t2_human=$(date)
59 | elapsed_time=$(($t2-$t1))
60 | elapsed_time_human=$(date -d@$elapsed_time -u +%H:%M:%S)
61 | echo "[STATS] [BED CHR] [$FILE_ID] START="$t1_human" ["$t1"] END="$t2_human" ["$t2"] ELAPSED="$elapsed_time" HUMAN="$elapsed_time_human 1>&2
62 |
63 | tmid=$(date +%s)
64 | tmid_human=$(date)
65 |
66 | echo "[STATS] Creating complete BED file $TEMP_DIR$FILE_ID.bed ["`date`"]"
67 | rm $TEMP_DIR$FILE_ID".bed"
68 | for chrom in `cat $SIZE_FILE | cut -f 1`
69 | do
70 | if [ -s $TEMP_DIR$chrom".bed" ]
71 | then
72 | cat $TEMP_DIR$chrom".bed" >> $TEMP_DIR$FILE_ID".bed"
73 | fi
74 |
75 | rm $TEMP_DIR$chrom".table"
76 | rm $TEMP_DIR$chrom".bed"
77 | done
78 |
79 | t2=$(date +%s)
80 | t2_human=$(date)
81 | elapsed_time_mid=$(($t2-$tmid))
82 | elapsed_time_mid_human=$(date -d@$elapsed_time_mid -u +%H:%M:%S)
83 | echo "[STATS] [BED GLOBAL] [$FILE_ID] START="$tmid_human" ["$tmid"] END="$t2_human" ["$t2"] ELAPSED="$elapsed_time_mid" HUMAN="$elapsed_time_mid_human 1>&2
84 | # read -n 1 -s -r -p "Press any key to continue"
85 |
86 | elapsed_time=$(($t2-$t1))
87 | elapsed_time_human=$(date -d@$elapsed_time -u +%H:%M:%S)
88 | echo "[STATS] [BED] [$FILE_ID] START="$t1_human" ["$tmid"] END="$t2_human" ["$t2"] ELAPSED="$elapsed_time" HUMAN="$elapsed_time_human 1>&2
89 |
90 | echo -e "$FILE_ID\t$elapsed_time\t$elapsed_time_human" > $TEMP_DIR/$FILE_ID-bed-chronometer.txt
91 |
92 | echo "[STATS] Finished creating bed data ["`date`"]"
93 |
--------------------------------------------------------------------------------
/extract_coverage.sh:
--------------------------------------------------------------------------------
1 | FILENAME=`basename $1`
2 | COVERAGE_DIR=$2
3 | SIZE_FILE=$3
4 |
5 | FILE_ID=${FILENAME%.*}
6 |
7 | if [ ! -d $COVERAGE_DIR ]
8 | then
9 | mkdir -p $COVERAGE_DIR
10 | fi
11 |
12 | t1=$(date +%s)
13 | t1_human=$(date)
14 |
15 | #samtools depth $1 | grep -vP "\t0$" | tee $COVERAGE_DIR$FILE_ID".cov" | awk '{print($0) > "'$COVERAGE_DIR'"$1}'
16 | echo "[STATS] Creating single chromosome coverage files ["`date`"]"
17 | for chrom in `cat $SIZE_FILE | cut -f 1`
18 | do
19 | echo "Calculating coverage file for chromosome $chrom = $COVERAGE_DIR$chrom"
20 |
21 | if [ $(samtools view $1 | cut -f 3 | grep -q $chrom) ]
22 | then
23 | samtools depth $1 -r ${chrom#chr} | grep -vP "\t0$" > $COVERAGE_DIR$chrom &
24 | else
25 | samtools depth $1 -r $chrom | grep -vP "\t0$" > $COVERAGE_DIR$chrom &
26 | fi
27 |
28 | done
29 | wait
30 |
31 | t2=$(date +%s)
32 | t2_human=$(date)
33 | elapsed_time=$(($t2-$t1))
34 | elapsed_time_human=$(date -d@$elapsed_time -u +%H:%M:%S)
35 | echo "[STATS] [COVERAGE CHR] [$FILE_ID] START="$t1_human" ["$t1"] END="$t2_human" ["$t2"] ELAPSED="$elapsed_time" HUMAN="$elapsed_time_human 1>&2
36 |
37 | tmid=$(date +%s)
38 | tmid_human=$(date)
39 |
40 | echo "[STATS] Creating complete file $COVERAGE_DIR$FILE_ID.cov ["`date`"]"
41 | if [ -s $COVERAGE_DIR$FILE_ID".cov" ]
42 | then
43 | rm $COVERAGE_DIR$FILE_ID".cov"
44 | fi
45 |
46 | for chrom in `cat $SIZE_FILE | cut -f 1`
47 | do
48 | cat $COVERAGE_DIR$chrom >> $COVERAGE_DIR$FILE_ID".cov"
49 | done
50 |
51 | t2=$(date +%s)
52 | t2_human=$(date)
53 | elapsed_time_mid=$(($t2-$tmid))
54 | elapsed_time_mid_human=$(date -d@$elapsed_time_mid -u +%H:%M:%S)
55 | echo "[STATS] [COVERAGE GLOBAL] [$FILE_ID] START="$tmid_human" ["$tmid"] END="$t2_human" ["$t2"] ELAPSED="$elapsed_time_mid" HUMAN="$elapsed_time_mid_human 1>&2
56 |
57 | elapsed_time=$(($t2-$t1))
58 | elapsed_time_human=$(date -d@$elapsed_time -u +%H:%M:%S)
59 | echo "[STATS] [COVERAGE] [$FILE_ID] START="$t1_human" ["$tmid"] END="$t2_human" ["$t2"] ELAPSED="$elapsed_time" HUMAN="$elapsed_time_human 1>&2
60 |
61 | echo -e "$FILE_ID\t$elapsed_time\t$elapsed_time_human" > $COVERAGE_DIR/$FILE_ID-coverage-chronometer.txt
62 |
63 | echo "[STATS] Finished creating coverage data ["`date`"]"
64 |
--------------------------------------------------------------------------------
/extract_coverage_dynamic.sh:
--------------------------------------------------------------------------------
1 | FILENAME=`basename $1`
2 | COVERAGE_DIR=$2
3 | SIZE_FILE=$3
4 |
5 | FILE_ID=${FILENAME%.*}
6 |
7 | if [ ! -d $COVERAGE_DIR ]
8 | then
9 | mkdir -p $COVERAGE_DIR
10 | fi
11 |
12 | t1=$(date +%s)
13 | t1_human=$(date)
14 |
15 |
16 | echo "[STATS] Creating single chromosome coverage files ["`date`"]"
17 | CHROMOSOMES=()
18 | for chrom in $(cat $SIZE_FILE | cut -f 1)
19 | do
20 | CHROMOSOMES[${#CHROMOSOMES[@]}]=$chrom
21 | done
22 |
23 | ###############################
24 | ### PER-CHROMOSOME COVERAGE ###
25 | ###############################
26 | NUM_CHROMS=$(cat $SIZE_FILE | cut -f 1 | wc -l)
27 | AVAILABLE_CPUS=$(nproc)
28 | CHUNK_SIZE=$(($NUM_CHROMS>$AVAILABLE_CPUS?$AVAILABLE_CPUS:$NUM_CHROMS))
29 | echo "CHROMOSOMES="$NUM_CHROMS
30 | echo "CHUNK SIZE="$CHUNK_SIZE
31 | start=0
32 | while [ $start -lt $NUM_CHROMS ]
33 | do
34 | echo "NEW BATCH [$(expr $start + 1)-$(expr $start + $CHUNK_SIZE)]"
35 | for i in $(seq $start $(expr $start + $CHUNK_SIZE - 1))
36 | do
37 | if [ $i -ge $NUM_CHROMS ]; then break; fi
38 |
39 | chrom=${CHROMOSOMES[$i]}
40 |
41 | echo "Calculating coverage file for chromosome $chrom = $COVERAGE_DIR$chrom"
42 |
43 | if [ $(samtools view $1 | cut -f 3 | grep -q $chrom) ]
44 | then
45 | samtools depth $1 -r ${chrom#chr} | grep -vP "\t0$" > $COVERAGE_DIR$chrom &
46 | else
47 | samtools depth $1 -r $chrom | grep -vP "\t0$" > $COVERAGE_DIR$chrom &
48 | fi
49 |
50 | done
51 | wait
52 | start=$(expr $start + $CHUNK_SIZE)
53 | done
54 |
55 | t2=$(date +%s)
56 | t2_human=$(date)
57 | elapsed_time=$(($t2-$t1))
58 | elapsed_time_human=$(date -d@$elapsed_time -u +%H:%M:%S)
59 | echo "[STATS] [COVERAGE CHR] [$FILE_ID] START="$t1_human" ["$t1"] END="$t2_human" ["$t2"] ELAPSED="$elapsed_time" HUMAN="$elapsed_time_human 1>&2
60 |
61 | tmid=$(date +%s)
62 | tmid_human=$(date)
63 |
64 | ############################
65 | ### SINGLE COVERAGE FILE ###
66 | ############################
67 | echo "[STATS] Creating complete file $COVERAGE_DIR$FILE_ID.cov ["`date`"]"
68 | rm $COVERAGE_DIR$FILE_ID".cov"
69 | for chrom in `cat $SIZE_FILE | cut -f 1`
70 | do
71 | cat $COVERAGE_DIR$chrom >> $COVERAGE_DIR$FILE_ID".cov"
72 | done
73 |
74 | t2=$(date +%s)
75 | t2_human=$(date)
76 | elapsed_time_mid=$(($t2-$tmid))
77 | elapsed_time_mid_human=$(date -d@$elapsed_time_mid -u +%H:%M:%S)
78 | echo "[STATS] [COVERAGE GLOBAL] [$FILE_ID] START="$tmid_human" ["$tmid"] END="$t2_human" ["$t2"] ELAPSED="$elapsed_time_mid" HUMAN="$elapsed_time_mid_human 1>&2
79 |
80 | elapsed_time=$(($t2-$t1))
81 | elapsed_time_human=$(date -d@$elapsed_time -u +%H:%M:%S)
82 | echo "[STATS] [COVERAGE] [$FILE_ID] START="$t1_human" ["$tmid"] END="$t2_human" ["$t2"] ELAPSED="$elapsed_time" HUMAN="$elapsed_time_human 1>&2
83 |
84 | echo -e "$FILE_ID\t$elapsed_time\t$elapsed_time_human" > $COVERAGE_DIR/$FILE_ID-coverage-chronometer.txt
85 |
86 | echo "[STATS] Finished creating coverage data ["`date`"]"
87 |
--------------------------------------------------------------------------------
/extract_coverage_slurm.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | #SBATCH --ntasks=25
3 | #SBATCH --ntasks-per-node=25
4 | #SBATCH --time=02:00:00
5 | ##SBATCH --account=Pra15_3924
6 | #SBATCH --account=cin_staff
7 | #SBATCH -p knl_usr_prod
8 |
9 | # SAMPLE_ID
10 | # SOURCE_BAM_FILE
11 | # COV
12 | # SIZE_FILE
13 |
14 | echo "Launching REDItool COVERAGE on $SAMPLE_ID";
15 |
16 | module load autoload profile/global
17 | module load autoload samtools
18 |
19 | t1=$(date +%s)
20 | t1_human=$(date)
21 | echo "[STATS] [COVERAGE] [$SAMPLE_ID] START="$t1_human" ["$t1"]"
22 | time ./extract_coverage_dynamic.sh $SOURCE_BAM_FILE $COV $SIZE_FILE
23 | t2=$(date +%s)
24 | t2_human=$(date)
25 | elapsed_time=$(($t2-$t1))
26 | elapsed_time_human=$(date -d@$elapsed_time -u +%H:%M:%S)
27 | echo "[STATS] [COVERAGE] [$SAMPLE_ID] START="$t1_human" ["$t1"] END="$t2_human" ["$t2"] ELAPSED="$elapsed_time" HUMAN="$elapsed_time_human
28 |
--------------------------------------------------------------------------------
/extract_coverage_slurm_multisample.sh:
--------------------------------------------------------------------------------
1 | module load autoload profile/global
2 | module load ig_homo_sapiens/hg19
3 | REFERENCE=$IG_HG19_GENOME"/genome.fa"
4 | export SIZE_FILE=$REFERENCE".fai"
5 |
6 | BASE_DIR="/marconi_scratch/userexternal/tflati00/reditools_paper/"
7 | SAMPLE_FILE=$BASE_DIR"samples-10.txt"
8 | N=$(cat $SAMPLE_FILE | wc -l | cut -d' ' -f 1)
9 |
10 | COVERAGE_DIR=$BASE_DIR"/cov-multisample/"
11 |
12 | for SOURCE_BAM_FILE in $(cat $SAMPLE_FILE | head -n 1)
13 | do
14 | if [ ! -s $SOURCE_BAM_FILE ]
15 | then
16 | echo "File $SOURCE_BAM_FILE does not exists. Skipping."
17 | continue
18 | fi
19 |
20 | export SOURCE_BAM_FILE
21 | export SAMPLE_ID=$(basename $SOURCE_BAM_FILE | sed 's/\.bam//g')
22 | export COV=$COVERAGE_DIR$SAMPLE_ID"/"
23 | export COV_FILE=$COV$SAMPLE_ID".cov"
24 |
25 | mkdir -p $COV
26 |
27 | if [ ! -f $COV_FILE ]
28 | then
29 | echo "[STATS] [COVERAGE] [$SAMPLE_ID]"
30 | sbatch --export=ALL -J cov-$SAMPLE_ID -o $COV/output.txt -e $COV/error.txt ./extract_coverage_slurm.sh &
31 | fi
32 | done
--------------------------------------------------------------------------------
/install.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | virtualenv ENV
4 |
5 | source ENV/bin/activate
6 |
7 | pip install -r requirements.txt
8 |
9 | deactivate
--------------------------------------------------------------------------------
/merge.sh:
--------------------------------------------------------------------------------
1 | TABLE_DIR=$1
2 | FINAL_TABLE=$2
3 | THREADS=$3
4 |
5 | echo "Merging files in $TABLE_DIR using $THREADS threads and writing to output=$FINAL_TABLE"
6 |
7 | t1=$(date +%s)
8 | t1_human=$(date)
9 |
10 | if [ ! -s $TABLE_DIR/files.txt ]
11 | then
12 | echo "FILE LIST NOT EXISTING OR EMPTY: "$TABLE_DIR/files.txt
13 | else
14 | OUTPUT_DIR=`dirname $FINAL_TABLE`
15 | if [ ! -d $OUTPUT_DIR ]
16 | then
17 | mkdir -p $OUTPUT_DIR
18 | fi
19 |
20 | zcat $(cat $TABLE_DIR/files.txt) | bgzip -c -@ $THREADS > $FINAL_TABLE
21 | echo "Finished creating final table $FINAL_TABLE ["`date`"]"
22 |
23 | tabix -s 1 -b 2 -e 2 -c Region $FINAL_TABLE
24 | echo "Finished creating index file for file $FINAL_TABLE ["`date`"]"
25 | fi
26 |
27 | t2=$(date +%s)
28 | t2_human=$(date)
29 | elapsed_time=$(($t2-$t1))
30 | elapsed_time_human=$(date -d@$elapsed_time -u +%H:%M:%S)
31 |
32 | FILE_ID=`basename $TABLE_DIR`
33 |
34 | echo "[STATS] [MERGE] [$FILE_ID] START="$t1_human" ["$t1"] END="$t2_human" ["$t2"] ELAPSED="$elapsed_time" HUMAN="$elapsed_time_human 1>&2
35 | echo -e "$FILE_ID\t$elapsed_time\t$elapsed_time_human" > $TABLE_DIR/merge-chronometer.txt
36 |
37 | # echo "Starting creating final table "`date`
38 | #
39 | # i=0; for file in $(ls $TABLE_DIR/*.gz)
40 | # do
41 | # i=$((i + 1))
42 | # echo $i". "$file" "`date`; zcat $file >> final_file.txt
43 | # done
44 | #
45 | # echo "Compressing final_file.txt "`date`
46 | # time /marconi/home/userexternal/tflati00/pigz-2.4/pigz -c final_file.txt > final_file.gz
47 |
--------------------------------------------------------------------------------
/multisample_test.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | #SBATCH --ntasks=136
3 | #SBATCH --ntasks-per-node=68
4 | #SBATCH --time=4:00:00
5 | #SBATCH --account=Pra15_3924
6 | #SBATCH -p knl_usr_prod
7 | #SBATCH -e para-RT-MT.e
8 | #SBATCH -o para-RT-MT.o
9 |
10 | cd $SLURM_SUBMIT_DIR
11 |
12 | BASE_DIR="/marconi_scratch/userexternal/tflati00/reditools_paper/"
13 |
14 | OUTPUT_DIR=$BASE_DIR"/output-multisample-2/"
15 | TEMP_DIR=$BASE_DIR"/tmp-multisample-2/"
16 | COVERAGE_DIR=$BASE_DIR"/cov-multisample-2/"
17 |
18 | #DATA_DIR="/home/flati/data/reditools/input/"
19 | DATA_DIR="$CINECA_SCRATCH/public/"
20 |
21 | module load autoload profile/global
22 | module load ig_homo_sapiens/hg19
23 | REFERENCE=$IG_HG19_GENOME"/genome.fa"
24 | #REFERENCE=$DATA_DIR"hg19m.fa"
25 |
26 | OMOPOLYMER_FILE=$DATA_DIR"omopolymeric_positions.txt"
27 | SIZE_FILE=$REFERENCE".fai"
28 |
29 | SAMPLE_FILE=$BASE_DIR"samples.txt"
30 |
31 | # NUM_CORES=68
32 |
33 | if [ ! -s $SAMPLE_FILE ]
34 | then
35 | echo "File $SAMPLE_FILE does not exist. Please, provide an existing file."
36 | exit
37 | fi
38 |
39 | # Environment setup
40 | module load python/2.7.12
41 | source ENV/bin/activate
42 | module load autoload openmpi/1-10.3--gnu--6.1.0
43 | # module load autoload samtools
44 | module load autoload htslib
45 |
46 | # for SOURCE_BAM_FILE in $(cat $SAMPLE_FILE)
47 | # do
48 | # if [ ! -s $SOURCE_BAM_FILE ]
49 | # then
50 | # echo "File $SOURCE_BAM_FILE does not exists. Skipping."
51 | # continue
52 | # fi
53 | #
54 | # SAMPLE_ID=$(basename $SOURCE_BAM_FILE | sed 's/\.bam//g')
55 | # COV=$COVERAGE_DIR$SAMPLE_ID"/"
56 | # COV_FILE=$COV$SAMPLE_ID".cov"
57 | #
58 | # date
59 | #
60 | # if [ ! -f $COV_FILE ]
61 | # then
62 | # echo "Launching REDItool COVERAGE on $SAMPLE_ID (output_dir=$COV)";
63 | #
64 | # t1=$(date +%s)
65 | # t1_human=$(date)
66 | # echo "[STATS] [COVERAGE] [$SAMPLE_ID] START="$t1_human" ["$t1"]"
67 | # time ./extract_coverage.sh $SOURCE_BAM_FILE $COV $SIZE_FILE &
68 | # t2=$(date +%s)
69 | # t2_human=$(date)
70 | # elapsed_time=$(($t2-$t1))
71 | # elapsed_time_human=$(date -d@$elapsed_time -u +%H:%M:%S)
72 | # echo "[STATS] [COVERAGE] [$SAMPLE_ID] START="$t1_human" ["$t1"] END="$t2_human" ["$t2"] ELAPSED="$elapsed_time" HUMAN="$elapsed_time_human
73 | # fi
74 | # done
75 | # wait
76 |
77 | # strand=0
78 | # options=""
79 | # if [ $strand != 0 ]
80 | # then
81 | # options="-C -T 2 -s $strand"
82 | # fi
83 | options=""
84 |
85 | COV_FILE=$COV$SAMPLE_ID".cov"
86 | TEMP=$TEMP_DIR$SAMPLE_ID"/"
87 | OUTPUT=$OUTPUT_DIR/$SAMPLE_ID/table.gz
88 |
89 | # Program launch
90 | echo "START:"`date`
91 | t1=$(date +%s)
92 | t1_human=$(date)
93 | # time mpirun src/cineca/reditools2_multisample.py -F $SAMPLE_FILE -r $REFERENCE -m $OMOPOLYMER_FILE -D $COVERAGE_DIR -t $TEMP_DIR -Z $SIZE_FILE $options 2>&1 | tee MULTI_SAMPLES.log
94 | time mpirun src/cineca/reditools2_multisample.py -F $SAMPLE_FILE -r $REFERENCE -D $COVERAGE_DIR -t $TEMP_DIR -Z $SIZE_FILE $options 2>&1 | tee MULTI_SAMPLES.log
95 | t2=$(date +%s)
96 | t2_human=$(date)
97 | elapsed_time=$(($t2-$t1))
98 | elapsed_time_human=$(date -d@$elapsed_time -u +%H:%M:%S)
99 | echo "[STATS] [PARALLEL] START="$t1_human" ["$t1"] END="$t2_human" ["$t2"] ELAPSED="$elapsed_time" HUMAN="$elapsed_time_human
100 |
101 | # export PATH=$HTSLIB_HOME/bin/:$PATH
102 | # for SOURCE_BAM_FILE in $(cat $SAMPLE_FILE)
103 | # do
104 | # t1=$(date +%s)
105 | # t1_human=$(date)
106 | #
107 | # SAMPLE_ID=$(basename $SOURCE_BAM_FILE | sed 's/\.bam//g')
108 | #
109 | # COV=$COVERAGE_DIR$SAMPLE_ID"/"
110 | # COV_FILE=$COV$SAMPLE_ID".cov"
111 | # TEMP=$TEMP_DIR$SAMPLE_ID"/"
112 | # OUTPUT=$OUTPUT_DIR/$SAMPLE_ID/table.gz
113 | #
114 | # time ./merge.sh $TEMP $OUTPUT $NUM_CORES &
115 | # t2=$(date +%s)
116 | # t2_human=$(date)
117 | # elapsed_time=$(($t2-$t1))
118 | # elapsed_time_human=$(date -d@$elapsed_time -u +%H:%M:%S)
119 | # echo "[STATS] [MERGE] [$SAMPLE_ID] START="$t1_human" ["$t1"] END="$t2_human" ["$t2"] ELAPSED="$elapsed_time" HUMAN="$elapsed_time_human
120 | #
121 | # echo "[$SAMPLE_ID] END:"`date`
122 | # echo "OK" > $TEMP/status.txt
123 | # done
124 | # wait
125 |
--------------------------------------------------------------------------------
/parallel_test.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | # Parallel test #
4 | source ENV/bin/activate
5 |
6 | SOURCE_BAM_FILE="test/SRR2135332.bam"
7 | REFERENCE="test/chr21.fa"
8 | SIZE_FILE="test/chr21.fa.fai"
9 |
10 | NUM_CORES=2
11 | OUTPUT_FILE="test_results/output/parallel_table.txt.gz"
12 | TEMP_DIR="test_results/temp/"
13 | COVERAGE_FILE="test_results/coverage/SRR2135332.chr21.cov"
14 | COVERAGE_DIR="test_results/coverage/"
15 |
16 | ./extract_coverage.sh $SOURCE_BAM_FILE $COVERAGE_DIR $SIZE_FILE
17 | mpirun -np $NUM_CORES src/cineca/parallel_reditools.py -g "chr21" -f $SOURCE_BAM_FILE -o $OUTPUT_FILE -r $REFERENCE -t $TEMP_DIR -Z $SIZE_FILE -G $COVERAGE_FILE -D $COVERAGE_DIR
18 | ./merge.sh $TEMP_DIR $OUTPUT_FILE $NUM_CORES
19 |
20 | deactivate
21 |
--------------------------------------------------------------------------------
/parallel_test_slurm.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | #SBATCH --job-name=REDItools2Job
3 | #SBATCH -N 3
4 | #SBATCH -n 12
5 | #SBATCH -p m100_usr_prod
6 | #SBATCH --time 02:00:00
7 | #SBATCH --account cin_staff
8 | #SBATCH --error REDItools2Job.err
9 | #SBATCH --output REDItools2Job.out
10 |
11 | SAMPLE_ID="SRR2135332"
12 | SOURCE_BAM_FILE="test/SRR2135332.bam"
13 | REFERENCE="test/chr21.fa"
14 | REFERENCE_DNA=$(basename "$REFERENCE")
15 | SIZE_FILE="test/chr21.fa.fai"
16 | NUM_CORES=12
17 | OUTPUT_FILE="test_results/output/parallel_table.txt.gz"
18 | TEMP_DIR="test_results/temp/"
19 | COVERAGE_FILE="test_results/coverage/SRR2135332.cov"
20 | COVERAGE_DIR="test_results/coverage/"
21 | OUTPUT_DIR=$(basename "$OUTPUT_FILE")
22 |
23 |
24 | module load spack
25 | module load python/2.7.16--gcc--8.4.0-bgv
26 | module load autoload py-mpi4py/3.0.3--gcc--8.4.0-spectrmpi-ac2
27 | module load py-virtualenv/16.7.6--gcc--8.4.0-4ut
28 | module load profile/global
29 | module load samtools/1.12
30 |
31 | source ENV/bin/activate
32 |
33 | if [ ! -f $COVERAGE_FILE ]
34 | then
35 | t1=$(date +%s)
36 | t1_human=$(date)
37 | echo "[STATS] [COVERAGE] START="$t1_human" ["$t1"]"
38 | ./extract_coverage_dynamic.sh $SOURCE_BAM_FILE $COVERAGE_DIR $SIZE_FILE
39 | t2=$(date +%s)
40 | t2_human=$(date)
41 | elapsed_time=$(($t2-$t1))
42 | elapsed_time_human=$(date -d@$elapsed_time -u +%H:%M:%S)
43 | echo "[STATS] [COVERAGE] START="$t1_human" ["$t1"] END="$t2_human" ["$t2"] ELAPSED="$elapsed_time" HUMAN="$elapsed_time_human
44 | fi
45 |
46 |
47 |
48 |
49 | strand=0
50 | options=""
51 | if [ $strand != 0 ]
52 | then
53 | options="-C -T 2 -s $strand"
54 | fi
55 |
56 | # Program launch
57 | echo "START:"`date`
58 | t1=$(date +%s)
59 | t1_human=$(date)
60 |
61 | time mpirun -np $NUM_CORES src/cineca/parallel_reditools.py -f $SOURCE_BAM_FILE -o $OUTPUT_FILE -r $REFERENCE -t $TEMP_DIR -Z $SIZE_FILE -G $COVERAGE_FILE -D $COVERAGE_DIR $options 2>&1 | tee $SAMPLE_ID.log
62 |
63 | t2=$(date +%s)
64 | t2_human=$(date)
65 | elapsed_time=$(($t2-$t1))
66 | elapsed_time_human=$(date -d@$elapsed_time -u +%H:%M:%S)
67 | echo "[STATS] [PARALLEL] START="$t1_human" ["$t1"] END="$t2_human" ["$t2"] ELAPSED="$elapsed_time" HUMAN="$elapsed_time_human
68 |
69 | t1=$(date +%s)
70 | t1_human=$(date)
71 | export PATH=$HTSLIB_HOME/bin/:$PATH
72 | time ./merge.sh $TEMP_DIR $OUTPUT_FILE $NUM_CORES
73 | t2=$(date +%s)
74 | t2_human=$(date)
75 | elapsed_time=$(($t2-$t1))
76 | elapsed_time_human=$(date -d@$elapsed_time -u +%H:%M:%S)
77 | echo "[STATS] [MERGE] START="$t1_human" ["$t1"] END="$t2_human" ["$t2"] ELAPSED="$elapsed_time" HUMAN="$elapsed_time_human
78 |
79 | echo "END:"`date`
80 | echo "OK" > $TEMP_DIR/status.txt
81 |
82 | deactivate
83 |
--------------------------------------------------------------------------------
/parallel_test_slurm_DEPRECATED.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | #SBATCH --job-name=REDItools2Job
4 | #SBATCH -N 1
5 | #SBATCH -n 36
6 | #SBATCH -p gll_usr_prod
7 | #SBATCH --mem=115GB
8 | #SBATCH --time 05:00:00
9 | #SBATCH --account ELIX4_manniron
10 | #SBATCH --error REDItools2Job.err
11 | #SBATCH --output REDItools2Job.out
12 |
13 |
14 | #########################################################
15 | ######## Parameters setting
16 | #########################################################
17 |
18 | ##SAMPLE_ID is the basename of the sample of interest
19 | SAMPLE_ID="SRR2135332"
20 |
21 | ##bam file to be analysed
22 | SOURCE_BAM_FILE="test/SRR2135332.bam"
23 |
24 | ##reference chromosome or genome
25 | REFERENCE="test/chr21.fa"
26 | REFERENCE_DNA=$(basename "$REFERENCE")
27 |
28 | ##fasta index file created by samtools
29 | SIZE_FILE="test/chr21.fa.fai"
30 |
31 | ##number of utilized cores
32 | NUM_CORES=2
33 |
34 | ##setting output file
35 | OUTPUT_FILE="test_results/output/parallel_table.txt.gz"
36 | TEMP_DIR="test_results/temp/"
37 |
38 | ##setting the coverage file
39 | COVERAGE_FILE="test_results/coverage/SRR2135332.cov"
40 |
41 | ##setting coverage directory
42 | COVERAGE_DIR="test_results/coverage/"
43 |
44 | ##setting output directory
45 | OUTPUT_DIR=$(basename "$OUTPUT_FILE")
46 |
47 | #########################################################
48 | ######## Modules loading
49 | #########################################################
50 |
51 | module load profile/bioinf
52 | module load python/2.7.12
53 | module load autoload samtools/1.9
54 | module load autoload profile/global
55 | module load autoload openmpi/3.1.4--gnu--7.3.0
56 | module load autoload samtools
57 |
58 | echo "Launching REDItool on $SAMPLE_ID (output_file=$OUTPUT_FILE)";
59 |
60 | #########################################################
61 | ######## Coverage
62 | #########################################################
63 |
64 | ## If the coverage file doesn’t exist, then the script calculate It.
65 |
66 | if [ ! -f $COVERAGE_FILE ]
67 | then
68 | t1=$(date +%s)
69 | t1_human=$(date)
70 | echo "[STATS] [COVERAGE] START="$t1_human" ["$t1"]"
71 | ./extract_coverage_dynamic.sh $SOURCE_BAM_FILE $COVERAGE_DIR $SIZE_FILE
72 | t2=$(date +%s)
73 | t2_human=$(date)
74 | elapsed_time=$(($t2-$t1))
75 | elapsed_time_human=$(date -d@$elapsed_time -u +%H:%M:%S)
76 | echo "[STATS] [COVERAGE] START="$t1_human" ["$t1"] END="$t2_human" ["$t2"] ELAPSED="$elapsed_time" HUMAN="$elapsed_time_human
77 | fi
78 |
79 | #########################################################
80 | ######## Parallel Computation
81 | #########################################################
82 |
83 | strand=0
84 | options=""
85 | if [ $strand != 0 ]
86 | then
87 | options="-C -T 2 -s $strand"
88 | fi
89 |
90 | # Program launch
91 | echo "START:"`date`
92 | t1=$(date +%s)
93 | t1_human=$(date)
94 |
95 | time mpirun src/cineca/parallel_reditools.py -g $REFERENCE_DNA -f $SOURCE_BAM_FILE -r $REFERENCE -G $COVERAGE_FILE -D $COVERAGE_DIR -t $TEMP_DIR -Z $SIZE_FILE $options 2>&1 | tee $SAMPLE_ID.log
96 | t2=$(date +%s)
97 | t2_human=$(date)
98 | elapsed_time=$(($t2-$t1))
99 | elapsed_time_human=$(date -d@$elapsed_time -u +%H:%M:%S)
100 | echo "[STATS] [PARALLEL] START="$t1_human" ["$t1"] END="$t2_human" ["$t2"] ELAPSED="$elapsed_time" HUMAN="$elapsed_time_human
101 |
102 | #########################################################
103 | ######## Merging
104 | #########################################################
105 |
106 | t1=$(date +%s)
107 | t1_human=$(date)
108 | export PATH=$HTSLIB_HOME/bin/:$PATH
109 | time ./merge.sh $TEMP_DIR $OUTPUT $NUM_CORES
110 | t2=$(date +%s)
111 | t2_human=$(date)
112 | elapsed_time=$(($t2-$t1))
113 | elapsed_time_human=$(date -d@$elapsed_time -u +%H:%M:%S)
114 | echo "[STATS] [MERGE] START="$t1_human" ["$t1"] END="$t2_human" ["$t2"] ELAPSED="$elapsed_time" HUMAN="$elapsed_time_human
115 |
116 | echo "END:"`date`
117 | echo "OK" > $TEMP_DIR/status.txt
118 |
--------------------------------------------------------------------------------
/prepare_test.sh:
--------------------------------------------------------------------------------
1 | cd test
2 |
3 | if [ ! -s chr21.fa ]
4 | then
5 | echo "Reference chromosome 21 (Homo Sapiens) not existing. Downloading..."
6 | wget -O chr21.fa.gz http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr21.fa.gz
7 |
8 | echo "Extracting chr21.fa.gz archive"
9 | gzip -d chr21.fa.gz
10 | fi
11 |
12 | if [ ! -s chr21.fa.fai ]
13 | then
14 | echo "Index .fai not found. Indexing chr21.fa"
15 | samtools faidx chr21.fa
16 | fi
17 |
18 | echo "Test(s) ready!"
19 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | pysam
2 | sortedcontainers
3 | psutil
4 | netifaces
5 | mpi4py
6 |
--------------------------------------------------------------------------------
/serial_test.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | source ENV/bin/activate
4 |
5 | # Serial test #
6 | python src/cineca/reditools.py -f test/SRR2135332.bam -r test/chr21.fa -g chr21 -o serial_table.txt
7 |
8 | deactivate
9 |
--------------------------------------------------------------------------------
/serial_test_slurm.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | #SBATCH --ntasks=1
3 | #SBATCH --ntasks-per-node=1
4 | #SBATCH --time=00:10:00
5 | #SBATCH --account=cin_staff
6 | #SBATCH -p knl_usr_prod
7 | #SBATCH -e serial-RT.e
8 | #SBATCH -o serial-RT.o
9 |
10 | # Serial test (SLURM)#
11 | module load python/2.7.12
12 |
13 | source ENV/bin/activate
14 |
15 | python src/cineca/reditools.py -f test/SRR2135332.bam -g chr21 -r test/chr21.fa -o serial_table_slurm.txt
16 |
17 | deactivate
18 |
--------------------------------------------------------------------------------
/src/cineca/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BioinfoUNIBA/REDItools2/17e932fa225477effced64ad5342e7cfd2b7d87b/src/cineca/__init__.py
--------------------------------------------------------------------------------
/src/cineca/annotate_with_DNA.py:
--------------------------------------------------------------------------------
1 | import sys
2 | import os
3 | import gzip
4 |
5 | columns = {
6 | "Region": 0,
7 | "Position": 1,
8 | "Reference": 2,
9 | "Strand": 3,
10 | "Coverage": 4,
11 | "MeanQ": 5,
12 | "BaseCount": 6,
13 | "AllSubs": 7,
14 | "Frequency": 8,
15 | "gCoverage": 9,
16 | "gMeanQ": 10,
17 | "gBaseCount": 11,
18 | "gAllSubs": 12,
19 | "gFrequency": 13
20 | }
21 |
22 | def read_line(fd):
23 | line = next(fd, None)
24 | return line.strip().split("\t") if line is not None else []
25 |
26 | def is_smaller_than_or_equal_to(fields1, fields2):
27 | # if not fields1 and not fields2: return True
28 | # if fields1 and not fields2: return True
29 | if not fields1 and fields2: return False
30 |
31 | if fields1 and fields2:
32 | region1 = get(fields1, "Region")
33 | region2 = get(fields2, "Region")
34 |
35 | index1 = chromosomes.index(region1) if region1 in chromosomes else chromosomes.index("chr" + region1)
36 | index2 = chromosomes.index(region2) if region2 in chromosomes else chromosomes.index("chr" + region2)
37 |
38 | # sys.stderr.write(" ".join([str(x) for x in [region1, region2, index1, index2]]) + "\n")
39 |
40 | if index1 < index2:
41 | return True
42 |
43 | if index1 > index2:
44 | return False
45 |
46 | return index1 == index2 and int(get(fields1, "Position")) <= int(get(fields2, "Position"))
47 |
48 | return True
49 |
50 | def get(fields, column):
51 | value = None
52 |
53 | index = columns[column]
54 | if len(fields) >= index:
55 | value = fields[index]
56 |
57 | return value
58 |
59 | comp = {'A':'T','T':'A','C':'G','G':'C'}
60 | indexes = {v: k for k, v in dict(enumerate('ACGT')).iteritems()}
61 |
62 | def annotate(fields1, fields2):
63 |
64 | strand = get(fields1, "Strand")
65 |
66 | if strand == '0':
67 | base_count = eval(get(fields2, "BaseCount")) # BaseCount[A,C,G,T]
68 | fields2[columns["BaseCount"]] = str([base_count[indexes[comp[b]]] for b in 'ACGT'])
69 |
70 | subs = get(fields2, "AllSubs").split(" ")
71 | fields2[columns["AllSubs"]] = " ".join([''.join([comp[b] if b != "-" else b for b in sub]) for sub in subs])
72 |
73 | for field in ["Coverage", "MeanQ", "BaseCount", "AllSubs", "Frequency"]:
74 | annotation = get(fields2, field)
75 | # if annotation is None:
76 | # print(fields1)
77 | # print(fields2)
78 | # print(field, annotation)
79 |
80 | fields1[columns["g" + field]] = annotation
81 |
82 | chromosomes = []
83 | def load_chromosomes(fai):
84 | with open(fai, "r") as reader:
85 | for line in reader:
86 | chromosome = line.strip().split("\t")[0]
87 | if chromosome in chromosomes: continue
88 | chromosomes.append(chromosome)
89 |
90 | LOG_INTERVAL = 1000000
91 |
92 | import argparse
93 | if __name__ == '__main__':
94 |
95 | parser = argparse.ArgumentParser(description='REDItools 2.0 annotator')
96 | parser.add_argument('-r', '--rna-file', required=True, help='The RNA-editing events table to be annotated')
97 | parser.add_argument('-d', '--dna-file', required=True, help='The RNA-editing events table as obtained from DNA-Seq data')
98 | parser.add_argument('-R', '--reference', required=True, help='The .fai file of the reference genome containing the ordered chromosomes')
99 | parser.add_argument('-Z', '--only-omozygotes', default=False, action='store_true', help='Exclude positions with multiple changes in DNA-Seq')
100 | args = parser.parse_known_args()[0]
101 |
102 | file1 = args.rna_file
103 | file2 = args.dna_file
104 | fai_file = args.reference
105 | load_chromosomes(fai_file)
106 | only_omozygotes = args.only_omozygotes
107 |
108 | sys.stderr.write("[INFO] {} CHROMOSOMES LOADED\n".format(len(chromosomes)))
109 |
110 | file1root, ext1 = os.path.splitext(file1)
111 | file2root, ext2 = os.path.splitext(file2)
112 |
113 | fd1 = gzip.open(file1, "r") if ext1 == ".gz" else open(file1, "r")
114 | fd2 = gzip.open(file2, "r") if ext2 == ".gz" else open(file2, "r")
115 | fd3 = sys.stdout
116 |
117 | total1 = 0
118 | total2 = 0
119 | last_chr = None
120 | with fd1, fd2, fd3:
121 |
122 | fields1 = read_line(fd1)
123 | total1 += 1
124 | if fields1[0] == "Region":
125 | fields1 = read_line(fd1)
126 | total1 += 1
127 |
128 | fields2 = read_line(fd2)
129 | total2 += 1
130 | if fields2[0] == "Region":
131 | fields2 = read_line(fd2)
132 | total2 += 1
133 |
134 | while fields1 or fields2:
135 |
136 | if fields1[0] != last_chr:
137 | last_chr = fields1[0]
138 | sys.stderr.write("ANALYZING CHROMOSOME " + last_chr + "\n")
139 |
140 | f1_less_than_f2 = is_smaller_than_or_equal_to(fields1, fields2)
141 | f2_less_than_f1 = is_smaller_than_or_equal_to(fields2, fields1)
142 | are_equal = f1_less_than_f2 and f2_less_than_f1
143 |
144 | # sys.stderr.write(str(fields1) + "\n")
145 | # sys.stderr.write(str(fields2) + "\n")
146 | # sys.stderr.write(str(f1_less_than_f2) + " " + str(f2_less_than_f1) + " " + str(are_equal) + "\n")
147 | # raw_input()
148 |
149 | omozigote = True if not fields2 else not are_equal or fields2[columns["AllSubs"]] == "-"
150 |
151 | if are_equal:
152 | annotate(fields1, fields2)
153 |
154 | if fields1:
155 | if not only_omozygotes or omozigote:
156 | fd3.write("\t".join(fields1) + "\n")
157 | else:
158 | sys.stderr.write("[INFO] [{}] Discarding {}:{} because DNA data is not omozygote from {}\n".format(last_chr, fields1[0], fields1[1], file1))
159 |
160 | if f1_less_than_f2:
161 | fields1 = read_line(fd1)
162 | total1 += 1
163 |
164 | if f2_less_than_f1:
165 | fields2 = read_line(fd2)
166 | total2 += 1
167 |
168 | if total1 % LOG_INTERVAL == 0:
169 | sys.stderr.write("[INFO] [{}] {} lines read from {}\n".format(last_chr, total1, file1))
170 |
171 | if total2 % LOG_INTERVAL == 0:
172 | sys.stderr.write("[INFO] [{}] {} lines read from {}\n".format(last_chr, total2, file2))
173 |
174 | sys.stderr.write("[INFO] {} lines read from {}\n".format(total1, file1))
175 | sys.stderr.write("[INFO] {} lines read from {}\n".format(total2, file2))
176 |
--------------------------------------------------------------------------------
/src/cineca/parallel_reditools.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 |
3 | import os
4 | import glob
5 | import sys
6 | import re
7 | import time
8 | from mpi4py import MPI
9 | from datetime import datetime
10 | from collections import OrderedDict
11 | import reditools
12 | import argparse
13 | import gc
14 | import socket
15 | import netifaces
16 | import json
17 |
18 | ALIGN_CHUNK = 0
19 | STOP_WORKING = 1
20 | IM_FREE = 2
21 | CALCULATE_COVERAGE = 3
22 |
23 | STEP = 10000000
24 |
25 | TIME_STATS = {}
26 |
27 | def get_intervals(intervals, num_intervals):
28 | homeworks = []
29 | for chromosome in chromosomes.keys():
30 | print("chromosome:" + chromosome)
31 | chromosome_length = chromosomes[chromosome]
32 | print("len:"+ str(chromosome_length))
33 | step = STEP
34 | if chromosome == "chrM":
35 | step = (int)(chromosome_length / 100)
36 | chromosome_slices = list(range(1, chromosome_length, step)) + [chromosome_length+1]
37 | print(chromosome_slices)
38 | print(len(chromosome_slices))
39 | print("#slices:" + str(len(chromosome_slices)))
40 |
41 | for i in range(0, len(chromosome_slices)-1):
42 | homeworks.append((chromosome, chromosome_slices[i], chromosome_slices[i+1]-1))
43 | return homeworks
44 |
45 | def weight_function(x):
46 | # x = math.log(1+x)
47 | # return 2.748*10**(-3)*x**3 -0.056*x**2 + 0.376*x + 2.093
48 | return x**3
49 |
50 | def get_coverage(coverage_file, region = None):
51 |
52 | # Open the file and read i-th section (jump to the next '\n' character)
53 | n = float(size)
54 | file_size = os.path.getsize(coverage_file)
55 | print("[{}] SIZE OF FILE {}: {} bytes".format(rank, coverage_file, file_size))
56 | start = int(rank*(file_size/n))
57 | end = int((rank+1)*(file_size/n))
58 | print("[{}] [DEBUG] START={} END={}".format(rank, start, end))
59 |
60 | f = open(coverage_file, "r")
61 | f.seek(start)
62 | loaded = start
63 | coverage_partial = 0
64 | with f as lines:
65 | line_no = 0
66 | for line in lines:
67 | if loaded >= end: continue
68 | loaded += len(line)
69 |
70 | line_no += 1
71 | if line_no == 1:
72 | if not line.startswith("chr"):
73 | continue
74 |
75 | triple = line.rstrip().split("\t")
76 |
77 | if region is not None:
78 | if triple[0] != region[0]: continue
79 | if len(region) >= 2 and int(triple[1]) < region[1]: continue
80 | if len(region) >= 2 and int(triple[1]) > region[2]: continue
81 |
82 | #if line_no % 10000000 == 0:
83 | # print("[{}] [DEBUG] Read {} lines so far".format(rank, line_no))
84 | cov = int(triple[2])
85 | coverage_partial += weight_function(cov)
86 |
87 | print("[{}] START={} END={} PARTIAL_COVERAGE={}".format(rank, start, end, coverage_partial))
88 |
89 | # Reduce
90 | coverage = None
91 |
92 | coverages = comm.gather(coverage_partial)
93 | if rank == 0:
94 | print("COVERAGES:", str(coverages))
95 | coverage = reduce(lambda x,y: x+y, coverages)
96 |
97 | coverage = comm.bcast(coverage, root=0)
98 |
99 | # Return the total
100 | return coverage
101 |
102 | def calculate_intervals(total_coverage, coverage_file, region):
103 | print("[SYSTEM] [{}] Opening coverage file={}".format(rank, coverage_file))
104 | f = open(coverage_file, "r")
105 |
106 | chr = None
107 | start = None
108 | end = None
109 | C = 0
110 | max_interval_width = min(3000000, 3000000000 / size)
111 |
112 | subintervals = []
113 | subtotal = total_coverage / size
114 | print("[SYSTEM] TOTAL={} SUBTOTAL={} MAX_INTERVAL_WIDTH={}".format(total_coverage, subtotal, max_interval_width))
115 |
116 | line_no = 0
117 | with f as lines:
118 | for line in lines:
119 | line_no += 1
120 | if line_no % 1000000 == 0:
121 | print("[SYSTEM] [{}] Time: {} - {} lines loaded.".format(rank, time.time(), line_no))
122 |
123 | fields = line.rstrip().split("\t")
124 |
125 | if region is not None:
126 | if fields[0] != region[0]: continue
127 | if len(region) >= 2 and int(fields[1]) < region[1]: continue
128 | if len(region) >= 3 and int(fields[1]) > region[2]: continue
129 |
130 | # If the interval has become either i) too large or ii) too heavy or iii) spans across two different chromosomes
131 | if C >= subtotal or (chr is not None and fields[0] != chr) or (end is not None and start is not None and (end-start) > max_interval_width):
132 | reason = None
133 | if C >= subtotal: reason = "WEIGHT"
134 | elif chr is not None and fields[0] != chr: reason = "END_OF_CHROMOSOME"
135 | elif end is not None and start is not None and (end-start) > max_interval_width: reason = "MAX_WIDTH"
136 |
137 | interval = (chr, start, end, C, end-start, reason)
138 | print("[SYSTEM] [{}] Time: {} - Discovered new interval={}".format(rank, time.time(), interval))
139 | subintervals.append(interval)
140 | chr = None
141 | start = None
142 | end = None
143 | C = 0
144 | if len(fields) < 3: continue
145 |
146 | if chr is None: chr = fields[0]
147 | if start is None: start = int(fields[1])
148 | end = int(fields[1])
149 | # C += math.pow(int(fields[2]), 2)
150 | C += weight_function(int(fields[2]))
151 |
152 | if C > 0:
153 | reason = "END_OF_CHROMOSOME"
154 | interval = (chr, start, end, C, end-start, reason)
155 | print("[SYSTEM] [{}] Time: {} - Discovered new interval={}".format(rank, time.time(), interval))
156 | subintervals.append(interval)
157 |
158 | return subintervals
159 |
160 | if __name__ == '__main__':
161 |
162 | # MPI init
163 | comm = MPI.COMM_WORLD
164 | rank = comm.Get_rank()
165 | size = comm.Get_size()
166 |
167 | options = reditools.parse_options()
168 | options["remove_header"] = True
169 |
170 | parser = argparse.ArgumentParser(description='REDItools 2.0')
171 | parser.add_argument('-G', '--coverage-file', help='The coverage file of the sample to analyze')
172 | parser.add_argument('-D', '--coverage-dir', help='The coverage directory containing the coverage file of the sample to analyze divided by chromosome')
173 | parser.add_argument('-t', '--temp-dir', help='The temp directory where to store temporary data for this sample')
174 | parser.add_argument('-Z', '--chromosome-sizes', help='The file with the chromosome sizes')
175 | parser.add_argument('-g', '--region', help='The region of the bam file to be analyzed')
176 | args = parser.parse_known_args()[0]
177 |
178 | coverage_file = args.coverage_file
179 | coverage_dir = args.coverage_dir
180 | temp_dir = args.temp_dir
181 | size_file = args.chromosome_sizes
182 |
183 | if not os.path.isfile(coverage_file):
184 | print("[ERROR] Coverage file {} not existing!".format(coverage_file))
185 | exit(1)
186 |
187 | # output = options["output"]
188 | # format = output.split(".")[-1]
189 | # hostname = socket.gethostname()
190 | # host = socket.gethostbyname(hostname)
191 | # fqdn = socket.getfqdn()
192 | interface = 'ib0' if 'ib0' in netifaces.interfaces() else netifaces.interfaces()[0]
193 | hostname = socket.gethostbyaddr(netifaces.ifaddresses(interface)[netifaces.AF_INET][0]['addr'])
194 | pid = os.getpid()
195 | print("[SYSTEM] [TECH] [NODE] RANK:{} HOSTNAME:{} PID:{}".format(rank, hostname, pid))
196 |
197 | if rank == 0:
198 | print("[SYSTEM] LAUNCHED PARALLEL REDITOOLS WITH THE FOLLOWING OPTIONS:", options, args)
199 |
200 | region = None
201 | if args.region:
202 | region = re.split("[:-]", args.region)
203 | if not region or len(region) == 2 or (len(region) == 3 and region[1] == region[2]):
204 | sys.stderr.write("[ERROR] Please provide a region of the form chrom:start-end (with end > start). Region provided: {}".format(region))
205 | exit(1)
206 | if len(region) >= 2:
207 | region[1] = int(region[1])
208 | region[2] = int(region[2])
209 |
210 | t1 = time.time()
211 |
212 | print("I am rank #"+str(rank))
213 |
214 | time_data = {}
215 | time_data["periods"] = []
216 | time_data["groups"] = []
217 | for i in range(0, size):
218 | time_data["groups"].append([])
219 |
220 | if rank == 0:
221 | time_data["periods"].append({"id": "INTERVALS", "content": "Intervals", "start": str(datetime.now()), "type": "background"})
222 |
223 | # COVERAGE SECTION
224 | try:
225 | if not os.path.exists(temp_dir):
226 | os.makedirs(temp_dir)
227 | except Exception as e:
228 | print("[WARN] {}".format(e))
229 |
230 | interval_file = temp_dir + "/intervals.txt"
231 | homeworks = []
232 | if os.path.isfile(interval_file) and os.stat(interval_file).st_size > 0:
233 | if rank == 0:
234 | print("[0] [RESTART] FOUND INTERVAL FILE {} ".format(interval_file))
235 | expected_total = 0
236 | for line in open(interval_file, "r"):
237 | line = line.strip()
238 |
239 | if expected_total == 0:
240 | expected_total = int(line)
241 | continue
242 |
243 | # Interval format: (chr, start, end, C, end-start, reason)
244 | fields = line.split("\t")
245 | for i in range(1, 5):
246 | fields[i] = int(fields[i])
247 | homeworks.append(fields)
248 | else:
249 | if rank == 0:
250 | time_data["periods"].append({"id": "COVERAGE", "content": "Total coverage", "start": str(datetime.now()), "type": "background"})
251 | print("[0] PRE-COVERAGE TIME " + str(datetime.now().time()))
252 |
253 | total_coverage = get_coverage(coverage_file, region)
254 | # print("TOTAL COVERAGE", str(total_coverage))
255 |
256 | if rank == 0:
257 | time_data["periods"][-1]["end"] = str(datetime.now())
258 | now = datetime.now().time()
259 | elapsed = time.time() - t1
260 | print("[SYSTEM] [TIME] [MPI] [0] MIDDLE-COVERAGE [now:{}] [elapsed: {}]".format(now, elapsed))
261 |
262 | # Collect all the files with the coverage
263 | files = []
264 | for file in os.listdir(coverage_dir):
265 | if region is not None and file != region[0]: continue
266 | if file.startswith("."): continue
267 | if file.endswith(".cov"): continue
268 | if file == "chrM": continue
269 | if file.endswith("chronometer.txt"): continue
270 |
271 | files.append(file)
272 | files.sort()
273 |
274 | if rank == 0:
275 | print("[0] " + str(len(files)) + " FILES => " + str(files))
276 |
277 | '''
278 | # Assign interval calculation to slaves
279 | fps = int(len(files) / size)
280 | if fps == 0: fps = 1
281 | print("Files per mpi process: " + str(fps))
282 | subintervals = []
283 | for i in range(0, size):
284 | if rank == i:
285 | from_file = i*fps
286 | to_file = i*fps+fps if i len(files): continue
288 | if to_file > len(files): continue
289 |
290 | print("[{}] Processing from file {} to file {} = {}".format(rank, from_file, to_file, files[from_file:to_file]))
291 |
292 | for file in files[from_file:to_file]:
293 | file_intervals = calculate_intervals(total_coverage, "pieces/" + file)
294 | for interv in file_intervals:
295 | subintervals.append(interv)
296 |
297 | # Gather all the intervals calculated from the slaves
298 | all_subintervals = []
299 | if rank == 0:
300 | intervals = None
301 | all_subintervals = comm.gather(subintervals)
302 | print("[0] {} total intervals received.".format(len(all_subintervals)))
303 | homeworks = reduce(lambda x,y: x+y, all_subintervals)
304 | print("[0] {} total intervals aggregated.".format(len(homeworks)))
305 | for interval in homeworks:
306 | print(interval)
307 | '''
308 |
309 | # Master: dispatches the work to the other slaves
310 | if rank == 0:
311 | start_intervals = t1
312 | print("[0] Start time: {}".format(start_intervals))
313 |
314 | done = 0
315 | total = len(files)
316 |
317 | queue = set()
318 | for i in range(1, min(size, total+1)):
319 | file = files.pop()
320 | print("[SYSTEM] [MPI] [0] Sending coverage data "+ str(file) +" to rank " + str(i))
321 | comm.send(file, dest=i, tag=CALCULATE_COVERAGE)
322 | queue.add(i)
323 |
324 | while len(files) > 0:
325 | status = MPI.Status()
326 | subintervals = comm.recv(source=MPI.ANY_SOURCE, tag=IM_FREE, status=status)
327 | for subinterval in subintervals:
328 | homeworks.append(subinterval)
329 |
330 | done += 1
331 | who = status.Get_source()
332 | queue.remove(who)
333 | now = datetime.now().time()
334 | elapsed = time.time() - start_intervals
335 | print("[SYSTEM] [TIME] [MPI] [0] COVERAGE RECEIVED IM_FREE SIGNAL FROM RANK {} [now:{}] [elapsed:{}] [#intervals: {}] [{}/{}][{:.2f}%] [Queue:{}]".format(str(who), now, elapsed, len(homeworks), done, total, 100 * float(done)/total, queue))
336 |
337 | file = files.pop()
338 | print("[SYSTEM] [MPI] [0] Sending coverage data "+ str(file) +" to rank " + str(who))
339 | comm.send(file, dest=who, tag=CALCULATE_COVERAGE)
340 | queue.add(who)
341 |
342 | while len(queue) > 0:
343 | status = MPI.Status()
344 | print("[SYSTEM] [MPI] [0] Going to receive data from slaves.")
345 | subintervals = comm.recv(source=MPI.ANY_SOURCE, tag=IM_FREE, status=status)
346 | for subinterval in subintervals:
347 | homeworks.append(subinterval)
348 |
349 | done += 1
350 | who = status.Get_source()
351 | queue.remove(who)
352 | now = datetime.now().time()
353 | elapsed = time.time() - start_intervals
354 | print("[SYSTEM] [TIME] [MPI] [0] COVERAGE RECEIVED IM_FREE SIGNAL FROM RANK {} [now:{}] [elapsed:{}] [#intervals: {}] [{}/{}][{:.2f}%] [Queue:{}]".format(str(who), now, elapsed, len(homeworks), done, total, 100 * float(done)/total, queue))
355 |
356 | now = datetime.now().time()
357 | elapsed = time.time() - start_intervals
358 |
359 | interval_file = temp_dir + "/intervals.txt"
360 | print("[SYSTEM] [TIME] [MPI] [0] SAVING INTERVALS TO {} [now:{}] [elapsed: {}]".format(interval_file, now, elapsed))
361 | writer = open(interval_file, "w")
362 | writer.write(str(len(homeworks)) + "\n")
363 | for homework in homeworks:
364 | writer.write("\t".join([str(x) for x in homework]) + "\n")
365 | writer.close()
366 |
367 | now = datetime.now().time()
368 | elapsed = time.time() - start_intervals
369 | print("[SYSTEM] [TIME] [MPI] [0] INTERVALS SAVED TO {} [now:{}] [elapsed: {}]".format(interval_file, now, elapsed))
370 |
371 | print("[SYSTEM] [TIME] [MPI] [0] FINISHED CALCULATING INTERVALS [now:{}] [elapsed: {}]".format(now, elapsed))
372 |
373 | TIME_STATS["COVERAGE"] = {
374 | "start": start_intervals,
375 | "end": time.time(),
376 | "elapsed": elapsed
377 | }
378 |
379 | if rank == 0:
380 |
381 | time_data["periods"][0]["end"] = str(datetime.now())
382 |
383 | ###########################################################
384 | ######### COMPUTATION SECTION #############################
385 | ###########################################################
386 | done = 0
387 | parallel_time_section_data = {"id": "ANALYSIS", "content": "Parallel", "start": str(datetime.now()), "type": "background"}
388 | time_data["periods"].append(parallel_time_section_data)
389 | print("[SYSTEM] [TIME] [MPI] [0] REDItools STARTED. MPI SIZE (PROCS): {} [now: {}]".format(size, datetime.now().time()))
390 |
391 | intervals_done = set()
392 | progress_file = temp_dir + "/progress.txt"
393 | if os.path.exists(progress_file):
394 | with open(progress_file, "r") as file:
395 | for line in file:
396 | pieces = line.strip().split()
397 | chromosome = pieces[1].split(":")[0]
398 | start, end = pieces[1].split(":")[1].split("-")
399 | interval_done = (chr, start, end)
400 | intervals_done.add(interval_done)
401 |
402 | t1 = time.time()
403 |
404 | print("Loading chromosomes' sizes!")
405 | chromosomes = OrderedDict()
406 | for line in open(size_file):
407 | (key, val) = line.split()[0:2]
408 | chromosomes[key] = int(val)
409 | print("Sizes:")
410 | print(chromosomes)
411 |
412 | homeworks_to_remove = set()
413 | for hw in homeworks:
414 | interval = (hw[0], hw[1], hw[2])
415 | if interval in intervals_done:
416 | homeworks_to_remove.add(hw)
417 | for hw in homeworks_to_remove:
418 | homeworks.remove(hw)
419 |
420 | something_to_analyze = True
421 | if len(homeworks) == 0:
422 | something_to_analyze = False
423 |
424 | if something_to_analyze:
425 | intervals_done_writer = open(progress_file, "w")
426 |
427 | total = len(homeworks)
428 | print("[SYSTEM] [MPI] [0] HOMEWORKS", total, homeworks)
429 | #shuffle(homeworks)
430 |
431 | start = time.time()
432 |
433 | print("[SYSTEM] [TIME] [MPI] [0] REDItools PILEUP START: [now: {}]".format(datetime.now().time()))
434 |
435 | queue = set()
436 | for i in range(1, min(size, total)):
437 | interval = homeworks.pop()
438 | print("[SYSTEM] [MPI] [SEND/RECV] [SEND] [0] Sending data "+ str(interval) +" to rank " + str(i))
439 | id_event = str(i)+"#"+str(len(time_data["groups"][i]))
440 | time_data["groups"][i].append({"id": id_event, "content": id_event, "start": str(datetime.now()), "group": i,
441 | "extra": {
442 | "interval": "{}:{}-{}".format(interval[0], interval[1], interval[2]),
443 | "weight": str(interval[3]),
444 | "width": str(interval[4]),
445 | "reason": str(interval[5])
446 | }})
447 | comm.send(interval, dest=i, tag=ALIGN_CHUNK)
448 | queue.add(i)
449 |
450 | while len(homeworks) > 0:
451 | status = MPI.Status()
452 | comm.recv(source=MPI.ANY_SOURCE, tag=IM_FREE, status=status)
453 | done += 1
454 | who = status.Get_source()
455 | queue.remove(who)
456 | now = datetime.now().time()
457 | elapsed = time.time() - start
458 | print("[SYSTEM] [TIME] [MPI] [SEND/RECV] [RECV] [0] RECEIVED IM_FREE SIGNAL FROM RANK {} [now:{}] [elapsed:{}] [{}/{}][{:.2f}%] [Queue:{}]".format(str(who), now, elapsed, done, total, 100 * float(done)/total, queue))
459 | time_data["groups"][who][-1]["end"] = str(datetime.now())
460 | time_data["groups"][who][-1]["extra"]["duration"] = str(datetime.strptime(time_data["groups"][who][-1]["end"], '%Y-%m-%d %H:%M:%S.%f') - datetime.strptime(time_data["groups"][who][-1]["start"], '%Y-%m-%d %H:%M:%S.%f'))
461 | time_data["groups"][who][-1]["extra"]["done"] = done
462 | time_data["groups"][who][-1]["extra"]["total"] = total
463 | time_data["groups"][who][-1]["extra"]["total (%)"] = "{:.2f}%".format(100 * float(done)/total)
464 |
465 | interval = time_data["groups"][who][-1]["extra"]["interval"]
466 | intervals_done_writer.write("{}\t{}\t{}\n".format(who, interval, temp_dir + "/" + interval.replace(":", "#") + ".gz"))
467 | intervals_done_writer.flush()
468 |
469 | interval = homeworks.pop()
470 | print("[SYSTEM] [MPI] [SEND/RECV] [SEND] [0] Sending data "+ str(interval) +" to rank " + str(who))
471 | id_event = str(who)+"#"+str(len(time_data["groups"][who]))
472 |
473 | time_data["groups"][who].append({"id": id_event, "content": id_event, "start": str(datetime.now()), "group": who,
474 | "extra": {
475 | "interval": "{}:{}-{}".format(interval[0], interval[1], interval[2]),
476 | "weight": str(interval[3]),
477 | "width": str(interval[4]),
478 | "reason": str(interval[5])
479 | }})
480 | comm.send(interval, dest=who, tag=ALIGN_CHUNK)
481 | queue.add(who)
482 |
483 | while len(queue) > 0:
484 | status = MPI.Status()
485 | comm.recv(source=MPI.ANY_SOURCE, tag=IM_FREE, status=status)
486 | done += 1
487 | who = status.Get_source()
488 | queue.remove(who)
489 | now = datetime.now().time()
490 | elapsed = time.time() - start
491 | time_data["groups"][who][-1]["end"] = str(datetime.now())
492 | time_data["groups"][who][-1]["extra"]["duration"] = str(datetime.strptime(time_data["groups"][who][-1]["end"], '%Y-%m-%d %H:%M:%S.%f') - datetime.strptime(time_data["groups"][who][-1]["start"], '%Y-%m-%d %H:%M:%S.%f'))
493 | time_data["groups"][who][-1]["extra"]["done"] = done
494 | time_data["groups"][who][-1]["extra"]["total"] = total
495 | time_data["groups"][who][-1]["extra"]["total (%)"] = "{:.2f}%".format(100 * float(done)/total)
496 |
497 | interval = time_data["groups"][who][-1]["extra"]["interval"]
498 | intervals_done_writer.write("{}\t{}\t{}\n".format(who, interval, temp_dir + interval.replace(":", "#") + ".gz"))
499 | intervals_done_writer.flush()
500 |
501 | print("[SYSTEM] [TIME] [MPI] [SEND/RECV] [RECV] [0] RECEIVED IM_FREE SIGNAL FROM RANK {} [now:{}] [elapsed:{}] [{}/{}][{:.2f}%] [Queue:{}]".format(str(who), now, elapsed, done, total, 100 * float(done)/total, queue))
502 | print("[SYSTEM] [MPI] [SEND/RECV] [SEND] [0] Sending DIE SIGNAL TO RANK " + str(who))
503 | comm.send(None, dest=who, tag=STOP_WORKING)
504 |
505 | parallel_time_section_data["end"] = str(datetime.now())
506 | if something_to_analyze:
507 | intervals_done_writer.close()
508 |
509 | #################################################
510 | ########### WRITE TIME DATA #####################
511 | #################################################
512 | events = []
513 | for period in time_data["periods"]:
514 | events.append(period)
515 |
516 | for group in time_data["groups"]:
517 | for event in group:
518 | extras = []
519 | for key, value in event["extra"].items():
520 | extras.append("{}: {}".format(key, value))
521 |
522 | event["title"] = " ".join(extras)
523 | events.append(event)
524 |
525 | groups = []
526 | for i in range(0, size):
527 | groups.append({"id": i, "content": "MPI Proc. #"+str(i)})
528 |
529 |
530 | time_file = temp_dir + "times.txt"
531 | f = open(time_file, "w")
532 | json.dump(events, f)
533 | f.close()
534 |
535 | group_file = temp_dir + "groups.txt"
536 | f = open(group_file, "w")
537 | json.dump(groups, f)
538 | f.close()
539 |
540 | # We have finished processing all the chunks. Let's notify this to slaves
541 | # for i in range(1, size):
542 | # print("[SYSTEM] [MPI] [0] Sending DIE SIGNAL TO RANK " + str(i))
543 | # comm.send(None, dest=i, tag=STOP_WORKING)
544 |
545 | #####################################################################
546 | ######### RECOMBINATION OF SINGLE FILES #############################
547 | #####################################################################
548 | t2 = time.time()
549 | elapsed = t2-t1
550 | print("[SYSTEM] [TIME] [MPI] [0] WHOLE PARALLEL ANALYSIS FINISHED. CREATING SETUP FOR MERGING PARTIAL FILES - Total elapsed time [{:5.5f}] [{}] [now: {}]".format(elapsed, t2, datetime.now().time()))
551 | TIME_STATS["COMPUTATION"] = {
552 | "start": t1,
553 | "end": t2,
554 | "elapsed": elapsed
555 | }
556 |
557 | little_files = []
558 | print("Scanning all files in "+temp_dir+" matching " + ".*")
559 | for little_file in glob.glob(temp_dir + "/*"):
560 | if little_file.endswith("chronometer.txt"): continue
561 | if little_file.endswith("files.txt"): continue
562 | if little_file.endswith("intervals.txt"): continue
563 | if little_file.endswith("status.txt"): continue
564 | if little_file.endswith("progress.txt"): continue
565 | if little_file.endswith("times.txt"): continue
566 | if little_file.endswith("groups.txt"): continue
567 |
568 | print(little_file)
569 | pieces = re.sub("\..*", "", os.path.basename(little_file)).split("#")
570 | pieces.insert(0, little_file)
571 | little_files.append(pieces)
572 |
573 | # Sort the output files
574 | keys = chromosomes.keys()
575 | print("[SYSTEM] "+str(len(little_files))+" FILES TO MERGE: ", little_files)
576 | little_files = sorted(little_files, key = lambda x: (keys.index(x[1]) if x[1] in keys else keys.index("chr"+x[1]), int(x[2])))
577 | print("[SYSTEM] "+str(len(little_files))+" FILES TO MERGE (SORTED): ", little_files)
578 |
579 | smallfiles_list_filename = temp_dir + "files.txt"
580 | f = open(smallfiles_list_filename, "w")
581 | for little_file in little_files:
582 | f.write(little_file[0] + "\n")
583 | f.close()
584 |
585 | # Open the final output file
586 | # output_dir = os.path.dirname(output)
587 | # if not os.path.exists(output_dir):
588 | # os.makedirs(output_dir)
589 | # final_file = gzip.open(output, "w")
590 |
591 | # final_file.write("\t".join(reditools.get_header()) + "\n")
592 |
593 | # total = len(little_files)
594 | # done = 0
595 | # for little_file in little_files:
596 | # print("Writing ", little_file)
597 | # file = little_file[0]
598 | #
599 | # f = gzip.open(file)
600 | # final_file.write(f.read())
601 | # f.close()
602 | #
603 | # done = done + 1
604 | # print(file + "\t["+str(done)+"/"+str(total)+" - {:.2%}]".format(done/float(total)))
605 | #
606 | # final_file.close()
607 |
608 | t2 = time.time()
609 | print("[SYSTEM] [TIME] [MPI] [0] [END] - WHOLE ANALYSIS FINISHED - Total elapsed time [{:5.5f}] [{}] [now: {}]".format(t2-t1, t2, datetime.now().time()))
610 |
611 | if "COVERAGE" in TIME_STATS:
612 | print("[STATS] [COVERAGE] START={} END={} ELAPSED={}".format(TIME_STATS["COVERAGE"]["start"], TIME_STATS["COVERAGE"]["end"], TIME_STATS["COVERAGE"]["elapsed"]))
613 |
614 | if "COMPUTATION" in TIME_STATS:
615 | print("[STATS] [COMPUTATION] START={} END={} ELAPSED={}".format(TIME_STATS["COMPUTATION"]["start"], TIME_STATS["COMPUTATION"]["end"], TIME_STATS["COMPUTATION"]["elapsed"]))
616 |
617 | # Slave processes
618 | if rank > 0:
619 |
620 | while(True):
621 | # Execute bowtie, view and sort
622 | status = MPI.Status()
623 | data = comm.recv(source=0, tag=MPI.ANY_TAG, status=status)
624 |
625 | tag = status.Get_tag()
626 | if tag == CALCULATE_COVERAGE:
627 | intervals = calculate_intervals(total_coverage, coverage_dir + data, region)
628 | comm.send(intervals, dest=0, tag=IM_FREE)
629 | if tag == ALIGN_CHUNK:
630 |
631 | # Process it
632 | time_start = time.time()
633 | time_s = datetime.now().time()
634 | print("[SYSTEM] [TIME] [MPI] [SEND/RECV] [RECV] [{}] REDItools: STARTED {} from rank 0 [{}]".format(str(rank), str(data), time_s))
635 |
636 | # Command: python REDItoolDnaRna_1.04_n.py -i $INPUT -o editing -f hg19.fa -t $THREADS
637 | # -c 1,1 -m 20,20 -v 0 -q 30,30 -s 2 -g 2 -S -e -n 0.0 -N 0.0 -u -l -H -Y $CHR:$LEFT-$RIGHT -F $CHR_$LEFT_$RIGHT
638 | # Command REDItools2.0: reditools2.0/src/cineca/reditools.py -f /gss/gss_work/DRES_HAIdA/gtex/SRR1413602/SRR1413602.bam
639 | # -r ../../hg19.fa -g chr18:14237-14238
640 |
641 | id = data[0] + "#" + str(data[1]) + "#" + str(data[2])
642 |
643 | options["region"] = [data[0], data[1], data[2]]
644 | options["output"] = temp_dir + "/" + id + ".gz"
645 |
646 | print("[MPI] [" + str(rank) + "] COMMAND-LINE:", options)
647 |
648 | gc.collect()
649 | reditools.analyze(options)
650 |
651 | time_end = time.time()
652 | print("[SYSTEM] [TIME] [MPI] [{}] REDItools: FINISHED {} [{}][{}] [TOTAL:{:5.2f}]".format(str(rank), str(data), time_s, datetime.now().time(), time_end - time_start))
653 |
654 | print("[SYSTEM] [TIME] [MPI] [SEND/RECV] [SEND] [{}] SENDING IM_FREE tag TO RANK 0 [{}]".format(str(rank), datetime.now().time()))
655 | comm.send(None, dest=0, tag=IM_FREE)
656 | elif tag == STOP_WORKING:
657 | print("[SYSTEM] [TIME] [MPI] [SEND/RECV] [RECV] [{}] received DIE SIGNAL FROM RANK 0 [{}]".format(str(rank), datetime.now().time()))
658 | break
659 |
660 | print("[{}] EXITING [now:{}]".format(rank, time.time()))
661 |
--------------------------------------------------------------------------------
/src/cineca/reditools.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 |
3 | '''
4 | Created on 09 gen 2017
5 |
6 | @author: flati
7 | '''
8 |
9 | import pysam
10 | import sys
11 | import datetime
12 | from collections import defaultdict
13 | import gzip
14 | from sortedcontainers import SortedSet
15 | import os
16 | import argparse
17 | import re
18 | import psutil
19 | import socket
20 | import netifaces
21 |
22 | DEBUG = False
23 |
24 | def delta(t2, t1):
25 | delta = t2 - t1
26 | hours, remainder = divmod(delta.seconds, 3600)
27 | minutes, seconds = divmod(remainder, 60)
28 |
29 | return "%02d:%02d:%02d" % (hours, minutes, seconds)
30 |
31 | def print_reads(reads, i):
32 | total = 0
33 | for key in reads:
34 | total += len(reads[key])
35 | print("[INFO] E[i="+str(key)+"]["+str(len(reads[key]))+"] strand=" + str(strand))
36 | for read in reads[key]:
37 | # index = read["index"]
38 | index = read["alignment_index"]
39 |
40 | print("[INFO] \tR:" + str(read["reference"]) + " [r1="+str(read["object"].is_read1)+", r2="+str(read["object"].is_read2) +", reverse="+str(read["object"].is_reverse) +", pos="+str(read["pos"])+", alignment_index=" + str(index) + ", reference_start="+str(read["object"].reference_start)+" , align_start="+str(read["object"].query_alignment_start) + ", cigar=" + str(read["cigar"])+ ", cigar_list=" + str(read["cigar_list"]) + ", "+ str(len(read["query_qualities"]))+ ", " + str(read["query_qualities"]) + "]")
41 | print("[INFO] \tQ:" + str(read["sequence"]))
42 | print("READS[i="+str(i)+"] = " + str(total))
43 |
44 | def update_reads(reads, i):
45 | if DEBUG:
46 | print("[INFO] UPDATING READS IN POSITION " + str(i))
47 |
48 | pos_based_read_dictionary = {}
49 |
50 | total = 0
51 |
52 | for ending_position in reads:
53 | for read in reads[ending_position]:
54 |
55 | cigar_list = read["cigar_list"]
56 | if len(cigar_list) == 0:
57 | # print("EXCEPTION: CIGAR LIST IS EMPTY")
58 | continue
59 |
60 | if read["pos"] >= i:
61 | #print("READ POSITION " + str(read["pos"]) + " IS GREATER THAN i=" + str(i))
62 | continue
63 |
64 | total += 1
65 |
66 | block = cigar_list[0]
67 | op = block[1]
68 |
69 | if op == "S":
70 |
71 | del cigar_list[0]
72 |
73 | if not cigar_list:
74 | block = None
75 | else:
76 | block = cigar_list[0]
77 | op = block[1]
78 |
79 | elif op == "N":
80 | # if read["sequence"] == "ATTTTTCTGTTTCTCCCTCAATATCCACCTCATGGAAGTAGATATTCACTAGGTGATATTTTCTAGGCTCTCTTAA":
81 | # print("[NNNN i="+str(i)+"] N=" + str(block[0])+ " Updating pos from " + str(read["pos"])+ " to " + str(read["pos"] + (block[0]-1)), read["pos"], read)
82 | read["pos"] += block[0]
83 | del cigar_list[0]
84 |
85 | read["ref"] = None
86 | read["alt"] = None
87 | read["qual"] = DEFAULT_BASE_QUALITY
88 |
89 | continue
90 |
91 | if block is not None and op == "I":
92 | n = block[0]
93 |
94 | # if read["sequence"] == "GTTAATTTTAGAACATTATCATTCCAAAAAAGCAACTTCATAACATCTAGCAGTCACCTCCTTTCCCATTTCTAGC":
95 | # print("[INSERTION i="+str(i)+"] I=" + str(n)+ " Updating alignment_index from " + str(read["alignment_index"]) + " to " + str(read["alignment_index"] + n), read)
96 |
97 | read["alignment_index"] += n
98 | read["ref"] = None
99 | read["alt"] = read["sequence"][read["alignment_index"]]
100 | del cigar_list[0]
101 |
102 | if not cigar_list:
103 | block = None
104 | else:
105 | block = cigar_list[0]
106 | op = block[1]
107 |
108 | if block is not None:
109 | n = block[0]
110 |
111 | # D I M N S
112 | if op == "M":
113 |
114 | # if read["sequence"] == "GTTAATTTTAGAACATTATCATTCCAAAAAAGCAACTTCATAACATCTAGCAGTCACCTCCTTTCCCATTTCTAGC":
115 | # print("[MATCH i="+str(i)+"] M=" + str(n)+ " Updating alignment_index from " + str(read["alignment_index"]) + " to " + str(read["alignment_index"] + 1), read["pos"], read)
116 |
117 | read["pos"] += 1
118 |
119 | block[0] -= 1
120 | read["reference_index"] += 1
121 | read["alignment_index"] += 1
122 |
123 | # if DEBUG:
124 | # print(str(read["reference_index"]), read["reference"][read["reference_index"]], read)
125 | #if read["reference_index"] >= len(read["reference"]): print("i={} \nSEQ={} \nORG={}".format(read["reference_index"], read["reference"], read["object"].get_reference_sequence()))
126 | read["ref"] = read["reference"][read["reference_index"]]
127 | read["alt"] = read["sequence"][read["alignment_index"]]
128 |
129 | # if read["sequence"] == "ATTTTTCTGTTTCTCCCTCAATATCCACCTCATGGAAGTAGATATTCACTAGGTGATATTTTCTAGGCTCTCTTAA":
130 | # print("[MATCH i="+str(i)+"]", "pos="+str(read["pos"]), "ref=" + str(read["ref"]), "alt=" + str(read["alt"]), read)
131 |
132 | if block[0] == 0:
133 | del cigar_list[0]
134 |
135 | elif op == "D":
136 | # if read["sequence"] == "GAAATTTGAAGGTAGAATTGAATACAGATGAACCTCCAATGGTATTCAAGGCTCAGCTGTTTGCGTTGACTGGAGT":
137 | # print("[DELETION i="+str(i)+"] D=" + str(n)+ " Updating reference_index from " + str(read["reference_index"])+ " to " + str(read["reference_index"] + n), read["pos"], read)
138 |
139 | #read["reference_index"] += n # MODIFICATO E COMMENTATO IL 26/03/18
140 |
141 | read["pos"] += n
142 | # read["alignment_index"] += 1
143 | read["ref"] = None
144 | # read["ref"] = read["reference"][read["reference_index"]]
145 | read["alt"] = None
146 | del cigar_list[0]
147 |
148 | if read["query_qualities"] is not None:
149 | read["qual"] = read["query_qualities"][read["alignment_index"]]
150 |
151 | p = read["pos"]
152 | if p not in pos_based_read_dictionary: pos_based_read_dictionary[p] = []
153 | pos_based_read_dictionary[p].append(read)
154 |
155 | if DEBUG:
156 | print("[INFO] READS UPDATED IN POSITION " + str(i) + ":" + str(total))
157 |
158 | return pos_based_read_dictionary
159 |
160 | def get_column(pos_based_read_dictionary, reads, splice_positions, last_chr, omopolymeric_positions, target_positions, i):
161 |
162 | if splice_positions:
163 | if i in splice_positions[last_chr]:
164 | if VERBOSE:
165 | sys.stderr.write("[DEBUG] [SPLICE_SITE] Discarding position ({}, {}) because in splice site\n".format(last_chr, i))
166 | return None
167 |
168 | if omopolymeric_positions:
169 | if i in omopolymeric_positions[last_chr]:
170 | if VERBOSE:
171 | sys.stderr.write("[DEBUG] [OMOPOLYMERIC] Discarding position ({}, {}) because omopolymeric\n".format(last_chr, i))
172 | return None
173 |
174 | if target_positions:
175 | if (last_chr in target_positions and i not in target_positions[last_chr]) or ("chr"+last_chr in target_positions and i not in target_positions["chr"+last_chr]):
176 | if VERBOSE:
177 | sys.stderr.write("[DEBUG] [TARGET POSITIONS] Discarding position ({}, {}) because not in target positions\n".format(last_chr, i))
178 | return None
179 |
180 | # edits = {"T": [], "A": [], "C": [], "G": [], "N": []}
181 | edits_no = 0
182 | edits = []
183 | ref = None
184 |
185 | # r1r2distribution = Counter()
186 | r1r2distribution = defaultdict(int)
187 |
188 | strand_column = []
189 | qualities = []
190 | for key in reads:
191 | for read in reads[key]:
192 |
193 | # if DEBUG:
194 | # print("GET_COLUMN Q_NAME="+ str(read["object"].query_name)+ " READ1=" + str(read["object"].is_read1) + " REVERSE=" + str(read["object"].is_reverse) + " i="+str(i) + " READ=" + str(read))
195 |
196 | # Filter the reads by positions
197 | # if not filter_base(read):
198 | # continue
199 |
200 | pos = read["alignment_index"]
201 |
202 | # Se il carattere e' nelle prime X posizioni della read
203 | if pos < MIN_BASE_POSITION:
204 | if VERBOSE: sys.stderr.write("[DEBUG] APPLIED BASE FILTER [MIN_BASE_POSITION]\n")
205 | continue
206 |
207 | # Se il carattere e' nelle ultime Y posizioni della read
208 | if read["length"] - pos < MAX_BASE_POSITION:
209 | if VERBOSE: sys.stderr.write("[DEBUG] APPLIED BASE FILTER [MAX_BASE_POSITION]\n")
210 | continue
211 |
212 | # Se la qualita' e' < Q
213 | # if read["query_qualities"][read["alignment_index"]] < MIN_BASE_QUALITY:
214 | if read["qual"] < MIN_BASE_QUALITY:
215 | if VERBOSE: sys.stderr.write("[DEBUG] APPLIED BASE FILTER [MIN_BASE_QUALITY] {} {} {} {} {}\n".format(str(read["query_qualities"]), pos, str(read["query_qualities"][pos]), MIN_BASE_QUALITY, read))
216 | continue
217 |
218 | # elif read["positions"][read["index"]] != i:
219 | if read["pos"] != i:
220 | if DEBUG:
221 | print("[OUT_OF_RANGE] SKIPPING READ i=" + str(i) + " but READ=" + str(read["pos"]))
222 | continue
223 |
224 | if DEBUG:
225 | print("GET_COLUMN Q_NAME="+ str(read["object"].query_name)+ " READ1=" + str(read["object"].is_read1) + " REVERSE=" + str(read["object"].is_reverse) + " i="+str(i) + " READ=" + str(read))
226 |
227 | # j = read["alignment_index"]
228 | # if DEBUG:
229 | # print("GET_COLUMN_OK i="+str(i) + " ALT="+read["sequence"][j]+" READ=" + str(read))
230 |
231 | # ref = read["reference"][read["reference_index"]].upper()
232 | # if j >= len(read["sequence"]):
233 | # print("GET_COLUMN_STRANGE i="+str(i) + " j="+str(j)+" orig="+str(read["alignment_index"])+" READ=" + str(read))
234 | # alt = read["sequence"][j]
235 |
236 | if read["ref"] == None:
237 | if DEBUG:
238 | print("[INVALID] SKIPPING READ i=" + str(i) + " BECAUSE REF is None", read)
239 | continue
240 | if read["alt"] == None:
241 | if DEBUG:
242 | print("[INVALID] SKIPPING READ i=" + str(i) + " BECAUSE ALT is None", read)
243 | continue
244 |
245 | # passed += 1
246 |
247 | # if passed > 8000:
248 | # break
249 |
250 | ref = read["ref"].upper()
251 | alt = read["alt"].upper()
252 |
253 | if DEBUG:
254 | print("\tBEF={} {}".format(ref, alt))
255 |
256 | if ref == "N" or alt == "N":
257 | continue
258 |
259 | # print(read["pos"], ref, alt, strand, strand == 1, read["object"].is_read1, read["object"].is_read2, read["object"].is_reverse )
260 | #ref, alt = fix_strand(read, ref, alt)
261 |
262 | if DEBUG:
263 | print("\tLAT={} {}".format(ref, alt))
264 |
265 | edits.append(alt)
266 |
267 | # q = read["query_qualities"][read["alignment_index"]]
268 | q = read["qual"]
269 | qualities.append(q)
270 |
271 | strand_column.append(read["strand"])
272 | # strand_column.append(get_strand(read))
273 |
274 | if alt != ref:
275 | edits_no += 1
276 |
277 | r1r2distribution[("R1" if read["object"].is_read1 else "R2") + ("-REV" if read["object"].is_reverse else "")] += 1
278 |
279 | if not IS_DNA:
280 | vstrand = 2
281 | if strand != 0:
282 | vstrand = vstand(''.join(strand_column))
283 | if vstrand == "+": vstrand = 1
284 | elif vstrand == "-": vstrand = 0
285 | elif vstrand == "*": vstrand = 2
286 |
287 | if vstrand == 0:
288 | edits = complement_all(edits)
289 | ref = complement(ref)
290 |
291 | if vstrand in [0, 1] and strand_correction:
292 | edits, strand_column, qualities, qualities_positions = normByStrand(edits, strand_column, qualities, vstrand)
293 |
294 | if DEBUG:
295 | print(vstrand, ''.join(strand_column))
296 | else:
297 | vstrand = "*"
298 |
299 | if DEBUG:
300 | print(r1r2distribution)
301 | # counter = defaultdict(str)
302 | # for e in edits: counter[e] += 1
303 | # print(Counter(edits))
304 |
305 | # if i == 62996785:
306 | # print(edits, strand_column, len(qualities), qualities)
307 |
308 | passed = len(edits)
309 |
310 | # counter = Counter(edits)
311 | counter = defaultdict(int)
312 | for e in edits: counter[e] += 1
313 |
314 | # print(Counter(edits), counter)
315 |
316 | mean_q = 0
317 | if DEBUG:
318 | print("Qualities[i="+str(i)+"]="+str(qualities))
319 |
320 | if len(qualities) > 0:
321 | #mean_q = numpy.mean(qualities)
322 | mean_q = float(sum(qualities)) / max(len(qualities), 1)
323 |
324 | # If all the reads are concordant
325 | #if counter[ref] > 0 and len(counter) == 1:
326 | # return None
327 |
328 | if len(counter) == 0:
329 | if VERBOSE:
330 | sys.stderr.write("[VERBOSE] [EMPTY] Discarding position ({}, {}) because the associated counter is empty\n".format(last_chr, i))
331 | return None
332 |
333 | # [A,C,G,T]
334 | distribution = [counter['A'] if 'A' in counter else 0,
335 | counter['C'] if 'C' in counter else 0,
336 | counter['G'] if 'G' in counter else 0,
337 | counter['T'] if 'T' in counter else 0]
338 | ref_count = counter[ref] if ref in counter else 0
339 |
340 | non_zero = 0
341 | for el in counter:
342 | if el != ref and counter[el] > 0:
343 | non_zero += 1
344 |
345 | variants = []
346 | # most_common = None
347 | ratio = 0.0
348 | # most_common = []
349 | # most_common_value = -1
350 | # for el in counter:
351 | # value = counter[el]
352 | # if value > most_common_value:
353 | # most_common_value = value
354 | # most_common = []
355 | # if value == most_common_value:
356 | # most_common.append((el, value))
357 |
358 | # for el in Counter(edits).most_common():
359 | for el in sorted(counter.items(), key=lambda x: x[1], reverse=True):
360 | if el[0] == ref: continue
361 | else:
362 | variants.append(el[0])
363 | # most_common = el
364 | if ratio == 0.0:
365 | ratio = (float)(el[1]) / (el[1] + ref_count)
366 |
367 | # ratio = 0.0
368 | # if most_common is not None:
369 | # ratio = (float)(most_common[1]) / (most_common[1] + ref_count)
370 |
371 | # if passed > 0:
372 | # print("REF=" + ref)
373 | # print(passed)
374 | # print(edits)
375 | # print(counter)
376 | # print("MOST FREQUENT EDITS=" + str(counter.most_common()))
377 | # print("MOST COMMON=" + str(most_common))
378 | # print(numpy.mean(counter.values()))
379 | # print(distribution)
380 | # print(qualities)
381 | # print(mean_q)
382 | # print("REF COUNT=" + str(ref_count))
383 | # print("ALT/REF % = " + str(ratio))
384 | # raw_input("Press a key:")
385 |
386 | edits_info = {
387 | "edits": edits,
388 | "distribution": distribution,
389 | "mean_quality": mean_q,
390 | "counter": counter,
391 | "non_zero": non_zero,
392 | "edits_no": edits_no,
393 | "ref": ref,
394 | "variants": variants,
395 | "frequency": ratio,
396 | "passed": passed,
397 | "strand": vstrand
398 | }
399 |
400 | # Check that the column passes the filters
401 | if not filter_column(edits_info, i): return None
402 |
403 | # if edits_no > 5:
404 | # print(str(i) + ":" + str(edits_info))
405 | # raw_input("[ALERT] Press enter to continue...")
406 |
407 | return edits_info;
408 |
409 | def normByStrand(seq_, strand_, squal_, mystrand_):
410 |
411 | st='+'
412 | if mystrand_== 0: st='-'
413 | seq,strand,qual,squal=[],[],[],''
414 | for i in range(len(seq_)):
415 | if strand_[i]==st:
416 | seq.append(seq_[i])
417 | strand.append(strand_[i])
418 | qual.append(squal_[i])
419 | squal+=chr(squal_[i])
420 | return seq,strand,qual,squal
421 |
422 | # def fix_strand(read, ref, alt):
423 | # global strand
424 | #
425 | # raw_read = read["object"]
426 | #
427 | # if (strand == 1 and ((raw_read.is_read1 and raw_read.is_reverse) or (raw_read.is_read2 and not raw_read.is_reverse))) or (strand == 2 and ((raw_read.is_read1 and not raw_read.is_reverse) or (raw_read.is_read2 and raw_read.is_reverse))):
428 | # return ref, complement(alt)
429 | #
430 | # return ref, alt
431 | def get_strand(read):
432 | global strand
433 |
434 | raw_read = read["object"]
435 |
436 | if (strand == 1 and ((raw_read.is_read1 and raw_read.is_reverse) or (raw_read.is_read2 and not raw_read.is_reverse))) or (strand == 2 and ((raw_read.is_read1 and not raw_read.is_reverse) or (raw_read.is_read2 and raw_read.is_reverse))):
437 | return "-"
438 |
439 | return "+"
440 |
441 | def filter_read(read):
442 |
443 | # if DEBUG:
444 | # print("[FILTER_READ] F={} QC={} MP={} LEN={} SECOND={} SUPPL={} DUPL={} READ={}".format(read.flag, read.is_qcfail, read.mapping_quality, read.query_length, read.is_secondary, read.is_supplementary, read.is_duplicate, read))
445 |
446 | # Get the flag of the read
447 | f = read.flag
448 |
449 | # if strict_mode:
450 | # try:
451 | # NM = read.get_tag("NM")
452 | # if NM == 0:
453 | # # print("SKIPPING", MD_value, read.query_sequence, read.reference_start)
454 | # return True
455 | # except KeyError:
456 | # pass
457 |
458 | # if strict_mode:
459 | # MD = read.get_tag("MD")
460 | # # print(MD, read.get_reference_sequence(), read.reference_start)
461 | # # MD = MD.split(":")[1]
462 | # try:
463 | # MD_value = int(MD)
464 | # # print("SKIPPING", MD_value, read.query_sequence, read.reference_start)
465 | # return True
466 | # except ValueError:
467 | # # print("NO MD VALUE")
468 | # pass
469 |
470 | # Se la read non e' mappata (FLAG 77 o 141)
471 | if f == 77 or f == 141:
472 | if VERBOSE: sys.stderr.write("[DEBUG] APPLIED FILTER [NOT_MAPPED] f={}\n".format(str(f)))
473 | return False
474 |
475 | # Se la read non passa i quality controls (FLAG 512)
476 | if read.is_qcfail:
477 | if VERBOSE: sys.stderr.write("[DEBUG] APPLIED FILTER [QC_FAIL]\n")
478 | return False
479 |
480 | # Se la read ha un MAPQ < di 30
481 | if read.mapping_quality < MIN_QUALITY:
482 | if VERBOSE: sys.stderr.write("[DEBUG] APPLIED FILTER [MAPQ] {} MIN={}\n".format(read.mapping_quality, MIN_QUALITY))
483 | return False
484 |
485 | # Se la read ha una lunghezza < XX
486 | if read.query_length < MIN_READ_LENGTH:
487 | if VERBOSE: sys.stderr.write("[DEBUG] APPLIED FILTER [MIN_READ_LENGTH] {} MIN={}\n".format(read.query_length, MIN_READ_LENGTH))
488 | return False
489 |
490 | # Se la read non mappa in modo unico (FLAG 256 o 2048)
491 | if read.is_secondary or read.is_supplementary:
492 | if VERBOSE: sys.stderr.write("[DEBUG] APPLIED FILTER [IS_SECONDARY][IS_SUPPLEMENTARY]\n")
493 | return False
494 |
495 | # Se la read e' un duplicato di PCR (FLAG 1024)
496 | if read.is_duplicate:
497 | if VERBOSE: sys.stderr.write("[DEBUG] APPLIED FILTER [IS_DUPLICATE]\n")
498 | return False
499 |
500 | # Se la read e' paired-end ma non mappa in modo proprio (FLAG diversi da 99/147(+-) o 83/163(-+))
501 | # 99 = 1+2+32+64 = PAIRED+PROPER_PAIR+MREVERSE+READ1 (+-)
502 | if read.is_paired and not (f == 99 or f == 147 or f == 83 or f == 163):
503 | if VERBOSE: sys.stderr.write("[DEBUG] APPLIED FILTER [NOT_PROPER]\n")
504 | return False
505 |
506 | if read.has_tag('SA'):
507 | if VERBOSE: sys.stderr.write("[DEBUG] APPLIED FILTER [CHIMERIC_READ]\n")
508 | return False
509 |
510 | return True
511 |
512 | def filter_base(read):
513 |
514 | # pos = read["index"]
515 | pos = read["alignment_index"]
516 |
517 | # Se il carattere e' nelle prime X posizioni della read
518 | if pos < MIN_BASE_POSITION:
519 | if VERBOSE: sys.stderr.write("[DEBUG] APPLIED BASE FILTER [MIN_BASE_POSITION]\n")
520 | return False
521 |
522 | # Se il carattere e' nelle ultime Y posizioni della read
523 | if read["length"] - pos < MAX_BASE_POSITION:
524 | if VERBOSE: sys.stderr.write("[DEBUG] APPLIED BASE FILTER [MAX_BASE_POSITION]\n")
525 | return False
526 |
527 | # Se la qualita' e' < Q
528 | # if read["query_qualities"][read["alignment_index"]] < MIN_BASE_QUALITY:
529 | if "qual" not in read:
530 | if VERBOSE: sys.stderr.write("[DEBUG] APPLIED BASE FILTER [QUAL MISSING] {} {}\n".format(pos, read))
531 | return False
532 |
533 | if read["qual"] < MIN_BASE_QUALITY:
534 | if VERBOSE: sys.stderr.write("[DEBUG] APPLIED BASE FILTER [MIN_BASE_QUALITY] {} {} {} {} {}\n".format(str(read["query_qualities"]), pos, str(read["query_qualities"][pos]), MIN_BASE_QUALITY, read))
535 | return False
536 |
537 | return True
538 |
539 | def filter_column(column, i):
540 |
541 | edits = column["edits"]
542 |
543 | if column["mean_quality"] < MIN_QUALITY:
544 | if VERBOSE: sys.stderr.write("[DEBUG] DISCARDING COLUMN i={} {} [MIN_MEAN_COLUMN_QUALITY]\n".format(i, column))
545 | return False
546 |
547 | # Se il numero di caratteri e' < X
548 | if len(edits) < MIN_COLUMN_LENGTH:
549 | if VERBOSE: sys.stderr.write("[DEBUG] DISCARDING COLUMN i={} {} [MIN_COLUMN_LENGTH]\n".format(i, len(edits)))
550 | return False
551 |
552 | counter = column["counter"]
553 | ref = column["ref"]
554 |
555 | # (per ogni variazione) se singolarmente il numero delle basi che supportano la variazione e' < X
556 | for edit in counter:
557 | if edit != ref and counter[edit] < MIN_EDITS_SINGLE:
558 | if VERBOSE: sys.stderr.write("[DEBUG] DISCARDING COLUMN i={} c({})={} [MIN_EDITS_SINGLE] {}\n".format(i, edit, counter[edit], counter))
559 | return False
560 |
561 | # Se esistono multipli cambi rispetto al reference
562 | if len(counter.keys()) > MAX_CHANGES:
563 | if VERBOSE: sys.stderr.write("[DEBUG] DISCARDING COLUMN i={} changes={} [MULTIPLE_CHANGES] {}\n".format(i, len(counter.keys()), column))
564 | return False
565 |
566 | # Se tutte le sostituzioni sono < Y
567 | if column["edits_no"] < MIN_EDITS_NO:
568 | if VERBOSE: sys.stderr.write("[DEBUG] DISCARDING COLUMN i={} {} [MIN_EDITS_NO]\n".format(i, column["edits_no"]))
569 | return False
570 |
571 | return True
572 |
573 | def load_omopolymeric_positions(positions, input_file, region):
574 | if input_file is None: return
575 |
576 | sys.stderr.write("Loading omopolymeric positions from file {}\n".format(input_file))
577 |
578 | chromosome = None
579 | start = None
580 | end = None
581 |
582 | if region is not None:
583 | if len(region) >= 1:
584 | chromosome = region[0]
585 | if len(region) >= 2:
586 | start = region[1]
587 | if len(region) >= 3:
588 | end = region[2]
589 |
590 | lines_read = 0
591 | total = 0
592 |
593 | print("Loading omopolymeric positions of {} between {} and {}".format(chromosome, start, end))
594 |
595 | try:
596 | reader = open(input_file, "r")
597 |
598 | for line in reader:
599 | if line.startswith("#"):
600 | continue
601 |
602 | lines_read += 1
603 | if lines_read % 500000 == 0:
604 | sys.stderr.write("{} lines read.\n".format(lines_read))
605 |
606 | fields = line.rstrip().split("\t")
607 | if chromosome is None or fields[0] == chromosome:
608 | chrom = fields[0]
609 | f = int(fields[1])
610 | t = int(fields[2])
611 |
612 | if start is not None: f = max(start, f)
613 | if end is not None: t = min(t, end)
614 |
615 | # print("POSITION {} {} {} {} {} {}".format(str(fields), chromosome, f, t, start, end))
616 |
617 | if chrom not in positions:
618 | positions[chrom] = SortedSet()
619 |
620 | for i in range(f, t):
621 | positions[chrom].add(i)
622 | total += 1
623 |
624 | elif positions:
625 | break
626 |
627 | reader.close()
628 | except IOError as e:
629 | sys.stderr.write("[{}] Omopolymeric positions file not found at {}. Error: {}\n".format(region, input_file, e))
630 |
631 | sys.stderr.write("[{}] {} total omopolymeric positions found.\n".format(region, total))
632 |
633 | def load_chromosome_names(index_file):
634 | names = []
635 |
636 | with open(index_file, "r") as lines:
637 | for line in lines:
638 | names.append(line.split("\t")[0])
639 |
640 | return names
641 |
642 | def load_splicing_file(splicing_file):
643 | splice_positions = {}
644 |
645 | if splicing_file is None: return splice_positions
646 |
647 | sys.stderr.write('Loading known splice sites from file {}\n'.format(splicing_file))
648 |
649 | if splicing_file.endswith("gz"): f = gzip.open(splicing_file, "r")
650 | else: f = open(splicing_file, "r")
651 |
652 | total = 0
653 | total_array = {}
654 |
655 | for i in f:
656 | l=(i.strip()).split()
657 | chrom = l[0]
658 |
659 | if chrom not in splice_positions:
660 | splice_positions[chrom] = SortedSet()
661 | total_array[chrom] = 0
662 |
663 | st,tp,cc = l[4], l[3], int(l[1])
664 |
665 | total += SPLICING_SPAN
666 | total_array[chrom] += SPLICING_SPAN
667 |
668 | if st=='+' and tp=='D':
669 | for j in range(SPLICING_SPAN): splice_positions[chrom].add(cc+(j+1))
670 | if st=='+' and tp=='A':
671 | for j in range(SPLICING_SPAN): splice_positions[chrom].add(cc-(j+1))
672 | if st=='-' and tp=='D':
673 | for j in range(SPLICING_SPAN): splice_positions[chrom].add(cc-(j+1))
674 | if st=='-' and tp=='A':
675 | for j in range(SPLICING_SPAN): splice_positions[chrom].add(cc+(j+1))
676 |
677 | f.close()
678 |
679 | sys.stderr.write('Loaded {} positions from file {}\n'.format(total, splicing_file))
680 | sys.stderr.write('\tPartial:{}\n'.format(total_array))
681 |
682 | return splice_positions
683 |
684 | def create_omopolymeric_positions(reference_file, omopolymeric_file):
685 |
686 | tic = datetime.datetime.now()
687 |
688 | sys.stderr.write("Creating omopolymeric positions (span={}) from reference file {}\n".format(OMOPOLYMERIC_SPAN, reference_file))
689 |
690 | index_file = reference_file + ".fai"
691 | sys.stderr.write("Loading chromosome names from index file {}\n".format(index_file))
692 | chromosomes = load_chromosome_names(index_file)
693 | sys.stderr.write("{} chromosome names found\n".format(str(len(chromosomes))))
694 |
695 | positions = []
696 |
697 | try:
698 | # Opening reference fasta file
699 | sys.stderr.write("Opening reference file {}.\n".format(reference_file))
700 | fasta_reader = pysam.FastaFile(reference_file)
701 | sys.stderr.write("Reference file {} opened.\n".format(reference_file))
702 |
703 | for chromosome in chromosomes:
704 | sys.stderr.write("Loading reference sequence for chromosome {}\n".format(chromosome))
705 | sequence = fasta_reader.fetch(chromosome).lower()
706 | sys.stderr.write("Reference sequence for chromosome {} loaded (len: {})\n".format(chromosome, len(sequence)))
707 |
708 | equals = 0
709 | last = None
710 | for i, b in enumerate(sequence):
711 |
712 | # if chromosome == "chr18" and i > 190450 and i < 190500:
713 | # print(i, b, last, OMOPOLYMERIC_SPAN, sequence[190450:190480])
714 |
715 | if b == last:
716 | equals += 1
717 | else:
718 | if equals >= OMOPOLYMERIC_SPAN:
719 | # sys.stderr.write("Found a new omopolymeric interval: ({}, {}-{}): {}\n".format(chromosome, i-equals, i, sequence[i-equals:i]))
720 | positions.append((chromosome, i-equals, i, equals, last))
721 |
722 | equals = 1
723 |
724 | last = b
725 |
726 | # Closing
727 | fasta_reader.close()
728 | sys.stderr.write("Reference file {} closed.\n".format(reference_file))
729 |
730 | except ValueError as e:
731 | sys.stderr.write("Error in reading reference file {}: message={}\n".format(reference_file, e))
732 | except IOError:
733 | sys.stderr.write("The reference file {} could not be opened.\n".format(reference_file))
734 |
735 | sys.stderr.write("{} total omopolymeric positions found.\n".format(len(positions)))
736 |
737 | toc = datetime.datetime.now()
738 | sys.stderr.write("Time to produce all the omopolymeric positions: {}\n".format(toc-tic))
739 |
740 | sys.stderr.write("Writing omopolymeric positions to file: {}.\n".format(omopolymeric_file))
741 | writer = open(omopolymeric_file, "w")
742 | writer.write("#" + "\t".join(["Chromomosome", "Start", "End", "Length", "Symbol"]) + "\n")
743 | for position in positions:
744 | writer.write("\t".join([str(x) for x in position]) + "\n")
745 | writer.close()
746 | sys.stderr.write("Omopolymeric positions written into file: {}.\n".format(omopolymeric_file))
747 |
748 | def init(samfile, region):
749 |
750 | print("Opening bamfile within region=" + str(region))
751 |
752 | if region is None or len(region) == 0:
753 | return samfile.fetch()
754 |
755 | if len(region) == 1:
756 | try:
757 | return samfile.fetch(region[0])
758 | except ValueError:
759 | return samfile.fetch(region[0].replace("chr", ""))
760 |
761 | else:
762 | try:
763 | return samfile.fetch(region[0], region[1], region[2])
764 | except ValueError:
765 | return samfile.fetch(region[0].replace("chr", ""), region[1], region[2])
766 |
767 | def within_interval(i, region):
768 |
769 | if region is None or len(region) <= 1:
770 | return True
771 |
772 | else:
773 | start = region[1]
774 | end = region[2]
775 | return i >= start and i <= end
776 |
777 | def get_header():
778 | return ["Region", "Position", "Reference", "Strand", "Coverage-q30", "MeanQ", "BaseCount[A,C,G,T]", "AllSubs", "Frequency", "gCoverage-q30", "gMeanQ", "gBaseCount[A,C,G,T]", "gAllSubs", "gFrequency"]
779 |
780 | from collections import Counter
781 | import pickle
782 | def load_target_positions(bed_file, region):
783 | print("Loading target positions from file {} (region:{})".format(bed_file, region))
784 |
785 | # if os.path.exists(bed_file + "save.p"):
786 | # return pickle.load(open( bed_file + "save.p", "rb" ))
787 |
788 | target_positions = {}
789 |
790 | extension = os.path.splitext(bed_file)[1]
791 | handler = None
792 | if extension == ".gz":
793 | handler = gzip.open(bed_file, "r")
794 | else:
795 | handler = open(bed_file, "r")
796 |
797 | read = 0
798 | total_positions = 0
799 | total = Counter()
800 | with handler as file:
801 | for line in file:
802 | read += 1
803 | fields = line.strip().split("\t")
804 | chr = fields[0]
805 | if read % 10000000 == 0: print("[{1}] {0} total lines read. Total positions: {2}".format(read, datetime.datetime.now(), total_positions))
806 |
807 | if region != None and chr.replace("chr", "") != region[0].replace("chr", ""): continue
808 |
809 | start = int(fields[1])-1
810 |
811 | try:
812 | end = int(fields[2])-1
813 | except:
814 | end = start # In case the file has 2 columns only or the third column is not an integer
815 |
816 | intersection_start = max(region[1] if region is not None and len(region)>1 else 0, start)
817 | intersection_end = min(region[2] if region is not None and len(region)>2 else sys.maxint, end)
818 |
819 |
820 | # If the target region does not intersect the currently analyzed region
821 | if intersection_end < intersection_start: continue
822 |
823 | # print(line, chr, start, end, intersection_start, intersection_end, total)
824 |
825 | # Add target positions
826 | if chr not in target_positions: target_positions[chr] = SortedSet()
827 | for i in range(intersection_start, intersection_end+1):
828 |
829 | target_positions[chr].add(i)
830 | total[chr] += 1
831 | total_positions += 1
832 |
833 | print("### TARGET POSITIONS ###")
834 | print(total)
835 | print("TOTAL POSITIONS:", sum(total.values()))
836 | # pickle.dump(target_positions, open( bed_file + "save.p", "wb" ) )
837 |
838 | return target_positions
839 |
840 | def analyze(options):
841 |
842 | global DEBUG
843 | global activate_debug
844 |
845 | print("[SYSTEM]", "PYSAM VERSION", pysam.__version__)
846 | print("[SYSTEM]", "PYSAM PATH", pysam.__path__)
847 |
848 | interface = 'ib0' if 'ib0' in netifaces.interfaces() else netifaces.interfaces()[0]
849 | hostname = socket.gethostbyaddr(netifaces.ifaddresses(interface)[netifaces.AF_INET][0]['addr'])
850 | pid = os.getpid()
851 | hostname_string = hostname[0] + "|" + hostname[2][0] + "|" + str(pid)
852 |
853 | bamfile = options["bamfile"]
854 | region = options["region"]
855 | reference_file = options["reference"]
856 | output = options["output"]
857 | append = options["append"]
858 | omopolymeric_file = options["omopolymeric_file"]
859 | splicing_file = options["splicing_file"]
860 | create_omopolymeric_file = options["create_omopolymeric_file"]
861 | bed_file = options["bed_file"] if "bed_file" in options else None
862 |
863 | LAUNCH_TIME = datetime.datetime.now()
864 | print("[INFO] ["+str(region)+"] START=" + str(LAUNCH_TIME))
865 |
866 | print("[INFO] Opening BAM file="+bamfile)
867 | samfile = pysam.AlignmentFile(bamfile, "rb")
868 |
869 | target_positions = {}
870 | if bed_file is not None:
871 | target_positions = load_target_positions(bed_file, region)
872 |
873 | omopolymeric_positions = {}
874 | if create_omopolymeric_file is True:
875 | if omopolymeric_file is not None:
876 | create_omopolymeric_positions(reference_file, omopolymeric_file)
877 | else:
878 | print("[ERROR] You asked to create the omopolymeric file, but you did not specify any output file. Exiting.")
879 | return
880 |
881 | load_omopolymeric_positions(omopolymeric_positions, omopolymeric_file, region)
882 | # if not omopolymeric_positions and omopolymeric_file is not None:
883 | # omopolymeric_positions = create_omopolymeric_positions(reference_file, omopolymeric_file)
884 |
885 | splice_positions = []
886 |
887 | if splicing_file:
888 | splice_positions = load_splicing_file(splicing_file)
889 |
890 | # Constants
891 | LAST_READ = None
892 | LOG_INTERVAL = 10000000
893 |
894 | # Take the time
895 | tic = datetime.datetime.now()
896 | first_tic = tic
897 |
898 | total = 0
899 |
900 | reads = dict()
901 |
902 | outputfile = None
903 |
904 | if output is not None:
905 | outputfile = output
906 | else:
907 | prefix = os.path.basename(bamfile)
908 | if region is not None:
909 | prefix += "_" + '_'.join([str(x) for x in region])
910 | outputfile = prefix + "_reditools2_table.gz"
911 |
912 | mode = "a" if append else "w"
913 |
914 | if outputfile.endswith("gz"): writer = gzip.open(outputfile, mode)
915 | else: writer = open(outputfile, mode)
916 |
917 | if not options["remove_header"]:
918 | writer.write("\t".join(get_header()) + "\n")
919 |
920 | # Open the iterator
921 | print("[INFO] Fetching data from bam {}".format(bamfile))
922 | print("[INFO] Narrowing REDItools to region {}".format(region))
923 | sys.stdout.flush()
924 |
925 | reference_reader = None
926 | if reference_file is not None: reference_reader = pysam.FastaFile(reference_file)
927 | chr_ref = None
928 |
929 | iterator = init(samfile, region)
930 |
931 | next_read = next(iterator, LAST_READ)
932 | if next_read is not None:
933 | # next_pos = next_read.get_reference_positions()
934 | # i = next_pos[0]
935 | i = next_read.reference_start
936 |
937 | total += 1
938 |
939 | read = None
940 | # pos = None
941 | last_chr = None
942 | finished = False
943 |
944 | DEBUG_START = region[1] if region is not None and len(region) > 1 else -1
945 | DEBUG_END = region[2] if region is not None and len(region) > 2 else -1
946 | STOP = -1
947 |
948 | while not finished:
949 |
950 | if activate_debug and DEBUG_START > 0 and i >= DEBUG_START-1: DEBUG = True
951 | if activate_debug and DEBUG_END > 0 and i >= DEBUG_END: DEBUG = False
952 | if STOP > 0 and i > STOP: break
953 |
954 | # if i>=46958774:
955 | # print(next_read)
956 | # print_reads(reads, i)
957 | # raw_input()
958 |
959 | if (next_read is LAST_READ and len(reads) == 0) or (region is not None and len(region) >= 3 and i > region[2]):
960 | print("NO MORE READS!")
961 | finished = True
962 | break
963 |
964 | # Jump if we consumed all the reads
965 | if len(reads) == 0:
966 | i = next_read.reference_start
967 | # print("[INFO] READ SET IS EMPTY. JUMP TO "+str(next_pos[0])+"!")
968 | # if len(next_pos) == 0: i = next_read.reference_start
969 | # else: i = next_pos[0]
970 |
971 | # print("P1", next_read.query_name, next_pos)
972 |
973 | # Get all the next read(s)
974 | #while next_read is not LAST_READ and (len(next_pos) > 0 and (next_pos[0] == i or next_pos[-1] == i)): # TODO: why or next_pos[-1] == i?
975 | while next_read is not LAST_READ and next_read.reference_start == i:
976 |
977 | read = next_read
978 | # pos = next_pos
979 |
980 | # When changing chromosome print some statistics
981 | if read is not LAST_READ and read.reference_name != last_chr:
982 |
983 | try:
984 | chr_ref = reference_reader.fetch(read.reference_name)
985 | except KeyError:
986 | chr_ref = reference_reader.fetch("chr" + read.reference_name)
987 |
988 | tac = datetime.datetime.now()
989 | print("[INFO] REFERENCE NAME=" + read.reference_name + " (" + str(tac) + ")\t["+delta(tac, tic)+"]")
990 | sys.stdout.flush()
991 | tic = tac
992 |
993 | last_chr = read.reference_name
994 |
995 | next_read = next(iterator, LAST_READ)
996 | if next_read is not LAST_READ:
997 | total += 1
998 | # next_pos = next_read.get_reference_positions()
999 |
1000 | if total % LOG_INTERVAL == 0:
1001 | print("[{}] [{}] [{}] Total reads loaded: {} [{}] [RAM:{}MB]".format(hostname_string, last_chr, region, total, datetime.datetime.now(), psutil.Process(os.getpid()).memory_info().rss / (1024 * 1024)))
1002 | sys.stdout.flush()
1003 |
1004 | # print("P2", next_read.query_name, next_read.get_reference_positions())
1005 |
1006 | #print("[INFO] Adding a read to the set=" + str(read.get_reference_positions()))
1007 |
1008 | # Check that the read passes the filters
1009 | if not filter_read(read): continue
1010 |
1011 | # ref_seq = read.get_reference_sequence()
1012 |
1013 | ref_pos = [x[1] for x in read.get_aligned_pairs() if x[0] is not None and x[1] is not None]
1014 | ref_seq = ''.join([chr_ref[x] for x in ref_pos]).upper()
1015 |
1016 | # if ref_seq != read.get_reference_sequence().upper():
1017 | # print("MY_REF={} \nPY_REF={} \nREAD_NAME={} \nPOSITIONS={} \nREAD={}\n--------------------------".format(ref_seq, read.get_reference_sequence().upper(), read.query_name, read.get_reference_positions(), read.query_sequence))
1018 |
1019 | # raw_input()
1020 |
1021 | # if len(ref_seq) != len(read.query_sequence) or len(pos) != len(read.query_sequence) or len(pos) != len(ref_seq):
1022 | # print("=== DETAILS ===")
1023 | # print("i="+str(i))
1024 | # print("ref_seq="+str(len(ref_seq)))
1025 | # print("seq="+str(len(read.query_sequence)))
1026 | # print("pos="+str(len(pos)))
1027 | # print("qual="+str(len(read.query_qualities)))
1028 | # print("index="+str(read.query_alignment_start))
1029 | # print(ref_seq)
1030 | # print(read.query_sequence)
1031 | # print(pos)
1032 | # print(read.query_qualities)
1033 |
1034 | t = "*"
1035 |
1036 | if not IS_DNA:
1037 | if read.is_read1:
1038 | if strand == 1:
1039 | if read.is_reverse: t='-'
1040 | else: t='+'
1041 | else:
1042 | if read.is_reverse: t='+'
1043 | else: t='-'
1044 | elif read.is_read2:
1045 | if strand == 2:
1046 | if read.is_reverse: t='-'
1047 | else: t='+'
1048 | else:
1049 | if read.is_reverse: t='+'
1050 | else: t='-'
1051 | else: # for single ends
1052 | if strand == 1:
1053 | if read.is_reverse: t='-'
1054 | else: t='+'
1055 | else:
1056 | if read.is_reverse: t='+'
1057 | else: t='-'
1058 |
1059 | qualities = read.query_qualities
1060 | if qualities is None: qualities = [DEFAULT_BASE_QUALITY for x in range(0, len(ref_seq))]
1061 |
1062 | item = {
1063 | # "index": 0,
1064 | "pos": read.reference_start - 1,
1065 | # "pos": i-1,
1066 | "alignment_index": read.query_alignment_start - 1,
1067 | # "alignment_index": -1,
1068 | "reference_index": -1,
1069 | "query_alignment_start": read.query_alignment_start,
1070 | "object": read,
1071 | "reference": ref_seq,
1072 | "reference_len": len(ref_seq),
1073 | "sequence": read.query_sequence,
1074 | # "positions": pos,
1075 | "chromosome": read.reference_name,
1076 | "query_qualities": qualities,
1077 | "qualities_len": len(qualities),
1078 | "length": read.query_length,
1079 | "cigar": read.cigarstring,
1080 | "strand": t
1081 | }
1082 |
1083 | cigar_list = [[int(c), op] for (c, op) in re.findall('(\d+)(.)', item["cigar"])]
1084 | # if read.is_reverse:
1085 | # cigar_list.reverse()
1086 | item["cigar_list"] = cigar_list
1087 |
1088 | # if read.query_sequence == "AGGCTCTCTTAATGTAATAAAAGCCATCTATGACAAACCCACAGCCAACATAATACTGAATGGGGAAAAGGTGAAA":
1089 | # print(i, read.reference_start, item, read)
1090 |
1091 | # item["ref"] = item["reference"][item["reference_index"]]
1092 | # item["alt"] = item["sequence"][item["alignment_index"]]
1093 | # item["qual"] = item["query_qualities"][item["alignment_index"]]
1094 |
1095 | # print(item["cigar"])
1096 | # print(item["cigar_list"])
1097 | # print(read.get_aligned_pairs())
1098 | # print("REF START = " + str(read.reference_start))
1099 | # print("REF POS[0] = " + str(item["positions"][0]))
1100 | # print("ALIGN START = " + str(item["alignment_index"]))
1101 | # raw_input("CIGAR STRING PARSED...")
1102 |
1103 | # if item["cigar"] != "76M":
1104 | # item["pairs"] = read.get_aligned_pairs()
1105 |
1106 | # if read.query_sequence == "CACGGACTTTTCCTGAAATTTATTTTTATGTATGTATATCAAACATTGAATTTCTGTTTTCTTCTTTACTGGAATT" and pos[0] == 14233 and pos[-1] == 14308:
1107 | # print("[FILTER_READ] F={} QC={} MP={} LEN={} SECOND={} SUPPL={} DUPL={} PAIRED={} READ={}".format(read.flag, read.is_qcfail, read.mapping_quality, read.query_length, read.is_secondary, read.is_supplementary, read.is_duplicate, read.is_paired, read))
1108 | # raw_input("SPECIAL READ...")
1109 |
1110 | # print(item)
1111 | # raw_input("Press enter to continue...")
1112 |
1113 | # print(item)
1114 | # if i > 15400000:
1115 | # print(read.reference_name, i)
1116 | # raw_input("Press enter to continue...")
1117 |
1118 | end_position = read.reference_end #pos[-1]
1119 | if end_position not in reads:
1120 | reads[end_position] = []
1121 |
1122 | if DEBUG:
1123 | print("Adding item="+str(item))
1124 | reads[end_position].append(item)
1125 |
1126 | # Debug purposes
1127 | # if DEBUG:
1128 | # print("BEFORE UPDATE (i="+str(i)+"):")
1129 | # print_reads(reads, i)
1130 |
1131 | pos_based_read_dictionary = update_reads(reads, i)
1132 |
1133 | column = get_column(pos_based_read_dictionary, reads, splice_positions, last_chr, omopolymeric_positions, target_positions, i)
1134 |
1135 | # Debug purposes
1136 | if DEBUG:
1137 | # print("AFTER UPDATE:");
1138 | # print_reads(reads, i)
1139 | raw_input("Press enter to continue...")
1140 |
1141 | # Go the next position
1142 | i += 1
1143 | # print("Position i"+str(i))
1144 |
1145 | # if DEBUG:
1146 | # print("[DEBUG] WRITING COLUMN IN POSITION {}: {}".format(i, column is not None))
1147 | # print(column)
1148 | # print_reads(reads, i)
1149 |
1150 | if column is not None and within_interval(i, region) and not (strict_mode and column["non_zero"] == 0):
1151 | # head='Region\tPosition\tReference\tStrand\tCoverage-q%i\tMeanQ\tBaseCount[A,C,G,T]\t
1152 | # AllSubs\tFrequency\t
1153 | # gCoverage-q%i\tgMeanQ\tgBaseCount[A,C,G,T]\tgAllSubs\tgFrequency\n' %(MQUAL,gMQUAL)
1154 | # cov,bcomp,subs,freq=BaseCount(seq,ref,MINIMUM_EDITS_FREQUENCY,MIN_EDITS_SINGLE)
1155 | # mqua=meanq(qual,len(seq))
1156 | # line='\t'.join([chr,str(pileupcolumn.pos+1),ref,mystrand,str(cov),mqua,str(bcomp),subs,freq]+['-','-','-','-','-'])+'\n'
1157 | # [A,C,G,T]
1158 |
1159 | writer.write("\t".join([
1160 | last_chr,
1161 | str(i),
1162 | column["ref"],
1163 | str(column["strand"]),
1164 | str(column["passed"]),
1165 | "{0:.2f}".format(column["mean_quality"]),
1166 | str(column["distribution"]),
1167 | " ".join([column["ref"] + el for el in column["variants"]]) if column["non_zero"] >= 1 else "-",
1168 | "{0:.2f}".format(column["frequency"]),
1169 | "\t".join(['-','-','-','-','-'])
1170 | ]) + "\n")
1171 | # if column["passed"] >= 1000: print("WRITTEN LINE {} {} {} {} {}".format(last_chr, str(i), column["ref"], column["strand"], column["passed"]))
1172 | # writer.flush()
1173 | elif VERBOSE:
1174 | sys.stderr.write("[VERBOSE] [NOPRINT] Not printing position ({}, {}) WITHIN_INTERVAL={} STRICT_MODE={} COLUMN={}\n".format(last_chr, i, within_interval(i, region), strict_mode, column))
1175 |
1176 | # Remove old reads
1177 | reads.pop(i-1, None)
1178 |
1179 | if reference_reader is not None: reference_reader.close()
1180 | samfile.close()
1181 | writer.close()
1182 |
1183 | tac = datetime.datetime.now()
1184 | print("[INFO] ["+hostname_string+"] ["+str(region)+"] " + str(total) + " total reads read")
1185 | print("[INFO] ["+hostname_string+"] ["+str(region)+"] END=" + str(tac) + "\t["+delta(tac, tic)+"]")
1186 | print("[INFO] ["+hostname_string+"] ["+str(region).ljust(50)+"] FINAL END=" + str(tac) + " START="+ str(first_tic) + "\t"+ str(region) +"\t[TOTAL COMPUTATION="+delta(tac, first_tic)+"] [LAUNCH TIME:"+str(LAUNCH_TIME)+"] [TOTAL RUN="+delta(tac, LAUNCH_TIME)+"] [READS="+str(total)+"]")
1187 |
1188 | complement_map = {"A":"T", "T":"A", "C":"G", "G":"C"}
1189 | def complement(b):
1190 | return complement_map[b]
1191 |
1192 | def complement_all(sequence):
1193 | return ''.join([complement_map[l] for l in sequence])
1194 |
1195 | def prop(tot,va):
1196 | try: av=float(va)/tot
1197 | except: av=0.0
1198 | return av
1199 |
1200 | def vstand(strand): # strand='+-+-+-++++++-+++'
1201 |
1202 | vv=[(strand.count('+'),'+'),(strand.count('-'),'-'),(strand.count('*'),'*')]
1203 | if vv[0][0]==0 and vv[1][0]==0: return '*'
1204 | if use_strand_confidence: #flag che indica se usare il criterio 2, altrimenti usa il criterio 1
1205 | totvv=sum([x[0] for x in vv[:2]])
1206 | if prop(totvv,vv[0][0])>=strand_confidence_value: return '+' # strand_confidence_value e' il valore soglia, compreso tra 0 e 1, default 0.7
1207 | if prop(totvv,vv[1][0])>=strand_confidence_value: return '-'
1208 | return '*'
1209 | else:
1210 | if vv[0][0]==vv[1][0] and vv[2][0]==0: return '+'
1211 | return max(vv)[1]
1212 |
1213 | def parse_options():
1214 |
1215 | # Options parsing
1216 | parser = argparse.ArgumentParser(description='REDItools 2.0')
1217 | parser.add_argument('-f', '--file', help='The bam file to be analyzed')
1218 | parser.add_argument('-o', '--output-file', help='The output statistics file')
1219 | parser.add_argument('-S', '--strict', default=False, action='store_true', help='Activate strict mode: only sites with edits will be included in the output')
1220 | parser.add_argument('-s', '--strand', type=int, default=0, help='Strand: this can be 0 (unstranded), 1 (secondstrand oriented) or 2 (firststrand oriented)')
1221 | parser.add_argument('-a', '--append-file', action='store_true', help='Appends results to file (and creates if not existing)')
1222 | parser.add_argument('-r', '--reference', help='The reference FASTA file')
1223 | parser.add_argument('-g', '--region', help='The region of the bam file to be analyzed')
1224 | parser.add_argument('-m', '--omopolymeric-file', help='The file containing the omopolymeric positions')
1225 | parser.add_argument('-c', '--create-omopolymeric-file', default=False, help='Whether to create the omopolymeric span', action='store_true')
1226 | parser.add_argument('-os', '--omopolymeric-span', type=int, default=5, help='The omopolymeric span')
1227 | parser.add_argument('-sf', '--splicing-file', help='The file containing the splicing sites positions')
1228 | parser.add_argument('-ss', '--splicing-span', type=int, default=4, help='The splicing span')
1229 | parser.add_argument('-mrl', '--min-read-length', type=int, default=30, help='The minimum read length. Reads whose length is below this value will be discarded.')
1230 | parser.add_argument('-q', '--min-read-quality', type=int, default=20, help='The minimum read quality. Reads whose mapping quality is below this value will be discarded.')
1231 | parser.add_argument('-bq', '--min-base-quality', type=int, default=30, help='The minimum base quality. Bases whose quality is below this value will not be included in the analysis.')
1232 | parser.add_argument('-mbp', '--min-base-position', type=int, default=0, help='The minimum base position. Bases which reside in a previous position (in the read) will not be included in the analysis.')
1233 | parser.add_argument('-Mbp', '--max-base-position', type=int, default=0, help='The maximum base position. Bases which reside in a further position (in the read) will not be included in the analysis.')
1234 | parser.add_argument('-l', '--min-column-length', type=int, default=1, help='The minimum length of editing column (per position). Positions whose columns have length below this value will not be included in the analysis.')
1235 | parser.add_argument('-men', '--min-edits-per-nucleotide', type=int, default=1, help='The minimum number of editing for events each nucleotide (per position). Positions whose columns have bases with less than min-edits-per-base edits will not be included in the analysis.')
1236 | parser.add_argument('-me', '--min-edits', type=int, default=0, help='The minimum number of editing events (per position). Positions whose columns have bases with less than \'min-edits-per-base edits\' will not be included in the analysis.')
1237 | parser.add_argument('-Men', '--max-editing-nucleotides', type=int, default=100, help='The maximum number of editing nucleotides, from 0 to 4 (per position). Positions whose columns have more than \'max-editing-nucleotides\' will not be included in the analysis.')
1238 | parser.add_argument('-d', '--debug', default=False, help='REDItools is run in DEBUG mode.', action='store_true')
1239 | parser.add_argument('-T', '--strand-confidence', default=1, help='Strand inference type 1:maxValue 2:useConfidence [1]; maxValue: the most prominent strand count will be used; useConfidence: strand is assigned if over a prefixed frequency confidence (-TV option)')
1240 | parser.add_argument('-C', '--strand-correction', default=False, help='Strand correction. Once the strand has been inferred, only bases according to this strand will be selected.', action='store_true')
1241 | parser.add_argument('-Tv', '--strand-confidence-value', type=float, default=0.7, help='Strand confidence [0.70]')
1242 | parser.add_argument('-V', '--verbose', default=False, help='Verbose information in stderr', action='store_true')
1243 | parser.add_argument('-H', '--remove-header', default=False, help='Do not include header in output file', action='store_true')
1244 | parser.add_argument('-N', '--dna', default=False, help='Run REDItools 2.0 on DNA-Seq data', action='store_true')
1245 | parser.add_argument('-B', '--bed_file', help='Path of BED file containing target regions')
1246 |
1247 | args = parser.parse_known_args()[0]
1248 | # print(args)
1249 |
1250 | global activate_debug
1251 | activate_debug = args.debug
1252 |
1253 | global VERBOSE
1254 | VERBOSE = args.verbose
1255 |
1256 | bamfile = args.file
1257 | if bamfile is None:
1258 | print("[ERROR] An input bam file is mandatory. Please, provide one (-f|--file)")
1259 | exit(1)
1260 |
1261 | omopolymeric_file = args.omopolymeric_file
1262 | global OMOPOLYMERIC_SPAN
1263 | OMOPOLYMERIC_SPAN = args.omopolymeric_span
1264 | create_omopolymeric_file = args.create_omopolymeric_file
1265 |
1266 | reference_file = args.reference
1267 | if reference_file is None:
1268 | print("[ERROR] An input reference file is mandatory. Please, provide one (-r|--reference)")
1269 | exit(1)
1270 |
1271 | output = args.output_file
1272 | append = args.append_file
1273 |
1274 | global strict_mode
1275 | strict_mode = args.strict
1276 |
1277 | global strand
1278 | strand = args.strand
1279 |
1280 | global strand_correction
1281 | strand_correction = args.strand_correction
1282 |
1283 | global use_strand_confidence
1284 | use_strand_confidence = bool(args.strand_confidence)
1285 |
1286 | global strand_confidence_value
1287 | strand_confidence_value = float(args.strand_confidence_value)
1288 |
1289 | splicing_file = args.splicing_file
1290 | global SPLICING_SPAN
1291 | SPLICING_SPAN = args.splicing_span
1292 |
1293 | global MIN_READ_LENGTH
1294 | MIN_READ_LENGTH = args.min_read_length
1295 |
1296 | global MIN_QUALITY
1297 | MIN_QUALITY = args.min_read_quality
1298 |
1299 | global MIN_BASE_QUALITY
1300 | MIN_BASE_QUALITY = args.min_base_quality
1301 |
1302 | global DEFAULT_BASE_QUALITY
1303 | DEFAULT_BASE_QUALITY = 30
1304 |
1305 | global MIN_BASE_POSITION
1306 | MIN_BASE_POSITION = args.min_base_position
1307 |
1308 | global MAX_BASE_POSITION
1309 | MAX_BASE_POSITION = args.max_base_position
1310 |
1311 | global MIN_COLUMN_LENGTH
1312 | MIN_COLUMN_LENGTH = args.min_column_length
1313 |
1314 | global MIN_EDITS_SINGLE
1315 | MIN_EDITS_SINGLE = args.min_edits_per_nucleotide
1316 |
1317 | global MIN_EDITS_NO
1318 | MIN_EDITS_NO = args.min_edits
1319 |
1320 | global MAX_CHANGES
1321 | MAX_CHANGES = args.max_editing_nucleotides
1322 |
1323 | global IS_DNA
1324 | IS_DNA = args.dna
1325 |
1326 | bed_file = args.bed_file
1327 |
1328 | if IS_DNA and bed_file is None:
1329 | print("[ERROR] When analyzing DNA-Seq files it is mandatory to provide a BED file containing the positions of target regions (-B|--bed_file)")
1330 | exit(1)
1331 |
1332 | region = None
1333 |
1334 | if args.region:
1335 | region = re.split("[:-]", args.region)
1336 | if not region or len(region) == 2 or (len(region) == 3 and region[1] == region[2]):
1337 | sys.stderr.write("[ERROR] Please provide a region of the form chrom:start-end (with end > start). Region provided: {}".format(region))
1338 | exit(1)
1339 | if len(region) >= 2:
1340 | region[1] = int(region[1])
1341 | region[2] = int(region[2])
1342 |
1343 | options = {
1344 | "bamfile": bamfile,
1345 | "region": region,
1346 | "reference": reference_file,
1347 | "output": output,
1348 | "append": append,
1349 | "omopolymeric_file": omopolymeric_file,
1350 | "create_omopolymeric_file": create_omopolymeric_file,
1351 | "splicing_file": splicing_file,
1352 | "remove_header": args.remove_header,
1353 | "bed_file": bed_file
1354 | }
1355 |
1356 | # print("RUNNING REDItools 2.0 with the following options", options)
1357 |
1358 | return options
1359 |
1360 | # -i /marconi_scratch/userexternal/tflati00/test_picardi/reditools_test/SRR1413602.bam
1361 | # -o editing18_test -f /marconi_scratch/userinternal/tcastign/test_picardi/hg19.fa -c1,1
1362 | # -m20,20 -v1 -q30,30 -e -n0.0 -N0.0 -u -l -p --gzip -H -Y chr18:1-78077248 -F chr18_1_78077248
1363 | #
1364 | # -f /home/flati/data/reditools/SRR1413602.bam -r /home/flati/data/reditools/hg19.fa -g chr18:14237-14238 -m /home/flati/data/reditools/omopolymeric_positions.txt
1365 | #
1366 | # -f /home/flati/data/reditools/SRR1413602.bam
1367 | # -r /home/flati/data/reditools/hg19.fa
1368 | # -g chr18:14237-14238
1369 | # -m /home/flati/data/reditools/omopolymeric_positions.txt
1370 | if __name__ == '__main__':
1371 |
1372 | options = parse_options()
1373 |
1374 | analyze(options)
1375 |
1376 |
1377 |
1378 |
--------------------------------------------------------------------------------
/src/cineca/reditools2_multisample.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 |
3 | import os
4 | import glob
5 | import sys
6 | import re
7 | import time
8 | from mpi4py import MPI
9 | import datetime
10 | from collections import OrderedDict
11 | import reditools
12 | import argparse
13 | import gc
14 | import socket
15 | import netifaces
16 | from random import shuffle
17 |
18 | ALIGN_CHUNK = 0
19 | STOP_WORKING = 1
20 | IM_FREE = 2
21 | CALCULATE_COVERAGE = 3
22 |
23 | STEP = 10000000
24 |
25 | def weight_function(x):
26 | # x = math.log(1+x)
27 | # return 2.748*10**(-3)*x**3 -0.056*x**2 + 0.376*x + 2.093
28 | return x**3
29 |
30 | def get_coverage(coverage_file, region = None):
31 |
32 | # Open the file and read i-th section (jump to the next '\n' character)
33 | n = float(sample_size)
34 | file_size = os.path.getsize(coverage_file)
35 | print("[{}] SIZE OF FILE {}: {} bytes".format(sample_rank, coverage_file, file_size))
36 | start = int(sample_rank*(file_size/n))
37 | end = int((sample_rank+1)*(file_size/n))
38 | print("[{}] [DEBUG] START={} END={}".format(sample_rank, start, end))
39 |
40 | f = open(coverage_file, "r")
41 | f.seek(start)
42 | loaded = start
43 | coverage_partial = 0
44 | with f as lines:
45 | line_no = 0
46 | for line in lines:
47 | if loaded >= end: continue
48 | loaded += len(line)
49 |
50 | line_no += 1
51 | if line_no == 1:
52 | if not line.startswith("chr"):
53 | continue
54 |
55 | triple = line.rstrip().split("\t")
56 |
57 | if region is not None:
58 | if triple[0] != region[0]: continue
59 | if len(region) >= 2 and int(triple[1]) < region[1]: continue
60 | if len(region) >= 2 and int(triple[1]) > region[2]: continue
61 |
62 | #if line_no % 10000000 == 0:
63 | # print("[{}] [DEBUG] Read {} lines so far".format(rank, line_no))
64 | cov = int(triple[2])
65 | coverage_partial += weight_function(cov)
66 |
67 | print("[{}] START={} END={} PARTIAL_COVERAGE={}".format(sample_rank, start, end, coverage_partial))
68 |
69 | # Reduce
70 | coverage = None
71 |
72 | coverages = sample_comm.gather(coverage_partial)
73 | if sample_rank == 0:
74 | print("COVERAGES:", str(coverages))
75 | coverage = reduce(lambda x,y: x+y, coverages)
76 |
77 | coverage = sample_comm.bcast(coverage)
78 |
79 | # Return the total
80 | return coverage
81 |
82 | def calculate_intervals(total_coverage, coverage_file, region):
83 | print("[SYSTEM] [{}] Opening coverage file={}".format(sample_rank, coverage_file))
84 | f = open(coverage_file, "r")
85 |
86 | chr = None
87 | start = None
88 | end = None
89 | C = 0
90 | max_interval_width = min(3000000, 3000000000 / sample_size)
91 |
92 | subintervals = []
93 | subtotal = total_coverage / sample_size
94 | print("[SYSTEM] TOTAL={} SUBTOTAL={} MAX_INTERVAL_WIDTH={}".format(total_coverage, subtotal, max_interval_width))
95 |
96 | line_no = 0
97 | with f as lines:
98 | for line in lines:
99 | line_no += 1
100 | if line_no % 1000000 == 0:
101 | print("[SYSTEM] [{}] Time: {} - {} lines loaded.".format(sample_rank, time.time(), line_no))
102 |
103 | fields = line.rstrip().split("\t")
104 |
105 | if region is not None:
106 | if fields[0] != region[0]: continue
107 | if len(region) >= 2 and int(fields[1]) < region[1]: continue
108 | if len(region) >= 3 and int(fields[1]) > region[2]: continue
109 |
110 | # If the interval has become either i) too large or ii) too heavy or iii) spans across two different chromosomes
111 | if C >= subtotal or (chr is not None and fields[0] != chr) or (end is not None and start is not None and (end-start) > max_interval_width):
112 | reason = None
113 | if C >= subtotal: reason = "WEIGHT"
114 | elif chr is not None and fields[0] != chr: reason = "END_OF_CHROMOSOME"
115 | elif end is not None and start is not None and (end-start) > max_interval_width: reason = "MAX_WIDTH"
116 |
117 | interval = (chr, start, end, C, end-start, reason)
118 | print("[SYSTEM] [{}] Time: {} - Discovered new interval={}".format(sample_rank, time.time(), interval))
119 | subintervals.append(interval)
120 | chr = None
121 | start = None
122 | end = None
123 | C = 0
124 | if len(fields) < 3: continue
125 |
126 | if chr is None: chr = fields[0]
127 | if start is None: start = int(fields[1])
128 | end = int(fields[1])
129 | C += weight_function(int(fields[2]))
130 |
131 | if C > 0:
132 | reason = "END_OF_CHROMOSOME"
133 | interval = (chr, start, end, C, end-start, reason)
134 | print("[SYSTEM] [{}] Time: {} - Discovered new interval={}".format(sample_rank, time.time(), interval))
135 | subintervals.append(interval)
136 |
137 | return subintervals
138 |
139 | if __name__ == '__main__':
140 |
141 | # MPI init
142 | comm = MPI.COMM_WORLD
143 | rank = comm.Get_rank()
144 | size = comm.Get_size()
145 |
146 | options = reditools.parse_options()
147 | options["remove_header"] = True
148 |
149 | parser = argparse.ArgumentParser(description='REDItools 2.0')
150 | parser.add_argument('-D', '--coverage-dir', help='The coverage directory containing the coverage file of the sample to analyze divided by chromosome')
151 | parser.add_argument('-t', '--temp-dir', help='The temp directory where to store temporary data for this sample')
152 | parser.add_argument('-Z', '--chromosome-sizes', help='The file with the chromosome sizes')
153 | parser.add_argument('-g', '--region', help='The region of the bam file to be analyzed')
154 | parser.add_argument('-F', '--samples-file', help='The file listing each bam file to be analyzed on a separate line')
155 | args = parser.parse_known_args()[0]
156 |
157 | coverage_dir = args.coverage_dir
158 | temp_dir = args.temp_dir
159 | size_file = args.chromosome_sizes
160 | samples_filepath = args.samples_file
161 |
162 | samples = None
163 | if rank == 0:
164 | samples = []
165 | for line in open(samples_filepath, "r"):
166 | line = line.strip()
167 | samples.append(line)
168 |
169 | # Chronometer data structure
170 | if rank == 0:
171 | chronometer = {}
172 | for sample in samples:
173 | sample = os.path.basename(sample)
174 | sample = ".".join(sample.split(".")[0:-1])
175 | chronometer[sample] = {
176 | # "coverage": 0,
177 | # "intervals": 0,
178 | "parallel": 0
179 | }
180 |
181 | print("CHRONOMETER", chronometer)
182 |
183 | samples = comm.bcast(samples, root=0)
184 |
185 | PROCS_PER_SAMPLE = size / len(samples)
186 | if rank == 0:
187 | print("[{}] PROCESSES_PER_SAMPLE={}".format(rank, PROCS_PER_SAMPLE))
188 |
189 | interface = 'ib0' if 'ib0' in netifaces.interfaces() else netifaces.interfaces()[0]
190 | hostname = socket.gethostbyaddr(netifaces.ifaddresses(interface)[netifaces.AF_INET][0]['addr'])
191 | pid = os.getpid()
192 | # print("[SYSTEM] [TECH] [NODE] RANK:{} HOSTNAME:{} PID:{}".format(rank, hostname, pid))
193 |
194 | # if rank == 0:
195 | # print("[SYSTEM] LAUNCHED PARALLEL REDITOOLS WITH THE FOLLOWING OPTIONS:", options, args)
196 |
197 | region = None
198 | if args.region:
199 | region = re.split("[:-]", args.region)
200 | if not region or len(region) == 2 or (len(region) == 3 and region[1] == region[2]):
201 | sys.stderr.write("[ERROR] Please provide a region of the form chrom:start-end (with end > start). Region provided: {}".format(region))
202 | exit(1)
203 | if len(region) >= 2:
204 | region[1] = int(region[1])
205 | region[2] = int(region[2])
206 |
207 | t1 = time.time()
208 |
209 | # print("I am rank #"+str(rank))
210 |
211 | # COVERAGE SECTION
212 | sample_index = rank/PROCS_PER_SAMPLE
213 | sample_filepath = samples[sample_index]
214 | sample = os.path.basename(sample_filepath)
215 | sample = ".".join(sample.split(".")[0:-1])
216 |
217 | if rank % PROCS_PER_SAMPLE == 0:
218 | print("[{}] SAMPLE_INDEX={} SAMPLE_FILEPATH={} SAMPLE={}".format(rank, sample_index, sample_filepath, sample))
219 |
220 | sample_comm = comm.Split(sample_index)
221 | sample_rank = sample_comm.Get_rank()
222 | sample_size = sample_comm.Get_size()
223 |
224 | coverage_dir += sample + "/"
225 | coverage_file = coverage_dir + sample + ".cov"
226 | temp_dir += sample + "/"
227 |
228 | if not os.path.isfile(coverage_file):
229 | print("[ERROR] Coverage file {} not existing!".format(coverage_file))
230 | exit(1)
231 |
232 | try:
233 | if not os.path.exists(temp_dir):
234 | os.makedirs(temp_dir)
235 | except Exception as e:
236 | print("[WARN] {}".format(e))
237 |
238 | interval_file = temp_dir + "/intervals.txt"
239 | homeworks = []
240 |
241 | if os.path.isfile(interval_file) and os.stat(interval_file).st_size > 0:
242 | if sample_rank == 0:
243 | print("[{}] [{}] [S{}] [RESTART] FOUND INTERVAL FILE {} ".format(rank, sample_rank, sample_index, interval_file))
244 | expected_total = 0
245 | for line in open(interval_file, "r"):
246 | line = line.strip()
247 |
248 | if expected_total == 0:
249 | expected_total = int(line)
250 | continue
251 |
252 | # Interval format: (chr, start, end, C, end-start, reason)
253 | fields = line.split("\t")
254 | for i in range(1, 5):
255 | fields[i] = int(fields[i])
256 | homeworks.append([sample_index] + fields)
257 | print("[{}] [{}] [S{}] [RESTART] INTERVAL FILE #INTERVALS {} ".format(rank, sample_rank, sample_index, len(homeworks)))
258 |
259 | else:
260 | if sample_rank == 0:
261 | print("["+str(rank)+"] [S"+str(sample_index)+"] PRE-COVERAGE TIME " + str(datetime.datetime.now().time()))
262 |
263 | start_cov = time.time()
264 | total_coverage = get_coverage(coverage_file, region)
265 | end_cov = time.time()
266 | elapsed = end_cov - start_cov
267 | print("[{}] [{}] [S{}] [{}] [TOTAL_COVERAGE] {}".format(rank, sample_rank, sample_index, sample, total_coverage))
268 |
269 | # if sample_rank == 0:
270 | # interval_time = [sample, elapsed]
271 | # else:
272 | # interval_time = []
273 |
274 | # interval_times = comm.gather(interval_time)
275 |
276 | # if rank == 0:
277 | # interval_times = list(filter(lambda x: x is not None, interval_times))
278 | # for interval_time in interval_times:
279 | # if len(interval_time) > 0:
280 | # print("INTERVAL_TIME[0]", interval_time[0])
281 | # print("INTERVAL_TIME", interval_time)
282 | #
283 | # chronometer[interval_time[0]]["coverage"] = interval_time[1]
284 |
285 | if sample_rank == 0:
286 | now = datetime.datetime.now().time()
287 | elapsed = time.time() - t1
288 | print("[SYSTEM] [TIME] [MPI] [0] MIDDLE-COVERAGE [now:{}] [elapsed: {}]".format(now, elapsed))
289 |
290 | # Collect all the files with the coverage
291 | files = []
292 | for file in os.listdir(coverage_dir):
293 | if region is not None and file != region[0]: continue
294 | if file.startswith("."): continue
295 | if file.endswith(".cov"): continue
296 | if file.endswith(".txt"): continue
297 | if file == "chrM": continue
298 | files.append(file)
299 | files.sort()
300 |
301 | if sample_rank == 0:
302 | print("[0] [S"+str(sample_index)+"] " + str(len(files)) + " FILES => " + str(files))
303 |
304 | # Master: dispatches the work to the other slaves
305 | if sample_rank == 0:
306 | start_intervals = t1
307 | print("[0] Start time: {}".format(start_intervals))
308 |
309 | done = 0
310 | total = len(files)
311 |
312 | queue = set()
313 | for i in range(1, min(sample_size, total+1)):
314 | file = files.pop()
315 | print("[SYSTEM] [MPI] [0] Sending coverage data "+ str(file) +" to rank " + str(i))
316 | sample_comm.send(file, dest=i, tag=CALCULATE_COVERAGE)
317 | queue.add(i)
318 |
319 | while len(files) > 0:
320 | status = MPI.Status()
321 | subintervals = sample_comm.recv(source=MPI.ANY_SOURCE, tag=IM_FREE, status=status)
322 | for subinterval in subintervals:
323 | homeworks.append([sample_index] + list(subinterval))
324 |
325 | done += 1
326 | who = status.Get_source()
327 | queue.remove(who)
328 | now = datetime.datetime.now().time()
329 | elapsed = time.time() - start_intervals
330 | print("[SYSTEM] [TIME] [MPI] [0] [S{}] COVERAGE RECEIVED IM_FREE SIGNAL FROM RANK {} [now:{}] [elapsed:{}] [#intervals: {}] [{}/{}][{:.2f}%] [Queue:{}]".format(sample_index, str(who), now, elapsed, len(homeworks), done, total, 100 * float(done)/total, queue))
331 |
332 | file = files.pop()
333 | print("[SYSTEM] [MPI] [0] [S"+str(sample_index)+"] Sending coverage data "+ str(file) +" to rank " + str(who))
334 | sample_comm.send(file, dest=who, tag=CALCULATE_COVERAGE)
335 | queue.add(who)
336 |
337 | while len(queue) > 0:
338 | status = MPI.Status()
339 | print("[SYSTEM] [MPI] [0] [S{}] Going to receive data from slaves.".format(sample_index))
340 | subintervals = sample_comm.recv(source=MPI.ANY_SOURCE, tag=IM_FREE, status=status)
341 | for subinterval in subintervals:
342 | homeworks.append([sample_index] + list(subinterval))
343 |
344 | done += 1
345 | who = status.Get_source()
346 | queue.remove(who)
347 | now = datetime.datetime.now().time()
348 | elapsed = time.time() - start_intervals
349 | print("[SYSTEM] [TIME] [MPI] [0] [S{}] COVERAGE RECEIVED IM_FREE SIGNAL FROM RANK {} [now:{}] [elapsed:{}] [#intervals: {}] [{}/{}][{:.2f}%] [Queue:{}]".format(sample_index, str(who), now, elapsed, len(homeworks), done, total, 100 * float(done)/total, queue))
350 |
351 | # Let them know we finished calculating the coverage
352 | for i in range(1, sample_size):
353 | sample_comm.send(None, dest=i, tag=STOP_WORKING)
354 |
355 | now = datetime.datetime.now().time()
356 | elapsed = time.time() - start_intervals
357 |
358 | interval_file = temp_dir + "/intervals.txt"
359 | print("[SYSTEM] [TIME] [MPI] [0] [S{}] SAVING INTERVALS TO {} [now:{}] [elapsed: {}]".format(sample_index, interval_file, now, elapsed))
360 | writer = open(interval_file, "w")
361 | writer.write(str(len(homeworks)) + "\n")
362 | for homework in homeworks:
363 | writer.write("\t".join([str(x) for x in homework[1:]]) + "\n")
364 | writer.close()
365 |
366 | now = datetime.datetime.now().time()
367 | elapsed = time.time() - start_intervals
368 | # interval_time = [sample, elapsed]
369 | print("[SYSTEM] [TIME] [MPI] [0] [S{}] INTERVALS SAVED TO {} [now:{}] [elapsed: {}]".format(sample_index, interval_file, now, elapsed))
370 | print("[SYSTEM] [TIME] [MPI] [0] [S{}] FINISHED CALCULATING INTERVALS [now:{}] [elapsed: {}]".format(sample_index, now, elapsed))
371 | else:
372 |
373 | # interval_time = []
374 |
375 | while(True):
376 | status = MPI.Status()
377 | # Here data is the name of a chromosome.
378 | data = sample_comm.recv(source=0, tag=MPI.ANY_TAG, status=status)
379 | tag = status.Get_tag()
380 | if tag == CALCULATE_COVERAGE:
381 | intervals = calculate_intervals(total_coverage, coverage_dir + data, region)
382 | sample_comm.send(intervals, dest=0, tag=IM_FREE)
383 | elif tag == STOP_WORKING:
384 | print("[SYSTEM] [TIME] [MPI] [SEND/RECV] [RECV] [{}] received STOP calculating intervals SIGNAL FROM RANK 0 [{}]".format(str(rank), datetime.datetime.now().time()))
385 | break
386 |
387 | # interval_times = comm.gather(interval_time)
388 | # if rank == 0:
389 | # for interval_time in interval_times:
390 | # if len(interval_time) > 0:
391 | # chronometer[interval_time[0]]["intervals"] = interval_time[1]
392 |
393 | print("[{}] [{}] [S{}] [{}] BEFORE GATHER HOMEWORKS #TOTAL={} intervals".format(rank, sample_rank, sample_index, sample, len(homeworks)))
394 |
395 | # Wait for all intervals to be collected
396 | homeworks = comm.gather(homeworks)
397 | homeworks_by_sample = {}
398 | homeworks_done = {}
399 | if rank == 0:
400 | homeworks = reduce(lambda x,y: x+y, homeworks)
401 | shuffle(homeworks)
402 |
403 | # Divide intervals by samples
404 | for homework in homeworks:
405 | sample_id = homework[0]
406 | if sample_id not in homeworks_by_sample: homeworks_by_sample[sample_id] = []
407 | homeworks_by_sample[sample_id].append(homework)
408 |
409 | for sample_id in homeworks_by_sample:
410 | homeworks_done[sample_id] = 0
411 |
412 | print("[{}] [{}] [S{}] #TOTAL={} (all intervals)".format(rank, sample_rank, sample_index, len(homeworks)))
413 |
414 | ###########################################################
415 | ######### COMPUTATION SECTION #############################
416 | ###########################################################
417 |
418 | if rank == 0:
419 | done = 0
420 | print("[SYSTEM] [TIME] [MPI] [0] REDItools STARTED. MPI SIZE (PROCS): {} [now: {}]".format(size, datetime.datetime.now().time()))
421 |
422 | t1 = time.time()
423 |
424 | print("Loading chromosomes' sizes!")
425 | chromosomes = OrderedDict()
426 | for line in open(size_file):
427 | (key, val) = line.split()[0:2]
428 | chromosomes[key] = int(val)
429 | print("Sizes:")
430 | print(chromosomes)
431 |
432 | total = len(homeworks)
433 | #print("[SYSTEM] [MPI] [0] HOMEWORKS", total, homeworks)
434 |
435 | start = time.time()
436 |
437 | print("[SYSTEM] [TIME] [MPI] [0] REDItools PILEUP START: [now: {}]".format(datetime.datetime.now().time()))
438 |
439 | queue = set()
440 | for i in range(1, min(size, total)):
441 | interval = homeworks.pop()
442 |
443 | print("[SYSTEM] [MPI] [SEND/RECV] [SEND] [0] Sending data "+ str(interval) +" to rank " + str(i))
444 | comm.send(interval, dest=i, tag=ALIGN_CHUNK)
445 | queue.add(i)
446 |
447 | while len(homeworks) > 0:
448 | status = MPI.Status()
449 | response = comm.recv(source=MPI.ANY_SOURCE, tag=IM_FREE, status=status)
450 | done += 1
451 | who = status.Get_source()
452 | queue.remove(who)
453 | now = datetime.datetime.now().time()
454 | elapsed = time.time() - start
455 | print("[SYSTEM] [TIME] [MPI] [SEND/RECV] [RECV] [0] RECEIVED IM_FREE SIGNAL FROM RANK {} [now:{}] [elapsed:{}] [{}/{}][{:.2f}%] [Queue:{}]".format(str(who), now, elapsed, done, total, 100 * float(done)/total, queue))
456 |
457 | sample_done = samples[response[0]]
458 | sample_done = os.path.basename(sample_done)
459 | sample_done = ".".join(sample_done.split(".")[0:-1])
460 | print(response)
461 | duration = response[-1] - response[-2]
462 | chronometer[sample_done]["parallel"] += duration
463 | homeworks_done[response[0]] += 1
464 | if homeworks_done[response[0]] == len(homeworks_by_sample[response[0]]):
465 | print("[SYSTEM] [MPI] [COMPLETE] [{}] [{}] [{}] now:{}".format(sample_done, chronometer[sample_done]["parallel"], str(datetime.timedelta(seconds=chronometer[sample_done]["parallel"])), now))
466 |
467 | interval = homeworks.pop()
468 | print("[SYSTEM] [MPI] [SEND/RECV] [SEND] [0] Sending data "+ str(interval) +" to rank " + str(who))
469 | comm.send(interval, dest=who, tag=ALIGN_CHUNK)
470 | queue.add(who)
471 |
472 | while len(queue) > 0:
473 | status = MPI.Status()
474 | response = comm.recv(source=MPI.ANY_SOURCE, tag=IM_FREE, status=status)
475 | done += 1
476 | who = status.Get_source()
477 | queue.remove(who)
478 | now = datetime.datetime.now().time()
479 | elapsed = time.time() - start
480 | print("[SYSTEM] [TIME] [MPI] [SEND/RECV] [RECV] [0] RECEIVED IM_FREE SIGNAL FROM RANK {} [now:{}] [elapsed:{}] [{}/{}][{:.2f}%] [Queue:{}]".format(str(who), now, elapsed, done, total, 100 * float(done)/total, queue))
481 |
482 | sample_done = samples[response[0]]
483 | sample_done = os.path.basename(sample_done)
484 | sample_done = ".".join(sample_done.split(".")[0:-1])
485 | duration = response[-1] - response[-2]
486 | chronometer[sample_done]["parallel"] += duration
487 | homeworks_done[response[0]] += 1
488 | if homeworks_done[response[0]] == len(homeworks_by_sample[response[0]]):
489 | print("[SYSTEM] [MPI] [COMPLETE] [{}] [{}] [{}] now:{}".format(sample_done, chronometer[sample_done]["parallel"], str(datetime.timedelta(seconds=chronometer[sample_done]["parallel"])), now))
490 |
491 | print("[SYSTEM] [MPI] [SEND/RECV] [SEND] [0] Sending DIE SIGNAL TO RANK " + str(who))
492 | comm.send(None, dest=who, tag=STOP_WORKING)
493 |
494 | t2 = time.time()
495 | elapsed = t2-t1
496 | print("[SYSTEM] [TIME] [MPI] [0] WHOLE PARALLEL ANALYSIS FINISHED. CREATING SETUP FOR MERGING PARTIAL FILES - Total elapsed time [{:5.5f}] [{}] [now: {}]".format(elapsed, t2, datetime.datetime.now().time()))
497 |
498 | #####################################################################
499 | ######### RECOMBINATION OF SINGLE FILES #############################
500 | #####################################################################
501 | for s in samples:
502 |
503 | s = os.path.basename(s)
504 | s = ".".join(s.split(".")[0:-1])
505 |
506 | little_files = []
507 | print("Scanning all files in "+args.temp_dir + s +" matching " + ".*")
508 | for little_file in glob.glob(args.temp_dir + s + "/*"):
509 | if little_file.endswith("chronometer.txt"): continue
510 | if little_file.endswith("files.txt"): continue
511 | if little_file.endswith("intervals.txt"): continue
512 | if little_file.endswith("status.txt"): continue
513 | if little_file.endswith("progress.txt"): continue
514 | if little_file.endswith("times.txt"): continue
515 | if little_file.endswith("groups.txt"): continue
516 |
517 | print(little_file)
518 | pieces = re.sub("\..*", "", os.path.basename(little_file)).split("#")
519 | pieces.insert(0, little_file)
520 | little_files.append(pieces)
521 |
522 | # Sort the output files
523 | keys = chromosomes.keys()
524 | print("[SYSTEM] "+str(len(little_files))+" FILES TO MERGE: ", little_files)
525 | little_files = sorted(little_files, key = lambda x: (keys.index(x[1]), int(x[2])))
526 | print("[SYSTEM] "+str(len(little_files))+" FILES TO MERGE (SORTED): ", little_files)
527 |
528 | smallfiles_list_filename = args.temp_dir + s + "/" + "files.txt"
529 | f = open(smallfiles_list_filename, "w")
530 | for little_file in little_files:
531 | f.write(little_file[0] + "\n")
532 | f.close()
533 |
534 | # Chronometer data
535 | chronometer_filename = args.temp_dir + "/" + "chronometer.txt"
536 | f = open(chronometer_filename, "w")
537 | #f.write("\t".join(["SampleID", "Coverage", "Intervals", "Editing", "Coverage (human)", "Intervals (human)", "Coverage (human)"]))
538 | for s in chronometer:
539 | # coverage_duration = str(datetime.timedelta(seconds=chronometer[s]["coverage"]))
540 | # interval_duration = str(datetime.timedelta(seconds=chronometer[s]["intervals"]))
541 | parallel_duration = str(datetime.timedelta(seconds=chronometer[s]["parallel"]))
542 | f.write("\t".join([
543 | s,
544 | # str(chronometer[s]["coverage"]),
545 | # str(chronometer[s]["intervals"]),
546 | str(chronometer[s]["parallel"]),
547 | # coverage_duration,
548 | # interval_duration,
549 | parallel_duration]) + "\n")
550 | f.close()
551 |
552 | t2 = time.time()
553 | print("[SYSTEM] [TIME] [MPI] [0] [END] - WHOLE ANALYSIS FINISHED - Total elapsed time [{:5.5f}] [{}] [now: {}]".format(t2-t1, t2, datetime.datetime.now().time()))
554 |
555 | # Slave processes
556 | else:
557 |
558 | while(True):
559 | status = MPI.Status()
560 | data = comm.recv(source=0, tag=MPI.ANY_TAG, status=status)
561 |
562 | tag = status.Get_tag()
563 | if tag == ALIGN_CHUNK:
564 |
565 | # Process it
566 | time_start = time.time()
567 | time_s = datetime.datetime.now().time()
568 | print("[SYSTEM] [TIME] [MPI] [SEND/RECV] [RECV] [{}] REDItools: STARTED {} from rank 0 [{}] Interval: {}".format(str(rank), str(data), time_s, data))
569 |
570 | local_sample_index = data[0]
571 | local_sample_filepath = samples[local_sample_index]
572 | local_sample = os.path.basename(local_sample_filepath)
573 | local_sample = ".".join(local_sample.split(".")[0:-1])
574 | print("[LAUNCH REDITOOLS] {} {} {}".format(local_sample_index, local_sample_filepath, local_sample))
575 |
576 | id = data[1] + "#" + str(data[2]) + "#" + str(data[3])
577 |
578 | options["bamfile"] = local_sample_filepath
579 | options["region"] = [data[1], data[2], data[3]]
580 | options["output"] = args.temp_dir + local_sample + "/" + id + ".gz"
581 |
582 | print("[MPI] [" + str(rank) + "] COMMAND-LINE:", options)
583 |
584 | if not os.path.exists(args.temp_dir + local_sample + "/" + id + ".gz"):
585 | gc.collect()
586 | reditools.analyze(options)
587 |
588 | time_end = time.time()
589 | time_e = datetime.datetime.now().time()
590 | print("[SYSTEM] [TIME] [MPI] [{}] REDItools: FINISHED {} [{}][{}] [TOTAL:{:5.2f}]".format(str(rank), str(data), time_s, datetime.datetime.now().time(), time_end - time_start))
591 |
592 | print("[SYSTEM] [TIME] [MPI] [SEND/RECV] [SEND] [{}] SENDING IM_FREE tag TO RANK 0 [{}]".format(str(rank), datetime.datetime.now().time()))
593 | comm.send(data + [time_s, time_e, time_start, time_end], dest=0, tag=IM_FREE)
594 | elif tag == STOP_WORKING:
595 | print("[SYSTEM] [TIME] [MPI] [SEND/RECV] [RECV] [{}] received DIE SIGNAL FROM RANK 0 [{}]".format(str(rank), datetime.datetime.now().time()))
596 | break
597 |
598 | print("[{}] EXITING [now:{}]".format(rank, time.time()))
599 |
--------------------------------------------------------------------------------
/src/cineca/reditools_table_to_bed.py:
--------------------------------------------------------------------------------
1 | import sys
2 | import os
3 | import gzip
4 |
5 | import argparse
6 | if __name__ == '__main__':
7 |
8 | parser = argparse.ArgumentParser(description='REDItools 2.0 table to BED file converter')
9 | parser.add_argument('-i', '--table-file', help='The RNA-editing events table to be converted')
10 | parser.add_argument('-o', '--bed_file', help='The output bed file')
11 | args = parser.parse_known_args()[0]
12 |
13 | input = args.table_file
14 | output = args.bed_file
15 |
16 | input_root, ext = os.path.splitext(input)
17 | fd_input = gzip.open(input, "r") if ext == ".gz" else open(input, "r")
18 | fd_output = open(output, "w")
19 |
20 | LOG_INTERVAL = 1000000
21 | last_chr = None
22 | start = None
23 | end = None
24 |
25 | total = 0
26 | with fd_input:
27 | for line in fd_input:
28 | total += 1
29 | if total % LOG_INTERVAL == 0:
30 | sys.stderr.write("[{}] {} lines read from {}\n".format(last_chr, total, input))
31 |
32 | fields = line.strip().split()
33 | chr = fields[0]
34 | pos = int(fields[1])
35 |
36 | if last_chr != chr or (end is not None and pos > end + 1):
37 | if last_chr is not None and start is not None and end is not None:
38 | fd_output.write("{}\t{}\t{}\n".format(last_chr, start, end))
39 | start = None
40 | end = None
41 |
42 | if start is None:
43 | start = pos
44 |
45 | if last_chr != chr:
46 | last_chr = chr
47 |
48 | if end is None or pos == end + 1:
49 | end = pos
50 |
51 | if last_chr is not None and start is not None and end is not None:
52 | fd_output.write("{}\t{}\t{}\n".format(last_chr, start, end))
53 | start = None
54 | end = None
55 |
56 | sys.stderr.write("{} lines read from {}\n".format(total, input))
57 |
--------------------------------------------------------------------------------
/template.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | Timeline | Basic demo
5 |
6 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
59 |
60 |
61 |
--------------------------------------------------------------------------------
/test/SRR2135332.bam:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BioinfoUNIBA/REDItools2/17e932fa225477effced64ad5342e7cfd2b7d87b/test/SRR2135332.bam
--------------------------------------------------------------------------------
/test/SRR2135332.bam.bai:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BioinfoUNIBA/REDItools2/17e932fa225477effced64ad5342e7cfd2b7d87b/test/SRR2135332.bam.bai
--------------------------------------------------------------------------------