Timeline | Basic demo

├── .project ├── .pydevproject ├── README.md ├── README_2.md ├── bower.json ├── create_html.sh ├── extract_bed_dynamic.sh ├── extract_coverage.sh ├── extract_coverage_dynamic.sh ├── extract_coverage_slurm.sh ├── extract_coverage_slurm_multisample.sh ├── install.sh ├── merge.sh ├── multisample_test.sh ├── parallel_test.sh ├── parallel_test_slurm.sh ├── parallel_test_slurm_DEPRECATED.sh ├── prepare_test.sh ├── requirements.txt ├── serial_test.sh ├── serial_test_slurm.sh ├── src └── cineca │ ├── __init__.py │ ├── annotate_with_DNA.py │ ├── parallel_reditools.py │ ├── reditools.py │ ├── reditools2_multisample.py │ └── reditools_table_to_bed.py ├── template.html └── test ├── SRR2135332.bam └── SRR2135332.bam.bai /.project: -------------------------------------------------------------------------------- 1 | 2 | 3 | REDItools2.0 4 | 5 | 6 | 7 | 8 | 9 | org.python.pydev.PyDevBuilder 10 | 11 | 12 | 13 | 14 | 15 | org.python.pydev.pythonNature 16 | 17 | 18 | -------------------------------------------------------------------------------- /.pydevproject: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | /${PROJECT_DIR_NAME}/src 5 | 6 | python 2.7 7 | Default 8 | 9 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # REDItools2 2 | 3 | **REDItools2** is the optimized, parallel multi-node version of [ REDItools](https://github.com/BioinfoUNIBA/REDItools). 4 | 5 | REDItools takes in input a RNA-Seq (or DNA-Seq BAM) file and outputs a table of RNA-Seq editing events. Here is an example of REDItools's output: 6 |

7 | 8 |

9 | 10 | The following image explains the high-level architecture. 11 | 12 |

13 | 14 |

15 | 16 | This version of REDItools shows an average 8x speed improvement over the previous version even when using only the serial-mode: 17 | 18 |

19 | 20 |

130 | 131 | -Spack module loading 132 | >module load autoload spack 133 | 134 | -Installation of python required version (when prompted with ['Do yoy want to proceed?'], answer always y): 135 | 136 | >spack install python@2.7.16 #@ builds a specific version of python. If u want more verbosity, use -d 137 | 138 | >spack module tcl refresh python@2.7.16 139 | 140 | >spack install py-mpi4py^python@2.7.16 141 | 142 | >spack module tcl refresh py-mpi4py^python@2.7.16 143 | 144 | >spack install py-virtualenv^python@2.7.16 145 | 146 | >spack module tcl refresh py-virtualenv^python@2.7.16 147 | 148 | -Installation of REDItools 2.0 required modules 149 | 150 | >module load python/2.7.16--gcc--8.4.0-bgv 151 | 152 | >module load autoload py-mpi4py/3.0.3--gcc--8.4.0-spectrmpi-ac2 153 | 154 | >module load py-virtualenv/16.7.6--gcc--8.4.0-4ut 155 | 156 | >module load profile/global 157 | 158 | >module load samtools/1.12 159 | 160 | -Download of REDItols 2.0 from this repo 161 | 162 | > git clone https://github.com/BioinfoUNIBA/REDItools2.git 163 | 164 | > cd REDItools2 165 | 166 | -Virtualenv activation and dependencies download 167 | 168 | >virtualenv ENV 169 | 170 | >source ENV/bin/activate 171 | 172 | >pip install pysam 173 | 174 | >pip install sortedcontainers 175 | 176 | >pip install psutil 177 | 178 | >pip install netifaces 179 | 180 | -Data test preparation: 181 | 182 | >./prepare_test.sh 183 | 184 | -With a text editor modify the two slurm directives (queue -p and account) of serial_test_slurm.sh: 185 | 186 | >#SBATCH --account= (insert here your account)
187 | >#SBATCH -p m100_all_serial 188 | 189 | -Launch the run test: 190 | 191 | >sbatch serial_test_slurm.sh 192 | 193 | ### 5. The two versions of REDItools 2.0 194 | --- 195 | 196 | This repo includes test data and a test script for checking that dependencies have been installed properly and the basic REDItools command works. 197 | 198 | In order to have all the data you need, run the following commands: 199 | 200 | > cd test 201 | > 202 | > ./prepare_test.sh 203 | 204 | This will download and index the chromosome 21 of the hg19 version of the human genome (from http://hgdownload.cse.ucsc.edu/downloads.html). 205 | Once the script has finished running, you have all you need to perform the tests. 206 | 207 | The software comes with two modalities. Feel free to choose the one which best fits your needs. 208 | 209 | #### 5.1 Serial version (reditools.py) 210 | 211 | In this modality you benefit only from the optimization introduced after the first version. While being significantly faster (with about a 8x factor), you do not exploit the computational power of having multiple cores. On the other hand the setup and launch of REDItools is much easier. 212 | This might be the first modality you might want to give a try when using REDItools2.0 for the first time. 213 | 214 | The serial version of REDItools2.0 can be tested by issuing the following command: 215 | 216 | > serial_test.sh 217 | 218 | or, if you are in a SLURM-based cluster: 219 | 220 | > sbatch serial_test_slurm.sh 221 | 222 | #### 5.2 Parallel version (parallel_reditools.py) 223 | 224 | In this modality you benefit both from the serial optimization and from the parallel computation introduced in this brand new version which exploits the existence of multiple cores, also on multiple nodes, making it a perfect tool on High Performance Computing facilities. 225 | Using this modality requires you to perform a little bit more system setup, but it will definitely pay you off. 226 | 227 | The parallel version leverages on the existence of coverage information which reports for each position the number of supporting reads. 228 | 229 | We assume you already have installed and correctly configured the following tools: 230 | 231 | - **samtools** (http://www.htslib.org/) 232 | - **htslib** (http://www.htslib.org/) 233 | 234 | If you can use *mpi* on your machine (e.g., you are not on a multi-user system and there are no limitations to the jobs you can submit to the system), you can try launching the parallel version of REDItools 2.0 as follows: 235 | 236 | > ./parallel_test.sh 237 | 238 | If you are running on a SLURM-based cluster, instead, run the following command: 239 | 240 | > sbatch ./parallel_test_slurm.sh 241 | 242 | This script: 243 | - first defines a bunch of variables which point to input, output and accessory files; then 244 | - launches the production of coverage data; then 245 | - REDItools 2.0 is launched in parallel, by using the specified number of cores; finally 246 | - results are gathered and written into a single table (parameter *-o* provided in the command line) 247 | 248 | ## Running 249 | 250 | ### 6. Running REDItools 2.0 on your own data 251 | --- 252 | You can now customize the input test scripts to your needs with your input, output and ad-hoc options. 253 | 254 | ### 7. REDItools 2.0 options 255 | --- 256 | #### 7.1 Basic options 257 | In its most basic form, REDItools 2.0 can be invoked with an input BAM file, a reference genome and an output file: 258 | > python src/cineca/reditools.py -f \$INPUT_BAM_FILE -r $REFERENCE -o \$OUTPUT_FILE 259 | 260 | If you want, you can restrict the analysis only to a certain region (e.g., only chr1), by means of the **-g** option : 261 | > python src/cineca/reditools.py -f \$INPUT_BAM_FILE -r $REFERENCE -o \$OUTPUT_FILE -g chr1 262 | > 263 | or a specific interval: 264 | > python src/cineca/reditools.py -f \$INPUT_BAM_FILE -r $REFERENCE -o \$OUTPUT_FILE -g chr1:1000-2000 265 | 266 | For a complete list of options and their usage and meaning, please type: 267 | 268 | > python src/cineca/reditools.py -h 269 | 270 | #### 7.2 Other options 271 | 272 | Here we report the principal options with a detailed explanation for each of them. 273 | The following are the options accepted by the serial version of REDItools: 274 | 275 | > reditools.py [-h] [-f FILE] [-o OUTPUT_FILE] [-S] [-s STRAND] [-a] 276 | [-r REFERENCE] [-g REGION] [-m OMOPOLYMERIC_FILE] [-c] 277 | [-os OMOPOLYMERIC_SPAN] [-sf SPLICING_FILE] 278 | [-ss SPLICING_SPAN] [-mrl MIN_READ_LENGTH] 279 | [-q MIN_READ_QUALITY] [-bq MIN_BASE_QUALITY] 280 | [-mbp MIN_BASE_POSITION] [-Mbp MAX_BASE_POSITION] 281 | [-l MIN_COLUMN_LENGTH] [-men MIN_EDITS_PER_NUCLEOTIDE] 282 | [-me MIN_EDITS] [-Men MAX_EDITING_NUCLEOTIDES] [-d] 283 | [-T STRAND_CONFIDENCE] [-C] [-Tv STRAND_CONFIDENCE_VALUE] 284 | [-V] [-H] [-D] [-B BED_FILE] 285 | > 286 | > **-h**, --help 287 | > show this help message and exit 288 | > 289 | >**-f** FILE, --file FILE 290 | >The bam file to be analyzed 291 | > 292 | >**-o** OUTPUT_FILE, --output-file OUTPUT_FILE 293 | >The output statistics file 294 | > 295 | >**-S**, --strict 296 | > Activate strict mode: only sites with edits will be included in the output 297 | > 298 | >**-s** STRAND, --strand STRAND 299 | >Strand: this can be 0 (unstranded), 1 (secondstrand oriented) or 2 (firststrand oriented) 300 | > 301 | >**-a**, --append-file 302 | >Appends results to file (and creates if not existing) 303 | > 304 | >**-r** REFERENCE, --reference REFERENCE 305 | >The reference FASTA file 306 | > 307 | >**-g** REGION, --region REGION 308 | >The region of the bam file to be analyzed 309 | > 310 | >**-m** OMOPOLYMERIC_FILE, --omopolymeric-file OMOPOLYMERIC_FILE 311 | >The file containing the omopolymeric positions 312 | > 313 | >**-c**, --create-omopolymeric-file 314 | >Whether to create the omopolymeric span 315 | > 316 | >**-os** OMOPOLYMERIC_SPAN, --omopolymeric-span OMOPOLYMERIC_SPAN 317 | >The omopolymeric span 318 | > 319 | >**-sf** SPLICING_FILE, --splicing-file SPLICING_FILE 320 | >The file containing the splicing sites positions 321 | > 322 | >**-ss** SPLICING_SPAN, --splicing-span SPLICING_SPAN 323 | >The splicing span 324 | > 325 | >**-mrl** MIN_READ_LENGTH, --min-read-length MIN_READ_LENGTH 326 | >The minimum read length. Reads whose length is below this value will be discarded. 327 | > 328 | >**-q** MIN_READ_QUALITY, --min-read-quality MIN_READ_QUALITY 329 | >The minimum read quality. Reads whose mapping quality is below this value will be discarded. 330 | > 331 | >**-bq** MIN_BASE_QUALITY, --min-base-quality MIN_BASE_QUALITY 332 | >The minimum base quality. Bases whose quality is below this value will not be included in the analysis. 333 | > 334 | >**-mbp** MIN_BASE_POSITION, --min-base-position MIN_BASE_POSITION 335 | >The minimum base position. Bases which reside in a previous position (in the read) will not be included in the analysis. 336 | > 337 | >**-Mbp** MAX_BASE_POSITION, --max-base-position MAX_BASE_POSITION 338 | >The maximum base position. Bases which reside in a further position (in the read) will not be included in the analysis. 339 | > 340 | >**-l** MIN_COLUMN_LENGTH, --min-column-length MIN_COLUMN_LENGTH 341 | >The minimum length of editing column (per position). Positions whose columns have length below this value will not be included in the analysis. 342 | > 343 | >**-men** MIN_EDITS_PER_NUCLEOTIDE, --min-edits-per-nucleotide MIN_EDITS_PER_NUCLEOTIDE 344 | >The minimum number of editing for events each nucleotide (per position). Positions whose columns have bases with less than min-edits-per-base edits will not be included in the analysis. 345 | > 346 | >**-me** MIN_EDITS, --min-edits MIN_EDITS 347 | > The minimum number of editing events (per position). Positions whose columns have bases with less than 'min-edits-per-base edits' will not be included in the analysis. 348 | > 349 | >**-Men** MAX_EDITING_NUCLEOTIDES, --max-editing-nucleotides MAX_EDITING_NUCLEOTIDES 350 | > The maximum number of editing nucleotides, from 0 to 4 (per position). Positions whose columns have more than 'max-editing-nucleotides' will not be included in the analysis. 351 | > 352 | >**-d**, --debug 353 | >REDItools is run in DEBUG mode. 354 | > 355 | >**-T** STRAND_CONFIDENCE, --strand-confidence STRAND_CONFIDENCE 356 | > Strand inference type 357 | > 1:maxValue 358 | > 2:useConfidence [1]; 359 | > maxValue: the most prominent strand count will be used; 360 | > useConfidence: strand is assigned if over a prefixed frequency confidence (-TV option) 361 | > 362 | >**-C**, --strand-correction 363 | > Strand correction. Once the strand has been inferred, only bases according to this strand will be selected. 364 | > 365 | >**-Tv** STRAND_CONFIDENCE_VALUE, --strand-confidence-value STRAND_CONFIDENCE_VALUE 366 | > Strand confidence [0.70] 367 | > 368 | >**-V**, --verbose 369 | > Verbose information in stderr 370 | > 371 | >**-H**, --remove-header 372 | >Do not include header in output file 373 | > 374 | >**-N**, --dna 375 | >Run REDItools 2.0 on DNA-Seq data 376 | > 377 | >**-B** BED_FILE, --bed_file BED_FILE 378 | > Path of BED file containing target regions 379 | 380 | The parallel version of REDItools 2.0 has also other 4 additional parameters, namely: 381 | >**-G** --coverage-file The coverage file of the sample to analyze 382 | > 383 | >**-D** --coverage-dir The coverage directory containing the coverage file of the sample to analyze divided by chromosome 384 | > 385 | >**-t** --temp-dir The temp directory where to store temporary data for this sample 386 | > 387 | >**-Z** --chromosome-sizes The file with the chromosome sizes 388 | 389 | ### 8. DNA-Seq annotation with REDItools 2.0 390 | 391 | - Analyze your RNA-Seq data (e.g., file *rna.bam*) with any version of REDItools and obtain the corresponding output table (e.g., *rna_table.txt* or *rna_table.txt.gz*); 392 | - Analyze your DNA-Seq data (e.g., *dna.bam*) with REDItools 2.0, providing as input: 393 | 1. The DNA-Seq file (*dna.bam*) (e.g., option *-f* *dna.bam*); 394 | 2. The output RNA-table output of the first step (e.g., option *-B* *rna_table.txt*) 395 | This step will produce the output table (e.g., *dna_table.txt*); 396 | - Annotate the RNA-Seq table by means of the DNA-Seq table by running REDItools2.0 annotator (script *src/cineca/annotate_with_DNA.py*) with the two tables as input (e.g., *rna_table.txt* and *dna_table.txt*) which will produce the final annotated table (e.g., *final_table.txt*). 397 | 398 |

399 | 400 |

401 | 402 | When RNA-editing tables are big (e.g., greater than 1GB in gz format) reading the full table in parallel mode can be really a time-consuming task. In order to optimize the loading of target positions, we have provided a script to convert RNA-editing tables to BED files: 403 | 404 | > python src/cineca/reditools_table_to_bed.py -i RNA_TABLE -o BED_FILE 405 | 406 | This can be further optimized by creating the final BED in parallel: 407 | 408 | > extract_bed_dynamic.sh RNA_TABLE TEMP_DIR SIZE_FILE 409 | 410 | where 411 | - RNA_TABLE is the input RNA-editing table; 412 | - TEMP_DIR is the directory that will contain the output BED file; 413 | - SIZE_FILE is the file containing the chromosome information (e.g., the .fai file of your reference genome). 414 | 415 | Finally run the script *src/cineca/annotate_with_DNA.py*: 416 | 417 | > python src/cineca/annotate_with_DNA.py -r RNA_TABLE -d DNA_TABLE [-Z] 418 | 419 | The option -Z (not mandatory and without arguments) will exclude positions with multiple changes in DNA-Seq. 420 | 421 | #### 8.1 Useful scripts 422 | 423 | In order to ease the annotation of RNA-Seq tables with DNA-Seq information, we also provided two sample scripts that you can customize with your own data: 424 | 425 | - [**WORK IN PROGRESS**] serial_dna_test.sh 426 | - [**WORK IN PROGRESS**] parallel_dna_test.sh 427 | 428 | ### 9. [**WORK IN PROGRESS**] Running REDItools 2.0 in multisample mode 429 | REDItools also supports the launch on multiple samples at the same time. This modality is extremely useful if you have a dataset (i.e., group of homogeneous samples) and wish to run the same analysis on all of them (i.e., with the same options). 430 | 431 | In order to do this, we provided a second script analogous to parallel_reditools.py, called *reditools2_multisample.py* which supports the specification of an additional option -F [SAMPLE_FILE]. SAMPLE_FILE is a file containing the (absolute) path of samples to be analyzed. 432 | It can be launched in the following manner: 433 | 434 | > mpirun src/cineca/reditools2_multisample.py -F $SAMPLE_FILE [OPTIONS] 435 | 436 | where OPTIONS are the same options accepted by the parallel version of REDItools 2.0. 437 | 438 | #### 9.1 Running in multisample mode on a SLURM-based cluster 439 | If you wish to run REDItools 2.0 in multisample mode on a SLURM-based cluster, we provided two scripts that will help you: 440 | 441 | - [**WORK IN PROGRESS**] *extract_coverage_slurm_multisample.sh*: will calculate the coverage data for all the samples in parallel (by using the script *extract_coverage_dynamic.sh*); 442 | - [**WORK IN PROGRESS**] *multisample_test.sh*: will calculate the RNA-editing events tables for all the samples in parallel using MPI. 443 | 444 | First run *extract_coverage_slurm_multisample.sh* and then *multisample_test.sh*. 445 | 446 | ### 10. Displaying benchmarks with REDItools 2.0 (parallel version only) 447 | We also released simple scripts to generate HTML pages containing the snapshot of the amount of time REDItools 2.0 (parallel version) spends on each part of the overall computation for each process (e.g., coverage computation, DIA algorithm, interval analysis, partial results recombination, etc). 448 | 449 | **Note**: this command will work only when launched *after* the parallel computation has completed. 450 | 451 | All you have to do to create the HTML page is launching the following command: 452 | > create_html.sh TEMP_DIR 453 | 454 | where TEMP_DIR is the directory you specified with the -t option; this directory should contain in fact some auxiliary files (e.g., intervals.txt, progress.txt, times.txt and groups.txt) which serve exactly this purpose. 455 | Once created, the HTML page should display time information similar to the following: 456 | 457 |

458 | 459 |

460 | 461 | By means of this visualization you can *hover* on slices to see more in details the statistics for each interval computation as well as *zoom in* and *zoom out* by using the scroll wheel of your mouse. 462 | 463 | Issues 464 | --- 465 | No issues are known so far. For any problem, write to t.flati@cineca.it. 466 | 473 | -------------------------------------------------------------------------------- /README_2.md: -------------------------------------------------------------------------------- 1 | # REDItools2 2 | 3 | **REDItools2** is the optimized, parallel multi-node version of [ REDItools](https://github.com/BioinfoUNIBA/REDItools). 4 | 5 | REDItools takes in input a RNA-Seq (or DNA-Seq BAM) file and outputs a table of RNA-Seq editing events. Here is an example of REDItools's output: 6 |

7 | 8 |

9 | 10 | The following image explains the high-level architecture. 11 | 12 |

13 | 14 |

15 | 16 | This version of REDItools shows an average 8x speed improvement over the previous version even when using only the serial-mode: 17 | 18 |

19 | 20 |

21 | 22 | # Index 23 | 24 | - [1. Python setup](#1-python-setup) 25 | - [2. Environment setup](#2-environment-setup) 26 | - [3. Cloning / downloading](#3-cloning--downloading) 27 | - [4. Installing](#4-installing) 28 | - [5. The two versions of REDItools 2.0](#5-the-two-versions-of-reditools-20) 29 | - [5.1 Serial version](#51-serial-version-reditoolspy) 30 | - [5.2 Parallel version](#52-parallel-version--parallel_reditoolspy) 31 | - [6. Running REDItools 2.0 on your own data](#6-running-reditools-20-on-your-own-data) 32 | - [7. REDItools 2.0 options](#7-reditools-20-options) 33 | - [8. DNA-Seq annotation with REDItools 2.0](#8-dna-seq-annotation-with-reditools-20) 34 | - [9. Running REDItools 2.0 in multisample mode](#9-running-reditools-20-in-multisample-mode) 35 | - [10. Displaying benchmarks in HTML with REDItools 2.0 (parallel version only)](#10-displaying-benchmarks-with-reditools-20-parallel-version-only) 36 | 37 | 38 | ## Installation 39 | 40 | ### 1. Python setup 41 | --- 42 | This guide assumes you have Python <= 2.7 installed in your system. If you do not have Python, please read the [official Python webpage](https://www.python.org/). 43 | 44 | Make sure to have the following packages installed: 45 | 46 | > sudo apt-get install python-dev build-essential libssl-dev libffi-dev libxml2-dev libxslt1-dev zlib1g-dev python-pip zlib-devel zlib zlib1g zlib1g-devel libbz2-dev zlib1g-dev libncurses5-dev libncursesw5-dev liblzma-dev 47 | 48 | Make sure you have you preferred Python version loaded. If you have a single Python version already installed in your system you should do nothing. If you have multiple versions, please be sure to point to a given version; in order to do so check your environmental variables (e.g., PATH). 49 | 50 | If you are running on a cluster (where usually several versions are available) make sure to load a given Python version. For example (if running on CINECA Marconi super computer) the following command would load Python 2.7.12: 51 | > module load autoload python/2.7.12 52 | 53 | Note: REDItools2.0 has been tested with Python 2.7.12. The software comes with no guarantee of being compatible with other versions of Python (e.g., Python >=3). 54 | 55 | ### 2. Environment setup 56 | --- 57 | Make sure the following libraries are installed: 58 | 59 | - htslib (see http://www.htslib.org/download/ and https://www.biostars.org/p/328831/ for instructions) 60 | - samtools: 61 | 62 | > sudo apt-get install samtools 63 | 64 | - tabix: 65 | 66 | > sudo apt-get install tabix 67 | 68 | - an MPI implementation. We suggest OpenMPI, but you can choose whatever you like the most. For installing OpenMPI, try the following command: 69 | > sudo apt-get install openmpi-common libopenmpi-dev 70 | 71 | ### 3. Cloning / Downloading 72 | --- 73 | 74 | The first step is to clone this repository (assumes you have *git* installed in your system - see the [Git official page](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git) otherwise): 75 | > git clone https://github.com/tflati/reditools2.0.git 76 | 77 | (alternatively you can download a ZIP package of REDItools2.0 from [here](https://github.com/tflati/reditools2.0/archive/master.zip) and uncompress the archive). 78 | 79 | Move into the project main directory: 80 | > cd reditools2.0 81 | 82 | 83 | ### 4. Installing 84 | --- 85 | 86 | REDItools 2.0 requires a few Python modules to be installed in the environment (e.g., pysam, sortedcontainers, mpi4py, etc.). These can be installed in three ways: 87 | 88 | - **System-level**: in this way the dependencies will be installed in your system and all users in your system will see changes. In order to perform this type of installation you need administrator rights. 89 | To install REDItools2.0 in this modality, just run the following command: 90 | > sudo pip install -r requirements.txt 91 | 92 | - **User-level**: in this way the dependencies will be installed only for your current user, usually in your home directory. In order to perform this type of installation you need only to be logged as a normal user. Note that this type of installation will install additional software in your local Python directory (usually $HOME/.local/lib/python2.7/site-packages/, but it depends on your operating system and distribution). 93 | This is the recommended modality if you do not care about altering your user environment. Note that altering your user environment might lead to software corruption. For example, assume you have already the *pysam* package installed (version 0.6); since REDItools 2.0 requires a version for *pysam* >= 0.9, the installation would uninstall the existing version of pysam and would install the version 0.9, thus altering the state of your environment. Any existing software which relied on version pysam 0.6 might break and stop working. In conclusion, choose this modality at your own risk. 94 | To install REDItools2.0 in this modality, just run the following command: 95 | > pip install -r requirements.txt --user 96 | 97 | - **Environment-level**: in this type of installation you create an isolated virtual environment (initially empty) which will contain any new required software, without creating conflicts with any existing environment or requiring any particular right. 98 | This modality will work regardless of the existing packages already installed in your system (both user and system levels) and thus gives the maximum possible freedom to the final-end user. 99 | This is the recommended modality. 100 | The downside of choosing this modality is a potential duplication of code with respect to other existing environments. For example, assume you already have a given version of *sortedcontainers*; by installing REDItools2.0 at environment-level will download and install a *new* copy of *sortedcontainers* into a new isolated environment (ending up with two copies of the same software present in the system, one inside and one outside the virtual environment). 101 | To install REDItools2.0 in this modality, run the following commands: 102 | 103 | > virtualenv ENV 104 | > 105 | > source ENV/bin/activate 106 | > 107 | > pip install -r requirements.txt 108 | > 109 | > deactivate 110 | 111 | These commands will create a new environment called *ENV* (you can choose any name you like) and will install all dependencies listed in the file *requirements.txt* into it). The commands *activate* and *deactivate* respectively activate (i.e., start/open) and deactivate (i.e., end/close) the virtual environment. 112 | When running the real commands, remember to wrap your commands between and activate and deactivate commands: 113 | 114 | >source ENV/bin/activate 115 | > 116 | >command... 117 | > 118 | >command... 119 | > 120 | >command... 121 | > 122 | >command... 123 | > 124 | >deactivate 125 | 126 | ## Testing 127 | 128 | ### 5. The two versions of REDItools 2.0 129 | --- 130 | 131 | This repo includes test data and a test script for checking that dependencies have been installed properly and the basic REDItools command works. 132 | 133 | In order to have all the data you need, run the following commands: 134 | 135 | > cd test 136 | > 137 | > ./prepare_test.sh 138 | 139 | This will download and index the chromosome 21 of the hg19 version of the human genome (from http://hgdownload.cse.ucsc.edu/downloads.html). 140 | Once the script has finished running, you have all you need to perform the tests. 141 | 142 | The software comes with two modalities. Feel free to choose the one which best fits your needs. 143 | 144 | #### 5.1 Serial version (reditools.py) 145 | 146 | In this modality you benefit only from the optimization introduced after the first version. While being significantly faster (with about a 8x factor), you do not exploit the computational power of having multiple cores. On the other hand the setup and launch of REDItools is much easier. 147 | This might be the first modality you might want to give a try when using REDItools2.0 for the first time. 148 | 149 | The serial version of REDItools2.0 can be tested by issuing the following command: 150 | 151 | > serial_test.sh 152 | 153 | or, if you are in a SLURM-based cluster: 154 | 155 | > sbatch serial_test_slurm.sh 156 | 157 | #### 5.2 Parallel version (parallel_reditools.py) 158 | 159 | In this modality you benefit both from the serial optimization and from the parallel computation introduced in this brand new version which exploits the existence of multiple cores, also on multiple nodes, making it a perfect tool on High Performance Computing facilities. 160 | Using this modality requires you to perform a little bit more system setup, but it will definitely pay you off. 161 | 162 | The parallel version leverages on the existence of coverage information which reports for each position the number of supporting reads. 163 | 164 | We assume you already have installed and correctly configured the following tools: 165 | 166 | - **samtools** (http://www.htslib.org/) 167 | - **htslib** (http://www.htslib.org/) 168 | 169 | If you can use *mpi* on your machine (e.g., you are not on a multi-user system and there are no limitations to the jobs you can submit to the system), you can try launching the parallel version of REDItools 2.0 as follows: 170 | 171 | > ./parallel_test.sh 172 | 173 | If you are running on a SLURM-based cluster, instead, run the following command: 174 | 175 | > sbatch ./parallel_test_slurm.sh 176 | 177 | This script: 178 | - first defines a bunch of variables which point to input, output and accessory files; then 179 | - launches the production of coverage data; then 180 | - REDItools 2.0 is launched in parallel, by using the specified number of cores; finally 181 | - results are gathered and written into a single table (parameter *-o* provided in the command line) 182 | 183 | ## Running 184 | 185 | ### 6. Running REDItools 2.0 on your own data 186 | --- 187 | You can now customize the input test scripts to your needs with your input, output and ad-hoc options. 188 | 189 | ### 7. REDItools 2.0 options 190 | --- 191 | #### 7.1 Basic options 192 | In its most basic form, REDItools 2.0 can be invoked with an input BAM file, a reference genome and an output file: 193 | > python src/cineca/reditools.py -f \$INPUT_BAM_FILE -r $REFERENCE -o \$OUTPUT_FILE 194 | 195 | If you want, you can restrict the analysis only to a certain region (e.g., only chr1), by means of the **-g** option : 196 | > python src/cineca/reditools.py -f \$INPUT_BAM_FILE -r $REFERENCE -o \$OUTPUT_FILE -g chr1 197 | > 198 | or a specific interval: 199 | > python src/cineca/reditools.py -f \$INPUT_BAM_FILE -r $REFERENCE -o \$OUTPUT_FILE -g chr1:1000-2000 200 | 201 | For a complete list of options and their usage and meaning, please type: 202 | 203 | > python src/cineca/reditools.py -h 204 | 205 | #### 7.2 Other options 206 | 207 | Here we report the principal options with a detailed explanation for each of them. 208 | The following are the options accepted by the serial version of REDItools: 209 | 210 | > reditools.py [-h] [-f FILE] [-o OUTPUT_FILE] [-S] [-s STRAND] [-a] 211 | [-r REFERENCE] [-g REGION] [-m OMOPOLYMERIC_FILE] [-c] 212 | [-os OMOPOLYMERIC_SPAN] [-sf SPLICING_FILE] 213 | [-ss SPLICING_SPAN] [-mrl MIN_READ_LENGTH] 214 | [-q MIN_READ_QUALITY] [-bq MIN_BASE_QUALITY] 215 | [-mbp MIN_BASE_POSITION] [-Mbp MAX_BASE_POSITION] 216 | [-l MIN_COLUMN_LENGTH] [-men MIN_EDITS_PER_NUCLEOTIDE] 217 | [-me MIN_EDITS] [-Men MAX_EDITING_NUCLEOTIDES] [-d] 218 | [-T STRAND_CONFIDENCE] [-C] [-Tv STRAND_CONFIDENCE_VALUE] 219 | [-V] [-H] [-D] [-B BED_FILE] 220 | > 221 | > **-h**, --help 222 | > show this help message and exit 223 | > 224 | >**-f** FILE, --file FILE 225 | >The bam file to be analyzed 226 | > 227 | >**-o** OUTPUT_FILE, --output-file OUTPUT_FILE 228 | >The output statistics file 229 | > 230 | >**-S**, --strict 231 | > Activate strict mode: only sites with edits will be included in the output 232 | > 233 | >**-s** STRAND, --strand STRAND 234 | >Strand: this can be 0 (unstranded), 1 (secondstrand oriented) or 2 (firststrand oriented) 235 | > 236 | >**-a**, --append-file 237 | >Appends results to file (and creates if not existing) 238 | > 239 | >**-r** REFERENCE, --reference REFERENCE 240 | >The reference FASTA file 241 | > 242 | >**-g** REGION, --region REGION 243 | >The region of the bam file to be analyzed 244 | > 245 | >**-m** OMOPOLYMERIC_FILE, --omopolymeric-file OMOPOLYMERIC_FILE 246 | >The file containing the omopolymeric positions 247 | > 248 | >**-c**, --create-omopolymeric-file 249 | >Whether to create the omopolymeric span 250 | > 251 | >**-os** OMOPOLYMERIC_SPAN, --omopolymeric-span OMOPOLYMERIC_SPAN 252 | >The omopolymeric span 253 | > 254 | >**-sf** SPLICING_FILE, --splicing-file SPLICING_FILE 255 | >The file containing the splicing sites positions 256 | > 257 | >**-ss** SPLICING_SPAN, --splicing-span SPLICING_SPAN 258 | >The splicing span 259 | > 260 | >**-mrl** MIN_READ_LENGTH, --min-read-length MIN_READ_LENGTH 261 | >The minimum read length. Reads whose length is below this value will be discarded. 262 | > 263 | >**-q** MIN_READ_QUALITY, --min-read-quality MIN_READ_QUALITY 264 | >The minimum read quality. Reads whose mapping quality is below this value will be discarded. 265 | > 266 | >**-bq** MIN_BASE_QUALITY, --min-base-quality MIN_BASE_QUALITY 267 | >The minimum base quality. Bases whose quality is below this value will not be included in the analysis. 268 | > 269 | >**-mbp** MIN_BASE_POSITION, --min-base-position MIN_BASE_POSITION 270 | >The minimum base position. Bases which reside in a previous position (in the read) will not be included in the analysis. 271 | > 272 | >**-Mbp** MAX_BASE_POSITION, --max-base-position MAX_BASE_POSITION 273 | >The maximum base position. Bases which reside in a further position (in the read) will not be included in the analysis. 274 | > 275 | >**-l** MIN_COLUMN_LENGTH, --min-column-length MIN_COLUMN_LENGTH 276 | >The minimum length of editing column (per position). Positions whose columns have length below this value will not be included in the analysis. 277 | > 278 | >**-men** MIN_EDITS_PER_NUCLEOTIDE, --min-edits-per-nucleotide MIN_EDITS_PER_NUCLEOTIDE 279 | >The minimum number of editing for events each nucleotide (per position). Positions whose columns have bases with less than min-edits-per-base edits will not be included in the analysis. 280 | > 281 | >**-me** MIN_EDITS, --min-edits MIN_EDITS 282 | > The minimum number of editing events (per position). Positions whose columns have bases with less than 'min-edits-per-base edits' will not be included in the analysis. 283 | > 284 | >**-Men** MAX_EDITING_NUCLEOTIDES, --max-editing-nucleotides MAX_EDITING_NUCLEOTIDES 285 | > The maximum number of editing nucleotides, from 0 to 4 (per position). Positions whose columns have more than 'max-editing-nucleotides' will not be included in the analysis. 286 | > 287 | >**-d**, --debug 288 | >REDItools is run in DEBUG mode. 289 | > 290 | >**-T** STRAND_CONFIDENCE, --strand-confidence STRAND_CONFIDENCE 291 | > Strand inference type 292 | > 1:maxValue 293 | > 2:useConfidence [1]; 294 | > maxValue: the most prominent strand count will be used; 295 | > useConfidence: strand is assigned if over a prefixed frequency confidence (-TV option) 296 | > 297 | >**-C**, --strand-correction 298 | > Strand correction. Once the strand has been inferred, only bases according to this strand will be selected. 299 | > 300 | >**-Tv** STRAND_CONFIDENCE_VALUE, --strand-confidence-value STRAND_CONFIDENCE_VALUE 301 | > Strand confidence [0.70] 302 | > 303 | >**-V**, --verbose 304 | > Verbose information in stderr 305 | > 306 | >**-H**, --remove-header 307 | >Do not include header in output file 308 | > 309 | >**-N**, --dna 310 | >Run REDItools 2.0 on DNA-Seq data 311 | > 312 | >**-B** BED_FILE, --bed_file BED_FILE 313 | > Path of BED file containing target regions 314 | 315 | The parallel version of REDItools 2.0 has also other 4 additional parameters, namely: 316 | >**-G** --coverage-file The coverage file of the sample to analyze 317 | > 318 | >**-D** --coverage-dir The coverage directory containing the coverage file of the sample to analyze divided by chromosome 319 | > 320 | >**-t** --temp-dir The temp directory where to store temporary data for this sample 321 | > 322 | >**-Z** --chromosome-sizes The file with the chromosome sizes 323 | 324 | ### 8. DNA-Seq annotation with REDItools 2.0 325 | 326 | - Analyze your RNA-Seq data (e.g., file *rna.bam*) with any version of REDItools and obtain the corresponding output table (e.g., *rna_table.txt* or *rna_table.txt.gz*); 327 | - Analyze your DNA-Seq data (e.g., *dna.bam*) with REDItools 2.0, providing as input: 328 | 1. The DNA-Seq file (*dna.bam*) (e.g., option *-f* *dna.bam*); 329 | 2. The output RNA-table output of the first step (e.g., option *-B* *rna_table.txt*) 330 | This step will produce the output table (e.g., *dna_table.txt*); 331 | - Annotate the RNA-Seq table by means of the DNA-Seq table by running REDItools2.0 annotator (script *src/cineca/annotate_with_DNA.py*) with the two tables as input (e.g., *rna_table.txt* and *dna_table.txt*) which will produce the final annotated table (e.g., *final_table.txt*). 332 | 333 |

334 | 335 |

336 | 337 | When RNA-editing tables are big (e.g., greater than 1GB in gz format) reading the full table in parallel mode can be really a time-consuming task. In order to optimize the loading of target positions, we have provided a script to convert RNA-editing tables to BED files: 338 | 339 | > python src/cineca/reditools_table_to_bed.py -i RNA_TABLE -o BED_FILE 340 | 341 | This can be further optimized by creating the final BED in parallel: 342 | 343 | > extract_bed_dynamic.sh RNA_TABLE TEMP_DIR SIZE_FILE 344 | 345 | where 346 | - RNA_TABLE is the input RNA-editing table; 347 | - TEMP_DIR is the directory that will contain the output BED file; 348 | - SIZE_FILE is the file containing the chromosome information (e.g., the .fai file of your reference genome). 349 | 350 | Finally run the script *src/cineca/annotate_with_DNA.py*: 351 | 352 | > python src/cineca/annotate_with_DNA.py -r RNA_TABLE -d DNA_TABLE [-Z] 353 | 354 | The option -Z (not mandatory and without arguments) will exclude positions with multiple changes in DNA-Seq. 355 | 356 | #### 8.1 Useful scripts 357 | 358 | In order to ease the annotation of RNA-Seq tables with DNA-Seq information, we also provided two sample scripts that you can customize with your own data: 359 | 360 | - [**WORK IN PROGRESS**] serial_dna_test.sh 361 | - [**WORK IN PROGRESS**] parallel_dna_test.sh 362 | 363 | ### 9. [**WORK IN PROGRESS**] Running REDItools 2.0 in multisample mode 364 | REDItools also supports the launch on multiple samples at the same time. This modality is extremely useful if you have a dataset (i.e., group of homogeneous samples) and wish to run the same analysis on all of them (i.e., with the same options). 365 | 366 | In order to do this, we provided a second script analogous to parallel_reditools.py, called *reditools2_multisample.py* which supports the specification of an additional option -F [SAMPLE_FILE]. SAMPLE_FILE is a file containing the (absolute) path of samples to be analyzed. 367 | It can be launched in the following manner: 368 | 369 | > mpirun src/cineca/reditools2_multisample.py -F $SAMPLE_FILE [OPTIONS] 370 | 371 | where OPTIONS are the same options accepted by the parallel version of REDItools 2.0. 372 | 373 | #### 9.1 Running in multisample mode on a SLURM-based cluster 374 | If you wish to run REDItools 2.0 in multisample mode on a SLURM-based cluster, we provided two scripts that will help you: 375 | 376 | - [**WORK IN PROGRESS**] *extract_coverage_slurm_multisample.sh*: will calculate the coverage data for all the samples in parallel (by using the script *extract_coverage_dynamic.sh*); 377 | - [**WORK IN PROGRESS**] *multisample_test.sh*: will calculate the RNA-editing events tables for all the samples in parallel using MPI. 378 | 379 | First run *extract_coverage_slurm_multisample.sh* and then *multisample_test.sh*. 380 | 381 | ### 10. Displaying benchmarks with REDItools 2.0 (parallel version only) 382 | We also released simple scripts to generate HTML pages containing the snapshot of the amount of time REDItools 2.0 (parallel version) spends on each part of the overall computation for each process (e.g., coverage computation, DIA algorithm, interval analysis, partial results recombination, etc). 383 | 384 | **Note**: this command will work only when launched *after* the parallel computation has completed. 385 | 386 | All you have to do to create the HTML page is launching the following command: 387 | > create_html.sh TEMP_DIR 388 | 389 | where TEMP_DIR is the directory you specified with the -t option; this directory should contain in fact some auxiliary files (e.g., intervals.txt, progress.txt, times.txt and groups.txt) which serve exactly this purpose. 390 | Once created, the HTML page should display time information similar to the following: 391 | 392 |

393 | 394 |

395 | 396 | By means of this visualization you can *hover* on slices to see more in details the statistics for each interval computation as well as *zoom in* and *zoom out* by using the scroll wheel of your mouse. 397 | 398 | Issues 399 | --- 400 | No issues are known so far. For any problem, write to t.flati@cineca.it. 401 | 408 | -------------------------------------------------------------------------------- /bower.json: -------------------------------------------------------------------------------- 1 | { 2 | "name": "REDItools 2.0", 3 | "authors": [ 4 | "tflati " 5 | ], 6 | "description": "Simple visualization tool", 7 | "main": "template.html", 8 | "license": "MIT", 9 | "homepage": "", 10 | "ignore": [ 11 | "**/.*", 12 | "node_modules", 13 | "bower_components", 14 | "test", 15 | "tests" 16 | ], 17 | "dependencies": { 18 | "vis": "^4.21.0" 19 | } 20 | } 21 | -------------------------------------------------------------------------------- /create_html.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | if [ $# -eq 0 ]; then 4 | echo "[ERROR] Please, remember to provide the temporary directory of interest." 5 | echo -e "Usage:\n\t$0 TEMPORARY_DIR" 6 | exit 1 7 | fi 8 | 9 | TEMPDIR=$1 10 | #cat template.html | sed "s@EVENTS_DATA@$(cat "$1"/times.txt)@g" | sed "s@GROUPS_DATA@$(cat "$1"/groups.txt)@g" > reditools.html 11 | cat template.html | sed "/EVENTS_DATA/{s/EVENTS_DATA//g 12 | r "$1"/times.txt 13 | }" | sed "/GROUPS_DATA/{s/GROUPS_DATA//g 14 | r "$1"/groups.txt 15 | }" > reditools.html 16 | -------------------------------------------------------------------------------- /extract_bed_dynamic.sh: -------------------------------------------------------------------------------- 1 | INPUT=$1 2 | TEMP_DIR=$2 3 | SIZE_FILE=$3 4 | 5 | FILENAME=$(basename $INPUT) 6 | FILE_ID=${FILENAME%%.*} 7 | 8 | if [ ! -d $TEMP_DIR ] 9 | then 10 | mkdir -p $TEMP_DIR 11 | fi 12 | 13 | echo "INPUT=$INPUT" 14 | echo "TEMP=$TEMP_DIR" 15 | echo "CHROMOSOMES=$SIZE_FILE" 16 | echo "FILE_ID=$FILE_ID" 17 | 18 | t1=$(date +%s) 19 | t1_human=$(date) 20 | 21 | echo "[STATS] Dividing input table into pieces ["`date`"]" 22 | zcat $INPUT | cut -f 1,2 | awk '{print $0 >> "'$TEMP_DIR'/"$1".table"}' 23 | # read -n 1 -s -r -p "Press any key to continue" 24 | 25 | echo "[STATS] Creating single chromosome bed files ["`date`"]" 26 | CHROMOSOMES=() 27 | for chrom in $(cat $SIZE_FILE | cut -f 1) 28 | do 29 | CHROMOSOMES[${#CHROMOSOMES[@]}]=$chrom 30 | done 31 | 32 | NUM_CHROMS=$(cat $SIZE_FILE | cut -f 1 | wc -l) 33 | AVAILABLE_CPUS=$(nproc) 34 | CHUNK_SIZE=$(($NUM_CHROMS>$AVAILABLE_CPUS?$AVAILABLE_CPUS:$NUM_CHROMS)) 35 | echo "CHROMOSOMES="$NUM_CHROMS 36 | echo "CHUNK SIZE="$CHUNK_SIZE 37 | start=0 38 | while [ $start -lt $NUM_CHROMS ] 39 | do 40 | echo "NEW BATCH [$(expr $start + 1)-$(expr $start + $CHUNK_SIZE)]" 41 | for i in $(seq $start $(expr $start + $CHUNK_SIZE - 1)) 42 | do 43 | if [ $i -ge $NUM_CHROMS ]; then break; fi 44 | 45 | chrom=${CHROMOSOMES[$i]} 46 | if [ -s $TEMP_DIR/$chrom.table ] 47 | then 48 | echo "Calculating bed file for chromosome $chrom = $TEMP_DIR$chrom" 49 | python src/cineca/reditools_table_to_bed.py -i $TEMP_DIR/$chrom.table -o $TEMP_DIR/$chrom.bed & 50 | fi 51 | done 52 | wait 53 | start=$(expr $start + $CHUNK_SIZE) 54 | done 55 | # read -n 1 -s -r -p "Press any key to continue" 56 | 57 | t2=$(date +%s) 58 | t2_human=$(date) 59 | elapsed_time=$(($t2-$t1)) 60 | elapsed_time_human=$(date -d@$elapsed_time -u +%H:%M:%S) 61 | echo "[STATS] [BED CHR] [$FILE_ID] START="$t1_human" ["$t1"] END="$t2_human" ["$t2"] ELAPSED="$elapsed_time" HUMAN="$elapsed_time_human 1>&2 62 | 63 | tmid=$(date +%s) 64 | tmid_human=$(date) 65 | 66 | echo "[STATS] Creating complete BED file $TEMP_DIR$FILE_ID.bed ["`date`"]" 67 | rm $TEMP_DIR$FILE_ID".bed" 68 | for chrom in `cat $SIZE_FILE | cut -f 1` 69 | do 70 | if [ -s $TEMP_DIR$chrom".bed" ] 71 | then 72 | cat $TEMP_DIR$chrom".bed" >> $TEMP_DIR$FILE_ID".bed" 73 | fi 74 | 75 | rm $TEMP_DIR$chrom".table" 76 | rm $TEMP_DIR$chrom".bed" 77 | done 78 | 79 | t2=$(date +%s) 80 | t2_human=$(date) 81 | elapsed_time_mid=$(($t2-$tmid)) 82 | elapsed_time_mid_human=$(date -d@$elapsed_time_mid -u +%H:%M:%S) 83 | echo "[STATS] [BED GLOBAL] [$FILE_ID] START="$tmid_human" ["$tmid"] END="$t2_human" ["$t2"] ELAPSED="$elapsed_time_mid" HUMAN="$elapsed_time_mid_human 1>&2 84 | # read -n 1 -s -r -p "Press any key to continue" 85 | 86 | elapsed_time=$(($t2-$t1)) 87 | elapsed_time_human=$(date -d@$elapsed_time -u +%H:%M:%S) 88 | echo "[STATS] [BED] [$FILE_ID] START="$t1_human" ["$tmid"] END="$t2_human" ["$t2"] ELAPSED="$elapsed_time" HUMAN="$elapsed_time_human 1>&2 89 | 90 | echo -e "$FILE_ID\t$elapsed_time\t$elapsed_time_human" > $TEMP_DIR/$FILE_ID-bed-chronometer.txt 91 | 92 | echo "[STATS] Finished creating bed data ["`date`"]" 93 | -------------------------------------------------------------------------------- /extract_coverage.sh: -------------------------------------------------------------------------------- 1 | FILENAME=`basename $1` 2 | COVERAGE_DIR=$2 3 | SIZE_FILE=$3 4 | 5 | FILE_ID=${FILENAME%.*} 6 | 7 | if [ ! -d $COVERAGE_DIR ] 8 | then 9 | mkdir -p $COVERAGE_DIR 10 | fi 11 | 12 | t1=$(date +%s) 13 | t1_human=$(date) 14 | 15 | #samtools depth $1 | grep -vP "\t0$" | tee $COVERAGE_DIR$FILE_ID".cov" | awk '{print($0) > "'$COVERAGE_DIR'"$1}' 16 | echo "[STATS] Creating single chromosome coverage files ["`date`"]" 17 | for chrom in `cat $SIZE_FILE | cut -f 1` 18 | do 19 | echo "Calculating coverage file for chromosome $chrom = $COVERAGE_DIR$chrom" 20 | 21 | if [ $(samtools view $1 | cut -f 3 | grep -q $chrom) ] 22 | then 23 | samtools depth $1 -r ${chrom#chr} | grep -vP "\t0$" > $COVERAGE_DIR$chrom & 24 | else 25 | samtools depth $1 -r $chrom | grep -vP "\t0$" > $COVERAGE_DIR$chrom & 26 | fi 27 | 28 | done 29 | wait 30 | 31 | t2=$(date +%s) 32 | t2_human=$(date) 33 | elapsed_time=$(($t2-$t1)) 34 | elapsed_time_human=$(date -d@$elapsed_time -u +%H:%M:%S) 35 | echo "[STATS] [COVERAGE CHR] [$FILE_ID] START="$t1_human" ["$t1"] END="$t2_human" ["$t2"] ELAPSED="$elapsed_time" HUMAN="$elapsed_time_human 1>&2 36 | 37 | tmid=$(date +%s) 38 | tmid_human=$(date) 39 | 40 | echo "[STATS] Creating complete file $COVERAGE_DIR$FILE_ID.cov ["`date`"]" 41 | if [ -s $COVERAGE_DIR$FILE_ID".cov" ] 42 | then 43 | rm $COVERAGE_DIR$FILE_ID".cov" 44 | fi 45 | 46 | for chrom in `cat $SIZE_FILE | cut -f 1` 47 | do 48 | cat $COVERAGE_DIR$chrom >> $COVERAGE_DIR$FILE_ID".cov" 49 | done 50 | 51 | t2=$(date +%s) 52 | t2_human=$(date) 53 | elapsed_time_mid=$(($t2-$tmid)) 54 | elapsed_time_mid_human=$(date -d@$elapsed_time_mid -u +%H:%M:%S) 55 | echo "[STATS] [COVERAGE GLOBAL] [$FILE_ID] START="$tmid_human" ["$tmid"] END="$t2_human" ["$t2"] ELAPSED="$elapsed_time_mid" HUMAN="$elapsed_time_mid_human 1>&2 56 | 57 | elapsed_time=$(($t2-$t1)) 58 | elapsed_time_human=$(date -d@$elapsed_time -u +%H:%M:%S) 59 | echo "[STATS] [COVERAGE] [$FILE_ID] START="$t1_human" ["$tmid"] END="$t2_human" ["$t2"] ELAPSED="$elapsed_time" HUMAN="$elapsed_time_human 1>&2 60 | 61 | echo -e "$FILE_ID\t$elapsed_time\t$elapsed_time_human" > $COVERAGE_DIR/$FILE_ID-coverage-chronometer.txt 62 | 63 | echo "[STATS] Finished creating coverage data ["`date`"]" 64 | -------------------------------------------------------------------------------- /extract_coverage_dynamic.sh: -------------------------------------------------------------------------------- 1 | FILENAME=`basename $1` 2 | COVERAGE_DIR=$2 3 | SIZE_FILE=$3 4 | 5 | FILE_ID=${FILENAME%.*} 6 | 7 | if [ ! -d $COVERAGE_DIR ] 8 | then 9 | mkdir -p $COVERAGE_DIR 10 | fi 11 | 12 | t1=$(date +%s) 13 | t1_human=$(date) 14 | 15 | 16 | echo "[STATS] Creating single chromosome coverage files ["`date`"]" 17 | CHROMOSOMES=() 18 | for chrom in $(cat $SIZE_FILE | cut -f 1) 19 | do 20 | CHROMOSOMES[${#CHROMOSOMES[@]}]=$chrom 21 | done 22 | 23 | ############################### 24 | ### PER-CHROMOSOME COVERAGE ### 25 | ############################### 26 | NUM_CHROMS=$(cat $SIZE_FILE | cut -f 1 | wc -l) 27 | AVAILABLE_CPUS=$(nproc) 28 | CHUNK_SIZE=$(($NUM_CHROMS>$AVAILABLE_CPUS?$AVAILABLE_CPUS:$NUM_CHROMS)) 29 | echo "CHROMOSOMES="$NUM_CHROMS 30 | echo "CHUNK SIZE="$CHUNK_SIZE 31 | start=0 32 | while [ $start -lt $NUM_CHROMS ] 33 | do 34 | echo "NEW BATCH [$(expr $start + 1)-$(expr $start + $CHUNK_SIZE)]" 35 | for i in $(seq $start $(expr $start + $CHUNK_SIZE - 1)) 36 | do 37 | if [ $i -ge $NUM_CHROMS ]; then break; fi 38 | 39 | chrom=${CHROMOSOMES[$i]} 40 | 41 | echo "Calculating coverage file for chromosome $chrom = $COVERAGE_DIR$chrom" 42 | 43 | if [ $(samtools view $1 | cut -f 3 | grep -q $chrom) ] 44 | then 45 | samtools depth $1 -r ${chrom#chr} | grep -vP "\t0$" > $COVERAGE_DIR$chrom & 46 | else 47 | samtools depth $1 -r $chrom | grep -vP "\t0$" > $COVERAGE_DIR$chrom & 48 | fi 49 | 50 | done 51 | wait 52 | start=$(expr $start + $CHUNK_SIZE) 53 | done 54 | 55 | t2=$(date +%s) 56 | t2_human=$(date) 57 | elapsed_time=$(($t2-$t1)) 58 | elapsed_time_human=$(date -d@$elapsed_time -u +%H:%M:%S) 59 | echo "[STATS] [COVERAGE CHR] [$FILE_ID] START="$t1_human" ["$t1"] END="$t2_human" ["$t2"] ELAPSED="$elapsed_time" HUMAN="$elapsed_time_human 1>&2 60 | 61 | tmid=$(date +%s) 62 | tmid_human=$(date) 63 | 64 | ############################ 65 | ### SINGLE COVERAGE FILE ### 66 | ############################ 67 | echo "[STATS] Creating complete file $COVERAGE_DIR$FILE_ID.cov ["`date`"]" 68 | rm $COVERAGE_DIR$FILE_ID".cov" 69 | for chrom in `cat $SIZE_FILE | cut -f 1` 70 | do 71 | cat $COVERAGE_DIR$chrom >> $COVERAGE_DIR$FILE_ID".cov" 72 | done 73 | 74 | t2=$(date +%s) 75 | t2_human=$(date) 76 | elapsed_time_mid=$(($t2-$tmid)) 77 | elapsed_time_mid_human=$(date -d@$elapsed_time_mid -u +%H:%M:%S) 78 | echo "[STATS] [COVERAGE GLOBAL] [$FILE_ID] START="$tmid_human" ["$tmid"] END="$t2_human" ["$t2"] ELAPSED="$elapsed_time_mid" HUMAN="$elapsed_time_mid_human 1>&2 79 | 80 | elapsed_time=$(($t2-$t1)) 81 | elapsed_time_human=$(date -d@$elapsed_time -u +%H:%M:%S) 82 | echo "[STATS] [COVERAGE] [$FILE_ID] START="$t1_human" ["$tmid"] END="$t2_human" ["$t2"] ELAPSED="$elapsed_time" HUMAN="$elapsed_time_human 1>&2 83 | 84 | echo -e "$FILE_ID\t$elapsed_time\t$elapsed_time_human" > $COVERAGE_DIR/$FILE_ID-coverage-chronometer.txt 85 | 86 | echo "[STATS] Finished creating coverage data ["`date`"]" 87 | -------------------------------------------------------------------------------- /extract_coverage_slurm.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | #SBATCH --ntasks=25 3 | #SBATCH --ntasks-per-node=25 4 | #SBATCH --time=02:00:00 5 | ##SBATCH --account=Pra15_3924 6 | #SBATCH --account=cin_staff 7 | #SBATCH -p knl_usr_prod 8 | 9 | # SAMPLE_ID 10 | # SOURCE_BAM_FILE 11 | # COV 12 | # SIZE_FILE 13 | 14 | echo "Launching REDItool COVERAGE on $SAMPLE_ID"; 15 | 16 | module load autoload profile/global 17 | module load autoload samtools 18 | 19 | t1=$(date +%s) 20 | t1_human=$(date) 21 | echo "[STATS] [COVERAGE] [$SAMPLE_ID] START="$t1_human" ["$t1"]" 22 | time ./extract_coverage_dynamic.sh $SOURCE_BAM_FILE $COV $SIZE_FILE 23 | t2=$(date +%s) 24 | t2_human=$(date) 25 | elapsed_time=$(($t2-$t1)) 26 | elapsed_time_human=$(date -d@$elapsed_time -u +%H:%M:%S) 27 | echo "[STATS] [COVERAGE] [$SAMPLE_ID] START="$t1_human" ["$t1"] END="$t2_human" ["$t2"] ELAPSED="$elapsed_time" HUMAN="$elapsed_time_human 28 | -------------------------------------------------------------------------------- /extract_coverage_slurm_multisample.sh: -------------------------------------------------------------------------------- 1 | module load autoload profile/global 2 | module load ig_homo_sapiens/hg19 3 | REFERENCE=$IG_HG19_GENOME"/genome.fa" 4 | export SIZE_FILE=$REFERENCE".fai" 5 | 6 | BASE_DIR="/marconi_scratch/userexternal/tflati00/reditools_paper/" 7 | SAMPLE_FILE=$BASE_DIR"samples-10.txt" 8 | N=$(cat $SAMPLE_FILE | wc -l | cut -d' ' -f 1) 9 | 10 | COVERAGE_DIR=$BASE_DIR"/cov-multisample/" 11 | 12 | for SOURCE_BAM_FILE in $(cat $SAMPLE_FILE | head -n 1) 13 | do 14 | if [ ! -s $SOURCE_BAM_FILE ] 15 | then 16 | echo "File $SOURCE_BAM_FILE does not exists. Skipping." 17 | continue 18 | fi 19 | 20 | export SOURCE_BAM_FILE 21 | export SAMPLE_ID=$(basename $SOURCE_BAM_FILE | sed 's/\.bam//g') 22 | export COV=$COVERAGE_DIR$SAMPLE_ID"/" 23 | export COV_FILE=$COV$SAMPLE_ID".cov" 24 | 25 | mkdir -p $COV 26 | 27 | if [ ! -f $COV_FILE ] 28 | then 29 | echo "[STATS] [COVERAGE] [$SAMPLE_ID]" 30 | sbatch --export=ALL -J cov-$SAMPLE_ID -o $COV/output.txt -e $COV/error.txt ./extract_coverage_slurm.sh & 31 | fi 32 | done -------------------------------------------------------------------------------- /install.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | virtualenv ENV 4 | 5 | source ENV/bin/activate 6 | 7 | pip install -r requirements.txt 8 | 9 | deactivate -------------------------------------------------------------------------------- /merge.sh: -------------------------------------------------------------------------------- 1 | TABLE_DIR=$1 2 | FINAL_TABLE=$2 3 | THREADS=$3 4 | 5 | echo "Merging files in $TABLE_DIR using $THREADS threads and writing to output=$FINAL_TABLE" 6 | 7 | t1=$(date +%s) 8 | t1_human=$(date) 9 | 10 | if [ ! -s $TABLE_DIR/files.txt ] 11 | then 12 | echo "FILE LIST NOT EXISTING OR EMPTY: "$TABLE_DIR/files.txt 13 | else 14 | OUTPUT_DIR=`dirname $FINAL_TABLE` 15 | if [ ! -d $OUTPUT_DIR ] 16 | then 17 | mkdir -p $OUTPUT_DIR 18 | fi 19 | 20 | zcat $(cat $TABLE_DIR/files.txt) | bgzip -c -@ $THREADS > $FINAL_TABLE 21 | echo "Finished creating final table $FINAL_TABLE ["`date`"]" 22 | 23 | tabix -s 1 -b 2 -e 2 -c Region $FINAL_TABLE 24 | echo "Finished creating index file for file $FINAL_TABLE ["`date`"]" 25 | fi 26 | 27 | t2=$(date +%s) 28 | t2_human=$(date) 29 | elapsed_time=$(($t2-$t1)) 30 | elapsed_time_human=$(date -d@$elapsed_time -u +%H:%M:%S) 31 | 32 | FILE_ID=`basename $TABLE_DIR` 33 | 34 | echo "[STATS] [MERGE] [$FILE_ID] START="$t1_human" ["$t1"] END="$t2_human" ["$t2"] ELAPSED="$elapsed_time" HUMAN="$elapsed_time_human 1>&2 35 | echo -e "$FILE_ID\t$elapsed_time\t$elapsed_time_human" > $TABLE_DIR/merge-chronometer.txt 36 | 37 | # echo "Starting creating final table "`date` 38 | # 39 | # i=0; for file in $(ls $TABLE_DIR/*.gz) 40 | # do 41 | # i=$((i + 1)) 42 | # echo $i". "$file" "`date`; zcat $file >> final_file.txt 43 | # done 44 | # 45 | # echo "Compressing final_file.txt "`date` 46 | # time /marconi/home/userexternal/tflati00/pigz-2.4/pigz -c final_file.txt > final_file.gz 47 | -------------------------------------------------------------------------------- /multisample_test.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | #SBATCH --ntasks=136 3 | #SBATCH --ntasks-per-node=68 4 | #SBATCH --time=4:00:00 5 | #SBATCH --account=Pra15_3924 6 | #SBATCH -p knl_usr_prod 7 | #SBATCH -e para-RT-MT.e 8 | #SBATCH -o para-RT-MT.o 9 | 10 | cd $SLURM_SUBMIT_DIR 11 | 12 | BASE_DIR="/marconi_scratch/userexternal/tflati00/reditools_paper/" 13 | 14 | OUTPUT_DIR=$BASE_DIR"/output-multisample-2/" 15 | TEMP_DIR=$BASE_DIR"/tmp-multisample-2/" 16 | COVERAGE_DIR=$BASE_DIR"/cov-multisample-2/" 17 | 18 | #DATA_DIR="/home/flati/data/reditools/input/" 19 | DATA_DIR="$CINECA_SCRATCH/public/" 20 | 21 | module load autoload profile/global 22 | module load ig_homo_sapiens/hg19 23 | REFERENCE=$IG_HG19_GENOME"/genome.fa" 24 | #REFERENCE=$DATA_DIR"hg19m.fa" 25 | 26 | OMOPOLYMER_FILE=$DATA_DIR"omopolymeric_positions.txt" 27 | SIZE_FILE=$REFERENCE".fai" 28 | 29 | SAMPLE_FILE=$BASE_DIR"samples.txt" 30 | 31 | # NUM_CORES=68 32 | 33 | if [ ! -s $SAMPLE_FILE ] 34 | then 35 | echo "File $SAMPLE_FILE does not exist. Please, provide an existing file." 36 | exit 37 | fi 38 | 39 | # Environment setup 40 | module load python/2.7.12 41 | source ENV/bin/activate 42 | module load autoload openmpi/1-10.3--gnu--6.1.0 43 | # module load autoload samtools 44 | module load autoload htslib 45 | 46 | # for SOURCE_BAM_FILE in $(cat $SAMPLE_FILE) 47 | # do 48 | # if [ ! -s $SOURCE_BAM_FILE ] 49 | # then 50 | # echo "File $SOURCE_BAM_FILE does not exists. Skipping." 51 | # continue 52 | # fi 53 | # 54 | # SAMPLE_ID=$(basename $SOURCE_BAM_FILE | sed 's/\.bam//g') 55 | # COV=$COVERAGE_DIR$SAMPLE_ID"/" 56 | # COV_FILE=$COV$SAMPLE_ID".cov" 57 | # 58 | # date 59 | # 60 | # if [ ! -f $COV_FILE ] 61 | # then 62 | # echo "Launching REDItool COVERAGE on $SAMPLE_ID (output_dir=$COV)"; 63 | # 64 | # t1=$(date +%s) 65 | # t1_human=$(date) 66 | # echo "[STATS] [COVERAGE] [$SAMPLE_ID] START="$t1_human" ["$t1"]" 67 | # time ./extract_coverage.sh $SOURCE_BAM_FILE $COV $SIZE_FILE & 68 | # t2=$(date +%s) 69 | # t2_human=$(date) 70 | # elapsed_time=$(($t2-$t1)) 71 | # elapsed_time_human=$(date -d@$elapsed_time -u +%H:%M:%S) 72 | # echo "[STATS] [COVERAGE] [$SAMPLE_ID] START="$t1_human" ["$t1"] END="$t2_human" ["$t2"] ELAPSED="$elapsed_time" HUMAN="$elapsed_time_human 73 | # fi 74 | # done 75 | # wait 76 | 77 | # strand=0 78 | # options="" 79 | # if [ $strand != 0 ] 80 | # then 81 | # options="-C -T 2 -s $strand" 82 | # fi 83 | options="" 84 | 85 | COV_FILE=$COV$SAMPLE_ID".cov" 86 | TEMP=$TEMP_DIR$SAMPLE_ID"/" 87 | OUTPUT=$OUTPUT_DIR/$SAMPLE_ID/table.gz 88 | 89 | # Program launch 90 | echo "START:"`date` 91 | t1=$(date +%s) 92 | t1_human=$(date) 93 | # time mpirun src/cineca/reditools2_multisample.py -F $SAMPLE_FILE -r $REFERENCE -m $OMOPOLYMER_FILE -D $COVERAGE_DIR -t $TEMP_DIR -Z $SIZE_FILE $options 2>&1 | tee MULTI_SAMPLES.log 94 | time mpirun src/cineca/reditools2_multisample.py -F $SAMPLE_FILE -r $REFERENCE -D $COVERAGE_DIR -t $TEMP_DIR -Z $SIZE_FILE $options 2>&1 | tee MULTI_SAMPLES.log 95 | t2=$(date +%s) 96 | t2_human=$(date) 97 | elapsed_time=$(($t2-$t1)) 98 | elapsed_time_human=$(date -d@$elapsed_time -u +%H:%M:%S) 99 | echo "[STATS] [PARALLEL] START="$t1_human" ["$t1"] END="$t2_human" ["$t2"] ELAPSED="$elapsed_time" HUMAN="$elapsed_time_human 100 | 101 | # export PATH=$HTSLIB_HOME/bin/:$PATH 102 | # for SOURCE_BAM_FILE in $(cat $SAMPLE_FILE) 103 | # do 104 | # t1=$(date +%s) 105 | # t1_human=$(date) 106 | # 107 | # SAMPLE_ID=$(basename $SOURCE_BAM_FILE | sed 's/\.bam//g') 108 | # 109 | # COV=$COVERAGE_DIR$SAMPLE_ID"/" 110 | # COV_FILE=$COV$SAMPLE_ID".cov" 111 | # TEMP=$TEMP_DIR$SAMPLE_ID"/" 112 | # OUTPUT=$OUTPUT_DIR/$SAMPLE_ID/table.gz 113 | # 114 | # time ./merge.sh $TEMP $OUTPUT $NUM_CORES & 115 | # t2=$(date +%s) 116 | # t2_human=$(date) 117 | # elapsed_time=$(($t2-$t1)) 118 | # elapsed_time_human=$(date -d@$elapsed_time -u +%H:%M:%S) 119 | # echo "[STATS] [MERGE] [$SAMPLE_ID] START="$t1_human" ["$t1"] END="$t2_human" ["$t2"] ELAPSED="$elapsed_time" HUMAN="$elapsed_time_human 120 | # 121 | # echo "[$SAMPLE_ID] END:"`date` 122 | # echo "OK" > $TEMP/status.txt 123 | # done 124 | # wait 125 | -------------------------------------------------------------------------------- /parallel_test.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Parallel test # 4 | source ENV/bin/activate 5 | 6 | SOURCE_BAM_FILE="test/SRR2135332.bam" 7 | REFERENCE="test/chr21.fa" 8 | SIZE_FILE="test/chr21.fa.fai" 9 | 10 | NUM_CORES=2 11 | OUTPUT_FILE="test_results/output/parallel_table.txt.gz" 12 | TEMP_DIR="test_results/temp/" 13 | COVERAGE_FILE="test_results/coverage/SRR2135332.chr21.cov" 14 | COVERAGE_DIR="test_results/coverage/" 15 | 16 | ./extract_coverage.sh $SOURCE_BAM_FILE $COVERAGE_DIR $SIZE_FILE 17 | mpirun -np $NUM_CORES src/cineca/parallel_reditools.py -g "chr21" -f $SOURCE_BAM_FILE -o $OUTPUT_FILE -r $REFERENCE -t $TEMP_DIR -Z $SIZE_FILE -G $COVERAGE_FILE -D $COVERAGE_DIR 18 | ./merge.sh $TEMP_DIR $OUTPUT_FILE $NUM_CORES 19 | 20 | deactivate 21 | -------------------------------------------------------------------------------- /parallel_test_slurm.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | #SBATCH --job-name=REDItools2Job 3 | #SBATCH -N 3 4 | #SBATCH -n 12 5 | #SBATCH -p m100_usr_prod 6 | #SBATCH --time 02:00:00 7 | #SBATCH --account cin_staff 8 | #SBATCH --error REDItools2Job.err 9 | #SBATCH --output REDItools2Job.out 10 | 11 | SAMPLE_ID="SRR2135332" 12 | SOURCE_BAM_FILE="test/SRR2135332.bam" 13 | REFERENCE="test/chr21.fa" 14 | REFERENCE_DNA=$(basename "$REFERENCE") 15 | SIZE_FILE="test/chr21.fa.fai" 16 | NUM_CORES=12 17 | OUTPUT_FILE="test_results/output/parallel_table.txt.gz" 18 | TEMP_DIR="test_results/temp/" 19 | COVERAGE_FILE="test_results/coverage/SRR2135332.cov" 20 | COVERAGE_DIR="test_results/coverage/" 21 | OUTPUT_DIR=$(basename "$OUTPUT_FILE") 22 | 23 | 24 | module load spack 25 | module load python/2.7.16--gcc--8.4.0-bgv 26 | module load autoload py-mpi4py/3.0.3--gcc--8.4.0-spectrmpi-ac2 27 | module load py-virtualenv/16.7.6--gcc--8.4.0-4ut 28 | module load profile/global 29 | module load samtools/1.12 30 | 31 | source ENV/bin/activate 32 | 33 | if [ ! -f $COVERAGE_FILE ] 34 | then 35 | t1=$(date +%s) 36 | t1_human=$(date) 37 | echo "[STATS] [COVERAGE] START="$t1_human" ["$t1"]" 38 | ./extract_coverage_dynamic.sh $SOURCE_BAM_FILE $COVERAGE_DIR $SIZE_FILE 39 | t2=$(date +%s) 40 | t2_human=$(date) 41 | elapsed_time=$(($t2-$t1)) 42 | elapsed_time_human=$(date -d@$elapsed_time -u +%H:%M:%S) 43 | echo "[STATS] [COVERAGE] START="$t1_human" ["$t1"] END="$t2_human" ["$t2"] ELAPSED="$elapsed_time" HUMAN="$elapsed_time_human 44 | fi 45 | 46 | 47 | 48 | 49 | strand=0 50 | options="" 51 | if [ $strand != 0 ] 52 | then 53 | options="-C -T 2 -s $strand" 54 | fi 55 | 56 | # Program launch 57 | echo "START:"`date` 58 | t1=$(date +%s) 59 | t1_human=$(date) 60 | 61 | time mpirun -np $NUM_CORES src/cineca/parallel_reditools.py -f $SOURCE_BAM_FILE -o $OUTPUT_FILE -r $REFERENCE -t $TEMP_DIR -Z $SIZE_FILE -G $COVERAGE_FILE -D $COVERAGE_DIR $options 2>&1 | tee $SAMPLE_ID.log 62 | 63 | t2=$(date +%s) 64 | t2_human=$(date) 65 | elapsed_time=$(($t2-$t1)) 66 | elapsed_time_human=$(date -d@$elapsed_time -u +%H:%M:%S) 67 | echo "[STATS] [PARALLEL] START="$t1_human" ["$t1"] END="$t2_human" ["$t2"] ELAPSED="$elapsed_time" HUMAN="$elapsed_time_human 68 | 69 | t1=$(date +%s) 70 | t1_human=$(date) 71 | export PATH=$HTSLIB_HOME/bin/:$PATH 72 | time ./merge.sh $TEMP_DIR $OUTPUT_FILE $NUM_CORES 73 | t2=$(date +%s) 74 | t2_human=$(date) 75 | elapsed_time=$(($t2-$t1)) 76 | elapsed_time_human=$(date -d@$elapsed_time -u +%H:%M:%S) 77 | echo "[STATS] [MERGE] START="$t1_human" ["$t1"] END="$t2_human" ["$t2"] ELAPSED="$elapsed_time" HUMAN="$elapsed_time_human 78 | 79 | echo "END:"`date` 80 | echo "OK" > $TEMP_DIR/status.txt 81 | 82 | deactivate 83 | -------------------------------------------------------------------------------- /parallel_test_slurm_DEPRECATED.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | #SBATCH --job-name=REDItools2Job 4 | #SBATCH -N 1 5 | #SBATCH -n 36 6 | #SBATCH -p gll_usr_prod 7 | #SBATCH --mem=115GB 8 | #SBATCH --time 05:00:00 9 | #SBATCH --account ELIX4_manniron 10 | #SBATCH --error REDItools2Job.err 11 | #SBATCH --output REDItools2Job.out 12 | 13 | 14 | ######################################################### 15 | ######## Parameters setting 16 | ######################################################### 17 | 18 | ##SAMPLE_ID is the basename of the sample of interest 19 | SAMPLE_ID="SRR2135332" 20 | 21 | ##bam file to be analysed 22 | SOURCE_BAM_FILE="test/SRR2135332.bam" 23 | 24 | ##reference chromosome or genome 25 | REFERENCE="test/chr21.fa" 26 | REFERENCE_DNA=$(basename "$REFERENCE") 27 | 28 | ##fasta index file created by samtools 29 | SIZE_FILE="test/chr21.fa.fai" 30 | 31 | ##number of utilized cores 32 | NUM_CORES=2 33 | 34 | ##setting output file 35 | OUTPUT_FILE="test_results/output/parallel_table.txt.gz" 36 | TEMP_DIR="test_results/temp/" 37 | 38 | ##setting the coverage file 39 | COVERAGE_FILE="test_results/coverage/SRR2135332.cov" 40 | 41 | ##setting coverage directory 42 | COVERAGE_DIR="test_results/coverage/" 43 | 44 | ##setting output directory 45 | OUTPUT_DIR=$(basename "$OUTPUT_FILE") 46 | 47 | ######################################################### 48 | ######## Modules loading 49 | ######################################################### 50 | 51 | module load profile/bioinf 52 | module load python/2.7.12 53 | module load autoload samtools/1.9 54 | module load autoload profile/global 55 | module load autoload openmpi/3.1.4--gnu--7.3.0 56 | module load autoload samtools 57 | 58 | echo "Launching REDItool on $SAMPLE_ID (output_file=$OUTPUT_FILE)"; 59 | 60 | ######################################################### 61 | ######## Coverage 62 | ######################################################### 63 | 64 | ## If the coverage file doesn’t exist, then the script calculate It. 65 | 66 | if [ ! -f $COVERAGE_FILE ] 67 | then 68 | t1=$(date +%s) 69 | t1_human=$(date) 70 | echo "[STATS] [COVERAGE] START="$t1_human" ["$t1"]" 71 | ./extract_coverage_dynamic.sh $SOURCE_BAM_FILE $COVERAGE_DIR $SIZE_FILE 72 | t2=$(date +%s) 73 | t2_human=$(date) 74 | elapsed_time=$(($t2-$t1)) 75 | elapsed_time_human=$(date -d@$elapsed_time -u +%H:%M:%S) 76 | echo "[STATS] [COVERAGE] START="$t1_human" ["$t1"] END="$t2_human" ["$t2"] ELAPSED="$elapsed_time" HUMAN="$elapsed_time_human 77 | fi 78 | 79 | ######################################################### 80 | ######## Parallel Computation 81 | ######################################################### 82 | 83 | strand=0 84 | options="" 85 | if [ $strand != 0 ] 86 | then 87 | options="-C -T 2 -s $strand" 88 | fi 89 | 90 | # Program launch 91 | echo "START:"`date` 92 | t1=$(date +%s) 93 | t1_human=$(date) 94 | 95 | time mpirun src/cineca/parallel_reditools.py -g $REFERENCE_DNA -f $SOURCE_BAM_FILE -r $REFERENCE -G $COVERAGE_FILE -D $COVERAGE_DIR -t $TEMP_DIR -Z $SIZE_FILE $options 2>&1 | tee $SAMPLE_ID.log 96 | t2=$(date +%s) 97 | t2_human=$(date) 98 | elapsed_time=$(($t2-$t1)) 99 | elapsed_time_human=$(date -d@$elapsed_time -u +%H:%M:%S) 100 | echo "[STATS] [PARALLEL] START="$t1_human" ["$t1"] END="$t2_human" ["$t2"] ELAPSED="$elapsed_time" HUMAN="$elapsed_time_human 101 | 102 | ######################################################### 103 | ######## Merging 104 | ######################################################### 105 | 106 | t1=$(date +%s) 107 | t1_human=$(date) 108 | export PATH=$HTSLIB_HOME/bin/:$PATH 109 | time ./merge.sh $TEMP_DIR $OUTPUT $NUM_CORES 110 | t2=$(date +%s) 111 | t2_human=$(date) 112 | elapsed_time=$(($t2-$t1)) 113 | elapsed_time_human=$(date -d@$elapsed_time -u +%H:%M:%S) 114 | echo "[STATS] [MERGE] START="$t1_human" ["$t1"] END="$t2_human" ["$t2"] ELAPSED="$elapsed_time" HUMAN="$elapsed_time_human 115 | 116 | echo "END:"`date` 117 | echo "OK" > $TEMP_DIR/status.txt 118 | -------------------------------------------------------------------------------- /prepare_test.sh: -------------------------------------------------------------------------------- 1 | cd test 2 | 3 | if [ ! -s chr21.fa ] 4 | then 5 | echo "Reference chromosome 21 (Homo Sapiens) not existing. Downloading..." 6 | wget -O chr21.fa.gz http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr21.fa.gz 7 | 8 | echo "Extracting chr21.fa.gz archive" 9 | gzip -d chr21.fa.gz 10 | fi 11 | 12 | if [ ! -s chr21.fa.fai ] 13 | then 14 | echo "Index .fai not found. Indexing chr21.fa" 15 | samtools faidx chr21.fa 16 | fi 17 | 18 | echo "Test(s) ready!" 19 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | pysam 2 | sortedcontainers 3 | psutil 4 | netifaces 5 | mpi4py 6 | -------------------------------------------------------------------------------- /serial_test.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | source ENV/bin/activate 4 | 5 | # Serial test # 6 | python src/cineca/reditools.py -f test/SRR2135332.bam -r test/chr21.fa -g chr21 -o serial_table.txt 7 | 8 | deactivate 9 | -------------------------------------------------------------------------------- /serial_test_slurm.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | #SBATCH --ntasks=1 3 | #SBATCH --ntasks-per-node=1 4 | #SBATCH --time=00:10:00 5 | #SBATCH --account=cin_staff 6 | #SBATCH -p knl_usr_prod 7 | #SBATCH -e serial-RT.e 8 | #SBATCH -o serial-RT.o 9 | 10 | # Serial test (SLURM)# 11 | module load python/2.7.12 12 | 13 | source ENV/bin/activate 14 | 15 | python src/cineca/reditools.py -f test/SRR2135332.bam -g chr21 -r test/chr21.fa -o serial_table_slurm.txt 16 | 17 | deactivate 18 | -------------------------------------------------------------------------------- /src/cineca/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BioinfoUNIBA/REDItools2/17e932fa225477effced64ad5342e7cfd2b7d87b/src/cineca/__init__.py -------------------------------------------------------------------------------- /src/cineca/annotate_with_DNA.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import os 3 | import gzip 4 | 5 | columns = { 6 | "Region": 0, 7 | "Position": 1, 8 | "Reference": 2, 9 | "Strand": 3, 10 | "Coverage": 4, 11 | "MeanQ": 5, 12 | "BaseCount": 6, 13 | "AllSubs": 7, 14 | "Frequency": 8, 15 | "gCoverage": 9, 16 | "gMeanQ": 10, 17 | "gBaseCount": 11, 18 | "gAllSubs": 12, 19 | "gFrequency": 13 20 | } 21 | 22 | def read_line(fd): 23 | line = next(fd, None) 24 | return line.strip().split("\t") if line is not None else [] 25 | 26 | def is_smaller_than_or_equal_to(fields1, fields2): 27 | # if not fields1 and not fields2: return True 28 | # if fields1 and not fields2: return True 29 | if not fields1 and fields2: return False 30 | 31 | if fields1 and fields2: 32 | region1 = get(fields1, "Region") 33 | region2 = get(fields2, "Region") 34 | 35 | index1 = chromosomes.index(region1) if region1 in chromosomes else chromosomes.index("chr" + region1) 36 | index2 = chromosomes.index(region2) if region2 in chromosomes else chromosomes.index("chr" + region2) 37 | 38 | # sys.stderr.write(" ".join([str(x) for x in [region1, region2, index1, index2]]) + "\n") 39 | 40 | if index1 < index2: 41 | return True 42 | 43 | if index1 > index2: 44 | return False 45 | 46 | return index1 == index2 and int(get(fields1, "Position")) <= int(get(fields2, "Position")) 47 | 48 | return True 49 | 50 | def get(fields, column): 51 | value = None 52 | 53 | index = columns[column] 54 | if len(fields) >= index: 55 | value = fields[index] 56 | 57 | return value 58 | 59 | comp = {'A':'T','T':'A','C':'G','G':'C'} 60 | indexes = {v: k for k, v in dict(enumerate('ACGT')).iteritems()} 61 | 62 | def annotate(fields1, fields2): 63 | 64 | strand = get(fields1, "Strand") 65 | 66 | if strand == '0': 67 | base_count = eval(get(fields2, "BaseCount")) # BaseCount[A,C,G,T] 68 | fields2[columns["BaseCount"]] = str([base_count[indexes[comp[b]]] for b in 'ACGT']) 69 | 70 | subs = get(fields2, "AllSubs").split(" ") 71 | fields2[columns["AllSubs"]] = " ".join([''.join([comp[b] if b != "-" else b for b in sub]) for sub in subs]) 72 | 73 | for field in ["Coverage", "MeanQ", "BaseCount", "AllSubs", "Frequency"]: 74 | annotation = get(fields2, field) 75 | # if annotation is None: 76 | # print(fields1) 77 | # print(fields2) 78 | # print(field, annotation) 79 | 80 | fields1[columns["g" + field]] = annotation 81 | 82 | chromosomes = [] 83 | def load_chromosomes(fai): 84 | with open(fai, "r") as reader: 85 | for line in reader: 86 | chromosome = line.strip().split("\t")[0] 87 | if chromosome in chromosomes: continue 88 | chromosomes.append(chromosome) 89 | 90 | LOG_INTERVAL = 1000000 91 | 92 | import argparse 93 | if __name__ == '__main__': 94 | 95 | parser = argparse.ArgumentParser(description='REDItools 2.0 annotator') 96 | parser.add_argument('-r', '--rna-file', required=True, help='The RNA-editing events table to be annotated') 97 | parser.add_argument('-d', '--dna-file', required=True, help='The RNA-editing events table as obtained from DNA-Seq data') 98 | parser.add_argument('-R', '--reference', required=True, help='The .fai file of the reference genome containing the ordered chromosomes') 99 | parser.add_argument('-Z', '--only-omozygotes', default=False, action='store_true', help='Exclude positions with multiple changes in DNA-Seq') 100 | args = parser.parse_known_args()[0] 101 | 102 | file1 = args.rna_file 103 | file2 = args.dna_file 104 | fai_file = args.reference 105 | load_chromosomes(fai_file) 106 | only_omozygotes = args.only_omozygotes 107 | 108 | sys.stderr.write("[INFO] {} CHROMOSOMES LOADED\n".format(len(chromosomes))) 109 | 110 | file1root, ext1 = os.path.splitext(file1) 111 | file2root, ext2 = os.path.splitext(file2) 112 | 113 | fd1 = gzip.open(file1, "r") if ext1 == ".gz" else open(file1, "r") 114 | fd2 = gzip.open(file2, "r") if ext2 == ".gz" else open(file2, "r") 115 | fd3 = sys.stdout 116 | 117 | total1 = 0 118 | total2 = 0 119 | last_chr = None 120 | with fd1, fd2, fd3: 121 | 122 | fields1 = read_line(fd1) 123 | total1 += 1 124 | if fields1[0] == "Region": 125 | fields1 = read_line(fd1) 126 | total1 += 1 127 | 128 | fields2 = read_line(fd2) 129 | total2 += 1 130 | if fields2[0] == "Region": 131 | fields2 = read_line(fd2) 132 | total2 += 1 133 | 134 | while fields1 or fields2: 135 | 136 | if fields1[0] != last_chr: 137 | last_chr = fields1[0] 138 | sys.stderr.write("ANALYZING CHROMOSOME " + last_chr + "\n") 139 | 140 | f1_less_than_f2 = is_smaller_than_or_equal_to(fields1, fields2) 141 | f2_less_than_f1 = is_smaller_than_or_equal_to(fields2, fields1) 142 | are_equal = f1_less_than_f2 and f2_less_than_f1 143 | 144 | # sys.stderr.write(str(fields1) + "\n") 145 | # sys.stderr.write(str(fields2) + "\n") 146 | # sys.stderr.write(str(f1_less_than_f2) + " " + str(f2_less_than_f1) + " " + str(are_equal) + "\n") 147 | # raw_input() 148 | 149 | omozigote = True if not fields2 else not are_equal or fields2[columns["AllSubs"]] == "-" 150 | 151 | if are_equal: 152 | annotate(fields1, fields2) 153 | 154 | if fields1: 155 | if not only_omozygotes or omozigote: 156 | fd3.write("\t".join(fields1) + "\n") 157 | else: 158 | sys.stderr.write("[INFO] [{}] Discarding {}:{} because DNA data is not omozygote from {}\n".format(last_chr, fields1[0], fields1[1], file1)) 159 | 160 | if f1_less_than_f2: 161 | fields1 = read_line(fd1) 162 | total1 += 1 163 | 164 | if f2_less_than_f1: 165 | fields2 = read_line(fd2) 166 | total2 += 1 167 | 168 | if total1 % LOG_INTERVAL == 0: 169 | sys.stderr.write("[INFO] [{}] {} lines read from {}\n".format(last_chr, total1, file1)) 170 | 171 | if total2 % LOG_INTERVAL == 0: 172 | sys.stderr.write("[INFO] [{}] {} lines read from {}\n".format(last_chr, total2, file2)) 173 | 174 | sys.stderr.write("[INFO] {} lines read from {}\n".format(total1, file1)) 175 | sys.stderr.write("[INFO] {} lines read from {}\n".format(total2, file2)) 176 | -------------------------------------------------------------------------------- /src/cineca/parallel_reditools.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | import os 4 | import glob 5 | import sys 6 | import re 7 | import time 8 | from mpi4py import MPI 9 | from datetime import datetime 10 | from collections import OrderedDict 11 | import reditools 12 | import argparse 13 | import gc 14 | import socket 15 | import netifaces 16 | import json 17 | 18 | ALIGN_CHUNK = 0 19 | STOP_WORKING = 1 20 | IM_FREE = 2 21 | CALCULATE_COVERAGE = 3 22 | 23 | STEP = 10000000 24 | 25 | TIME_STATS = {} 26 | 27 | def get_intervals(intervals, num_intervals): 28 | homeworks = [] 29 | for chromosome in chromosomes.keys(): 30 | print("chromosome:" + chromosome) 31 | chromosome_length = chromosomes[chromosome] 32 | print("len:"+ str(chromosome_length)) 33 | step = STEP 34 | if chromosome == "chrM": 35 | step = (int)(chromosome_length / 100) 36 | chromosome_slices = list(range(1, chromosome_length, step)) + [chromosome_length+1] 37 | print(chromosome_slices) 38 | print(len(chromosome_slices)) 39 | print("#slices:" + str(len(chromosome_slices))) 40 | 41 | for i in range(0, len(chromosome_slices)-1): 42 | homeworks.append((chromosome, chromosome_slices[i], chromosome_slices[i+1]-1)) 43 | return homeworks 44 | 45 | def weight_function(x): 46 | # x = math.log(1+x) 47 | # return 2.748*10**(-3)*x**3 -0.056*x**2 + 0.376*x + 2.093 48 | return x**3 49 | 50 | def get_coverage(coverage_file, region = None): 51 | 52 | # Open the file and read i-th section (jump to the next '\n' character) 53 | n = float(size) 54 | file_size = os.path.getsize(coverage_file) 55 | print("[{}] SIZE OF FILE {}: {} bytes".format(rank, coverage_file, file_size)) 56 | start = int(rank*(file_size/n)) 57 | end = int((rank+1)*(file_size/n)) 58 | print("[{}] [DEBUG] START={} END={}".format(rank, start, end)) 59 | 60 | f = open(coverage_file, "r") 61 | f.seek(start) 62 | loaded = start 63 | coverage_partial = 0 64 | with f as lines: 65 | line_no = 0 66 | for line in lines: 67 | if loaded >= end: continue 68 | loaded += len(line) 69 | 70 | line_no += 1 71 | if line_no == 1: 72 | if not line.startswith("chr"): 73 | continue 74 | 75 | triple = line.rstrip().split("\t") 76 | 77 | if region is not None: 78 | if triple[0] != region[0]: continue 79 | if len(region) >= 2 and int(triple[1]) < region[1]: continue 80 | if len(region) >= 2 and int(triple[1]) > region[2]: continue 81 | 82 | #if line_no % 10000000 == 0: 83 | # print("[{}] [DEBUG] Read {} lines so far".format(rank, line_no)) 84 | cov = int(triple[2]) 85 | coverage_partial += weight_function(cov) 86 | 87 | print("[{}] START={} END={} PARTIAL_COVERAGE={}".format(rank, start, end, coverage_partial)) 88 | 89 | # Reduce 90 | coverage = None 91 | 92 | coverages = comm.gather(coverage_partial) 93 | if rank == 0: 94 | print("COVERAGES:", str(coverages)) 95 | coverage = reduce(lambda x,y: x+y, coverages) 96 | 97 | coverage = comm.bcast(coverage, root=0) 98 | 99 | # Return the total 100 | return coverage 101 | 102 | def calculate_intervals(total_coverage, coverage_file, region): 103 | print("[SYSTEM] [{}] Opening coverage file={}".format(rank, coverage_file)) 104 | f = open(coverage_file, "r") 105 | 106 | chr = None 107 | start = None 108 | end = None 109 | C = 0 110 | max_interval_width = min(3000000, 3000000000 / size) 111 | 112 | subintervals = [] 113 | subtotal = total_coverage / size 114 | print("[SYSTEM] TOTAL={} SUBTOTAL={} MAX_INTERVAL_WIDTH={}".format(total_coverage, subtotal, max_interval_width)) 115 | 116 | line_no = 0 117 | with f as lines: 118 | for line in lines: 119 | line_no += 1 120 | if line_no % 1000000 == 0: 121 | print("[SYSTEM] [{}] Time: {} - {} lines loaded.".format(rank, time.time(), line_no)) 122 | 123 | fields = line.rstrip().split("\t") 124 | 125 | if region is not None: 126 | if fields[0] != region[0]: continue 127 | if len(region) >= 2 and int(fields[1]) < region[1]: continue 128 | if len(region) >= 3 and int(fields[1]) > region[2]: continue 129 | 130 | # If the interval has become either i) too large or ii) too heavy or iii) spans across two different chromosomes 131 | if C >= subtotal or (chr is not None and fields[0] != chr) or (end is not None and start is not None and (end-start) > max_interval_width): 132 | reason = None 133 | if C >= subtotal: reason = "WEIGHT" 134 | elif chr is not None and fields[0] != chr: reason = "END_OF_CHROMOSOME" 135 | elif end is not None and start is not None and (end-start) > max_interval_width: reason = "MAX_WIDTH" 136 | 137 | interval = (chr, start, end, C, end-start, reason) 138 | print("[SYSTEM] [{}] Time: {} - Discovered new interval={}".format(rank, time.time(), interval)) 139 | subintervals.append(interval) 140 | chr = None 141 | start = None 142 | end = None 143 | C = 0 144 | if len(fields) < 3: continue 145 | 146 | if chr is None: chr = fields[0] 147 | if start is None: start = int(fields[1]) 148 | end = int(fields[1]) 149 | # C += math.pow(int(fields[2]), 2) 150 | C += weight_function(int(fields[2])) 151 | 152 | if C > 0: 153 | reason = "END_OF_CHROMOSOME" 154 | interval = (chr, start, end, C, end-start, reason) 155 | print("[SYSTEM] [{}] Time: {} - Discovered new interval={}".format(rank, time.time(), interval)) 156 | subintervals.append(interval) 157 | 158 | return subintervals 159 | 160 | if __name__ == '__main__': 161 | 162 | # MPI init 163 | comm = MPI.COMM_WORLD 164 | rank = comm.Get_rank() 165 | size = comm.Get_size() 166 | 167 | options = reditools.parse_options() 168 | options["remove_header"] = True 169 | 170 | parser = argparse.ArgumentParser(description='REDItools 2.0') 171 | parser.add_argument('-G', '--coverage-file', help='The coverage file of the sample to analyze') 172 | parser.add_argument('-D', '--coverage-dir', help='The coverage directory containing the coverage file of the sample to analyze divided by chromosome') 173 | parser.add_argument('-t', '--temp-dir', help='The temp directory where to store temporary data for this sample') 174 | parser.add_argument('-Z', '--chromosome-sizes', help='The file with the chromosome sizes') 175 | parser.add_argument('-g', '--region', help='The region of the bam file to be analyzed') 176 | args = parser.parse_known_args()[0] 177 | 178 | coverage_file = args.coverage_file 179 | coverage_dir = args.coverage_dir 180 | temp_dir = args.temp_dir 181 | size_file = args.chromosome_sizes 182 | 183 | if not os.path.isfile(coverage_file): 184 | print("[ERROR] Coverage file {} not existing!".format(coverage_file)) 185 | exit(1) 186 | 187 | # output = options["output"] 188 | # format = output.split(".")[-1] 189 | # hostname = socket.gethostname() 190 | # host = socket.gethostbyname(hostname) 191 | # fqdn = socket.getfqdn() 192 | interface = 'ib0' if 'ib0' in netifaces.interfaces() else netifaces.interfaces()[0] 193 | hostname = socket.gethostbyaddr(netifaces.ifaddresses(interface)[netifaces.AF_INET][0]['addr']) 194 | pid = os.getpid() 195 | print("[SYSTEM] [TECH] [NODE] RANK:{} HOSTNAME:{} PID:{}".format(rank, hostname, pid)) 196 | 197 | if rank == 0: 198 | print("[SYSTEM] LAUNCHED PARALLEL REDITOOLS WITH THE FOLLOWING OPTIONS:", options, args) 199 | 200 | region = None 201 | if args.region: 202 | region = re.split("[:-]", args.region) 203 | if not region or len(region) == 2 or (len(region) == 3 and region[1] == region[2]): 204 | sys.stderr.write("[ERROR] Please provide a region of the form chrom:start-end (with end > start). Region provided: {}".format(region)) 205 | exit(1) 206 | if len(region) >= 2: 207 | region[1] = int(region[1]) 208 | region[2] = int(region[2]) 209 | 210 | t1 = time.time() 211 | 212 | print("I am rank #"+str(rank)) 213 | 214 | time_data = {} 215 | time_data["periods"] = [] 216 | time_data["groups"] = [] 217 | for i in range(0, size): 218 | time_data["groups"].append([]) 219 | 220 | if rank == 0: 221 | time_data["periods"].append({"id": "INTERVALS", "content": "Intervals", "start": str(datetime.now()), "type": "background"}) 222 | 223 | # COVERAGE SECTION 224 | try: 225 | if not os.path.exists(temp_dir): 226 | os.makedirs(temp_dir) 227 | except Exception as e: 228 | print("[WARN] {}".format(e)) 229 | 230 | interval_file = temp_dir + "/intervals.txt" 231 | homeworks = [] 232 | if os.path.isfile(interval_file) and os.stat(interval_file).st_size > 0: 233 | if rank == 0: 234 | print("[0] [RESTART] FOUND INTERVAL FILE {} ".format(interval_file)) 235 | expected_total = 0 236 | for line in open(interval_file, "r"): 237 | line = line.strip() 238 | 239 | if expected_total == 0: 240 | expected_total = int(line) 241 | continue 242 | 243 | # Interval format: (chr, start, end, C, end-start, reason) 244 | fields = line.split("\t") 245 | for i in range(1, 5): 246 | fields[i] = int(fields[i]) 247 | homeworks.append(fields) 248 | else: 249 | if rank == 0: 250 | time_data["periods"].append({"id": "COVERAGE", "content": "Total coverage", "start": str(datetime.now()), "type": "background"}) 251 | print("[0] PRE-COVERAGE TIME " + str(datetime.now().time())) 252 | 253 | total_coverage = get_coverage(coverage_file, region) 254 | # print("TOTAL COVERAGE", str(total_coverage)) 255 | 256 | if rank == 0: 257 | time_data["periods"][-1]["end"] = str(datetime.now()) 258 | now = datetime.now().time() 259 | elapsed = time.time() - t1 260 | print("[SYSTEM] [TIME] [MPI] [0] MIDDLE-COVERAGE [now:{}] [elapsed: {}]".format(now, elapsed)) 261 | 262 | # Collect all the files with the coverage 263 | files = [] 264 | for file in os.listdir(coverage_dir): 265 | if region is not None and file != region[0]: continue 266 | if file.startswith("."): continue 267 | if file.endswith(".cov"): continue 268 | if file == "chrM": continue 269 | if file.endswith("chronometer.txt"): continue 270 | 271 | files.append(file) 272 | files.sort() 273 | 274 | if rank == 0: 275 | print("[0] " + str(len(files)) + " FILES => " + str(files)) 276 | 277 | ''' 278 | # Assign interval calculation to slaves 279 | fps = int(len(files) / size) 280 | if fps == 0: fps = 1 281 | print("Files per mpi process: " + str(fps)) 282 | subintervals = [] 283 | for i in range(0, size): 284 | if rank == i: 285 | from_file = i*fps 286 | to_file = i*fps+fps if i len(files): continue 288 | if to_file > len(files): continue 289 | 290 | print("[{}] Processing from file {} to file {} = {}".format(rank, from_file, to_file, files[from_file:to_file])) 291 | 292 | for file in files[from_file:to_file]: 293 | file_intervals = calculate_intervals(total_coverage, "pieces/" + file) 294 | for interv in file_intervals: 295 | subintervals.append(interv) 296 | 297 | # Gather all the intervals calculated from the slaves 298 | all_subintervals = [] 299 | if rank == 0: 300 | intervals = None 301 | all_subintervals = comm.gather(subintervals) 302 | print("[0] {} total intervals received.".format(len(all_subintervals))) 303 | homeworks = reduce(lambda x,y: x+y, all_subintervals) 304 | print("[0] {} total intervals aggregated.".format(len(homeworks))) 305 | for interval in homeworks: 306 | print(interval) 307 | ''' 308 | 309 | # Master: dispatches the work to the other slaves 310 | if rank == 0: 311 | start_intervals = t1 312 | print("[0] Start time: {}".format(start_intervals)) 313 | 314 | done = 0 315 | total = len(files) 316 | 317 | queue = set() 318 | for i in range(1, min(size, total+1)): 319 | file = files.pop() 320 | print("[SYSTEM] [MPI] [0] Sending coverage data "+ str(file) +" to rank " + str(i)) 321 | comm.send(file, dest=i, tag=CALCULATE_COVERAGE) 322 | queue.add(i) 323 | 324 | while len(files) > 0: 325 | status = MPI.Status() 326 | subintervals = comm.recv(source=MPI.ANY_SOURCE, tag=IM_FREE, status=status) 327 | for subinterval in subintervals: 328 | homeworks.append(subinterval) 329 | 330 | done += 1 331 | who = status.Get_source() 332 | queue.remove(who) 333 | now = datetime.now().time() 334 | elapsed = time.time() - start_intervals 335 | print("[SYSTEM] [TIME] [MPI] [0] COVERAGE RECEIVED IM_FREE SIGNAL FROM RANK {} [now:{}] [elapsed:{}] [#intervals: {}] [{}/{}][{:.2f}%] [Queue:{}]".format(str(who), now, elapsed, len(homeworks), done, total, 100 * float(done)/total, queue)) 336 | 337 | file = files.pop() 338 | print("[SYSTEM] [MPI] [0] Sending coverage data "+ str(file) +" to rank " + str(who)) 339 | comm.send(file, dest=who, tag=CALCULATE_COVERAGE) 340 | queue.add(who) 341 | 342 | while len(queue) > 0: 343 | status = MPI.Status() 344 | print("[SYSTEM] [MPI] [0] Going to receive data from slaves.") 345 | subintervals = comm.recv(source=MPI.ANY_SOURCE, tag=IM_FREE, status=status) 346 | for subinterval in subintervals: 347 | homeworks.append(subinterval) 348 | 349 | done += 1 350 | who = status.Get_source() 351 | queue.remove(who) 352 | now = datetime.now().time() 353 | elapsed = time.time() - start_intervals 354 | print("[SYSTEM] [TIME] [MPI] [0] COVERAGE RECEIVED IM_FREE SIGNAL FROM RANK {} [now:{}] [elapsed:{}] [#intervals: {}] [{}/{}][{:.2f}%] [Queue:{}]".format(str(who), now, elapsed, len(homeworks), done, total, 100 * float(done)/total, queue)) 355 | 356 | now = datetime.now().time() 357 | elapsed = time.time() - start_intervals 358 | 359 | interval_file = temp_dir + "/intervals.txt" 360 | print("[SYSTEM] [TIME] [MPI] [0] SAVING INTERVALS TO {} [now:{}] [elapsed: {}]".format(interval_file, now, elapsed)) 361 | writer = open(interval_file, "w") 362 | writer.write(str(len(homeworks)) + "\n") 363 | for homework in homeworks: 364 | writer.write("\t".join([str(x) for x in homework]) + "\n") 365 | writer.close() 366 | 367 | now = datetime.now().time() 368 | elapsed = time.time() - start_intervals 369 | print("[SYSTEM] [TIME] [MPI] [0] INTERVALS SAVED TO {} [now:{}] [elapsed: {}]".format(interval_file, now, elapsed)) 370 | 371 | print("[SYSTEM] [TIME] [MPI] [0] FINISHED CALCULATING INTERVALS [now:{}] [elapsed: {}]".format(now, elapsed)) 372 | 373 | TIME_STATS["COVERAGE"] = { 374 | "start": start_intervals, 375 | "end": time.time(), 376 | "elapsed": elapsed 377 | } 378 | 379 | if rank == 0: 380 | 381 | time_data["periods"][0]["end"] = str(datetime.now()) 382 | 383 | ########################################################### 384 | ######### COMPUTATION SECTION ############################# 385 | ########################################################### 386 | done = 0 387 | parallel_time_section_data = {"id": "ANALYSIS", "content": "Parallel", "start": str(datetime.now()), "type": "background"} 388 | time_data["periods"].append(parallel_time_section_data) 389 | print("[SYSTEM] [TIME] [MPI] [0] REDItools STARTED. MPI SIZE (PROCS): {} [now: {}]".format(size, datetime.now().time())) 390 | 391 | intervals_done = set() 392 | progress_file = temp_dir + "/progress.txt" 393 | if os.path.exists(progress_file): 394 | with open(progress_file, "r") as file: 395 | for line in file: 396 | pieces = line.strip().split() 397 | chromosome = pieces[1].split(":")[0] 398 | start, end = pieces[1].split(":")[1].split("-") 399 | interval_done = (chr, start, end) 400 | intervals_done.add(interval_done) 401 | 402 | t1 = time.time() 403 | 404 | print("Loading chromosomes' sizes!") 405 | chromosomes = OrderedDict() 406 | for line in open(size_file): 407 | (key, val) = line.split()[0:2] 408 | chromosomes[key] = int(val) 409 | print("Sizes:") 410 | print(chromosomes) 411 | 412 | homeworks_to_remove = set() 413 | for hw in homeworks: 414 | interval = (hw[0], hw[1], hw[2]) 415 | if interval in intervals_done: 416 | homeworks_to_remove.add(hw) 417 | for hw in homeworks_to_remove: 418 | homeworks.remove(hw) 419 | 420 | something_to_analyze = True 421 | if len(homeworks) == 0: 422 | something_to_analyze = False 423 | 424 | if something_to_analyze: 425 | intervals_done_writer = open(progress_file, "w") 426 | 427 | total = len(homeworks) 428 | print("[SYSTEM] [MPI] [0] HOMEWORKS", total, homeworks) 429 | #shuffle(homeworks) 430 | 431 | start = time.time() 432 | 433 | print("[SYSTEM] [TIME] [MPI] [0] REDItools PILEUP START: [now: {}]".format(datetime.now().time())) 434 | 435 | queue = set() 436 | for i in range(1, min(size, total)): 437 | interval = homeworks.pop() 438 | print("[SYSTEM] [MPI] [SEND/RECV] [SEND] [0] Sending data "+ str(interval) +" to rank " + str(i)) 439 | id_event = str(i)+"#"+str(len(time_data["groups"][i])) 440 | time_data["groups"][i].append({"id": id_event, "content": id_event, "start": str(datetime.now()), "group": i, 441 | "extra": { 442 | "interval": "{}:{}-{}".format(interval[0], interval[1], interval[2]), 443 | "weight": str(interval[3]), 444 | "width": str(interval[4]), 445 | "reason": str(interval[5]) 446 | }}) 447 | comm.send(interval, dest=i, tag=ALIGN_CHUNK) 448 | queue.add(i) 449 | 450 | while len(homeworks) > 0: 451 | status = MPI.Status() 452 | comm.recv(source=MPI.ANY_SOURCE, tag=IM_FREE, status=status) 453 | done += 1 454 | who = status.Get_source() 455 | queue.remove(who) 456 | now = datetime.now().time() 457 | elapsed = time.time() - start 458 | print("[SYSTEM] [TIME] [MPI] [SEND/RECV] [RECV] [0] RECEIVED IM_FREE SIGNAL FROM RANK {} [now:{}] [elapsed:{}] [{}/{}][{:.2f}%] [Queue:{}]".format(str(who), now, elapsed, done, total, 100 * float(done)/total, queue)) 459 | time_data["groups"][who][-1]["end"] = str(datetime.now()) 460 | time_data["groups"][who][-1]["extra"]["duration"] = str(datetime.strptime(time_data["groups"][who][-1]["end"], '%Y-%m-%d %H:%M:%S.%f') - datetime.strptime(time_data["groups"][who][-1]["start"], '%Y-%m-%d %H:%M:%S.%f')) 461 | time_data["groups"][who][-1]["extra"]["done"] = done 462 | time_data["groups"][who][-1]["extra"]["total"] = total 463 | time_data["groups"][who][-1]["extra"]["total (%)"] = "{:.2f}%".format(100 * float(done)/total) 464 | 465 | interval = time_data["groups"][who][-1]["extra"]["interval"] 466 | intervals_done_writer.write("{}\t{}\t{}\n".format(who, interval, temp_dir + "/" + interval.replace(":", "#") + ".gz")) 467 | intervals_done_writer.flush() 468 | 469 | interval = homeworks.pop() 470 | print("[SYSTEM] [MPI] [SEND/RECV] [SEND] [0] Sending data "+ str(interval) +" to rank " + str(who)) 471 | id_event = str(who)+"#"+str(len(time_data["groups"][who])) 472 | 473 | time_data["groups"][who].append({"id": id_event, "content": id_event, "start": str(datetime.now()), "group": who, 474 | "extra": { 475 | "interval": "{}:{}-{}".format(interval[0], interval[1], interval[2]), 476 | "weight": str(interval[3]), 477 | "width": str(interval[4]), 478 | "reason": str(interval[5]) 479 | }}) 480 | comm.send(interval, dest=who, tag=ALIGN_CHUNK) 481 | queue.add(who) 482 | 483 | while len(queue) > 0: 484 | status = MPI.Status() 485 | comm.recv(source=MPI.ANY_SOURCE, tag=IM_FREE, status=status) 486 | done += 1 487 | who = status.Get_source() 488 | queue.remove(who) 489 | now = datetime.now().time() 490 | elapsed = time.time() - start 491 | time_data["groups"][who][-1]["end"] = str(datetime.now()) 492 | time_data["groups"][who][-1]["extra"]["duration"] = str(datetime.strptime(time_data["groups"][who][-1]["end"], '%Y-%m-%d %H:%M:%S.%f') - datetime.strptime(time_data["groups"][who][-1]["start"], '%Y-%m-%d %H:%M:%S.%f')) 493 | time_data["groups"][who][-1]["extra"]["done"] = done 494 | time_data["groups"][who][-1]["extra"]["total"] = total 495 | time_data["groups"][who][-1]["extra"]["total (%)"] = "{:.2f}%".format(100 * float(done)/total) 496 | 497 | interval = time_data["groups"][who][-1]["extra"]["interval"] 498 | intervals_done_writer.write("{}\t{}\t{}\n".format(who, interval, temp_dir + interval.replace(":", "#") + ".gz")) 499 | intervals_done_writer.flush() 500 | 501 | print("[SYSTEM] [TIME] [MPI] [SEND/RECV] [RECV] [0] RECEIVED IM_FREE SIGNAL FROM RANK {} [now:{}] [elapsed:{}] [{}/{}][{:.2f}%] [Queue:{}]".format(str(who), now, elapsed, done, total, 100 * float(done)/total, queue)) 502 | print("[SYSTEM] [MPI] [SEND/RECV] [SEND] [0] Sending DIE SIGNAL TO RANK " + str(who)) 503 | comm.send(None, dest=who, tag=STOP_WORKING) 504 | 505 | parallel_time_section_data["end"] = str(datetime.now()) 506 | if something_to_analyze: 507 | intervals_done_writer.close() 508 | 509 | ################################################# 510 | ########### WRITE TIME DATA ##################### 511 | ################################################# 512 | events = [] 513 | for period in time_data["periods"]: 514 | events.append(period) 515 | 516 | for group in time_data["groups"]: 517 | for event in group: 518 | extras = [] 519 | for key, value in event["extra"].items(): 520 | extras.append("{}: {}".format(key, value)) 521 | 522 | event["title"] = "
".join(extras) 523 | events.append(event) 524 | 525 | groups = [] 526 | for i in range(0, size): 527 | groups.append({"id": i, "content": "MPI Proc. #"+str(i)}) 528 | 529 | 530 | time_file = temp_dir + "times.txt" 531 | f = open(time_file, "w") 532 | json.dump(events, f) 533 | f.close() 534 | 535 | group_file = temp_dir + "groups.txt" 536 | f = open(group_file, "w") 537 | json.dump(groups, f) 538 | f.close() 539 | 540 | # We have finished processing all the chunks. Let's notify this to slaves 541 | # for i in range(1, size): 542 | # print("[SYSTEM] [MPI] [0] Sending DIE SIGNAL TO RANK " + str(i)) 543 | # comm.send(None, dest=i, tag=STOP_WORKING) 544 | 545 | ##################################################################### 546 | ######### RECOMBINATION OF SINGLE FILES ############################# 547 | ##################################################################### 548 | t2 = time.time() 549 | elapsed = t2-t1 550 | print("[SYSTEM] [TIME] [MPI] [0] WHOLE PARALLEL ANALYSIS FINISHED. CREATING SETUP FOR MERGING PARTIAL FILES - Total elapsed time [{:5.5f}] [{}] [now: {}]".format(elapsed, t2, datetime.now().time())) 551 | TIME_STATS["COMPUTATION"] = { 552 | "start": t1, 553 | "end": t2, 554 | "elapsed": elapsed 555 | } 556 | 557 | little_files = [] 558 | print("Scanning all files in "+temp_dir+" matching " + ".*") 559 | for little_file in glob.glob(temp_dir + "/*"): 560 | if little_file.endswith("chronometer.txt"): continue 561 | if little_file.endswith("files.txt"): continue 562 | if little_file.endswith("intervals.txt"): continue 563 | if little_file.endswith("status.txt"): continue 564 | if little_file.endswith("progress.txt"): continue 565 | if little_file.endswith("times.txt"): continue 566 | if little_file.endswith("groups.txt"): continue 567 | 568 | print(little_file) 569 | pieces = re.sub("\..*", "", os.path.basename(little_file)).split("#") 570 | pieces.insert(0, little_file) 571 | little_files.append(pieces) 572 | 573 | # Sort the output files 574 | keys = chromosomes.keys() 575 | print("[SYSTEM] "+str(len(little_files))+" FILES TO MERGE: ", little_files) 576 | little_files = sorted(little_files, key = lambda x: (keys.index(x[1]) if x[1] in keys else keys.index("chr"+x[1]), int(x[2]))) 577 | print("[SYSTEM] "+str(len(little_files))+" FILES TO MERGE (SORTED): ", little_files) 578 | 579 | smallfiles_list_filename = temp_dir + "files.txt" 580 | f = open(smallfiles_list_filename, "w") 581 | for little_file in little_files: 582 | f.write(little_file[0] + "\n") 583 | f.close() 584 | 585 | # Open the final output file 586 | # output_dir = os.path.dirname(output) 587 | # if not os.path.exists(output_dir): 588 | # os.makedirs(output_dir) 589 | # final_file = gzip.open(output, "w") 590 | 591 | # final_file.write("\t".join(reditools.get_header()) + "\n") 592 | 593 | # total = len(little_files) 594 | # done = 0 595 | # for little_file in little_files: 596 | # print("Writing ", little_file) 597 | # file = little_file[0] 598 | # 599 | # f = gzip.open(file) 600 | # final_file.write(f.read()) 601 | # f.close() 602 | # 603 | # done = done + 1 604 | # print(file + "\t["+str(done)+"/"+str(total)+" - {:.2%}]".format(done/float(total))) 605 | # 606 | # final_file.close() 607 | 608 | t2 = time.time() 609 | print("[SYSTEM] [TIME] [MPI] [0] [END] - WHOLE ANALYSIS FINISHED - Total elapsed time [{:5.5f}] [{}] [now: {}]".format(t2-t1, t2, datetime.now().time())) 610 | 611 | if "COVERAGE" in TIME_STATS: 612 | print("[STATS] [COVERAGE] START={} END={} ELAPSED={}".format(TIME_STATS["COVERAGE"]["start"], TIME_STATS["COVERAGE"]["end"], TIME_STATS["COVERAGE"]["elapsed"])) 613 | 614 | if "COMPUTATION" in TIME_STATS: 615 | print("[STATS] [COMPUTATION] START={} END={} ELAPSED={}".format(TIME_STATS["COMPUTATION"]["start"], TIME_STATS["COMPUTATION"]["end"], TIME_STATS["COMPUTATION"]["elapsed"])) 616 | 617 | # Slave processes 618 | if rank > 0: 619 | 620 | while(True): 621 | # Execute bowtie, view and sort 622 | status = MPI.Status() 623 | data = comm.recv(source=0, tag=MPI.ANY_TAG, status=status) 624 | 625 | tag = status.Get_tag() 626 | if tag == CALCULATE_COVERAGE: 627 | intervals = calculate_intervals(total_coverage, coverage_dir + data, region) 628 | comm.send(intervals, dest=0, tag=IM_FREE) 629 | if tag == ALIGN_CHUNK: 630 | 631 | # Process it 632 | time_start = time.time() 633 | time_s = datetime.now().time() 634 | print("[SYSTEM] [TIME] [MPI] [SEND/RECV] [RECV] [{}] REDItools: STARTED {} from rank 0 [{}]".format(str(rank), str(data), time_s)) 635 | 636 | # Command: python REDItoolDnaRna_1.04_n.py -i $INPUT -o editing -f hg19.fa -t $THREADS 637 | # -c 1,1 -m 20,20 -v 0 -q 30,30 -s 2 -g 2 -S -e -n 0.0 -N 0.0 -u -l -H -Y $CHR:$LEFT-$RIGHT -F $CHR_$LEFT_$RIGHT 638 | # Command REDItools2.0: reditools2.0/src/cineca/reditools.py -f /gss/gss_work/DRES_HAIdA/gtex/SRR1413602/SRR1413602.bam 639 | # -r ../../hg19.fa -g chr18:14237-14238 640 | 641 | id = data[0] + "#" + str(data[1]) + "#" + str(data[2]) 642 | 643 | options["region"] = [data[0], data[1], data[2]] 644 | options["output"] = temp_dir + "/" + id + ".gz" 645 | 646 | print("[MPI] [" + str(rank) + "] COMMAND-LINE:", options) 647 | 648 | gc.collect() 649 | reditools.analyze(options) 650 | 651 | time_end = time.time() 652 | print("[SYSTEM] [TIME] [MPI] [{}] REDItools: FINISHED {} [{}][{}] [TOTAL:{:5.2f}]".format(str(rank), str(data), time_s, datetime.now().time(), time_end - time_start)) 653 | 654 | print("[SYSTEM] [TIME] [MPI] [SEND/RECV] [SEND] [{}] SENDING IM_FREE tag TO RANK 0 [{}]".format(str(rank), datetime.now().time())) 655 | comm.send(None, dest=0, tag=IM_FREE) 656 | elif tag == STOP_WORKING: 657 | print("[SYSTEM] [TIME] [MPI] [SEND/RECV] [RECV] [{}] received DIE SIGNAL FROM RANK 0 [{}]".format(str(rank), datetime.now().time())) 658 | break 659 | 660 | print("[{}] EXITING [now:{}]".format(rank, time.time())) 661 | -------------------------------------------------------------------------------- /src/cineca/reditools.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | ''' 4 | Created on 09 gen 2017 5 | 6 | @author: flati 7 | ''' 8 | 9 | import pysam 10 | import sys 11 | import datetime 12 | from collections import defaultdict 13 | import gzip 14 | from sortedcontainers import SortedSet 15 | import os 16 | import argparse 17 | import re 18 | import psutil 19 | import socket 20 | import netifaces 21 | 22 | DEBUG = False 23 | 24 | def delta(t2, t1): 25 | delta = t2 - t1 26 | hours, remainder = divmod(delta.seconds, 3600) 27 | minutes, seconds = divmod(remainder, 60) 28 | 29 | return "%02d:%02d:%02d" % (hours, minutes, seconds) 30 | 31 | def print_reads(reads, i): 32 | total = 0 33 | for key in reads: 34 | total += len(reads[key]) 35 | print("[INFO] E[i="+str(key)+"]["+str(len(reads[key]))+"] strand=" + str(strand)) 36 | for read in reads[key]: 37 | # index = read["index"] 38 | index = read["alignment_index"] 39 | 40 | print("[INFO] \tR:" + str(read["reference"]) + " [r1="+str(read["object"].is_read1)+", r2="+str(read["object"].is_read2) +", reverse="+str(read["object"].is_reverse) +", pos="+str(read["pos"])+", alignment_index=" + str(index) + ", reference_start="+str(read["object"].reference_start)+" , align_start="+str(read["object"].query_alignment_start) + ", cigar=" + str(read["cigar"])+ ", cigar_list=" + str(read["cigar_list"]) + ", "+ str(len(read["query_qualities"]))+ ", " + str(read["query_qualities"]) + "]") 41 | print("[INFO] \tQ:" + str(read["sequence"])) 42 | print("READS[i="+str(i)+"] = " + str(total)) 43 | 44 | def update_reads(reads, i): 45 | if DEBUG: 46 | print("[INFO] UPDATING READS IN POSITION " + str(i)) 47 | 48 | pos_based_read_dictionary = {} 49 | 50 | total = 0 51 | 52 | for ending_position in reads: 53 | for read in reads[ending_position]: 54 | 55 | cigar_list = read["cigar_list"] 56 | if len(cigar_list) == 0: 57 | # print("EXCEPTION: CIGAR LIST IS EMPTY") 58 | continue 59 | 60 | if read["pos"] >= i: 61 | #print("READ POSITION " + str(read["pos"]) + " IS GREATER THAN i=" + str(i)) 62 | continue 63 | 64 | total += 1 65 | 66 | block = cigar_list[0] 67 | op = block[1] 68 | 69 | if op == "S": 70 | 71 | del cigar_list[0] 72 | 73 | if not cigar_list: 74 | block = None 75 | else: 76 | block = cigar_list[0] 77 | op = block[1] 78 | 79 | elif op == "N": 80 | # if read["sequence"] == "ATTTTTCTGTTTCTCCCTCAATATCCACCTCATGGAAGTAGATATTCACTAGGTGATATTTTCTAGGCTCTCTTAA": 81 | # print("[NNNN i="+str(i)+"] N=" + str(block[0])+ " Updating pos from " + str(read["pos"])+ " to " + str(read["pos"] + (block[0]-1)), read["pos"], read) 82 | read["pos"] += block[0] 83 | del cigar_list[0] 84 | 85 | read["ref"] = None 86 | read["alt"] = None 87 | read["qual"] = DEFAULT_BASE_QUALITY 88 | 89 | continue 90 | 91 | if block is not None and op == "I": 92 | n = block[0] 93 | 94 | # if read["sequence"] == "GTTAATTTTAGAACATTATCATTCCAAAAAAGCAACTTCATAACATCTAGCAGTCACCTCCTTTCCCATTTCTAGC": 95 | # print("[INSERTION i="+str(i)+"] I=" + str(n)+ " Updating alignment_index from " + str(read["alignment_index"]) + " to " + str(read["alignment_index"] + n), read) 96 | 97 | read["alignment_index"] += n 98 | read["ref"] = None 99 | read["alt"] = read["sequence"][read["alignment_index"]] 100 | del cigar_list[0] 101 | 102 | if not cigar_list: 103 | block = None 104 | else: 105 | block = cigar_list[0] 106 | op = block[1] 107 | 108 | if block is not None: 109 | n = block[0] 110 | 111 | # D I M N S 112 | if op == "M": 113 | 114 | # if read["sequence"] == "GTTAATTTTAGAACATTATCATTCCAAAAAAGCAACTTCATAACATCTAGCAGTCACCTCCTTTCCCATTTCTAGC": 115 | # print("[MATCH i="+str(i)+"] M=" + str(n)+ " Updating alignment_index from " + str(read["alignment_index"]) + " to " + str(read["alignment_index"] + 1), read["pos"], read) 116 | 117 | read["pos"] += 1 118 | 119 | block[0] -= 1 120 | read["reference_index"] += 1 121 | read["alignment_index"] += 1 122 | 123 | # if DEBUG: 124 | # print(str(read["reference_index"]), read["reference"][read["reference_index"]], read) 125 | #if read["reference_index"] >= len(read["reference"]): print("i={} \nSEQ={} \nORG={}".format(read["reference_index"], read["reference"], read["object"].get_reference_sequence())) 126 | read["ref"] = read["reference"][read["reference_index"]] 127 | read["alt"] = read["sequence"][read["alignment_index"]] 128 | 129 | # if read["sequence"] == "ATTTTTCTGTTTCTCCCTCAATATCCACCTCATGGAAGTAGATATTCACTAGGTGATATTTTCTAGGCTCTCTTAA": 130 | # print("[MATCH i="+str(i)+"]", "pos="+str(read["pos"]), "ref=" + str(read["ref"]), "alt=" + str(read["alt"]), read) 131 | 132 | if block[0] == 0: 133 | del cigar_list[0] 134 | 135 | elif op == "D": 136 | # if read["sequence"] == "GAAATTTGAAGGTAGAATTGAATACAGATGAACCTCCAATGGTATTCAAGGCTCAGCTGTTTGCGTTGACTGGAGT": 137 | # print("[DELETION i="+str(i)+"] D=" + str(n)+ " Updating reference_index from " + str(read["reference_index"])+ " to " + str(read["reference_index"] + n), read["pos"], read) 138 | 139 | #read["reference_index"] += n # MODIFICATO E COMMENTATO IL 26/03/18 140 | 141 | read["pos"] += n 142 | # read["alignment_index"] += 1 143 | read["ref"] = None 144 | # read["ref"] = read["reference"][read["reference_index"]] 145 | read["alt"] = None 146 | del cigar_list[0] 147 | 148 | if read["query_qualities"] is not None: 149 | read["qual"] = read["query_qualities"][read["alignment_index"]] 150 | 151 | p = read["pos"] 152 | if p not in pos_based_read_dictionary: pos_based_read_dictionary[p] = [] 153 | pos_based_read_dictionary[p].append(read) 154 | 155 | if DEBUG: 156 | print("[INFO] READS UPDATED IN POSITION " + str(i) + ":" + str(total)) 157 | 158 | return pos_based_read_dictionary 159 | 160 | def get_column(pos_based_read_dictionary, reads, splice_positions, last_chr, omopolymeric_positions, target_positions, i): 161 | 162 | if splice_positions: 163 | if i in splice_positions[last_chr]: 164 | if VERBOSE: 165 | sys.stderr.write("[DEBUG] [SPLICE_SITE] Discarding position ({}, {}) because in splice site\n".format(last_chr, i)) 166 | return None 167 | 168 | if omopolymeric_positions: 169 | if i in omopolymeric_positions[last_chr]: 170 | if VERBOSE: 171 | sys.stderr.write("[DEBUG] [OMOPOLYMERIC] Discarding position ({}, {}) because omopolymeric\n".format(last_chr, i)) 172 | return None 173 | 174 | if target_positions: 175 | if (last_chr in target_positions and i not in target_positions[last_chr]) or ("chr"+last_chr in target_positions and i not in target_positions["chr"+last_chr]): 176 | if VERBOSE: 177 | sys.stderr.write("[DEBUG] [TARGET POSITIONS] Discarding position ({}, {}) because not in target positions\n".format(last_chr, i)) 178 | return None 179 | 180 | # edits = {"T": [], "A": [], "C": [], "G": [], "N": []} 181 | edits_no = 0 182 | edits = [] 183 | ref = None 184 | 185 | # r1r2distribution = Counter() 186 | r1r2distribution = defaultdict(int) 187 | 188 | strand_column = [] 189 | qualities = [] 190 | for key in reads: 191 | for read in reads[key]: 192 | 193 | # if DEBUG: 194 | # print("GET_COLUMN Q_NAME="+ str(read["object"].query_name)+ " READ1=" + str(read["object"].is_read1) + " REVERSE=" + str(read["object"].is_reverse) + " i="+str(i) + " READ=" + str(read)) 195 | 196 | # Filter the reads by positions 197 | # if not filter_base(read): 198 | # continue 199 | 200 | pos = read["alignment_index"] 201 | 202 | # Se il carattere e' nelle prime X posizioni della read 203 | if pos < MIN_BASE_POSITION: 204 | if VERBOSE: sys.stderr.write("[DEBUG] APPLIED BASE FILTER [MIN_BASE_POSITION]\n") 205 | continue 206 | 207 | # Se il carattere e' nelle ultime Y posizioni della read 208 | if read["length"] - pos < MAX_BASE_POSITION: 209 | if VERBOSE: sys.stderr.write("[DEBUG] APPLIED BASE FILTER [MAX_BASE_POSITION]\n") 210 | continue 211 | 212 | # Se la qualita' e' < Q 213 | # if read["query_qualities"][read["alignment_index"]] < MIN_BASE_QUALITY: 214 | if read["qual"] < MIN_BASE_QUALITY: 215 | if VERBOSE: sys.stderr.write("[DEBUG] APPLIED BASE FILTER [MIN_BASE_QUALITY] {} {} {} {} {}\n".format(str(read["query_qualities"]), pos, str(read["query_qualities"][pos]), MIN_BASE_QUALITY, read)) 216 | continue 217 | 218 | # elif read["positions"][read["index"]] != i: 219 | if read["pos"] != i: 220 | if DEBUG: 221 | print("[OUT_OF_RANGE] SKIPPING READ i=" + str(i) + " but READ=" + str(read["pos"])) 222 | continue 223 | 224 | if DEBUG: 225 | print("GET_COLUMN Q_NAME="+ str(read["object"].query_name)+ " READ1=" + str(read["object"].is_read1) + " REVERSE=" + str(read["object"].is_reverse) + " i="+str(i) + " READ=" + str(read)) 226 | 227 | # j = read["alignment_index"] 228 | # if DEBUG: 229 | # print("GET_COLUMN_OK i="+str(i) + " ALT="+read["sequence"][j]+" READ=" + str(read)) 230 | 231 | # ref = read["reference"][read["reference_index"]].upper() 232 | # if j >= len(read["sequence"]): 233 | # print("GET_COLUMN_STRANGE i="+str(i) + " j="+str(j)+" orig="+str(read["alignment_index"])+" READ=" + str(read)) 234 | # alt = read["sequence"][j] 235 | 236 | if read["ref"] == None: 237 | if DEBUG: 238 | print("[INVALID] SKIPPING READ i=" + str(i) + " BECAUSE REF is None", read) 239 | continue 240 | if read["alt"] == None: 241 | if DEBUG: 242 | print("[INVALID] SKIPPING READ i=" + str(i) + " BECAUSE ALT is None", read) 243 | continue 244 | 245 | # passed += 1 246 | 247 | # if passed > 8000: 248 | # break 249 | 250 | ref = read["ref"].upper() 251 | alt = read["alt"].upper() 252 | 253 | if DEBUG: 254 | print("\tBEF={} {}".format(ref, alt)) 255 | 256 | if ref == "N" or alt == "N": 257 | continue 258 | 259 | # print(read["pos"], ref, alt, strand, strand == 1, read["object"].is_read1, read["object"].is_read2, read["object"].is_reverse ) 260 | #ref, alt = fix_strand(read, ref, alt) 261 | 262 | if DEBUG: 263 | print("\tLAT={} {}".format(ref, alt)) 264 | 265 | edits.append(alt) 266 | 267 | # q = read["query_qualities"][read["alignment_index"]] 268 | q = read["qual"] 269 | qualities.append(q) 270 | 271 | strand_column.append(read["strand"]) 272 | # strand_column.append(get_strand(read)) 273 | 274 | if alt != ref: 275 | edits_no += 1 276 | 277 | r1r2distribution[("R1" if read["object"].is_read1 else "R2") + ("-REV" if read["object"].is_reverse else "")] += 1 278 | 279 | if not IS_DNA: 280 | vstrand = 2 281 | if strand != 0: 282 | vstrand = vstand(''.join(strand_column)) 283 | if vstrand == "+": vstrand = 1 284 | elif vstrand == "-": vstrand = 0 285 | elif vstrand == "*": vstrand = 2 286 | 287 | if vstrand == 0: 288 | edits = complement_all(edits) 289 | ref = complement(ref) 290 | 291 | if vstrand in [0, 1] and strand_correction: 292 | edits, strand_column, qualities, qualities_positions = normByStrand(edits, strand_column, qualities, vstrand) 293 | 294 | if DEBUG: 295 | print(vstrand, ''.join(strand_column)) 296 | else: 297 | vstrand = "*" 298 | 299 | if DEBUG: 300 | print(r1r2distribution) 301 | # counter = defaultdict(str) 302 | # for e in edits: counter[e] += 1 303 | # print(Counter(edits)) 304 | 305 | # if i == 62996785: 306 | # print(edits, strand_column, len(qualities), qualities) 307 | 308 | passed = len(edits) 309 | 310 | # counter = Counter(edits) 311 | counter = defaultdict(int) 312 | for e in edits: counter[e] += 1 313 | 314 | # print(Counter(edits), counter) 315 | 316 | mean_q = 0 317 | if DEBUG: 318 | print("Qualities[i="+str(i)+"]="+str(qualities)) 319 | 320 | if len(qualities) > 0: 321 | #mean_q = numpy.mean(qualities) 322 | mean_q = float(sum(qualities)) / max(len(qualities), 1) 323 | 324 | # If all the reads are concordant 325 | #if counter[ref] > 0 and len(counter) == 1: 326 | # return None 327 | 328 | if len(counter) == 0: 329 | if VERBOSE: 330 | sys.stderr.write("[VERBOSE] [EMPTY] Discarding position ({}, {}) because the associated counter is empty\n".format(last_chr, i)) 331 | return None 332 | 333 | # [A,C,G,T] 334 | distribution = [counter['A'] if 'A' in counter else 0, 335 | counter['C'] if 'C' in counter else 0, 336 | counter['G'] if 'G' in counter else 0, 337 | counter['T'] if 'T' in counter else 0] 338 | ref_count = counter[ref] if ref in counter else 0 339 | 340 | non_zero = 0 341 | for el in counter: 342 | if el != ref and counter[el] > 0: 343 | non_zero += 1 344 | 345 | variants = [] 346 | # most_common = None 347 | ratio = 0.0 348 | # most_common = [] 349 | # most_common_value = -1 350 | # for el in counter: 351 | # value = counter[el] 352 | # if value > most_common_value: 353 | # most_common_value = value 354 | # most_common = [] 355 | # if value == most_common_value: 356 | # most_common.append((el, value)) 357 | 358 | # for el in Counter(edits).most_common(): 359 | for el in sorted(counter.items(), key=lambda x: x[1], reverse=True): 360 | if el[0] == ref: continue 361 | else: 362 | variants.append(el[0]) 363 | # most_common = el 364 | if ratio == 0.0: 365 | ratio = (float)(el[1]) / (el[1] + ref_count) 366 | 367 | # ratio = 0.0 368 | # if most_common is not None: 369 | # ratio = (float)(most_common[1]) / (most_common[1] + ref_count) 370 | 371 | # if passed > 0: 372 | # print("REF=" + ref) 373 | # print(passed) 374 | # print(edits) 375 | # print(counter) 376 | # print("MOST FREQUENT EDITS=" + str(counter.most_common())) 377 | # print("MOST COMMON=" + str(most_common)) 378 | # print(numpy.mean(counter.values())) 379 | # print(distribution) 380 | # print(qualities) 381 | # print(mean_q) 382 | # print("REF COUNT=" + str(ref_count)) 383 | # print("ALT/REF % = " + str(ratio)) 384 | # raw_input("Press a key:") 385 | 386 | edits_info = { 387 | "edits": edits, 388 | "distribution": distribution, 389 | "mean_quality": mean_q, 390 | "counter": counter, 391 | "non_zero": non_zero, 392 | "edits_no": edits_no, 393 | "ref": ref, 394 | "variants": variants, 395 | "frequency": ratio, 396 | "passed": passed, 397 | "strand": vstrand 398 | } 399 | 400 | # Check that the column passes the filters 401 | if not filter_column(edits_info, i): return None 402 | 403 | # if edits_no > 5: 404 | # print(str(i) + ":" + str(edits_info)) 405 | # raw_input("[ALERT] Press enter to continue...") 406 | 407 | return edits_info; 408 | 409 | def normByStrand(seq_, strand_, squal_, mystrand_): 410 | 411 | st='+' 412 | if mystrand_== 0: st='-' 413 | seq,strand,qual,squal=[],[],[],'' 414 | for i in range(len(seq_)): 415 | if strand_[i]==st: 416 | seq.append(seq_[i]) 417 | strand.append(strand_[i]) 418 | qual.append(squal_[i]) 419 | squal+=chr(squal_[i]) 420 | return seq,strand,qual,squal 421 | 422 | # def fix_strand(read, ref, alt): 423 | # global strand 424 | # 425 | # raw_read = read["object"] 426 | # 427 | # if (strand == 1 and ((raw_read.is_read1 and raw_read.is_reverse) or (raw_read.is_read2 and not raw_read.is_reverse))) or (strand == 2 and ((raw_read.is_read1 and not raw_read.is_reverse) or (raw_read.is_read2 and raw_read.is_reverse))): 428 | # return ref, complement(alt) 429 | # 430 | # return ref, alt 431 | def get_strand(read): 432 | global strand 433 | 434 | raw_read = read["object"] 435 | 436 | if (strand == 1 and ((raw_read.is_read1 and raw_read.is_reverse) or (raw_read.is_read2 and not raw_read.is_reverse))) or (strand == 2 and ((raw_read.is_read1 and not raw_read.is_reverse) or (raw_read.is_read2 and raw_read.is_reverse))): 437 | return "-" 438 | 439 | return "+" 440 | 441 | def filter_read(read): 442 | 443 | # if DEBUG: 444 | # print("[FILTER_READ] F={} QC={} MP={} LEN={} SECOND={} SUPPL={} DUPL={} READ={}".format(read.flag, read.is_qcfail, read.mapping_quality, read.query_length, read.is_secondary, read.is_supplementary, read.is_duplicate, read)) 445 | 446 | # Get the flag of the read 447 | f = read.flag 448 | 449 | # if strict_mode: 450 | # try: 451 | # NM = read.get_tag("NM") 452 | # if NM == 0: 453 | # # print("SKIPPING", MD_value, read.query_sequence, read.reference_start) 454 | # return True 455 | # except KeyError: 456 | # pass 457 | 458 | # if strict_mode: 459 | # MD = read.get_tag("MD") 460 | # # print(MD, read.get_reference_sequence(), read.reference_start) 461 | # # MD = MD.split(":")[1] 462 | # try: 463 | # MD_value = int(MD) 464 | # # print("SKIPPING", MD_value, read.query_sequence, read.reference_start) 465 | # return True 466 | # except ValueError: 467 | # # print("NO MD VALUE") 468 | # pass 469 | 470 | # Se la read non e' mappata (FLAG 77 o 141) 471 | if f == 77 or f == 141: 472 | if VERBOSE: sys.stderr.write("[DEBUG] APPLIED FILTER [NOT_MAPPED] f={}\n".format(str(f))) 473 | return False 474 | 475 | # Se la read non passa i quality controls (FLAG 512) 476 | if read.is_qcfail: 477 | if VERBOSE: sys.stderr.write("[DEBUG] APPLIED FILTER [QC_FAIL]\n") 478 | return False 479 | 480 | # Se la read ha un MAPQ < di 30 481 | if read.mapping_quality < MIN_QUALITY: 482 | if VERBOSE: sys.stderr.write("[DEBUG] APPLIED FILTER [MAPQ] {} MIN={}\n".format(read.mapping_quality, MIN_QUALITY)) 483 | return False 484 | 485 | # Se la read ha una lunghezza < XX 486 | if read.query_length < MIN_READ_LENGTH: 487 | if VERBOSE: sys.stderr.write("[DEBUG] APPLIED FILTER [MIN_READ_LENGTH] {} MIN={}\n".format(read.query_length, MIN_READ_LENGTH)) 488 | return False 489 | 490 | # Se la read non mappa in modo unico (FLAG 256 o 2048) 491 | if read.is_secondary or read.is_supplementary: 492 | if VERBOSE: sys.stderr.write("[DEBUG] APPLIED FILTER [IS_SECONDARY][IS_SUPPLEMENTARY]\n") 493 | return False 494 | 495 | # Se la read e' un duplicato di PCR (FLAG 1024) 496 | if read.is_duplicate: 497 | if VERBOSE: sys.stderr.write("[DEBUG] APPLIED FILTER [IS_DUPLICATE]\n") 498 | return False 499 | 500 | # Se la read e' paired-end ma non mappa in modo proprio (FLAG diversi da 99/147(+-) o 83/163(-+)) 501 | # 99 = 1+2+32+64 = PAIRED+PROPER_PAIR+MREVERSE+READ1 (+-) 502 | if read.is_paired and not (f == 99 or f == 147 or f == 83 or f == 163): 503 | if VERBOSE: sys.stderr.write("[DEBUG] APPLIED FILTER [NOT_PROPER]\n") 504 | return False 505 | 506 | if read.has_tag('SA'): 507 | if VERBOSE: sys.stderr.write("[DEBUG] APPLIED FILTER [CHIMERIC_READ]\n") 508 | return False 509 | 510 | return True 511 | 512 | def filter_base(read): 513 | 514 | # pos = read["index"] 515 | pos = read["alignment_index"] 516 | 517 | # Se il carattere e' nelle prime X posizioni della read 518 | if pos < MIN_BASE_POSITION: 519 | if VERBOSE: sys.stderr.write("[DEBUG] APPLIED BASE FILTER [MIN_BASE_POSITION]\n") 520 | return False 521 | 522 | # Se il carattere e' nelle ultime Y posizioni della read 523 | if read["length"] - pos < MAX_BASE_POSITION: 524 | if VERBOSE: sys.stderr.write("[DEBUG] APPLIED BASE FILTER [MAX_BASE_POSITION]\n") 525 | return False 526 | 527 | # Se la qualita' e' < Q 528 | # if read["query_qualities"][read["alignment_index"]] < MIN_BASE_QUALITY: 529 | if "qual" not in read: 530 | if VERBOSE: sys.stderr.write("[DEBUG] APPLIED BASE FILTER [QUAL MISSING] {} {}\n".format(pos, read)) 531 | return False 532 | 533 | if read["qual"] < MIN_BASE_QUALITY: 534 | if VERBOSE: sys.stderr.write("[DEBUG] APPLIED BASE FILTER [MIN_BASE_QUALITY] {} {} {} {} {}\n".format(str(read["query_qualities"]), pos, str(read["query_qualities"][pos]), MIN_BASE_QUALITY, read)) 535 | return False 536 | 537 | return True 538 | 539 | def filter_column(column, i): 540 | 541 | edits = column["edits"] 542 | 543 | if column["mean_quality"] < MIN_QUALITY: 544 | if VERBOSE: sys.stderr.write("[DEBUG] DISCARDING COLUMN i={} {} [MIN_MEAN_COLUMN_QUALITY]\n".format(i, column)) 545 | return False 546 | 547 | # Se il numero di caratteri e' < X 548 | if len(edits) < MIN_COLUMN_LENGTH: 549 | if VERBOSE: sys.stderr.write("[DEBUG] DISCARDING COLUMN i={} {} [MIN_COLUMN_LENGTH]\n".format(i, len(edits))) 550 | return False 551 | 552 | counter = column["counter"] 553 | ref = column["ref"] 554 | 555 | # (per ogni variazione) se singolarmente il numero delle basi che supportano la variazione e' < X 556 | for edit in counter: 557 | if edit != ref and counter[edit] < MIN_EDITS_SINGLE: 558 | if VERBOSE: sys.stderr.write("[DEBUG] DISCARDING COLUMN i={} c({})={} [MIN_EDITS_SINGLE] {}\n".format(i, edit, counter[edit], counter)) 559 | return False 560 | 561 | # Se esistono multipli cambi rispetto al reference 562 | if len(counter.keys()) > MAX_CHANGES: 563 | if VERBOSE: sys.stderr.write("[DEBUG] DISCARDING COLUMN i={} changes={} [MULTIPLE_CHANGES] {}\n".format(i, len(counter.keys()), column)) 564 | return False 565 | 566 | # Se tutte le sostituzioni sono < Y 567 | if column["edits_no"] < MIN_EDITS_NO: 568 | if VERBOSE: sys.stderr.write("[DEBUG] DISCARDING COLUMN i={} {} [MIN_EDITS_NO]\n".format(i, column["edits_no"])) 569 | return False 570 | 571 | return True 572 | 573 | def load_omopolymeric_positions(positions, input_file, region): 574 | if input_file is None: return 575 | 576 | sys.stderr.write("Loading omopolymeric positions from file {}\n".format(input_file)) 577 | 578 | chromosome = None 579 | start = None 580 | end = None 581 | 582 | if region is not None: 583 | if len(region) >= 1: 584 | chromosome = region[0] 585 | if len(region) >= 2: 586 | start = region[1] 587 | if len(region) >= 3: 588 | end = region[2] 589 | 590 | lines_read = 0 591 | total = 0 592 | 593 | print("Loading omopolymeric positions of {} between {} and {}".format(chromosome, start, end)) 594 | 595 | try: 596 | reader = open(input_file, "r") 597 | 598 | for line in reader: 599 | if line.startswith("#"): 600 | continue 601 | 602 | lines_read += 1 603 | if lines_read % 500000 == 0: 604 | sys.stderr.write("{} lines read.\n".format(lines_read)) 605 | 606 | fields = line.rstrip().split("\t") 607 | if chromosome is None or fields[0] == chromosome: 608 | chrom = fields[0] 609 | f = int(fields[1]) 610 | t = int(fields[2]) 611 | 612 | if start is not None: f = max(start, f) 613 | if end is not None: t = min(t, end) 614 | 615 | # print("POSITION {} {} {} {} {} {}".format(str(fields), chromosome, f, t, start, end)) 616 | 617 | if chrom not in positions: 618 | positions[chrom] = SortedSet() 619 | 620 | for i in range(f, t): 621 | positions[chrom].add(i) 622 | total += 1 623 | 624 | elif positions: 625 | break 626 | 627 | reader.close() 628 | except IOError as e: 629 | sys.stderr.write("[{}] Omopolymeric positions file not found at {}. Error: {}\n".format(region, input_file, e)) 630 | 631 | sys.stderr.write("[{}] {} total omopolymeric positions found.\n".format(region, total)) 632 | 633 | def load_chromosome_names(index_file): 634 | names = [] 635 | 636 | with open(index_file, "r") as lines: 637 | for line in lines: 638 | names.append(line.split("\t")[0]) 639 | 640 | return names 641 | 642 | def load_splicing_file(splicing_file): 643 | splice_positions = {} 644 | 645 | if splicing_file is None: return splice_positions 646 | 647 | sys.stderr.write('Loading known splice sites from file {}\n'.format(splicing_file)) 648 | 649 | if splicing_file.endswith("gz"): f = gzip.open(splicing_file, "r") 650 | else: f = open(splicing_file, "r") 651 | 652 | total = 0 653 | total_array = {} 654 | 655 | for i in f: 656 | l=(i.strip()).split() 657 | chrom = l[0] 658 | 659 | if chrom not in splice_positions: 660 | splice_positions[chrom] = SortedSet() 661 | total_array[chrom] = 0 662 | 663 | st,tp,cc = l[4], l[3], int(l[1]) 664 | 665 | total += SPLICING_SPAN 666 | total_array[chrom] += SPLICING_SPAN 667 | 668 | if st=='+' and tp=='D': 669 | for j in range(SPLICING_SPAN): splice_positions[chrom].add(cc+(j+1)) 670 | if st=='+' and tp=='A': 671 | for j in range(SPLICING_SPAN): splice_positions[chrom].add(cc-(j+1)) 672 | if st=='-' and tp=='D': 673 | for j in range(SPLICING_SPAN): splice_positions[chrom].add(cc-(j+1)) 674 | if st=='-' and tp=='A': 675 | for j in range(SPLICING_SPAN): splice_positions[chrom].add(cc+(j+1)) 676 | 677 | f.close() 678 | 679 | sys.stderr.write('Loaded {} positions from file {}\n'.format(total, splicing_file)) 680 | sys.stderr.write('\tPartial:{}\n'.format(total_array)) 681 | 682 | return splice_positions 683 | 684 | def create_omopolymeric_positions(reference_file, omopolymeric_file): 685 | 686 | tic = datetime.datetime.now() 687 | 688 | sys.stderr.write("Creating omopolymeric positions (span={}) from reference file {}\n".format(OMOPOLYMERIC_SPAN, reference_file)) 689 | 690 | index_file = reference_file + ".fai" 691 | sys.stderr.write("Loading chromosome names from index file {}\n".format(index_file)) 692 | chromosomes = load_chromosome_names(index_file) 693 | sys.stderr.write("{} chromosome names found\n".format(str(len(chromosomes)))) 694 | 695 | positions = [] 696 | 697 | try: 698 | # Opening reference fasta file 699 | sys.stderr.write("Opening reference file {}.\n".format(reference_file)) 700 | fasta_reader = pysam.FastaFile(reference_file) 701 | sys.stderr.write("Reference file {} opened.\n".format(reference_file)) 702 | 703 | for chromosome in chromosomes: 704 | sys.stderr.write("Loading reference sequence for chromosome {}\n".format(chromosome)) 705 | sequence = fasta_reader.fetch(chromosome).lower() 706 | sys.stderr.write("Reference sequence for chromosome {} loaded (len: {})\n".format(chromosome, len(sequence))) 707 | 708 | equals = 0 709 | last = None 710 | for i, b in enumerate(sequence): 711 | 712 | # if chromosome == "chr18" and i > 190450 and i < 190500: 713 | # print(i, b, last, OMOPOLYMERIC_SPAN, sequence[190450:190480]) 714 | 715 | if b == last: 716 | equals += 1 717 | else: 718 | if equals >= OMOPOLYMERIC_SPAN: 719 | # sys.stderr.write("Found a new omopolymeric interval: ({}, {}-{}): {}\n".format(chromosome, i-equals, i, sequence[i-equals:i])) 720 | positions.append((chromosome, i-equals, i, equals, last)) 721 | 722 | equals = 1 723 | 724 | last = b 725 | 726 | # Closing 727 | fasta_reader.close() 728 | sys.stderr.write("Reference file {} closed.\n".format(reference_file)) 729 | 730 | except ValueError as e: 731 | sys.stderr.write("Error in reading reference file {}: message={}\n".format(reference_file, e)) 732 | except IOError: 733 | sys.stderr.write("The reference file {} could not be opened.\n".format(reference_file)) 734 | 735 | sys.stderr.write("{} total omopolymeric positions found.\n".format(len(positions))) 736 | 737 | toc = datetime.datetime.now() 738 | sys.stderr.write("Time to produce all the omopolymeric positions: {}\n".format(toc-tic)) 739 | 740 | sys.stderr.write("Writing omopolymeric positions to file: {}.\n".format(omopolymeric_file)) 741 | writer = open(omopolymeric_file, "w") 742 | writer.write("#" + "\t".join(["Chromomosome", "Start", "End", "Length", "Symbol"]) + "\n") 743 | for position in positions: 744 | writer.write("\t".join([str(x) for x in position]) + "\n") 745 | writer.close() 746 | sys.stderr.write("Omopolymeric positions written into file: {}.\n".format(omopolymeric_file)) 747 | 748 | def init(samfile, region): 749 | 750 | print("Opening bamfile within region=" + str(region)) 751 | 752 | if region is None or len(region) == 0: 753 | return samfile.fetch() 754 | 755 | if len(region) == 1: 756 | try: 757 | return samfile.fetch(region[0]) 758 | except ValueError: 759 | return samfile.fetch(region[0].replace("chr", "")) 760 | 761 | else: 762 | try: 763 | return samfile.fetch(region[0], region[1], region[2]) 764 | except ValueError: 765 | return samfile.fetch(region[0].replace("chr", ""), region[1], region[2]) 766 | 767 | def within_interval(i, region): 768 | 769 | if region is None or len(region) <= 1: 770 | return True 771 | 772 | else: 773 | start = region[1] 774 | end = region[2] 775 | return i >= start and i <= end 776 | 777 | def get_header(): 778 | return ["Region", "Position", "Reference", "Strand", "Coverage-q30", "MeanQ", "BaseCount[A,C,G,T]", "AllSubs", "Frequency", "gCoverage-q30", "gMeanQ", "gBaseCount[A,C,G,T]", "gAllSubs", "gFrequency"] 779 | 780 | from collections import Counter 781 | import pickle 782 | def load_target_positions(bed_file, region): 783 | print("Loading target positions from file {} (region:{})".format(bed_file, region)) 784 | 785 | # if os.path.exists(bed_file + "save.p"): 786 | # return pickle.load(open( bed_file + "save.p", "rb" )) 787 | 788 | target_positions = {} 789 | 790 | extension = os.path.splitext(bed_file)[1] 791 | handler = None 792 | if extension == ".gz": 793 | handler = gzip.open(bed_file, "r") 794 | else: 795 | handler = open(bed_file, "r") 796 | 797 | read = 0 798 | total_positions = 0 799 | total = Counter() 800 | with handler as file: 801 | for line in file: 802 | read += 1 803 | fields = line.strip().split("\t") 804 | chr = fields[0] 805 | if read % 10000000 == 0: print("[{1}] {0} total lines read. Total positions: {2}".format(read, datetime.datetime.now(), total_positions)) 806 | 807 | if region != None and chr.replace("chr", "") != region[0].replace("chr", ""): continue 808 | 809 | start = int(fields[1])-1 810 | 811 | try: 812 | end = int(fields[2])-1 813 | except: 814 | end = start # In case the file has 2 columns only or the third column is not an integer 815 | 816 | intersection_start = max(region[1] if region is not None and len(region)>1 else 0, start) 817 | intersection_end = min(region[2] if region is not None and len(region)>2 else sys.maxint, end) 818 | 819 | 820 | # If the target region does not intersect the currently analyzed region 821 | if intersection_end < intersection_start: continue 822 | 823 | # print(line, chr, start, end, intersection_start, intersection_end, total) 824 | 825 | # Add target positions 826 | if chr not in target_positions: target_positions[chr] = SortedSet() 827 | for i in range(intersection_start, intersection_end+1): 828 | 829 | target_positions[chr].add(i) 830 | total[chr] += 1 831 | total_positions += 1 832 | 833 | print("### TARGET POSITIONS ###") 834 | print(total) 835 | print("TOTAL POSITIONS:", sum(total.values())) 836 | # pickle.dump(target_positions, open( bed_file + "save.p", "wb" ) ) 837 | 838 | return target_positions 839 | 840 | def analyze(options): 841 | 842 | global DEBUG 843 | global activate_debug 844 | 845 | print("[SYSTEM]", "PYSAM VERSION", pysam.__version__) 846 | print("[SYSTEM]", "PYSAM PATH", pysam.__path__) 847 | 848 | interface = 'ib0' if 'ib0' in netifaces.interfaces() else netifaces.interfaces()[0] 849 | hostname = socket.gethostbyaddr(netifaces.ifaddresses(interface)[netifaces.AF_INET][0]['addr']) 850 | pid = os.getpid() 851 | hostname_string = hostname[0] + "|" + hostname[2][0] + "|" + str(pid) 852 | 853 | bamfile = options["bamfile"] 854 | region = options["region"] 855 | reference_file = options["reference"] 856 | output = options["output"] 857 | append = options["append"] 858 | omopolymeric_file = options["omopolymeric_file"] 859 | splicing_file = options["splicing_file"] 860 | create_omopolymeric_file = options["create_omopolymeric_file"] 861 | bed_file = options["bed_file"] if "bed_file" in options else None 862 | 863 | LAUNCH_TIME = datetime.datetime.now() 864 | print("[INFO] ["+str(region)+"] START=" + str(LAUNCH_TIME)) 865 | 866 | print("[INFO] Opening BAM file="+bamfile) 867 | samfile = pysam.AlignmentFile(bamfile, "rb") 868 | 869 | target_positions = {} 870 | if bed_file is not None: 871 | target_positions = load_target_positions(bed_file, region) 872 | 873 | omopolymeric_positions = {} 874 | if create_omopolymeric_file is True: 875 | if omopolymeric_file is not None: 876 | create_omopolymeric_positions(reference_file, omopolymeric_file) 877 | else: 878 | print("[ERROR] You asked to create the omopolymeric file, but you did not specify any output file. Exiting.") 879 | return 880 | 881 | load_omopolymeric_positions(omopolymeric_positions, omopolymeric_file, region) 882 | # if not omopolymeric_positions and omopolymeric_file is not None: 883 | # omopolymeric_positions = create_omopolymeric_positions(reference_file, omopolymeric_file) 884 | 885 | splice_positions = [] 886 | 887 | if splicing_file: 888 | splice_positions = load_splicing_file(splicing_file) 889 | 890 | # Constants 891 | LAST_READ = None 892 | LOG_INTERVAL = 10000000 893 | 894 | # Take the time 895 | tic = datetime.datetime.now() 896 | first_tic = tic 897 | 898 | total = 0 899 | 900 | reads = dict() 901 | 902 | outputfile = None 903 | 904 | if output is not None: 905 | outputfile = output 906 | else: 907 | prefix = os.path.basename(bamfile) 908 | if region is not None: 909 | prefix += "_" + '_'.join([str(x) for x in region]) 910 | outputfile = prefix + "_reditools2_table.gz" 911 | 912 | mode = "a" if append else "w" 913 | 914 | if outputfile.endswith("gz"): writer = gzip.open(outputfile, mode) 915 | else: writer = open(outputfile, mode) 916 | 917 | if not options["remove_header"]: 918 | writer.write("\t".join(get_header()) + "\n") 919 | 920 | # Open the iterator 921 | print("[INFO] Fetching data from bam {}".format(bamfile)) 922 | print("[INFO] Narrowing REDItools to region {}".format(region)) 923 | sys.stdout.flush() 924 | 925 | reference_reader = None 926 | if reference_file is not None: reference_reader = pysam.FastaFile(reference_file) 927 | chr_ref = None 928 | 929 | iterator = init(samfile, region) 930 | 931 | next_read = next(iterator, LAST_READ) 932 | if next_read is not None: 933 | # next_pos = next_read.get_reference_positions() 934 | # i = next_pos[0] 935 | i = next_read.reference_start 936 | 937 | total += 1 938 | 939 | read = None 940 | # pos = None 941 | last_chr = None 942 | finished = False 943 | 944 | DEBUG_START = region[1] if region is not None and len(region) > 1 else -1 945 | DEBUG_END = region[2] if region is not None and len(region) > 2 else -1 946 | STOP = -1 947 | 948 | while not finished: 949 | 950 | if activate_debug and DEBUG_START > 0 and i >= DEBUG_START-1: DEBUG = True 951 | if activate_debug and DEBUG_END > 0 and i >= DEBUG_END: DEBUG = False 952 | if STOP > 0 and i > STOP: break 953 | 954 | # if i>=46958774: 955 | # print(next_read) 956 | # print_reads(reads, i) 957 | # raw_input() 958 | 959 | if (next_read is LAST_READ and len(reads) == 0) or (region is not None and len(region) >= 3 and i > region[2]): 960 | print("NO MORE READS!") 961 | finished = True 962 | break 963 | 964 | # Jump if we consumed all the reads 965 | if len(reads) == 0: 966 | i = next_read.reference_start 967 | # print("[INFO] READ SET IS EMPTY. JUMP TO "+str(next_pos[0])+"!") 968 | # if len(next_pos) == 0: i = next_read.reference_start 969 | # else: i = next_pos[0] 970 | 971 | # print("P1", next_read.query_name, next_pos) 972 | 973 | # Get all the next read(s) 974 | #while next_read is not LAST_READ and (len(next_pos) > 0 and (next_pos[0] == i or next_pos[-1] == i)): # TODO: why or next_pos[-1] == i? 975 | while next_read is not LAST_READ and next_read.reference_start == i: 976 | 977 | read = next_read 978 | # pos = next_pos 979 | 980 | # When changing chromosome print some statistics 981 | if read is not LAST_READ and read.reference_name != last_chr: 982 | 983 | try: 984 | chr_ref = reference_reader.fetch(read.reference_name) 985 | except KeyError: 986 | chr_ref = reference_reader.fetch("chr" + read.reference_name) 987 | 988 | tac = datetime.datetime.now() 989 | print("[INFO] REFERENCE NAME=" + read.reference_name + " (" + str(tac) + ")\t["+delta(tac, tic)+"]") 990 | sys.stdout.flush() 991 | tic = tac 992 | 993 | last_chr = read.reference_name 994 | 995 | next_read = next(iterator, LAST_READ) 996 | if next_read is not LAST_READ: 997 | total += 1 998 | # next_pos = next_read.get_reference_positions() 999 | 1000 | if total % LOG_INTERVAL == 0: 1001 | print("[{}] [{}] [{}] Total reads loaded: {} [{}] [RAM:{}MB]".format(hostname_string, last_chr, region, total, datetime.datetime.now(), psutil.Process(os.getpid()).memory_info().rss / (1024 * 1024))) 1002 | sys.stdout.flush() 1003 | 1004 | # print("P2", next_read.query_name, next_read.get_reference_positions()) 1005 | 1006 | #print("[INFO] Adding a read to the set=" + str(read.get_reference_positions())) 1007 | 1008 | # Check that the read passes the filters 1009 | if not filter_read(read): continue 1010 | 1011 | # ref_seq = read.get_reference_sequence() 1012 | 1013 | ref_pos = [x[1] for x in read.get_aligned_pairs() if x[0] is not None and x[1] is not None] 1014 | ref_seq = ''.join([chr_ref[x] for x in ref_pos]).upper() 1015 | 1016 | # if ref_seq != read.get_reference_sequence().upper(): 1017 | # print("MY_REF={} \nPY_REF={} \nREAD_NAME={} \nPOSITIONS={} \nREAD={}\n--------------------------".format(ref_seq, read.get_reference_sequence().upper(), read.query_name, read.get_reference_positions(), read.query_sequence)) 1018 | 1019 | # raw_input() 1020 | 1021 | # if len(ref_seq) != len(read.query_sequence) or len(pos) != len(read.query_sequence) or len(pos) != len(ref_seq): 1022 | # print("=== DETAILS ===") 1023 | # print("i="+str(i)) 1024 | # print("ref_seq="+str(len(ref_seq))) 1025 | # print("seq="+str(len(read.query_sequence))) 1026 | # print("pos="+str(len(pos))) 1027 | # print("qual="+str(len(read.query_qualities))) 1028 | # print("index="+str(read.query_alignment_start)) 1029 | # print(ref_seq) 1030 | # print(read.query_sequence) 1031 | # print(pos) 1032 | # print(read.query_qualities) 1033 | 1034 | t = "*" 1035 | 1036 | if not IS_DNA: 1037 | if read.is_read1: 1038 | if strand == 1: 1039 | if read.is_reverse: t='-' 1040 | else: t='+' 1041 | else: 1042 | if read.is_reverse: t='+' 1043 | else: t='-' 1044 | elif read.is_read2: 1045 | if strand == 2: 1046 | if read.is_reverse: t='-' 1047 | else: t='+' 1048 | else: 1049 | if read.is_reverse: t='+' 1050 | else: t='-' 1051 | else: # for single ends 1052 | if strand == 1: 1053 | if read.is_reverse: t='-' 1054 | else: t='+' 1055 | else: 1056 | if read.is_reverse: t='+' 1057 | else: t='-' 1058 | 1059 | qualities = read.query_qualities 1060 | if qualities is None: qualities = [DEFAULT_BASE_QUALITY for x in range(0, len(ref_seq))] 1061 | 1062 | item = { 1063 | # "index": 0, 1064 | "pos": read.reference_start - 1, 1065 | # "pos": i-1, 1066 | "alignment_index": read.query_alignment_start - 1, 1067 | # "alignment_index": -1, 1068 | "reference_index": -1, 1069 | "query_alignment_start": read.query_alignment_start, 1070 | "object": read, 1071 | "reference": ref_seq, 1072 | "reference_len": len(ref_seq), 1073 | "sequence": read.query_sequence, 1074 | # "positions": pos, 1075 | "chromosome": read.reference_name, 1076 | "query_qualities": qualities, 1077 | "qualities_len": len(qualities), 1078 | "length": read.query_length, 1079 | "cigar": read.cigarstring, 1080 | "strand": t 1081 | } 1082 | 1083 | cigar_list = [[int(c), op] for (c, op) in re.findall('(\d+)(.)', item["cigar"])] 1084 | # if read.is_reverse: 1085 | # cigar_list.reverse() 1086 | item["cigar_list"] = cigar_list 1087 | 1088 | # if read.query_sequence == "AGGCTCTCTTAATGTAATAAAAGCCATCTATGACAAACCCACAGCCAACATAATACTGAATGGGGAAAAGGTGAAA": 1089 | # print(i, read.reference_start, item, read) 1090 | 1091 | # item["ref"] = item["reference"][item["reference_index"]] 1092 | # item["alt"] = item["sequence"][item["alignment_index"]] 1093 | # item["qual"] = item["query_qualities"][item["alignment_index"]] 1094 | 1095 | # print(item["cigar"]) 1096 | # print(item["cigar_list"]) 1097 | # print(read.get_aligned_pairs()) 1098 | # print("REF START = " + str(read.reference_start)) 1099 | # print("REF POS[0] = " + str(item["positions"][0])) 1100 | # print("ALIGN START = " + str(item["alignment_index"])) 1101 | # raw_input("CIGAR STRING PARSED...") 1102 | 1103 | # if item["cigar"] != "76M": 1104 | # item["pairs"] = read.get_aligned_pairs() 1105 | 1106 | # if read.query_sequence == "CACGGACTTTTCCTGAAATTTATTTTTATGTATGTATATCAAACATTGAATTTCTGTTTTCTTCTTTACTGGAATT" and pos[0] == 14233 and pos[-1] == 14308: 1107 | # print("[FILTER_READ] F={} QC={} MP={} LEN={} SECOND={} SUPPL={} DUPL={} PAIRED={} READ={}".format(read.flag, read.is_qcfail, read.mapping_quality, read.query_length, read.is_secondary, read.is_supplementary, read.is_duplicate, read.is_paired, read)) 1108 | # raw_input("SPECIAL READ...") 1109 | 1110 | # print(item) 1111 | # raw_input("Press enter to continue...") 1112 | 1113 | # print(item) 1114 | # if i > 15400000: 1115 | # print(read.reference_name, i) 1116 | # raw_input("Press enter to continue...") 1117 | 1118 | end_position = read.reference_end #pos[-1] 1119 | if end_position not in reads: 1120 | reads[end_position] = [] 1121 | 1122 | if DEBUG: 1123 | print("Adding item="+str(item)) 1124 | reads[end_position].append(item) 1125 | 1126 | # Debug purposes 1127 | # if DEBUG: 1128 | # print("BEFORE UPDATE (i="+str(i)+"):") 1129 | # print_reads(reads, i) 1130 | 1131 | pos_based_read_dictionary = update_reads(reads, i) 1132 | 1133 | column = get_column(pos_based_read_dictionary, reads, splice_positions, last_chr, omopolymeric_positions, target_positions, i) 1134 | 1135 | # Debug purposes 1136 | if DEBUG: 1137 | # print("AFTER UPDATE:"); 1138 | # print_reads(reads, i) 1139 | raw_input("Press enter to continue...") 1140 | 1141 | # Go the next position 1142 | i += 1 1143 | # print("Position i"+str(i)) 1144 | 1145 | # if DEBUG: 1146 | # print("[DEBUG] WRITING COLUMN IN POSITION {}: {}".format(i, column is not None)) 1147 | # print(column) 1148 | # print_reads(reads, i) 1149 | 1150 | if column is not None and within_interval(i, region) and not (strict_mode and column["non_zero"] == 0): 1151 | # head='Region\tPosition\tReference\tStrand\tCoverage-q%i\tMeanQ\tBaseCount[A,C,G,T]\t 1152 | # AllSubs\tFrequency\t 1153 | # gCoverage-q%i\tgMeanQ\tgBaseCount[A,C,G,T]\tgAllSubs\tgFrequency\n' %(MQUAL,gMQUAL) 1154 | # cov,bcomp,subs,freq=BaseCount(seq,ref,MINIMUM_EDITS_FREQUENCY,MIN_EDITS_SINGLE) 1155 | # mqua=meanq(qual,len(seq)) 1156 | # line='\t'.join([chr,str(pileupcolumn.pos+1),ref,mystrand,str(cov),mqua,str(bcomp),subs,freq]+['-','-','-','-','-'])+'\n' 1157 | # [A,C,G,T] 1158 | 1159 | writer.write("\t".join([ 1160 | last_chr, 1161 | str(i), 1162 | column["ref"], 1163 | str(column["strand"]), 1164 | str(column["passed"]), 1165 | "{0:.2f}".format(column["mean_quality"]), 1166 | str(column["distribution"]), 1167 | " ".join([column["ref"] + el for el in column["variants"]]) if column["non_zero"] >= 1 else "-", 1168 | "{0:.2f}".format(column["frequency"]), 1169 | "\t".join(['-','-','-','-','-']) 1170 | ]) + "\n") 1171 | # if column["passed"] >= 1000: print("WRITTEN LINE {} {} {} {} {}".format(last_chr, str(i), column["ref"], column["strand"], column["passed"])) 1172 | # writer.flush() 1173 | elif VERBOSE: 1174 | sys.stderr.write("[VERBOSE] [NOPRINT] Not printing position ({}, {}) WITHIN_INTERVAL={} STRICT_MODE={} COLUMN={}\n".format(last_chr, i, within_interval(i, region), strict_mode, column)) 1175 | 1176 | # Remove old reads 1177 | reads.pop(i-1, None) 1178 | 1179 | if reference_reader is not None: reference_reader.close() 1180 | samfile.close() 1181 | writer.close() 1182 | 1183 | tac = datetime.datetime.now() 1184 | print("[INFO] ["+hostname_string+"] ["+str(region)+"] " + str(total) + " total reads read") 1185 | print("[INFO] ["+hostname_string+"] ["+str(region)+"] END=" + str(tac) + "\t["+delta(tac, tic)+"]") 1186 | print("[INFO] ["+hostname_string+"] ["+str(region).ljust(50)+"] FINAL END=" + str(tac) + " START="+ str(first_tic) + "\t"+ str(region) +"\t[TOTAL COMPUTATION="+delta(tac, first_tic)+"] [LAUNCH TIME:"+str(LAUNCH_TIME)+"] [TOTAL RUN="+delta(tac, LAUNCH_TIME)+"] [READS="+str(total)+"]") 1187 | 1188 | complement_map = {"A":"T", "T":"A", "C":"G", "G":"C"} 1189 | def complement(b): 1190 | return complement_map[b] 1191 | 1192 | def complement_all(sequence): 1193 | return ''.join([complement_map[l] for l in sequence]) 1194 | 1195 | def prop(tot,va): 1196 | try: av=float(va)/tot 1197 | except: av=0.0 1198 | return av 1199 | 1200 | def vstand(strand): # strand='+-+-+-++++++-+++' 1201 | 1202 | vv=[(strand.count('+'),'+'),(strand.count('-'),'-'),(strand.count('*'),'*')] 1203 | if vv[0][0]==0 and vv[1][0]==0: return '*' 1204 | if use_strand_confidence: #flag che indica se usare il criterio 2, altrimenti usa il criterio 1 1205 | totvv=sum([x[0] for x in vv[:2]]) 1206 | if prop(totvv,vv[0][0])>=strand_confidence_value: return '+' # strand_confidence_value e' il valore soglia, compreso tra 0 e 1, default 0.7 1207 | if prop(totvv,vv[1][0])>=strand_confidence_value: return '-' 1208 | return '*' 1209 | else: 1210 | if vv[0][0]==vv[1][0] and vv[2][0]==0: return '+' 1211 | return max(vv)[1] 1212 | 1213 | def parse_options(): 1214 | 1215 | # Options parsing 1216 | parser = argparse.ArgumentParser(description='REDItools 2.0') 1217 | parser.add_argument('-f', '--file', help='The bam file to be analyzed') 1218 | parser.add_argument('-o', '--output-file', help='The output statistics file') 1219 | parser.add_argument('-S', '--strict', default=False, action='store_true', help='Activate strict mode: only sites with edits will be included in the output') 1220 | parser.add_argument('-s', '--strand', type=int, default=0, help='Strand: this can be 0 (unstranded), 1 (secondstrand oriented) or 2 (firststrand oriented)') 1221 | parser.add_argument('-a', '--append-file', action='store_true', help='Appends results to file (and creates if not existing)') 1222 | parser.add_argument('-r', '--reference', help='The reference FASTA file') 1223 | parser.add_argument('-g', '--region', help='The region of the bam file to be analyzed') 1224 | parser.add_argument('-m', '--omopolymeric-file', help='The file containing the omopolymeric positions') 1225 | parser.add_argument('-c', '--create-omopolymeric-file', default=False, help='Whether to create the omopolymeric span', action='store_true') 1226 | parser.add_argument('-os', '--omopolymeric-span', type=int, default=5, help='The omopolymeric span') 1227 | parser.add_argument('-sf', '--splicing-file', help='The file containing the splicing sites positions') 1228 | parser.add_argument('-ss', '--splicing-span', type=int, default=4, help='The splicing span') 1229 | parser.add_argument('-mrl', '--min-read-length', type=int, default=30, help='The minimum read length. Reads whose length is below this value will be discarded.') 1230 | parser.add_argument('-q', '--min-read-quality', type=int, default=20, help='The minimum read quality. Reads whose mapping quality is below this value will be discarded.') 1231 | parser.add_argument('-bq', '--min-base-quality', type=int, default=30, help='The minimum base quality. Bases whose quality is below this value will not be included in the analysis.') 1232 | parser.add_argument('-mbp', '--min-base-position', type=int, default=0, help='The minimum base position. Bases which reside in a previous position (in the read) will not be included in the analysis.') 1233 | parser.add_argument('-Mbp', '--max-base-position', type=int, default=0, help='The maximum base position. Bases which reside in a further position (in the read) will not be included in the analysis.') 1234 | parser.add_argument('-l', '--min-column-length', type=int, default=1, help='The minimum length of editing column (per position). Positions whose columns have length below this value will not be included in the analysis.') 1235 | parser.add_argument('-men', '--min-edits-per-nucleotide', type=int, default=1, help='The minimum number of editing for events each nucleotide (per position). Positions whose columns have bases with less than min-edits-per-base edits will not be included in the analysis.') 1236 | parser.add_argument('-me', '--min-edits', type=int, default=0, help='The minimum number of editing events (per position). Positions whose columns have bases with less than \'min-edits-per-base edits\' will not be included in the analysis.') 1237 | parser.add_argument('-Men', '--max-editing-nucleotides', type=int, default=100, help='The maximum number of editing nucleotides, from 0 to 4 (per position). Positions whose columns have more than \'max-editing-nucleotides\' will not be included in the analysis.') 1238 | parser.add_argument('-d', '--debug', default=False, help='REDItools is run in DEBUG mode.', action='store_true') 1239 | parser.add_argument('-T', '--strand-confidence', default=1, help='Strand inference type 1:maxValue 2:useConfidence [1]; maxValue: the most prominent strand count will be used; useConfidence: strand is assigned if over a prefixed frequency confidence (-TV option)') 1240 | parser.add_argument('-C', '--strand-correction', default=False, help='Strand correction. Once the strand has been inferred, only bases according to this strand will be selected.', action='store_true') 1241 | parser.add_argument('-Tv', '--strand-confidence-value', type=float, default=0.7, help='Strand confidence [0.70]') 1242 | parser.add_argument('-V', '--verbose', default=False, help='Verbose information in stderr', action='store_true') 1243 | parser.add_argument('-H', '--remove-header', default=False, help='Do not include header in output file', action='store_true') 1244 | parser.add_argument('-N', '--dna', default=False, help='Run REDItools 2.0 on DNA-Seq data', action='store_true') 1245 | parser.add_argument('-B', '--bed_file', help='Path of BED file containing target regions') 1246 | 1247 | args = parser.parse_known_args()[0] 1248 | # print(args) 1249 | 1250 | global activate_debug 1251 | activate_debug = args.debug 1252 | 1253 | global VERBOSE 1254 | VERBOSE = args.verbose 1255 | 1256 | bamfile = args.file 1257 | if bamfile is None: 1258 | print("[ERROR] An input bam file is mandatory. Please, provide one (-f|--file)") 1259 | exit(1) 1260 | 1261 | omopolymeric_file = args.omopolymeric_file 1262 | global OMOPOLYMERIC_SPAN 1263 | OMOPOLYMERIC_SPAN = args.omopolymeric_span 1264 | create_omopolymeric_file = args.create_omopolymeric_file 1265 | 1266 | reference_file = args.reference 1267 | if reference_file is None: 1268 | print("[ERROR] An input reference file is mandatory. Please, provide one (-r|--reference)") 1269 | exit(1) 1270 | 1271 | output = args.output_file 1272 | append = args.append_file 1273 | 1274 | global strict_mode 1275 | strict_mode = args.strict 1276 | 1277 | global strand 1278 | strand = args.strand 1279 | 1280 | global strand_correction 1281 | strand_correction = args.strand_correction 1282 | 1283 | global use_strand_confidence 1284 | use_strand_confidence = bool(args.strand_confidence) 1285 | 1286 | global strand_confidence_value 1287 | strand_confidence_value = float(args.strand_confidence_value) 1288 | 1289 | splicing_file = args.splicing_file 1290 | global SPLICING_SPAN 1291 | SPLICING_SPAN = args.splicing_span 1292 | 1293 | global MIN_READ_LENGTH 1294 | MIN_READ_LENGTH = args.min_read_length 1295 | 1296 | global MIN_QUALITY 1297 | MIN_QUALITY = args.min_read_quality 1298 | 1299 | global MIN_BASE_QUALITY 1300 | MIN_BASE_QUALITY = args.min_base_quality 1301 | 1302 | global DEFAULT_BASE_QUALITY 1303 | DEFAULT_BASE_QUALITY = 30 1304 | 1305 | global MIN_BASE_POSITION 1306 | MIN_BASE_POSITION = args.min_base_position 1307 | 1308 | global MAX_BASE_POSITION 1309 | MAX_BASE_POSITION = args.max_base_position 1310 | 1311 | global MIN_COLUMN_LENGTH 1312 | MIN_COLUMN_LENGTH = args.min_column_length 1313 | 1314 | global MIN_EDITS_SINGLE 1315 | MIN_EDITS_SINGLE = args.min_edits_per_nucleotide 1316 | 1317 | global MIN_EDITS_NO 1318 | MIN_EDITS_NO = args.min_edits 1319 | 1320 | global MAX_CHANGES 1321 | MAX_CHANGES = args.max_editing_nucleotides 1322 | 1323 | global IS_DNA 1324 | IS_DNA = args.dna 1325 | 1326 | bed_file = args.bed_file 1327 | 1328 | if IS_DNA and bed_file is None: 1329 | print("[ERROR] When analyzing DNA-Seq files it is mandatory to provide a BED file containing the positions of target regions (-B|--bed_file)") 1330 | exit(1) 1331 | 1332 | region = None 1333 | 1334 | if args.region: 1335 | region = re.split("[:-]", args.region) 1336 | if not region or len(region) == 2 or (len(region) == 3 and region[1] == region[2]): 1337 | sys.stderr.write("[ERROR] Please provide a region of the form chrom:start-end (with end > start). Region provided: {}".format(region)) 1338 | exit(1) 1339 | if len(region) >= 2: 1340 | region[1] = int(region[1]) 1341 | region[2] = int(region[2]) 1342 | 1343 | options = { 1344 | "bamfile": bamfile, 1345 | "region": region, 1346 | "reference": reference_file, 1347 | "output": output, 1348 | "append": append, 1349 | "omopolymeric_file": omopolymeric_file, 1350 | "create_omopolymeric_file": create_omopolymeric_file, 1351 | "splicing_file": splicing_file, 1352 | "remove_header": args.remove_header, 1353 | "bed_file": bed_file 1354 | } 1355 | 1356 | # print("RUNNING REDItools 2.0 with the following options", options) 1357 | 1358 | return options 1359 | 1360 | # -i /marconi_scratch/userexternal/tflati00/test_picardi/reditools_test/SRR1413602.bam 1361 | # -o editing18_test -f /marconi_scratch/userinternal/tcastign/test_picardi/hg19.fa -c1,1 1362 | # -m20,20 -v1 -q30,30 -e -n0.0 -N0.0 -u -l -p --gzip -H -Y chr18:1-78077248 -F chr18_1_78077248 1363 | # 1364 | # -f /home/flati/data/reditools/SRR1413602.bam -r /home/flati/data/reditools/hg19.fa -g chr18:14237-14238 -m /home/flati/data/reditools/omopolymeric_positions.txt 1365 | # 1366 | # -f /home/flati/data/reditools/SRR1413602.bam 1367 | # -r /home/flati/data/reditools/hg19.fa 1368 | # -g chr18:14237-14238 1369 | # -m /home/flati/data/reditools/omopolymeric_positions.txt 1370 | if __name__ == '__main__': 1371 | 1372 | options = parse_options() 1373 | 1374 | analyze(options) 1375 | 1376 | 1377 | 1378 | -------------------------------------------------------------------------------- /src/cineca/reditools2_multisample.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | import os 4 | import glob 5 | import sys 6 | import re 7 | import time 8 | from mpi4py import MPI 9 | import datetime 10 | from collections import OrderedDict 11 | import reditools 12 | import argparse 13 | import gc 14 | import socket 15 | import netifaces 16 | from random import shuffle 17 | 18 | ALIGN_CHUNK = 0 19 | STOP_WORKING = 1 20 | IM_FREE = 2 21 | CALCULATE_COVERAGE = 3 22 | 23 | STEP = 10000000 24 | 25 | def weight_function(x): 26 | # x = math.log(1+x) 27 | # return 2.748*10**(-3)*x**3 -0.056*x**2 + 0.376*x + 2.093 28 | return x**3 29 | 30 | def get_coverage(coverage_file, region = None): 31 | 32 | # Open the file and read i-th section (jump to the next '\n' character) 33 | n = float(sample_size) 34 | file_size = os.path.getsize(coverage_file) 35 | print("[{}] SIZE OF FILE {}: {} bytes".format(sample_rank, coverage_file, file_size)) 36 | start = int(sample_rank*(file_size/n)) 37 | end = int((sample_rank+1)*(file_size/n)) 38 | print("[{}] [DEBUG] START={} END={}".format(sample_rank, start, end)) 39 | 40 | f = open(coverage_file, "r") 41 | f.seek(start) 42 | loaded = start 43 | coverage_partial = 0 44 | with f as lines: 45 | line_no = 0 46 | for line in lines: 47 | if loaded >= end: continue 48 | loaded += len(line) 49 | 50 | line_no += 1 51 | if line_no == 1: 52 | if not line.startswith("chr"): 53 | continue 54 | 55 | triple = line.rstrip().split("\t") 56 | 57 | if region is not None: 58 | if triple[0] != region[0]: continue 59 | if len(region) >= 2 and int(triple[1]) < region[1]: continue 60 | if len(region) >= 2 and int(triple[1]) > region[2]: continue 61 | 62 | #if line_no % 10000000 == 0: 63 | # print("[{}] [DEBUG] Read {} lines so far".format(rank, line_no)) 64 | cov = int(triple[2]) 65 | coverage_partial += weight_function(cov) 66 | 67 | print("[{}] START={} END={} PARTIAL_COVERAGE={}".format(sample_rank, start, end, coverage_partial)) 68 | 69 | # Reduce 70 | coverage = None 71 | 72 | coverages = sample_comm.gather(coverage_partial) 73 | if sample_rank == 0: 74 | print("COVERAGES:", str(coverages)) 75 | coverage = reduce(lambda x,y: x+y, coverages) 76 | 77 | coverage = sample_comm.bcast(coverage) 78 | 79 | # Return the total 80 | return coverage 81 | 82 | def calculate_intervals(total_coverage, coverage_file, region): 83 | print("[SYSTEM] [{}] Opening coverage file={}".format(sample_rank, coverage_file)) 84 | f = open(coverage_file, "r") 85 | 86 | chr = None 87 | start = None 88 | end = None 89 | C = 0 90 | max_interval_width = min(3000000, 3000000000 / sample_size) 91 | 92 | subintervals = [] 93 | subtotal = total_coverage / sample_size 94 | print("[SYSTEM] TOTAL={} SUBTOTAL={} MAX_INTERVAL_WIDTH={}".format(total_coverage, subtotal, max_interval_width)) 95 | 96 | line_no = 0 97 | with f as lines: 98 | for line in lines: 99 | line_no += 1 100 | if line_no % 1000000 == 0: 101 | print("[SYSTEM] [{}] Time: {} - {} lines loaded.".format(sample_rank, time.time(), line_no)) 102 | 103 | fields = line.rstrip().split("\t") 104 | 105 | if region is not None: 106 | if fields[0] != region[0]: continue 107 | if len(region) >= 2 and int(fields[1]) < region[1]: continue 108 | if len(region) >= 3 and int(fields[1]) > region[2]: continue 109 | 110 | # If the interval has become either i) too large or ii) too heavy or iii) spans across two different chromosomes 111 | if C >= subtotal or (chr is not None and fields[0] != chr) or (end is not None and start is not None and (end-start) > max_interval_width): 112 | reason = None 113 | if C >= subtotal: reason = "WEIGHT" 114 | elif chr is not None and fields[0] != chr: reason = "END_OF_CHROMOSOME" 115 | elif end is not None and start is not None and (end-start) > max_interval_width: reason = "MAX_WIDTH" 116 | 117 | interval = (chr, start, end, C, end-start, reason) 118 | print("[SYSTEM] [{}] Time: {} - Discovered new interval={}".format(sample_rank, time.time(), interval)) 119 | subintervals.append(interval) 120 | chr = None 121 | start = None 122 | end = None 123 | C = 0 124 | if len(fields) < 3: continue 125 | 126 | if chr is None: chr = fields[0] 127 | if start is None: start = int(fields[1]) 128 | end = int(fields[1]) 129 | C += weight_function(int(fields[2])) 130 | 131 | if C > 0: 132 | reason = "END_OF_CHROMOSOME" 133 | interval = (chr, start, end, C, end-start, reason) 134 | print("[SYSTEM] [{}] Time: {} - Discovered new interval={}".format(sample_rank, time.time(), interval)) 135 | subintervals.append(interval) 136 | 137 | return subintervals 138 | 139 | if __name__ == '__main__': 140 | 141 | # MPI init 142 | comm = MPI.COMM_WORLD 143 | rank = comm.Get_rank() 144 | size = comm.Get_size() 145 | 146 | options = reditools.parse_options() 147 | options["remove_header"] = True 148 | 149 | parser = argparse.ArgumentParser(description='REDItools 2.0') 150 | parser.add_argument('-D', '--coverage-dir', help='The coverage directory containing the coverage file of the sample to analyze divided by chromosome') 151 | parser.add_argument('-t', '--temp-dir', help='The temp directory where to store temporary data for this sample') 152 | parser.add_argument('-Z', '--chromosome-sizes', help='The file with the chromosome sizes') 153 | parser.add_argument('-g', '--region', help='The region of the bam file to be analyzed') 154 | parser.add_argument('-F', '--samples-file', help='The file listing each bam file to be analyzed on a separate line') 155 | args = parser.parse_known_args()[0] 156 | 157 | coverage_dir = args.coverage_dir 158 | temp_dir = args.temp_dir 159 | size_file = args.chromosome_sizes 160 | samples_filepath = args.samples_file 161 | 162 | samples = None 163 | if rank == 0: 164 | samples = [] 165 | for line in open(samples_filepath, "r"): 166 | line = line.strip() 167 | samples.append(line) 168 | 169 | # Chronometer data structure 170 | if rank == 0: 171 | chronometer = {} 172 | for sample in samples: 173 | sample = os.path.basename(sample) 174 | sample = ".".join(sample.split(".")[0:-1]) 175 | chronometer[sample] = { 176 | # "coverage": 0, 177 | # "intervals": 0, 178 | "parallel": 0 179 | } 180 | 181 | print("CHRONOMETER", chronometer) 182 | 183 | samples = comm.bcast(samples, root=0) 184 | 185 | PROCS_PER_SAMPLE = size / len(samples) 186 | if rank == 0: 187 | print("[{}] PROCESSES_PER_SAMPLE={}".format(rank, PROCS_PER_SAMPLE)) 188 | 189 | interface = 'ib0' if 'ib0' in netifaces.interfaces() else netifaces.interfaces()[0] 190 | hostname = socket.gethostbyaddr(netifaces.ifaddresses(interface)[netifaces.AF_INET][0]['addr']) 191 | pid = os.getpid() 192 | # print("[SYSTEM] [TECH] [NODE] RANK:{} HOSTNAME:{} PID:{}".format(rank, hostname, pid)) 193 | 194 | # if rank == 0: 195 | # print("[SYSTEM] LAUNCHED PARALLEL REDITOOLS WITH THE FOLLOWING OPTIONS:", options, args) 196 | 197 | region = None 198 | if args.region: 199 | region = re.split("[:-]", args.region) 200 | if not region or len(region) == 2 or (len(region) == 3 and region[1] == region[2]): 201 | sys.stderr.write("[ERROR] Please provide a region of the form chrom:start-end (with end > start). Region provided: {}".format(region)) 202 | exit(1) 203 | if len(region) >= 2: 204 | region[1] = int(region[1]) 205 | region[2] = int(region[2]) 206 | 207 | t1 = time.time() 208 | 209 | # print("I am rank #"+str(rank)) 210 | 211 | # COVERAGE SECTION 212 | sample_index = rank/PROCS_PER_SAMPLE 213 | sample_filepath = samples[sample_index] 214 | sample = os.path.basename(sample_filepath) 215 | sample = ".".join(sample.split(".")[0:-1]) 216 | 217 | if rank % PROCS_PER_SAMPLE == 0: 218 | print("[{}] SAMPLE_INDEX={} SAMPLE_FILEPATH={} SAMPLE={}".format(rank, sample_index, sample_filepath, sample)) 219 | 220 | sample_comm = comm.Split(sample_index) 221 | sample_rank = sample_comm.Get_rank() 222 | sample_size = sample_comm.Get_size() 223 | 224 | coverage_dir += sample + "/" 225 | coverage_file = coverage_dir + sample + ".cov" 226 | temp_dir += sample + "/" 227 | 228 | if not os.path.isfile(coverage_file): 229 | print("[ERROR] Coverage file {} not existing!".format(coverage_file)) 230 | exit(1) 231 | 232 | try: 233 | if not os.path.exists(temp_dir): 234 | os.makedirs(temp_dir) 235 | except Exception as e: 236 | print("[WARN] {}".format(e)) 237 | 238 | interval_file = temp_dir + "/intervals.txt" 239 | homeworks = [] 240 | 241 | if os.path.isfile(interval_file) and os.stat(interval_file).st_size > 0: 242 | if sample_rank == 0: 243 | print("[{}] [{}] [S{}] [RESTART] FOUND INTERVAL FILE {} ".format(rank, sample_rank, sample_index, interval_file)) 244 | expected_total = 0 245 | for line in open(interval_file, "r"): 246 | line = line.strip() 247 | 248 | if expected_total == 0: 249 | expected_total = int(line) 250 | continue 251 | 252 | # Interval format: (chr, start, end, C, end-start, reason) 253 | fields = line.split("\t") 254 | for i in range(1, 5): 255 | fields[i] = int(fields[i]) 256 | homeworks.append([sample_index] + fields) 257 | print("[{}] [{}] [S{}] [RESTART] INTERVAL FILE #INTERVALS {} ".format(rank, sample_rank, sample_index, len(homeworks))) 258 | 259 | else: 260 | if sample_rank == 0: 261 | print("["+str(rank)+"] [S"+str(sample_index)+"] PRE-COVERAGE TIME " + str(datetime.datetime.now().time())) 262 | 263 | start_cov = time.time() 264 | total_coverage = get_coverage(coverage_file, region) 265 | end_cov = time.time() 266 | elapsed = end_cov - start_cov 267 | print("[{}] [{}] [S{}] [{}] [TOTAL_COVERAGE] {}".format(rank, sample_rank, sample_index, sample, total_coverage)) 268 | 269 | # if sample_rank == 0: 270 | # interval_time = [sample, elapsed] 271 | # else: 272 | # interval_time = [] 273 | 274 | # interval_times = comm.gather(interval_time) 275 | 276 | # if rank == 0: 277 | # interval_times = list(filter(lambda x: x is not None, interval_times)) 278 | # for interval_time in interval_times: 279 | # if len(interval_time) > 0: 280 | # print("INTERVAL_TIME[0]", interval_time[0]) 281 | # print("INTERVAL_TIME", interval_time) 282 | # 283 | # chronometer[interval_time[0]]["coverage"] = interval_time[1] 284 | 285 | if sample_rank == 0: 286 | now = datetime.datetime.now().time() 287 | elapsed = time.time() - t1 288 | print("[SYSTEM] [TIME] [MPI] [0] MIDDLE-COVERAGE [now:{}] [elapsed: {}]".format(now, elapsed)) 289 | 290 | # Collect all the files with the coverage 291 | files = [] 292 | for file in os.listdir(coverage_dir): 293 | if region is not None and file != region[0]: continue 294 | if file.startswith("."): continue 295 | if file.endswith(".cov"): continue 296 | if file.endswith(".txt"): continue 297 | if file == "chrM": continue 298 | files.append(file) 299 | files.sort() 300 | 301 | if sample_rank == 0: 302 | print("[0] [S"+str(sample_index)+"] " + str(len(files)) + " FILES => " + str(files)) 303 | 304 | # Master: dispatches the work to the other slaves 305 | if sample_rank == 0: 306 | start_intervals = t1 307 | print("[0] Start time: {}".format(start_intervals)) 308 | 309 | done = 0 310 | total = len(files) 311 | 312 | queue = set() 313 | for i in range(1, min(sample_size, total+1)): 314 | file = files.pop() 315 | print("[SYSTEM] [MPI] [0] Sending coverage data "+ str(file) +" to rank " + str(i)) 316 | sample_comm.send(file, dest=i, tag=CALCULATE_COVERAGE) 317 | queue.add(i) 318 | 319 | while len(files) > 0: 320 | status = MPI.Status() 321 | subintervals = sample_comm.recv(source=MPI.ANY_SOURCE, tag=IM_FREE, status=status) 322 | for subinterval in subintervals: 323 | homeworks.append([sample_index] + list(subinterval)) 324 | 325 | done += 1 326 | who = status.Get_source() 327 | queue.remove(who) 328 | now = datetime.datetime.now().time() 329 | elapsed = time.time() - start_intervals 330 | print("[SYSTEM] [TIME] [MPI] [0] [S{}] COVERAGE RECEIVED IM_FREE SIGNAL FROM RANK {} [now:{}] [elapsed:{}] [#intervals: {}] [{}/{}][{:.2f}%] [Queue:{}]".format(sample_index, str(who), now, elapsed, len(homeworks), done, total, 100 * float(done)/total, queue)) 331 | 332 | file = files.pop() 333 | print("[SYSTEM] [MPI] [0] [S"+str(sample_index)+"] Sending coverage data "+ str(file) +" to rank " + str(who)) 334 | sample_comm.send(file, dest=who, tag=CALCULATE_COVERAGE) 335 | queue.add(who) 336 | 337 | while len(queue) > 0: 338 | status = MPI.Status() 339 | print("[SYSTEM] [MPI] [0] [S{}] Going to receive data from slaves.".format(sample_index)) 340 | subintervals = sample_comm.recv(source=MPI.ANY_SOURCE, tag=IM_FREE, status=status) 341 | for subinterval in subintervals: 342 | homeworks.append([sample_index] + list(subinterval)) 343 | 344 | done += 1 345 | who = status.Get_source() 346 | queue.remove(who) 347 | now = datetime.datetime.now().time() 348 | elapsed = time.time() - start_intervals 349 | print("[SYSTEM] [TIME] [MPI] [0] [S{}] COVERAGE RECEIVED IM_FREE SIGNAL FROM RANK {} [now:{}] [elapsed:{}] [#intervals: {}] [{}/{}][{:.2f}%] [Queue:{}]".format(sample_index, str(who), now, elapsed, len(homeworks), done, total, 100 * float(done)/total, queue)) 350 | 351 | # Let them know we finished calculating the coverage 352 | for i in range(1, sample_size): 353 | sample_comm.send(None, dest=i, tag=STOP_WORKING) 354 | 355 | now = datetime.datetime.now().time() 356 | elapsed = time.time() - start_intervals 357 | 358 | interval_file = temp_dir + "/intervals.txt" 359 | print("[SYSTEM] [TIME] [MPI] [0] [S{}] SAVING INTERVALS TO {} [now:{}] [elapsed: {}]".format(sample_index, interval_file, now, elapsed)) 360 | writer = open(interval_file, "w") 361 | writer.write(str(len(homeworks)) + "\n") 362 | for homework in homeworks: 363 | writer.write("\t".join([str(x) for x in homework[1:]]) + "\n") 364 | writer.close() 365 | 366 | now = datetime.datetime.now().time() 367 | elapsed = time.time() - start_intervals 368 | # interval_time = [sample, elapsed] 369 | print("[SYSTEM] [TIME] [MPI] [0] [S{}] INTERVALS SAVED TO {} [now:{}] [elapsed: {}]".format(sample_index, interval_file, now, elapsed)) 370 | print("[SYSTEM] [TIME] [MPI] [0] [S{}] FINISHED CALCULATING INTERVALS [now:{}] [elapsed: {}]".format(sample_index, now, elapsed)) 371 | else: 372 | 373 | # interval_time = [] 374 | 375 | while(True): 376 | status = MPI.Status() 377 | # Here data is the name of a chromosome. 378 | data = sample_comm.recv(source=0, tag=MPI.ANY_TAG, status=status) 379 | tag = status.Get_tag() 380 | if tag == CALCULATE_COVERAGE: 381 | intervals = calculate_intervals(total_coverage, coverage_dir + data, region) 382 | sample_comm.send(intervals, dest=0, tag=IM_FREE) 383 | elif tag == STOP_WORKING: 384 | print("[SYSTEM] [TIME] [MPI] [SEND/RECV] [RECV] [{}] received STOP calculating intervals SIGNAL FROM RANK 0 [{}]".format(str(rank), datetime.datetime.now().time())) 385 | break 386 | 387 | # interval_times = comm.gather(interval_time) 388 | # if rank == 0: 389 | # for interval_time in interval_times: 390 | # if len(interval_time) > 0: 391 | # chronometer[interval_time[0]]["intervals"] = interval_time[1] 392 | 393 | print("[{}] [{}] [S{}] [{}] BEFORE GATHER HOMEWORKS #TOTAL={} intervals".format(rank, sample_rank, sample_index, sample, len(homeworks))) 394 | 395 | # Wait for all intervals to be collected 396 | homeworks = comm.gather(homeworks) 397 | homeworks_by_sample = {} 398 | homeworks_done = {} 399 | if rank == 0: 400 | homeworks = reduce(lambda x,y: x+y, homeworks) 401 | shuffle(homeworks) 402 | 403 | # Divide intervals by samples 404 | for homework in homeworks: 405 | sample_id = homework[0] 406 | if sample_id not in homeworks_by_sample: homeworks_by_sample[sample_id] = [] 407 | homeworks_by_sample[sample_id].append(homework) 408 | 409 | for sample_id in homeworks_by_sample: 410 | homeworks_done[sample_id] = 0 411 | 412 | print("[{}] [{}] [S{}] #TOTAL={} (all intervals)".format(rank, sample_rank, sample_index, len(homeworks))) 413 | 414 | ########################################################### 415 | ######### COMPUTATION SECTION ############################# 416 | ########################################################### 417 | 418 | if rank == 0: 419 | done = 0 420 | print("[SYSTEM] [TIME] [MPI] [0] REDItools STARTED. MPI SIZE (PROCS): {} [now: {}]".format(size, datetime.datetime.now().time())) 421 | 422 | t1 = time.time() 423 | 424 | print("Loading chromosomes' sizes!") 425 | chromosomes = OrderedDict() 426 | for line in open(size_file): 427 | (key, val) = line.split()[0:2] 428 | chromosomes[key] = int(val) 429 | print("Sizes:") 430 | print(chromosomes) 431 | 432 | total = len(homeworks) 433 | #print("[SYSTEM] [MPI] [0] HOMEWORKS", total, homeworks) 434 | 435 | start = time.time() 436 | 437 | print("[SYSTEM] [TIME] [MPI] [0] REDItools PILEUP START: [now: {}]".format(datetime.datetime.now().time())) 438 | 439 | queue = set() 440 | for i in range(1, min(size, total)): 441 | interval = homeworks.pop() 442 | 443 | print("[SYSTEM] [MPI] [SEND/RECV] [SEND] [0] Sending data "+ str(interval) +" to rank " + str(i)) 444 | comm.send(interval, dest=i, tag=ALIGN_CHUNK) 445 | queue.add(i) 446 | 447 | while len(homeworks) > 0: 448 | status = MPI.Status() 449 | response = comm.recv(source=MPI.ANY_SOURCE, tag=IM_FREE, status=status) 450 | done += 1 451 | who = status.Get_source() 452 | queue.remove(who) 453 | now = datetime.datetime.now().time() 454 | elapsed = time.time() - start 455 | print("[SYSTEM] [TIME] [MPI] [SEND/RECV] [RECV] [0] RECEIVED IM_FREE SIGNAL FROM RANK {} [now:{}] [elapsed:{}] [{}/{}][{:.2f}%] [Queue:{}]".format(str(who), now, elapsed, done, total, 100 * float(done)/total, queue)) 456 | 457 | sample_done = samples[response[0]] 458 | sample_done = os.path.basename(sample_done) 459 | sample_done = ".".join(sample_done.split(".")[0:-1]) 460 | print(response) 461 | duration = response[-1] - response[-2] 462 | chronometer[sample_done]["parallel"] += duration 463 | homeworks_done[response[0]] += 1 464 | if homeworks_done[response[0]] == len(homeworks_by_sample[response[0]]): 465 | print("[SYSTEM] [MPI] [COMPLETE] [{}] [{}] [{}] now:{}".format(sample_done, chronometer[sample_done]["parallel"], str(datetime.timedelta(seconds=chronometer[sample_done]["parallel"])), now)) 466 | 467 | interval = homeworks.pop() 468 | print("[SYSTEM] [MPI] [SEND/RECV] [SEND] [0] Sending data "+ str(interval) +" to rank " + str(who)) 469 | comm.send(interval, dest=who, tag=ALIGN_CHUNK) 470 | queue.add(who) 471 | 472 | while len(queue) > 0: 473 | status = MPI.Status() 474 | response = comm.recv(source=MPI.ANY_SOURCE, tag=IM_FREE, status=status) 475 | done += 1 476 | who = status.Get_source() 477 | queue.remove(who) 478 | now = datetime.datetime.now().time() 479 | elapsed = time.time() - start 480 | print("[SYSTEM] [TIME] [MPI] [SEND/RECV] [RECV] [0] RECEIVED IM_FREE SIGNAL FROM RANK {} [now:{}] [elapsed:{}] [{}/{}][{:.2f}%] [Queue:{}]".format(str(who), now, elapsed, done, total, 100 * float(done)/total, queue)) 481 | 482 | sample_done = samples[response[0]] 483 | sample_done = os.path.basename(sample_done) 484 | sample_done = ".".join(sample_done.split(".")[0:-1]) 485 | duration = response[-1] - response[-2] 486 | chronometer[sample_done]["parallel"] += duration 487 | homeworks_done[response[0]] += 1 488 | if homeworks_done[response[0]] == len(homeworks_by_sample[response[0]]): 489 | print("[SYSTEM] [MPI] [COMPLETE] [{}] [{}] [{}] now:{}".format(sample_done, chronometer[sample_done]["parallel"], str(datetime.timedelta(seconds=chronometer[sample_done]["parallel"])), now)) 490 | 491 | print("[SYSTEM] [MPI] [SEND/RECV] [SEND] [0] Sending DIE SIGNAL TO RANK " + str(who)) 492 | comm.send(None, dest=who, tag=STOP_WORKING) 493 | 494 | t2 = time.time() 495 | elapsed = t2-t1 496 | print("[SYSTEM] [TIME] [MPI] [0] WHOLE PARALLEL ANALYSIS FINISHED. CREATING SETUP FOR MERGING PARTIAL FILES - Total elapsed time [{:5.5f}] [{}] [now: {}]".format(elapsed, t2, datetime.datetime.now().time())) 497 | 498 | ##################################################################### 499 | ######### RECOMBINATION OF SINGLE FILES ############################# 500 | ##################################################################### 501 | for s in samples: 502 | 503 | s = os.path.basename(s) 504 | s = ".".join(s.split(".")[0:-1]) 505 | 506 | little_files = [] 507 | print("Scanning all files in "+args.temp_dir + s +" matching " + ".*") 508 | for little_file in glob.glob(args.temp_dir + s + "/*"): 509 | if little_file.endswith("chronometer.txt"): continue 510 | if little_file.endswith("files.txt"): continue 511 | if little_file.endswith("intervals.txt"): continue 512 | if little_file.endswith("status.txt"): continue 513 | if little_file.endswith("progress.txt"): continue 514 | if little_file.endswith("times.txt"): continue 515 | if little_file.endswith("groups.txt"): continue 516 | 517 | print(little_file) 518 | pieces = re.sub("\..*", "", os.path.basename(little_file)).split("#") 519 | pieces.insert(0, little_file) 520 | little_files.append(pieces) 521 | 522 | # Sort the output files 523 | keys = chromosomes.keys() 524 | print("[SYSTEM] "+str(len(little_files))+" FILES TO MERGE: ", little_files) 525 | little_files = sorted(little_files, key = lambda x: (keys.index(x[1]), int(x[2]))) 526 | print("[SYSTEM] "+str(len(little_files))+" FILES TO MERGE (SORTED): ", little_files) 527 | 528 | smallfiles_list_filename = args.temp_dir + s + "/" + "files.txt" 529 | f = open(smallfiles_list_filename, "w") 530 | for little_file in little_files: 531 | f.write(little_file[0] + "\n") 532 | f.close() 533 | 534 | # Chronometer data 535 | chronometer_filename = args.temp_dir + "/" + "chronometer.txt" 536 | f = open(chronometer_filename, "w") 537 | #f.write("\t".join(["SampleID", "Coverage", "Intervals", "Editing", "Coverage (human)", "Intervals (human)", "Coverage (human)"])) 538 | for s in chronometer: 539 | # coverage_duration = str(datetime.timedelta(seconds=chronometer[s]["coverage"])) 540 | # interval_duration = str(datetime.timedelta(seconds=chronometer[s]["intervals"])) 541 | parallel_duration = str(datetime.timedelta(seconds=chronometer[s]["parallel"])) 542 | f.write("\t".join([ 543 | s, 544 | # str(chronometer[s]["coverage"]), 545 | # str(chronometer[s]["intervals"]), 546 | str(chronometer[s]["parallel"]), 547 | # coverage_duration, 548 | # interval_duration, 549 | parallel_duration]) + "\n") 550 | f.close() 551 | 552 | t2 = time.time() 553 | print("[SYSTEM] [TIME] [MPI] [0] [END] - WHOLE ANALYSIS FINISHED - Total elapsed time [{:5.5f}] [{}] [now: {}]".format(t2-t1, t2, datetime.datetime.now().time())) 554 | 555 | # Slave processes 556 | else: 557 | 558 | while(True): 559 | status = MPI.Status() 560 | data = comm.recv(source=0, tag=MPI.ANY_TAG, status=status) 561 | 562 | tag = status.Get_tag() 563 | if tag == ALIGN_CHUNK: 564 | 565 | # Process it 566 | time_start = time.time() 567 | time_s = datetime.datetime.now().time() 568 | print("[SYSTEM] [TIME] [MPI] [SEND/RECV] [RECV] [{}] REDItools: STARTED {} from rank 0 [{}] Interval: {}".format(str(rank), str(data), time_s, data)) 569 | 570 | local_sample_index = data[0] 571 | local_sample_filepath = samples[local_sample_index] 572 | local_sample = os.path.basename(local_sample_filepath) 573 | local_sample = ".".join(local_sample.split(".")[0:-1]) 574 | print("[LAUNCH REDITOOLS] {} {} {}".format(local_sample_index, local_sample_filepath, local_sample)) 575 | 576 | id = data[1] + "#" + str(data[2]) + "#" + str(data[3]) 577 | 578 | options["bamfile"] = local_sample_filepath 579 | options["region"] = [data[1], data[2], data[3]] 580 | options["output"] = args.temp_dir + local_sample + "/" + id + ".gz" 581 | 582 | print("[MPI] [" + str(rank) + "] COMMAND-LINE:", options) 583 | 584 | if not os.path.exists(args.temp_dir + local_sample + "/" + id + ".gz"): 585 | gc.collect() 586 | reditools.analyze(options) 587 | 588 | time_end = time.time() 589 | time_e = datetime.datetime.now().time() 590 | print("[SYSTEM] [TIME] [MPI] [{}] REDItools: FINISHED {} [{}][{}] [TOTAL:{:5.2f}]".format(str(rank), str(data), time_s, datetime.datetime.now().time(), time_end - time_start)) 591 | 592 | print("[SYSTEM] [TIME] [MPI] [SEND/RECV] [SEND] [{}] SENDING IM_FREE tag TO RANK 0 [{}]".format(str(rank), datetime.datetime.now().time())) 593 | comm.send(data + [time_s, time_e, time_start, time_end], dest=0, tag=IM_FREE) 594 | elif tag == STOP_WORKING: 595 | print("[SYSTEM] [TIME] [MPI] [SEND/RECV] [RECV] [{}] received DIE SIGNAL FROM RANK 0 [{}]".format(str(rank), datetime.datetime.now().time())) 596 | break 597 | 598 | print("[{}] EXITING [now:{}]".format(rank, time.time())) 599 | -------------------------------------------------------------------------------- /src/cineca/reditools_table_to_bed.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import os 3 | import gzip 4 | 5 | import argparse 6 | if __name__ == '__main__': 7 | 8 | parser = argparse.ArgumentParser(description='REDItools 2.0 table to BED file converter') 9 | parser.add_argument('-i', '--table-file', help='The RNA-editing events table to be converted') 10 | parser.add_argument('-o', '--bed_file', help='The output bed file') 11 | args = parser.parse_known_args()[0] 12 | 13 | input = args.table_file 14 | output = args.bed_file 15 | 16 | input_root, ext = os.path.splitext(input) 17 | fd_input = gzip.open(input, "r") if ext == ".gz" else open(input, "r") 18 | fd_output = open(output, "w") 19 | 20 | LOG_INTERVAL = 1000000 21 | last_chr = None 22 | start = None 23 | end = None 24 | 25 | total = 0 26 | with fd_input: 27 | for line in fd_input: 28 | total += 1 29 | if total % LOG_INTERVAL == 0: 30 | sys.stderr.write("[{}] {} lines read from {}\n".format(last_chr, total, input)) 31 | 32 | fields = line.strip().split() 33 | chr = fields[0] 34 | pos = int(fields[1]) 35 | 36 | if last_chr != chr or (end is not None and pos > end + 1): 37 | if last_chr is not None and start is not None and end is not None: 38 | fd_output.write("{}\t{}\t{}\n".format(last_chr, start, end)) 39 | start = None 40 | end = None 41 | 42 | if start is None: 43 | start = pos 44 | 45 | if last_chr != chr: 46 | last_chr = chr 47 | 48 | if end is None or pos == end + 1: 49 | end = pos 50 | 51 | if last_chr is not None and start is not None and end is not None: 52 | fd_output.write("{}\t{}\t{}\n".format(last_chr, start, end)) 53 | start = None 54 | end = None 55 | 56 | sys.stderr.write("{} lines read from {}\n".format(total, input)) 57 | -------------------------------------------------------------------------------- /template.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | Timeline | Basic demo 5 | 6 | 11 | 12 | 13 | 14 | 15 | 16 |

17 | 18 | 59 | 60 | 61 | -------------------------------------------------------------------------------- /test/SRR2135332.bam: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BioinfoUNIBA/REDItools2/17e932fa225477effced64ad5342e7cfd2b7d87b/test/SRR2135332.bam -------------------------------------------------------------------------------- /test/SRR2135332.bam.bai: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/BioinfoUNIBA/REDItools2/17e932fa225477effced64ad5342e7cfd2b7d87b/test/SRR2135332.bam.bai --------------------------------------------------------------------------------