├── .DS_Store ├── README.md ├── figures ├── .DS_Store ├── DI_formula.png ├── HiCtool_1mb_normalized.png ├── HiCtool_1mb_normalized_95th.png ├── HiCtool_1mb_observed.png ├── HiCtool_GM12878_HEK293T_1mb_observed.png ├── HiCtool_chr6_1mb_PC1.png ├── HiCtool_chr6_1mb_correlation_matrix.png ├── HiCtool_chr6_1mb_normalized_enrich.png ├── HiCtool_chr6_1mb_normalized_enrich_histogram.png ├── HiCtool_chr6_1mb_normalized_fend.png ├── HiCtool_chr6_40kb_normalized_enrich_50000000_54000000.png ├── HiCtool_chr6_40kb_normalized_enrich_histogram_50000000_54000000.png ├── HiCtool_chr6_40kb_normalized_fend_50000000_54000000.png ├── HiCtool_chr6_40kb_normalized_fend_histogram_50000000_54000000.png ├── HiCtool_chr6_DI.png ├── HiCtool_chr6_DI_HMM.png ├── HiCtool_chr6_chr3_1mb_0-50_0-80mb_normalized.png ├── HiCtool_chr6_chr3_1mb_normalized.png ├── HiCtool_chr6_chr3_1mb_normalized_histogram.png ├── HiCtool_chr6_chr3_1mb_observed.png ├── HiCtool_chr6_chr3_1mb_observed_histogram.png ├── HiCtool_chr6_chr6_1mb_0-80mb_normalized.png ├── HiCtool_chr6_chr6_1mb_normalized.png ├── HiCtool_chr6_chr6_1mb_observed.png ├── HiCtool_chr6_chr6_40kb_normalized_domains_50000000_54000000.png ├── HiCtool_chr6_chr6_40kb_normalized_domains_80000000_120000000.png ├── HiCtool_compression.png ├── HiCtool_topological_domains_flowchart.png ├── SRR1658570_1.fastq_truncated_reads.png └── SRR1658570_2.fastq_truncated_reads.png ├── scripts ├── .DS_Store ├── Hi-Corrector1.2 │ ├── .DS_Store │ ├── Manual_HiCorrector_1.2.pdf │ ├── bin │ │ ├── export_norm_data │ │ ├── ic │ │ ├── ic_mep │ │ ├── ic_mes │ │ ├── split_data │ │ └── split_data_parallel │ ├── example │ │ ├── contact.matrix │ │ ├── output_ic_mes │ │ │ ├── contact.matrix.bias │ │ │ ├── contact.matrix.log │ │ │ └── contact.matrix.norm │ │ ├── run_example_ic.sh │ │ ├── run_example_ic_mep.sh │ │ ├── run_example_ic_mes.sh │ │ └── run_export_norm_data.sh │ └── src │ │ ├── Makefile │ │ ├── export_norm_data.c │ │ ├── ic.c │ │ ├── ic_mep.c │ │ ├── ic_mes.c │ │ ├── matutils.c │ │ ├── matutils.h │ │ ├── my_getopt-1.5 │ │ ├── ChangeLog │ │ ├── LICENSE │ │ ├── Makefile │ │ ├── README │ │ ├── getopt.3 │ │ ├── getopt.h │ │ ├── getopt.txt │ │ ├── main.c │ │ ├── my_getopt.c │ │ └── my_getopt.h │ │ ├── split_data.c │ │ └── split_data_parallel.c ├── HiCtool_TAD_analysis.py ├── HiCtool_add_fend_gc_content.py ├── HiCtool_add_fend_mappability.py ├── HiCtool_artificial_reads.py ├── HiCtool_build_enzyme_fastq.py ├── HiCtool_compartment_analysis.py ├── HiCtool_fend_file_add_gc_map.sh ├── HiCtool_generate_fend_file.sh ├── HiCtool_global_map_analysis.py ├── HiCtool_global_map_observed.sh ├── HiCtool_global_map_rows.py ├── HiCtool_hifive.py ├── HiCtool_normalize_global_matrix.sh ├── HiCtool_pre_truncation.py ├── HiCtool_run_preprocessing.sh ├── HiCtool_utilities.py ├── HiCtool_yaffe_tanay.py └── chromSizes │ ├── dm6.chrom.sizes │ ├── hg19.chrom.sizes │ ├── hg38.chrom.sizes │ ├── mm10.chrom.sizes │ ├── mm9.chrom.sizes │ ├── susScr11.chrom.sizes │ └── susScr3.chrom.sizes └── tutorial ├── .DS_Store ├── HiCtool_compressed_format.md ├── HiCtool_flowchart.png ├── HiCtool_utility_code.md ├── ReadMe.md ├── compartment.md ├── data-preprocessing.md ├── normalization-matrix-balancing.md ├── normalization-yaffe-tanay.md └── tad-analysis.md /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Zhong-Lab-UCSD/HiCtool/528e865ce0a3139bc17d07c78ced4e44a7b399d2/.DS_Store -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # HiCtool: a standardized pipeline to process and visualize Hi-C data (v2.2) 2 | 3 | HiCtool is an open-source bioinformatic tool based on Python, which integrates several software to perform a standardized Hi-C data analysis, from the processing of raw data, to the visualization of heatmaps and the identification of topologically associated domains (TADs) and A/B compartments. 4 | 5 | ## Table of Contents 6 | 7 | - [Overview](#overview) 8 | - [Installation](#installation) 9 | - [Tutorial](#tutorial) 10 | - [Reference](#reference) 11 | - [Version history](#version-history) 12 | - [Support](#support) 13 | 14 | ## Overview 15 | 16 | HiCtool leads the user step-by-step through a pipeline which consists of the following sections: 17 | 18 | - Data preprocessing 19 | - Data normalization and visualization 20 | - A/B compartment analysis 21 | - TAD analysis 22 | 23 | HiCtool can implement contact data normalization following two approaches: 24 | 25 | - **The explicit-factor correction method reported by [Yaffe and Tanay](https://www.nature.com/articles/ng.947)** and performed by the library [HiFive](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0806-y). In this case, only intra-chromosomal analysis is performed, per each chromosome singularly and only single heatmaps can be plotted. In addition, there is the possibility to plot topological domains over the heatmap at a resolution of 40kb or lower. **This approach allows also to compute the O/E contact matrix, derive the Pearson correlation matrix and perform A/B compartment analysis**. 26 | - **The matrix balancing approach performed by [Hi-Corrector](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4380031/).** In this case, a global analysis is performed including all the chromosomes and both intra- and inter-chromosomal maps. It is possible to visualize either single intra- and inter-chromosomal heatmap or the global all-by-all chromosomes heatmap (for the global heatmap visualization, resolution could be limited by your hardware). In addition, there is the possibility to plot topological domains over the intra-chromosomal heatmap (resolution of 40kb or lower) or plot the same maps from different samples on a side-by-side view for easy comparison. 27 | 28 | ## Installation 29 | 30 | HiCtool is in a pipeline format based on single unix commands to allow easy usage. In order to use HiCtool, you need to install the following Python libraries, packages and software. Everything is open source. After that, you need the HiCtool source codes. **[Click here to download HiCtool](https://github.com/Zhong-Lab-UCSD/HiCtool/archive/master.zip)**, unzip the file, all the scripts are under the folder ``scripts``. [Hi-Corrector](http://zhoulab.usc.edu/Hi-Corrector/) source code is already inside this folder. 31 | 32 | **1. Python libraries [for Python 2.7]:** 33 | 34 | - [numpy](http://scipy.org/) 35 | - [scipy](http://scipy.org/) 36 | - [matplotlib](http://matplotlib.org/) 37 | - [math](https://docs.python.org/2/library/math.html) 38 | - [matplotlib.pyplot](http://matplotlib.org/api/pyplot_api.html#module-matplotlib.pyplot/) 39 | - [csv](https://docs.python.org/2/library/csv.html) 40 | - [pybedtools](https://daler.github.io/pybedtools/) 41 | - [pandas](https://pandas.pydata.org/) 42 | - [multiprocessing](https://docs.python.org/2/library/multiprocessing.html) 43 | - [biopython](http://biopython.org/) 44 | - [sklearn-decomposition](https://scikit-learn.org/) 45 | 46 | **2. Python packages:** 47 | 48 | - [hifive (v1.4)](http://bxlab-hifive.readthedocs.org/en/latest/introduction.html) 49 | - [hmmlearn](https://github.com/hmmlearn/hmmlearn) 50 | 51 | **3. Other software:** 52 | 53 | - [BEDTools](http://bedtools.readthedocs.org/en/latest/) 54 | - [Bowtie 2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) 55 | - [SAMTools](http://samtools.sourceforge.net/) 56 | - [SRA Toolkit](http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=fastq-dump) 57 | 58 | ## Tutorial 59 | 60 | We have compiled a full tutorial to show the usage of the pipeline. Please check the [Tutorial Homepage](./tutorial/ReadMe.md). 61 | 62 | ## Reference 63 | 64 | HiCtool was developed by Riccardo Calandrelli and Qiuyang Wu from Dr. Sheng Zhong's Lab at University of California San Diego. 65 | 66 | If you use HiCtool, please cite the paper: 67 | 68 | Calandrelli, R., Wu, Q., Guan, J., & Zhong, S. (2018). GITAR: An open source tool for analysis and visualization of Hi-C data. *Genomics, proteomics & bioinformatics.* 69 | 70 | ## Version history 71 | 72 | ### March 27, 2020 73 | 74 | - Version 2.2 released: 75 | 76 | - A/B compartment analysis is now included. 77 | - Calculation and plotting of the Pearson correlation matrix from the O/E contact matrix (only Yaffe-Tanay method). 78 | - Calculation and plotting of the eigenvector for A/B compartment annotation. 79 | - Small bug fixes. 80 | 81 | ### July 25, 2019 82 | 83 | - Version 2.1.1 released: 84 | 85 | - Optimized pre-processing and normalization to allow the analysis of very high depth sequenced files. 86 | - Added parameter to allow custom mapped reads filtering at the pre-processing step. 87 | - Bedpe file generation step added for pre-processed read pairs. 88 | - Deduplication step added in the pre-processing pipeline. 89 | - Optimized and faster FEND file generation for adding GC content and mappability score. 90 | - Small bug fixes. 91 | 92 | ### May 20, 2019 93 | 94 | - Version 2.1 released: 95 | 96 | - HiCtool is now based only on unix commands. A Python script is given which contains utility functions for I/O of files generated with HiCtool in the Python console. 97 | - The entire pre-processing is now performed with a single-command. 98 | - Possibility of processing Hi-C data generated with a cocktail of restriction enzymes (i.e. as in the Arima Kit) or with restriction enzymes including "N" in their sequence. 99 | - HiCtool includes now additional species besides hg38 and mm10: hg19, mm9, dm6, susScr3, susScr11. 100 | - New function to visualize heatmaps from different samples or conditions on a side-by-side view for comparison. 101 | - Small bug fixes. 102 | 103 | ### March 28, 2019 104 | 105 | - Version 2.0 released: 106 | 107 | - HiCtool code is now hosted on GitHub. 108 | - Added inter-chromosomal analysis and visualization. 109 | - Included an additional global normalization method based on a matrix balancing approach. 110 | - New function to plot the all-by-all chromosomes global contact matrix. 111 | - Possibility of saving contact matrices in tab separated format. 112 | - Possibility of plotting topological domains over the heatmaps. 113 | - Small bug fixes. 114 | 115 | ### December 2015 - October 2018 116 | 117 | - The initial release of HiCtool v1.0 came out in late 2015. GITAR manuscript (including HiCtool) published in October 2018. 118 | 119 | ## Support 120 | 121 | For issues related to the use of HiCtool or if you want to report a bug, please contact Riccardo Calandrelli at . 122 | -------------------------------------------------------------------------------- /figures/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Zhong-Lab-UCSD/HiCtool/528e865ce0a3139bc17d07c78ced4e44a7b399d2/figures/.DS_Store -------------------------------------------------------------------------------- /figures/DI_formula.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Zhong-Lab-UCSD/HiCtool/528e865ce0a3139bc17d07c78ced4e44a7b399d2/figures/DI_formula.png -------------------------------------------------------------------------------- /figures/HiCtool_1mb_normalized.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Zhong-Lab-UCSD/HiCtool/528e865ce0a3139bc17d07c78ced4e44a7b399d2/figures/HiCtool_1mb_normalized.png -------------------------------------------------------------------------------- /figures/HiCtool_1mb_normalized_95th.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Zhong-Lab-UCSD/HiCtool/528e865ce0a3139bc17d07c78ced4e44a7b399d2/figures/HiCtool_1mb_normalized_95th.png -------------------------------------------------------------------------------- /figures/HiCtool_1mb_observed.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Zhong-Lab-UCSD/HiCtool/528e865ce0a3139bc17d07c78ced4e44a7b399d2/figures/HiCtool_1mb_observed.png -------------------------------------------------------------------------------- /figures/HiCtool_GM12878_HEK293T_1mb_observed.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Zhong-Lab-UCSD/HiCtool/528e865ce0a3139bc17d07c78ced4e44a7b399d2/figures/HiCtool_GM12878_HEK293T_1mb_observed.png -------------------------------------------------------------------------------- /figures/HiCtool_chr6_1mb_PC1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Zhong-Lab-UCSD/HiCtool/528e865ce0a3139bc17d07c78ced4e44a7b399d2/figures/HiCtool_chr6_1mb_PC1.png -------------------------------------------------------------------------------- /figures/HiCtool_chr6_1mb_correlation_matrix.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Zhong-Lab-UCSD/HiCtool/528e865ce0a3139bc17d07c78ced4e44a7b399d2/figures/HiCtool_chr6_1mb_correlation_matrix.png -------------------------------------------------------------------------------- /figures/HiCtool_chr6_1mb_normalized_enrich.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Zhong-Lab-UCSD/HiCtool/528e865ce0a3139bc17d07c78ced4e44a7b399d2/figures/HiCtool_chr6_1mb_normalized_enrich.png -------------------------------------------------------------------------------- /figures/HiCtool_chr6_1mb_normalized_enrich_histogram.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Zhong-Lab-UCSD/HiCtool/528e865ce0a3139bc17d07c78ced4e44a7b399d2/figures/HiCtool_chr6_1mb_normalized_enrich_histogram.png -------------------------------------------------------------------------------- /figures/HiCtool_chr6_1mb_normalized_fend.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Zhong-Lab-UCSD/HiCtool/528e865ce0a3139bc17d07c78ced4e44a7b399d2/figures/HiCtool_chr6_1mb_normalized_fend.png -------------------------------------------------------------------------------- /figures/HiCtool_chr6_40kb_normalized_enrich_50000000_54000000.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Zhong-Lab-UCSD/HiCtool/528e865ce0a3139bc17d07c78ced4e44a7b399d2/figures/HiCtool_chr6_40kb_normalized_enrich_50000000_54000000.png -------------------------------------------------------------------------------- /figures/HiCtool_chr6_40kb_normalized_enrich_histogram_50000000_54000000.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Zhong-Lab-UCSD/HiCtool/528e865ce0a3139bc17d07c78ced4e44a7b399d2/figures/HiCtool_chr6_40kb_normalized_enrich_histogram_50000000_54000000.png -------------------------------------------------------------------------------- /figures/HiCtool_chr6_40kb_normalized_fend_50000000_54000000.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Zhong-Lab-UCSD/HiCtool/528e865ce0a3139bc17d07c78ced4e44a7b399d2/figures/HiCtool_chr6_40kb_normalized_fend_50000000_54000000.png -------------------------------------------------------------------------------- /figures/HiCtool_chr6_40kb_normalized_fend_histogram_50000000_54000000.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Zhong-Lab-UCSD/HiCtool/528e865ce0a3139bc17d07c78ced4e44a7b399d2/figures/HiCtool_chr6_40kb_normalized_fend_histogram_50000000_54000000.png -------------------------------------------------------------------------------- /figures/HiCtool_chr6_DI.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Zhong-Lab-UCSD/HiCtool/528e865ce0a3139bc17d07c78ced4e44a7b399d2/figures/HiCtool_chr6_DI.png -------------------------------------------------------------------------------- /figures/HiCtool_chr6_DI_HMM.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Zhong-Lab-UCSD/HiCtool/528e865ce0a3139bc17d07c78ced4e44a7b399d2/figures/HiCtool_chr6_DI_HMM.png -------------------------------------------------------------------------------- /figures/HiCtool_chr6_chr3_1mb_0-50_0-80mb_normalized.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Zhong-Lab-UCSD/HiCtool/528e865ce0a3139bc17d07c78ced4e44a7b399d2/figures/HiCtool_chr6_chr3_1mb_0-50_0-80mb_normalized.png -------------------------------------------------------------------------------- /figures/HiCtool_chr6_chr3_1mb_normalized.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Zhong-Lab-UCSD/HiCtool/528e865ce0a3139bc17d07c78ced4e44a7b399d2/figures/HiCtool_chr6_chr3_1mb_normalized.png -------------------------------------------------------------------------------- /figures/HiCtool_chr6_chr3_1mb_normalized_histogram.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Zhong-Lab-UCSD/HiCtool/528e865ce0a3139bc17d07c78ced4e44a7b399d2/figures/HiCtool_chr6_chr3_1mb_normalized_histogram.png -------------------------------------------------------------------------------- /figures/HiCtool_chr6_chr3_1mb_observed.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Zhong-Lab-UCSD/HiCtool/528e865ce0a3139bc17d07c78ced4e44a7b399d2/figures/HiCtool_chr6_chr3_1mb_observed.png -------------------------------------------------------------------------------- /figures/HiCtool_chr6_chr3_1mb_observed_histogram.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Zhong-Lab-UCSD/HiCtool/528e865ce0a3139bc17d07c78ced4e44a7b399d2/figures/HiCtool_chr6_chr3_1mb_observed_histogram.png -------------------------------------------------------------------------------- /figures/HiCtool_chr6_chr6_1mb_0-80mb_normalized.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Zhong-Lab-UCSD/HiCtool/528e865ce0a3139bc17d07c78ced4e44a7b399d2/figures/HiCtool_chr6_chr6_1mb_0-80mb_normalized.png -------------------------------------------------------------------------------- /figures/HiCtool_chr6_chr6_1mb_normalized.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Zhong-Lab-UCSD/HiCtool/528e865ce0a3139bc17d07c78ced4e44a7b399d2/figures/HiCtool_chr6_chr6_1mb_normalized.png -------------------------------------------------------------------------------- /figures/HiCtool_chr6_chr6_1mb_observed.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Zhong-Lab-UCSD/HiCtool/528e865ce0a3139bc17d07c78ced4e44a7b399d2/figures/HiCtool_chr6_chr6_1mb_observed.png -------------------------------------------------------------------------------- /figures/HiCtool_chr6_chr6_40kb_normalized_domains_50000000_54000000.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Zhong-Lab-UCSD/HiCtool/528e865ce0a3139bc17d07c78ced4e44a7b399d2/figures/HiCtool_chr6_chr6_40kb_normalized_domains_50000000_54000000.png -------------------------------------------------------------------------------- /figures/HiCtool_chr6_chr6_40kb_normalized_domains_80000000_120000000.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Zhong-Lab-UCSD/HiCtool/528e865ce0a3139bc17d07c78ced4e44a7b399d2/figures/HiCtool_chr6_chr6_40kb_normalized_domains_80000000_120000000.png -------------------------------------------------------------------------------- /figures/HiCtool_compression.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Zhong-Lab-UCSD/HiCtool/528e865ce0a3139bc17d07c78ced4e44a7b399d2/figures/HiCtool_compression.png -------------------------------------------------------------------------------- /figures/HiCtool_topological_domains_flowchart.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Zhong-Lab-UCSD/HiCtool/528e865ce0a3139bc17d07c78ced4e44a7b399d2/figures/HiCtool_topological_domains_flowchart.png -------------------------------------------------------------------------------- /figures/SRR1658570_1.fastq_truncated_reads.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Zhong-Lab-UCSD/HiCtool/528e865ce0a3139bc17d07c78ced4e44a7b399d2/figures/SRR1658570_1.fastq_truncated_reads.png -------------------------------------------------------------------------------- /figures/SRR1658570_2.fastq_truncated_reads.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Zhong-Lab-UCSD/HiCtool/528e865ce0a3139bc17d07c78ced4e44a7b399d2/figures/SRR1658570_2.fastq_truncated_reads.png -------------------------------------------------------------------------------- /scripts/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Zhong-Lab-UCSD/HiCtool/528e865ce0a3139bc17d07c78ced4e44a7b399d2/scripts/.DS_Store -------------------------------------------------------------------------------- /scripts/Hi-Corrector1.2/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Zhong-Lab-UCSD/HiCtool/528e865ce0a3139bc17d07c78ced4e44a7b399d2/scripts/Hi-Corrector1.2/.DS_Store -------------------------------------------------------------------------------- /scripts/Hi-Corrector1.2/Manual_HiCorrector_1.2.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Zhong-Lab-UCSD/HiCtool/528e865ce0a3139bc17d07c78ced4e44a7b399d2/scripts/Hi-Corrector1.2/Manual_HiCorrector_1.2.pdf -------------------------------------------------------------------------------- /scripts/Hi-Corrector1.2/bin/export_norm_data: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Zhong-Lab-UCSD/HiCtool/528e865ce0a3139bc17d07c78ced4e44a7b399d2/scripts/Hi-Corrector1.2/bin/export_norm_data -------------------------------------------------------------------------------- /scripts/Hi-Corrector1.2/bin/ic: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Zhong-Lab-UCSD/HiCtool/528e865ce0a3139bc17d07c78ced4e44a7b399d2/scripts/Hi-Corrector1.2/bin/ic -------------------------------------------------------------------------------- /scripts/Hi-Corrector1.2/bin/ic_mep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Zhong-Lab-UCSD/HiCtool/528e865ce0a3139bc17d07c78ced4e44a7b399d2/scripts/Hi-Corrector1.2/bin/ic_mep -------------------------------------------------------------------------------- /scripts/Hi-Corrector1.2/bin/ic_mes: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Zhong-Lab-UCSD/HiCtool/528e865ce0a3139bc17d07c78ced4e44a7b399d2/scripts/Hi-Corrector1.2/bin/ic_mes -------------------------------------------------------------------------------- /scripts/Hi-Corrector1.2/bin/split_data: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Zhong-Lab-UCSD/HiCtool/528e865ce0a3139bc17d07c78ced4e44a7b399d2/scripts/Hi-Corrector1.2/bin/split_data -------------------------------------------------------------------------------- /scripts/Hi-Corrector1.2/bin/split_data_parallel: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Zhong-Lab-UCSD/HiCtool/528e865ce0a3139bc17d07c78ced4e44a7b399d2/scripts/Hi-Corrector1.2/bin/split_data_parallel -------------------------------------------------------------------------------- /scripts/Hi-Corrector1.2/example/output_ic_mes/contact.matrix.log: -------------------------------------------------------------------------------- 1 | ic_mes arguments: 2 | input file: /mnt/extraids/OceanStor-SysCmn-2/rcalandrelli/Software/Hi-Corrector1.2/example/contact.matrix 3 | has header line: 0 4 | has header column: 0 5 | memory size for this job: 10 MB 6 | total_rows: 1000 7 | maxiter: 100 8 | output file: /mnt/extraids/OceanStor-SysCmn-2/rcalandrelli/Software/Hi-Corrector1.2/example/output_ic_mes/contact.matrix.bias 9 | All data is divided to 1 parts: 1000 10 | 11 | partition sizes (1 partitions): 1000 12 | 13 | Iteration 1: 14 | Part 1: 1 1000 15 | Iteration 2: 16 | Part 1: 1 1000 17 | Iteration 3: 18 | Part 1: 1 1000 19 | Iteration 4: 20 | Part 1: 1 1000 21 | Iteration 5: 22 | Part 1: 1 1000 23 | Iteration 6: 24 | Part 1: 1 1000 25 | Iteration 7: 26 | Part 1: 1 1000 27 | Iteration 8: 28 | Part 1: 1 1000 29 | Iteration 9: 30 | Part 1: 1 1000 31 | Iteration 10: 32 | Part 1: 1 1000 33 | Iteration 11: 34 | Part 1: 1 1000 35 | Iteration 12: 36 | Part 1: 1 1000 37 | Iteration 13: 38 | Part 1: 1 1000 39 | Iteration 14: 40 | Part 1: 1 1000 41 | Iteration 15: 42 | Part 1: 1 1000 43 | Iteration 16: 44 | Part 1: 1 1000 45 | Iteration 17: 46 | Part 1: 1 1000 47 | Iteration 18: 48 | Part 1: 1 1000 49 | Iteration 19: 50 | Part 1: 1 1000 51 | Iteration 20: 52 | Part 1: 1 1000 53 | Iteration 21: 54 | Part 1: 1 1000 55 | Iteration 22: 56 | Part 1: 1 1000 57 | Iteration 23: 58 | Part 1: 1 1000 59 | Iteration 24: 60 | Part 1: 1 1000 61 | Iteration 25: 62 | Part 1: 1 1000 63 | Iteration 26: 64 | Part 1: 1 1000 65 | Iteration 27: 66 | Part 1: 1 1000 67 | Iteration 28: 68 | Part 1: 1 1000 69 | Iteration 29: 70 | Part 1: 1 1000 71 | Iteration 30: 72 | Part 1: 1 1000 73 | Iteration 31: 74 | Part 1: 1 1000 75 | Iteration 32: 76 | Part 1: 1 1000 77 | Iteration 33: 78 | Part 1: 1 1000 79 | Iteration 34: 80 | Part 1: 1 1000 81 | Iteration 35: 82 | Part 1: 1 1000 83 | Iteration 36: 84 | Part 1: 1 1000 85 | Iteration 37: 86 | Part 1: 1 1000 87 | Iteration 38: 88 | Part 1: 1 1000 89 | Iteration 39: 90 | Part 1: 1 1000 91 | Iteration 40: 92 | Part 1: 1 1000 93 | Iteration 41: 94 | Part 1: 1 1000 95 | Iteration 42: 96 | Part 1: 1 1000 97 | Iteration 43: 98 | Part 1: 1 1000 99 | Iteration 44: 100 | Part 1: 1 1000 101 | Iteration 45: 102 | Part 1: 1 1000 103 | Iteration 46: 104 | Part 1: 1 1000 105 | Iteration 47: 106 | Part 1: 1 1000 107 | Iteration 48: 108 | Part 1: 1 1000 109 | Iteration 49: 110 | Part 1: 1 1000 111 | Iteration 50: 112 | Part 1: 1 1000 113 | Iteration 51: 114 | Part 1: 1 1000 115 | Iteration 52: 116 | Part 1: 1 1000 117 | Iteration 53: 118 | Part 1: 1 1000 119 | Iteration 54: 120 | Part 1: 1 1000 121 | Iteration 55: 122 | Part 1: 1 1000 123 | Iteration 56: 124 | Part 1: 1 1000 125 | Iteration 57: 126 | Part 1: 1 1000 127 | Iteration 58: 128 | Part 1: 1 1000 129 | Iteration 59: 130 | Part 1: 1 1000 131 | Iteration 60: 132 | Part 1: 1 1000 133 | Iteration 61: 134 | Part 1: 1 1000 135 | Iteration 62: 136 | Part 1: 1 1000 137 | Iteration 63: 138 | Part 1: 1 1000 139 | Iteration 64: 140 | Part 1: 1 1000 141 | Iteration 65: 142 | Part 1: 1 1000 143 | Iteration 66: 144 | Part 1: 1 1000 145 | Iteration 67: 146 | Part 1: 1 1000 147 | Iteration 68: 148 | Part 1: 1 1000 149 | Iteration 69: 150 | Part 1: 1 1000 151 | Iteration 70: 152 | Part 1: 1 1000 153 | Iteration 71: 154 | Part 1: 1 1000 155 | Iteration 72: 156 | Part 1: 1 1000 157 | Iteration 73: 158 | Part 1: 1 1000 159 | Iteration 74: 160 | Part 1: 1 1000 161 | Iteration 75: 162 | Part 1: 1 1000 163 | Iteration 76: 164 | Part 1: 1 1000 165 | Iteration 77: 166 | Part 1: 1 1000 167 | Iteration 78: 168 | Part 1: 1 1000 169 | Iteration 79: 170 | Part 1: 1 1000 171 | Iteration 80: 172 | Part 1: 1 1000 173 | Iteration 81: 174 | Part 1: 1 1000 175 | Iteration 82: 176 | Part 1: 1 1000 177 | Iteration 83: 178 | Part 1: 1 1000 179 | Iteration 84: 180 | Part 1: 1 1000 181 | Iteration 85: 182 | Part 1: 1 1000 183 | Iteration 86: 184 | Part 1: 1 1000 185 | Iteration 87: 186 | Part 1: 1 1000 187 | Iteration 88: 188 | Part 1: 1 1000 189 | Iteration 89: 190 | Part 1: 1 1000 191 | Iteration 90: 192 | Part 1: 1 1000 193 | Iteration 91: 194 | Part 1: 1 1000 195 | Iteration 92: 196 | Part 1: 1 1000 197 | Iteration 93: 198 | Part 1: 1 1000 199 | Iteration 94: 200 | Part 1: 1 1000 201 | Iteration 95: 202 | Part 1: 1 1000 203 | Iteration 96: 204 | Part 1: 1 1000 205 | Iteration 97: 206 | Part 1: 1 1000 207 | Iteration 98: 208 | Part 1: 1 1000 209 | Iteration 99: 210 | Part 1: 1 1000 211 | Iteration 100: 212 | Part 1: 1 1000 213 | >> ic_mes computation total time: 0:0:16 214 | Write output to file: /mnt/extraids/OceanStor-SysCmn-2/rcalandrelli/Software/Hi-Corrector1.2/example/output_ic_mes/contact.matrix.bias 215 | Write time: 0:0:0 216 | Done. 217 | -------------------------------------------------------------------------------- /scripts/Hi-Corrector1.2/example/run_example_ic.sh: -------------------------------------------------------------------------------- 1 | checkMakeDirectory(){ 2 | echo -e "checking directory: $1" 3 | if [ ! -e "$1" ]; then 4 | echo -e "\tmakedir $1" 5 | mkdir -p "$1" 6 | fi 7 | } 8 | 9 | cmd="$PWD/../bin/ic" 10 | output_dir="$PWD/output_ic" # output directory 11 | checkMakeDirectory $output_dir 12 | 13 | # input parameters 14 | total_rows=1000 # total number of rows in the input contact matrix 15 | max_iter=10 # total number of iterations that are needed to run in the ic algorithm 16 | input_mat_file="$PWD/contact.matrix" # input contact matrix file, each line is a row, numbers are separated by TAB char 17 | has_header_line=0 # input file doesn't have header line 18 | has_header_column=0 # input file doesn't have header column 19 | output_file="$output_dir/contact.matrix.bias_factors" # output file consists of a vector of bias factors 20 | log_file="$output_dir/contact.matrix.log" # log file recording the verbose console output of the ic command 21 | 22 | # run the command 23 | echo "$cmd $input_mat_file $total_rows $max_iter $has_header_line $has_header_column $output_file > $log_file" 24 | $cmd $input_mat_file $total_rows $max_iter $has_header_line $has_header_column $output_file > $log_file 25 | 26 | -------------------------------------------------------------------------------- /scripts/Hi-Corrector1.2/example/run_example_ic_mep.sh: -------------------------------------------------------------------------------- 1 | checkMakeDirectory(){ 2 | echo -e "checking directory: $1" 3 | if [ ! -e "$1" ]; then 4 | echo -e "\tmakedir $1" 5 | mkdir -p "$1" 6 | fi 7 | } 8 | 9 | output_dir="$PWD/output_ic_mep" # output directory 10 | checkMakeDirectory $output_dir 11 | 12 | # input parameters 13 | num_processors=5 14 | cmd="/usr/usc/openmpi/default/bin/mpirun -np $num_processors $PWD/../bin/ic_mep" 15 | mem_per_task="5" # memory used for loading data (in MegaBytes) 16 | total_rows=1000 # total number of rows in the input contact matrix 17 | max_iter=10 # total number of iterations that are needed to run in the ic algorithm 18 | input_mat_file="$PWD/contact.matrix" # input contact matrix file, each line is a row, numbers are separated by TAB char 19 | output_file="$output_dir/contact.matrix.bias_factors" # output file consists of a vector of bias factors 20 | log_file="$output_dir/contact.matrix.log" # log file recording the verbose console output of the ic command 21 | jobID="contact.test" 22 | 23 | # run the command 24 | echo "$cmd --inputFile=$input_mat_file --numTask=$num_processors --memSizePerTask=$mem_per_task --numRows=$total_rows --maxIteration=$max_iter --jobID=$jobID --outputFile=$output_file > $log_file" 25 | $cmd --inputFile=$input_mat_file --numTask=$num_processors --memSizePerTask=$mem_per_task --numRows=$total_rows --maxIteration=$max_iter --jobID=$jobID --outputFile=$output_file > $log_file 26 | 27 | -------------------------------------------------------------------------------- /scripts/Hi-Corrector1.2/example/run_example_ic_mes.sh: -------------------------------------------------------------------------------- 1 | checkMakeDirectory(){ 2 | echo -e "checking directory: $1" 3 | if [ ! -e "$1" ]; then 4 | echo -e "\tmakedir $1" 5 | mkdir -p "$1" 6 | fi 7 | } 8 | 9 | cmd="$PWD/../bin/ic_mes" 10 | output_dir="$PWD/output_ic_mes" # output directory 11 | checkMakeDirectory $output_dir 12 | 13 | # input parameters 14 | total_mem="10" # memory used for loading data (in MegaBytes) 15 | total_rows=1000 # total number of rows in the input contact matrix 16 | max_iter=100 # total number of iterations that are needed to run in the ic algorithm 17 | input_mat_file="$PWD/contact.matrix" # input contact matrix file, each line is a row, numbers are separated by TAB char 18 | has_header_line=0 # input file doesn't have header line 19 | has_header_column=0 # input file doesn't have header column 20 | output_file="$output_dir/contact.matrix.bias" # output file consists of a vector of bias factors 21 | log_file="$output_dir/contact.matrix.log" # log file recording the verbose console output of the ic command 22 | 23 | # run the command 24 | echo "$cmd $input_mat_file $total_mem $total_rows $max_iter $has_header_line $has_header_column $output_file > $log_file" 25 | $cmd $input_mat_file $total_mem $total_rows $max_iter $has_header_line $has_header_column $output_file > $log_file 26 | 27 | -------------------------------------------------------------------------------- /scripts/Hi-Corrector1.2/example/run_export_norm_data.sh: -------------------------------------------------------------------------------- 1 | checkMakeDirectory(){ 2 | echo -e "checking directory: $1" 3 | if [ ! -e "$1" ]; then 4 | echo -e "\tmakedir $1" 5 | mkdir -p "$1" 6 | fi 7 | } 8 | 9 | # export_norm_data <#rows/columns> 10 | cmd="$PWD/../bin/export_norm_data" 11 | output_dir="$PWD/output_ic_mes" # output directory. You may modify this output directory 12 | 13 | # input parameters 14 | total_mem="10" # memory used for loading data (in MegaBytes) 15 | total_rows=1000 # total number of rows in the input contact matrix 16 | input_mat_file="$PWD/contact.matrix" # input contact matrix file, each line is a row, numbers are separated by TAB char 17 | has_header_line=0 # input file doesn't have header line 18 | has_header_column=0 # input file doesn't have header column 19 | bias_factor_file="$output_dir/contact.matrix.bias" # input file consists of a vector of bias factors 20 | row_sum_after_norm=10000 21 | output_file="$output_dir/contact.matrix.norm" # output file consists of a vector of bias factors 22 | 23 | # run the command 24 | echo "$cmd $input_mat_file $total_rows $has_header_line $has_header_column $total_mem $bias_factor_file $row_sum_after_norm $output_file" 25 | $cmd $input_mat_file $total_rows $has_header_line $has_header_column $total_mem $bias_factor_file $row_sum_after_norm $output_file 26 | -------------------------------------------------------------------------------- /scripts/Hi-Corrector1.2/src/Makefile: -------------------------------------------------------------------------------- 1 | # compiler and options for ANSI C 2 | CC=cc 3 | CFLAGS= -Wall -O3 4 | LIBS= -lm 5 | 6 | # mpi compiler and options for ANSI C and mpi library 7 | CCMPI=/usr/usc/openmpi/default/bin/mpicc 8 | CFLAGSMPI= 9 | LIBSMPI= -L/usr/usc/openmpi/1.6.4/lib 10 | INCLUDE= 11 | 12 | # getopt functions library 13 | getoptDIR=./my_getopt-1.5 14 | 15 | # programs to be compiled and built 16 | OBJ_GETOPT= my_getopt.o 17 | OBJECTS1= ic.o matutils.o 18 | OBJECTS2= ic_mes.o matutils.o 19 | OBJECTS3= ic_mep.o matutils.o ${OBJ_GETOPT} 20 | OBJECTS4= split_data.o matutils.o 21 | OBJECTS5= split_data_parallel.o matutils.o 22 | OBJECTS6= export_norm_data.o matutils.o 23 | PROG1=ic 24 | PROG2=ic_mes 25 | PROG3=ic_mep 26 | PROG4=split_data 27 | PROG5=split_data_parallel 28 | PROG6=export_norm_data 29 | 30 | all: ${PROG1} ${PROG2} ${PROG3} ${PROG4} ${PROG5} ${PROG6} 31 | 32 | ${PROG1}: ${OBJECTS1} 33 | ${CC} ${CFLAGS} ${OBJECTS1} ${LIBS} -o ${PROG1} 34 | cp ${PROG1} ../bin 35 | 36 | ${PROG2}: ${OBJECTS2} 37 | ${CC} ${CFLAGS} ${OBJECTS2} ${LIBS} -o ${PROG2} 38 | cp ${PROG2} ../bin 39 | 40 | ${PROG3}: ${OBJECTS3} 41 | ${CCMPI} ${CFLAGSMPI} ${OBJECTS3} ${LIBSMPI} ${LIBS} -o ${PROG3} 42 | cp ${PROG3} ../bin 43 | 44 | ${PROG4}: ${OBJECTS4} 45 | ${CC} ${CFLAGS} ${OBJECTS4} ${LIBS} -o ${PROG4} 46 | cp ${PROG4} ../bin 47 | 48 | ${PROG5}: ${OBJECTS5} 49 | ${CCMPI} ${CFLAGSMPI} ${OBJECTS5} ${LIBSMPI} ${LIBS} -o ${PROG5} 50 | cp ${PROG5} ../bin 51 | 52 | ${PROG6}: ${OBJECTS6} 53 | ${CC} ${CFLAGS} ${OBJECTS6} ${LIBS} -o ${PROG6} 54 | cp ${PROG6} ../bin 55 | 56 | matutils.o: matutils.c matutils.h 57 | ${CC} ${CFLAGS} ${INCLUDE} -c $< -o $@ 58 | 59 | ic.o: ic.c matutils.h 60 | ${CC} ${CFLAGS} ${INCLUDE} -c $< -o $@ 61 | 62 | ic_mes2.o: ic_mes2.c matutils.h 63 | ${CC} ${CFLAGS} ${INCLUDE} -c $< -o $@ 64 | 65 | ic_mep.o: ic_mep.c matutils.h 66 | ${CCMPI} ${CFLAGS} ${INCLUDE} -c $< -o $@ 67 | 68 | ic_mep2.o: ic_mep2.c matutils.h 69 | ${CCMPI} ${CFLAGS} ${INCLUDE} -c $< -o $@ 70 | 71 | split_data.o: split_data.c matutils.h 72 | ${CC} ${CFLAGS} ${INCLUDE} -c $< -o $@ 73 | 74 | split_data_parallel.o: split_data_parallel.c matutils.h 75 | ${CCMPI} ${CFLAGS} ${INCLUDE} -c $< -o $@ 76 | 77 | export_norm_data.o: export_norm_data.c matutils.h 78 | ${CC} ${CFLAGS} ${INCLUDE} -c $< -o $@ 79 | 80 | my_getopt.o: $(getoptDIR)/my_getopt.c $(getoptDIR)/my_getopt.h $(getoptDIR)/getopt.h 81 | cd $(getoptDIR) 82 | $(CC) $(Flags) -I$(getoptDIR) -c $(getoptDIR)/my_getopt.c -o my_getopt.o 83 | 84 | # --- remove binary and executable files 85 | clean: 86 | rm -f *.o 87 | rm -f ${PROG1} 88 | rm -f ${PROG2} 89 | rm -f ${PROG3} 90 | rm -f ${PROG4} 91 | rm -f ${PROG5} 92 | rm -f ${PROG6} 93 | -------------------------------------------------------------------------------- /scripts/Hi-Corrector1.2/src/export_norm_data.c: -------------------------------------------------------------------------------- 1 | /* 2 | * export_norm_data.c 3 | * 4 | * Created on: March 24, 2015 5 | * Author: Wenyuan Li 6 | */ 7 | 8 | #include 9 | #include 10 | #include 11 | #include 12 | #include "matutils.h" 13 | 14 | void print_usage(FILE* stream) 15 | { 16 | fprintf(stream, "Usage:\n"); 17 | fprintf(stream, "\texport_norm_data <#rows/columns> \n"); 18 | // fprintf(stream, "\nOptions:\n\t: FULL_MATRIX_TEXT\n"); 19 | fprintf(stream, "\nVersion: 1.2\n"); 20 | fprintf(stream, "\nAuthor: Wenyuan Li\n"); 21 | fprintf(stream, "Date: March 24, 2015\n"); 22 | } 23 | 24 | int main(int argc, char *argv[]) { 25 | int ret, i; 26 | char input_mat_file[MAXCHAR], input_bias_vec_file[MAXCHAR]; 27 | float mem_size_of_task; 28 | int part_size; 29 | float x; 30 | float row_sum_=1.0; 31 | float row_sum_after_norm=1.0; 32 | int total_column; 33 | int has_header_line; 34 | int has_header_column; 35 | char output_file[MAXCHAR]; 36 | VEC_FLOAT bias_vec; 37 | VEC_INT part_sizes; 38 | int start_row, end_row; 39 | MATRIX_FLOAT mat; // submatrix loaded, its size is "4*part_size*total_column" bytes (float is 4 bytes in both 32bit and 64bit machine: see http://docs.oracle.com/cd/E18752_01/html/817-6223/chp-typeopexpr-2.html) 40 | 41 | if (argc!=9) { 42 | fprintf(stderr,"Error: export_norm_data arguments are insufficient.\n\n"); 43 | print_usage(stderr); 44 | exit(EXIT_FAILURE); 45 | } 46 | strcpy(input_mat_file, argv[1]); 47 | total_column = atoi(argv[2]); 48 | has_header_line = atoi(argv[3]); 49 | has_header_column = atoi(argv[4]); 50 | mem_size_of_task = (float)atof(argv[5]); 51 | strcpy(input_bias_vec_file, argv[6]); 52 | row_sum_after_norm = (float)atof(argv[7]); 53 | strcpy(output_file, argv[8]); 54 | 55 | printf("export_norm_data arguments:\n"); 56 | printf("\tinput raw matrix file: %s\n", input_mat_file); 57 | printf("\t\thas header line: %d\n", has_header_line); 58 | printf("\t\thas header column: %d\n", has_header_column); 59 | printf("\ttotal_rows: %d\n", total_column); 60 | printf("\tinput bias vector file: %s\n", input_bias_vec_file); 61 | printf("\tmemory size for this job: %g MB\n", mem_size_of_task); 62 | printf("\trow sum after normalization: %f\n", row_sum_after_norm); 63 | printf("\toutput file: %s\n", output_file); 64 | fflush( stdout ); 65 | 66 | ret = quick_check_partitions_by_mem_size(total_column, total_column, mem_size_of_task, &x); 67 | if (ret==FALSE) { 68 | fprintf(stderr, "Error: max_memory_size (%gMB) for this task cannot afford to load the matrix (size=%d rows X %d columns, each row as a partition needs %lf MB memory)\n", mem_size_of_task, total_column, total_column, x); 69 | exit(EXIT_FAILURE); 70 | } 71 | 72 | calc_partitions_by_mem_size(total_column, total_column, mem_size_of_task, &part_sizes); 73 | part_size = max_VEC_INT(part_sizes); 74 | printf("All data is divided to %d parts: ", part_sizes.n); 75 | put_VEC_INT(stdout, part_sizes, ", "); 76 | printf("\npartition sizes (%d partitions): ", part_sizes.n); 77 | put_VEC_INT(stdout, part_sizes, ", "); 78 | printf("\n"); 79 | 80 | init_VEC_FLOAT( &bias_vec ); 81 | ret = read_VEC_FLOAT( &bias_vec, input_bias_vec_file ); 82 | if (ret==0) { 83 | fprintf(stderr, "Error: Wrong reading %s\n", input_bias_vec_file); 84 | exit(EXIT_FAILURE); 85 | } 86 | init_MATRIX_FLOAT( &mat ); 87 | 88 | for (i=0, start_row=1; i 12 | #include 13 | #include 14 | #include 15 | #include "matutils.h" 16 | 17 | void print_usage(FILE* stream) 18 | { 19 | fprintf(stream, "Usage:\n"); 20 | fprintf(stream, "\tic <#rows/columns> \n"); 21 | fprintf(stream, "\n\t\thas header line: '1' indicates input matrix file has a header line; otherwise 0;"); 22 | fprintf(stream, "\n\t\thas header column: '1' indicates input matrix file has a header column; otherwise 0.\n"); 23 | fprintf(stream, "\nVersion: 1.1\n"); 24 | fprintf(stream, "Author: Wenyuan Li\n"); 25 | fprintf(stream, "Date: October 17, 2014\n"); 26 | } 27 | 28 | int main(int argc, char *argv[]) { 29 | int ret; 30 | time_t t_start, t_end; 31 | double t_elapse; // seconds 32 | char input_mat_file[MAXCHAR]; 33 | float p = 1.0; 34 | int total_column; 35 | int maxiter; 36 | int has_header_line; 37 | int has_header_column; 38 | char output_file[MAXCHAR]; 39 | MATRIX_FLOAT mat; // data matrix loaded 40 | int iter; 41 | VEC_FLOAT b, t; 42 | 43 | if (argc!=7) { 44 | fprintf(stderr,"Error: Arguments are insufficient.\n\n"); 45 | print_usage(stderr); 46 | exit(EXIT_FAILURE); 47 | } 48 | strcpy(input_mat_file, argv[1]); 49 | total_column = atoi(argv[2]); 50 | maxiter = atoi(argv[3]); 51 | has_header_line = atoi(argv[4]); 52 | has_header_column = atoi(argv[5]); 53 | strcpy(output_file, argv[6]); 54 | 55 | printf("ic arguments:\n"); 56 | printf("\tinput file: %s\n", input_mat_file); 57 | printf("\t\thas header line: %d\n", has_header_line); 58 | printf("\t\thas header column: %d\n", has_header_column); 59 | printf("\ttotal_column: %d\n", total_column); 60 | printf("\tmaxiter: %d\n", maxiter); 61 | printf("\toutput file: %s\n", output_file); 62 | fflush( stdout ); 63 | 64 | t_start = time(NULL); 65 | create_ones_VEC_FLOAT( total_column, &b); 66 | create_VEC_FLOAT( total_column, &t); 67 | init_MATRIX_FLOAT( &mat ); 68 | ret = read_efficient_MATRIX_FLOAT_with_header(input_mat_file, 1, total_column, total_column, has_header_line, has_header_column, &mat); 69 | printf("Load data done\n"); 70 | fflush( stdout ); 71 | if (ret==EXIT_SUCCESS) { 72 | for (iter=0; iter> ic computation total time - %d:%d:%d\n", (int)floor(t_elapse/3600.0), (int)floor(fmod(t_elapse,3600.0)/60.0), (int)fmod(t_elapse,60.0)); 94 | fflush( stdout ); 95 | 96 | // write output 97 | t_start = time(NULL); 98 | printf("Write output to file: %s\n", output_file); 99 | write_VEC_FLOAT(b, output_file); 100 | t_end = time(NULL); 101 | t_elapse = difftime( t_end, t_start ); 102 | printf("Write time: %d:%d:%d\n", (int)floor(t_elapse/3600.0), (int)floor(fmod(t_elapse,3600.0)/60.0), (int)fmod(t_elapse,60.0)); 103 | 104 | // free space 105 | free_MATRIX_FLOAT( &mat ); 106 | free_VEC_FLOAT( &b ); 107 | free_VEC_FLOAT( &t ); 108 | printf("Done.\n"); 109 | return EXIT_SUCCESS; 110 | } 111 | -------------------------------------------------------------------------------- /scripts/Hi-Corrector1.2/src/ic_mes.c: -------------------------------------------------------------------------------- 1 | /* 2 | ============================================================================ 3 | Name : ic_mes.c 4 | Created on: Jun 25, 2014 5 | Revised on: Oct 17, 2014 6 | 1. Add the function that can read input matrix file with header line or column 7 | 8 | Author : Wenyuan Li 9 | Version : 1.1 10 | 11 | ============================================================================ 12 | */ 13 | 14 | #include 15 | #include 16 | #include 17 | #include 18 | #include "matutils.h" 19 | 20 | void print_usage(FILE* stream) 21 | { 22 | fprintf(stream, "Usage:\n"); 23 | fprintf(stream, "\tic_mes <#rows/columns> \n"); 24 | fprintf(stream, "\nVersion: 1.1\n"); 25 | fprintf(stream, "Author: Wenyuan Li\n"); 26 | fprintf(stream, "Date: October 17, 2014\n"); 27 | } 28 | 29 | int main(int argc, char *argv[]) { 30 | int ret, i; 31 | time_t t_start, t_end; 32 | double t_elapse; // seconds 33 | char input_mat_file[MAXCHAR]; 34 | float p = 1.0; 35 | float mem_size_of_task; 36 | int part_size; 37 | int total_column; 38 | int maxiter; 39 | int has_header_line; 40 | int has_header_column; 41 | char output_file[MAXCHAR]; 42 | VEC_INT part_sizes; 43 | int start_row, end_row, iter; 44 | MATRIX_FLOAT mat; // submatrix loaded, its size is "4*part_size*total_column" bytes (float is 4 bytes in both 32bit and 64bit machine: see http://docs.oracle.com/cd/E18752_01/html/817-6223/chp-typeopexpr-2.html) 45 | MATRIX_FLOAT d_save; // accumulated factors (2 vectors, one for used, the other for backup), its size is"4*2*total_column" bytes 46 | VEC_FLOAT d_temp; // norms of rows, its size is "4*part_size" bytes 47 | float *d_save_use_ptr, *d_save_new_ptr; 48 | float x; 49 | 50 | if (argc!=8) { 51 | fprintf(stderr,"Error: ic_mes arguments are insufficient.\n\n"); 52 | print_usage(stderr); 53 | exit(EXIT_FAILURE); 54 | } 55 | strcpy(input_mat_file, argv[1]); 56 | mem_size_of_task = (float)atof(argv[2]); 57 | total_column = atoi(argv[3]); 58 | maxiter = atoi(argv[4]); 59 | has_header_line = atoi(argv[5]); 60 | has_header_column = atoi(argv[6]); 61 | strcpy(output_file, argv[7]); 62 | 63 | printf("ic_mes arguments:\n"); 64 | printf("\tinput file: %s\n", input_mat_file); 65 | printf("\t\thas header line: %d\n", has_header_line); 66 | printf("\t\thas header column: %d\n", has_header_column); 67 | printf("\tmemory size for this job: %g MB\n", mem_size_of_task); 68 | printf("\ttotal_rows: %d\n", total_column); 69 | printf("\tmaxiter: %d\n", maxiter); 70 | printf("\toutput file: %s\n", output_file); 71 | fflush( stdout ); 72 | 73 | ret = quick_check_partitions_by_mem_size(total_column, total_column, mem_size_of_task, &x); 74 | if (ret==FALSE) { 75 | fprintf(stderr, "Error: max_memory_size (%gMB) for this task cannot afford to compute on the matrix (size=%d rows X %d columns, each row as a partition needs %lf MB memory)\n", mem_size_of_task, total_column, total_column, x); 76 | exit(EXIT_FAILURE); 77 | } 78 | 79 | t_start = time(NULL); 80 | calc_partitions_by_mem_size(total_column, total_column, mem_size_of_task, &part_sizes); 81 | part_size = max_VEC_INT(part_sizes); 82 | printf("All data is divided to %d parts: ", part_sizes.n); 83 | put_VEC_INT(stdout, part_sizes, ", "); 84 | printf("\npartition sizes (%d partitions): ", part_sizes.n); 85 | put_VEC_INT(stdout, part_sizes, ", "); 86 | printf("\n"); 87 | 88 | init_MATRIX_FLOAT( &mat ); 89 | create_ones_VEC_FLOAT( part_size, &d_temp ); 90 | create_ones_MATRIX_FLOAT( 2, total_column, &d_save); 91 | 92 | for (iter=0; iter> ic_mes computation total time: %d:%d:%d\n", (int)floor(t_elapse/3600.0), (int)floor(fmod(t_elapse,3600.0)/60.0), (int)fmod(t_elapse,60.0)); 129 | 130 | // write output 131 | t_start = time(NULL); 132 | printf("Write output to file: %s\n", output_file); 133 | write_VEC_FLOAT_2(d_save_new_ptr, total_column, "\n", output_file); 134 | t_end = time(NULL); 135 | t_elapse = difftime( t_end, t_start ); 136 | printf("Write time: %d:%d:%d\n", (int)floor(t_elapse/3600.0), (int)floor(fmod(t_elapse,3600.0)/60.0), (int)fmod(t_elapse,60.0)); 137 | 138 | // free space 139 | free_VEC_INT( &part_sizes ); 140 | free_MATRIX_FLOAT( &mat ); 141 | free_MATRIX_FLOAT( &d_save ); 142 | free_VEC_FLOAT( &d_temp ); 143 | printf("Done.\n"); 144 | return EXIT_SUCCESS; 145 | } 146 | -------------------------------------------------------------------------------- /scripts/Hi-Corrector1.2/src/matutils.h: -------------------------------------------------------------------------------- 1 | /* 2 | * matutils.h 3 | * 4 | * Created on: Jun 19, 2014 5 | * Author: Wenyuan Li 6 | */ 7 | 8 | #ifndef MATUTILS_H_ 9 | #define MATUTILS_H_ 10 | 11 | /*--------------------------- 12 | Data Types 13 | ---------------------------*/ 14 | typedef int BOOL; 15 | #define TRUE 1 16 | #define FALSE 0 17 | 18 | /*--------------------------- 19 | Constants 20 | ---------------------------*/ 21 | #define MAXCHAR 1024 22 | #define MAX_LINE_LENGTH 1000000 23 | #define MAXCHAR_ID 64 24 | #define MEM_NOT_CONSIDERED 0 // in MB 25 | 26 | /*--------------------------- 27 | Data structure 28 | ---------------------------*/ 29 | typedef struct _vector_i { 30 | int n; 31 | int* v; 32 | } VEC_INT; 33 | 34 | typedef struct _vector_i2 { 35 | int n; 36 | int* v1; 37 | int* v2; 38 | } VEC_INT2; 39 | 40 | typedef struct _vector_f { 41 | int n; 42 | float* v; 43 | } VEC_FLOAT; 44 | 45 | typedef struct _vector_d { 46 | int n; 47 | double* v; 48 | } VEC_DOUBLE; 49 | 50 | typedef struct _matrix_f { 51 | int start_row; 52 | int end_row; 53 | int cnt_row; 54 | int total_column; 55 | int real_cnt_row; 56 | float** v; 57 | } MATRIX_FLOAT; 58 | 59 | typedef struct _matrix_d { 60 | int start_row; 61 | int end_row; 62 | int cnt_row; 63 | int total_column; 64 | double** v; 65 | } MATRIX_DOUBLE; 66 | 67 | 68 | 69 | /*--------------------------- 70 | Routines 71 | ---------------------------*/ 72 | int getline_gnu(char **lineptr, size_t *n, FILE *stream); 73 | 74 | void init_VEC_INT(VEC_INT* vec); 75 | void free_VEC_INT(VEC_INT* vec); 76 | void create_VEC_INT(int n, VEC_INT* vec); 77 | void put_VEC_INT(FILE* stream, VEC_INT x, char* delim); 78 | int max_VEC_INT(VEC_INT vec); 79 | 80 | void init_VEC_INT2(VEC_INT2* vec); 81 | void free_VEC_INT2(VEC_INT2* vec); 82 | void create_VEC_INT2(int n, VEC_INT2* vec); 83 | void put_VEC_INT2(FILE* stream, VEC_INT2 x, char* delim); 84 | 85 | void init_VEC_FLOAT(VEC_FLOAT* vec); 86 | void create_VEC_FLOAT(int n, VEC_FLOAT* vec); 87 | void create_ones_VEC_FLOAT(int n, VEC_FLOAT* vec); 88 | void addnumber_VEC_FLOAT( float key, VEC_FLOAT* vec ); 89 | void free_VEC_FLOAT(VEC_FLOAT* vec); 90 | int read_VEC_FLOAT(VEC_FLOAT* vec, char* file); 91 | int read_VEC_FLOAT_2(float* vec, int len, char* file); 92 | void assign_init_to_VEC_FLOAT(VEC_FLOAT* vec, float init_v); 93 | void write_VEC_FLOAT(VEC_FLOAT x, char* file); 94 | void write_VEC_FLOAT_2(float *x, int len, char* delim, char* file); 95 | void put_VEC_FLOAT(FILE* stream, VEC_FLOAT x, char* delim); 96 | void put_VEC_FLOAT_2(FILE* stream, float* x, int len, char* delim); 97 | void ones_VEC_FLOAT( VEC_FLOAT* vec ); 98 | void elementwise_div_vector_MATRIX_FLOAT(MATRIX_FLOAT* mat, VEC_FLOAT norms); 99 | void elementwise_div_value_MATRIX_FLOAT(MATRIX_FLOAT* mat, float v); 100 | void elementwise_div_vector_MATRIX_FLOAT_2(MATRIX_FLOAT* mat, float* norms, int norms_len); 101 | void elementwise_multi_vector_VEC_FLOAT(VEC_FLOAT* dst, int start_idx,int end_idx, 102 | VEC_FLOAT src); 103 | void elementwise_multi_vector_VEC_FLOAT_2(float* src1, int start_idx,int end_idx, 104 | VEC_FLOAT src2, float* dst); 105 | /* suppose dst is allocated space, before calling, index is 0-based 106 | * start_idx and start_idx are all indexes of dst vector*/ 107 | int copy_VEC_FLOAT(VEC_FLOAT *dst, int start_idx, int end_idx, VEC_FLOAT src); 108 | 109 | void init_MATRIX_FLOAT(MATRIX_FLOAT* mat); 110 | void create_ones_MATRIX_FLOAT(int cnt_row, int total_column, MATRIX_FLOAT* mat); 111 | void add_morerows_MATRIX_FLOAT(int n_additional_row, int total_column, MATRIX_FLOAT* mat); 112 | void free_MATRIX_FLOAT( MATRIX_FLOAT* mat ); 113 | int read_MATRIX_FLOAT(char* file, int start_row, int end_row, int total_column, 114 | MATRIX_FLOAT* mat); 115 | int read_efficient_MATRIX_FLOAT(char* file, int start_row, int end_row, int total_column, 116 | MATRIX_FLOAT* mat); 117 | int read_efficient_MATRIX_FLOAT_with_header(char* file, int start_row, int end_row, int total_column, 118 | int has_header_row, int has_header_column, MATRIX_FLOAT* mat); 119 | int read_write_efficient_MATRIX_FLOAT(char* file, int start_row, int end_row, int total_column, 120 | MATRIX_FLOAT* mat, char* out_file); 121 | int read_write_part_of_MATRIX_FLOAT(char* file, int start_row, int end_row, int total_column, char* out_file); 122 | int read_write_part_of_MATRIX_FLOAT_alt(FILE* stream, int cnt_row, char* out_file); 123 | void write_MATRIX_FLOAT(char* file, MATRIX_FLOAT mat); 124 | void append_MATRIX_FLOAT_text(char* file, MATRIX_FLOAT mat); 125 | void put_MATRIX_FLOAT(FILE* stream, MATRIX_FLOAT mat); 126 | float norm1_float(float* v, int len); 127 | void norm_rows_MATRIX_FLOAT(MATRIX_FLOAT mat, float p, VEC_FLOAT* norms); 128 | 129 | void partitions( int n, int part_size, VEC_INT* part_sizes ); 130 | void partitions2( int n, int n_part, VEC_INT* part_sizes ); 131 | void calc_partitions_by_mem_size(int cnt_rows, int total_column, float max_mem_size, 132 | VEC_INT* part_sizes); 133 | int quick_check_partitions_by_mem_size(int cnt_rows, int total_column, float max_mem_size, float *x); 134 | #endif /* MATUTILS_H_ */ 135 | -------------------------------------------------------------------------------- /scripts/Hi-Corrector1.2/src/my_getopt-1.5/ChangeLog: -------------------------------------------------------------------------------- 1 | 2006-12-09 Benjamin C. W. Sittler 2 | 3 | * my_getopt.c: add my_getopt_reset to reset the argument parser 4 | 5 | * README: updated email address, updated for version 1.5 6 | 7 | 2002-07-26 Benjamin C. W. Sittler 8 | 9 | * README: updated for version 1.4 10 | 11 | * my_getopt.c: now we include explicitly for those 12 | systems that narrowly (mis-)interpret ANSI C and POSIX 13 | (_my_getopt_internal): added an explicit cast to size_t to make 14 | g++ happy 15 | 16 | * getopt.h, my_getopt.h: added extern "C" { ... } for C++ 17 | compilation (thanks to Jeff Lawson and others) 18 | 19 | 2001-08-20 Benjamin C. W. Sittler 20 | 21 | * getopt.h (getopt_long_only): fixed typo (thanks to Justin Lee 22 | ) 23 | -------------------------------------------------------------------------------- /scripts/Hi-Corrector1.2/src/my_getopt-1.5/LICENSE: -------------------------------------------------------------------------------- 1 | my_getopt - a command-line argument parser 2 | Copyright 1997-2001, Benjamin Sittler 3 | 4 | Permission is hereby granted, free of charge, to any person 5 | obtaining a copy of this software and associated documentation 6 | files (the "Software"), to deal in the Software without 7 | restriction, including without limitation the rights to use, copy, 8 | modify, merge, publish, distribute, sublicense, and/or sell copies 9 | of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be 13 | included in all copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 16 | EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 17 | MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND 18 | NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT 19 | HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, 20 | WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 21 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER 22 | DEALINGS IN THE SOFTWARE. 23 | -------------------------------------------------------------------------------- /scripts/Hi-Corrector1.2/src/my_getopt-1.5/Makefile: -------------------------------------------------------------------------------- 1 | all: copy 2 | 3 | # Compiler options 4 | #CCOPTS = -g -O3 -Wall -Werror 5 | CCOPTS = 6 | 7 | # Compiler 8 | CC = gcc -Wall -Werror 9 | #CC = cc 10 | 11 | # Linker 12 | LD = $(CC) 13 | 14 | # Utility to remove a file 15 | RM = rm 16 | 17 | OBJS = main.o my_getopt.o 18 | 19 | copy: $(OBJS) 20 | $(LD) -o $@ $(OBJS) 21 | 22 | clean: 23 | $(RM) -f copy $(OBJS) *~ 24 | 25 | %.o: %.c getopt.h my_getopt.h 26 | $(CC) $(CCOPTS) -o $@ -c $< 27 | -------------------------------------------------------------------------------- /scripts/Hi-Corrector1.2/src/my_getopt-1.5/README: -------------------------------------------------------------------------------- 1 | my_getopt - a command-line argument parser 2 | Copyright 1997-2006, Benjamin Sittler 3 | 4 | The author can be reached by sending email to . 5 | 6 | The version of my_getopt in this package (1.5) has a BSD-like license; 7 | see the file LICENSE for details. Version 1.0 of my_getopt was similar 8 | to the GPL'ed version of my_getopt included with SMOKE-16 Version 1, 9 | Release 19990717. SMOKE-16 packages are available from: 10 | 11 | http://geocities.com/bsittler/#smoke16 12 | 13 | OVERVIEW OF THE ARGUMENT PARSER 14 | =============================== 15 | 16 | The getopt(), getopt_long() and getopt_long_only() functions parse 17 | command line arguments. The argc and argv parameters passed to these 18 | functions correspond to the argument count and argument list passed to 19 | your program's main() function at program start-up. Element 0 of the 20 | argument list conventionally contains the name of your program. Any 21 | remaining arguments starting with "-" (except for "-" or "--" by 22 | themselves) are option arguments, some of include option values. This 23 | family of getopt() functions allows intermixed option and non-option 24 | arguments anywhere in the argument list, except that "--" by itself 25 | causes the remaining elements of the argument list to be treated as 26 | non-option arguments. 27 | 28 | [ See the parts of this document labeled "DOCUMENTATION" and 29 | "WHY RE-INVENT THE WHEEL?" for a more information. ] 30 | 31 | FILES 32 | ===== 33 | 34 | The following four files constitute the my_getopt package: 35 | 36 | LICENSE - license and warranty information for my_getopt 37 | my_getopt.c - implementation of my getopt replacement 38 | my_getopt.h - interface for my getopt replacement 39 | getopt.h - a header file to make my getopt look like GNU getopt 40 | 41 | USAGE 42 | ===== 43 | 44 | To use my_getopt in your application, include the following line to 45 | your main program source: 46 | 47 | #include "getopt.h" 48 | 49 | This line should appear after your standard system header files to 50 | avoid conflicting with your system's built-in getopt. 51 | 52 | Then compile my_getopt.c into my_getopt.o, and link my_getopt.o into 53 | your application: 54 | 55 | $ cc -c my_getopt.c 56 | $ ld -o app app.o ... my_getopt.o 57 | 58 | To avoid conflicting with standard library functions, the function 59 | names and global variables used by my_getopt all begin with `my_'. To 60 | ensure compatibility with existing C programs, the `getopt.h' header 61 | file uses the C preprocessor to redefine names like getopt, optarg, 62 | optind, and so forth to my_getopt, my_optarg, my_optind, etc. 63 | 64 | SAMPLE PROGRAM 65 | ============== 66 | 67 | There is also a public-domain sample program: 68 | 69 | main.c - main() for a sample program using my_getopt 70 | Makefile - build script for the sample program (called `copy') 71 | 72 | To build and test the sample program: 73 | 74 | $ make 75 | $ ./copy -help 76 | $ ./copy -version 77 | 78 | The sample program bears a slight resemblance to the UNIX `cat' 79 | utility, but can be used rot13-encode streams, and can redirect output 80 | to a file. 81 | 82 | DOCUMENTATION 83 | ============= 84 | 85 | There is not yet any real documentation for my_getopt. For the moment, 86 | use the Linux manual page for getopt. It has its own copyright and 87 | license; view the file `getopt.3' in a text editor for more details. 88 | 89 | getopt.3 - the manual page for GNU getopt 90 | getopt.txt - preformatted copy of the manual page for GNU getopt, 91 | for your convenience 92 | 93 | WHY RE-INVENT THE WHEEL? 94 | ======================== 95 | 96 | I re-implemented getopt, getopt_long, and getopt_long_only because 97 | there were noticable bugs in several versions of the GNU 98 | implementations, and because the GNU versions aren't always available 99 | on some systems (*BSD, for example.) Other systems don't include any 100 | sort of standard argument parser (Win32 with Microsoft tools, for 101 | example, has no getopt.) 102 | 103 | These should do all the expected Unix- and GNU-style argument 104 | parsing, including permution, bunching, long options with single or 105 | double dashes (double dashes are required if you use 106 | my_getopt_long,) and optional arguments for both long and short 107 | options. A word with double dashes all by themselves halts argument 108 | parsing. A required long option argument can be in the same word as 109 | the option name, separated by '=', or in the next word. An optional 110 | long option argument must be in the same word as the option name, 111 | separated by '='. 112 | 113 | As with the GNU versions, a '+' prefix to the short option 114 | specification (or the POSIXLY_CORRECT environment variable) disables 115 | permution, a '-' prefix to the short option specification returns 1 116 | for non-options, ':' after a short option indicates a required 117 | argument, and '::' after a short option specification indicates an 118 | optional argument (which must appear in the same word.) If you'd like 119 | to recieve ':' instead of '?' for missing option arguments, prefix the 120 | short option specification with ':'. 121 | 122 | The original intent was to re-implement the documented behavior of 123 | the GNU versions, but I have found it necessary to emulate some of 124 | the undocumented behavior as well. Some programs depend on it. 125 | 126 | KNOWN BUGS 127 | ========== 128 | 129 | The GNU versions support POSIX-style -W "name=value" long 130 | options. Currently, my_getopt does not support these, because I 131 | don't have any documentation on them (other than the fact that they 132 | are enabled by "W;" in the short option specification.) As a 133 | temporary workaround, my_getopt treats "W;" in the short option 134 | string identically to "W:". 135 | 136 | The GNU versions support internationalized/localized 137 | messages. Currently, my_getopt does not. 138 | 139 | There should be re-entrant versions of all these functions so that 140 | multiple threads can parse arguments simultaneously. 141 | -------------------------------------------------------------------------------- /scripts/Hi-Corrector1.2/src/my_getopt-1.5/getopt.3: -------------------------------------------------------------------------------- 1 | .\" (c) 1993 by Thomas Koenig (ig25@rz.uni-karlsruhe.de) 2 | .\" 3 | .\" Permission is granted to make and distribute verbatim copies of this 4 | .\" manual provided the copyright notice and this permission notice are 5 | .\" preserved on all copies. 6 | .\" 7 | .\" Permission is granted to copy and distribute modified versions of this 8 | .\" manual under the conditions for verbatim copying, provided that the 9 | .\" entire resulting derived work is distributed under the terms of a 10 | .\" permission notice identical to this one 11 | .\" 12 | .\" Since the Linux kernel and libraries are constantly changing, this 13 | .\" manual page may be incorrect or out-of-date. The author(s) assume no 14 | .\" responsibility for errors or omissions, or for damages resulting from 15 | .\" the use of the information contained herein. The author(s) may not 16 | .\" have taken the same level of care in the production of this manual, 17 | .\" which is licensed free of charge, as they might when working 18 | .\" professionally. 19 | .\" 20 | .\" Formatted or processed versions of this manual, if unaccompanied by 21 | .\" the source, must acknowledge the copyright and authors of this work. 22 | .\" License. 23 | .\" Modified Sat Jul 24 19:27:50 1993 by Rik Faith (faith@cs.unc.edu) 24 | .\" Modified Mon Aug 30 22:02:34 1995 by Jim Van Zandt 25 | .\" longindex is a pointer, has_arg can take 3 values, using consistent 26 | .\" names for optstring and longindex, "\n" in formats fixed. Documenting 27 | .\" opterr and getopt_long_only. Clarified explanations (borrowing heavily 28 | .\" from the source code). 29 | .TH GETOPT 3 "Aug 30, 1995" "GNU" "Linux Programmer's Manual" 30 | .SH NAME 31 | getopt \- Parse command line options 32 | .SH SYNOPSIS 33 | .nf 34 | .B #include 35 | .sp 36 | .BI "int getopt(int " argc ", char * const " argv[] "," 37 | .BI " const char *" optstring ");" 38 | .sp 39 | .BI "extern char *" optarg ; 40 | .BI "extern int " optind ", " opterr ", " optopt ; 41 | .sp 42 | .B #include 43 | .sp 44 | .BI "int getopt_long(int " argc ", char * const " argv[] ", 45 | .BI " const char *" optstring , 46 | .BI " const struct option *" longopts ", int *" longindex ");" 47 | .sp 48 | .BI "int getopt_long_only(int " argc ", char * const " argv[] ", 49 | .BI " const char *" optstring , 50 | .BI " const struct option *" longopts ", int *" longindex ");" 51 | .fi 52 | .SH DESCRIPTION 53 | The 54 | .B getopt() 55 | function parses the command line arguments. Its arguments 56 | .I argc 57 | and 58 | .I argv 59 | are the argument count and array as passed to the 60 | .B main() 61 | function on program invocation. 62 | An element of \fIargv\fP that starts with `-' (and is not exactly "-" or "--") 63 | is an option element. The characters of this element 64 | (aside from the initial `-') are option characters. If \fBgetopt()\fP 65 | is called repeatedly, it returns successively each of the option characters 66 | from each of the option elements. 67 | .PP 68 | If \fBgetopt()\fP finds another option character, it returns that 69 | character, updating the external variable \fIoptind\fP and a static 70 | variable \fInextchar\fP so that the next call to \fBgetopt()\fP can 71 | resume the scan with the following option character or 72 | \fIargv\fP-element. 73 | .PP 74 | If there are no more option characters, \fBgetopt()\fP returns 75 | \fBEOF\fP. Then \fIoptind\fP is the index in \fIargv\fP of the first 76 | \fIargv\fP-element that is not an option. 77 | .PP 78 | .I optstring 79 | is a string containing the legitimate option characters. If such a 80 | character is followed by a colon, the option requires an argument, so 81 | \fBgetopt\fP places a pointer to the following text in the same 82 | \fIargv\fP-element, or the text of the following \fIargv\fP-element, in 83 | .IR optarg . 84 | Two colons mean an option takes 85 | an optional arg; if there is text in the current \fIargv\fP-element, 86 | it is returned in \fIoptarg\fP, otherwise \fIoptarg\fP is set to zero. 87 | .PP 88 | By default, \fBgetargs()\fP permutes the contents of \fIargv\fP as it 89 | scans, so that eventually all the non-options are at the end. Two 90 | other modes are also implemented. If the first character of 91 | \fIoptstring\fP is `+' or the environment variable POSIXLY_CORRECT is 92 | set, then option processing stops as soon as a non-option argument is 93 | encountered. If the first character of \fIoptstring\fP is `-', then 94 | each non-option \fIargv\fP-element is handled as if it were the argument of 95 | an option with character code 1. (This is used by programs that were 96 | written to expect options and other \fIargv\fP-elements in any order 97 | and that care about the ordering of the two.) 98 | The special argument `--' forces an end of option-scanning regardless 99 | of the scanning mode. 100 | .PP 101 | If \fBgetopt()\fP does not recognize an option character, it prints an 102 | error message to stderr, stores the character in \fIoptopt\fP, and 103 | returns `?'. The calling program may prevent the error message by 104 | setting \fIopterr\fP to 0. 105 | .PP 106 | The 107 | .B getopt_long() 108 | function works like 109 | .B getopt() 110 | except that it also accepts long options, started out by two dashes. 111 | Long option names may be abbreviated if the abbreviation is 112 | unique or is an exact match for some defined option. A long option 113 | may take a parameter, of the form 114 | .B --arg=param 115 | or 116 | .BR "--arg param" . 117 | .PP 118 | .I longopts 119 | is a pointer to the first element of an array of 120 | .B struct option 121 | declared in 122 | .B 123 | as 124 | .nf 125 | .sp 126 | .in 10 127 | struct option { 128 | .in 14 129 | const char *name; 130 | int has_arg; 131 | int *flag; 132 | int val; 133 | .in 10 134 | }; 135 | .fi 136 | .PP 137 | The meanings of the different fields are: 138 | .TP 139 | .I name 140 | is the name of the long option. 141 | .TP 142 | .I has_arg 143 | is: 144 | \fBno_argument\fP (or 0) if the option does not take an argument, 145 | \fBrequired_argument\fP (or 1) if the option requires an argument, or 146 | \fBoptional_argument\fP (or 2) if the option takes an optional argument. 147 | .TP 148 | .I flag 149 | specifies how results are returned for a long option. If \fIflag\fP 150 | is \fBNULL\fP, then \fBgetopt_long()\fP returns \fIval\fP. (For 151 | example, the calling program may set \fIval\fP to the equivalent short 152 | option character.) Otherwise, \fBgetopt_long()\fP returns 0, and 153 | \fIflag\fP points to a variable which is set to \fIval\fP if the 154 | option is found, but left unchanged if the option is not found. 155 | .TP 156 | \fIval\fP 157 | is the value to return, or to load into the variable pointed 158 | to by \fIflag\fP. 159 | .PP 160 | The last element of the array has to be filled with zeroes. 161 | .PP 162 | If \fIlongindex\fP is not \fBNULL\fP, it 163 | points to a variable which is set to the index of the long option relative to 164 | .IR longopts . 165 | .PP 166 | \fBgetopt_long_only()\fP is like \fBgetopt_long()\fP, but `-' as well 167 | as `--' can indicate a long option. If an option that starts with `-' 168 | (not `--') doesn't match a long option, but does match a short option, 169 | it is parsed as a short option instead. 170 | .SH "RETURN VALUE" 171 | The 172 | .B getopt() 173 | function returns the option character if the option was found 174 | successfully, `:' if there was a missing parameter for one of the 175 | options, `?' for an unknown option character, or \fBEOF\fP 176 | for the end of the option list. 177 | .PP 178 | \fBgetopt_long()\fP and \fBgetopt_long_only()\fP also return the option 179 | character when a short option is recognized. For a long option, they 180 | return \fIval\fP if \fIflag\fP is \fBNULL\fP, and 0 otherwise. Error 181 | and EOF returns are the same as for \fBgetopt()\fP, plus `?' for an 182 | ambiguous match or an extraneous parameter. 183 | .SH "ENVIRONMENT VARIABLES" 184 | .TP 185 | .SM 186 | .B POSIXLY_CORRECT 187 | If this is set, then option processing stops as soon as a non-option 188 | argument is encountered. 189 | .SH "EXAMPLE" 190 | The following example program, from the source code, illustrates the 191 | use of 192 | .BR getopt_long() 193 | with most of its features. 194 | .nf 195 | .sp 196 | #include 197 | 198 | int 199 | main (argc, argv) 200 | int argc; 201 | char **argv; 202 | { 203 | int c; 204 | int digit_optind = 0; 205 | 206 | while (1) 207 | { 208 | int this_option_optind = optind ? optind : 1; 209 | int option_index = 0; 210 | static struct option long_options[] = 211 | { 212 | {"add", 1, 0, 0}, 213 | {"append", 0, 0, 0}, 214 | {"delete", 1, 0, 0}, 215 | {"verbose", 0, 0, 0}, 216 | {"create", 1, 0, 'c'}, 217 | {"file", 1, 0, 0}, 218 | {0, 0, 0, 0} 219 | }; 220 | 221 | c = getopt_long (argc, argv, "abc:d:012", 222 | long_options, &option_index); 223 | if (c == -1) 224 | break; 225 | 226 | switch (c) 227 | { 228 | case 0: 229 | printf ("option %s", long_options[option_index].name); 230 | if (optarg) 231 | printf (" with arg %s", optarg); 232 | printf ("\\n"); 233 | break; 234 | 235 | case '0': 236 | case '1': 237 | case '2': 238 | if (digit_optind != 0 && digit_optind != this_option_optind) 239 | printf ("digits occur in two different argv-elements.\\n"); 240 | digit_optind = this_option_optind; 241 | printf ("option %c\\n", c); 242 | break; 243 | 244 | case 'a': 245 | printf ("option a\\n"); 246 | break; 247 | 248 | case 'b': 249 | printf ("option b\\n"); 250 | break; 251 | 252 | case 'c': 253 | printf ("option c with value `%s'\\n", optarg); 254 | break; 255 | 256 | case 'd': 257 | printf ("option d with value `%s'\\n", optarg); 258 | break; 259 | 260 | case '?': 261 | break; 262 | 263 | default: 264 | printf ("?? getopt returned character code 0%o ??\\n", c); 265 | } 266 | } 267 | 268 | if (optind < argc) 269 | { 270 | printf ("non-option ARGV-elements: "); 271 | while (optind < argc) 272 | printf ("%s ", argv[optind++]); 273 | printf ("\\n"); 274 | } 275 | 276 | exit (0); 277 | } 278 | .fi 279 | .SH "BUGS" 280 | This manpage is confusing. 281 | .SH "CONFORMING TO" 282 | .TP 283 | \fBgetopt()\fP: 284 | POSIX.1, provided the environment variable POSIXLY_CORRECT is set. 285 | Otherwise, the elements of \fIargv\fP aren't really const, because we 286 | permute them. We pretend they're const in the prototype to be 287 | compatible with other systems. 288 | 289 | -------------------------------------------------------------------------------- /scripts/Hi-Corrector1.2/src/my_getopt-1.5/getopt.h: -------------------------------------------------------------------------------- 1 | /* 2 | * getopt.h - cpp wrapper for my_getopt to make it look like getopt. 3 | * Copyright 1997, 2000, 2001, 2002, Benjamin Sittler 4 | * 5 | * Permission is hereby granted, free of charge, to any person 6 | * obtaining a copy of this software and associated documentation 7 | * files (the "Software"), to deal in the Software without 8 | * restriction, including without limitation the rights to use, copy, 9 | * modify, merge, publish, distribute, sublicense, and/or sell copies 10 | * of the Software, and to permit persons to whom the Software is 11 | * furnished to do so, subject to the following conditions: 12 | * 13 | * The above copyright notice and this permission notice shall be 14 | * included in all copies or substantial portions of the Software. 15 | * 16 | * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 17 | * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 18 | * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND 19 | * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT 20 | * HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, 21 | * WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 22 | * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER 23 | * DEALINGS IN THE SOFTWARE. 24 | */ 25 | 26 | #ifndef MY_WRAPPER_GETOPT_H_INCLUDED 27 | #define MY_WRAPPER_GETOPT_H_INCLUDED 28 | 29 | #ifdef __cplusplus 30 | extern "C" { 31 | #endif 32 | 33 | #include "my_getopt.h" 34 | 35 | #undef getopt 36 | #define getopt my_getopt 37 | #undef getopt_long 38 | #define getopt_long my_getopt_long 39 | #undef getopt_long_only 40 | #define getopt_long_only my_getopt_long_only 41 | #undef _getopt_internal 42 | #define _getopt_internal _my_getopt_internal 43 | #undef opterr 44 | #define opterr my_opterr 45 | #undef optind 46 | #define optind my_optind 47 | #undef optopt 48 | #define optopt my_optopt 49 | #undef optarg 50 | #define optarg my_optarg 51 | 52 | #ifdef __cplusplus 53 | } 54 | #endif 55 | 56 | #endif /* MY_WRAPPER_GETOPT_H_INCLUDED */ 57 | -------------------------------------------------------------------------------- /scripts/Hi-Corrector1.2/src/my_getopt-1.5/getopt.txt: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | GETOPT(3) Linux Programmer's Manual GETOPT(3) 5 | 6 | 7 | NAME 8 | getopt - Parse command line options 9 | 10 | SYNOPSIS 11 | #include 12 | 13 | int getopt(int argc, char * const argv[], 14 | const char *optstring); 15 | 16 | extern char *optarg; 17 | extern int optind, opterr, optopt; 18 | 19 | #include 20 | 21 | int getopt_long(int argc, char * const argv[], 22 | const char *optstring, 23 | const struct option *longopts, int *longindex); 24 | 25 | int getopt_long_only(int argc, char * const argv[], 26 | const char *optstring, 27 | const struct option *longopts, int *longindex); 28 | 29 | DESCRIPTION 30 | The getopt() function parses the command line arguments. 31 | Its arguments argc and argv are the argument count and 32 | array as passed to the main() function on program invoca- 33 | tion. An element of argv that starts with `-' (and is not 34 | exactly "-" or "--") is an option element. The characters 35 | of this element (aside from the initial `-') are option 36 | characters. If getopt() is called repeatedly, it returns 37 | successively each of the option characters from each of 38 | the option elements. 39 | 40 | If getopt() finds another option character, it returns 41 | that character, updating the external variable optind and 42 | a static variable nextchar so that the next call to 43 | getopt() can resume the scan with the following option 44 | character or argv-element. 45 | 46 | If there are no more option characters, getopt() returns 47 | EOF. Then optind is the index in argv of the first argv- 48 | element that is not an option. 49 | 50 | optstring is a string containing the legitimate option 51 | characters. If such a character is followed by a colon, 52 | the option requires an argument, so getopt places a 53 | pointer to the following text in the same argv-element, or 54 | the text of the following argv-element, in optarg. Two 55 | colons mean an option takes an optional arg; if there is 56 | text in the current argv-element, it is returned in 57 | optarg, otherwise optarg is set to zero. 58 | 59 | By default, getargs() permutes the contents of argv as it 60 | scans, so that eventually all the non-options are at the 61 | 62 | 63 | 64 | GNU Aug 30, 1995 1 65 | 66 | 67 | 68 | 69 | 70 | GETOPT(3) Linux Programmer's Manual GETOPT(3) 71 | 72 | 73 | end. Two other modes are also implemented. If the first 74 | character of optstring is `+' or the environment variable 75 | POSIXLY_CORRECT is set, then option processing stops as 76 | soon as a non-option argument is encountered. If the 77 | first character of optstring is `-', then each non-option 78 | argv-element is handled as if it were the argument of an 79 | option with character code 1. (This is used by programs 80 | that were written to expect options and other argv-ele- 81 | ments in any order and that care about the ordering of the 82 | two.) The special argument `--' forces an end of option- 83 | scanning regardless of the scanning mode. 84 | 85 | If getopt() does not recognize an option character, it 86 | prints an error message to stderr, stores the character in 87 | optopt, and returns `?'. The calling program may prevent 88 | the error message by setting opterr to 0. 89 | 90 | The getopt_long() function works like getopt() except that 91 | it also accepts long options, started out by two dashes. 92 | Long option names may be abbreviated if the abbreviation 93 | is unique or is an exact match for some defined option. A 94 | long option may take a parameter, of the form --arg=param 95 | or --arg param. 96 | 97 | longopts is a pointer to the first element of an array of 98 | struct option declared in as 99 | 100 | struct option { 101 | const char *name; 102 | int has_arg; 103 | int *flag; 104 | int val; 105 | }; 106 | 107 | The meanings of the different fields are: 108 | 109 | name is the name of the long option. 110 | 111 | has_arg 112 | is: no_argument (or 0) if the option does not take 113 | an argument, required_argument (or 1) if the option 114 | requires an argument, or optional_argument (or 2) 115 | if the option takes an optional argument. 116 | 117 | flag specifies how results are returned for a long 118 | option. If flag is NULL, then getopt_long() 119 | returns val. (For example, the calling program may 120 | set val to the equivalent short option character.) 121 | Otherwise, getopt_long() returns 0, and flag points 122 | to a variable which is set to val if the option is 123 | found, but left unchanged if the option is not 124 | found. 125 | 126 | val is the value to return, or to load into the 127 | 128 | 129 | 130 | GNU Aug 30, 1995 2 131 | 132 | 133 | 134 | 135 | 136 | GETOPT(3) Linux Programmer's Manual GETOPT(3) 137 | 138 | 139 | variable pointed to by flag. 140 | 141 | The last element of the array has to be filled with 142 | zeroes. 143 | 144 | If longindex is not NULL, it points to a variable which is 145 | set to the index of the long option relative to longopts. 146 | 147 | getopt_long_only() is like getopt_long(), but `-' as well 148 | as `--' can indicate a long option. If an option that 149 | starts with `-' (not `--') doesn't match a long option, 150 | but does match a short option, it is parsed as a short 151 | option instead. 152 | 153 | RETURN VALUE 154 | The getopt() function returns the option character if the 155 | option was found successfully, `:' if there was a missing 156 | parameter for one of the options, `?' for an unknown 157 | option character, or EOF for the end of the option list. 158 | 159 | getopt_long() and getopt_long_only() also return the 160 | option character when a short option is recognized. For a 161 | long option, they return val if flag is NULL, and 0 other- 162 | wise. Error and EOF returns are the same as for getopt(), 163 | plus `?' for an ambiguous match or an extraneous parame- 164 | ter. 165 | 166 | ENVIRONMENT VARIABLES 167 | POSIXLY_CORRECT 168 | If this is set, then option processing stops as 169 | soon as a non-option argument is encountered. 170 | 171 | EXAMPLE 172 | The following example program, from the source code, 173 | illustrates the use of getopt_long() with most of its fea- 174 | tures. 175 | 176 | #include 177 | 178 | int 179 | main (argc, argv) 180 | int argc; 181 | char **argv; 182 | { 183 | int c; 184 | int digit_optind = 0; 185 | 186 | while (1) 187 | { 188 | int this_option_optind = optind ? optind : 1; 189 | int option_index = 0; 190 | static struct option long_options[] = 191 | { 192 | {"add", 1, 0, 0}, 193 | 194 | 195 | 196 | GNU Aug 30, 1995 3 197 | 198 | 199 | 200 | 201 | 202 | GETOPT(3) Linux Programmer's Manual GETOPT(3) 203 | 204 | 205 | {"append", 0, 0, 0}, 206 | {"delete", 1, 0, 0}, 207 | {"verbose", 0, 0, 0}, 208 | {"create", 1, 0, 'c'}, 209 | {"file", 1, 0, 0}, 210 | {0, 0, 0, 0} 211 | }; 212 | 213 | c = getopt_long (argc, argv, "abc:d:012", 214 | long_options, &option_index); 215 | if (c == -1) 216 | break; 217 | 218 | switch (c) 219 | { 220 | case 0: 221 | printf ("option %s", long_options[option_index].name); 222 | if (optarg) 223 | printf (" with arg %s", optarg); 224 | printf ("\n"); 225 | break; 226 | 227 | case '0': 228 | case '1': 229 | case '2': 230 | if (digit_optind != 0 && digit_optind != this_option_optind) 231 | printf ("digits occur in two different argv-elements.\n"); 232 | digit_optind = this_option_optind; 233 | printf ("option %c\n", c); 234 | break; 235 | 236 | case 'a': 237 | printf ("option a\n"); 238 | break; 239 | 240 | case 'b': 241 | printf ("option b\n"); 242 | break; 243 | 244 | case 'c': 245 | printf ("option c with value `%s'\n", optarg); 246 | break; 247 | 248 | case 'd': 249 | printf ("option d with value `%s'\n", optarg); 250 | break; 251 | 252 | case '?': 253 | break; 254 | 255 | default: 256 | printf ("?? getopt returned character code 0%o ??\n", c); 257 | } 258 | } 259 | 260 | 261 | 262 | GNU Aug 30, 1995 4 263 | 264 | 265 | 266 | 267 | 268 | GETOPT(3) Linux Programmer's Manual GETOPT(3) 269 | 270 | 271 | if (optind < argc) 272 | { 273 | printf ("non-option ARGV-elements: "); 274 | while (optind < argc) 275 | printf ("%s ", argv[optind++]); 276 | printf ("\n"); 277 | } 278 | 279 | exit (0); 280 | } 281 | 282 | BUGS 283 | This manpage is confusing. 284 | 285 | CONFORMING TO 286 | getopt(): 287 | POSIX.1, provided the environment variable 288 | POSIXLY_CORRECT is set. Otherwise, the elements of 289 | argv aren't really const, because we permute them. 290 | We pretend they're const in the prototype to be 291 | compatible with other systems. 292 | 293 | 294 | 295 | 296 | 297 | 298 | 299 | 300 | 301 | 302 | 303 | 304 | 305 | 306 | 307 | 308 | 309 | 310 | 311 | 312 | 313 | 314 | 315 | 316 | 317 | 318 | 319 | 320 | 321 | 322 | 323 | 324 | 325 | 326 | 327 | 328 | GNU Aug 30, 1995 5 329 | 330 | 331 | -------------------------------------------------------------------------------- /scripts/Hi-Corrector1.2/src/my_getopt-1.5/main.c: -------------------------------------------------------------------------------- 1 | /* 2 | * copy - test program for my getopt() re-implementation 3 | * 4 | * This program is in the public domain. 5 | */ 6 | 7 | #define VERSION \ 8 | "0.3" 9 | 10 | #define COPYRIGHT \ 11 | "This program is in the public domain." 12 | 13 | /* for isprint(), printf(), fopen(), perror(), getenv(), strcmp(), etc. */ 14 | #include 15 | #include 16 | #include 17 | #include 18 | 19 | /* for my getopt() re-implementation */ 20 | #include "getopt.h" 21 | 22 | /* the default verbosity level is 0 (no verbose reporting) */ 23 | static unsigned verbose = 0; 24 | 25 | /* print version and copyright information */ 26 | static void 27 | version(char *progname) 28 | { 29 | printf("%s version %s\n" 30 | "%s\n", 31 | progname, 32 | VERSION, 33 | COPYRIGHT); 34 | } 35 | 36 | /* print a help summary */ 37 | static void 38 | help(char *progname) 39 | { 40 | printf("Usage: %s [options] [FILE]...\n" 41 | "Options:\n" 42 | "-h or -help show this message and exit\n" 43 | "-append append to the output file\n" 44 | "-o FILE or\n" 45 | "-output FILE send output to FILE (default is stdout)\n" 46 | "-r or --rotate rotate letters 13 positions (rot13)\n" 47 | "-rNUM or\n" 48 | "--rotate=NUM rotate letters NUM positions\n" 49 | "-truncate truncate the output file " 50 | "(this is the default)\n" 51 | "-v or -verbose increase the level of verbosity by 1" 52 | "(the default is 0)\n" 53 | "-vNUM or\n" 54 | "-verbose=NUM set the level of verbosity to NUM\n" 55 | "-V or -version print program version and exit\n" 56 | "\n" 57 | "This program reads the specified FILEs " 58 | "(or stdin if none are given)\n" 59 | "and writes their bytes to the specified output FILE " 60 | "(or stdout if none is\n" 61 | "given.) It can optionally rotate letters.\n", 62 | progname); 63 | } 64 | 65 | /* print usage information to stderr */ 66 | static void 67 | usage(char *progname) 68 | { 69 | fprintf(stderr, 70 | "Summary: %s [-help] [-version] [options] [FILE]...\n", 71 | progname); 72 | } 73 | 74 | /* input file handler -- returns nonzero or exit()s on failure */ 75 | static int 76 | handle(char *progname, 77 | FILE *infile, char *infilename, 78 | FILE *outfile, char *outfilename, 79 | int rotate) 80 | { 81 | int c; 82 | unsigned long bytes_copied = 0; 83 | 84 | if (verbose > 2) 85 | { 86 | fprintf(stderr, 87 | "%s: copying from `%s' to `%s'\n", 88 | progname, 89 | infilename, 90 | outfilename); 91 | } 92 | while ((c = getc(infile)) != EOF) 93 | { 94 | if (rotate && isalpha(c)) 95 | { 96 | const char *letters = "abcdefghijklmnopqrstuvwxyz"; 97 | char *match; 98 | if ((match = strchr(letters, tolower(c)))) 99 | { 100 | char rc = letters[(match - letters + rotate) % 26]; 101 | if (isupper(c)) 102 | rc = toupper(rc); 103 | c = rc; 104 | } 105 | } 106 | if (putc(c, outfile) == EOF) 107 | { 108 | perror(outfilename); 109 | exit(1); 110 | } 111 | bytes_copied ++; 112 | } 113 | if (! feof(infile)) 114 | { 115 | perror(infilename); 116 | return 1; 117 | } 118 | if (verbose > 2) 119 | { 120 | fprintf(stderr, 121 | "%s: %lu bytes copied from `%s' to `%s'\n", 122 | progname, 123 | bytes_copied, 124 | infilename, 125 | outfilename); 126 | } 127 | return 0; 128 | } 129 | 130 | /* argument parser and dispatcher */ 131 | int 132 | main(int argc, char * argv[]) 133 | { 134 | /* the program name */ 135 | char *progname = argv[0]; 136 | /* during argument parsing, opt contains the return value from getopt() */ 137 | int opt; 138 | /* the output filename is initially 0 (a.k.a. stdout) */ 139 | char *outfilename = 0; 140 | /* the default return value is initially 0 (success) */ 141 | int retval = 0; 142 | /* initially we truncate */ 143 | int append = 0; 144 | /* initially we don't rotate letters */ 145 | int rotate = 0; 146 | 147 | /* short options string */ 148 | char *shortopts = "Vho:r::v::"; 149 | /* long options list */ 150 | struct option longopts[] = 151 | { 152 | /* name, has_arg, flag, val */ /* longind */ 153 | { "append", no_argument, 0, 0 }, /* 0 */ 154 | { "truncate", no_argument, 0, 0 }, /* 1 */ 155 | { "version", no_argument, 0, 'V' }, /* 3 */ 156 | { "help", no_argument, 0, 'h' }, /* 4 */ 157 | { "output", required_argument, 0, 'o' }, /* 5 */ 158 | { "rotate", optional_argument, 0, 'r' }, /* 6 */ 159 | { "verbose", optional_argument, 0, 'v' }, /* 7 */ 160 | /* end-of-list marker */ 161 | { 0, 0, 0, 0 } 162 | }; 163 | /* long option list index */ 164 | int longind = 0; 165 | 166 | /* 167 | * print a warning when the POSIXLY_CORRECT environment variable will 168 | * interfere with argument placement 169 | */ 170 | if (getenv("POSIXLY_CORRECT")) 171 | { 172 | fprintf(stderr, 173 | "%s: " 174 | "Warning: implicit argument reordering disallowed by " 175 | "POSIXLY_CORRECT\n", 176 | progname); 177 | } 178 | 179 | /* parse all options from the command line */ 180 | while ((opt = 181 | getopt_long_only(argc, argv, shortopts, longopts, &longind)) != -1) 182 | switch (opt) 183 | { 184 | case 0: /* a long option without an equivalent short option */ 185 | switch (longind) 186 | { 187 | case 0: /* -append */ 188 | append = 1; 189 | break; 190 | case 1: /* -truncate */ 191 | append = 0; 192 | break; 193 | default: /* something unexpected has happened */ 194 | fprintf(stderr, 195 | "%s: " 196 | "getopt_long_only unexpectedly returned %d for `--%s'\n", 197 | progname, 198 | opt, 199 | longopts[longind].name); 200 | return 1; 201 | } 202 | break; 203 | case 'V': /* -version */ 204 | version(progname); 205 | return 0; 206 | case 'h': /* -help */ 207 | help(progname); 208 | return 0; 209 | case 'r': /* -rotate[=NUM] */ 210 | if (optarg) 211 | { 212 | /* we use this while trying to parse a numeric argument */ 213 | char ignored; 214 | if (sscanf(optarg, 215 | "%d%c", 216 | &rotate, 217 | &ignored) != 1) 218 | { 219 | fprintf(stderr, 220 | "%s: " 221 | "rotation `%s' is not a number\n", 222 | progname, 223 | optarg); 224 | usage(progname); 225 | return 2; 226 | } 227 | /* normalize rotation */ 228 | while (rotate < 0) 229 | { 230 | rotate += 26; 231 | } 232 | rotate %= 26; 233 | } 234 | else 235 | rotate = 13; 236 | break; 237 | case 'o': /* -output=FILE */ 238 | outfilename = optarg; 239 | /* we allow "-" as a synonym for stdout here */ 240 | if (! strcmp(optarg, "-")) 241 | { 242 | outfilename = 0; 243 | } 244 | break; 245 | case 'v': /* -verbose[=NUM] */ 246 | if (optarg) 247 | { 248 | /* we use this while trying to parse a numeric argument */ 249 | char ignored; 250 | if (sscanf(optarg, 251 | "%u%c", 252 | &verbose, 253 | &ignored) != 1) 254 | { 255 | fprintf(stderr, 256 | "%s: " 257 | "verbosity level `%s' is not a number\n", 258 | progname, 259 | optarg); 260 | usage(progname); 261 | return 2; 262 | } 263 | } 264 | else 265 | verbose ++; 266 | break; 267 | case '?': /* getopt_long_only noticed an error */ 268 | usage(progname); 269 | return 2; 270 | default: /* something unexpected has happened */ 271 | fprintf(stderr, 272 | "%s: " 273 | "getopt_long_only returned an unexpected value (%d)\n", 274 | progname, 275 | opt); 276 | return 1; 277 | } 278 | 279 | /* re-open stdout to outfilename, if requested */ 280 | if (outfilename) 281 | { 282 | if (! freopen(outfilename, (append ? "a" : "w"), stdout)) 283 | { 284 | perror(outfilename); 285 | return 1; 286 | } 287 | } 288 | else 289 | { 290 | /* make a human-readable version of the output filename "-" */ 291 | outfilename = "stdout"; 292 | /* you can't truncate stdout */ 293 | append = 1; 294 | } 295 | 296 | if (verbose) 297 | { 298 | fprintf(stderr, 299 | "%s: verbosity level is %u; %s `%s'; rotation %d\n", 300 | progname, 301 | verbose, 302 | (append ? "appending to" : "truncating"), 303 | outfilename, 304 | rotate); 305 | } 306 | 307 | if (verbose > 1) 308 | { 309 | fprintf(stderr, 310 | "%s: %d input file(s) were given\n", 311 | progname, 312 | ((argc > optind) ? (argc - optind) : 0)); 313 | } 314 | 315 | if (verbose > 3) 316 | { 317 | fprintf(stderr, 318 | "\topterr: %d\n\toptind: %d\n\toptopt: %d (%c)\n\toptarg: %s\n", 319 | opterr, 320 | optind, 321 | optopt, optopt, 322 | optarg ? optarg : "(null)"); 323 | } 324 | 325 | /* handle each of the input files (or stdin, if no files were given) */ 326 | if (optind < argc) 327 | { 328 | int argindex; 329 | 330 | for (argindex = optind; argindex < argc; argindex ++) 331 | { 332 | char *infilename = argv[argindex]; 333 | FILE *infile; 334 | 335 | /* we allow "-" as a synonym for stdin here */ 336 | if (! strcmp(infilename, "-")) 337 | { 338 | infile = stdin; 339 | infilename = "stdin"; 340 | } 341 | else if (! (infile = fopen(infilename, "r"))) 342 | { 343 | perror(infilename); 344 | retval = 1; 345 | continue; 346 | } 347 | if (handle(progname, 348 | infile, argv[optind], 349 | stdout, outfilename, 350 | rotate)) 351 | { 352 | retval = 1; 353 | fclose(infile); 354 | continue; 355 | } 356 | if ((infile != stdin) && fclose(infile)) 357 | { 358 | perror(infilename); 359 | retval = 1; 360 | } 361 | } 362 | } 363 | else 364 | { 365 | retval = 366 | handle(progname, 367 | stdin, "stdin", 368 | stdout, outfilename, 369 | rotate); 370 | } 371 | 372 | /* close stdout */ 373 | if (fclose(stdout)) 374 | { 375 | perror(outfilename); 376 | return 1; 377 | } 378 | 379 | if (verbose > 3) 380 | { 381 | fprintf(stderr, 382 | "%s: normal return, exit code is %d\n", 383 | progname, 384 | retval); 385 | } 386 | return retval; 387 | } 388 | -------------------------------------------------------------------------------- /scripts/Hi-Corrector1.2/src/my_getopt-1.5/my_getopt.c: -------------------------------------------------------------------------------- 1 | /* 2 | * my_getopt.c - my re-implementation of getopt. 3 | * Copyright 1997, 2000, 2001, 2002, 2006, Benjamin Sittler 4 | * 5 | * Permission is hereby granted, free of charge, to any person 6 | * obtaining a copy of this software and associated documentation 7 | * files (the "Software"), to deal in the Software without 8 | * restriction, including without limitation the rights to use, copy, 9 | * modify, merge, publish, distribute, sublicense, and/or sell copies 10 | * of the Software, and to permit persons to whom the Software is 11 | * furnished to do so, subject to the following conditions: 12 | * 13 | * The above copyright notice and this permission notice shall be 14 | * included in all copies or substantial portions of the Software. 15 | * 16 | * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 17 | * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 18 | * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND 19 | * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT 20 | * HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, 21 | * WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 22 | * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER 23 | * DEALINGS IN THE SOFTWARE. 24 | */ 25 | 26 | #include 27 | #include 28 | #include 29 | #include 30 | #include "my_getopt.h" 31 | 32 | int my_optind=1, my_opterr=1, my_optopt=0; 33 | char *my_optarg=0; 34 | 35 | /* reset argument parser to start-up values */ 36 | int my_getopt_reset(void) 37 | { 38 | my_optind = 1; 39 | my_opterr = 1; 40 | my_optopt = 0; 41 | my_optarg = 0; 42 | return 0; 43 | } 44 | 45 | /* this is the plain old UNIX getopt, with GNU-style extensions. */ 46 | /* if you're porting some piece of UNIX software, this is all you need. */ 47 | /* this supports GNU-style permution and optional arguments */ 48 | 49 | int my_getopt(int argc, char * argv[], const char *opts) 50 | { 51 | static int charind=0; 52 | const char *s; 53 | char mode, colon_mode; 54 | int off = 0, opt = -1; 55 | 56 | if(getenv("POSIXLY_CORRECT")) colon_mode = mode = '+'; 57 | else { 58 | if((colon_mode = *opts) == ':') off ++; 59 | if(((mode = opts[off]) == '+') || (mode == '-')) { 60 | off++; 61 | if((colon_mode != ':') && ((colon_mode = opts[off]) == ':')) 62 | off ++; 63 | } 64 | } 65 | my_optarg = 0; 66 | if(charind) { 67 | my_optopt = argv[my_optind][charind]; 68 | for(s=opts+off; *s; s++) if(my_optopt == *s) { 69 | charind++; 70 | if((*(++s) == ':') || ((my_optopt == 'W') && (*s == ';'))) { 71 | if(argv[my_optind][charind]) { 72 | my_optarg = &(argv[my_optind++][charind]); 73 | charind = 0; 74 | } else if(*(++s) != ':') { 75 | charind = 0; 76 | if(++my_optind >= argc) { 77 | if(my_opterr) fprintf(stderr, 78 | "%s: option requires an argument -- %c\n", 79 | argv[0], my_optopt); 80 | opt = (colon_mode == ':') ? ':' : '?'; 81 | goto my_getopt_ok; 82 | } 83 | my_optarg = argv[my_optind++]; 84 | } 85 | } 86 | opt = my_optopt; 87 | goto my_getopt_ok; 88 | } 89 | if(my_opterr) fprintf(stderr, 90 | "%s: illegal option -- %c\n", 91 | argv[0], my_optopt); 92 | opt = '?'; 93 | if(argv[my_optind][++charind] == '\0') { 94 | my_optind++; 95 | charind = 0; 96 | } 97 | my_getopt_ok: 98 | if(charind && ! argv[my_optind][charind]) { 99 | my_optind++; 100 | charind = 0; 101 | } 102 | } else if((my_optind >= argc) || 103 | ((argv[my_optind][0] == '-') && 104 | (argv[my_optind][1] == '-') && 105 | (argv[my_optind][2] == '\0'))) { 106 | my_optind++; 107 | opt = -1; 108 | } else if((argv[my_optind][0] != '-') || 109 | (argv[my_optind][1] == '\0')) { 110 | char *tmp; 111 | int i, j, k; 112 | 113 | if(mode == '+') opt = -1; 114 | else if(mode == '-') { 115 | my_optarg = argv[my_optind++]; 116 | charind = 0; 117 | opt = 1; 118 | } else { 119 | for(i=j=my_optind; i j) { 124 | tmp=argv[--i]; 125 | for(k=i; k+1 argc) my_optind = argc; 137 | return opt; 138 | } 139 | 140 | /* this is the extended getopt_long{,_only}, with some GNU-like 141 | * extensions. Implements _getopt_internal in case any programs 142 | * expecting GNU libc getopt call it. 143 | */ 144 | 145 | int _my_getopt_internal(int argc, char * argv[], const char *shortopts, 146 | const struct option *longopts, int *longind, 147 | int long_only) 148 | { 149 | char mode, colon_mode = *shortopts; 150 | int shortoff = 0, opt = -1; 151 | 152 | if(getenv("POSIXLY_CORRECT")) colon_mode = mode = '+'; 153 | else { 154 | if((colon_mode = *shortopts) == ':') shortoff ++; 155 | if(((mode = shortopts[shortoff]) == '+') || (mode == '-')) { 156 | shortoff++; 157 | if((colon_mode != ':') && ((colon_mode = shortopts[shortoff]) == ':')) 158 | shortoff ++; 159 | } 160 | } 161 | my_optarg = 0; 162 | if((my_optind >= argc) || 163 | ((argv[my_optind][0] == '-') && 164 | (argv[my_optind][1] == '-') && 165 | (argv[my_optind][2] == '\0'))) { 166 | my_optind++; 167 | opt = -1; 168 | } else if((argv[my_optind][0] != '-') || 169 | (argv[my_optind][1] == '\0')) { 170 | char *tmp; 171 | int i, j, k; 172 | 173 | opt = -1; 174 | if(mode == '+') return -1; 175 | else if(mode == '-') { 176 | my_optarg = argv[my_optind++]; 177 | return 1; 178 | } 179 | for(i=j=my_optind; i j) { 186 | tmp=argv[--i]; 187 | for(k=i; k+1= argc) { 240 | opt = (colon_mode == ':') ? ':' : '?'; 241 | if(my_opterr) fprintf(stderr, 242 | "%s: option `--%s' requires an argument\n", 243 | argv[0], longopts[found].name); 244 | } else my_optarg = argv[my_optind]; 245 | } 246 | if(!opt) { 247 | if (longind) *longind = found; 248 | if(!longopts[found].flag) opt = longopts[found].val; 249 | else *(longopts[found].flag) = longopts[found].val; 250 | } 251 | my_optind++; 252 | } else if(!hits) { 253 | if(offset == 1) opt = my_getopt(argc, argv, shortopts); 254 | else { 255 | opt = '?'; 256 | if(my_opterr) fprintf(stderr, 257 | "%s: unrecognized option `%s'\n", 258 | argv[0], argv[my_optind++]); 259 | } 260 | } else { 261 | opt = '?'; 262 | if(my_opterr) fprintf(stderr, 263 | "%s: option `%s' is ambiguous\n", 264 | argv[0], argv[my_optind++]); 265 | } 266 | } 267 | if (my_optind > argc) my_optind = argc; 268 | return opt; 269 | } 270 | 271 | int my_getopt_long(int argc, char * argv[], const char *shortopts, 272 | const struct option *longopts, int *longind) 273 | { 274 | return _my_getopt_internal(argc, argv, shortopts, longopts, longind, 0); 275 | } 276 | 277 | int my_getopt_long_only(int argc, char * argv[], const char *shortopts, 278 | const struct option *longopts, int *longind) 279 | { 280 | return _my_getopt_internal(argc, argv, shortopts, longopts, longind, 1); 281 | } 282 | -------------------------------------------------------------------------------- /scripts/Hi-Corrector1.2/src/my_getopt-1.5/my_getopt.h: -------------------------------------------------------------------------------- 1 | /* 2 | * my_getopt.h - interface to my re-implementation of getopt. 3 | * Copyright 1997, 2000, 2001, 2002, 2006, Benjamin Sittler 4 | * 5 | * Permission is hereby granted, free of charge, to any person 6 | * obtaining a copy of this software and associated documentation 7 | * files (the "Software"), to deal in the Software without 8 | * restriction, including without limitation the rights to use, copy, 9 | * modify, merge, publish, distribute, sublicense, and/or sell copies 10 | * of the Software, and to permit persons to whom the Software is 11 | * furnished to do so, subject to the following conditions: 12 | * 13 | * The above copyright notice and this permission notice shall be 14 | * included in all copies or substantial portions of the Software. 15 | * 16 | * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 17 | * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 18 | * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND 19 | * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT 20 | * HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, 21 | * WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 22 | * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER 23 | * DEALINGS IN THE SOFTWARE. 24 | */ 25 | 26 | #ifndef MY_GETOPT_H_INCLUDED 27 | #define MY_GETOPT_H_INCLUDED 28 | 29 | #ifdef __cplusplus 30 | extern "C" { 31 | #endif 32 | 33 | /* reset argument parser to start-up values */ 34 | extern int my_getopt_reset(void); 35 | 36 | /* UNIX-style short-argument parser */ 37 | extern int my_getopt(int argc, char * argv[], const char *opts); 38 | 39 | extern int my_optind, my_opterr, my_optopt; 40 | extern char *my_optarg; 41 | 42 | struct option { 43 | const char *name; 44 | int has_arg; 45 | int *flag; 46 | int val; 47 | }; 48 | 49 | /* human-readable values for has_arg */ 50 | #undef no_argument 51 | #define no_argument 0 52 | #undef required_argument 53 | #define required_argument 1 54 | #undef optional_argument 55 | #define optional_argument 2 56 | 57 | /* GNU-style long-argument parsers */ 58 | extern int my_getopt_long(int argc, char * argv[], const char *shortopts, 59 | const struct option *longopts, int *longind); 60 | 61 | extern int my_getopt_long_only(int argc, char * argv[], const char *shortopts, 62 | const struct option *longopts, int *longind); 63 | 64 | extern int _my_getopt_internal(int argc, char * argv[], const char *shortopts, 65 | const struct option *longopts, int *longind, 66 | int long_only); 67 | 68 | #ifdef __cplusplus 69 | } 70 | #endif 71 | 72 | #endif /* MY_GETOPT_H_INCLUDED */ 73 | -------------------------------------------------------------------------------- /scripts/Hi-Corrector1.2/src/split_data.c: -------------------------------------------------------------------------------- 1 | /* 2 | * split_data.c 3 | * 4 | * Created on: August 14, 2014 5 | * Author: Wenyuan Li 6 | */ 7 | 8 | #include 9 | #include 10 | #include 11 | #include 12 | #include "matutils.h" 13 | 14 | void split_data_and_write_for_one_worker(char *input_mat_file, int taskid, int start_row, int end_row, 15 | int total_column, float mem_size_of_task, char *output_dir, char *job_id) 16 | { 17 | int i, ret; 18 | int cnt_row, sub_start_row, sub_end_row; 19 | VEC_INT part_sizes; // its size is small and thus ignored. 20 | char temp_data_file[MAXCHAR]; 21 | // start_row and end_row are 1-based 22 | // receive info of which partition assigned to this worker task 23 | cnt_row = end_row - start_row + 1; 24 | calc_partitions_by_mem_size(cnt_row, total_column, mem_size_of_task, &part_sizes); 25 | printf("worker %d: %d - %d (%d parts), part sizes:", taskid,start_row,end_row,part_sizes.n); 26 | put_VEC_INT(stdout, part_sizes, ", "); 27 | 28 | for (i=0, sub_start_row=start_row; i <#rows/columns> <#task> \n"); 49 | fprintf(stream, "\nVersion: 1.0\n"); 50 | fprintf(stream, "\nNote: #task includes both master and worker tasks.\n"); 51 | fprintf(stream, "\nAuthor: Wenyuan Li\n"); 52 | fprintf(stream, "Date: August 14, 2014\n"); 53 | } 54 | 55 | int main(int argc, char ** argv) { 56 | int numtasks, /* number of tasks in partition */ 57 | taskid, /* a task identifier */ 58 | numworkers; /* number of worker tasks */ 59 | int i, ret; 60 | float x; 61 | time_t t_start, t_end; 62 | double t_elapse; // seconds 63 | char input_mat_file[MAXCHAR]; 64 | int total_column = atoi(argv[2]); 65 | float mem_size_of_task = (float)atof(argv[5]); 66 | char output_file[MAXCHAR], output_dir[MAXCHAR], job_id[MAXCHAR]; 67 | VEC_INT part_sizes; 68 | int start_row, end_row; 69 | VEC_INT2 part_info_of_workers; 70 | 71 | if (argc!=7) { 72 | fprintf(stderr,"Error: split_data arguments (%d) are insufficient.\n\n", argc); 73 | print_usage(stderr); 74 | exit(EXIT_FAILURE); 75 | } 76 | numtasks = atoi(argv[4]); 77 | strcpy(input_mat_file, argv[1]); 78 | strcpy(output_dir, argv[3]); 79 | strcpy(job_id, argv[6]); 80 | i = strlen(output_dir); 81 | if (output_dir[i-1]!='/') 82 | strcat(output_dir,"/"); 83 | 84 | numworkers = numtasks-1; 85 | 86 | /**************************** master task ************************************/ 87 | printf("split_data arguments:\n"); 88 | printf("\tinput file: %s\n", input_mat_file); 89 | printf("\ttotal_column: %d\n", total_column); 90 | printf("\toutput file: %s\n", output_file); 91 | printf("\t#task: %d\n", numtasks); 92 | printf("\tmemory size for each cpu: %g\n", mem_size_of_task); 93 | fflush( stdout ); 94 | 95 | t_start = time(NULL); 96 | 97 | partitions2( total_column, numworkers, &part_sizes ); // #rows for each task 98 | printf("split_data has %d tasks (both master and %d workers).\n", numtasks, numworkers); 99 | printf("master: partition sizes (%d partitions): ", part_sizes.n); 100 | put_VEC_INT(stdout, part_sizes, ", "); 101 | fflush( stdout ); 102 | if (numworkers!=part_sizes.n) { 103 | fprintf(stderr, "Error (master): #parts %d != numworkers %d", part_sizes.n, numworkers); 104 | exit(EXIT_FAILURE); 105 | } 106 | 107 | // send partition info to each worker 108 | printf("split data\n"); fflush( stdout ); 109 | create_VEC_INT2(numworkers, &part_info_of_workers); 110 | for (taskid=1, start_row=1; taskid<=numworkers; taskid++) { 111 | // the i-th part 112 | // start_row and end_row are 1-based index 113 | i = taskid - 1; 114 | end_row = start_row + part_sizes.v[i] - 1; 115 | ret = quick_check_partitions_by_mem_size(part_sizes.v[i], total_column, mem_size_of_task, &x); 116 | if (ret==FALSE) { 117 | fprintf(stderr, "Error: max_memory_size (%gMB) for each task cannot afford to compute on the matrix (size=%d rows X %d columns, each row as a partition needs %lf MB memory)\n", mem_size_of_task, part_sizes.v[i], total_column, x); 118 | exit(EXIT_FAILURE); 119 | } 120 | part_info_of_workers.v1[taskid-1] = start_row; 121 | part_info_of_workers.v2[taskid-1] = end_row; 122 | printf("master: worker %d - (%d, %d)\n", taskid, start_row, end_row); fflush( stdout ); 123 | 124 | split_data_and_write_for_one_worker(input_mat_file, taskid, start_row, end_row, total_column, 125 | mem_size_of_task, output_dir, job_id); 126 | 127 | start_row = end_row + 1; 128 | } 129 | 130 | t_end = time(NULL); 131 | t_elapse = difftime( t_end, t_start ); 132 | printf("\t>> split_data total time - %d:%d:%d\n", (int)floor(t_elapse/3600.0), (int)floor(fmod(t_elapse,3600.0)/60.0), (int)fmod(t_elapse,60.0)); 133 | 134 | // free space 135 | printf("free space\n"); fflush( stdout ); 136 | free_VEC_INT( &part_sizes ); 137 | free_VEC_INT2( &part_info_of_workers ); 138 | 139 | printf ("Done.\n"); 140 | return(EXIT_SUCCESS); 141 | } 142 | 143 | 144 | -------------------------------------------------------------------------------- /scripts/Hi-Corrector1.2/src/split_data_parallel.c: -------------------------------------------------------------------------------- 1 | /* 2 | * split_data_parallel.c 3 | * 4 | * Created on: August 14, 2014 5 | * Author: Wenyuan Li 6 | */ 7 | 8 | #include 9 | #include 10 | #include 11 | #include 12 | #include "matutils.h" 13 | #include 14 | 15 | #define MAPI_USER_ABORT 999 16 | 17 | #define MASTER 0 /* taskid of first task */ 18 | #define PARTITION_TAG 1 /* message type: partition info */ 19 | #define ITERATION_BEGIN_TAG 2 /* message type: begin an iteration */ 20 | #define D_SAVE_RETURN_TAG 3 /* message type: return d_save info to master */ 21 | #define END_TAG 4 /* message type: end signal */ 22 | #define ERROR_TAG 99 /* message type: end signal */ 23 | 24 | void print_usage(FILE* stream) 25 | { 26 | fprintf(stream, "Usage:\n"); 27 | fprintf(stream, "\tsplit_data_parallel <#rows/columns> <#task> \n"); 28 | fprintf(stream, "\nVersion: 1.0\n"); 29 | fprintf(stream, "\nAuthor: Wenyuan Li\n"); 30 | fprintf(stream, "Date: August 14, 2014\n"); 31 | } 32 | 33 | int main(int argc, char ** argv) { 34 | int numtasks, /* number of tasks in partition */ 35 | taskid, /* a task identifier */ 36 | numworkers, /* number of worker tasks */ 37 | dest; /* task id of message destination */ 38 | MPI_Status status; 39 | int i, ret, cnt_recv, finished; 40 | float x; 41 | time_t t_start, t_end; 42 | double t_elapse; // seconds 43 | char input_mat_file[MAXCHAR], temp_data_file[MAXCHAR]; 44 | int total_column = atoi(argv[2]); 45 | float mem_size_of_task = (float)atof(argv[5]); 46 | char output_dir[MAXCHAR], job_id[MAXCHAR]; 47 | VEC_INT part_sizes; 48 | int cnt_row, start_row, end_row, sub_start_row, sub_end_row; 49 | 50 | if (argc!=7) { 51 | fprintf(stderr,"Error: split_data_parallel arguments (%d) are insufficient.\n\n", argc); 52 | print_usage(stderr); 53 | exit(EXIT_FAILURE); 54 | } 55 | numtasks = atoi(argv[4]); 56 | strcpy(input_mat_file, argv[1]); 57 | strcpy(output_dir, argv[3]); 58 | strcpy(job_id, argv[6]); 59 | i = strlen(output_dir); 60 | if (output_dir[i-1]!='/') 61 | strcat(output_dir,"/"); 62 | 63 | MPI_Init(&argc, &argv); 64 | MPI_Comm_rank(MPI_COMM_WORLD,&taskid); 65 | MPI_Comm_size(MPI_COMM_WORLD,&numtasks); 66 | 67 | if (numtasks < 2 ) { 68 | fprintf(stderr,"Error: Need at least two MPI tasks. Quitting...\n"); 69 | MPI_Abort(MPI_COMM_WORLD, MAPI_USER_ABORT); 70 | exit(EXIT_SUCCESS); 71 | } 72 | numworkers = numtasks-1; 73 | 74 | /**************************** master task ************************************/ 75 | if (taskid == MASTER) { 76 | printf("split_data_parallel arguments:\n"); 77 | printf("\tinput file: %s\n", input_mat_file); 78 | printf("\ttotal_column: %d\n", total_column); 79 | printf("\toutput directory: %s\n", output_dir); 80 | printf("\t#task: %d\n", numtasks); 81 | printf("\tmemory size for each cpu: %g\n", mem_size_of_task); 82 | fflush( stdout ); 83 | 84 | t_start = time(NULL); 85 | 86 | partitions2( total_column, numworkers, &part_sizes ); // #rows for each task 87 | printf("split_data_parallel has started with %d tasks (both master and %d workers).\n", numtasks, numworkers); 88 | printf("master: partition sizes (%d partitions): ", part_sizes.n); 89 | put_VEC_INT(stdout, part_sizes, ", "); 90 | fflush( stdout ); 91 | if (numworkers!=part_sizes.n) { 92 | fprintf(stderr, "Error (master): #parts %d != numworkers %d", part_sizes.n, numworkers); 93 | MPI_Abort(MPI_COMM_WORLD, MAPI_USER_ABORT); 94 | exit(EXIT_FAILURE); 95 | } 96 | 97 | // send partition info to each worker 98 | printf("master: send partition info to each worker\n"); fflush( stdout ); 99 | for (dest=1, start_row=1; dest<=numworkers; dest++) { 100 | // the i-th part 101 | // start_row and end_row are 1-based index 102 | i = dest - 1; 103 | end_row = start_row + part_sizes.v[i] - 1; 104 | ret = quick_check_partitions_by_mem_size(part_sizes.v[i], total_column, mem_size_of_task, &x); 105 | if (ret==FALSE) { 106 | fprintf(stderr, "Error (master): max_memory_size (%gMB) for each task cannot afford to compute on the matrix (size=%d rows X %d columns, each row as a partition needs %lf MB memory)\n", mem_size_of_task, part_sizes.v[i], total_column, x); 107 | MPI_Abort(MPI_COMM_WORLD, MAPI_USER_ABORT); 108 | exit(EXIT_FAILURE); 109 | } 110 | MPI_Send(&start_row, 1, MPI_INT, dest, PARTITION_TAG, MPI_COMM_WORLD); 111 | MPI_Send(&end_row, 1, MPI_INT, dest, PARTITION_TAG, MPI_COMM_WORLD); 112 | printf("master: worker %d - (%d, %d)\n", dest, start_row, end_row); fflush( stdout ); 113 | start_row = end_row + 1; 114 | } 115 | 116 | // receive finish message from all workers 117 | printf("master: will receive finish message from all workers\n"); fflush( stdout ); 118 | cnt_recv = 0; 119 | while (cnt_recv> split_data_parallel master computation total time - %d:%d:%d\n", (int)floor(t_elapse/3600.0), (int)floor(fmod(t_elapse,3600.0)/60.0), (int)fmod(t_elapse,60.0)); 130 | 131 | // free space 132 | free_VEC_INT( &part_sizes ); 133 | 134 | printf ("Done (master).\n"); fflush( stdout ); 135 | } 136 | 137 | /**************************** worker task ************************************/ 138 | if (taskid > MASTER) { 139 | // start_row and end_row are 1-based 140 | // receive info of which partition assigned to this worker task 141 | MPI_Recv(&start_row, 1, MPI_INT, MASTER, PARTITION_TAG, MPI_COMM_WORLD, &status); 142 | MPI_Recv(&end_row, 1, MPI_INT, MASTER, PARTITION_TAG, MPI_COMM_WORLD, &status); 143 | cnt_row = end_row - start_row + 1; 144 | calc_partitions_by_mem_size(cnt_row, total_column, mem_size_of_task, &part_sizes); 145 | printf("worker %d/%d: %d - %d (%d parts), part sizes:", taskid,numworkers,start_row,end_row,part_sizes.n); 146 | put_VEC_INT(stdout, part_sizes, ", "); 147 | 148 | for (i=0, sub_start_row=start_row; i.') 126 | 127 | chromosomes = open(parameters['chromSizes_path'] + parameters['species'] + '.chrom.sizes', 'r') 128 | chromosomes_list = [] 129 | while True: 130 | try: 131 | line2list = next(chromosomes).split('\n')[0].split('\t') 132 | chromosomes_list.append(line2list[0]) 133 | except StopIteration: 134 | break 135 | 136 | threads = int(parameters['threads']) 137 | 138 | if threads > len(chromosomes_list): 139 | parser.error("Input a number of threads less or equal than the number of chromosomes (" + str(len(chromosomes_list)) + ").") 140 | else: 141 | pass 142 | 143 | print "Adding GC content information in parallel using " + parameters['threads'] + " threads..." 144 | print "Start: " + strftime("%Y-%m-%d %H:%M:%S", gmtime()) 145 | pool = Pool(processes=threads) 146 | pool.map(add_gc_content, chromosomes_list) 147 | print "End: " + strftime("%Y-%m-%d %H:%M:%S", gmtime()) 148 | -------------------------------------------------------------------------------- /scripts/HiCtool_add_fend_mappability.py: -------------------------------------------------------------------------------- 1 | # Add the mappability score of 500bp upstream and downstream to the restriction site per each single chromosome. 2 | 3 | # Usage: python2.7 HiCtool_add_fend_mappability.py [-h] [options] 4 | # Options: 5 | # -h, --help show this help message and exit 6 | # -c CHROMSIZES_PATH Path to the folder chromSizes with trailing slash at the end. 7 | # -s SPECIES Species. It has to be one of those present under the chromSizes path. Example: for human hg38 type here "hg38". 8 | # -r RESTRICTION_SITES_PATH Path to the folder with the restriction sites bed files. 9 | # -g ARTIFICIAL_READS_PATH Path of the artificial reads files (one per each chromosome) 10 | # -p THREADS Number of parallel threads to use. It has to be less or equal than the number of chromosomes. 11 | 12 | # Output files: 13 | # Fend bed file with additional mappability score information added in two columns: mappability_upstream and mappability_downstream. 14 | 15 | from optparse import OptionParser 16 | from time import gmtime, strftime 17 | import pybedtools 18 | import numpy as np 19 | import os 20 | import pandas as pd 21 | from multiprocessing import Pool 22 | 23 | parameters = {'chromSizes_path': None, 24 | 'species': None, 25 | 'restrictionsites_path': None, 26 | 'artificial_reads_path': None, 27 | 'threads': None} 28 | 29 | def add_mappability_score(chromosome): 30 | """ 31 | Add for each fragment of the FEND file the mappability score of 500bp upstream and downstream 32 | to the restriction site per a single chromosome. 33 | Arguments: 34 | chromosome (str): chromosome number (example for chromosome 1: '1'). 35 | Returns: None. 36 | Output: 37 | Fend bed file with additional mappability information added in two columns: 38 | mappability_upstream and mappability_downstream. 39 | """ 40 | 41 | output_path = parameters['restrictionsites_path'] 42 | artificial_reads_path = parameters['artificial_reads_path'] 43 | 44 | # Building a bed file from restrictionsites with start coordinate as 500 bp upstream of each position and end coordinate as position. 45 | upstream = pd.read_csv(output_path + 'chr' + chromosome + '.bed', sep = '\t', header = None) 46 | upstream.iloc[:,4] = range(upstream.shape[0]) 47 | upstream.iloc[:,2] = upstream.iloc[:,1] 48 | upstream.iloc[:,1] = upstream.iloc[:,1] - 500 49 | upstream.loc[upstream.loc[:,1] < 0, 1] = 0 50 | a_up = pybedtools.BedTool.from_dataframe(upstream) 51 | 52 | # Building a bed file from restrictionsites with start coordinate as position and end coordinate as 500 bp downstream of each position. 53 | downstream = pd.read_csv(output_path + 'chr' + chromosome + '.bed', sep = '\t', header = None) 54 | downstream.iloc[:,4] = range(downstream.shape[0]) 55 | downstream.iloc[:,1] = downstream.iloc[:,2] 56 | downstream.iloc[:,2] = downstream.iloc[:,2] + 500 57 | a_down = pybedtools.BedTool.from_dataframe(downstream) 58 | 59 | result = pd.read_table(output_path + 'chr' + chromosome + '.bed',header=None) 60 | result.columns = ['chr','start','stop','name','score','strand'] 61 | result['mappability_upstream'] = '' # adding a field to store the upstrem mappability score 62 | result['mappability_downstream'] = '' # adding a field to store the downstream mappability score 63 | 64 | b = pybedtools.example_bedtool(artificial_reads_path + 'chr' + chromosome + '.txt') 65 | 66 | aup_and_b = a_up.intersect(b,wa=True,wb=True).to_dataframe() 67 | adown_and_b = a_down.intersect(b,wa=True,wb=True).to_dataframe() 68 | 69 | aup_and_b.columns = ['chr','start','stop','name','index','strand','chr_map','start_map','stop_map','map_up'] 70 | adown_and_b.columns = ['chr','start','stop','name','index','strand','chr_map','start_map','stop_map','map_down'] 71 | 72 | def compute_mappability_score(values): 73 | return float(sum(values>30))/float(len(values)) 74 | 75 | df_up = aup_and_b.groupby('index')['map_up'].apply(compute_mappability_score).to_frame() 76 | df_down = adown_and_b.groupby('index')['map_down'].apply(compute_mappability_score).to_frame() 77 | df_up['ind'] = df_up.index 78 | df_down['ind'] = df_down.index 79 | df = pd.merge(df_up,df_down,how='outer') 80 | 81 | r1 = 0 82 | for r in np.array(df.ind): 83 | if result.loc[r,'strand'] == '+': 84 | result.loc[r,'mappability_upstream'] = np.array(df.loc[r1,'map_up']).astype(np.str) 85 | result.loc[r,'mappability_downstream'] = np.array(df.loc[r1,'map_down']).astype(np.str) 86 | if result.loc[r,'strand'] == '-': 87 | result.loc[r,'mappability_upstream'] = np.array(df.loc[r1,'map_down']).astype(np.str) 88 | result.loc[r,'mappability_downstream'] = np.array(df.loc[r1,'map_up']).astype(np.str) 89 | r1 += 1 90 | 91 | result.score = 1 92 | result.to_csv(path_or_buf=output_path + 'chr' + chromosome + '_map.bed',sep='\t',header=False,index=False) 93 | 94 | print "Mappability score for chr" + chromosome + " complete." 95 | 96 | 97 | if __name__ == '__main__': 98 | 99 | usage = 'Usage: python2.7 HiCtool_add_fend_mappability.py [-h] [options]' 100 | parser = OptionParser(usage = 'python2.7 %prog -c chromSizes_path -s species -r restriction_sites_path -g gc_files_path -p threads') 101 | parser.add_option('-c', dest='chromSizes_path', type='string', help='Path to the folder chromSizes with trailing slash at the end.') 102 | parser.add_option('-s', dest='species', type='string', help='Species. It has to be one of those present under the chromSizes path. Example: for human hg38 type here "hg38".') 103 | parser.add_option('-r', dest='restrictionsites_path', type='string', help='Path to the folder with the restriction sites bed files.') 104 | parser.add_option('-a', dest='artificial_reads_path', type='string', help='Path to the folder with the GC files, one per each chromosome.') 105 | parser.add_option('-p', dest='threads', type='string', help='Number of parallel threads to use. It has to be less or equal than the number of chromosomes.') 106 | (options, args) = parser.parse_args( ) 107 | 108 | if options.chromSizes_path == None: 109 | parser.error('-h for help or provide the chromSizes path!') 110 | else: 111 | pass 112 | if options.species == None: 113 | parser.error('-h for help or provide the species!') 114 | else: 115 | pass 116 | if options.restrictionsites_path == None: 117 | parser.error('-h for help or provide the restrictionsites bed files path!') 118 | else: 119 | pass 120 | if options.artificial_reads_path == None: 121 | parser.error('-h for help or provide the path to artificial reads files!') 122 | else: 123 | pass 124 | if options.threads == None: 125 | parser.error('-h for help or provide the number of threads!') 126 | else: 127 | pass 128 | 129 | parameters['chromSizes_path'] = options.chromSizes_path 130 | parameters['species'] = options.species 131 | parameters['restrictionsites_path'] = options.restrictionsites_path 132 | parameters['artificial_reads_path'] = options.artificial_reads_path 133 | parameters['threads'] = options.threads 134 | 135 | if parameters['species'] + ".chrom.sizes" not in os.listdir(parameters['chromSizes_path']): 136 | available_species = ', '.join([x.split('.')[0] for x in os.listdir(parameters['chromSizes_path'])]) 137 | parser.error('Wrong species inserted! Check the species spelling or insert an available species: ' + available_species + '. If your species is not listed, please contact Riccardo Calandrelli at .') 138 | 139 | chromosomes = open(parameters['chromSizes_path'] + parameters['species'] + '.chrom.sizes', 'r') 140 | chromosomes_list = [] 141 | while True: 142 | try: 143 | line2list = next(chromosomes).split('\n')[0].split('\t') 144 | chromosomes_list.append(line2list[0]) 145 | except StopIteration: 146 | break 147 | 148 | threads = int(parameters['threads']) 149 | 150 | if threads > len(chromosomes_list): 151 | parser.error("Input a number of threads less or equal than the number of chromosomes (" + str(len(chromosomes_list)) + ").") 152 | else: 153 | pass 154 | 155 | print "Adding mappability score information in parallel using " + parameters['threads'] + " threads..." 156 | print "Start: " + strftime("%Y-%m-%d %H:%M:%S", gmtime()) 157 | pool = Pool(processes=threads) 158 | pool.map(add_mappability_score, chromosomes_list) 159 | print "End: " + strftime("%Y-%m-%d %H:%M:%S", gmtime()) 160 | 161 | -------------------------------------------------------------------------------- /scripts/HiCtool_artificial_reads.py: -------------------------------------------------------------------------------- 1 | # Split the input genome into 50 bp artificial reads every 10 bp and save the output in fastq format. 2 | 3 | # Usage: python2.7 HiCtool_artificial_reads.py [-h] [options] 4 | # Options: 5 | # -h, --help show this help message and exit 6 | # -g GENOME_FILE Genome fasta file. 7 | # -o OUTPUT_READS_FILE Output fastq file to save reads. 8 | 9 | # Output file: 10 | # Fastq file containing all the artificial reads. A fixed and intermediate score "I" is given to each base of each read. 11 | 12 | from optparse import OptionParser 13 | from Bio import SeqIO 14 | from time import gmtime, strftime 15 | 16 | parameters = {'genome_file': None, 17 | 'output_reads_file': None} 18 | 19 | class artificial_reads: 20 | def __init__(self, parameters): 21 | self.generate_artificial_reads(parameters) 22 | 23 | def generate_artificial_reads(self, parameters): 24 | with open(parameters['genome_file'], "rU") as handle: 25 | records = list(SeqIO.parse(handle,"fasta")) 26 | 27 | for i in range(len(records)): 28 | sequence = records[i].seq 29 | reads = [str(sequence[x:x+50]) for x in range(0,len(sequence),10)] 30 | 31 | with open(parameters['output_reads_file'], "a") as fout: 32 | for j in range(len(reads)): 33 | if len(reads[j]) == 50: 34 | read = "@" + str(records[i].id) + "." + str(j) + "\n" + reads[j] + "\n" + "+" + "\n" + "I"*50 + "\n" 35 | fout.write(read) 36 | 37 | 38 | if __name__ == '__main__': 39 | 40 | usage = 'Usage: python2.7 HiCtool_artificial_reads.py [-h] [options]' 41 | parser = OptionParser(usage = 'python2.7 %prog -g genome_file -o output_reads_file') 42 | parser.add_option('-g', dest='genome_file', type='string', help='Genome fasta file.') 43 | parser.add_option('-o', dest='output_reads_file', type='string', help='Output fastq file to save reads.') 44 | (options, args) = parser.parse_args( ) 45 | 46 | if options.genome_file == None: 47 | parser.error('-h for help or provide the genome fasta file!') 48 | else: 49 | pass 50 | if options.output_reads_file == None: 51 | parser.error('-h for help or provide the output fastq file!') 52 | else: 53 | pass 54 | 55 | parameters['genome_file'] = options.genome_file 56 | parameters['output_reads_file'] = options.output_reads_file 57 | 58 | print "Generating artificial reads..." 59 | print "Start: " + strftime("%Y-%m-%d %H:%M:%S", gmtime()) 60 | artificial_reads(parameters) 61 | print "End: " + strftime("%Y-%m-%d %H:%M:%S", gmtime()) 62 | -------------------------------------------------------------------------------- /scripts/HiCtool_build_enzyme_fastq.py: -------------------------------------------------------------------------------- 1 | # Generate the fastq file used to align the restriction enzyme(s) to generate the fend file. 2 | 3 | # Usage: python2.7 HiCtool_build_enzyme_fastq.py [-h] -e RESTRICTION_ENZYMES 4 | # Options: 5 | # -h, --help show this help message and exit 6 | # -e RESTRICTION_ENZYMES Restriction enzyme (or restriction enzymes passed like a Python list: [enzyme1,enzyme2]). Choose between: HindIII, MboI, DpnII, Sau3AI, BglII, NcoI, Hinfl. Arima kit uses a combination of MboI and Hinfl. 7 | 8 | # Output files: 9 | # Fastq file of the restriction enzyme(s). 10 | 11 | from optparse import OptionParser 12 | 13 | parameters = {'restriction_enzymes': None} 14 | 15 | class enzyme_fastq: 16 | def __init__(self, parameters): 17 | self.generate_re_fastq(parameters) 18 | 19 | def generate_re_fastq(self, parameters): 20 | 21 | re_info = dict() # rs: restriction site sequence; lj: ligation junction sequence 22 | re_info['HindIII']={'rs':'AAGCTT'} 23 | re_info['MboI']={'rs':'GATC'} 24 | re_info['DpnII']={'rs':'GATC'} 25 | re_info['Sau3AI']={'rs':'GATC'} 26 | re_info['BglII']={'rs':'AGATCT'} 27 | re_info['NcoI']={'rs':'CCATGG'} 28 | re_info['Hinfl']={'rs':'GANTC'} 29 | 30 | restriction_enzymes = map(str, parameters['restriction_enzymes'].strip('[]').split(',')) 31 | 32 | # Check that all the restriction enzymes are available 33 | for i in restriction_enzymes: 34 | if i not in re_info.keys(): 35 | print "ERROR! " + i + " is not among the available restriction enzymes (HindIII, MboI, DpnII, Sau3AI, BglII, NcoI, Hinfl)! Check the spelling or contact Riccardo Calandrelli at ." 36 | return 37 | 38 | output_file = open ('restriction_enzyme.fastq', 'w') 39 | for restriction_enzyme in restriction_enzymes: 40 | if restriction_enzyme != 'Hinfl': 41 | output_file.write("@" + restriction_enzyme + '\n' + re_info[restriction_enzyme]['rs'] + '\n+\n' + ''.join(['I']*len(re_info[restriction_enzyme]['rs'])) + '\n') 42 | else: 43 | for i in ['A','C','G','T']: 44 | output_file.write("@" + restriction_enzyme + '\n' + re_info[restriction_enzyme]['rs'].replace('N',i) + '\n+\n' + ''.join(['I']*len(re_info[restriction_enzyme]['rs'])) + '\n') 45 | output_file.close() 46 | 47 | 48 | if __name__ == '__main__': 49 | 50 | usage = 'Usage: python2.7 HiCtool_build_enzyme_fastq.py [-h] -e RESTRICTION_ENZYMES' 51 | parser = OptionParser(usage = 'python2.7 %prog -e restriction_enzymes') 52 | parser.add_option('-e', dest='restriction_enzymes', type='string', help='Restriction enzyme (or restriction enzymes passed like a Python list: [enzyme1,enzyme2]). Choose between: HindIII, MboI, DpnII, Sau3AI, BglII, NcoI, Hinfl. Arima kit uses both MboI and Hinfl.') 53 | (options, args) = parser.parse_args( ) 54 | 55 | if options.restriction_enzymes == None: 56 | parser.error('-h for help or provide the restriction enzyme(s)!') 57 | else: 58 | pass 59 | 60 | parameters['restriction_enzymes'] = options.restriction_enzymes 61 | 62 | enzyme_fastq(parameters) -------------------------------------------------------------------------------- /scripts/HiCtool_fend_file_add_gc_map.sh: -------------------------------------------------------------------------------- 1 | while getopts h:o:s:p:b:m: option 2 | do 3 | case "${option}" 4 | in 5 | h) hictoolPath=${OPTARG};; # The path of the HiCtool scripts 6 | o) directory=${OPTARG};; # The output directory to work on and save the fend file 7 | s) species=${OPTARG};; # Species under analysis: hg19, hg38, mm9, mm10, dm6, susScr2 are available 8 | p) threads=${OPTARG};; # The number of threads to use 9 | b) gc5Base=${OPTARG};; # gc5Base.bw file with the GC content information (download it from this website: hgdownload.cse.ucsc.edu/gbdb/your_species/bbi/) or GC_info directory path if the GC content information has been already calculated for the reference genome. 10 | m) genomeFasta=${OPTARG};; # genome in fasta format to add the mappability score to the fend file or mappability_info directory path if the mappability score information has been already calculated for the reference genome. 11 | esac 12 | done 13 | 14 | if [ $species = 'hg38' -o $species = 'hg19' ] 15 | then 16 | chromosomes=("chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9" "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18" "chr19" "chr20" "chr21" "chr22" "chrX" "chrY") 17 | elif [ $species = 'mm10' -o $species = 'mm9' ] 18 | then 19 | chromosomes=("chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9" "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18" "chr19" "chrX" "chrY") 20 | elif [ $species = 'dm6' ] 21 | then 22 | chromosomes=("chr2L" "chr2R" "chr3L" "chr3R" "chr4" "chrX" "chrY") 23 | elif [ $species = 'susScr3' -o $species = 'susScr11' ] 24 | then 25 | chromosomes=("chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9" "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18" "chrX" "chrY") 26 | else 27 | echo "ERROR! Wrong species inserted! Check the species spelling or insert an available species: hg38, hg19, mm10, mm9, dm6, susScr3, susScr11. If your species is not listed, please contact Riccardo Calandrelli at ." 28 | exit 29 | fi 30 | 31 | ### Create output directory 32 | checkMakeDirectory(){ 33 | if [ ! -e "$1" ]; then 34 | mkdir -p "$1" 35 | fi 36 | } 37 | checkMakeDirectory $directory 38 | 39 | echo "Start fend file update: $(date)" 40 | cd $directory 41 | 42 | ## Splitting restrictionsites.bed to generate separate files to be used either to add GC content information or mappability information or both. 43 | if ! [ -z "$gc5base" ] || ! [ -z "$genomeFasta" ] 44 | then 45 | echo -n "Splitting restrictionsites.bed, one bed file per each chromosome ... " 46 | for i in "${chromosomes[@]}"; do 47 | awk -v var="$i" '(NR>1) && ($1==var)' restrictionsites.bed > $i.bed 48 | done 49 | echo "Done!" 50 | fi 51 | 52 | ### Adding GC content information 53 | if ! [ -z "$gc5Base" ] 54 | then 55 | if [ $threads -gt "${#chromosomes[@]}" ] 56 | then 57 | python_threads="${#chromosomes[@]}" 58 | else 59 | python_threads=$threads 60 | fi 61 | 62 | if ! [ -d "$gc5Base" ] 63 | then 64 | echo -n "Splitting GC content information, one file per each chromosome ... " 65 | bigWigToBedGraph $gc5Base gc5Base.bedGraph 66 | checkMakeDirectory GC_info 67 | for i in "${chromosomes[@]}"; do 68 | awk -v var="$i" '(NR>1) && ($1==var)' gc5Base.bedGraph | awk -v OFS='\t' '{print $1, $2, $3, $4}' > GC_info/$i.txt 69 | done 70 | echo "Done!" 71 | 72 | python2.7 $hictoolPath"HiCtool_add_fend_gc_content.py" -c $hictoolPath"chromSizes/" -s $species -r $directory -g $directory"GC_info/" -p $python_threads 73 | else 74 | python2.7 $hictoolPath"HiCtool_add_fend_gc_content.py" -c $hictoolPath"chromSizes/" -s $species -r $directory -g $gc5Base -p $python_threads 75 | fi 76 | fi 77 | 78 | ### Adding mappability score information 79 | if ! [ -z "$genomeFasta" ] 80 | then 81 | if [ $threads -gt "${#chromosomes[@]}" ] 82 | then 83 | python_threads="${#chromosomes[@]}" 84 | else 85 | python_threads=$threads 86 | fi 87 | 88 | if ! [ -d "$genomeFasta" ] 89 | then 90 | checkMakeDirectory mappability_info 91 | python2.7 $hictoolPath"HiCtool_artificial_reads.py" -g $genomeFasta -o $directory"mappability_info/artificial_reads.fastq" 92 | 93 | echo -n "Mapping artificial reads ... " 94 | (bowtie2 -p $threads -x $genomeIndex mappability_info/artificial_reads.fastq -S mappability_info/artificial_reads.sam) 2>mappability_info/artificial_reads.log 95 | rm mappability_info/artificial_reads.fastq 96 | echo "Done!" 97 | 98 | echo -n "Selecting mapped reads only and generating separate files per each chromosome with mappability score information ... " 99 | samtools view -F 4 mappability_info/artificial_reads.sam | \ 100 | awk -v OFS='\t' '{print $3, $4-1, $4-1+50, $5}' > mappability_info/artificial_reads_mapped.txt 101 | 102 | rm mappability_info/artificial_reads.sam 103 | 104 | for i in "${chromosomes[@]}"; do 105 | awk -v var="$i" '(NR>1) && ($1==var)' mappability_info/artificial_reads_mapped.txt | awk -v OFS='\t' '{print $1, $2, $3, $4}' > mappability_info/$i.txt 106 | done 107 | rm mappability_info/artificial_reads_mapped.txt 108 | echo "Done!" 109 | 110 | python2.7 $hictoolPath"HiCtool_add_fend_mappability.py" -c $hictoolPath"chromSizes/" -s $species -r $directory -a $directory"mappability_info/" -p $python_threads 111 | else 112 | python2.7 $hictoolPath"HiCtool_add_fend_mappability.py" -c $hictoolPath"chromSizes/" -s $species -r $directory -a $genomeFasta -p $python_threads 113 | fi 114 | fi 115 | 116 | 117 | ### Merge files together and generate the final fend bed file 118 | if ! [ -z "$gc5Base" ] && [ -z "$genomeFasta" ] 119 | then 120 | echo -e 'chr\tstart\tstop\tname\tscore\tstrand\tgc' > header.txt 121 | echo -n "Merging bed files together and parsing ... " 122 | for i in "${chromosomes[@]}"; do 123 | sort -k 2,2n "$i"_gc.bed | cat >> restrictionsites_temp.bed 124 | done 125 | 126 | awk -v OFS='\t' '{print $1, $2, $3, $4, $5, $6, $7 "," $8}' restrictionsites_temp.bed | \ 127 | cat header.txt - > restrictionsites_gc.bed 128 | 129 | rm restrictionsites_temp.bed 130 | rm header.txt 131 | 132 | echo "Done!" 133 | echo "End fend file update: $(date)" 134 | echo "Your fend file is restrictionsites_gc.bed" 135 | 136 | elif [ -z "$gc5Base" ] && ! [ -z "$genomeFasta" ] 137 | then 138 | echo -e 'chr\tstart\tstop\tname\tscore\tstrand\tmappability' > header.txt 139 | echo -n "Merging bed files together, selecting reads with MAP_score >= 0.5 and parsing ... " 140 | for i in "${chromosomes[@]}"; do 141 | sort -k 2,2n "$i"_map.bed | cat >> restrictionsites_temp.bed 142 | done 143 | 144 | awk '(NR>1) && ($7 >= 0.5) && ($8 >= 0.5)' restrictionsites_temp.bed | \ 145 | awk -v OFS='\t' '{print $1, $2, $3, $4, $5, $6, $7 "," $8}' | \ 146 | cat header.txt - > restrictionsites_map.bed 147 | 148 | rm restrictionsites_temp.bed 149 | rm header.txt 150 | 151 | echo "Done!" 152 | echo "End fend file update: $(date)" 153 | echo "Your fend file is restrictionsites_map.bed" 154 | 155 | elif ! [ -z "$gc5Base" ] && ! [ -z "$genomeFasta" ] 156 | then 157 | echo -e 'chr\tstart\tstop\tname\tscore\tstrand\tgc\tmappability' > header.txt 158 | echo -n "Merging bed files together, selecting reads with MAP_score >= 0.5 and parsing ... " 159 | for i in "${chromosomes[@]}"; do 160 | paste -d'\t' "$i"_gc.bed <(awk -F'\t' '{print $7"\t"$8}' "$i"_map.bed) | sort -k 2,2n | cat >> restrictionsites_temp.bed 161 | done 162 | 163 | awk '(NR>1) && ($9 >= 0.5) && ($10 >= 0.5)' restrictionsites_temp.bed | \ 164 | awk -v OFS='\t' '{print $1, $2, $3, $4, $5, $6, $7 "," $8, $9 "," $10}' | \ 165 | cat header.txt - > restrictionsites_gc_map.bed 166 | 167 | rm restrictionsites_temp.bed 168 | rm header.txt 169 | 170 | echo "Done!" 171 | echo "End fend file update: $(date)" 172 | echo "Your fend file is restrictionsites_gc_map.bed" 173 | fi 174 | -------------------------------------------------------------------------------- /scripts/HiCtool_generate_fend_file.sh: -------------------------------------------------------------------------------- 1 | while getopts h:o:e:g:s:p:b:m: option 2 | do 3 | case "${option}" 4 | in 5 | h) hictoolPath=${OPTARG};; # The path of the HiCtool scripts 6 | o) directory=${OPTARG};; # The output directory to work on and save the fend file 7 | e) restrictionEnzyme=${OPTARG};; # The restriction enzyme (or enzymes passed in the form of a Python list) 8 | g) genomeIndex=${OPTARG};; # The Bowtie2 genome indexes of the reference genome 9 | s) species=${OPTARG};; # Species under analysis: hg19, hg38, mm9, mm10, dm6, susScr2 are available 10 | p) threads=${OPTARG};; # The number of threads to use 11 | esac 12 | done 13 | 14 | if [ $species = 'hg38' -o $species = 'hg19' ] 15 | then 16 | chromosomes=("chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9" "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18" "chr19" "chr20" "chr21" "chr22" "chrX" "chrY") 17 | elif [ $species = 'mm10' -o $species = 'mm9' ] 18 | then 19 | chromosomes=("chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9" "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18" "chr19" "chrX" "chrY") 20 | elif [ $species = 'dm6' ] 21 | then 22 | chromosomes=("chr2L" "chr2R" "chr3L" "chr3R" "chr4" "chrX" "chrY") 23 | elif [ $species = 'susScr3' -o $species = 'susScr11' ] 24 | then 25 | chromosomes=("chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9" "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18" "chrX" "chrY") 26 | else 27 | echo "ERROR! Wrong species inserted! Check the species spelling or insert an available species: hg38, hg19, mm10, mm9, dm6, susScr3, susScr11. If your species is not listed, please contact Riccardo Calandrelli at ." 28 | exit 29 | fi 30 | 31 | ### Create output directory 32 | checkMakeDirectory(){ 33 | if [ ! -e "$1" ]; then 34 | mkdir -p "$1" 35 | fi 36 | } 37 | checkMakeDirectory $directory 38 | 39 | echo "Start fend file generation: $(date)" 40 | cd $directory 41 | 42 | python2.7 $hictoolPath"HiCtool_build_enzyme_fastq.py" -e $restrictionEnzyme 43 | 44 | echo -n "Aligning restriction sites for "$restrictionEnzyme" ... " 45 | (bowtie2 -p $threads -k 10000000 -x $genomeIndex -U restriction_enzyme.fastq -S restrictionsites.sam) 2>restrictionsites_log.txt 46 | echo "Done!" 47 | 48 | echo -n "Converting sam file to bed file ... " 49 | samtools view -b -@ $threads restrictionsites.sam | bedtools bamtobed -i > restrictionsites.bed 50 | rm restrictionsites.sam 51 | echo "Done!" 52 | 53 | echo "End fend file generation: $(date)" 54 | echo "Your fend file is restrictionsites.bed" 55 | -------------------------------------------------------------------------------- /scripts/HiCtool_global_map_observed.sh: -------------------------------------------------------------------------------- 1 | while getopts h:i:b:s:p: option 2 | do 3 | case "${option}" 4 | in 5 | h) hictool=${OPTARG};; # the path to the HiCtool scripts. 6 | i) input_file=${OPTARG};; # Project object file in .hdf5 format. 7 | b) bin_size=${OPTARG};; # Bin size. 8 | s) species=${OPTARG};; # Species. 9 | p) threads=${OPTARG};; # Number of parallel threads to be used. 10 | esac 11 | done 12 | 13 | dir=$(dirname "${input_file}") 14 | 15 | ### Create output directory 16 | checkMakeDirectory(){ 17 | echo -e "checking directory: $1" 18 | if [ ! -e "$1" ]; then 19 | echo -e "\tmakedir $1" 20 | mkdir -p "$1" 21 | fi 22 | } 23 | 24 | start_time=$(date) 25 | 26 | output_dir=$dir"/observed_"$bin_size"/" # output directory 27 | checkMakeDirectory $output_dir 28 | 29 | python2.7 $hictool"HiCtool_global_map_rows.py" \ 30 | -i $input_file \ 31 | -o $output_dir \ 32 | -b $bin_size \ 33 | -s $species \ 34 | -c $hictool"chromSizes/" \ 35 | -p $threads 36 | 37 | while read chromosome size; do 38 | cat $output_dir"matrix_full_line_observed_"$chromosome".txt" >> $output_dir"HiCtool_observed_global_"$bin_size".txt" 39 | done < $hictool"chromSizes/"$species".chrom.sizes" 40 | 41 | echo "Start generating global observed matrix: "$start_time 42 | echo "End generating global observed matrix: $(date)" 43 | 44 | ### Split each matrix line by single contact matrices 45 | echo -n "Splitting global matrix into single tab separated matrices ..." 46 | while read chromosome_row size_row; do 47 | matrix_line=$output_dir"matrix_full_line_observed_"$chromosome_row".txt" 48 | k=0 49 | while read chromosome_col size_col; do 50 | start=`expr $k + 1` 51 | end=`expr $k + $size_col / $bin_size` 52 | cut -d$'\t' -f"$start"-"$end" $matrix_line > $output_dir"chr"$chromosome_row"_chr"$chromosome_col"_"$bin_size".txt" 53 | k=`expr $k + $size_col / $bin_size` 54 | done < $hictool"chromSizes/"$species".chrom.sizes" 55 | done < $hictool"chromSizes/"$species".chrom.sizes" 56 | echo "Done!" 57 | 58 | rm $output_dir"matrix_full_line"* 59 | -------------------------------------------------------------------------------- /scripts/HiCtool_hifive.py: -------------------------------------------------------------------------------- 1 | # Program to run HiFive functions. 2 | 3 | # Usage: python2.7 HiCtool_hifive.py [-h] [options] 4 | # Options: 5 | # -h, --help show this help message and exit 6 | # -f FEND_FILE Fend file from preprocessing 7 | # --b1 BAM_FILE_1 BAM file 1 for the first read of the pairs 8 | # --b2 BAM_FILE_1 BAM file 2 for the second read of the pairs 9 | # -e RESTRICTION_ENZYME Restriction enzyme (or restriction enzymes passed like a Python list: [enzyme1,enzyme2]). Choose between: HindIII, MboI, DpnII, Sau3AI, BglII, NcoI, Hinfl. Arima kit uses a combination of MboI and Hinfl. 10 | # -m MODEL Model you wish to use for the following normalization procedure: either 'Yaffe-Tanay' or 'Hi-Corrector'. 11 | # --add_gc Set to 1 if you wish to normalize also for the GC content, 0 otherwise. GC content information has to be present in the fend file. 12 | # --add_mappability Set to 1 if you wish to normalize also for the mappability score, 0 otherwise. Mappability score information has to be present in the fend file. 13 | 14 | # Output files: 15 | # HiC_project_object.hdf5, HiC_project_object_with_distance_parameters.hdf5 and HiC_norm_binning.hdf5 if model='Yaffe-Tanay' 16 | # HiC_project_object.hdf5 if model='Hi-Corrector' 17 | 18 | from optparse import OptionParser 19 | import hifive 20 | import os.path 21 | from os import path 22 | 23 | parameters = {'fend_file': None, 24 | 'bam_file_1': None, 25 | 'bam_file_2': None, 26 | 'restriction_enzyme': None, 27 | 'model': None, 28 | 'add_gc': None, 29 | 'add_mappability': None} 30 | 31 | class hi_five: 32 | def __init__(self, parameters): 33 | self.run_hifive(parameters) 34 | 35 | def run_hifive(self, parameters): 36 | 37 | fend_file = parameters['fend_file'] 38 | bam_file_1 = parameters['bam_file_1'] 39 | bam_file_2 = parameters['bam_file_2'] 40 | model = parameters['model'] 41 | add_gc = bool(parameters['add_gc']) 42 | add_mappability = bool(parameters['add_mappability']) 43 | 44 | restriction_enzymes = map(str, parameters['restriction_enzyme'].strip('[]').split(',')) 45 | if len(restriction_enzymes) == 1: 46 | restriction_enzyme = restriction_enzymes[0] 47 | else: 48 | restriction_enzyme = ','.join(restriction_enzymes) 49 | 50 | # Run for both models 51 | if not os.path.isfile('HiC_project_object.hdf5'): 52 | fend = hifive.Fend('fend_object.hdf5', mode='w') 53 | fend.load_fends(fend_file, re_name=restriction_enzyme, format='bed') 54 | fend.save() 55 | 56 | # Creating a HiCData object 57 | data = hifive.HiCData('HiC_data_object.hdf5', mode='w') 58 | data.load_data_from_bam('fend_object.hdf5', 59 | [bam_file_1,bam_file_2], 60 | maxinsert=500, 61 | skip_duplicate_filtering=False) 62 | data.save() 63 | 64 | # Creating a HiC Project object 65 | hic = hifive.HiC('HiC_project_object.hdf5', 'w') 66 | hic.load_data('HiC_data_object.hdf5') 67 | hic.save() 68 | 69 | if model == 'Yaffe-Tanay': 70 | if not os.path.isfile('HiC_norm_binning.hdf5'): 71 | # Filtering HiC fends 72 | hic = hifive.HiC('HiC_project_object.hdf5') 73 | hic.filter_fends(mininteractions=1, mindistance=0, maxdistance=0) 74 | 75 | # Finding HiC distance function 76 | hic.find_distance_parameters(numbins=90, minsize=200, maxsize=0) 77 | hic.save('HiC_project_object_with_distance_parameters.hdf5') 78 | 79 | # Learning correction parameters using the binning algorithm 80 | my_model = ['len','distance'] 81 | if add_gc == True: 82 | my_model.append('gc') 83 | if add_mappability == True: 84 | my_model.append('mappability') 85 | my_num_bins = [20] * len(my_model) 86 | my_parameters = ['even'] * len(my_model) 87 | hic.find_binning_fend_corrections(max_iterations=1000, 88 | mindistance=500000, 89 | maxdistance=0, 90 | num_bins=my_num_bins, 91 | model=my_model, 92 | parameters=my_parameters, 93 | usereads='cis', 94 | learning_threshold=1.0) 95 | hic.save('HiC_norm_binning.hdf5') 96 | 97 | if __name__ == '__main__': 98 | 99 | usage = 'Usage: python2.7 HiCtool_hifive.py [-h] [options]' 100 | parser = OptionParser(usage = 'python2.7 %prog -f fend_file --b1 bam_file_1 --b2 bam_file_2 -e restriction_enzyme -m model') 101 | parser.add_option('-f', dest='fend_file', type='string', help='Fend file from preprocessing.') 102 | parser.add_option('--b1', dest='bam_file_1', type='string', help='BAM file 1 for the first read of the pairs.') 103 | parser.add_option('--b2', dest='bam_file_2', type='string', help='BAM file 2 for the second read of the pairs.') 104 | parser.add_option('-e', dest='restriction_enzyme', type='string', help='Restriction enzyme (or restriction enzymes passed like a Python list: [enzyme1,enzyme2]). Choose between: HindIII, MboI, DpnII, Sau3AI, BglII, NcoI, Hinfl. Arima kit uses both MboI and Hinfl.') 105 | parser.add_option('-m', dest='model', type='string', help='Model you wish to use for the following normalization procedure: either "Yaffe-Tanay" or "Hi-Corrector".') 106 | parser.add_option('--add_gc', dest='add_gc', type='int', help='Set to 1 if you wish to normalize also for the GC content, 0 otherwise. GC content information has to be present in the fend file.') 107 | parser.add_option('--add_mappability', dest='add_mappability', type='int', help='Set to 1 if you wish to normalize also for the mappability score, 0 otherwise. Mappability score information has to be present in the fend file.') 108 | (options, args) = parser.parse_args( ) 109 | 110 | if options.fend_file == None: 111 | parser.error('-h for help or provide the fend file!') 112 | else: 113 | pass 114 | if options.bam_file_1 == None: 115 | parser.error('-h for help or provide the bam file 1!') 116 | else: 117 | pass 118 | if options.bam_file_2 == None: 119 | parser.error('-h for help or provide the bam file 2!') 120 | else: 121 | pass 122 | if options.restriction_enzyme == None: 123 | parser.error('-h for help or provide the restriction enzyme(s)!') 124 | else: 125 | pass 126 | if options.model == None: 127 | parser.error('-h for help or provide the model: "Yaffe-Tanay" or "Hi-Corrector"!') 128 | else: 129 | pass 130 | 131 | parameters['fend_file'] = options.fend_file 132 | parameters['bam_file_1'] = options.bam_file_1 133 | parameters['bam_file_2'] = options.bam_file_2 134 | parameters['restriction_enzyme'] = options.restriction_enzyme 135 | parameters['model'] = options.model 136 | parameters['add_gc'] = options.add_gc 137 | parameters['add_mappability'] = options.add_mappability 138 | 139 | if parameters['model'] in ['Yaffe-Tanay', 'Hi-Corrector']: 140 | print "Running HiFive functions for the model " + parameters['model'] + " ..." 141 | else: 142 | parser.error("Please insert the correct model, Yaffe-Tanay or Hi-Corrector.") 143 | 144 | hi_five(parameters) 145 | -------------------------------------------------------------------------------- /scripts/HiCtool_normalize_global_matrix.sh: -------------------------------------------------------------------------------- 1 | while getopts h:i:b:m:q:u:s: option 2 | do 3 | case "${option}" 4 | in 5 | h) hictool=${OPTARG};; # the path to the HiCtool scripts. 6 | i) input_mat_file=${OPTARG};; # the observed global contact matrix in tab delimited format. 7 | b) bin_size=${OPTARG};; # bin size. 8 | m) total_mem=${OPTARG};; # The memory size. Its unit is Megabytes (MB). 9 | q) max_iter=${OPTARG};; # maximum number of iterations performed in the algorithm. 10 | u) row_sum_after_norm=${OPTARG};; # Row sum after the normalization (if not declared automatically calculated as avg_matrix * number of rows). 11 | s) species=${OPTARG};; # species. 12 | esac 13 | done 14 | 15 | dir=$(pwd) 16 | total_rows=`wc -l $input_mat_file | awk '{print $1}'` 17 | 18 | if [ -z $row_sum_after_norm ] 19 | then 20 | echo "Rowsum after normalization not declared. Using avg_matrix * number of rows." 21 | a=$(awk '{for(i=1;i<=NF;i++)x+=$i;print x}' $input_mat_file | tail -n 1) 22 | row_sum_after_norm=$(( $a / $total_rows )) 23 | else 24 | echo "Using rowsum after normalization equal to "$row_sum_after_norm"." 25 | fi 26 | 27 | checkMakeDirectory(){ 28 | echo -e "checking directory: $1" 29 | if [ ! -e "$1" ]; then 30 | echo -e "\tmakedir $1" 31 | mkdir -p "$1" 32 | fi 33 | } 34 | 35 | start_time=$(date) 36 | 37 | output_dir=$dir"/output_ic_mes" # output directory 38 | checkMakeDirectory $output_dir 39 | 40 | has_header_line=0 # input file doesn't have header line 41 | has_header_column=0 # input file doesn't have header column 42 | 43 | ### Calculate biases 44 | cmd=$hictool"Hi-Corrector1.2/bin/ic_mes" 45 | bias_factor_file="$output_dir/output.bias" # output file consists of a vector of bias factors 46 | log_file="$output_dir/output.log" # log file recording the verbose console output of the ic command 47 | 48 | echo "$cmd $input_mat_file $total_mem $total_rows $max_iter $has_header_line $has_header_column $bias_factor_file > $log_file" 49 | $cmd $input_mat_file $total_mem $total_rows $max_iter $has_header_line $has_header_column $bias_factor_file > $log_file 50 | 51 | ### Generate normalized contact matrices 52 | cmd=$hictool"Hi-Corrector1.2/bin/export_norm_data" 53 | normalized_matrix="$output_dir/output_normalized.txt" # output file consists of a vector of bias factors 54 | 55 | echo "$cmd $input_mat_file $total_rows $has_header_line $has_header_column $total_mem $bias_factor_file $row_sum_after_norm $normalized_matrix" 56 | $cmd $input_mat_file $total_rows $has_header_line $has_header_column $total_mem $bias_factor_file $row_sum_after_norm $normalized_matrix 57 | 58 | echo "Start data normalization time: "$start_time 59 | echo "End data normalization time: $(date)" 60 | 61 | output_directory=$dir"/normalized_"$bin_size 62 | checkMakeDirectory $output_directory 63 | output_mat_file="$(echo "$(basename "$input_mat_file")" | sed s/observed/normalized/)" 64 | mv $normalized_matrix $output_directory"/$output_mat_file" 65 | 66 | 67 | ### Split normalized contact matrix in lines by chromosome 68 | echo -n "Splitting global matrix into single tab separated matrices ..." 69 | k=0 70 | while read chromosome size; do 71 | start=`expr $k + 1` 72 | end=`expr $k + $size / $bin_size` 73 | quit=`expr $end + 1` 74 | sed -n "$start,"$end"p;"$quit"q" $output_directory"/$output_mat_file" > $output_directory"/matrix_full_line_normalized_"$chromosome".txt" 75 | k=`expr $k + $size / $bin_size` 76 | done < $hictool"chromSizes/"$species".chrom.sizes" 77 | 78 | ### Split each matrix line by single contact matrices 79 | while read chromosome_row size_row; do 80 | matrix_line=$output_directory"/matrix_full_line_normalized_"$chromosome_row".txt" 81 | k=0 82 | while read chromosome_col size_col; do 83 | start=`expr $k + 1` 84 | end=`expr $k + $size_col / $bin_size` 85 | cut -d$'\t' -f"$start"-"$end" $matrix_line > $output_directory"/chr"$chromosome_row"_chr"$chromosome_col"_"$bin_size".txt" 86 | k=`expr $k + $size_col / $bin_size` 87 | done < $hictool"chromSizes/"$species".chrom.sizes" 88 | done < $hictool"chromSizes/"$species".chrom.sizes" 89 | echo "Done!" 90 | 91 | rm $output_directory"/matrix_full_line"* 92 | -------------------------------------------------------------------------------- /scripts/HiCtool_pre_truncation.py: -------------------------------------------------------------------------------- 1 | # Perform pre-truncation on reads that contain potential ligation junctions. To be executed before the mapping step. 2 | 3 | # Usage: python2.7 HiCtool_pre_truncation.py [-h] [options] 4 | # Options: 5 | # -h, --help show this help message and exit 6 | # -i INPUTFILES Input fastq file or input fastq files passed like a Python list. Example: [file1.fastq,file2.fastq]. 7 | # -e RESTRICTION_ENZYMES Restriction enzyme (or restriction enzymes passed like a Python list: [enzyme1,enzyme2]). Choose between: HindIII, MboI, DpnII, Sau3AI, BglII, NcoI, Hinfl. Arima kit uses a combination of MboI and Hinfl. 8 | # -p THREADS Number of parallel threads to use. 9 | 10 | # Output files: 11 | # Fastq files with pre-truncated reads. 12 | # Log files with pre-truncation information. 13 | 14 | from optparse import OptionParser 15 | import os 16 | import re 17 | from time import gmtime, strftime 18 | from multiprocessing import Pool 19 | 20 | parameters = {'inputFiles': None, 21 | 'restriction_enzymes': None, 22 | 'threads': None} 23 | 24 | def pre_truncation(input_fastq): 25 | 26 | re_info = dict() # rs: restriction site sequence; lj: ligation junction sequence 27 | re_info['HindIII']={'rs':'AAGCTT','lj':'AAGCTAGCTT'} 28 | re_info['MboI']={'rs':'GATC','lj':'GATCGATC'} 29 | re_info['DpnII']={'rs':'GATC','lj':'GATCGATC'} 30 | re_info['Sau3AI']={'rs':'GATC','lj':'GATCGATC'} 31 | re_info['BglII']={'rs':'AGATCT','lj':'AGATCGATCT'} 32 | re_info['NcoI']={'rs':'CCATGG','lj':'CCATGCATGG'} 33 | re_info['Hinfl']={'rs':'GANTC','lj':'GA[ACGT]TA[ACGT]TC'} 34 | 35 | filename = os.path.splitext(input_fastq)[0] 36 | with open (input_fastq, 'r') as infile: 37 | lines = infile.readlines() 38 | count = 0 # to count reads with potential ligation junction 39 | lengths = [] # to save the length of the pieces after truncation 40 | n_reads = len(lines)/4 41 | percents = {} 42 | for n in range(0,101,10)[1:]: 43 | percents[str(n)+'%'] = int(n_reads*n*0.01) 44 | for i in xrange(1,len(lines),4): # iteration over lines containing the sequence 45 | for key, value in percents.iteritems(): 46 | if i/4 == value: 47 | print (key + ' completed - ' + input_fastq) 48 | # Search for the ligation junction 49 | for j in restriction_enzymes: 50 | match = re.search(re_info[j]['lj'],lines[i]) 51 | if match: # ligation junction present in the read sequence 52 | count += 1 53 | line=lines[i][0:-1] # remove the \n char at the end of the sequence 54 | pieces = [] 55 | pieces.append(line[:match.start()]) 56 | pieces.append(line[match.end():]) 57 | max_length = max(len(x) for x in pieces) 58 | if j != 'Hinfl': 59 | re_seq = re_info[j]['rs'] 60 | else: 61 | re_seq = match.group()[:4] + 'C' 62 | lengths.append(max_length + len(re_seq)) 63 | piece = [x for x in pieces if len(x) == max_length][0] 64 | start_index = re.search(piece,line).start() 65 | match_index = match.start() 66 | if start_index < match_index: 67 | piece = piece + re_seq 68 | lines[i] = piece + '\n' 69 | lines[i+2] = lines[i+2][:-1][start_index:start_index+len(piece)] + '\n' 70 | elif start_index > match_index: 71 | piece = re_seq + piece 72 | lines[i] = piece + '\n' 73 | lines[i+2] = lines[i+2][:-1][start_index-len(re_seq):start_index-len(re_seq)+len(piece)] + '\n' 74 | 75 | with open (filename + '.trunc.fastq' ,'w') as fout: 76 | for i in xrange(len(lines)): 77 | fout.write(lines[i]) 78 | print ('100% completed - ' + input_fastq) 79 | 80 | with open (filename + '_log.txt', 'w') as outlog: 81 | outlog.write(str(len(lines)/4) + "\t" + str(len(lines[1])-1) + "\t" + str(count) + "\n") 82 | 83 | 84 | if __name__ == '__main__': 85 | 86 | usage = 'Usage: python2.7 HiCtool_pre_truncation.py [-h] [options]' 87 | parser = OptionParser(usage = 'python2.7 %prog -i inputFiles -e restriction_enzymes -p threads') 88 | parser.add_option('-i', dest='inputFiles', type='string', help='Input fastq file or input fastq files passed like a Python list. Example: [file1.fastq,file2.fastq].') 89 | parser.add_option('-e', dest='restriction_enzymes', type='string', help='Restriction enzyme (or restriction enzymes passed like a Python list: [enzyme1,enzyme2]). Choose between: HindIII, MboI, DpnII, Sau3AI, BglII, NcoI, Hinfl. Arima kit uses both MboI and Hinfl.') 90 | parser.add_option('-p', dest='threads', type='int', help='Number of parallel threads to use.') 91 | (options, args) = parser.parse_args() 92 | 93 | if options.inputFiles == None: 94 | parser.error('-h for help or provide the input fastq file(s)!') 95 | else: 96 | pass 97 | if options.restriction_enzymes == None: 98 | parser.error('-h for help or provide the restriction enzyme(s)!') 99 | else: 100 | pass 101 | 102 | parameters['inputFiles'] = options.inputFiles 103 | parameters['restriction_enzymes'] = options.restriction_enzymes 104 | parameters['threads'] = options.threads 105 | 106 | re_info = dict() # rs: restriction site sequence; lj: ligation junction sequence 107 | re_info['HindIII']={'rs':'AAGCTT','lj':'AAGCTAGCTT'} 108 | re_info['MboI']={'rs':'GATC','lj':'GATCGATC'} 109 | re_info['DpnII']={'rs':'GATC','lj':'GATCGATC'} 110 | re_info['Sau3AI']={'rs':'GATC','lj':'GATCGATC'} 111 | re_info['BglII']={'rs':'AGATCT','lj':'AGATCGATCT'} 112 | re_info['NcoI']={'rs':'CCATGG','lj':'CCATGCATGG'} 113 | re_info['Hinfl']={'rs':'GANTC','lj':'GA[ACGT]TA[ACGT]TC'} 114 | 115 | inputFiles = map(str, parameters['inputFiles'].strip('[]').split(',')) 116 | restriction_enzymes = map(str, parameters['restriction_enzymes'].strip('[]').split(',')) 117 | threads = parameters['threads'] 118 | 119 | # Check that all the restriction enzymes are available 120 | for i in restriction_enzymes: 121 | if i not in re_info.keys(): 122 | parser.error(i + " is not among the available restriction enzymes (HindIII, MboI, DpnII, Sau3AI, BglII, NcoI, Hinfl)! Check the spelling or contact Riccardo Calandrelli at .") 123 | 124 | if threads > 1: 125 | if threads > len(inputFiles): 126 | threads = len(inputFiles) 127 | if len(restriction_enzymes) == 1: 128 | print ("Pre-truncation in parallel (restriction enzyme: " + restriction_enzymes[0] + ") using " + str(threads) + " threads ...") 129 | else: 130 | print ("Pre-truncation in parallel (restriction enzymes: " + ", ".join(restriction_enzymes) + ") using " + str(threads) + " threads ...") 131 | print ("Start: " + strftime("%Y-%m-%d %H:%M:%S", gmtime())) 132 | pool = Pool(processes=threads) 133 | pool.map(pre_truncation, inputFiles) 134 | print ("End: " + strftime("%Y-%m-%d %H:%M:%S", gmtime())) 135 | else: 136 | if len(restriction_enzymes) == 1: 137 | print ("Pre-truncation (restriction enzyme: " + restriction_enzymes[0] + ") using a single threads ...") 138 | else: 139 | print ("Pre-truncation (restriction enzymes: " + ", ".join(restriction_enzymes) + ") using a single thread ...") 140 | print ("Start: " + strftime("%Y-%m-%d %H:%M:%S", gmtime())) 141 | for i in inputFiles: 142 | pre_truncation(i) 143 | print ("End: " + strftime("%Y-%m-%d %H:%M:%S", gmtime())) 144 | -------------------------------------------------------------------------------- /scripts/HiCtool_utilities.py: -------------------------------------------------------------------------------- 1 | # HiCtool utility functions to work with HiCtool generated data. 2 | 3 | def save_matrix(a_matrix, output_file): 4 | """ 5 | Save an intra-chromosomal contact matrix in the HiCtool compressed format to txt file. 6 | 1) The upper-triangular part of the matrix is selected (including the 7 | diagonal). 8 | 2) Data are reshaped to form a vector. 9 | 3) All the consecutive zeros are replaced with a "0" followed by the 10 | number of times zeros are repeated consecutively. 11 | 4) Data are saved to a txt file. 12 | Arguments: 13 | a_matrix (numpy matrix): input contact matrix to be saved 14 | output_file (str): output file name in txt format 15 | Output: 16 | txt file containing the formatted data 17 | """ 18 | import numpy as np 19 | 20 | n = len(a_matrix) 21 | iu = np.triu_indices(n) 22 | vect = a_matrix[iu].tolist() 23 | with open (output_file,'w') as fout: 24 | k = len(vect) 25 | i = 0 26 | count = 0 27 | flag = False # flag to set if the end of the vector has been reached 28 | while i < k and flag == False: 29 | if vect[i] == 0: 30 | count+=1 31 | if (i+count == k): 32 | w_out = str(0) + str(count) 33 | fout.write('%s\n' %w_out) 34 | flag = True 35 | break 36 | while vect[i+count] == 0 and flag == False: 37 | count+=1 38 | if (i+count == k): 39 | w_out = str(0) + str(count) 40 | fout.write('%s\n' %w_out) 41 | flag = True 42 | break 43 | if flag == False: 44 | w_out = str(0) + str(count) 45 | fout.write('%s\n' %w_out) 46 | i+=count 47 | count = 0 48 | else: 49 | fout.write('%s\n' %vect[i]) 50 | i+=1 51 | 52 | def load_matrix(input_file): 53 | """ 54 | Load an HiCtool compressed square (and symmetric) contact matrix from a txt file and parse it. 55 | Arguments: 56 | input_file (str): input file name in txt format (generated by the function 57 | "save_matrix"). 58 | Return: 59 | numpy array containing the parsed values stored in the input txt file to build a contact matrix. 60 | """ 61 | import numpy as np 62 | 63 | print "Loading " + input_file + "..." 64 | with open (input_file,'r') as infile: 65 | matrix_vect = [] 66 | for i in infile: 67 | if i[0] == "0" and i[1] != ".": 68 | for k in xrange(int(i[1:-1])): 69 | matrix_vect.append(0) 70 | else: 71 | j = i[:-1] 72 | matrix_vect.append(float(j)) 73 | 74 | k = len(matrix_vect) 75 | matrix_size = int((-1+np.sqrt(1+8*k))/2) 76 | 77 | iu = np.triu_indices(matrix_size) 78 | output_matrix_1 = np.zeros((matrix_size,matrix_size)) # upper triangular plus the diagonal 79 | output_matrix_1[iu] = matrix_vect 80 | 81 | diag_matrix = np.diag(np.diag(output_matrix_1)) # diagonal 82 | output_matrix_2 = np.transpose(output_matrix_1) # lower triangular plus the diagonal 83 | output_matrix = output_matrix_1 + output_matrix_2 - diag_matrix 84 | print "Done!" 85 | return output_matrix 86 | 87 | def save_matrix_rectangular(a_matrix, output_file): 88 | """ 89 | Save an inter-chromosomal contact matrix in the HiCtool compressed format to txt file. 90 | 1) Data are reshaped to form a vector. 91 | 2) All the consecutive zeros are replaced with a "0" followed by the 92 | number of times zeros are repeated consecutively. 93 | 3) Data are saved to a txt file. 94 | Arguments: 95 | a_matrix (numpy matrix): input contact matrix to be saved 96 | output_file (str): output file name in txt format 97 | Output: 98 | txt file containing the formatted data 99 | """ 100 | import numpy as np 101 | n_row = np.shape(a_matrix)[0] 102 | n_col = np.shape(a_matrix)[1] 103 | vect = np.reshape(a_matrix,[1,n_row*n_col]).tolist()[0] 104 | with open (output_file,'w') as fout: 105 | k = len(vect) 106 | i = 0 107 | count = 0 108 | flag = False # flag to set if the end of the vector has been reached 109 | while i < k and flag == False: 110 | if vect[i] == 0: 111 | count+=1 112 | if (i+count == k): 113 | w_out = str(0) + str(count) 114 | fout.write('%s\n' %w_out) 115 | flag = True 116 | break 117 | while vect[i+count] == 0 and flag == False: 118 | count+=1 119 | if (i+count == k): 120 | w_out = str(0) + str(count) 121 | fout.write('%s\n' %w_out) 122 | flag = True 123 | break 124 | if flag == False: 125 | w_out = str(0) + str(count) 126 | fout.write('%s\n' %w_out) 127 | i+=count 128 | count = 0 129 | else: 130 | fout.write('%s\n' %vect[i]) 131 | i+=1 132 | 133 | 134 | def load_matrix_rectangular(input_file, n_row, n_col): 135 | """ 136 | Load an HiCtool compressed rectangular contact matrix from a txt file and parse it. 137 | Arguments: 138 | input_file (str): input file name in txt format (generated by the function 139 | "save_matrix_rectangular") 140 | n_row (int): number of rows of the matrix. 141 | n_col (int): number of columns of the matrix. 142 | Return: 143 | output_matrix: numpy array the parsed values stored in the input txt file to build a contact matrix. 144 | """ 145 | import numpy as np 146 | 147 | print "Loading " + input_file + "..." 148 | with open (input_file,'r') as infile: 149 | matrix_vect = [] 150 | for i in infile: 151 | if i[0] == "0" and i[1] != ".": 152 | for k in xrange(int(i[1:-1])): 153 | matrix_vect.append(0) 154 | else: 155 | j = i[:-1] 156 | matrix_vect.append(float(j)) 157 | 158 | output_matrix = np.reshape(np.array(matrix_vect),[n_row,n_col]) 159 | print "Done!" 160 | return output_matrix 161 | 162 | 163 | def save_matrix_tab(input_matrix, output_filename): 164 | """ 165 | Save a contact matrix in a txt file in a tab separated format. Columns are 166 | separated by tabs, rows are in different lines. 167 | Arguments: 168 | input_matrix (numpy matrix): input contact matrix to be saved 169 | output_filename (str): output file name in txt format 170 | Output: 171 | txt file containing the tab separated data 172 | """ 173 | with open (output_filename, 'w') as f: 174 | for i in xrange(len(input_matrix)): 175 | row = [str(j) for j in input_matrix[i]] 176 | f.write('\t'.join(row) + '\n') 177 | 178 | def load_matrix_tab(input_file): 179 | """ 180 | Load a contact matrix saved in a tab separated format using the function 181 | "save_matrix_tab". 182 | Arguments: 183 | input_file (str): input contact matrix to be loaded. 184 | Return: 185 | numpy array containing the parsed values stored in the input tab separated txt file to build a contact matrix. 186 | """ 187 | import numpy as np 188 | 189 | print "Loading " + input_file + "..." 190 | with open (input_file, 'r') as infile: 191 | lines = infile.readlines() 192 | temp = [] 193 | for line in lines: 194 | row = [float(i) for i in line.strip().split('\t')] 195 | temp.append(row) 196 | 197 | output_matrix = np.array(temp) 198 | print "Done!" 199 | return output_matrix 200 | 201 | 202 | def load_DI_values(input_file): 203 | """ 204 | Load a DI txt file generated with "calculate_chromosome_DI". 205 | Arguments: 206 | input_file (str): input file name in txt format. 207 | Return: 208 | List of the DI values. 209 | """ 210 | import numpy as np 211 | 212 | fp = open(input_file,'r+') 213 | lines = fp.read().split('\n') 214 | lines = lines[:-1] 215 | di_values = (np.nan_to_num(np.array(map(float, lines)))).tolist() 216 | return di_values 217 | 218 | 219 | def load_hmm_states(input_file): 220 | """ 221 | Load an HMM txt file generated with "calculate_chromosome_hmm_states". 222 | Arguments: 223 | input_file (str): input file name in txt format. 224 | Returns: 225 | List of the HMM states. 226 | """ 227 | fp = open(input_file,'r+') 228 | lines = fp.read().split('\n') 229 | lines = lines[:-1] 230 | likelystates = map(int,lines) 231 | return likelystates 232 | 233 | 234 | def save_topological_domains(a_matrix, output_file): 235 | """ 236 | Function to save the topological domains coordinates to text file. 237 | Each topological domain coordinates (start and end) occupy one row and are 238 | tab separated. 239 | Arguments: 240 | a_matrix (numpy matrix): file to be saved with topological domains coordinates. 241 | output_file (str): output file name in txt format. 242 | Output: 243 | Tab separated txt file with topological domain start and end coordinates. 244 | """ 245 | def compile_row_string(a_row): 246 | return str(a_row).strip(']').strip('[').lstrip().replace(' ','\t') 247 | with open(output_file, 'w') as f: 248 | for row in a_matrix: 249 | f.write(compile_row_string(row)+'\n') 250 | 251 | 252 | def load_topological_domains(input_file): 253 | """ 254 | Function to load the topological domains coordinates from txt file. 255 | Arguments: 256 | input_file (str): input file name generated with "calculate_topological_domains" in txt format. 257 | Return: 258 | List of lists with topological domain coordinates. 259 | """ 260 | import csv 261 | print "Loading topological domain coordinates..." 262 | with open(input_file, 'r') as f: 263 | reader = csv.reader(f, dialect='excel', delimiter='\t') 264 | topological_domains = [] 265 | for row in reader: 266 | row_int = [int(x) for x in row] 267 | topological_domains.append(row_int) 268 | print "Done!" 269 | return topological_domains 270 | -------------------------------------------------------------------------------- /scripts/chromSizes/dm6.chrom.sizes: -------------------------------------------------------------------------------- 1 | 2L 23513712 2 | 2R 25286936 3 | 3L 28110227 4 | 3R 32079331 5 | 4 1348131 6 | X 23542271 7 | Y 3667352 8 | -------------------------------------------------------------------------------- /scripts/chromSizes/hg19.chrom.sizes: -------------------------------------------------------------------------------- 1 | 1 249250621 2 | 2 243199373 3 | 3 198022430 4 | 4 191154276 5 | 5 180915260 6 | 6 171115067 7 | 7 159138663 8 | 8 146364022 9 | 9 141213431 10 | 10 135534747 11 | 11 135006516 12 | 12 133851895 13 | 13 115169878 14 | 14 107349540 15 | 15 102531392 16 | 16 90354753 17 | 17 81195210 18 | 18 78077248 19 | 19 59128983 20 | 20 63025520 21 | 21 48129895 22 | 22 51304566 23 | X 155270560 24 | Y 59373566 25 | -------------------------------------------------------------------------------- /scripts/chromSizes/hg38.chrom.sizes: -------------------------------------------------------------------------------- 1 | 1 248956422 2 | 2 242193529 3 | 3 198295559 4 | 4 190214555 5 | 5 181538259 6 | 6 170805979 7 | 7 159345973 8 | 8 145138636 9 | 9 138394717 10 | 10 133797422 11 | 11 135086622 12 | 12 133275309 13 | 13 114364328 14 | 14 107043718 15 | 15 101991189 16 | 16 90338345 17 | 17 83257441 18 | 18 80373285 19 | 19 58617616 20 | 20 64444167 21 | 21 46709983 22 | 22 50818468 23 | X 156040895 24 | Y 57227415 25 | -------------------------------------------------------------------------------- /scripts/chromSizes/mm10.chrom.sizes: -------------------------------------------------------------------------------- 1 | 1 195471971 2 | 2 182113224 3 | 3 160039680 4 | 4 156508116 5 | 5 151834684 6 | 6 149736546 7 | 7 145441459 8 | 8 129401213 9 | 9 124595110 10 | 10 130694993 11 | 11 122082543 12 | 12 120129022 13 | 13 120421639 14 | 14 124902244 15 | 15 104043685 16 | 16 98207768 17 | 17 94987271 18 | 18 90702639 19 | 19 61431566 20 | X 171031299 21 | Y 91744698 22 | -------------------------------------------------------------------------------- /scripts/chromSizes/mm9.chrom.sizes: -------------------------------------------------------------------------------- 1 | 1 197195432 2 | 2 181748087 3 | 3 159599783 4 | 4 155630120 5 | 5 152537259 6 | 6 149517037 7 | 7 152524553 8 | 8 131738871 9 | 9 124076172 10 | 10 129993255 11 | 11 121843856 12 | 12 121257530 13 | 13 120284312 14 | 14 125194864 15 | 15 103494974 16 | 16 98319150 17 | 17 95272651 18 | 18 90772031 19 | 19 61342430 20 | X 166650296 21 | Y 15902555 22 | -------------------------------------------------------------------------------- /scripts/chromSizes/susScr11.chrom.sizes: -------------------------------------------------------------------------------- 1 | 1 274330532 2 | 2 151935994 3 | 3 132848913 4 | 4 130910915 5 | 5 104526007 6 | 6 170843587 7 | 7 121844099 8 | 8 138966237 9 | 9 139512083 10 | 10 69359453 11 | 11 79169978 12 | 12 61602749 13 | 13 208334590 14 | 14 141755446 15 | 15 140412725 16 | 16 79944280 17 | 17 63494081 18 | 18 55982971 19 | X 125939595 20 | Y 43547828 21 | -------------------------------------------------------------------------------- /scripts/chromSizes/susScr3.chrom.sizes: -------------------------------------------------------------------------------- 1 | 1 315321322 2 | 2 162569375 3 | 3 144787322 4 | 4 143465943 5 | 5 111506441 6 | 6 157765593 7 | 7 134764511 8 | 8 148491826 9 | 9 153670197 10 | 10 79102373 11 | 11 87690581 12 | 12 63588571 13 | 13 218635234 14 | 14 153851969 15 | 15 157681621 16 | 16 86898991 17 | 17 69701581 18 | 18 61220071 19 | X 144288218 20 | Y 1637716 21 | -------------------------------------------------------------------------------- /tutorial/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Zhong-Lab-UCSD/HiCtool/528e865ce0a3139bc17d07c78ced4e44a7b399d2/tutorial/.DS_Store -------------------------------------------------------------------------------- /tutorial/HiCtool_compressed_format.md: -------------------------------------------------------------------------------- 1 | # HiCtool compressed format 2 | 3 | HiCtool contact map storage is documented in the Supplementary Material of [Calandrelli et al. (2018). GITAR: An open source tool for analysis and visualization of Hi-C data. *Genomics, proteomics & bioinformatics.*](https://www.sciencedirect.com/science/article/pii/S1672022918304339#s0055). 4 | 5 | ## Intra-chromosomal compressed format 6 | 7 | Data are compressed based on the fact that contact maps are symmetric (contacts between loci i and j are the same than those between loci j and i) and usually sparse, since most of the elements are zeros, and this property is stronger with the decrease in the bin size. Given these two properties, it is not needed to save mirrored data and moreover it would be useful to “compress” the zero data within the matrices. To accomplish this, first we selected only the upper-triangular part of the contact matrices (including the diagonal) and reshaped the data by rows to form a vector. After that, we replaced all the consecutive zeros in the vector with a “0” followed by the number of zeros that are repeated consecutively; all the non-zero elements are left as they are. Finally, the data are saved in a txt file. 8 | 9 | ![](/figures/HiCtool_compression.png) 10 | 11 | The figure shows a simplified example of the compression workflow, where the intra-chromosomal contact matrix is represented by a 4 × 4 symmetric and sparse matrix. (1) The upper-triangular part of the matrix is selected (including the diagonal); (2) data are reshaped to form a vector; (3) all the consecutive zeros are replaced with a “0” followed by the number of zeros that are repeated consecutively; (4) data are saved into a txt file. 12 | 13 | ### Comparison between full and compressed data for the entire human genome (only intra-chromosomal maps) 14 | 15 | | Data type | Bin size (kb) | Storage usage txt format (MB) | Storage usage zip format (MB) | Saving time (min:sec) | Loading time | 16 | |-----------|---------------|-------------------------------|-------------------------------|-----------------------|--------------| 17 | | Full | 1000 | 6.1 | 2.7 | 00:01 | 00:01 | 18 | | Optimized | 1000 | 2.9 (48%) | 1.4 (52%) | 00:01 (100%) | 00:01 (100%) | 19 | | Full | 100 | 341.4 | 105.9 | 01:07 | 00:30 | 20 | | Optimized | 100 | 110.4 (32%) | 49.9 (47%) | 00:25 (37%) | 00:17 (57%) | 21 | | Full | 40 | 1566.7 | 213 | 06:18 | 03:24 | 22 | | Optimized | 40 | 208.6 (13%) | 93.1 (44%) | 01:50 (29%) | 01:14 (36%) | 23 | | Full | 10 | 19,957.80 | 451.3 | 90:16 | 85:41 | 24 | | Optimized | 10 | 393.6 (2%) | 177.1 (39%) | 26:31 (29%) | 18:49 (22%) | 25 | 26 | Hardware: 2.9 GHz Intel Core i5, 16 GB of RAM. The percentage of optimization (optimized/full) at each resolution is indicated in the parentheses. kb, kilobase; MB, megabyte. 27 | 28 | ## Inter-chromosomal compressed format 29 | 30 | The same compression workflow applies for the inter-chromosomal contact matrix, **except for the selection of the upper triangular matrix (step 1)** which is skipped: the entire matrix data are considered (since the matrix is rectangular). 31 | 32 | 33 | -------------------------------------------------------------------------------- /tutorial/HiCtool_flowchart.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Zhong-Lab-UCSD/HiCtool/528e865ce0a3139bc17d07c78ced4e44a7b399d2/tutorial/HiCtool_flowchart.png -------------------------------------------------------------------------------- /tutorial/HiCtool_utility_code.md: -------------------------------------------------------------------------------- 1 | # HiCtool utility code 2 | 3 | We provide a series of utility functions included into [HiCtool_utilities.py](/scripts/HiCtool_utilities.py) that allow to work with HiCtool generated data in the Python environment, such as: 4 | 5 | - Loading and saving HiCtool contact matrices (either in compressed and tab-separated format). 6 | - Loading DI values or HMM states. 7 | - Loading topological domain coordinates. 8 | 9 | These functions will allow you to use HiCtool data for additional analyses, and eventually re-save to file your processed data in order to be plotted using HiCtool. You could even have contact matrices generated with other software, parse them into a numpy matrix format, and then save them using the saving functions in order to be normalized or plotted for example. 10 | 11 | First, open your Python or iPython shell and execute the script: 12 | ```Python 13 | execfile('HiCtool_utilities.py') 14 | ``` 15 | Then, call the function you need (see how to use each function in the function documentation inside the script): 16 | 17 | - ``save_matrix`` to save a square and symmetric contact matrix (intra-chromosomal or global matrix) in the HiCtool compressed format. 18 | - ``load_matrix`` to load a square and symmetric contact matrix (intra-chromosomal or global matrix) in the HiCtool compressed format. 19 | - ``save_matrix_rectangular`` to save a rectangular contact matrix (inter-chromosomal) in the HiCtool compressed format. 20 | - ``load_matrix_rectangular`` to load a rectangular contact matrix (inter-chromosomal) in the HiCtool compressed format. 21 | - ``save_matrix_tab`` to save any contact matrix in the tab-separated format. 22 | - ``load_matrix_tab`` to load any contact matrix in the tab-separated format. 23 | - ``load_DI_values`` to load DI values. 24 | - ``load_HMM_states`` to load HMM states. 25 | - ``save_topological_domains`` to save topological domains values in tab-separated format. 26 | - ``load_topological_domains`` to load topological domains values in tab-separated format. 27 | 28 | 29 | 30 | 31 | -------------------------------------------------------------------------------- /tutorial/ReadMe.md: -------------------------------------------------------------------------------- 1 | # HiCtool Tutorial 2 | 3 | The sections in this tutorial steps have to be followed in sequence. At the normalization step, choose the preferred method. Note that A/B compartment analysis can be performed only if the method from Yaffe-Tanay is used, which allows to extract the O/E contact matrix and the Pearson correlation matrix. The method using Hi-Corrector is usually preferred if you have highly deep sequenced data, and it allows both intra- and inter-chromosomal analysis. 4 | 5 | All the scripts used in the tutorial are inside the main folder [scripts](https://github.com/Zhong-Lab-UCSD/HiCtool/tree/master/scripts). In this tutorial, all the commands are run supposing that you have downloaded the ``HiCtool-master`` folder inside your working directory, otherwise update the paths when you run the commands accordingly. 6 | 7 | ![](/tutorial/HiCtool_flowchart.png) 8 | 9 | ## [1. Data preprocessing](/tutorial/data-preprocessing.md) 10 | ## 2. Data normalization and visualization 11 | The **explicit-factor correction model of Yaffe and Tanay** is applied to normalize and visualize only intra-chromosomal contact data and it takes into account of explicit factors such as fragment length, GC content, mappability score to normalize the data. It has to be used to perform A/B compartment analysis. 12 | 13 | The **matrix balancing approach of Hi-Corrector** is used to normalize and visualize globally intra- and inter-chromosomal contact maps. It does not consider specific factors for normalization and it is faster. Both methods are fine for TAD analysis. 14 | - ### [2.1. Explicit-factor correction model of Yaffe and Tanay](/tutorial/normalization-yaffe-tanay.md) 15 | - ### [2.2. Matrix balancing approach of Hi-Corrector](/tutorial/normalization-matrix-balancing.md) 16 | ## [3. A/B compartment analysis](/tutorial/compartment.md) 17 | ## [4. TAD analysis](/tutorial/tad-analysis.md) 18 | 19 | *** 20 | ## Supplementary information 21 | 22 | - ### [HiCtool utility code](/tutorial/HiCtool_utility_code.md) 23 | - ### [HiCtool contact matrix compressed format](/tutorial/HiCtool_compressed_format.md) 24 | 25 | -------------------------------------------------------------------------------- /tutorial/compartment.md: -------------------------------------------------------------------------------- 1 | # A/B compartment analysis 2 | 3 | This section allows to calculate the principal components (PCs) of the Pearson correlation matrix that can be used to delineate A/B compartments in Hi-C data at low resolution (usually 1 Mb or 500 kb). It is possible to calculate both PC1 and PC2. Usually, the sign of the eigenvector (PC1) indicates the compartment. 4 | 5 | Note that comparing PC values (either PC1 or PC2) between experiments may not be appropriate. While PC1 usually correlates with active and inactive compartments, the nature of this association (i.e. which PC sign (+,-) is associated to active or inactive regions) may be different among experiments. Instead, it is always recommended to compare interaction profiles of a specific genomic locus towards the other loci. 6 | 7 | In order to directly compare eigenvectors among two experiments, first you should make sure that the association between PC1 and active and inactive compartments is the same for both the experiments. Usually, "A compartments" (positive PC values) are considered associated to gene rich and active genomic regions, while "B compartments" (negative PC values) to gene poor and inactive regions. In order for the sign of the eigenvalues to match our hypothesis, you could calculate the correlation of the eigenvector (separately per each chromosome) with a gene density track at the same resolution (1 Mb for example). If the correlation is positive, the eigenvector is left as it was, otherwise the sign of its values should be flipped. 8 | 9 | ## Table of contents 10 | 11 | 1. [Calculating the principal component](#1-calculating-the-principal-component) 12 | 2. [Plotting the principal component](#2-plotting-the-principal-component) 13 | 14 | ## 1. Calculating the principal component 15 | 16 | HiCtool allows to calculate either the first (typically used) or the second principal component of the Person correlation matrix. In order to do so, the Person correlation matrix has to be calculated first as presented [here](/tutorial/normalization-yaffe-tanay.md#22-normalizing-enrichment-oe-data-and-calculating-the-pearson-correlation-matrix). 17 | 18 | This is the code to calculate the first principal component: 19 | ```unix 20 | python2.7 /HiCtool-master/scripts/HiCtool_compartment_analysis.py \ 21 | --action calculate_pc \ 22 | -c /HiCtool-master/scripts/chromSizes/ \ 23 | -b 1000000 \ 24 | -s hg38 \ 25 | --chr 6 \ 26 | --pc PC1 27 | ``` 28 | 29 | where: 30 | 31 | - ``--action``: Action to perform (here ``calculate_pc``). 32 | - ``-c``: Path to the folder ``chromSizes`` with trailing slash at the end ``/``. 33 | - ``-b``: The bin size (resolution) for the analysis. 34 | - ``-s``: Species name. 35 | - ``--chr``: The chromosome to be used. 36 | - ``--pc``: Which principal component to be returned (either ``PC1`` or ``PC2``). 37 | - ``--flip``: Set to ``-1`` if you wish to flip PC values. 38 | 39 | The output is a txt file with the values of the principal component selected, in this case ``chr6_1000000_PC1.txt`` inside the folder ``./yaffe_tanay_1000000/``. 40 | 41 | The parameter ``--chr`` can be used also to pass multiple chromosomes (as a list between square brackets) at once and also multi-processing computation is provided if your machine supports it, using the parameters ``-p`` followed by the number of processors to be used in parallel (for example ``-p 2``): 42 | ```unix 43 | chromosomes=[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y] 44 | 45 | python2.7 /HiCtool-master/scripts/HiCtool_compartment_analysis.py \ 46 | --action calculate_pc \ 47 | -c /HiCtool-master/scripts/chromSizes/ \ 48 | -b 1000000 \ 49 | -s hg38 \ 50 | --chr $chromosomes \ 51 | --pc PC1 52 | ``` 53 | 54 | ## 2. Plotting the principal component 55 | 56 | The following code allows to plot the principal component values in a barplot, which allows to visually delineate compartments. The following plot usually is presented together with the Pearson correlation matrix. 57 | 58 | ```unix 59 | python2.7 /HiCtool-master/scripts/HiCtool_compartment_analysis.py \ 60 | --action plot_pc \ 61 | -i ./yaffe_tanay_1000000/chr6_1000000_PC1.txt \ 62 | -c /HiCtool-master/scripts/chromSizes/ \ 63 | -b 1000000 \ 64 | -s hg38 \ 65 | --chr 6 \ 66 | --pc PC1 \ 67 | --plot_grid 0 \ 68 | --plot_axis 1 69 | ``` 70 | 71 | where: 72 | 73 | - ``--action``: Action to perform (here ``plot_pc``). 74 | - ``-i``: Input file with the principal component values calculated above. 75 | - ``-c``: Path to the folder ``chromSizes`` with trailing slash at the end ``/``. 76 | - ``-b``: The bin size (resolution) for the analysis. 77 | - ``-s``: Species name. 78 | - ``--chr``: The chromosome to be used. 79 | - ``--pc``: Which principal component to be returned (the same you have calculated above). 80 | - ``--plot_grid``: Set to 1 if you wish to plot the grid. 81 | - ``--plot_axis``: Set to 0 or remove this parameter if you do not wish to plot the axis. This may be useful to obtain a simpler plot that you may want to visualize together with the correlation matrix. 82 | 83 | 84 | ![](/figures/HiCtool_chr6_1mb_PC1.png) 85 | -------------------------------------------------------------------------------- /tutorial/tad-analysis.md: -------------------------------------------------------------------------------- 1 | # TAD analysis 2 | 3 | This pipeline illustrates the procedure to calculate topologically associated domain (TAD) coordinates and visualize TADs and other features. Topological domain coordinates should be calculated on the normalized data (as we do here). 4 | 5 | **Note!** Contact data must be at 40 kb bin size to perform TAD analysis! 6 | 7 | ## Table of Contents 8 | 9 | 1. [Performing TAD analysis](#1-performing-tad-analysis) 10 | 2. [Plotting TADs and more](#2-plotting-tads-and-more) 11 | - [2.1. Plotting TADs on the heatmap](#21-plotting-tads-on-the-heatmap) 12 | - [2.2. Plotting DI values and HMM states](#22-plotting-di-values-and-hmm-states) 13 | 14 | ## 1. Performing TAD analysis 15 | 16 | TAD coordinates, as well as DI values and true DI values (HMM states), are calculated as described by [Dixon et al., (2012)](http://www.nature.com/nature/journal/v485/n7398/abs/nature11082.html). 17 | 18 | **Note!** Contact data must be at 40 kb bin size to perform TAD analysis! 19 | 20 | To perform TAD analysis (calculating DI, HMM states and topological domain coordinates) we use the function ``full_tad_analysis`` of [HiCtool_TAD_analysis.py](/scripts/HiCtool_TAD_analysis.py) and each single intra-chromosomal contact matrix as input files. 21 | 22 | Here we take chromosome 6 as example. For this case, you may either have normalized the data using the Hi-Corrector approach (``./normalized_40000/chr6_chr6_40000.txt``) or have used the approach from Yaffe and Tanay (``./yaffe_tanay_40000/chr6_40000_normalized_fend.txt``). In this last case, remember to set ``--tab_sep 0`` below. 23 | ```unix 24 | python2.7 ./HiCtool-master/scripts/HiCtool_TAD_analysis.py \ 25 | --action full_tad_analysis \ 26 | -i ./normalized_40000/chr6_chr6_40000.txt \ 27 | -c ./HiCtool-master/scripts/chromSizes/ \ 28 | -s hg38 \ 29 | --isGlobal 0 \ 30 | --tab_sep 1 \ 31 | --chr 6 \ 32 | --data_type normalized 33 | ``` 34 | where: 35 | 36 | - ``--action``: Action to perform (here ``full_tad_analysis``). 37 | - ``-i``: Input contact matrix file. 38 | - ``-c``: Path to the folder ``chromSizes`` with trailing slash at the end ``/``. 39 | - ``-s``: Species name. 40 | - ``--isGlobal``: 1 if the input matrix is a global matrix, 0 otherwise. 41 | - ``--tab_sep``: 1 if the input matrix is in a tab separated format, 0 in compressed format. 42 | - ``--chr``: Chromosome or chromosomes to perform the TAD analysis in a list between square brackets. 43 | - ``--data_type``: Data type to label your data, here ``normalized``. 44 | 45 | This script will produce three output files inside the folder ``tad_analysis``: 46 | 47 | - ``HiCtool_chr6_DI.txt`` which contains the DI values. 48 | - ``HiCtool_chr6_hmm_states.txt`` which contains the HMM states extracted from the DI values. 49 | - ``HiCtool_chr6_topological_domains.txt`` which contains topological domains coordinates in a tab separated format with two columns. Each row is a topological domains, first column is start coordinate, second column is end coordinate. 50 | 51 | **Note!** The end coordinate of each domain is saved as the start position of the last bin (40 kb) belonging to each domain. 52 | 53 | To calculate the **topological domain coordinates for multiple chromosomes** you may use an approach as the following: 54 | ```unix 55 | chromosomes=("1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "20" "21" "22" "X" "Y") 56 | 57 | for i in "${chromosomes[@]}"; do 58 | python2.7 ./HiCtool-master/scripts/HiCtool_TAD_analysis.py \ 59 | --action full_tad_analysis \ 60 | -i "./normalized_40000/chr"$i"_chr"$i"_40000.txt" \ 61 | -c ./HiCtool-master/scripts/chromSizes/ \ 62 | -s hg38 \ 63 | --isGlobal 0 \ 64 | --tab_sep 1 \ 65 | --chr $i \ 66 | --data_type normalized 67 | done 68 | ``` 69 | 70 | ## 2. Plotting TADs and more 71 | 72 | This section allows you to plot TADs over the heatmap and also DI values and HMM states. 73 | 74 | ### 2.1. Plotting TADs on the heatmap 75 | 76 | To plot the topological domains on the heatmap, use the function ``plot_map`` of [HiCtool_global_map_analysis.py](/scripts/HiCtool_global_map_analysis.py) and pass the topological domain coordinates with the argument ``--topological_domains``. Here we plot the heatmap for chr6: 80,000,000-120,000,000. The color of the domains can be changed by using the parameter ``--domain_color``. 77 | 78 | **Note!** To plot topological domains on the heatmap this should be with a bin size of 40kb or lower! 79 | 80 | ```unix 81 | python2.7 ./HiCtool-master/scripts/HiCtool_global_map_analysis.py \ 82 | --action plot_map \ 83 | -i ./normalized_40000/chr6_chr6_40000.txt \ 84 | -c ./HiCtool-master/scripts/chromSizes/ \ 85 | -b 40000 \ 86 | -s hg38 \ 87 | --isGlobal 0 \ 88 | --tab_sep 1 \ 89 | --data_type normalized \ 90 | --chr_row 6 \ 91 | --chr_col 6 \ 92 | --my_colormap [white,red] \ 93 | --cutoff_type percentile \ 94 | --cutoff 95 \ 95 | --max_color "#460000" \ 96 | --chr_row_coord [80000000,120000000] \ 97 | --chr_col_coord [80000000,120000000] \ 98 | --topological_domains ./tad_analysis/HiCtool_chr6_topological_domains.txt 99 | ``` 100 | ![](/figures/HiCtool_chr6_chr6_40kb_normalized_domains_80000000_120000000.png) 101 | 102 | where: 103 | - ``--action``: Action to perform (here ``plot_map``). 104 | - ``-i``: Input contact matrix file. 105 | - ``-c``: Path to the folder ``chromSizes`` with trailing slash at the end ``/``. 106 | - ``-b``: The bin size (resolution) for the analysis. 107 | - ``-s``: Species name. 108 | - ``--isGlobal``: 1 if the input matrix is a global matrix, 0 otherwise. 109 | - ``--tab_sep``: 1 if the input matrix is in a tab separated format, 0 if it is in compressed format. 110 | - ``--data_type``: Data type to label your data, example: observed, normalized, etc. 111 | - ``--chr_row``: Chromosome to plot the DI values of. 112 | - ``--chr_col``: Same as ``--chr_row`` since this is only for intra-chromosomal maps. 113 | - ``--my_colormap``: Colormap to be used to plot the data. You can choose among any colorbar at https://matplotlib.org/examples/color/colormaps_reference.html, or input a list of colors if you want a custom colorbar. Example: ``[white, red, black]``. Colors can be specified also HEX format. Default: ``[white,red]``. 114 | - ``--cutoff_type``: To select a type of cutoff (``percentile`` or ``contact``) or plot the full range of the data (not declared). Default: ``percentile``. 115 | - ``--cutoff``: To set a maximum cutoff on the number of contacts for the colorbar based on ``--cutoff_type``. Default: ``95``. 116 | - ``--max_color``: To set the color of the bins with contact counts over ``--cutoff``. Default: ``#460000``. 117 | - ``--chr_row_coord``: List of two integers with start and end coordinates for the chromosome on the rows. 118 | - ``--chr_col_coord``: List of two integers with start and end coordinates for the chromosome on the columns. 119 | - ``--topological_domains``: Topological domain coordinates file to visualize domains on the heatmap. 120 | 121 | Zoom on a smaller region (chr6: 50,000,000-54,000,000): 122 | ```unix 123 | python2.7 ./HiCtool-master/scripts/HiCtool_global_map_analysis.py \ 124 | --action plot_map \ 125 | -i ./normalized_40000/chr6_chr6_40000.txt \ 126 | -c ./HiCtool-master/scripts/chromSizes/ \ 127 | -b 40000 \ 128 | -s hg38 \ 129 | --isGlobal 0 \ 130 | --tab_sep 1 \ 131 | --data_type normalized \ 132 | --chr_row 6 \ 133 | --chr_col 6 \ 134 | --my_colormap [white,red] \ 135 | --cutoff_type percentile \ 136 | --cutoff 95 \ 137 | --max_color "#460000" \ 138 | --chr_row_coord [50000000,54000000] \ 139 | --chr_col_coord [50000000,54000000] \ 140 | --topological_domains ./tad_analysis/HiCtool_chr6_topological_domains.txt 141 | ``` 142 | ![](/figures/HiCtool_chr6_chr6_40kb_normalized_domains_50000000_54000000.png) 143 | 144 | 145 | ### 2.2. Plotting DI values and HMM states 146 | 147 | **Directionality Index (DI)** 148 | 149 | Regions at the periphery of the topological domains are highly biased in their interaction frequencies: the most upstream portion of a topological domain is highly biased towards interacting downstream, and the downstream portion of a topological domain is highly biased towards interacting upstream. To determine the directional bias at any given bin in the genome the Directionality Index (DI) is used, quantifying the degree of upstream or downstream bias of a given bin: 150 | 151 | This is the formula used to calculate the DI: 152 | 153 | ![](/figures/DI_formula.png) 154 | 155 | where: 156 | 157 | - *A* is the the number of reads that map from a given 40 kb bin to the upstream 2 Mb. 158 | - *B* is the the number of reads that map from a given 40 kb bin to the downstream 2 Mb. 159 | - *E* is the expected number of contacts for each bin and it equals (A+B)/2. 160 | 161 | To compute DI, we need the fend normalized contact data at a bin size of 40 kb. Since the bin size is 40 kb (here we used 40 kbpb which stands for 40 kb per bin), hence the detection region of upstream or downstream biases 2 Mb is converted to 50 bins (2 Mb / 40 kbpb = 50 bins). 162 | 163 | To **plot the DI values** use the function ``plot_chromosome_DI`` of [HiCtool_TAD_analysis.py](/scripts/HiCtool_TAD_analysis.py) as following (in this case we plot DI values for chromosome 6, from 50 to 54 Mb): 164 | ```unix 165 | python2.7 ./HiCtool-master/scripts/HiCtool_TAD_analysis.py \ 166 | --action plot_chromosome_DI \ 167 | -i ./tad_analysis/HiCtool_chr6_DI.txt \ 168 | -c ./HiCtool-master/scripts/chromSizes/ \ 169 | -s hg38 \ 170 | --chr 6 \ 171 | --full_chromosome 0 \ 172 | --coord [50000000,54000000] 173 | ``` 174 | where: 175 | - ``--action``: Action to perform (here ``plot_chromosome_DI``). 176 | - ``-i``: Input DI file. 177 | - ``-c``: Path to the folder ``chromSizes`` with trailing slash at the end ``/``. 178 | - ``-b``: The bin size (resolution) for the analysis. 179 | - ``-s``: Species name. 180 | - ``--chr``: Chromosome. 181 | - ``--full_chromosome``: 1 to plot the DI values for the entire chromosome, 0 otherwise (must use ``--coord``). 182 | - ``--coord``: List with start and end coordinates to plot the DI values. 183 | 184 | ![](/figures/HiCtool_chr6_DI.png) 185 | 186 | **HMM states** 187 | 188 | A Hidden Markov Model (HMM) based on the Directionality Index is used to identify biased states (true DI). 189 | 190 | For true DI calculation, we consider the **emission sequence** as the observed DI values and the Transition Matrix, Emission Matrix and initial State Sequence as unknown. We have **three emissions** 1, 2, 0 corresponding to a positive (1), negative (2) or zero (0) value of the observed DI. In our analysis, we associate to the emission '0' all the absolute DI values under a threshold of 0.4. So, first we estimate the model parameters and then the most probable sequence of states using the Viterbi algorithm. 191 | 192 | To **plot the DI values and HMM states** use the function ``plot_chromosome_DI`` of [HiCtool_TAD_analysis.py](/scripts/HiCtool_TAD_analysis.py) and add the parameter ``--input_file_hmm`` to input the HMM states file as following: 193 | ```unix 194 | python2.7 ./HiCtool-master/scripts/HiCtool_TAD_analysis.py \ 195 | --action plot_chromosome_DI \ 196 | -i ./tad_analysis/HiCtool_chr6_DI.txt \ 197 | -c ./HiCtool-master/scripts/chromSizes/ \ 198 | -s hg38 \ 199 | --chr 6 \ 200 | --full_chromosome 0 \ 201 | --coord [50000000,54000000] \ 202 | --input_file_hmm ./tad_analysis/HiCtool_chr6_hmm_states.txt 203 | ``` 204 | 205 | ![](/figures/HiCtool_chr6_DI_HMM.png) 206 | 207 | **TAD coordinates** 208 | 209 | The true DI values allow to infer the locations of the topological domains in the genome. A domain is initiated at the beginning of a single downstream biased HMM state (red color in the above figure). The domain is continuous throughout any consecutive downstream biased state. The domain will then end when the last in a series of upstream biased states (green color in the above figure) is reached, with the domain ending at the end of the last HMM upstream biased state. 210 | 211 | To calculate the topological domains coordinates, first we extract all the potential start and end coordinates according to the definition, and then we evaluate a list of conditions to take into account the possible presence of gaps between a series of positive or negative states values. The figure below shows a summary of the procedure: 212 | 213 | ![](/figures/HiCtool_topological_domains_flowchart.png) 214 | --------------------------------------------------------------------------------