├── .github └── ISSUE_TEMPLATE │ ├── a-bug.md │ ├── feature-request.md │ ├── smudgeplot-inference-problem.md │ └── smudgeplot-interpretation.md ├── .gitignore ├── FAQ.md ├── LICENSE ├── Makefile ├── README.md ├── exec ├── centrality_plot.R ├── smudgeplot ├── smudgeplot.py └── smudgeplot_plot.R ├── playground ├── BGA_tutorial.md ├── DEVELOPMENT.md ├── alternative_fitting │ ├── README.md │ ├── alternative_plot_covA_covB.R │ ├── alternative_plotting.R │ ├── alternative_plotting_functions.R │ ├── alternative_plotting_testing.R │ └── pair_clustering.py ├── interactive_plot_strawberry_full_kmer_families_fooling_around.R ├── more_away_pairs.py ├── playground.R ├── playground.py └── popart.R ├── src_ploidyplot ├── PloidyPlot.c ├── gene_core.c ├── gene_core.h ├── libfastk.c ├── libfastk.h ├── matrix.c └── matrix.h └── tests ├── README.md └── run_smudge_version.sh /.github/ISSUE_TEMPLATE/a-bug.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: A bug 3 | about: If it looks like an error in the code 4 | labels: potential_problems 5 | 6 | --- 7 | 8 | **What did you do** 9 | 10 | Tell us about the problem. what is the version of the software you used (smudgeplot -v)? What was the input (possibly with a few example lines)? What is the command you run? What is the error output you get? And what you have expected to see instead? 11 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/feature-request.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Feature request 3 | about: Any ideas how to improve smudgeplot? 4 | title: feature request: [short descrition] 5 | labels: enhancement 6 | 7 | --- 8 | 9 | **Background** 10 | 11 | I suppose you have a reason why you propose us an improvement. If it has a biological or algorithmical motivation, give us something to undestand where the suggestion comes from... 12 | 13 | **Feature** 14 | 15 | What do you think the feature should do. Be as detailed as possilbe here. Don't hesitate to write down examples of how the feature should operate. 16 | 17 | **Contribution** 18 | 19 | Do you have an idea how to implement the feature? Would you be willing to contribute to get the feature? -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/smudgeplot-inference-problem.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Smudgeplot inference problem 3 | about: When there is a suspicious smudgeplot suggesting something wrong 4 | 5 | --- 6 | 7 | **About your genome** 8 | 9 | Tell us about your genome, so we understand why the smudgeplot seems to be wrong. Also please include the evidence you have (karyotype, inSitu...). 10 | 11 | **smudgeplot** 12 | 13 | Show us please the command you have used to generate the smudgeplot and the smudgeplot if possible. Tell us what looks suspicious on the smudgeplot and how do you expect it to look like? 14 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/smudgeplot-interpretation.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Smudgeplot interpretation 3 | about: For interpretation problems of smudgeplot, send an issue by email if the data 4 | are sensitive 5 | 6 | --- 7 | 8 | I have troubles understanding my smudgeplot. I have used follwing command to generate it 9 | 10 | ``` 11 | smudgeplot plot -i kmer_pairs_coverages_2.tsv -o my_org -t "Figure 1a: genome structure of X. odoratum" -L 40 -k 19 12 | ``` 13 | 14 | and it look like this: 15 | 16 | (add smudgeplot) 17 | 18 | Now, I (know/have indication) already of (genome size/number of chromosomes/ploidy/...) from (RADseq/flow cytometry/karyotypes/inSitu/...) data. This does not make sense together with the smudge because (it predicts unexpected ploidy/shows only one smudge/...). 19 | 20 | How should I understand my smudgeplot? 21 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | data 2 | figures 3 | playground 4 | docs 5 | exec/PloidyPlot 6 | exec/hetmers 7 | *.o 8 | .DS_Store 9 | smu2text_smu 10 | -------------------------------------------------------------------------------- /FAQ.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | migrated to [wiki](https://github.com/tbenavi1/smudgeplot/wiki/FAQ) 4 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "{}" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright {yyyy} {name of copyright owner} 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | 203 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | # PATH for libraries is guessed 2 | CFLAGS = -O3 -Wall -Wextra -Wno-unused-result -fno-strict-aliasing 3 | 4 | ifndef INSTALL_PREFIX 5 | INSTALL_PREFIX = /usr/local 6 | endif 7 | 8 | HET_KMERS_INST = $(INSTALL_PREFIX)/bin/smudgeplot.py $(INSTALL_PREFIX)/bin/hetmers 9 | SMUDGEPLOT_INST = $(INSTALL_PREFIX)/bin/smudgeplot_plot.R $(INSTALL_PREFIX)/bin/centrality_plot.R 10 | 11 | .PHONY : install 12 | install : $(HET_KMERS_INST) $(SMUDGEPLOT_INST) $(CUTOFF_INST) 13 | 14 | $(INSTALL_PREFIX)/bin/% : exec/% 15 | install -C $< $(INSTALL_PREFIX)/bin 16 | 17 | exec/hetmers: src_ploidyplot/PloidyPlot.c src_ploidyplot/libfastk.c src_ploidyplot/libfastk.h src_ploidyplot/matrix.c src_ploidyplot/matrix.h 18 | gcc $(CFLAGS) -o $@ src_ploidyplot/PloidyPlot.c src_ploidyplot/libfastk.c src_ploidyplot/matrix.c -lpthread -lm 19 | 20 | 21 | .PHONY : clean 22 | clean : 23 | rm -f exec/hetmers 24 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Smudgeplot 2 | 3 | **_Version: 0.4.0 Arched_** 4 | 5 | **_Authors: [Gene W Myers](https://github.com/thegenemyers) and [Kamil S. Jaron](https://github.com/KamilSJaron), Tianyi Ma._** 6 | 7 | ### Install the whole thing 8 | 9 | This version of smudgeplot operates on FastK k-mer databases. So, before installing smudgeplot, please install [FastK](https://github.com/thegenemyers/FASTK). The smudgeplot installation consist of one python, two R scripts and the C-backend to search for all the k-mer pairs (hetmers) that needs to be compilet. 10 | 11 | #### Quick 12 | 13 | Assuming you have admin right / can write to `/usr/local/bin`, you can simply run 14 | 15 | ```bash 16 | sudo make 17 | ``` 18 | That should do everything necesarry to make smudgeplot fully operational. You can run `smudgeplot.py --help` to see if it worked. 19 | 20 | #### Custom installation location 21 | 22 | If there is a different directory where you store your executables, you can specify `INSTALL_PREFIX` variable to make. The binaries are then added to `$INSTALL_PREFIX/bin`. For example 23 | 24 | ```bash 25 | make -s INSTALL_PREFIX=~ 26 | ``` 27 | 28 | will install smudgeplot to `~/bin/`. 29 | 30 | #### Manual installation 31 | 32 | Compiling the `C` executable 33 | 34 | ``` 35 | make exec/hetmers # this will compile hetmers (kmer pair searching engine of PloidyPlot) backend 36 | ``` 37 | 38 | Now you can move all three files from the `exec` directory somewhere your system will see it (or alternativelly, you can add that directory to `$PATH` variable). 39 | 40 | ``` 41 | install -C exec/smudgeplot.py /usr/local/bin 42 | install -C exec/hetmers /usr/local/bin 43 | install -C exec/smudgeplot_plot.R /usr/local/bin 44 | install -C exec/centrality_plot.R /usr/local/bin 45 | ``` 46 | 47 | ### Runing this version on Sacharomyces data 48 | Requires ~2.1GB of space and `FastK` and `smudgeplot` installed. 49 | 50 | ```bash 51 | # download data 52 | wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR326/001/SRR3265401/SRR3265401_1.fastq.gz 53 | wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR326/001/SRR3265401/SRR3265401_2.fastq.gz 54 | 55 | # sort them in a reasonable place 56 | mkdir data/Scer 57 | mv *fastq.gz data/Scer/ 58 | 59 | # run FastK to create a k-mer database 60 | FastK -v -t4 -k31 -M16 -T4 data/Scer/SRR3265401_[12].fastq.gz -Ndata/Scer/FastK_Table 61 | 62 | # Find all k-mer pairs in the dataset using hetmer module 63 | smudgeplot.py hetmers -L 12 -t 4 -o data/Scer/kmerpairs --verbose data/Scer/FastK_Table 64 | # this now generated `data/Scer/kmerpairs_text.smu` file; 65 | # it's a flat file with three columns; covB, covA and freq (the number of k-mer pairs with these respective coverages) 66 | 67 | # use the .smu file to infer ploidy and create smudgeplot 68 | smudgeplot.py all -o data/Scer/trial_run data/Scer/kmerpairs_text.smu 69 | 70 | # check that bunch files are generated (3 pdfs; some summary tables and logs) 71 | ls data/Scer/trial_run_* 72 | ``` 73 | 74 | The y-axis scaling is by default 100, one can spcify argument `ylim` to scale it differently 75 | 76 | ```bash 77 | smudgeplot.py all -o data/Scer/trial_run_ylim70 data/Scer/kmerpairs_text.smu -ylim 70 78 | ``` 79 | 80 | There is also a plotting module that requires the coverage and a list of smudges and their respective sizes listed in a tabular file. This plotting module does not inference and should be used only if you know the right answers already. 81 | 82 | ### How smudgeplot works 83 | 84 | This tool extracts heterozygous kmer pairs from kmer count databases and performs gymnastics with them. We are able to disentangle genome structure by comparing the sum of kmer pair coverages (CovA + CovB) to their relative coverage (CovB / (CovA + CovB)). Such an approach also allows us to analyze obscure genomes with duplications, various ploidy levels, etc. 85 | 86 | Smudgeplots are computed from raw or even better from trimmed reads and show the haplotype structure using heterozygous kmer pairs. For example (of an older version): 87 | 88 | ![smudgeexample](https://user-images.githubusercontent.com/8181573/45959760-f1032d00-c01a-11e8-8576-ff0512c33da9.png) 89 | 90 | Every haplotype structure has a unique smudge on the graph and the heat of the smudge indicates how frequently the haplotype structure is represented in the genome compared to the other structures. The image above is an ideal case, where the sequencing coverage is sufficient to beautifully separate all the smudges, providing very strong and clear evidence of triploidy. 91 | 92 | This tool is planned to be a part of [GenomeScope](https://github.com/tbenavi1/genomescope2.0) in the near future. 93 | 94 | ### More about the use 95 | 96 | The input is a set of whole genome sequencing reads, the more coverage the better. The method is designed to process big datasets, don't hesitate to pull all single-end/pair-end libraries together. 97 | 98 | The workflow is automatic, but it's not fool-proof. It requires some decisions. Use this tool joinlty with [GenomeScope](https://github.com/tbenavi1/genomescope2.0). The tutorials on our wiki are currently outdated (build for version 0.2.5), and will be updated by 18th of October. 99 | 100 | Smudgeplot generates two plots, one with coloration on a log scale and the other on a linear scale. The legend indicates approximate kmer pairs per tile densities. Note that a single polymorphism generates multiple heterozygous kmers. As such, the reported numbers do not directly correspond to the number of variants. Instead, the actual number is approximately 1/k times the reported numbers, where k is the kmer size (in summary already recalculated). It's important to note that this process does not exhaustively attempt to find all of the heterozygous kmers from the genome. Instead, only a sufficient sample is obtained in order to identify relative genome structure. You can also report the minimal number of loci that are heterozygous if the inference is correct. 101 | 102 | ### GenomeScope 103 | 104 | You can feed the kmer coverage histogram to GenomeScope. (Either run the [genomescope script](https://github.com/schatzlab/genomescope/blob/master/genomescope.R) or use the [web server](http://qb.cshl.edu/genomescope/)) 105 | 106 | ``` 107 | Rscript genomescope.R kmcdb_k21.hist [kmer_max] [verbose] 108 | ``` 109 | 110 | This script estimates the size, heterozygosity, and repetitive fraction of the genome. By inspecting the fitted model you can determine the location of the smallest peak after the error tail. Then, you can decide the low end cutoff below which all kmers will be discarded as errors (cca 0.5 times the haploid kmer coverage), and the high end cutoff above which all kmers will be discarded (cca 8.5 times the haploid kmer coverage). 111 | 112 | ## Frequently Asked Questions 113 | 114 | Are collected on [our wiki](https://github.com/KamilSJaron/smudgeplot/wiki/FAQ). Smudgeplot does not demand much on computational resources, but make sure you check [memory requirements](https://github.com/KamilSJaron/smudgeplot/wiki/smudgeplot-hetkmers#memory-requirements) before you extract kmer pairs (`hetkmers` task). If you don't find an answer for your question in FAQ, open an [issue](https://github.com/KamilSJaron/smudgeplot/issues/new/choose) or drop us an email. 115 | 116 | Check [projects](https://github.com/KamilSJaron/smudgeplot/projects) to see how the development goes. 117 | 118 | ## Contributions 119 | 120 | This is definitely an open project, contributions are welcome. You can check some of the ideas for the future in [projects](https://github.com/KamilSJaron/smudgeplot/projects) and in the development [dev](https://github.com/KamilSJaron/smudgeplot/tree/dev) branch. The file [playground/DEVELOPMENT.md](playground/DEVELOPMENT.md) contains some development notes. The directory [playground](playground) contains some snippets, attempts, and other items of interest. 121 | 122 | ## Reference 123 | 124 | Ranallo-Benavidez, T.R., Jaron, K.S. & Schatz, M.C. GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. *Nature Communications* **11**, 1432 (2020). https://doi.org/10.1038/s41467-020-14998-3 125 | 126 | ## Acknowledgements 127 | 128 | This [blogpost](http://www.everydayanalytics.ca/2014/09/5-ways-to-do-2d-histograms-in-r.html) by Myles Harrison has largely inspired the visual output of smudgeplots. The colourblind friendly colour theme was suggested by @ggrimes. Grateful for helpful comments of beta testers and pre-release chatters! 129 | -------------------------------------------------------------------------------- /exec/centrality_plot.R: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env Rscript 2 | args = commandArgs(trailingOnly=TRUE) 3 | input_name <- args[1] 4 | # input_name = '~/test/daAchMill1_centralities.txt' 5 | 6 | output_name <- gsub('.txt', '.pdf', input_name) 7 | tested_covs <- read.table(input_name, col.names = c('cov', 'centrality')) 8 | 9 | pdf(output_name) 10 | plot(tested_covs[, 'cov'], tested_covs[, 'centrality'], xlab = 'Coverage', ylab = 'Centrality [(theoretical_center - actual_center) / coverage ]', pch = 20) 11 | dev.off() -------------------------------------------------------------------------------- /exec/smudgeplot: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | import argparse 4 | import sys 5 | import os 6 | from math import log 7 | from math import ceil 8 | import numpy as np 9 | from scipy.signal import argrelextrema 10 | 11 | version = '0.2.0' 12 | 13 | class parser(): 14 | def __init__(self): 15 | argparser = argparse.ArgumentParser( 16 | # description='Inference of ploidy and heterozygosity structure using whole genome sequencing data', 17 | usage='''smudgeplot [options] \n 18 | tasks: cutoff Calculate meaningful values for lower/upper kmer histogram cutoff. 19 | hetkmers Calculate unique kmer pairs from a Jellyfish or KMC dump file. 20 | plot Generate 2d histogram; infere ploidy and plot a smudgeplot.\n\n''') 21 | argparser.add_argument('task', help='Task to execute; for task specific options execute smudgeplot -h') 22 | argparser.add_argument('-v', '--version', action="store_true", default = False, help="print the version and exit") 23 | # print version is a special case 24 | if len(sys.argv) > 1: 25 | if sys.argv[1] in ['-v', '--version']: 26 | self.task = "version" 27 | return 28 | # the following line either prints help and die; or assign the name of task to variable task 29 | self.task = argparser.parse_args([sys.argv[1]]).task 30 | else: 31 | self.task = "" 32 | # if the task is known (i.e. defined in this file); 33 | if hasattr(self, self.task): 34 | # load arguments of that task 35 | getattr(self, self.task)() 36 | else: 37 | argparser.print_usage() 38 | print('"' + self.task + '" is not a valid task name') 39 | exit(1) 40 | 41 | def hetkmers(self): 42 | ''' 43 | Calculate unique kmer pairs from a Jellyfish or KMC dump file. 44 | ''' 45 | argparser = argparse.ArgumentParser(prog = 'smudgeplot hetkmers', 46 | description='Calculate unique kmer pairs from a Jellyfish or KMC dump file.') 47 | argparser.add_argument('infile', nargs='?', type=argparse.FileType('r'), default=sys.stdin, help='Alphabetically sorted Jellyfish or KMC dump file (stdin).') 48 | argparser.add_argument('-o', help='The pattern used to name the output (kmerpairs).', default='kmerpairs') 49 | argparser.add_argument('-k', help='The length of the kmer.', default=21) 50 | argparser.add_argument('-t', help='Number of processes to use.', default = 4) 51 | argparser.add_argument('--middle', dest='middle', action='store_const', const = True, default = False, 52 | help='Get all kmer pairs one SNP away from each other (default: just the middle one).') 53 | self.arguments = argparser.parse_args(sys.argv[2:]) 54 | 55 | def plot(self): 56 | ''' 57 | Generate 2d histogram; infer ploidy and plot a smudgeplot. 58 | ''' 59 | argparser = argparse.ArgumentParser(prog = 'smudgeplot plot', description='Generate 2d histogram for smudgeplot') 60 | argparser.add_argument('infile', nargs='?', type=argparse.FileType('r'), default=sys.stdin, help='name of the input tsv file with covarages (default \"coverages_2.tsv\")."') 61 | argparser.add_argument('-o', help='The pattern used to name the output (smudgeplot).', default='smudgeplot') 62 | argparser.add_argument('-q', help='Remove kmer pairs with coverage over the specified quantile; (default none).', type=float, default=1) 63 | argparser.add_argument('-L', help='The lower boundary used when dumping kmers (default min(total_pair_cov) / 2).', type=int, default=0) 64 | argparser.add_argument('-n', help='The expected haploid coverage (default estimated from data).', type=int, default=0) 65 | argparser.add_argument('-t', '--title', help='name printed at the top of the smudgeplot (default none).', default='') 66 | argparser.add_argument('-m', '-method', help='The algorithm for annotation of smudges (default \'local_aggregation\')', default='local_aggregation') 67 | argparser.add_argument('-nbins', help='The number of nbins used for smudgeplot matrix (nbins x nbins) (default autodetection).', type=int, default=0) 68 | # argparser.add_argument('-k', help='The length of the kmer.', default=21) 69 | argparser.add_argument('-kmer_file', help='Name of the input files containing kmer seuqences (assuming the same order as in the coverage file)', default = "") 70 | argparser.add_argument('--homozygous', action="store_true", default = False, help="Assume no heterozygosity in the genome - plotting a paralog structure; (default False).") 71 | self.arguments = argparser.parse_args(sys.argv[2:]) 72 | 73 | def cutoff(self): 74 | ''' 75 | Calculate meaningful values for lower/upper kmer histogram cutoff. 76 | ''' 77 | argparser = argparse.ArgumentParser(prog = 'smudgeplot cutoff', description='Calculate meaningful values for lower/upper kmer histogram cutoff.') 78 | argparser.add_argument('infile', type=argparse.FileType('r'), help='Name of the input kmer histogram file (default \"kmer.hist\")."') 79 | argparser.add_argument('boundary', help='Which bounary to compute L (lower, default) or U (upper)', default = 'L') 80 | self.arguments = argparser.parse_args(sys.argv[2:]) 81 | 82 | 83 | def round_up_nice(x): 84 | digits = ceil(log(x, 10)) 85 | if digits <= 1: 86 | multiplier = 10 ** (digits - 1) 87 | else: 88 | multiplier = 10 ** (digits - 2) 89 | return(ceil(x / multiplier) * multiplier) 90 | 91 | def cutoff(args): 92 | # kmer_hist = open("data/Mflo2/kmer.hist","r") 93 | kmer_hist = args.infile 94 | hist = np.array([int(line.split()[1]) for line in kmer_hist]) 95 | if args.boundary == "L": 96 | local_minima = argrelextrema(hist, np.less)[0][0] 97 | L = max(10, int(round(local_minima * 1.25))) 98 | print(L, end = '') 99 | else: 100 | # take 99.8 quantile of kmers that are more than one in the read set 101 | hist_rel_cumsum = np.cumsum(hist[1:]) / np.sum(hist[1:]) 102 | U = round_up_nice(np.argmax(hist_rel_cumsum > 0.998)) 103 | print(U, end = '') 104 | 105 | def main(): 106 | _parser = parser() 107 | 108 | print('Running smudgeplot v' + version) 109 | if _parser.task == "version": 110 | exit(0) 111 | 112 | print('Task: ' + _parser.task) 113 | 114 | if _parser.task == "cutoff": 115 | cutoff(_parser.arguments) 116 | 117 | # if _parser.task == "hetkmers": 118 | # hetkmers(_parser.arguments) 119 | # 120 | # if _parser.task == "plot": 121 | # call .R script 122 | # plot(_parser.arguments) 123 | 124 | print('Done!') 125 | exit(0) 126 | 127 | if __name__=='__main__': 128 | main() -------------------------------------------------------------------------------- /exec/smudgeplot.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | import argparse 4 | import sys 5 | import numpy as np 6 | from pandas import read_csv # type: ignore 7 | from pandas import DataFrame # type: ignore 8 | from numpy import arange 9 | from numpy import argmin 10 | from numpy import concatenate 11 | from os import system 12 | from math import log 13 | from math import ceil 14 | from statistics import fmean 15 | from collections import defaultdict 16 | # import matplotlib as mpl 17 | # import matplotlib.pyplot as plt 18 | # from matplotlib.pyplot import plot 19 | 20 | version = '0.4.0dev' 21 | 22 | ############################ 23 | # processing of user input # 24 | ############################ 25 | 26 | class parser(): 27 | def __init__(self): 28 | argparser = argparse.ArgumentParser( 29 | # description='Inference of ploidy and heterozygosity structure using whole genome sequencing data', 30 | usage='''smudgeplot [options] \n 31 | tasks: cutoff Calculate meaningful values for lower kmer histogram cutoff. 32 | hetmers Calculate unique kmer pairs from a FastK k-mer database. 33 | peak_agregation Agregates smudges using local agregation algorithm. 34 | plot Generate 2d histogram; infere ploidy and plot a smudgeplot. 35 | all Runs all the steps (with default options)\n\n''') 36 | # removing this for now; 37 | # extract Extract kmer pairs within specified coverage sum and minor covrage ratio ranges 38 | argparser.add_argument('task', help='Task to execute; for task specific options execute smudgeplot -h') 39 | argparser.add_argument('-v', '--version', action="store_true", default = False, help="print the version and exit") 40 | # print version is a special case 41 | if len(sys.argv) > 1: 42 | if sys.argv[1] in ['-v', '--version']: 43 | self.task = "version" 44 | return 45 | # the following line either prints help and die; or assign the name of task to variable task 46 | self.task = argparser.parse_args([sys.argv[1]]).task 47 | else: 48 | self.task = "" 49 | # if the task is known (i.e. defined in this file); 50 | if hasattr(self, self.task): 51 | # load arguments of that task 52 | getattr(self, self.task)() 53 | else: 54 | argparser.print_usage() 55 | sys.stderr.write('"' + self.task + '" is not a valid task name\n') 56 | exit(1) 57 | 58 | def hetmers(self): 59 | ''' 60 | Calculate unique kmer pairs from a Jellyfish or KMC dump file. 61 | ''' 62 | argparser = argparse.ArgumentParser(prog = 'smudgeplot hetkmers', 63 | description='Calculate unique kmer pairs from FastK k-mer database.') 64 | argparser.add_argument('infile', nargs='?', help='Input FastK database (.ktab) file.') 65 | argparser.add_argument('-L', help='Count threshold below which k-mers are considered erroneous', type=int) 66 | argparser.add_argument('-t', help='Number of threads (default 4)', type=int, default=4) 67 | argparser.add_argument('-o', help='The pattern used to name the output (kmerpairs).', default='kmerpairs') 68 | argparser.add_argument('-tmp', help='Directory where all temporary files will be stored (default /tmp).', default='.') 69 | argparser.add_argument('--verbose', action="store_true", default = False, help='verbose mode') 70 | self.arguments = argparser.parse_args(sys.argv[2:]) 71 | 72 | def plot(self): 73 | ''' 74 | Generate 2d histogram; infer ploidy and plot a smudgeplot. 75 | ''' 76 | argparser = argparse.ArgumentParser(prog = 'smudgeplot plot', description='Generate 2d histogram for smudgeplot') 77 | argparser.add_argument('infile', help='name of the input tsv file with covarages and frequencies') 78 | argparser.add_argument('smudgefile', help='name of the input tsv file with sizes of individual smudges') 79 | argparser.add_argument('n', help='The expected haploid coverage.', type=float) 80 | argparser.add_argument('-o', help='The pattern used to name the output (smudgeplot).', default='smudgeplot') 81 | 82 | argparser = self.add_plotting_arguments(argparser) 83 | 84 | self.arguments = argparser.parse_args(sys.argv[2:]) 85 | 86 | def cutoff(self): 87 | ''' 88 | Calculate meaningful values for lower/upper kmer histogram cutoff. 89 | ''' 90 | argparser = argparse.ArgumentParser(prog = 'smudgeplot cutoff', description='Calculate meaningful values for lower/upper kmer histogram cutoff.') 91 | argparser.add_argument('infile', type=argparse.FileType('r'), help='Name of the input kmer histogram file (default \"kmer.hist\")."') 92 | argparser.add_argument('boundary', help='Which bounary to compute L (lower) or U (upper)') 93 | self.arguments = argparser.parse_args(sys.argv[2:]) 94 | 95 | def peak_agregation(self): 96 | ''' 97 | Extract kmer pairs within specified coverage sum and minor covrage ratio ranges. 98 | ''' 99 | argparser = argparse.ArgumentParser() 100 | argparser.add_argument('infile', nargs='?', help='name of the input tsv file with covarages and frequencies.') 101 | argparser.add_argument('-nf', '-noise_filter', help='Do not agregate into smudge k-mer pairs with frequency lower than this parameter', type=int, default=50) 102 | argparser.add_argument('-d', '-distance', help='Manthattan distance of k-mer pairs that are considered neioboring for the local agregation purposes.', type=int, default=5) 103 | argparser.add_argument('--mask_errors', help='instead of reporting assignments to individual smudges, just remove all monotonically decreasing points from the error line', action="store_true", default = False) 104 | self.arguments = argparser.parse_args(sys.argv[2:]) 105 | 106 | def all(self): 107 | argparser = argparse.ArgumentParser() 108 | argparser.add_argument('infile', nargs='?', help='name of the input tsv file with covarages and frequencies.') 109 | argparser.add_argument('-o', help='The pattern used to name the output (smudgeplot).', default='smudgeplot') 110 | argparser.add_argument('-cov_min', help='Minimal coverage to explore (default 6)', default=6, type = int) 111 | argparser.add_argument('-cov_max', help='Maximal coverage to explore (default 50)', default=60, type = int) 112 | argparser.add_argument('-cov', help='Define coverage instead of infering it. Disables cov_min and cov_max.', default=0, type=int) 113 | 114 | argparser = self.add_plotting_arguments(argparser) 115 | 116 | self.arguments = argparser.parse_args(sys.argv[2:]) 117 | 118 | def add_plotting_arguments(self, argparser): 119 | argparser.add_argument('-c', '-cov_filter', help='Filter pairs with one of them having coverage bellow specified threshold (default 0; disables parameter L)', type=int, default=0) 120 | argparser.add_argument('-t', '--title', help='name printed at the top of the smudgeplot (default none).', default='') 121 | argparser.add_argument('-ylim', help='The upper limit for the coverage sum (the y axis)', type = int, default=0) 122 | argparser.add_argument('-col_ramp', help='An R palette used for the plot (default "viridis", other sensible options are "magma", "mako" or "grey.colors" - recommended in combination with --invert_cols).', default='viridis') 123 | argparser.add_argument('--invert_cols', action="store_true", default = False, help="Revert the colour palette (default False).") 124 | return(argparser) 125 | 126 | def format_aguments_for_R_plotting(self): 127 | plot_args = "" 128 | if self.arguments.c != 0: 129 | plot_args += " -c " + str(self.arguments.c) 130 | if self.arguments.title: 131 | plot_args += " -t \"" + self.arguments.title + "\"" 132 | if self.arguments.ylim != 0: 133 | plot_args += " -ylim " + str(self.arguments.ylim) 134 | if self.arguments.col_ramp: 135 | plot_args += " -col_ramp \"" + self.arguments.col_ramp + "\"" 136 | if self.arguments.invert_cols: 137 | plot_args += " --invert_cols" 138 | return(plot_args) 139 | 140 | ############### 141 | # task cutoff # 142 | ############### 143 | 144 | # taken from https://stackoverflow.com/a/29614335 145 | def local_min(ys): 146 | return [i for i, y in enumerate(ys) 147 | if ((i == 0) or (ys[i - 1] >= y)) 148 | and ((i == len(ys) - 1) or (y < ys[i+1]))] 149 | 150 | def round_up_nice(x): 151 | digits = ceil(log(x, 10)) 152 | if digits <= 1: 153 | multiplier = 10 ** (digits - 1) 154 | else: 155 | multiplier = 10 ** (digits - 2) 156 | return(ceil(x / multiplier) * multiplier) 157 | 158 | def cutoff(args): 159 | # kmer_hist = open("data/Scer/kmc_k31.hist","r") 160 | kmer_hist = args.infile 161 | hist = [int(line.split()[1]) for line in kmer_hist] 162 | if args.boundary == "L": 163 | local_minima = local_min(hist)[0] 164 | L = max(10, int(round(local_minima * 1.25))) 165 | sys.stdout.write(str(L)) 166 | else: 167 | sys.stderr.write('Warning: We discourage using the original hetmer algorithm.\n\tThe updated (recommended) version does not take the argument U\n') 168 | # take 99.8 quantile of kmers that are more than one in the read set 169 | number_of_kmers = sum(hist[1:]) 170 | hist_rel_cumsum = [sum(hist[1:i+1]) / number_of_kmers for i in range(1, len(hist))] 171 | min(range(len(hist_rel_cumsum))) 172 | U = round_up_nice(min([i for i, q in enumerate(hist_rel_cumsum) if q > 0.998])) 173 | sys.stdout.write(str(U)) 174 | sys.stdout.flush() 175 | 176 | ######################## 177 | # task peak_agregation # 178 | ######################## 179 | 180 | def load_hetmers(smufile): 181 | cov_tab = read_csv(smufile, names = ['covB', 'covA', 'freq'], sep='\t') 182 | cov_tab = cov_tab.sort_values('freq', ascending = False) 183 | return(cov_tab) 184 | 185 | def local_agregation(cov_tab, distance, noise_filter, mask_errors): 186 | # generate a dictionary that gives us for each combination of coverages a frequency 187 | cov2freq = defaultdict(int) 188 | cov2peak = defaultdict(int) 189 | 190 | L = min(cov_tab['covB']) # important only when --mask_errors is on 191 | 192 | ### clustering 193 | next_peak = 1 194 | for idx, covB, covA, freq in cov_tab.itertuples(): 195 | cov2freq[(covA, covB)] = freq # a make a frequency dictionary on the fly, because I don't need any value that was not processed yet 196 | if freq < noise_filter: 197 | break 198 | highest_neigbour_coords = (0, 0) 199 | highest_neigbour_freq = 0 200 | # for each kmer pair I will retrieve all neibours (Manhattan distance) 201 | for xA in range(covA - distance,covA + distance + 1): 202 | # for explored A coverage in neiborhood, we explore all possible B coordinates 203 | distanceB = distance - abs(covA - xA) 204 | for xB in range(covB - distanceB,covB + distanceB + 1): 205 | xB, xA = sorted([xA, xB]) # this is to make sure xB is smaller than xA 206 | # iterating only though those that were assigned already 207 | # and recroding only the one with highest frequency 208 | if cov2peak[(xA, xB)] and cov2freq[(xA, xB)] > highest_neigbour_freq: 209 | highest_neigbour_coords = (xA, xB) 210 | highest_neigbour_freq = cov2freq[(xA, xB)] 211 | if highest_neigbour_freq > 0: 212 | cov2peak[(covA, covB)] = cov2peak[(highest_neigbour_coords)] 213 | else: 214 | # print("new peak:", (covA, covB)) 215 | if mask_errors: 216 | if covB < L + distance: 217 | cov2peak[(covA, covB)] = 1 # error line 218 | else: 219 | cov2peak[(covA, covB)] = 0 # central smudges 220 | else: 221 | cov2peak[(covA, covB)] = next_peak # if I want to keep info about all locally agregated smudges 222 | next_peak += 1 223 | return(cov2peak) 224 | 225 | def peak_agregation(args): 226 | ### load data 227 | cov_tab = load_hetmers(args.infile) 228 | 229 | cov2peak = local_agregation(cov_tab, args.d, args.nf, mask_errors = False) 230 | 231 | cov_tab = cov_tab.sort_values(['covA', 'covB'], ascending = True) 232 | for idx, covB, covA, freq in cov_tab.itertuples(): 233 | peak = cov2peak[(covA, covB)] 234 | sys.stdout.write(f"{covB}\t{covA}\t{freq}\t{peak}\n") 235 | sys.stdout.flush() 236 | 237 | def get_smudge_container(cov_tab, cov, smudge_filter): 238 | smudge_container = dict() 239 | genomic_cov_tab = cov_tab[cov_tab['peak'] == 0] # this removed all the marked errors 240 | total_kmer_pairs = sum(genomic_cov_tab['freq']) 241 | 242 | for Bs in range(1,9): 243 | min_cov = 0 if Bs == 1 else cov * (Bs - 0.5) 244 | max_cov = cov * (Bs + 0.5) 245 | cov_tab_isoB = genomic_cov_tab.loc[(genomic_cov_tab["covB"] > min_cov) & (genomic_cov_tab["covB"] < max_cov)] # 246 | 247 | for As in range(Bs,(17 - Bs)): 248 | min_cov = 0 if As == 1 else cov * (As - 0.5) 249 | max_cov = cov * (As + 0.5) 250 | cov_tab_iso_smudge = cov_tab_isoB.loc[(cov_tab_isoB["covA"] > min_cov) & (cov_tab_isoB["covA"] < max_cov)] 251 | if sum(cov_tab_iso_smudge['freq']) / total_kmer_pairs > smudge_filter: 252 | # sys.stderr.write(f"{As}A{Bs}B: {sum(cov_tab_iso_smudge['freq']) / total_kmer_pairs}\n") 253 | smudge_container["A" * As + "B" * Bs] = cov_tab_iso_smudge 254 | return(smudge_container) 255 | 256 | def get_centrality(smudge_container, cov): 257 | centralities = list() 258 | freqs = list() 259 | for smudge in smudge_container.keys(): 260 | As = smudge.count('A') 261 | Bs = smudge.count('B') 262 | smudge_tab = smudge_container[smudge] 263 | kmer_in_the_smudge = sum(smudge_tab['freq']) 264 | freqs.append(kmer_in_the_smudge) 265 | # center as a a mean 266 | # center_A = sum((smudge_tab['freq'] * smudge_tab['covA'])) / kmer_in_the_smudge 267 | # center_B = sum((smudge_tab['freq'] * smudge_tab['covB'])) / kmer_in_the_smudge 268 | # center as a mode 269 | center = smudge_tab.loc[smudge_tab['freq'].idxmax()] 270 | center_A = center['covA'] 271 | center_B = center['covB'] 272 | ## emprical to edge 273 | # distA = min([abs(smudge_tab['covA'].max() - center['covA']), abs(center['covA'] - smudge_tab['covA'].min())]) 274 | # distB = min([abs(smudge_tab['covB'].max() - center['covB']), abs(center['covB'] - smudge_tab['covB'].min())]) 275 | ## theoretical to edge 276 | # distA = min(abs(center['covA'] - (cov * (As - 0.5))), abs((cov * (As + 0.5)) - center['covA'])) 277 | # distB = min(abs(center['covB'] - (cov * (Bs - 0.5))), abs((cov * (Bs + 0.5)) - center['covB'])) 278 | ## theoretical relative distance to the center 279 | distA = abs((center_A - (cov * As)) / cov) 280 | distB = abs((center_B - (cov * Bs)) / cov) 281 | 282 | # sys.stderr.write(f"Processing: {As}A{Bs}B; with center: {distA}, {distB}\n") 283 | centrality = distA + distB 284 | centralities.append(centrality) 285 | 286 | if len(centralities) == 0: 287 | return(1) 288 | return(fmean(centralities, weights=freqs)) 289 | 290 | def test_coverage_range(cov_tab, min_c, max_c, smudge_size_cutoff = 0.02): 291 | # covs_to_test = range(min_c, max_c) 292 | covs_to_test = arange(min_c + 0.05, max_c + 0.05, 2) 293 | cov_centralities = list() 294 | for cov in covs_to_test: 295 | smudge_container = get_smudge_container(cov_tab, cov, smudge_size_cutoff) 296 | cov_centralities.append(get_centrality(smudge_container, cov)) 297 | 298 | best_coverage = covs_to_test[argmin(cov_centralities)] 299 | 300 | tenths_to_test = arange(best_coverage - 1.9, best_coverage + 1.9, 0.2) 301 | tenths_centralities = list() 302 | for cov in tenths_to_test: 303 | smudge_container = get_smudge_container(cov_tab, cov, smudge_size_cutoff) 304 | tenths_centralities.append(get_centrality(smudge_container, cov)) 305 | 306 | best_tenth = tenths_to_test[argmin(tenths_centralities)] 307 | sys.stderr.write(f"Best coverage to precsion of one tenth: {round(best_tenth, 2)}\n") 308 | 309 | hundredths_to_test = list(arange(best_tenth - 0.19, best_tenth + 0.19, 0.01)) 310 | hundredths_centralities = list() 311 | for cov in hundredths_to_test: 312 | smudge_container = get_smudge_container(cov_tab, cov, smudge_size_cutoff) 313 | hundredths_centralities.append(get_centrality(smudge_container, cov)) 314 | 315 | final_cov = hundredths_to_test[argmin(hundredths_centralities)] 316 | just_to_be_sure_cov = final_cov/2 317 | 318 | hundredths_to_test.append(just_to_be_sure_cov) 319 | smudge_container = get_smudge_container(cov_tab, just_to_be_sure_cov, smudge_size_cutoff) 320 | hundredths_centralities.append(get_centrality(smudge_container, just_to_be_sure_cov)) 321 | 322 | final_cov = hundredths_to_test[argmin(hundredths_centralities)] 323 | sys.stderr.write(f"Best coverage to precision of one hundreth: {round(final_cov, 3)}\n") 324 | 325 | all_coverages = concatenate((covs_to_test, tenths_to_test, hundredths_to_test)) 326 | all_centralities = concatenate((cov_centralities, tenths_centralities, hundredths_centralities)) 327 | 328 | return(DataFrame({'coverage': all_coverages, 'centrality': all_centralities})) 329 | 330 | ##################### 331 | # the script itself # 332 | ##################### 333 | 334 | def main(): 335 | _parser = parser() 336 | 337 | sys.stderr.write('Running smudgeplot v' + version + "\n") 338 | if _parser.task == "version": 339 | exit(0) 340 | 341 | sys.stderr.write('Task: ' + _parser.task + "\n") 342 | 343 | if _parser.task == "cutoff": 344 | cutoff(_parser.arguments) 345 | 346 | if _parser.task == "hetmers": 347 | # PloidyPlot is expected ot be installed in the system as well as the R library supporting it 348 | args = _parser.arguments 349 | plot_args = " -o" + str(args.o) 350 | plot_args += " -e" + str(args.L) 351 | plot_args += " -T" + str(args.t) 352 | if args.verbose: 353 | plot_args += " -v" 354 | if args.tmp != '.': 355 | plot_args += " -P" + args.tmp 356 | plot_args += " " + args.infile 357 | 358 | sys.stderr.write("Calling: hetmers (PloidyPlot kmer pair search) " + plot_args + "\n") 359 | system("hetmers " + plot_args) 360 | 361 | if _parser.task == "plot": 362 | # the plotting script is expected ot be installed in the system as well as the R library supporting it 363 | args = _parser.arguments 364 | 365 | plot_args = f'-i "{args.infile}" -s "{args.smudgefile}" -n {args.n} -o "{args.o}" ' + _parser.format_aguments_for_R_plotting() 366 | 367 | sys.stderr.write("Calling: smudgeplot_plot.R " + plot_args + "\n") 368 | system("smudgeplot_plot.R " + plot_args) 369 | 370 | if _parser.task == "peak_agregation": 371 | peak_agregation(_parser.arguments) 372 | 373 | if _parser.task == "all": 374 | args = _parser.arguments 375 | 376 | sys.stderr.write("\nLoading data\n") 377 | cov_tab = load_hetmers(args.infile) 378 | # cov_tab = load_hetmers("data/dicots/smu_files/daAchMill1.k31_ploidy.smu.txt") 379 | 380 | sys.stderr.write("\nMasking errors using local agregation algorithm\n") 381 | cov2peak = local_agregation(cov_tab, distance = 1, noise_filter = 1000, mask_errors = True) 382 | cov_tab['peak'] = [cov2peak[(covA, covB)] for idx, covB, covA, freq in cov_tab.itertuples()] 383 | 384 | cov_tab = cov_tab.sort_values(['covA', 'covB'], ascending = True) 385 | total_kmers = sum(cov_tab['freq']) 386 | genomic_kmers = sum(cov_tab[cov_tab['peak'] == 0]['freq']) 387 | total_error_kmers = sum(cov_tab[cov_tab['peak'] == 1]['freq']) 388 | error_fraction = total_error_kmers / total_kmers 389 | sys.stderr.write(f"Total kmers: {total_kmers}\n\tGenomic kmers: {genomic_kmers}\n\tSequencing errors: {total_error_kmers}\n\tFraction or errors: {round(total_error_kmers/total_kmers, 3)}") 390 | 391 | with open(args.o + "_masked_errors_smu.txt", 'w') as error_annotated_smu: 392 | error_annotated_smu.write("covB\tcovA\tfreq\tis_error\n") 393 | for idx, covB, covA, freq, is_error in cov_tab.itertuples(): 394 | error_annotated_smu.write(f"{covB}\t{covA}\t{freq}\t{is_error}\n") # might not be needed 395 | 396 | sys.stderr.write("\nInfering 1n coverage using grid algorihm\n") 397 | 398 | smudge_size_cutoff = 0.001 # this is % of all k-mer pairs smudge needs to have to be considered a valid smudge 399 | 400 | if args.cov == 0: # not specified user coverage 401 | centralities = test_coverage_range(cov_tab, args.cov_min, args.cov_max, smudge_size_cutoff) 402 | np.savetxt(args.o + "_centralities.txt", np.around(centralities, decimals=6), fmt="%.4f", delimiter = '\t') 403 | # plot(centralities['coverage'], centralities['coverage']) 404 | 405 | if error_fraction < 0.75: 406 | cov = centralities['coverage'][argmin(centralities['centrality'])] 407 | else: 408 | sys.stderr.write(f"Too many errors observed: {error_fraction}, not trusting coverage inference\n") 409 | cov = 0 410 | 411 | sys.stderr.write(f"\nInferred coverage: {cov}\n") 412 | else: 413 | cov = args.cov 414 | 415 | final_smudges = get_smudge_container(cov_tab, cov, 0.03) 416 | # sys.stderr.write(str(final_smudges) + '\n') 417 | 418 | annotated_smudges = list(final_smudges.keys()) 419 | smudge_sizes = [round(sum(final_smudges[smudge]['freq']) / genomic_kmers, 4) for smudge in annotated_smudges] 420 | 421 | sys.stderr.write(f'Detected smudges / sizes ({args.o} + "_smudge_sizes.txt):"\n') 422 | sys.stderr.write('\t' + str(annotated_smudges) + '\n') 423 | sys.stderr.write('\t' + str(smudge_sizes) + '\n') 424 | 425 | # saving smudge sizes 426 | smudge_table = DataFrame({'smudge': annotated_smudges, 'size': smudge_sizes}) 427 | np.savetxt(args.o + "_smudge_sizes.txt", smudge_table, fmt='%s', delimiter = '\t') 428 | 429 | sys.stderr.write("\nPlotting\n") 430 | 431 | system("centrality_plot.R " + args.o + "_centralities.txt") 432 | # Rscript playground/alternative_fitting/alternative_plotting_testing.R -i data/dicots/peak_agregation/$ToLID.cov_tab_peaks -o data/dicots/peak_agregation/$ToLID 433 | args = _parser.arguments 434 | 435 | plot_args = f'-i "{args.o}_masked_errors_smu.txt" -s "{args.o}_smudge_sizes.txt" -n {round(cov, 3)} -o "{args.o}" ' + _parser.format_aguments_for_R_plotting() 436 | 437 | sys.stderr.write("Calling: smudgeplot_plot.R " + plot_args + "\n") 438 | system("smudgeplot_plot.R " + plot_args) 439 | 440 | sys.stderr.write("\nDone!\n") 441 | exit(0) 442 | 443 | if __name__=='__main__': 444 | main() 445 | -------------------------------------------------------------------------------- /exec/smudgeplot_plot.R: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env Rscript 2 | 3 | suppressPackageStartupMessages(library("methods")) 4 | suppressPackageStartupMessages(library("argparse")) 5 | suppressPackageStartupMessages(library("viridis")) 6 | 7 | # suppressPackageStartupMessages(library("smudgeplot")) 8 | 9 | ################# 10 | ### funcitons ### 11 | ################# 12 | # retirying the smudgeplot R package 13 | get_col_ramp <- function(.args, delay = 0){ 14 | colour_ramp <- eval(parse(text = paste0(.args$col_ramp,"(", 32 - delay, ")"))) 15 | if (.args$invert_cols){ 16 | colour_ramp <- rev(colour_ramp) 17 | } 18 | colour_ramp <- c(rep(colour_ramp[1], delay), colour_ramp) 19 | return(colour_ramp) 20 | } 21 | 22 | wtd.quantile <- function(x, q=0.25, weight=NULL) { 23 | o <- order(x) 24 | n <- sum(weight) 25 | order <- 1 + (n - 1) * q 26 | low <- pmax(floor(order), 1) 27 | high <- pmin(ceiling(order), n) 28 | low_contribution <- high - order 29 | allq <- approx(x=cumsum(weight[o])/sum(weight), y=x[o], xout = c(low, high)/n, method = "constant", 30 | f = 1, rule = 2)$y 31 | low_contribution * allq[1] + (1 - low_contribution) * allq[2] 32 | } 33 | 34 | wtd.iqr <- function(x, w=NULL) { 35 | wtd.quantile(x, q=0.75, weight=w) - wtd.quantile(x, q=0.25, weight=w) 36 | } 37 | 38 | plot_alt <- function(cov_tab, ylim, colour_ramp, log = F){ 39 | A_equals_B <- cov_tab[, 'covA'] == cov_tab[, 'covB'] 40 | cov_tab[A_equals_B, 'freq'] <- cov_tab[A_equals_B, 'freq'] * 2 41 | if (log){ 42 | cov_tab[, 'freq'] <- log10(cov_tab[, 'freq']) 43 | } 44 | cov_tab$col <- colour_ramp[1 + round(31 * cov_tab[, 'freq'] / max(cov_tab[, 'freq']))] 45 | 46 | # c(bottom, left, top, right) 47 | par(mar=c(4.8,4.8,1,1)) 48 | plot(NULL, xlim = c(0, 0.5), ylim = ylim, 49 | xlab = 'Normalized minor kmer coverage: B / (A + B)', 50 | ylab = 'Total coverage of the kmer pair: A + B', cex.lab = 1.4, bty = 'n') 51 | min_cov_to_plot <- max(ylim[1],min(cov_tab[, 'total_pair_cov'])) 52 | nothing <- sapply(min_cov_to_plot:ylim[2], plot_one_coverage, cov_tab) 53 | return(0) 54 | } 55 | 56 | plot_one_coverage <- function(cov, cov_tab){ 57 | cov_row_to_plot <- cov_tab[cov_tab[, 'total_pair_cov'] == cov, ] 58 | width <- 1 / (2 * cov) 59 | cov_row_to_plot$left <- cov_row_to_plot[, 'minor_variant_rel_cov'] - width 60 | cov_row_to_plot$right <- sapply(cov_row_to_plot[, 'minor_variant_rel_cov'], function(x){ min(0.5, x + width)}) 61 | apply(cov_row_to_plot, 1, plot_one_box, cov) 62 | } 63 | 64 | plot_one_box <- function(one_box_row, cov){ 65 | left <- as.numeric(one_box_row['left']) 66 | right <- as.numeric(one_box_row['right']) 67 | rect(left, cov - 0.5, right, cov + 0.5, col = one_box_row['col'], border = NA) 68 | } 69 | 70 | plot_isoA_line <- function (.covA, .L, .col = "black", .ymax = 250, .lwd, .lty) { 71 | min_covB <- .L # min(.cov_tab[, 'covB']) # should be L really 72 | max_covB <- .covA 73 | B_covs <- seq(min_covB, max_covB, length = 500) 74 | isoline_x <- B_covs/ (B_covs + .covA) 75 | isoline_y <- B_covs + .covA 76 | lines(isoline_x[isoline_y < .ymax], isoline_y[isoline_y < .ymax], lwd = .lwd, lty = .lty, col = .col) 77 | } 78 | 79 | plot_isoB_line <- function (.covB, .ymax, .col = "black", .lwd, .lty) { 80 | cov_range <- seq((2 * .covB) - 2, .ymax, length = 500) 81 | lines((.covB)/cov_range, cov_range, lwd = .lwd, lty = .lty, col = .col) 82 | } 83 | 84 | plot_iso_grid <- function(.cov, .L, .ymax, .col = 'black', .lwd = 2, .lty = 2){ 85 | for (i in 0:15){ 86 | cov <- (i + 0.5) * .cov 87 | plot_isoA_line(cov, .L = .L, .ymax = .ymax, .col, .lwd = .lwd, .lty = .lty) 88 | if (i < 8){ 89 | plot_isoB_line(cov, .ymax, .col, .lwd = .lwd, .lty = .lty) 90 | } 91 | } 92 | } 93 | 94 | plot_expected_haplotype_structure <- function(.n, .peak_sizes, 95 | .adjust = F, .cex = 1.3, xmax = 0.49){ 96 | .peak_sizes <- .peak_sizes[.peak_sizes[, 'size'] > 0.05, ] 97 | .peak_sizes[, 'ploidy'] <- nchar(.peak_sizes[, 'structure']) 98 | 99 | decomposed_struct <- strsplit(.peak_sizes[, 'structure'], '') 100 | .peak_sizes[, 'corrected_minor_variant_cov'] <- sapply(decomposed_struct, function(x){ sum(x == 'B') } ) / .peak_sizes[, 'ploidy'] 101 | .peak_sizes[, 'label'] <- reduce_structure_representation(.peak_sizes[, 'structure']) 102 | 103 | borercases <- .peak_sizes$corrected_minor_variant_cov == 0.5 104 | 105 | for(i in 1:nrow(.peak_sizes)){ 106 | # xmax is in the middle of the last square in the 2d histogram, 107 | # which is too far from the edge, so I average it with 0.49 108 | # witch will pull the label bit to the edge 109 | text( ifelse( borercases[i] & .adjust, (xmax + 0.49) / 2, .peak_sizes$corrected_minor_variant_cov[i]), 110 | .peak_sizes$ploidy[i] * .n, .peak_sizes[i, 'label'], 111 | offset = 0, cex = .cex, xpd = T, pos = ifelse( borercases[i] & .adjust, 2, 1)) 112 | } 113 | } 114 | 115 | reduce_structure_representation <- function(smudge_labels){ 116 | structures_to_adjust <- (sapply(smudge_labels, nchar) > 4) 117 | 118 | if ( any(structures_to_adjust) ) { 119 | decomposed_struct <- strsplit(smudge_labels[structures_to_adjust], '') 120 | As <- sapply(decomposed_struct, function(x){ sum(x == 'A') } ) 121 | Bs <- sapply(decomposed_struct, length) - As 122 | smudge_labels[structures_to_adjust] <- paste0(As, 'A', Bs, 'B') 123 | } 124 | return(smudge_labels) 125 | } 126 | 127 | plot_legend <- function(kmer_max, .colour_ramp, .log_scale = T){ 128 | par(mar=c(0,0,2,1)) 129 | plot.new() 130 | print_title <- ifelse(.log_scale, 'log kmers pairs', 'kmers pairs') 131 | title(print_title) 132 | for(i in 1:32){ 133 | rect(0,(i - 0.01) / 33, 0.5, (i + 0.99) / 33, col = .colour_ramp[i]) 134 | } 135 | # kmer_max <- max(smudge_container$dens) 136 | if( .log_scale == T ){ 137 | for(i in 0:6){ 138 | text(0.75, i / 6, rounding(10^(log10(kmer_max) * i / 6)), offset = 0) 139 | } 140 | } else { 141 | for(i in 0:6){ 142 | text(0.75, i / 6, rounding(kmer_max * i / 6), offset = 0) 143 | } 144 | } 145 | } 146 | 147 | rounding <- function(number){ 148 | if(number > 1000){ 149 | round(number / 1000) * 1000 150 | } else if (number > 100){ 151 | round(number / 100) * 100 152 | } else { 153 | round(number / 10) * 10 154 | } 155 | } 156 | 157 | ############# 158 | ## SETTING ## 159 | ############# 160 | 161 | parser <- ArgumentParser() 162 | parser$add_argument("--homozygous", action="store_true", default = F, 163 | help="Assume no heterozygosity in the genome - plotting a paralog structure; [default FALSE]") 164 | parser$add_argument("-i", "--input", default = "*_smu.txt", 165 | help="name of the input tsv file with covarages [default \"*_smu.txt\"]") 166 | parser$add_argument("-s", "--smudges", default = "not_specified", 167 | help="name of the input tsv file with annotated smudges and their respective sizes") 168 | parser$add_argument("-o", "--output", default = "smudgeplot", 169 | help="name pattern used for the output files (OUTPUT_smudgeplot.png, OUTPUT_summary.txt, OUTPUT_warrnings.txt) [default \"smudgeplot\"]") 170 | parser$add_argument("-t", "--title", 171 | help="name printed at the top of the smudgeplot [default none]") 172 | parser$add_argument("-q", "--quantile_filt", type = "double", 173 | help="Remove kmer pairs with coverage over the specified quantile; [default none]") 174 | parser$add_argument("-n", "--n_cov", type = "double", 175 | help="the haploid coverage of the sequencing data [default inference from data]") 176 | parser$add_argument("-c", "-cov_filter", type = "integer", 177 | help="Filter pairs with one of them having coverage bellow specified threshold [default 0]") 178 | parser$add_argument("-ylim", type = "integer", 179 | help="The upper limit for the coverage sum (the y axis)") 180 | parser$add_argument("-col_ramp", default = "viridis", 181 | help="A colour ramp available in your R session [viridis]") 182 | parser$add_argument("--invert_cols", action="store_true", default = F, 183 | help="Set this flag to invert colorus of Smudgeplot (dark for high, light for low densities)") 184 | 185 | args <- parser$parse_args() 186 | 187 | colour_ramp_log <- get_col_ramp(args, 16) # create palette for the log plots 188 | colour_ramp <- get_col_ramp(args) # create palette for the linear plots 189 | 190 | if ( !file.exists(args$input) ) { 191 | stop("The input file not found. Please use --help to get help", call.=FALSE) 192 | } 193 | 194 | cov_tab <- read.table(args$input, header = T) # col.names = c('covB', 'covA', 'freq', 'is_error'), 195 | smudge_tab <- read.table(args$smudges, col.names = c('structure', 'size')) 196 | 197 | # total covarate of the kmer pair 198 | cov_tab[, 'total_pair_cov'] <- cov_tab[, 'covA'] + cov_tab[, 'covB'] 199 | # calcualte relative coverage of the minor allele 200 | cov_tab[, 'minor_variant_rel_cov'] <- cov_tab[, 'covB'] / cov_tab[, 'total_pair_cov'] 201 | 202 | ##### coverage filtering 203 | 204 | if ( !is.null(args$c) ){ 205 | threshold <- args$c 206 | low_cov_filt <- cov_tab[, 'covA'] < threshold | cov_tab[, 'covB'] < threshold 207 | # smudge_warn(args$output, "Removing", sum(cov_tab[low_cov_filt, 'freq']), 208 | # "kmer pairs for which one of the pair had coverage below", 209 | # threshold, paste0("(Specified by argument -c ", args$c, ")")) 210 | cov_tab <- cov_tab[!low_cov_filt, ] 211 | # smudge_warn(args$output, "Processing", sum(cov_tab[, 'freq']), "kmer pairs") 212 | } 213 | 214 | ##### quantile filtering 215 | if ( !is.null(args$q) ){ 216 | # quantile filtering (remove top q%, it's not really informative) 217 | threshold <- wtd.quantile(cov_tab[, 'total_pair_cov'], args$q, cov_tab[, 'freq']) 218 | high_cov_filt <- cov_tab[, 'total_pair_cov'] < threshold 219 | # smudge_warn(args$output, "Removing", sum(cov_tab[!high_cov_filt, 'freq']), 220 | # "kmer pairs with coverage higher than", 221 | # threshold, paste0("(", args$q, " quantile)")) 222 | cov_tab <- cov_tab[high_cov_filt, ] 223 | } 224 | 225 | cov <- args$n_cov 226 | if (cov == wtd.quantile(cov_tab[, 'total_pair_cov'], 0.95, cov_tab[, 'freq'])){ 227 | ylim <- c(min(cov_tab[, 'total_pair_cov']), max(cov_tab[, 'total_pair_cov'])) 228 | } else { 229 | ylim <- c(min(cov_tab[, 'total_pair_cov']) - 1, # or 0? 230 | min(max(100, 10*cov), max(cov_tab[, 'total_pair_cov']))) 231 | } 232 | 233 | xlim <- c(0, 0.5) 234 | error_fraction <- sum(cov_tab[, 'is_error'] * cov_tab[, 'freq']) / sum(cov_tab[, 'freq']) * 100 235 | error_string <- paste("err =", round(error_fraction, 1), "%") 236 | cov_string <- paste0("1n = ", cov) 237 | 238 | if (!is.null(args$ylim)){ # if ylim is specified, set the boundary by the argument instead 239 | ylim[2] <- args$ylim 240 | } 241 | 242 | fig_title <- ifelse(length(args$title) == 0, NA, args$title[1]) 243 | # histogram_bins = max(30, args$nbins) 244 | 245 | ########## 246 | # LINEAR # 247 | ########## 248 | pdf(paste0(args$output,'_smudgeplot.pdf')) 249 | 250 | # layout(matrix(c(2,4,1,3), 2, 2, byrow=T), c(3,1), c(1,3)) 251 | layout(matrix(c(4,2,1,3), 2, 2, byrow=T), c(3,1), c(1,3)) 252 | # 1 smudge plot 253 | plot_alt(cov_tab, ylim, colour_ramp_log) 254 | if (cov > 0){ 255 | plot_expected_haplotype_structure(cov, smudge_tab, T, xmax = 0.49) 256 | } 257 | 258 | 259 | # 4 legend 260 | plot_legend(max(cov_tab[, 'freq']), colour_ramp, F) 261 | 262 | ### add annotation 263 | # print smudge sizes 264 | plot.new() 265 | if (cov > 0){ 266 | legend('topleft', bty = 'n', reduce_structure_representation(smudge_tab[,'structure']), cex = 1.1) 267 | legend('top', bty = 'n', legend = round(smudge_tab[,2], 2), cex = 1.1) 268 | legend('bottomleft', bty = 'n', legend = c(cov_string, error_string), cex = 1.1) 269 | } else { 270 | legend('bottomleft', bty = 'n', legend = error_string, cex = 1.1) 271 | } 272 | 273 | plot.new() 274 | mtext(bquote(italic(.(fig_title))), side=3, adj=0.1, line=-3, cex = 1.6) 275 | 276 | 277 | dev.off() 278 | 279 | ############ 280 | # log plot # 281 | ############ 282 | 283 | pdf(paste0(args$output,'_smudgeplot_log10.pdf')) 284 | 285 | layout(matrix(c(4,2,1,3), 2, 2, byrow=T), c(3,1), c(1,3)) 286 | # cov_tab[, 'freq'] <- log10(cov_tab[, 'freq']) 287 | # 1 smudge plot 288 | plot_alt(cov_tab, ylim, colour_ramp_log, log = T) 289 | 290 | if (cov > 0){ 291 | plot_expected_haplotype_structure(cov, smudge_tab, T, xmax = 0.49) 292 | } 293 | 294 | # 4 legend 295 | plot_legend(max(cov_tab[, 'freq']), colour_ramp_log, T) 296 | 297 | # print smudge sizes 298 | plot.new() 299 | if (cov > 0){ 300 | legend('topleft', bty = 'n', reduce_structure_representation(smudge_tab[,'structure']), cex = 1.1) 301 | legend('top', bty = 'n', legend = round(smudge_tab[,2], 2), cex = 1.1) 302 | legend('bottomleft', bty = 'n', legend = c(cov_string, error_string), cex = 1.1) 303 | } else { 304 | legend('bottomleft', bty = 'n', legend = error_string, cex = 1.1) 305 | } 306 | 307 | 308 | plot.new() 309 | mtext(bquote(italic(.(fig_title))), side=3, adj=0.1, line=-3, cex = 1.6) 310 | 311 | dev.off() -------------------------------------------------------------------------------- /playground/BGA_tutorial.md: -------------------------------------------------------------------------------- 1 | ## Smudgeplot 2 | 3 | Have you ever sequenced something not-well studied? Something that might show strange genomic signatures? Smudgeplot is a visualisation technique for whole-genome sequencing reads from a single individual. The visualisation techique is based on the idea of het-mers. Het-mers are k-mer pairs that are exactly one nucleotide pair away from each other, while forming a unique pair in the sequencing dataset. These k-mers are assumed to be mostly representing two alleles of a heterozygous, but potentially can also show pairing of imperfect paralogs, or sequencing errors paired up with a homozygous genomic k-mer. Nevertheless, the predicted ploidy by smudgeplot is simply the ploidy with the highest number of k-mer pairs (if a reasonable estimate must be evaluated for each individual case!). 4 | 5 | 6 | 7 | ### Installing the software 8 | 9 | Open gitpod. And install the development version of smudgeplot (branch sploidyplot) & FastK. 10 | 11 | ``` 12 | mkdir src bin && cd src # create directories for source code & binaries 13 | git clone -b sploidyplot https://github.com/KamilSJaron/smudgeplot 14 | git clone https://github.com/thegenemyers/FastK 15 | ``` 16 | 17 | Now smudgeplot make install smudgeplot R package, compiles the C kernel for searching for k-mer pairs and copy all the executables to `workspace/bin/` (which will be our dedicated spot for executables). 18 | 19 | ``` 20 | cd smudgeplot && make -s INSTALL_PREFIX=/workspace && cd .. 21 | cd FastK && make FastK Histex 22 | install -c Histex FastK /workspace/bin/ 23 | ``` 24 | 25 | 26 | ** 8 Datasets ** 27 | 28 | Species name SRA/ENA ID 29 | Pseudoloma neurophilia SRR926312 30 | Tubulinosema ratisbonensis ERR3154977 31 | Nosema ceranae SRR17317293 32 | Nematocida ausubeli SRR058692 33 | Nematocida ausubeli SRR350188 34 | Hamiltosporidium tvaerminnensis SRR16954898 35 | Encephalitozoon hellem SRR14017862 36 | Agmasoma penaei SRR926341 37 | 38 | TODO: get them urls 39 | 40 | Finally, if your session is running; Start downloading the data; For example: 41 | 42 | ``` 43 | wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR926/SRR926341/SRR926341_[12].fastq.gz 44 | ``` 45 | 46 | ### Constructing a database 47 | 48 | The whole process operates with raw, or trimmed sequencing reads. From those we generate a k-mer database using [FastK](https://github.com/thegenemyers/FASTK). FastK is currently the fastest k-mer counter out there and the only supported by the lastest version of smudgeplot*. This database contains an index of all the k-mers and their coverages in the sequencing readset. Within this set user must chose a theshold for excluding low frequencing k-mers that will be considered errors. That choice is not too difficult to make by looking at the k-mer spectra. Of all the retained k-mers we find all the het-mers. Then we plot a 2d histogram. 49 | 50 | 51 | *Note: The previous versions of smudgeplot (up to 2.5.0) were operating on k-mer "dumps" flat files you can generate with any counter you like. You can imagine that text files are very inefficient to operate on. The new version is operating directly on the optimised k-mer database instead. 52 | 53 | ``` 54 | FastK -v -t4 -k31 -M16 -T4 *.fastq.gz -NSRR8495097 55 | ``` 56 | 57 | 20' 58 | 59 | 23:24 60 | 61 | one file is also ~20', it's mostly function of the number of k-mers, we could speed it up by chosing higher t maybe? 62 | 63 | ### Getting the k-mer spectra out of it 64 | 65 | ``` 66 | Histex -G SRR8495097 > SRR8495097_k31.hist 67 | 68 | | GeneScopeFK -o data/Pvir1/GenomeScopeFK/ -k 17 69 | 70 | 71 | # Run PloidyPlot to find all k-mer pairs in the dataset 72 | PloidyPlot -e12 -k -v -T4 -odata/Scer/kmerpairs data/Scer/FastK_Table 73 | # this now generated `data/Scer/kmerpairs_text.smu` file; 74 | # it's a flat file with three columns; covB, covA and freq (the number of k-mer pairs with these respective coverages) 75 | 76 | # use the .smu file to infer ploidy and create smudgeplot 77 | smudgeplot.py plot -n 15 -t Sacharomyces -o data/Scer/trial_run data/Scer/kmerpairs_text.smu 78 | ``` 79 | -------------------------------------------------------------------------------- /playground/DEVELOPMENT.md: -------------------------------------------------------------------------------- 1 | # STANDARDS 2 | 3 | - spaces around operators 4 | - snake_case for variables and functions 5 | - camelCase for classes and methods 6 | - verbose naming is more important than a detailed documentation 7 | - R code is tested using `testthat` and python code in `dev` branch using `unittest` 8 | 9 | ## versioning 10 | 11 | - we try to keep `master` branch clean (i.e. production ready code). 12 | - `dev` branch should be also a working code, however here mistakes are permitted. Most of the development is done in subbranches of `dev` branch, once a feature is implemented, it should be merged to `dev`. I like to use `--no-ff` to keep a record, what was developed where (this might be a bad practice and if you know why, let me know). 13 | - One rule I would like to keep is this. There must be at least 72 hours incubation period between commits merged into `dev` and merging `dev` into `master`. The reason is simple, if there are any mistakes that were not spotted, there is a change to catch them. Also it takes while to run all the tests and stuff (the travis.ci testing is not working yet, but I hope it will quite soon). 14 | 15 | ## language 16 | 17 | The future is `C` backend based on [FastK](https://github.com/thegenemyers/FASTK), inference and plotting in `R` and `python` user interface. -------------------------------------------------------------------------------- /playground/alternative_fitting/README.md: -------------------------------------------------------------------------------- 1 | # Sploidyplot 2 | 3 | The goal is to have a smudge inference based on an explicit model. I hoped for a model that would make a lot of sense - based on negative binomials. 4 | 5 | Gene - made his own EM algorithm, I think. I could not decipher it 6 | Richard and Tianyi - made me an EM algorithm that work on simply coverage A and B; also using normal distributions 7 | 8 | 9 | ## alternative plotting 10 | 11 | A minimalist attempt 12 | 13 | ```R 14 | minidata <- daArtCamp1[daArtCamp1[, 'total_pair_cov'] < 20, ] 15 | coverages_to_plot <- unique(minidata[, 'total_pair_cov']) 16 | number_of_coverages_to_plot <- length(coverages_to_plot) 17 | mini_ylim <- c(5, 20) 18 | L <- 4 19 | cols <- c(rgb(1,0,0, 0.5), rgb(1,1,0, 0.5), rgb(0,1,1, 0.5), rgb(1,0,1, 0.5), rgb(0,1,0, 0.5), rgb(0,0,1, 0.5)) 20 | # plot(1:6, pch = 20, cex = 5, col = cols) 21 | 22 | plot_dot_smudgeplot(minidata, rep('black', 32), xlim, mini_ylim, cex = 3) 23 | points((L - 1) / coverages_to_plot, coverages_to_plot, cex = 3, pch = 20, col = 'blue') 24 | 25 | for( cov in 8:19){ 26 | rect(0, cov - 0.5, 0.5, cov + 0.5, col = NA, border = 'black') 27 | width <- 1 / (2 * cov) 28 | min_ratio <- L / cov 29 | rect(min_ratio - width, cov - 0.5, min(0.5, min_ratio + width), cov + 0.5, col = sample(cols, 1)) 30 | } 31 | 32 | text(rep(0.05, number_of_coverages_to_plot), 8:19, 8:19) 33 | ``` 34 | 35 | This is more serious attempt that does not really work. 36 | 37 | Alternative local aggregation 38 | 39 | ```bash 40 | for ToLID in daAchMill1 daAchPtar1 daAdoMosc1 daAjuCham1 daAjuRept1 daArcMinu1 daArtCamp1 daArtMari1 daArtVulg1 daAtrBela1; do 41 | python3 playground/alternative_fitting/pair_clustering.py data/dicots/smu_files/$ToLID.k31_ploidy.smu.txt --mask_errors > data/dicots/peak_agregation/$ToLID.cov_tab_peaks 42 | Rscript playground/alternative_fitting/alternative_plotting_testing.R -i data/dicots/peak_agregation/$ToLID.cov_tab_peaks -o data/dicots/peak_agregation/$ToLID 43 | done 44 | ``` 45 | 46 | This worked well. The agregation produced beautiful blocks, and vastly of the same shape as noticed by Richard. He suggested we should fix their shape and fit only a single parameter - coverage. 47 | 48 | ```R 49 | smudge_tab <- read.table('data/dicots/peak_agregation/daArtMari1.cov_tab_peaks', col.names = c('covB', 'covA', 'freq', 'smudge')) 50 | 51 | all_smudges <- unique(smudge_tab[, 'smudge']) 52 | all_smudge_sizes <- sapply(all_smudges, function(x){ sum(smudge_tab[smudge_tab[, 'smudge'] == x, 'freq']) }) 53 | 54 | # plot(sort(all_smudge_sizes, decreasing = T) / sum(all_smudge_sizes), ylim = c(0, 1)) 55 | # sort(all_smudge_sizes, decreasing = T) / sum(all_smudge_sizes) > 0.02 56 | # 2% of data soiunds reasonable 57 | 58 | smudges <- all_smudges[all_smudge_sizes / sum(all_smudge_sizes) > 0.02 & all_smudges != 0] 59 | smudge_sizes <- all_smudge_sizes[all_smudge_sizes / sum(all_smudge_sizes) > 0.02 & all_smudges != 0] 60 | 61 | smudge_tab[, 'total_pair_cov'] <- smudge_tab[, 1] + smudge_tab[, 2] 62 | smudge_tab[, 'minor_variant_rel_cov'] <- smudge_tab[, 1] / smudge_tab[, 'total_pair_cov'] 63 | 64 | 65 | smudge_tab_reduced <- smudge_tab[smudge_tab[, 'smudge'] %in% smudges, ] 66 | source('playground/alternative_fitting/alternative_plotting_functions.R') 67 | 68 | per_smudge_cov_list <- lapply(smudges, function(x){ smudge_tab_reduced[smudge_tab_reduced$smudge == x, ] }) 69 | names(per_smudge_cov_list) <- smudges 70 | 71 | cov_sum_summary <- sapply(per_smudge_cov_list, function(x){ summary(x[, 'total_pair_cov']) } ) 72 | rel_cov_summary <- sapply(per_smudge_cov_list, function(x){ summary(x[, 'minor_variant_rel_cov']) } ) 73 | 74 | colnames(cov_sum_summary) <- colnames(rel_cov_summary) <- smudges 75 | 76 | data.frame(smudges = smudges, total_pair_cov = round(cov_sum_summary[4, ], 1), minor_variant_rel_cov = round(rel_cov_summary[4, ], 3)) 77 | 78 | head(per_smudge_cov_list[['2']]) 79 | 80 | one_smudge <- per_smudge_cov_list[['2']] 81 | one_smudge[one_smudge[, 'minor_variant_rel_cov'] == 0.5, ] 82 | 83 | table(one_smudge[, 'covB']) 84 | (one_smudge[, 'minor_variant_rel_cov']) 85 | 86 | 87 | 88 | # cov_range <- seq((2 * .L) - 2, max_cov_pair, length = 500) 89 | # lines((.L - 1)/cov_range, cov_range, lwd = 2.5, lty = 2, 90 | 91 | plot_peakmap(smudge_tab_reduced, xlim = c(0, 0.5), ylim = c(0, 300)) 92 | 93 | plot_seq_error_line(smudge_tab, 4) 94 | plot_seq_error_line(smudge_tab, 13) 95 | plot_seq_error_line(smudge_tab, 48) 96 | plot_seq_error_line(smudge_tab, 80) 97 | 98 | one_smudge <- per_smudge_cov_list[['4']] 99 | min(one_smudge[ ,'total_pair_cov']) 100 | 101 | one_smudge[one_smudge[ ,'total_pair_cov'] == 61, ] 102 | one_smudge <- one_smudge[order(one_smudge[, 'minor_variant_rel_cov']), ] 103 | 104 | right_part_of_the_smudge <- one_smudge[one_smudge[ ,'minor_variant_rel_cov'] > 0.2131147, ] 105 | 106 | all_minor_var_rel_covs <- sort(unique(round(right_part_of_the_smudge[, 'minor_variant_rel_cov'], 2))) 107 | corresponding_min_cov_sums <- sapply(all_minor_var_rel_covs, function(x){ min(right_part_of_the_smudge[round(right_part_of_the_smudge[, 'minor_variant_rel_cov'], 2) == x, 'total_pair_cov']) } ) 108 | 109 | lines(all_minor_var_rel_covs, corresponding_min_cov_sums, lwd = 3, lty = 3, col = 'red') 110 | 111 | subtract_line <- function(rel_cov, cov_tab){ 112 | approx_rel_cov = round(rel_cov, 2) 113 | band_covs = round(cov_tab[, 'minor_variant_rel_cov'], 2) == approx_rel_cov 114 | cov_tab[band_covs, ][which.min(cov_tab[band_covs, 'total_pair_cov']), ] 115 | } 116 | 117 | edge_points <- t(sapply(all_minor_var_rel_covs, subtract_line, right_part_of_the_smudge)) 118 | total_pair_cov <- sapply(1:29, function(x){edge_points[[x,5]]}) 119 | minor_variant_rel_cov <- sapply(1:29, function(x){edge_points[[x,6]]}) 120 | lm(total_pair_cov ~ minor_variant_rel_cov + I(minor_variant_rel_cov^2)) 121 | 122 | plot_isoA_line <- function (.covA, .cov_tab, .col = "black") { 123 | min_covB <- min(.cov_tab[, 'covB']) # should be L really 124 | max_covB <- .covA 125 | B_covs <- seq(min_covB, max_covB, length = 500) 126 | lines(B_covs/ (B_covs + .covA), B_covs + .covA, lwd = 2.5, lty = 2, 127 | col = .col) 128 | } 129 | 130 | plot_isoA_line(48, smudge_tab, 'blue') 131 | plot_isoA_line(79, smudge_tab, 'blue') 132 | plot_isoA_line(110, smudge_tab, 'blue') 133 | plot_isoA_line(141, smudge_tab, 'blue') 134 | plot_isoA_line(172, smudge_tab, 'blue') 135 | 136 | ``` 137 | 138 | HA, looks great! Let's plot it on the background... 139 | 140 | ```R 141 | library(smudgeplot) 142 | source('playground/alternative_fitting/alternative_plotting_functions.R') 143 | colour_ramp <- viridis(32) 144 | 145 | 146 | smudge_tab <- read.table('data/dicots/peak_agregation/daAchMill1.cov_tab_errors', col.names = c('covB', 'covA', 'freq', 'is_error')) 147 | # smudge_tab <- read.table('data/dicots/peak_agregation/daArtMari1.cov_tab_peaks', col.names = c('covB', 'covA', 'freq', 'smudge')) 148 | smudge_tab[, 'total_pair_cov'] <- smudge_tab[, 1] + smudge_tab[, 2] 149 | smudge_tab[, 'minor_variant_rel_cov'] <- smudge_tab[, 1] / smudge_tab[, 'total_pair_cov'] 150 | cov = 31.1 # this is from GenomeScope this time 151 | 152 | plot_alt(smudge_tab[smudge_tab[, 'is_error'] != 0, ], c(0, 100), colour_ramp, T) 153 | plot_alt(smudge_tab[smudge_tab[, 'is_error'] != 1, ], c(0, 100), colour_ramp, T) 154 | plot_alt(smudge_tab, c(0, 100), colour_ramp, T) 155 | plot_iso_grid(31.1, 4, 100) 156 | plot_smudge_labels(18.1, 100) 157 | # .peak_points, .peak_sizes, .min_kmerpair_cov, .max_kmerpair_cov, col = "red" 158 | dev.off() 159 | 160 | plot_iso_grid() 161 | 162 | plot_smudge_labels(cov, 240) 163 | text(0.49, cov / 2, "2err", offset = 0, cex = 1.3, xpd = T, pos = 2) 164 | ``` 165 | 166 | Say we will test ploidy up to 16 (capturing up to octoploid paralogs). That makes 167 | 168 | ```R 169 | smudge_tab_with_err <- read.table('data/dicots/peak_agregation/daAchMill1.cov_tab_errors', col.names = c('covB', 'covA', 'freq', 'is_error')) 170 | 171 | smudge_filtering_threshold <- 0.01 # at least 1% of genomic kmers 172 | colour_ramp <- viridis(32) 173 | 174 | # # error band, done on non filtered data 175 | # smudge_tab[, 'edgepoint'] <- F 176 | # smudge_tab[smudge_tab[, 'covB'] < L + 3, 'edgepoint'] <- T 177 | # plot_alt(smudge_tab[smudge_tab[, 'edgepoint'], ], c(0, 500), colour_ramp, T) 178 | 179 | cov <- 19.55 # this will be unknown 180 | L <- min(smudge_tab_with_err[, 'covB']) 181 | smudge_tab <- smudge_tab_with_err[smudge_tab_with_err[, 'is_error'] == 0, ] 182 | genomic_kmer_pairs <- sum(smudge_tab[ ,'freq']) 183 | 184 | plot_alt(smudge_tab, c(0, 300), colour_ramp, T) 185 | 186 | smudge_tab[, 'total_pair_cov'] <- smudge_tab[, 1] + smudge_tab[, 2] 187 | smudge_tab[, 'minor_variant_rel_cov'] <- smudge_tab[, 1] / smudge_tab[, 'total_pair_cov'] 188 | 189 | plot_alt(smudge_tab, c(0, 300), colour_ramp) 190 | plot_all_smudge_labels(cov, 300) 191 | dev.off() 192 | 193 | #### isolating all smudges given cov 194 | 195 | # total_genomic_kmer_pairs <- sum(smudge_tab[, 'freq']) 196 | 197 | # plot_alt(smudge_container[[1]], c(0, 300), colour_ramp, T) 198 | # looks good! 199 | 200 | # two functions need to be sources from the smudgeplot package here 201 | covs_to_test <- seq(10.05, 60.05, by = 0.1) 202 | centrality_grid <- sapply(covs_to_test, run_replicate, smudge_tab, smudge_filtering_threshold) 203 | covs_to_test[which.max(centrality_grid)] 204 | 205 | sapply(c(21.71, 21.72, 21.73), run_replicate, smudge_tab, smudge_filtering_threshold) 206 | # 21.72 is our winner! 207 | 208 | tested_covs <- test_grid_of_coverages(smudge_tab, smudge_filtering_threshold, min_to_explore, max_to_explore) 209 | plot(tested_covs[, 'cov'], tested_covs[, 'centrality']) 210 | 211 | ``` 212 | 213 | Fixing the main package 214 | 215 | ```bash 216 | for smu_file in data/dicots/smu_files/*.k31_ploidy.smu.txt; do 217 | ToLID=$(basename $smu_file .k31_ploidy.smu.txt); 218 | smudgeplot.py all $smu_file -o data/dicots/grid_fits/$ToLID 219 | done 220 | 221 | 222 | ``` 223 | 224 | 225 | ## Homopolymer compressed testing 226 | 227 | Datasets with lots of errors. Sacharomyces will do. 228 | 229 | ``` 230 | FastK -v -c -t4 -k31 -M16 -T4 data/Scer/SRR3265401_[12].fastq.gz -Ndata/Scer/FastK_Table_hc 231 | hetmers -e4 -k -v -T4 -odata/Scer/kmerpairs_hc data/Scer/FastK_Table_hc 232 | 233 | smudgeplot.py hetmers -L 4 -t 4 -o data/Scer/kmerpairs_default_e --verbose data/Scer/FastK_Table 234 | 235 | smudgeplot.py all -o data/Scer/homopolymer_e4_wo data/Scer/kmerpairs_default_e_text.smu 236 | 237 | smudgeplot.py all -o data/Scer/homopolymer_e4_with data/Scer/kmerpairs_hc_text.smu 238 | ``` 239 | 240 | ## Other 241 | 242 | 243 | ### .smu to smu.txt 244 | 245 | For the legacy `.smu` files, we have a convertor for to flat files. 246 | 247 | ```bash 248 | gcc src_ploidyplot/smu2text_smu.c -o exec/smu2text_smu 249 | exec/smu2text_smu data/ddSalArbu1/ddSalArbu1.k31_ploidy.smu | less 250 | ``` -------------------------------------------------------------------------------- /playground/alternative_fitting/alternative_plot_covA_covB.R: -------------------------------------------------------------------------------- 1 | library(ggplot2) 2 | 3 | plot_unsquared_smudgeplot <- function(cov_tab, colour_ramp, xlim, ylim){ 4 | # this is the adjustment for plotting 5 | # cov_tab[cov_tab$covA == cov_tab$covB, 'freq'] <- cov_tab[cov_tab$covA == cov_tab$covB, 'freq'] * 2 6 | cov_tab$col = colour_ramp[1 + round(31 * cov_tab$freq / max(cov_tab$freq))] 7 | 8 | plot(NULL, xlim = xlim, ylim = ylim, 9 | xlab = 'covA', 10 | ylab = 'covB', cex.lab = 1.4) 11 | 12 | ## This might bite me in the a.., instead of taking L as an argument, I guess it from the data 13 | # L = floor(min(cov_tab_daAchMill1[, 'total_pair_cov']) / 2) 14 | ggplot(cov_tab, aes(x=covA, y=covB, weight = weight)) + 15 | geom_bin2d() + 16 | theme_bw() 17 | 18 | } 19 | 20 | real_clean <- read.table('data/Fiin/kmerpairs_k51_text.smu', col.names = c('covA', 'covB', 'freq')) 21 | real_clean$weight <- real_clean$freq / sum(real_clean$freq) 22 | 23 | # plot(real_clean[, 'covA'], real_clean[, 'covB']) 24 | 25 | xlim <- range(real_clean[, 'covA']) 26 | ylim <- range(real_clean[, 'covB']) 27 | 28 | library(smudgeplot) 29 | args <- list() 30 | args$col_ramp <- 'viridis' 31 | args$invert_cols <- F 32 | colour_ramp <- get_col_ramp(args) 33 | real_clean$col <- colour_ramp[1 + round(31 * real_clean$freq / max(real_clean$freq))] 34 | 35 | plot(NULL, xlim = xlim, ylim = ylim, 36 | xlab = 'covA', 37 | ylab = 'covB', cex.lab = 1.4) 38 | 39 | ggplot(real_clean, aes(x=covA, y=covB, weight = weight)) + 40 | geom_bin2d() + 41 | theme_bw() 42 | 43 | head(real_clean) 44 | 45 | 46 | # plotSquare <- function(row){ 47 | # rect(as.numeric(row['covA']) - 0.5, as.numeric(row['covB']) - 0.5, as.numeric(row['covA']) + 0.5, as.numeric(row['covB']) + 0.5, col = row['col'], border = NA) 48 | # } 49 | # apply(real_clean, 1, plotSquare) 50 | -------------------------------------------------------------------------------- /playground/alternative_fitting/alternative_plotting.R: -------------------------------------------------------------------------------- 1 | # HM Revenue and custom Tax Return form; 2 | # Needs to be 2023 Jan to 2024 Jan 3 | 4 | 5 | 6 | library(smudgeplot) 7 | source('playground/alternative_plotting_functions.R') 8 | 9 | cov_tab_daAchMill1 <- read.table('data/dicots/smu_files/daAchMill1.k31_ploidy.smu.txt', col.names = c('covB', 'covA', 'freq')) 10 | ylim = c(0, 250) 11 | 12 | xlim = c(0, 0.5) 13 | 14 | 15 | 16 | cov_tab_daAchMill1[, 'total_pair_cov'] <- cov_tab_daAchMill1[, 1] + cov_tab_daAchMill1[, 2] 17 | cov_tab_daAchMill1[, 'minor_variant_rel_cov'] <- cov_tab_daAchMill1[, 1] / cov_tab_daAchMill1[, 'total_pair_cov'] 18 | 19 | args <- list() 20 | args$col_ramp <- 'viridis' 21 | args$invert_cols <- F 22 | colour_ramp <- get_col_ramp(args) 23 | cols = colour_ramp[1 + round(31 * cov_tab_daAchMill1$freq / max(cov_tab_daAchMill1$freq))] 24 | 25 | # solving this "density" problem; cov1 cov1; have twice lower probability than cov1 cov2; we need to multiply these points, but it needs to be somehow corrected for the fit / summaries 26 | 27 | plot_dot_smudgeplot(cov_tab_daAchMill1, colour_ramp, xlim, ylim) 28 | 29 | plot_unsquared_smudgeplot(cov_tab_daAchMill1, colour_ramp, xlim, ylim) 30 | 31 | # plots the line where there will be nothing 32 | plot_seq_error_line(cov_tab_daAchMill1, .col = 'red') 33 | 34 | head(cov_tab_daAchMill1[order(cov_tab_daAchMill1[,'total_pair_cov']), ], 12) 35 | colour_ramp 36 | 3 / 8:13 37 | 38 | #### 39 | 40 | cov_tab_Fiin_ideal <- read.table('data/Fiin/idealised/kmerpairs_idealised_with_transformations.tsv', header = T) 41 | head(cov_tab_Fiin_ideal) 42 | 43 | xlim = c(0, 0.5) 44 | ylim = c(0, max(cov_tab_Fiin_ideal[, 'total_pair_cov'])) 45 | 46 | plot_dot_smudgeplot(cov_tab_Fiin_ideal, colour_ramp, xlim, ylim) 47 | 48 | plot_unsquared_smudgeplot(cov_tab_Fiin_ideal, colour_ramp, xlim, ylim) 49 | -------------------------------------------------------------------------------- /playground/alternative_fitting/alternative_plotting_functions.R: -------------------------------------------------------------------------------- 1 | plot_alt <- function(cov_tab, ylim, colour_ramp, logscale = F){ 2 | A_equals_B <- cov_tab[, 'covA'] == cov_tab[, 'covB'] 3 | cov_tab[A_equals_B, 'freq'] <- cov_tab[A_equals_B, 'freq'] * 2 4 | if (logscale){ 5 | cov_tab[, 'freq'] <- log10(cov_tab[, 'freq']) 6 | } 7 | cov_tab$col <- colour_ramp[1 + round(31 * cov_tab[, 'freq'] / max(cov_tab[, 'freq']))] 8 | 9 | plot(NULL, xlim = c(0, 0.5), ylim = ylim, 10 | xlab = 'Normalized minor kmer coverage: B / (A + B)', 11 | ylab = 'Total coverage of the kmer pair: A + B', cex.lab = 1.4) 12 | min_cov_to_plot <- max(ylim[1],min(cov_tab[, 'total_pair_cov'])) 13 | nothing <- sapply(min_cov_to_plot:ylim[2], plot_one_coverage, cov_tab) 14 | return(0) 15 | } 16 | 17 | plot_one_coverage <- function(cov, cov_tab){ 18 | cov_row_to_plot <- cov_tab[cov_tab[, 'total_pair_cov'] == cov, ] 19 | width <- 1 / (2 * cov) 20 | cov_row_to_plot$left <- cov_row_to_plot[, 'minor_variant_rel_cov'] - width 21 | cov_row_to_plot$right <- sapply(cov_row_to_plot[, 'minor_variant_rel_cov'], function(x){ min(0.5, x + width)}) 22 | apply(cov_row_to_plot, 1, plot_one_box, cov) 23 | } 24 | 25 | plot_one_box <- function(one_box_row, cov){ 26 | left <- as.numeric(one_box_row['left']) 27 | right <- as.numeric(one_box_row['right']) 28 | rect(left, cov - 0.5, right, cov + 0.5, col = one_box_row['col'], border = NA) 29 | } 30 | 31 | plot_dot_smudgeplot <- function(cov_tab, colour_ramp, xlim, ylim, background_col = 'grey', cex = 0.4){ 32 | # this is the adjustment for plotting 33 | cov_tab[cov_tab$covA == cov_tab$covB, 'freq'] <- cov_tab[cov_tab$covA == cov_tab$covB, 'freq'] * 2 34 | cov_tab$col = colour_ramp[1 + round(31 * cov_tab$freq / max(cov_tab$freq))] 35 | 36 | plot(NULL, xlim = xlim, ylim = ylim, xlab = 'Normalized minor kmer coverage: B / (A + B)', 37 | ylab = 'Total coverage of the kmer pair: A + B') 38 | rect(xlim[1], ylim[1], xlim[2], ylim[2], col = background_col, border = NA) 39 | points(cov_tab[, 'minor_variant_rel_cov'], cov_tab[, 'total_pair_cov'], col = cov_tab$col, pch = 20, cex = cex) 40 | } 41 | 42 | plot_peakmap <- function(cov_tab, xlim, ylim, background_col = 'grey', cex = 2){ 43 | # this is the adjustment for plotting 44 | plot(NULL, xlim = xlim, ylim = ylim, xlab = 'Normalized minor kmer coverage: B / (A + B)', 45 | ylab = 'Total coverage of the kmer pair: A + B') 46 | points(cov_tab[, 'minor_variant_rel_cov'], cov_tab[, 'total_pair_cov'], col = cov_tab$smudge, pch = 20, cex = cex) 47 | legend('bottomleft', col = 1:8, pch = 20, title = 'smudge', legend = 1:8) 48 | } 49 | 50 | plot_seq_error_line <- function (.cov_tab, .L = NA, .col = "black") { 51 | if (is.na(.L)) { 52 | .L <- min(.cov_tab[, "covB"]) 53 | } 54 | max_cov_pair <- max(.cov_tab[, "total_pair_cov"]) 55 | cov_range <- seq((2 * .L) - 2, max_cov_pair, length = 500) 56 | lines((.L - 1)/cov_range, cov_range, lwd = 2.5, lty = 2, 57 | col = .col) 58 | } 59 | 60 | plot_isoA_line <- function (.covA, .L, .col = "black", .ymax = 250, .lwd, .lty) { 61 | min_covB <- .L # min(.cov_tab[, 'covB']) # should be L really 62 | max_covB <- .covA 63 | B_covs <- seq(min_covB, max_covB, length = 500) 64 | isoline_x <- B_covs/ (B_covs + .covA) 65 | isoline_y <- B_covs + .covA 66 | lines(isoline_x[isoline_y < .ymax], isoline_y[isoline_y < .ymax], lwd = .lwd, lty = .lty, col = .col) 67 | } 68 | 69 | plot_isoB_line <- function (.covB, .ymax, .col = "black", .lwd, .lty) { 70 | cov_range <- seq((2 * .covB) - 2, .ymax, length = 500) 71 | lines((.covB)/cov_range, cov_range, lwd = .lwd, lty = .lty, col = .col) 72 | } 73 | 74 | plot_iso_grid <- function(.cov, .L, .ymax, .col = 'black', .lwd = 2, .lty = 2){ 75 | for (i in 0:15){ 76 | cov <- (i + 0.5) * .cov 77 | plot_isoA_line(cov, .L = .L, .ymax = .ymax, .col, .lwd = .lwd, .lty = .lty) 78 | if (i < 8){ 79 | plot_isoB_line(cov, .ymax, .col, .lwd = .lwd, .lty = .lty) 80 | } 81 | } 82 | } 83 | 84 | plot_smudge_labels <- function(cov_est, ymax, xmax = 0.49, .cex = 1.3, .L = 4){ 85 | for (As in 1:(floor(ymax / cov_est) - 1)){ 86 | label <- paste0(As, "Aerr") 87 | text(.L / (As * cov_est), (As * cov_est) + .L, label, 88 | offset = 0, cex = .cex, xpd = T, pos = ifelse(As == 1, 3, 4)) 89 | } 90 | for (ploidy in 2:floor(ymax / cov_est)){ 91 | for (Bs in 1:floor(ploidy / 2)){ 92 | As = ploidy - Bs 93 | label <- paste0(As, "A", Bs, "B") 94 | text(ifelse(As == Bs, (xmax + 0.49)/2, Bs / ploidy), ploidy * cov_est, label, 95 | offset = 0, cex = .cex, xpd = T, 96 | pos = ifelse(As == Bs, 2, 1)) 97 | } 98 | } 99 | } 100 | 101 | create_smudge_container <- function(cov, cov_tab, smudge_filtering_threshold){ 102 | smudge_container <- list() 103 | total_genomic_kmers <- sum(cov_tab[ , 'freq']) 104 | 105 | for (Bs in 1:8){ 106 | cov_tab_isoB <- cov_tab[cov_tab[ , 'covB'] > cov * ifelse(Bs == 1, 0, Bs - 0.5) & cov_tab[ , 'covB'] < cov * (Bs + 0.5), ] 107 | # cov_tab_isoB[, 'Bs'] <- Bs 108 | cov_tab_isoB[, 'As'] <- round(cov_tab_isoB[, 'covA'] / cov) #these are be individual smudges cutouts given coverages 109 | cov_tab_isoB[cov_tab_isoB[, 'As'] == 0, 'As'] = 1 110 | for( As in Bs:(16 - Bs)){ 111 | cov_tab_one_smudge <- cov_tab_isoB[cov_tab_isoB[, 'As'] == As, ] 112 | if (sum(cov_tab_one_smudge[, 'freq']) / total_genomic_kmers > smudge_filtering_threshold){ 113 | label <- paste0(As, "A", Bs, "B") 114 | smudge_container[[label]] <- cov_tab_one_smudge[,-which(names(cov_tab_one_smudge) %in% c('is_error', 'As'))] 115 | } 116 | } 117 | } 118 | return(smudge_container) 119 | } -------------------------------------------------------------------------------- /playground/alternative_fitting/alternative_plotting_testing.R: -------------------------------------------------------------------------------- 1 | library(smudgeplot) 2 | library(argparse) 3 | source('playground/alternative_fitting/alternative_plotting_functions.R') 4 | 5 | parser <- ArgumentParser() 6 | parser$add_argument("-i", "-infile", help="Input file") 7 | parser$add_argument("-o", "-outfile", help="Output file") 8 | args <- parser$parse_args() 9 | 10 | args$col_ramp <- 'viridis' 11 | args$invert_cols <- F 12 | 13 | # cov_tab_daAchMill1 <- read.table('data/dicots/smu_files/daAchMill1.k31_ploidy.smu.txt', col.names = c('covB', 'covA', 'freq')) 14 | # cov_tab <- read.table(args$file, col.names = c('covB', 'covA', 'freq')) 15 | # args <- list() 16 | # args$i <- 'data/ddSalArbu1/ddSalArbu1.k31_ploidy_converted.smu_with_peaks.txt' 17 | # args$o <- 'data/ddSalArbu1/smudge_with_peaks' 18 | 19 | cov_tab <- read.table(args$i, col.names = c('covB', 'covA', 'freq','peak')) 20 | 21 | xlim = c(0, 0.5) 22 | ylim = c(0, 300) 23 | 24 | 25 | cov_tab[, 'total_pair_cov'] <- cov_tab[, 1] + cov_tab[, 2] 26 | cov_tab[, 'minor_variant_rel_cov'] <- cov_tab[, 1] / cov_tab[, 'total_pair_cov'] 27 | 28 | colour_ramp <- viridis(32) 29 | colour_ramp_log <- get_col_ramp(args, 16) 30 | # cols = colour_ramp[1 + round(31 * cov_tab$freq / max(cov_tab$freq))] 31 | 32 | # solving this "density" problem; cov1 cov1; have twice lower probability than cov1 cov2; we need to multiply these points, but it needs to be somehow corrected for the fit / summaries 33 | 34 | plot_dot_smudgeplot(cov_tab, colour_ramp, xlim, ylim) 35 | 36 | pdf(paste0(args$o, '_background.pdf')) 37 | plot_alt(cov_tab, ylim, colour_ramp, F) 38 | dev.off() 39 | 40 | pdf(paste0(args$o, '_log_background.pdf')) 41 | plot_alt(cov_tab, ylim, colour_ramp, T) 42 | dev.off() 43 | 44 | pdf(paste0(args$o, '_peaks.pdf')) 45 | plot_peakmap(cov_tab, xlim = xlim, ylim = ylim) 46 | dev.off() 47 | 48 | # plots the line where there will be nothing 49 | # plot_seq_error_line(cov_tab, .col = 'red') 50 | 51 | # head(cov_tab[order(cov_tab[,'total_pair_cov']), ], 12) 52 | # colour_ramp 53 | # 3 / 8:13 54 | 55 | #### 56 | 57 | # cov_tab_Fiin_ideal <- read.table('data/Fiin/idealised/kmerpairs_idealised_with_transformations.tsv', header = T) 58 | # head(cov_tab_Fiin_ideal) 59 | 60 | # xlim = c(0, 0.5) 61 | # ylim = c(0, max(cov_tab_Fiin_ideal[, 'total_pair_cov'])) 62 | 63 | # pdf('data/Fiin/idealised/straw_plot_test3.pdf') 64 | # plot_dot_smudgeplot(cov_tab_Fiin_ideal, colour_ramp, xlim, ylim) 65 | # dev.off() 66 | 67 | # pdf('data/Fiin/idealised/straw_plot_test2.pdf') 68 | # plot_unsquared_smudgeplot(cov_tab_Fiin_ideal, colour_ramp, xlim, ylim) 69 | # dev.off() 70 | 71 | # # testing the packaged version 72 | 73 | # library(smudgeplot) 74 | # args <- list() 75 | # args$col_ramp <- 'viridis' 76 | # args$invert_cols <- F 77 | # colour_ramp <- get_col_ramp(args) 78 | 79 | # cov_tab_Fiin_ideal <- read.table('data/Fiin/idealised/kmerpairs_idealised_with_transformations.tsv', header = T) 80 | 81 | # pdf('data/Fiin/idealised/straw_plot_test.pdf') 82 | # plot_alt(cov_tab_Fiin_ideal, c(50, 700), colour_ramp) 83 | # dev.off() 84 | 85 | # source('playground/alternative_fitting/alternative_plotting_functions.R') 86 | 87 | # pdf('data/Fiin/idealised/straw_plot_test2.pdf') 88 | # plot_unsquared_smudgeplot(cov_tab_Fiin_ideal, colour_ramp, c(0, 0.5), c(50, 700)) 89 | # dev.off() 90 | 91 | -------------------------------------------------------------------------------- /playground/alternative_fitting/pair_clustering.py: -------------------------------------------------------------------------------- 1 | 2 | # cov2freq = defualtdict(covA, covB) -> freq 3 | # cov2peak = dict(covA, covB) -> peak 4 | # dict(peak) -> summit (if relevant) 5 | # import numpy as np 6 | 7 | import argparse 8 | from pandas import read_csv # type: ignore 9 | from collections import defaultdict 10 | from statistics import mean 11 | # import matplotlib.pyplot as plt 12 | 13 | #### 14 | 15 | parser = argparse.ArgumentParser() 16 | parser.add_argument('infile', nargs='?', help='name of the input tsv file with covarages and frequencies.') 17 | parser.add_argument('-nf', '-noise_filter', help='Do not agregate into smudge k-mer pairs with frequency lower than this parameter', type=int, default=50) 18 | parser.add_argument('-d', '-distance', help='Manthattan distance of k-mer pairs that are considered neioboring for the local agregation purposes.', type=int, default=5) 19 | parser.add_argument('--mask_errors', help='instead of reporting assignments to individual smudges, just remove all monotonically decreasing points from the error line', action="store_true", default = False) 20 | args = parser.parse_args() 21 | 22 | ### what should be aguments at some point 23 | # smu_file = 'data/ddSalArbu1/ddSalArbu1.k31_ploidy_converted.smu.txt' 24 | # distance = 5 25 | # noise_filter = 100 26 | 27 | smu_file = args.infile 28 | distance = args.d 29 | noise_filter = args.nf 30 | 31 | ### load data 32 | # cov_tab = np.loadtxt(smu_file, dtype=int) 33 | cov_tab = read_csv(smu_file, names = ['covB', 'covA', 'freq'], sep='\t') 34 | cov_tab = cov_tab.sort_values('freq', ascending = False) 35 | L = min(cov_tab['covB']) # important only when --mask_errors is on 36 | 37 | # generate a dictionary that gives us for each combination of coverages a frequency 38 | cov2freq = defaultdict(int) 39 | cov2peak = defaultdict(int) 40 | # for idx, covB, covA, freq in cov_tab.itertuples(): 41 | # cov2freq[(covA, covB)] = freq 42 | # I can create this one when I iterate though the data though 43 | 44 | # plt.hist(means) 45 | # plt.hist([x for x in means if x < 100 and x > -100]) 46 | # plt.show() 47 | 48 | ### clustering 49 | next_peak = 1 50 | for idx, covB, covA, freq in cov_tab.itertuples(): 51 | cov2freq[(covA, covB)] = freq # with this I can get rid of lines 23 24 that pre-makes this table 52 | if freq < noise_filter: 53 | break 54 | highest_neigbour_coords = (0, 0) 55 | highest_neigbour_freq = 0 56 | # for each kmer pair I will retrieve all neibours (Manhattan distance) 57 | for xA in range(covA - distance,covA + distance + 1): 58 | # for explored A coverage in neiborhood, we explore all possible B coordinates 59 | distanceB = distance - abs(covA - xA) 60 | for xB in range(covB - distanceB,covB + distanceB + 1): 61 | xB, xA = sorted([xA, xB]) # this is to make sure xB is smaller than xA 62 | # iterating only though those that were assigned already 63 | # and recroding only the one with highest frequency 64 | if cov2peak[(xA, xB)] and cov2freq[(xA, xB)] > highest_neigbour_freq: 65 | highest_neigbour_coords = (xA, xB) 66 | highest_neigbour_freq = cov2freq[(xA, xB)] 67 | if highest_neigbour_freq > 0: 68 | cov2peak[(covA, covB)] = cov2peak[(highest_neigbour_coords)] 69 | else: 70 | # print("new peak:", (covA, covB)) 71 | if args.mask_errors: 72 | if covB < L + args.d: 73 | cov2peak[(covA, covB)] = 1 # error line 74 | else: 75 | cov2peak[(covA, covB)] = 0 # central smudges 76 | else: 77 | cov2peak[(covA, covB)] = next_peak # if I want to keep info about all locally agregated smudges 78 | next_peak += 1 79 | 80 | cov_tab = cov_tab.sort_values(['covA', 'covB'], ascending = True) 81 | for idx, covB, covA, freq in cov_tab.itertuples(): 82 | print(covB, covA, freq, cov2peak[(covA, covB)]) 83 | # if idx > 20: 84 | # break 85 | -------------------------------------------------------------------------------- /playground/interactive_plot_strawberry_full_kmer_families_fooling_around.R: -------------------------------------------------------------------------------- 1 | library("methods") 2 | library("argparse") 3 | library("smudgeplot") 4 | # library("hexbin") 5 | 6 | # preprocessing 7 | # to get simply number of memebers / family (exploration) 8 | # cat data/strawberry_iinumae/kmer_counts_L109_U.tsv | cut -f 1 > data/strawberry_iinumae/kmer_counts_L109_U_family_members.tsv 9 | # awk '{row_sum = 0; row_max = 0; row_min = 10000; for (i=2; i <= NF; i++){ row_sum += $i; if ($i > row_max){row_max = $i} if ($i < row_min){row_min = $i} } print row_sum "\t" row_min "\t" row_max }' data/strawberry_iinumae/kmer_counts_L109_U.tsv > data/strawberry_iinumae/kmer_counts_L109_U_sums_min_max.tsv 10 | # (exploration) 11 | # 12 | # 13 | 14 | args <- ArgumentParser()$parse_args() 15 | args$homozygous <- F 16 | args$input <- 'data/Fiin/kmerpairs_k51_text.smu' 17 | args$output = './data/Fiin/testrun' 18 | args$title = 'F. iinumae' 19 | args$nbins <- 40 20 | args$L <- NULL 21 | args$n_cov <- NULL 22 | args$k <- 21 23 | 24 | -------------------------------------------------------------------------------- /playground/more_away_pairs.py: -------------------------------------------------------------------------------- 1 | def get_2away_pairs(local_index_to_kmer, k): 2 | """local_index_to_kmer is a dictionary where the value is a kmer portion, and the key is the index of the original kmer in which the kmer portion is found. get_2away_pairs returns a list of pairs where each pair of indices corresponds to a pair of kmers different in exactly two bases.""" 3 | 4 | #These are the base cases for the recursion. If k==1, the kmers obviously can't differ in exactly two bases, so return an empty list. if k==2, return every pair of indices where the kmers at those indices differ at exactly two bases. 5 | if k == 1: 6 | return [] 7 | if k == 2: 8 | return [(i, j) for (i,j) in combinations(local_index_to_kmer, 2) if local_index_to_kmer[i][0] != local_index_to_kmer[j][0] and local_index_to_kmer[i][1] != local_index_to_kmer[j][1]] 9 | 10 | #Get the two halves of the kmer 11 | k_L = k//2 12 | k_R = k-k_L 13 | 14 | #initialize dictionaries in which the key is the hash of half of the kmer, and the value is a list of indices of the kmers with that same hash 15 | kmer_L_hashes = defaultdict(list) 16 | kmer_R_hashes = defaultdict(list) 17 | 18 | #initialize pairs, which will be returned by get_1away_pairs 19 | pairs = [] 20 | 21 | #initialize dictionaries containing the left halves and the right halves (since we will have to check cases where the left half differs by 1 and the right half differs by 1) 22 | local_index_to_kmer_L = {} 23 | local_index_to_kmer_R = {} 24 | 25 | #for each kmer, calculate its left hash and right hash, then add its index to the corresponding entries of the dictionary 26 | for i, kmer in local_index_to_kmer.items(): 27 | kmer_L = kmer[:k_L] 28 | kmer_R = kmer[k_L:] 29 | local_index_to_kmer_L[i] = kmer_L 30 | local_index_to_kmer_R[i] = kmer_R 31 | kmer_L_hashes[kmer_to_int(kmer_L)] += [i] 32 | kmer_R_hashes[kmer_to_int(kmer_R)] += [i] 33 | 34 | #for each left hash in which there are multiple kmers with that left hash, find the list of pairs in which the right half differs by 2. (aka, if left half matches, recurse on right half). 35 | for kmer_L_hash_indices in kmer_L_hashes.values(): #same in first half 36 | if len(kmer_L_hash_indices) > 1: 37 | pairs += get_2away_pairs({kmer_L_hash_index:local_index_to_kmer[kmer_L_hash_index][k_L:] for kmer_L_hash_index in kmer_L_hash_indices}, k_R) #differ by 2 in right half 38 | 39 | #for each right hash in which there are multiple kmers with that right hash, find the list of pairs in which the left half differs by 2. (aka, if right half matches, recurse on left half). 40 | for kmer_R_hash_indices in kmer_R_hashes.values(): #same in second half 41 | if len(kmer_R_hash_indices) > 1: 42 | pairs += get_2away_pairs({kmer_R_hash_index:local_index_to_kmer[kmer_R_hash_index][:k_L] for kmer_R_hash_index in kmer_R_hash_indices}, k_L) #differ by 2 in left half 43 | 44 | #Find matching pairs where the left half is one away, and the right half is one away 45 | possible_pairs_L = set(get_1away_pairs(local_index_to_kmer_L,k_L)) 46 | possible_pairs_R = set(get_1away_pairs(local_index_to_kmer_R,k_R)) 47 | pairs += list(possible_pairs_L.intersection(possible_pairs_R)) 48 | return(pairs) 49 | 50 | 51 | ###This code has not been cleaned... needs to be edited!!! 52 | def get_3away_pairs(kmers): 53 | """kmers is a list of kmers. get_3away_pairs returns a list of pairs where each pair of kmers is different in exactly three bases.""" 54 | k = len(kmers[0]) 55 | if k == 1 or k==2: 56 | return [] 57 | if k == 3: 58 | return [pair for pair in combinations(kmers, 2) if pair[0][0] != pair[1][0] and pair[0][1] != pair[1][1] and pair[0][2] != pair[1][2]] 59 | k_L = k//2 60 | k_R = k-k_L 61 | kmer_L_hashes = defaultdict(list) 62 | kmer_R_hashes = defaultdict(list) 63 | pairs = [] 64 | kmers_L = [] 65 | kmers_R = [] 66 | for i, kmer in enumerate(kmers): 67 | kmer_L = kmer[:k_L] 68 | kmer_R = kmer[k_L:] 69 | #print(kmer_L) 70 | #print(kmer_R) 71 | kmers_L.append(kmer_L) 72 | kmers_R.append(kmer_R) 73 | kmer_L_hashes[kmer_to_int(kmer_L)] += [i] 74 | kmer_R_hashes[kmer_to_int(kmer_R)] += [i] 75 | for kmer_L_hash in kmer_L_hashes.values(): #same in first half 76 | if len(kmer_L_hash) > 1: 77 | kmer_L = kmers[kmer_L_hash[0]][:k_L] #first half 78 | pairs += [tuple(kmer_L + kmer for kmer in pair) for pair in get_3away_pairs([kmers[i][k_L:] for i in kmer_L_hash])] #differ by 3 in second half 79 | for kmer_R_hash in kmer_R_hashes.values(): #same in second half 80 | if len(kmer_R_hash) > 1: 81 | kmer_R = kmers[kmer_R_hash[0]][k_L:] #second half 82 | #print(kmer_R) 83 | pairs += [tuple(kmer + kmer_R for kmer in pair) for pair in get_3away_pairs([kmers[i][:k_L] for i in kmer_R_hash])] #differ by 3 in first half 84 | possible_pairs = [] 85 | possible_pairs_L = get_1away_pairs(kmers_L) 86 | possible_pairs_R = get_2away_pairs(kmers_R) 87 | #print(kmers_L) 88 | #print(kmers_R) 89 | #print(possible_pairs_L) 90 | #print(possible_pairs_R) 91 | for possible_pair_L in possible_pairs_L: 92 | for possible_pair_R in possible_pairs_R: 93 | possible_kmer1 = possible_pair_L[0]+possible_pair_R[0] 94 | possible_kmer2 = possible_pair_L[1]+possible_pair_R[1] 95 | if possible_kmer1 in kmers and possible_kmer2 in kmers: 96 | pairs += [(possible_kmer1, possible_kmer2)] 97 | possible_pairs = [] 98 | possible_pairs_L = get_2away_pairs(kmers_L) 99 | possible_pairs_R = get_1away_pairs(kmers_R) 100 | for possible_pair_L in possible_pairs_L: 101 | for possible_pair_R in possible_pairs_R: 102 | possible_kmer1 = possible_pair_L[0]+possible_pair_R[0] 103 | possible_kmer2 = possible_pair_L[1]+possible_pair_R[1] 104 | if possible_kmer1 in kmers and possible_kmer2 in kmers: 105 | pairs += [(possible_kmer1, possible_kmer2)] 106 | return(pairs) -------------------------------------------------------------------------------- /playground/playground.R: -------------------------------------------------------------------------------- 1 | files <- c('data/Avag1/coverages_2.tsv', 2 | 'data/Lcla1/Lcla1_pairs_coverages_2.tsv', 3 | 'data/Mflo2/coverages_2.tsv', 4 | 'data/Rvar1/Rvar1_pairs_coverages_2.tsv', 5 | 'data/Ps791/Ps791_pairs_coverages_2.tsv', 6 | 'data/Aric1/Aric1_pairs_coverages_2.tsv', 7 | "data/Rmag1/Rmag1_pairs_coverages_2.tsv") 8 | 9 | ### 10 | library(smudgeplot) 11 | args <- list() 12 | args$input <- 'data/Mflo2/Mflo2_coverages_2.tsv' 13 | args$output <- "figures/Mflo2_v0.1.0" 14 | args$nbins <- 40 15 | args$kmer_size <- 21 16 | args$homozygous <- F 17 | 18 | # args <- list() 19 | # args$input <- 'data/rice/SRR1919013_k21_l35_u500_coverages.tsv' 20 | # args$output <- "data/rice/smudge" 21 | # args$nbins <- 40 22 | # args$kmer_size <- 21 23 | # args$homozygous <- F 24 | 25 | ### 26 | 27 | i <- 7 28 | n <- NA 29 | cov <- read.table(args$input) 30 | 31 | # run bits of smudgeplot_plot.R to get k, and peak summary 32 | 33 | filter <- total_pair_cov < 350 34 | total_pair_cov_filt <- total_pair_cov[filter] 35 | minor_variant_rel_cov_filt <- minor_variant_rel_cov[filter] 36 | 37 | ymax <- max(total_pair_cov_filt) 38 | ymin <- min(total_pair_cov_filt) 39 | 40 | # the lims trick will make sure that the last column of squares will have the same width as the other squares 41 | smudge_container <- get_smudge_container(minor_variant_rel_cov, total_pair_cov, .nbins = 40) 42 | 43 | x <- seq(xlim[1], ((nbins - 1) / nbins) * xlim[2], length = nbins) 44 | y <- c(seq(ylim[1] - 0.1, ((nbins - 1) / nbins) * ylim[2], length = nbins), ylim[2]) 45 | 46 | .peak_points <- peak_points 47 | .smudge_container <- smudge_container 48 | .total_pair_cov <- total_pair_cov 49 | .treshold <- 0.05 50 | fig_title <- 'test' 51 | 52 | image(smudge_container, col = colour_ramp) 53 | # contour(x.bin, y.bin, freq2D, add=TRUE, col=rgb(1,1,1,.7)) 54 | 55 | ####### 56 | # PLOT 57 | ####### 58 | 59 | library(plotly) 60 | packageVersion('plotly') 61 | 62 | p <- plot_ly(x = k_toplot$x, y = k_toplot$y, z = k_toplot$z) %>% add_surface() 63 | htmlwidgets::saveWidget(p, "Ps791_smudge_surface.html") 64 | # Create a shareable link to your chart 65 | # Set up API credentials: https://plot.ly/r/getting-started 66 | chart_link = api_create(p, filename="Ps791_smudge_surface-2") 67 | chart_link 68 | 69 | layout(matrix(c(2,4,1,3), 2, 2, byrow=T), c(3,1), c(1,3)) 70 | # 1 smudge plot 71 | plot_smudgeplot(k_toplot, n, colour_ramp) 72 | plot_expected_haplotype_structure(n, peak_sizes, T) 73 | # annotate_peaks(peak_points, ymin, ymax) 74 | # annotate_summits(peak_points, peak_sizes, ymin, ymax, 'black') 75 | # TODO fix plot_seq_error_line(total_pair_cov) 76 | # 2,3 hist 77 | # TODO rescale histogram axis by the scale of the smudgeplot 78 | plot_histograms(minor_variant_rel_cov, total_pair_cov) 79 | # 4 legend 80 | plot_legend(k_toplot, total_pair_cov, colour_ramp) 81 | 82 | # findInterval(c(0.1, 0.2, 0.33, 0.5), seq(0, 0.5, length = 41)) 83 | 84 | ########################################################## 85 | ## TEST 86 | ## idea here was to propagate from the highest point and expand the peak till it's monotonic 87 | # starting_point <- which(dens_m == max(dens_m), arr.ind = TRUE) 88 | # starting_val <- dens_m[starting_point] 89 | # peak_points <- data.frame(x = starting_point[,2], y = starting_point[,1], value = starting_val) 90 | # 91 | # points_to_explore <- get_neibours(starting_val) 92 | # val_to_comp <- starting_val 93 | # 94 | # for(one_point in 1:nrow(points_to_explore)){ 95 | # one_point <- points_to_explore[one_point,] 96 | # point_val <- dens_m[t(one_point)] 97 | # if(point_val < val_to_comp){ 98 | # peak_points <- rbind(peak_points, 99 | # data.frame(x = one_point[2], y = one_point[1], value = point_val)) 100 | # } 101 | # } 102 | # 103 | # get_neibours <- function(point){ 104 | # neibours_vec <- matrix(rep(starting_point,8) + c(-1,-1,0,-1,1,-1,-1,0,+1,0,-1,1,0,1,1,1), 105 | # ncol = 2, byrow = T) 106 | # neibours_vec[rowSums(neibours_vec <= 30 & neibours_vec >= 1) == 2,] 107 | # } 108 | # 109 | 110 | ########################## 111 | ###ALTERNATIVE PLTTING ### 112 | ########################## 113 | # library('spatialfil') 114 | # msnFit(high_cov_filt, minor_variant_rel_cov) 115 | 116 | ## alternative plotting 117 | # library(hexbin) # honeycomb plot 118 | # h <- hexbin(df) 119 | # # h@count <- sqrt(h@count) 120 | # plot(h, colramp=rf) 121 | # gplot.hexbin(h, colorcut=10, colramp=rf) 122 | 123 | 124 | ### TEST plot lines at expected coverages 125 | # 126 | # for(i in 2:6){ 127 | # lines(c(0, 0.6), rep(i * n, 2), lwd = 1.4) 128 | # text(0.1, i * n, paste0(i,'x'), pos = 3) 129 | # } 130 | 131 | # FUTURE - wrapper 132 | # smudgeplot < - function(.k, .minor_variant_rel_cov, .total_pair_cov, .n, 133 | # .sqrt_scale = T, .cex = 1.4, .fig_title = NA){ 134 | # if( .sqrt_scale == T ){ 135 | # # to display densities on squared root scale (bit like log scale but less agressive) 136 | # .k$z <- sqrt(.k$z) 137 | # } 138 | # 139 | # pal <- brewer.pal(11,'Spectral') 140 | # rf <- colorRampPalette(rev(pal[3:11])) 141 | # colour_ramp <- rf(32) 142 | # 143 | # layout(matrix(c(2,4,1,3), 2, 2, byrow=T), c(3,1), c(1,3)) 144 | # 145 | # # 2D HISTOGRAM 146 | # plot_smudgeplot(...) 147 | # 148 | # # 1D historgram - minor_variant_rel_cov on top 149 | # plot_histogram(...) 150 | # 151 | # # 1D historgram - total pair coverage - right 152 | # plot_histogram(...) 153 | # 154 | # # LEGEND (topright corener) 155 | # plot_legend(...) 156 | # 157 | # } 158 | -------------------------------------------------------------------------------- /playground/playground.py: -------------------------------------------------------------------------------- 1 | #----- 2 | # WHat I tried ot make plots work 3 | # https://matplotlib.org/faq/howto_faq.html#generate-images-without-having-a-window-appear 4 | import matplotlib 5 | matplotlib.use('Agg') 6 | import matplotlib.pyplot as plt 7 | #------- 8 | 9 | #Load the particular dumps file you wish 10 | #These were created using jellyfish dump -c -L lower -U upper SRR_k21.jf > SRR_k21.dumps 11 | dumps_file = 'ERR2135445.dumps' #aric1 -L 20 -U 350 12 | dumps_file = 'SRR801084_k21.dumps' #avag1 -L 30 -U 300 13 | dumps_file = 'SRR4242457_k21.dumps' #mare2 -L 13 -U 132 14 | dumps_file = 'SRR4242472_k21.dumps' #ment1 -L 50 -U 350 15 | dumps_file = 'SRR4242474_k21.dumps' #mflo2 -L 60 -U 400 16 | dumps_file = 'SRR4242467_k21.dumps' #minc3 -L 25 -U 210 17 | dumps_file = 'SRR4242462_k21.dumps' #mjav2 -L 80 -U 600 18 | dumps_file = 'ERR2135453_k21.dumps' #rmac1 -L 100 -U 700 19 | dumps_file = 'ERR2135451_k21.dumps' #rmag1 -L 60 -U 500 20 | 21 | # dumps_file = 'kmers_dump_L120_U1500.tsv' 22 | 23 | 24 | 25 | 26 | 27 | plt.hist(coverages_2, bins = 1000) 28 | plt.savefig('coverages_2_hist.png') 29 | plt.close() 30 | 31 | # then plot a histogram of the coverages 32 | 33 | plt.hist(coverages_3, bins = 1000) 34 | plt.savefig('coverages_3_hist.png') 35 | plt.close() 36 | 37 | #n, bins, patches = plt.hist(coverages_3, bins = 1000) 38 | #bins[np.argmax(n)] 39 | 40 | #save families_4 to a pickle file, then plot a histogram of the coverages 41 | 42 | plt.hist(coverages_4, bins = 1000) 43 | plt.savefig('coverages_4_hist.png') 44 | plt.close() 45 | 46 | 47 | plt.hist(coverages_5, bins = 1000) 48 | plt.savefig('coverages_5_hist.png') 49 | plt.close() 50 | 51 | #save families_6 to a pickle file, then plot a histogram of the coverages 52 | plt.hist(coverages_6, bins = 1000) 53 | plt.savefig('coverages_6_hist.png') 54 | plt.close() 55 | 56 | ###some code to load previously saved pickle files 57 | # test_kmers = pickle.load(open('test_kmers.p', 'rb')) 58 | # test_coverages = pickle.load(open('test_coverages.p', 'rb')) 59 | G = pickle.load(open('G.p', 'rb')) 60 | component_lengths = pickle.load(open('component_lengths.p', 'rb')) 61 | families_2 = pickle.load(open('families_2.p', 'rb')) 62 | coverages_2 = pickle.load(open('coverages_2.p', 'rb')) 63 | # one_away_pair = pickle.load(open('one_away_pairs.p', 'rb')) 64 | 65 | # perhaps faster way how to calculate coverages_2 66 | # coverages_2 = [test_coverages[cov_i1] + test_coverages[cov_i2] for cov_i1, cov_i2 in families_2] 67 | 68 | 69 | #----- 70 | # for coverage in coverages_2: 71 | # 72 | 73 | ###Everything below this is just scratch work 74 | #f = open('ERR2135445_l20_u100.fa', 'r') 75 | #g = open('new.fa', 'w') 76 | #for line in f: 77 | # if line.startswith('>'): 78 | # g.write('>' + str(int(line[1:-1])+10000) + '\n') 79 | # else: 80 | # g.write(line) 81 | #f.close() 82 | #g.close() 83 | 84 | #get_3away_pairs(['AAAAAAAA', 'AACTAAGA', 'AACAATGA', 'AAAAATCG']) 85 | 86 | 87 | #get_1away_pairs(['AAA', 'AAC']) 88 | 89 | #kmers = [''.join([random.choice('ACGT') for _ in range(20)]) for _ in range(10)] 90 | 91 | #df2 = df[:1000000] 92 | 93 | #for pair in pairs: 94 | # #f.write(str(df2[df2[0] == pair[0]].iloc[0,1])+'\n') 95 | # #f.write(str(df2[df2[0] == pair[1]].iloc[0,1])+'\n') 96 | # [x[1] for x in pairs if x[0] == pair[0]]+[x[0] for x in pairs if x[1] == pair[0]] 97 | # a = df2[df2[0] == pair[0]].iloc[0,1]/89.2 98 | # b = df2[df2[0] == pair[1]].iloc[0,1]/89.2 99 | # f.write(str((a, b, a+b))+'\n') 100 | 101 | #Counter([min([Counter(pair[0])[x] for x in ['A', 'C', 'G', 'T']]) for pair in pairs]) 102 | 103 | #high_complexity_pairs = [pair for pair in pairs if min([Counter(pair[0])[x] for x in ['A', 'C', 'G', 'T']])==5] 104 | 105 | #for hcpair in high_complexity_pairs: 106 | # f.write(str(df2[df2[0] == hcpair[0]])) 107 | # f.write(str(df2[df2[0] == hcpair[1]])) 108 | 109 | #570620 TAAAATAATTTTTTTCTTAAA 115 110 | #878881 TAAAATAATTTTTTTCTAAAA 67 111 | #182 112 | 113 | #526664 AATTACCATTCAACCAGTTTC 156 114 | #922303 AATTACCATTCAACCAGATTC 166 115 | #322 116 | 117 | #394517 AAGAGAAAAGAAAAAAGTAAT 180 118 | #788086 AAGAGAAAAGAAAAAAGAAAT 180 119 | #360 120 | 121 | #420665 AAAAAAAAGTGTTTTACTTTG 119 122 | #946878 AAAAAAAAGTGTTTTACTCTG 95 123 | #214 124 | 125 | #594426 ACAAAATATTACCTTTATCTA 117 126 | #768315 ACAAAATATTACCTTTATTTA 152 127 | #269 128 | 129 | #536269 ACAGATTGGCTTGTTTGAGCC 103 130 | #711261 ACAGATTGGCTTGTTTGAACC 99 131 | 132 | #383862 ATTTCATTTGTTAGAAAAAAA 139 133 | #907248 ATTTCATTTGTTAGAAAAGAA 162 134 | 135 | #438051 TCAACAGAAAATAATGGAGCA 152 136 | #962365 TCAACAGAAAATAATGGAACA 143 137 | 138 | #425231 AAAAAAAAACGAAAAAATTTT 15 139 | #734086 AAAAAAAAACGAAAAAAATTT 18 140 | 141 | #607197 AAAAAAAAACACGACATGTTT 154 142 | #783001 AAAAAAAAACACGACATGCTT 134 143 | 144 | 145 | #test_kmers = {i:kmer for (i, kmer) in enumerate(kmers[:100000])} 146 | 147 | #members = [x[0] for x in one_away_pairs] + [x[1] for x in one_away_pairs] + [x[0] for x in two_away_pairs] + [x[1] for x in two_away_pairs] 148 | #G = nx.Graph() 149 | #for one_away_pair in one_away_pairs: 150 | # G.add_edge(*one_away_pair) 151 | #for two_away_pair in two_away_pairs: 152 | # G.add_edge(*two_away_pair) 153 | 154 | #component_lengths = [len(x) for x in nx.connected_components(G)] 155 | #Counter(component_lengths) 156 | #families = [list(x) for x in nx.connected_components(G) if len(x) == 2] 157 | #coverages = [df2.iloc[families[i][0], 1]+df2.iloc[families[i][1], 1] for i in range(len(families))] 158 | #plt.hist(coverages, bins = 100) 159 | #plt.savefig('coverages_hist.png') 160 | #plt.close() 161 | 162 | #families_3 = [list(x) for x in nx.connected_components(G) if len(x) == 3] 163 | #coverages_3 = [df2.iloc[families_3[i][0], 1]+df2.iloc[families_3[i][1], 1] for i in range(len(families_3))] 164 | #plt.hist(coverages_3, bins = 100) 165 | #plt.savefig('coverages_3_hist.png') 166 | #plt.close() 167 | -------------------------------------------------------------------------------- /playground/popart.R: -------------------------------------------------------------------------------- 1 | library(smudgeplot) 2 | 3 | args <- ArgumentParser()$parse_args() 4 | args$output <- "data/Scer/sploidyplot_test" 5 | args$nbins <- 40 6 | args$kmer_size <- 21 7 | args$homozygous <- FALSE 8 | args$L <- c() 9 | args$col_ramp <- 'viridis' 10 | args$invert_cols <- TRUE 11 | 12 | cov_tab <- read.table("data/Scer/PloidyPlot3_text.smu", col.names = c('covB', 'covA', 'freq'), skip = 2) #nolint 13 | cov_tab[, 'total_pair_cov'] <- cov_tab[, 1] + cov_tab[, 2] 14 | cov_tab[, 'minor_variant_rel_cov'] <- cov_tab[, 1] / cov_tab[, 'total_pair_cov'] 15 | 16 | cov_tab_1n_est <- round(estimate_1n_coverage_1d_subsets(cov_tab), 1) 17 | 18 | xlim <- c(0, 0.5) 19 | # max(total_pair_cov); 10*draft_n 20 | ylim <- c(0, 150) 21 | nbins <- 40 22 | 23 | smudge_container <- get_smudge_container(cov_tab, nbins, xlim, ylim) 24 | smudge_container$z <- smudge_container$dens 25 | 26 | plot_popart <- function(cov_tab, ylim, colour_ramp){ 27 | A_equals_B <- cov_tab[, 'covA'] == cov_tab[, 'covB'] 28 | cov_tab[A_equals_B, 'freq'] <- cov_tab[A_equals_B, 'freq'] * 2 29 | cov_tab$col <- colour_ramp[1 + round(31 * cov_tab[, 'freq'] / max(cov_tab[, 'freq']))] 30 | 31 | plot(NULL, xlim = c(0.1, 0.5), ylim = ylim, xaxt = "n", yaxt = "n", xlab = '', ylab = '', bty = 'n') 32 | min_cov_to_plot <- max(ylim[1],min(cov_tab[, 'total_pair_cov'])) 33 | nothing <- sapply(min_cov_to_plot:ylim[2], plot_one_coverage, cov_tab) 34 | return(0) 35 | } 36 | 37 | par(mfrow = c(2, 5)) 38 | 39 | args$col_ramp <- "viridis" 40 | args$invert_cols <- FALSE 41 | colour_ramp <- get_col_ramp(args) # get the default colour ramp (Spectral, 11) 42 | # plot_smudgeplot(smudge_container, 15.5, colour_ramp) 43 | plot_popart(cov_tab, c(20, 120), colour_ramp) 44 | 45 | args$invert_cols <- TRUE 46 | colour_ramp <- get_col_ramp(args) # get the default colour ramp (Spectral, 11) 47 | # plot_smudgeplot(smudge_container, 15.5, colour_ramp) 48 | plot_popart(cov_tab, c(20, 120), colour_ramp) 49 | 50 | 51 | for (ramp in c("grey.colors", "magma", "plasma", "mako", "inferno", "rocket", "heat.colors", "cm.colors")){ 52 | args$col_ramp <- ramp 53 | colour_ramp <- get_col_ramp(args) # get the default colour ramp (Spectral, 11) 54 | plot_popart(cov_tab, c(20, 120), colour_ramp) 55 | } 56 | 57 | -------------------------------------------------------------------------------- /src_ploidyplot/PloidyPlot.c: -------------------------------------------------------------------------------- 1 | /****************************************************************************************** 2 | * 3 | * PloidyPlot: A C-backed searching quickly for hetmers: 4 | * unique k-mer pairs different by exactly one nucleotide 5 | * 6 | * Author: Gene Myers 7 | * Date : May, 2021 8 | * Reduced to the k-mer pair search by Kamil Jaron in August, 2023 9 | * 10 | ********************************************************************************************/ 11 | 12 | #include 13 | #include 14 | #include 15 | #include 16 | #include 17 | #include 18 | #include 19 | #include 20 | 21 | #undef SOLO_CHECK 22 | 23 | #undef DEBUG_GENERAL 24 | #undef DEBUG_RECURSION 25 | #undef DEBUG_THREADS 26 | #undef DEBUG_BOUNDARY 27 | #undef DEBUG_SCAN 28 | #undef DEBUG_BIG_SCAN 29 | 30 | #include "libfastk.h" 31 | #include "matrix.h" 32 | 33 | static char *Usage[] = { " [-v] [-T] [-P]", 34 | " [-o] [-e] [.ktab]" 35 | }; 36 | 37 | static int VERBOSE; 38 | static int NTHREADS; // At most 64 allowed 39 | 40 | #ifdef SOLO_CHECK 41 | 42 | static uint8 *CENT; 43 | static int64 CIDX; 44 | 45 | #endif 46 | 47 | static int ETHRESH; // Error threshold 48 | 49 | #define SMAX 1000 // Max. value of CovA+CovB 50 | #define FMAX 500 // Max. value of min(CovA,CovB) 51 | 52 | static int BLEVEL; // <= 4 53 | static int BWIDTH; // = 4^BLEVEL 54 | 55 | static int64 MEMORY_LIMIT = 0x100000000ll; // 4GB 56 | static int64 Cache_Size; // Divided evenly amont the threads 57 | 58 | static int KMER; 59 | static int KBYTE; 60 | static int TBYTE; 61 | 62 | static int PASS1; 63 | 64 | /**************************************************************************************** 65 | * 66 | * Print & compare utilities 67 | * 68 | *****************************************************************************************/ 69 | 70 | #define COUNT_PTR(p) ((uint16 *) (p+KBYTE)) 71 | 72 | static char dna[4] = { 'a', 'c', 'g', 't' }; 73 | 74 | static char *fmer[256], _fmer[1280]; 75 | 76 | static void setup_fmer_table() 77 | { char *t; 78 | int i, l3, l2, l1, l0; 79 | 80 | i = 0; 81 | t = _fmer; 82 | for (l3 = 0; l3 < 4; l3++) 83 | for (l2 = 0; l2 < 4; l2++) 84 | for (l1 = 0; l1 < 4; l1++) 85 | for (l0 = 0; l0 < 4; l0++) 86 | { fmer[i] = t; 87 | *t++ = dna[l3]; 88 | *t++ = dna[l2]; 89 | *t++ = dna[l1]; 90 | *t++ = dna[l0]; 91 | *t++ = 0; 92 | i += 1; 93 | } 94 | } 95 | 96 | static void print_hap(uint8 *seq, int len, int half) 97 | { static int firstime = 1; 98 | int i, b, h, k; 99 | 100 | if (firstime) 101 | { firstime = 0; 102 | setup_fmer_table(); 103 | } 104 | 105 | h = half >> 2; 106 | b = len >> 2; 107 | for (i = 0; i < h; i++) 108 | printf("%s",fmer[seq[i]]); 109 | k = 6; 110 | for (i = h << 2; k >= 0; i++) 111 | { if (i == half) 112 | printf("%c",dna[(seq[h] >> k) & 0x3]-32); 113 | else 114 | printf("%c",dna[(seq[h] >> k) & 0x3]); 115 | k -= 2; 116 | } 117 | for (i = h+1; i < b; i++) 118 | printf("%s",fmer[seq[i]]); 119 | k = 6; 120 | for (i = b << 2; i < len; i++) 121 | { printf("%c",dna[(seq[b] >> k) & 0x3]); 122 | k -= 2; 123 | } 124 | } 125 | 126 | static inline int mycmp(uint8 *a, uint8 *b, int n) 127 | { while (n-- > 0) 128 | { if (*a++ != *b++) 129 | return (a[-1] < b[-1] ? -1 : 1); 130 | } 131 | return (0); 132 | } 133 | 134 | 135 | /**************************************************************************************** 136 | * 137 | * Find Het Pairs by merging 4 lists: 138 | * Out of core: stream indices from fng[a] to end[a] for a in [0,3] 139 | * In core: table pointer from ptr[a] to eptr[a] for a in [0,3] 140 | * 141 | *****************************************************************************************/ 142 | 143 | typedef struct 144 | { int level; // position of variation 145 | int64 *bound; // 17-element array of next level transition points 146 | // Stream scans (big & small): 147 | Kmer_Stream *fng[4]; // 4 streams for scan 148 | int64 end[4]; // index of end of each scan 149 | // Small scans: 150 | int tid; // thread assigned to this subtree 151 | uint8 *cache; // in-core cache for subtrees that fit 152 | // In-core scans: 153 | int64 cidx; // Table index of 1st cache entry 154 | uint8 *ept[4]; // Each list is in [ptr[a],eptr[a]) 155 | uint8 *ptr[4]; // in steps of TBYTES 156 | // Plot: 157 | int64 **plot; // Accumulate A+B, B/(A+B) pairs here 158 | 159 | #ifdef SOLO_CHECK 160 | uint8 *cptr; 161 | #endif 162 | } TP; 163 | 164 | static uint8 *Pair; // Incidence array (# of table entries) 165 | 166 | static uint8 Prefix[4] = { 0x3f, 0x0f, 0x03, 0x00 }; 167 | static uint8 Shift[4] = { 6, 4, 2, 0 }; 168 | 169 | static void *analysis_thread_1(void *args) 170 | { TP *parm = (TP *) args; 171 | int64 *end = parm->end; 172 | Kmer_Stream **fng = parm->fng; 173 | int level = parm->level; 174 | int64 *bound = parm->bound; 175 | 176 | int ll = ((level+1)>>2); 177 | int ls = Shift[(level+1)&0x3]; 178 | 179 | int mask = Prefix[level&0x3]; 180 | int offs = (level >> 2) + 1; 181 | int rem = KBYTE - offs; 182 | 183 | uint8 *ent[4]; 184 | int lst[4]; 185 | int in[4], itop; 186 | int cnt[4]; 187 | int mc, hc; 188 | uint8 *mr, *hr; 189 | int a, i, x; 190 | 191 | for (a = 0; a < 4; a++) 192 | { ent[a] = Current_Entry(fng[a],NULL); 193 | lst[a] = (ent[a][ll] >> ls) & 0x3; 194 | if (fng[a]->cidx < end[a]) 195 | { x = (ent[a][ll] >> ls) & 0x3; 196 | while (x > lst[a]) 197 | bound[(a<<2)+(++(lst[a]))] = fng[a]->cidx; 198 | } 199 | } 200 | 201 | #ifdef DEBUG_SCAN 202 | for (a = 0; a < 4; a++) 203 | { printf(" %c %10lld: ",dna[a],fng[a]->cidx); 204 | print_hap(ent[a],KMER,level); 205 | } 206 | printf("\n"); 207 | #endif 208 | 209 | while (1) 210 | { for (a = 0; a < 4; a++) 211 | if (fng[a]->cidx < end[a]) 212 | break; 213 | if (a >= 4) 214 | break; 215 | 216 | mr = ent[a]+offs; 217 | mc = mr[-1] & mask; 218 | in[0] = a; 219 | itop = 1; 220 | for (a++; a < 4; a++) 221 | if (fng[a]->cidx < end[a]) 222 | { hr = ent[a]+offs; 223 | hc = hr[-1] & mask; 224 | if (hc == mc) 225 | { int v = mycmp(hr,mr,rem); 226 | if (v == 0) 227 | in[itop++] = a; 228 | else if (v < 0) 229 | { mc = hc; 230 | mr = hr; 231 | in[0] = a; 232 | itop = 1; 233 | } 234 | } 235 | else if (hc < mc) 236 | { mc = hc; 237 | mr = hr; 238 | in[0] = a; 239 | itop = 1; 240 | } 241 | } 242 | 243 | if (itop > 1) 244 | { cnt[0] = *((uint16 *) (ent[in[0]]+KBYTE)); 245 | for (i = 1; i < itop; i++) 246 | { cnt[i] = *((uint16 *) (ent[in[i]]+KBYTE)); 247 | for (a = 0; a < i; a++) 248 | { x = cnt[a]+cnt[i]; 249 | if (x <= SMAX) 250 | { Pair[fng[in[i]]->cidx] += 1; 251 | Pair[fng[in[a]]->cidx] += 1; 252 | } 253 | } 254 | } 255 | } 256 | 257 | for (i = 0; i < itop; i++) 258 | { Kmer_Stream *t; 259 | 260 | a = in[i]; 261 | t = fng[a]; 262 | #ifdef DEBUG_SCAN 263 | if (i == 0) printf("\n"); 264 | printf(" %c %10lld: ",dna[a&0x3],fng[a]->cidx); 265 | print_hap(ent[a],KMER,level); 266 | printf("\n"); 267 | #endif 268 | Next_Kmer_Entry(t); 269 | if (t->cidx < end[a]) 270 | { Current_Entry(t,ent[a]); 271 | x = (ent[a][ll] >> ls) & 0x3; 272 | while (x > lst[a]) 273 | bound[(a<<2)+(++(lst[a]))] = t->cidx; 274 | } 275 | } 276 | 277 | #ifdef DEBUG_SCAN 278 | printf("\n"); 279 | for (a = 0; a < 4; a++) 280 | { printf(" %c %10lld: ",dna[a],fng[a]->cidx); 281 | print_hap(ent[a],KMER,level); 282 | } 283 | printf("\n"); 284 | #endif 285 | } 286 | 287 | for (a = 0; a < 4; a++) 288 | free(ent[a]); 289 | 290 | return (NULL); 291 | } 292 | 293 | static void *analysis_thread_2(void *args) 294 | { TP *parm = (TP *) args; 295 | int64 *end = parm->end; 296 | Kmer_Stream **fng = parm->fng; 297 | int level = parm->level; 298 | int64 *bound = parm->bound; 299 | int64 **plot = parm->plot; 300 | 301 | int ll = ((level+1)>>2); 302 | int ls = Shift[(level+1)&0x3]; 303 | 304 | int mask = Prefix[level&0x3]; 305 | int offs = (level >> 2) + 1; 306 | int rem = KBYTE - offs; 307 | 308 | uint8 *ent[4]; 309 | int lst[4]; 310 | int in[4], itop; 311 | int cnt[4]; 312 | int mc, hc; 313 | uint8 *mr, *hr; 314 | int a, i, x; 315 | 316 | for (a = 0; a < 4; a++) 317 | { ent[a] = Current_Entry(fng[a],NULL); 318 | lst[a] = (ent[a][ll] >> ls) & 0x3; 319 | if (fng[a]->cidx < end[a]) 320 | { x = (ent[a][ll] >> ls) & 0x3; 321 | while (x > lst[a]) 322 | bound[(a<<2)+(++(lst[a]))] = fng[a]->cidx; 323 | } 324 | } 325 | 326 | #ifdef DEBUG_SCAN 327 | for (a = 0; a < 4; a++) 328 | { printf(" %c %10lld: ",dna[a],fng[a]->cidx); 329 | print_hap(ent[a],KMER,level); 330 | } 331 | printf("\n"); 332 | #endif 333 | 334 | while (1) 335 | { for (a = 0; a < 4; a++) 336 | if (fng[a]->cidx < end[a]) 337 | break; 338 | if (a >= 4) 339 | break; 340 | 341 | mr = ent[a]+offs; 342 | mc = mr[-1] & mask; 343 | in[0] = a; 344 | itop = 1; 345 | for (a++; a < 4; a++) 346 | if (fng[a]->cidx < end[a]) 347 | { hr = ent[a]+offs; 348 | hc = hr[-1] & mask; 349 | if (hc == mc) 350 | { int v = mycmp(hr,mr,rem); 351 | if (v == 0) 352 | in[itop++] = a; 353 | else if (v < 0) 354 | { mc = hc; 355 | mr = hr; 356 | in[0] = a; 357 | itop = 1; 358 | } 359 | } 360 | else if (hc < mc) 361 | { mc = hc; 362 | mr = hr; 363 | in[0] = a; 364 | itop = 1; 365 | } 366 | } 367 | 368 | if (itop > 1) 369 | #ifdef SOLO_CHECK 370 | { for (i = 0; i < itop; i++) 371 | if (mycmp(ent[in[i]],CENT,KBYTE) == 0) 372 | for (a = 0; a < itop; a++) 373 | if (a != i) 374 | { printf(" "); 375 | print_hap(ent[in[a]],KMER,level); 376 | printf(": %d\n",*((uint16 *) (ent[in[a]]+KBYTE))); 377 | } 378 | } 379 | #else 380 | { cnt[0] = *((uint16 *) (ent[in[0]]+KBYTE)); 381 | for (i = 1; i < itop; i++) 382 | { cnt[i] = *((uint16 *) (ent[in[i]]+KBYTE)); 383 | if (Pair[fng[in[i]]->cidx] <= 1) 384 | for (a = 0; a < i; a++) 385 | { x = cnt[a]+cnt[i]; 386 | if (x <= SMAX && Pair[fng[in[a]]->cidx] <= 1) 387 | { if (cnt[a] < cnt[i]) 388 | plot[x][cnt[a]] += 1; 389 | else 390 | plot[x][cnt[i]] += 1; 391 | } 392 | } 393 | } 394 | } 395 | #endif 396 | 397 | for (i = 0; i < itop; i++) 398 | { Kmer_Stream *t; 399 | 400 | a = in[i]; 401 | t = fng[a]; 402 | #ifdef DEBUG_SCAN 403 | if (i == 0) printf("\n"); 404 | printf(" %c %10lld: ",dna[a&0x3],fng[a]->cidx); 405 | print_hap(ent[a],KMER,level); 406 | printf("\n"); 407 | #endif 408 | Next_Kmer_Entry(t); 409 | if (t->cidx < end[a]) 410 | { Current_Entry(t,ent[a]); 411 | x = (ent[a][ll] >> ls) & 0x3; 412 | while (x > lst[a]) 413 | bound[(a<<2)+(++(lst[a]))] = t->cidx; 414 | } 415 | } 416 | 417 | #ifdef DEBUG_SCAN 418 | printf("\n"); 419 | for (a = 0; a < 4; a++) 420 | { printf(" %c %10lld: ",dna[a],fng[a]->cidx); 421 | print_hap(ent[a],KMER,level); 422 | } 423 | printf("\n"); 424 | #endif 425 | } 426 | 427 | for (a = 0; a < 4; a++) 428 | free(ent[a]); 429 | 430 | return (NULL); 431 | } 432 | 433 | static void *analysis_in_core_1(void *args) 434 | { TP *parm = (TP *) args; 435 | uint8 **ptr = parm->ptr; 436 | uint8 **ept = parm->ept; 437 | int level = parm-> level; 438 | uint8 **bound = (uint8 **) (parm->bound); 439 | uint8 *cache = parm->cache; 440 | int64 aidx = parm->cidx; 441 | 442 | int ll = ((level+1)>>2); 443 | int ls = Shift[(level+1)&0x3]; 444 | 445 | int mask = Prefix[level&0x3]; 446 | int offs = (level >> 2) + 1; 447 | int rem = KBYTE - offs; 448 | 449 | int lst[4]; 450 | int in[4], itop; 451 | int cnt[4]; 452 | int mc, hc; 453 | uint8 *mr, *hr; 454 | int a, i, x; 455 | 456 | for (a = 0; a < 4; a++) 457 | { lst[a] = 0; 458 | if (ptr[a] < ept[a]) 459 | { x = (ptr[a][ll] >> ls) & 0x3; 460 | while (x > lst[a]) 461 | bound[(a<<2)+(++(lst[a]))] = ptr[a]; 462 | } 463 | } 464 | 465 | #ifdef DEBUG_SCAN 466 | for (a = 0; a < 4; a++) 467 | { printf(" %c %10ld: ",dna[a],(ptr[a]-parm->cache)/TBYTE); 468 | print_hap(ptr[a],KMER,level); 469 | } 470 | printf("\n"); 471 | #endif 472 | 473 | while (1) 474 | { for (a = 0; a < 4; a++) 475 | if (ptr[a] < ept[a]) 476 | break; 477 | if (a >= 4) 478 | break; 479 | 480 | mr = ptr[a]+offs; 481 | mc = mr[-1] & mask; 482 | in[0] = a; 483 | itop = 1; 484 | for (a++; a < 4; a++) 485 | if (ptr[a] < ept[a]) 486 | { hr = ptr[a]+offs; 487 | hc = hr[-1] & mask; 488 | if (hc == mc) 489 | { int v = mycmp(hr,mr,rem); 490 | if (v == 0) 491 | in[itop++] = a; 492 | else if (v < 0) 493 | { mc = hc; 494 | mr = hr; 495 | in[0] = a; 496 | itop = 1; 497 | } 498 | } 499 | else if (hc < mc) 500 | { mc = hc; 501 | mr = hr; 502 | in[0] = a; 503 | itop = 1; 504 | } 505 | } 506 | 507 | if (itop > 1) 508 | { cnt[0] = *((uint16 *) (ptr[in[0]]+KBYTE)); 509 | for (i = 1; i < itop; i++) 510 | { cnt[i] = *((uint16 *) (ptr[in[i]]+KBYTE)); 511 | for (a = 0; a < i; a++) 512 | { x = cnt[a]+cnt[i]; 513 | if (x <= SMAX) 514 | { Pair[aidx + (ptr[in[i]]-cache)/TBYTE] += 1; 515 | Pair[aidx + (ptr[in[a]]-cache)/TBYTE] += 1; 516 | } 517 | } 518 | } 519 | } 520 | 521 | for (i = 0; i < itop; i++) 522 | { a = in[i]; 523 | #ifdef DEBUG_SCAN 524 | if (i == 0) printf("\n"); 525 | printf("%c %10ld: ",dna[a&0x3],(ptr[a]-parm->cache)/TBYTE); 526 | print_hap(ptr[a],KMER,level); 527 | printf("\n"); 528 | #endif 529 | ptr[a] += TBYTE; 530 | if (ptr[a] < ept[a]) 531 | { x = (ptr[a][ll] >> ls) & 0x3; 532 | while (x > lst[a]) 533 | bound[(a<<2)+(++(lst[a]))] = ptr[a]; 534 | } 535 | } 536 | 537 | #ifdef DEBUG_SCAN 538 | for (a = 0; a < 4; a++) 539 | { printf(" %c %10ld: ",dna[a],(ptr[a]-parm->cache)/TBYTE); 540 | print_hap(ptr[a],KMER,level); 541 | } 542 | printf("\n"); 543 | #endif 544 | } 545 | 546 | return (NULL); 547 | } 548 | 549 | static void *analysis_in_core_2(void *args) 550 | { TP *parm = (TP *) args; 551 | uint8 **ptr = parm->ptr; 552 | uint8 **ept = parm->ept; 553 | int level = parm-> level; 554 | uint8 **bound = (uint8 **) (parm->bound); 555 | int64 **plot = parm->plot; 556 | uint8 *cache = parm->cache; 557 | int64 aidx = parm->cidx; 558 | 559 | int ll = ((level+1)>>2); 560 | int ls = Shift[(level+1)&0x3]; 561 | 562 | int mask = Prefix[level&0x3]; 563 | int offs = (level >> 2) + 1; 564 | int rem = KBYTE - offs; 565 | 566 | int lst[4]; 567 | int in[4], itop; 568 | int cnt[4]; 569 | int mc, hc; 570 | uint8 *mr, *hr; 571 | int a, i, x; 572 | 573 | for (a = 0; a < 4; a++) 574 | { lst[a] = 0; 575 | if (ptr[a] < ept[a]) 576 | { x = (ptr[a][ll] >> ls) & 0x3; 577 | while (x > lst[a]) 578 | bound[(a<<2)+(++(lst[a]))] = ptr[a]; 579 | } 580 | } 581 | 582 | #ifdef DEBUG_SCAN 583 | for (a = 0; a < 4; a++) 584 | { printf(" %c %10ld: ",dna[a],(ptr[a]-parm->cache)/TBYTE); 585 | print_hap(ptr[a],KMER,level); 586 | } 587 | printf("\n"); 588 | #endif 589 | 590 | while (1) 591 | { for (a = 0; a < 4; a++) 592 | if (ptr[a] < ept[a]) 593 | break; 594 | if (a >= 4) 595 | break; 596 | 597 | mr = ptr[a]+offs; 598 | mc = mr[-1] & mask; 599 | in[0] = a; 600 | itop = 1; 601 | for (a++; a < 4; a++) 602 | if (ptr[a] < ept[a]) 603 | { hr = ptr[a]+offs; 604 | hc = hr[-1] & mask; 605 | if (hc == mc) 606 | { int v = mycmp(hr,mr,rem); 607 | if (v == 0) 608 | in[itop++] = a; 609 | else if (v < 0) 610 | { mc = hc; 611 | mr = hr; 612 | in[0] = a; 613 | itop = 1; 614 | } 615 | } 616 | else if (hc < mc) 617 | { mc = hc; 618 | mr = hr; 619 | in[0] = a; 620 | itop = 1; 621 | } 622 | } 623 | 624 | if (itop > 1) 625 | #ifdef SOLO_CHECK 626 | { for (i = 0; i < itop; i++) 627 | if (mycmp(ptr[in[i]],CENT,KBYTE) == 0) 628 | for (a = 0; a < itop; a++) 629 | if (a != i) 630 | { printf(" "); 631 | print_hap(ptr[in[a]],KMER,level); 632 | printf(": %d\n",*((uint16 *) (ptr[in[a]]+KBYTE))); 633 | } 634 | } 635 | #else 636 | { cnt[0] = *((uint16 *) (ptr[in[0]]+KBYTE)); 637 | for (i = 1; i < itop; i++) 638 | { cnt[i] = *((uint16 *) (ptr[in[i]]+KBYTE)); 639 | if (Pair[aidx+(ptr[in[i]]-cache)/TBYTE] <= 1) 640 | for (a = 0; a < i; a++) 641 | { x = cnt[a]+cnt[i]; 642 | if (x <= SMAX && Pair[aidx+(ptr[in[a]]-cache)/TBYTE] <= 1) 643 | { if (cnt[a] < cnt[i]) 644 | plot[x][cnt[a]] += 1; 645 | else 646 | plot[x][cnt[i]] += 1; 647 | } 648 | } 649 | } 650 | } 651 | #endif 652 | 653 | for (i = 0; i < itop; i++) 654 | { a = in[i]; 655 | #ifdef DEBUG_SCAN 656 | if (i == 0) printf("\n"); 657 | printf("%c %10ld: ",dna[a&0x3],(ptr[a]-parm->cache)/TBYTE); 658 | print_hap(ptr[a],KMER,level); 659 | printf("\n"); 660 | #endif 661 | ptr[a] += TBYTE; 662 | if (ptr[a] < ept[a]) 663 | { x = (ptr[a][ll] >> ls) & 0x3; 664 | while (x > lst[a]) 665 | bound[(a<<2)+(++(lst[a]))] = ptr[a]; 666 | } 667 | } 668 | 669 | #ifdef DEBUG_SCAN 670 | for (a = 0; a < 4; a++) 671 | { printf(" %c %10ld: ",dna[a],(ptr[a]-parm->cache)/TBYTE); 672 | print_hap(ptr[a],KMER,level); 673 | } 674 | printf("\n"); 675 | #endif 676 | } 677 | 678 | return (NULL); 679 | } 680 | 681 | 682 | /**************************************************************************************** 683 | * 684 | * Find Het Pairs in top level nodes (level < BLEVEL <= 3) by paneling 4 merge intervals 685 | * with all threads. 686 | * 687 | *****************************************************************************************/ 688 | 689 | static uint8 *Divpt; 690 | 691 | static void big_window(int64 *adiv, int level, TP *parm) 692 | { int64 bound[17]; 693 | int a; 694 | 695 | #ifdef DEBUG_GENERAL 696 | printf("Doing big %d: %lld\n",level,adiv[4]-adiv[0]); fflush(stdout); 697 | #endif 698 | 699 | { int64 e; 700 | uint8 lm; 701 | int t, ls; 702 | Kmer_Stream *T; 703 | #ifndef DEBUG_THREADS 704 | pthread_t threads[NTHREADS]; 705 | #endif 706 | 707 | T = parm[0].fng[0]; 708 | 709 | ls = Shift[level]; // level < BLEVEL 710 | lm = 0xff ^ (0x3<cidx; 727 | } 728 | } 729 | for (a = 0; a < 4; a++) 730 | parm[NTHREADS-1].end[a] = adiv[a+1]; 731 | 732 | #ifdef DEBUG_BIG_SCAN 733 | for (a = 0; a < 4; a++) 734 | for (t = 0; t < NTHREADS; t++) 735 | { printf("%c/%d %10lld: ",dna[a],t,parm[t].fng[a]->cidx); 736 | Current_Entry(parm[t].fng[a],Divpt); 737 | print_hap(Divpt,KMER,level); 738 | printf("\n"); 739 | } 740 | #endif 741 | 742 | for (a = 0; a < 16; a += 4) 743 | { bound[a] = adiv[a>>2]; 744 | for (t = 1; t < 4; t++) 745 | bound[a+t] = -1; 746 | } 747 | 748 | if (PASS1) 749 | #ifdef DEBUG_THREADS 750 | { for (t = 0; t < NTHREADS; t++) 751 | analysis_thread_1(parm+t); 752 | #else 753 | { for (t = 1; t < NTHREADS; t++) 754 | pthread_create(threads+t,NULL,analysis_thread_1,parm+t); 755 | analysis_thread_1(parm); 756 | for (t = 1; t < NTHREADS; t++) 757 | pthread_join(threads[t],NULL); 758 | #endif 759 | } 760 | else 761 | #ifdef DEBUG_THREADS 762 | { for (t = 0; t < NTHREADS; t++) 763 | analysis_thread_2(parm+t); 764 | #else 765 | { for (t = 1; t < NTHREADS; t++) 766 | pthread_create(threads+t,NULL,analysis_thread_2,parm+t); 767 | analysis_thread_2(parm); 768 | for (t = 1; t < NTHREADS; t++) 769 | pthread_join(threads[t],NULL); 770 | #endif 771 | } 772 | 773 | for (a = 0; a < 16; a += 4) 774 | for (t = 1; t < 4; t++) 775 | if (bound[a+t] < 0) 776 | bound[a+t] = adiv[(a>>2)+1]; 777 | bound[16] = adiv[4]; 778 | 779 | #ifdef DEBUG_BOUNDARY 780 | T = parm[0].fng[0]; 781 | for (a = 0; a <= 16; a++) 782 | { if (a > 0) 783 | { GoTo_Kmer_Index(T,bound[a]-1); 784 | Current_Entry(T,Divpt); 785 | printf("%c %10lld: ",dna[a&0x3],T->cidx); 786 | print_hap(Divpt,KMER,level+1); 787 | printf("\n"); 788 | } 789 | if (a < 16) 790 | { GoTo_Kmer_Index(T,bound[a]); 791 | Current_Entry(T,Divpt); 792 | printf("%c %10lld: ",dna[a&0x3],T->cidx); 793 | print_hap(Divpt,KMER,level+1); 794 | printf("\n"); 795 | } 796 | } 797 | #endif 798 | } 799 | 800 | level += 1; 801 | if (level < BLEVEL) 802 | for (a = 0; a < 16; a += 4) 803 | big_window(bound+a,level,parm); 804 | } 805 | 806 | 807 | /**************************************************************************************** 808 | * 809 | * Find Het Pairs for a lower level node by list merging 810 | * 811 | *****************************************************************************************/ 812 | 813 | static void in_core_recursion(uint8 **aptr, int level, TP *parm) 814 | { uint8 *bound[17]; 815 | int a; 816 | 817 | #ifdef SOLO_CHECK 818 | if (aptr[0] <= parm->cptr && parm->cptr < aptr[4]) 819 | { printf("Inside %ld-%ld (%d %d)\n", 820 | (aptr[0]-parm->cache)/TBYTE,(aptr[4]-parm->cache)/TBYTE,parm->tid,level); 821 | printf(" "); 822 | print_hap(aptr[0],KMER,level); 823 | printf(" : "); 824 | print_hap(aptr[4],KMER,level); 825 | printf("\n"); 826 | } 827 | #endif 828 | 829 | if (aptr[4]-aptr[0] <= TBYTE) return; 830 | 831 | #ifdef DEBUG_RECURSION 832 | printf("Doing in-core %d: %ld\n",level,(aptr[4]-aptr[0])/TBYTE); 833 | #endif 834 | 835 | { int t; 836 | 837 | parm->ptr[0] = aptr[0]; 838 | for (a = 1; a < 4; a++) 839 | parm->ptr[a] = parm->ept[a-1] = aptr[a]; 840 | parm->ept[3] = aptr[4]; 841 | 842 | parm->level = level; 843 | parm->bound = (int64 *) bound; 844 | 845 | for (a = 0; a < 16; a += 4) 846 | { bound[a] = aptr[a>>2]; 847 | for (t = 1; t < 4; t++) 848 | bound[a+t] = NULL; 849 | } 850 | bound[16] = aptr[4]; 851 | 852 | if (PASS1) 853 | analysis_in_core_1(parm); 854 | else 855 | analysis_in_core_2(parm); 856 | 857 | for (a = 0; a < 16; a += 4) 858 | for (t = 1; t < 4; t++) 859 | if (bound[a+t] == NULL) 860 | bound[a+t] = bound[a+4]; 861 | 862 | #ifdef DEBUG_BOUNDARY 863 | { uint8 *cache = parm->cache; 864 | 865 | if (level+1 < KMER) 866 | for (a = 0; a <= 16; a++) 867 | { if (a > 0) 868 | { printf("%c %10ld: ",dna[a&0x3],(bound[a]-cache)/TBYTE); 869 | print_hap(bound[a],KMER,level+1); 870 | printf("\n"); 871 | } 872 | if (a < 16) 873 | { printf("%c %c %10ld: ",dna[a>>2],dna[a&0x3],(bound[a]-cache)/TBYTE); 874 | print_hap(bound[a],KMER,level+1); 875 | printf(" - "); 876 | } 877 | } 878 | } 879 | #endif 880 | } 881 | 882 | level += 1; 883 | if (level < KMER) 884 | for (a = 0; a < 16; a += 4) 885 | in_core_recursion(bound+a,level,parm); 886 | } 887 | 888 | static pthread_mutex_t TMUTEX; 889 | static pthread_cond_t TCOND; 890 | 891 | static int *Tstack; 892 | static int Tavail; 893 | 894 | static void small_recursion(int64 *adiv, int level, TP *parm) 895 | { 896 | #ifdef DEBUG_RECURSION 897 | printf("Doing small %d: %lld [%lld-%lld]\n",level,adiv[4]-adiv[0],adiv[0],adiv[4]); 898 | #endif 899 | 900 | if (adiv[4]-adiv[0] < Cache_Size) 901 | { uint8 *aptr[5]; 902 | 903 | { uint8 *C = parm->cache; 904 | Kmer_Stream *T; 905 | int64 i; 906 | int a; 907 | 908 | #ifdef SOLO_CHECK 909 | if (adiv[0] <= CIDX && CIDX < adiv[4]) 910 | { printf("Heading in %lld-%lld (%d %d)\n",adiv[0],adiv[4],parm->tid,level); 911 | parm->cptr = parm->cache + (CIDX - adiv[0])*TBYTE; 912 | } 913 | else 914 | parm->cptr = NULL; 915 | #endif 916 | 917 | T = parm->fng[0]; 918 | GoTo_Kmer_Index(T,adiv[0]); 919 | parm->cidx = T->cidx; 920 | for (i = 0; T->cidx < adiv[4]; i++) 921 | { Current_Entry(T,C); 922 | C += TBYTE; 923 | Next_Kmer_Entry(T); 924 | } 925 | for (a = 1; a <= 4; a++) 926 | aptr[a] = parm->cache + (adiv[a]-adiv[0])*TBYTE; 927 | aptr[0] = parm->cache; 928 | } 929 | 930 | in_core_recursion(aptr,level,parm); 931 | 932 | return; 933 | } 934 | 935 | { int64 bound[17]; 936 | int a; 937 | 938 | { int t; 939 | 940 | for (a = 0; a < 4; a++) 941 | { parm->end[a] = adiv[a+1]; 942 | GoTo_Kmer_Index(parm->fng[a],adiv[a]); 943 | } 944 | parm->level = level; 945 | parm->bound = bound; 946 | 947 | for (a = 0; a < 16; a += 4) 948 | { bound[a] = adiv[a>>2]; 949 | for (t = 1; t < 4; t++) 950 | bound[a+t] = -1; 951 | } 952 | 953 | if (PASS1) 954 | analysis_thread_1(parm); 955 | else 956 | analysis_thread_2(parm); 957 | 958 | for (a = 0; a < 16; a += 4) 959 | for (t = 1; t < 4; t++) 960 | if (bound[a+t] < 0) 961 | bound[a+t] = bound[a+4]; 962 | bound[16] = adiv[4]; 963 | 964 | #ifdef DEBUG_BOUNDARY 965 | { Kmer_Stream *T; 966 | 967 | T = parm->fng[0]; 968 | if (level+1 < KMER) 969 | for (a = 0; a <= 16; a++) 970 | { if (a > 0) 971 | { GoTo_Kmer_Index(T,bound[a]-1); 972 | Current_Entry(T,Divpt); 973 | printf("%c %10lld: ",dna[a&0x3],T->cidx); 974 | print_hap(Divpt,KMER,level+1); 975 | printf("\n"); 976 | } 977 | if (a < 16) 978 | { GoTo_Kmer_Index(T,bound[a]); 979 | Current_Entry(T,Divpt); 980 | printf("%c %10lld: ",dna[a&0x3],T->cidx); 981 | print_hap(Divpt,KMER,level+1); 982 | printf("\n"); 983 | } 984 | } 985 | } 986 | #endif 987 | } 988 | 989 | level += 1; 990 | if (level < KMER) 991 | for (a = 0; a < 16; a += 4) 992 | small_recursion(bound+a,level,parm); 993 | } 994 | } 995 | 996 | static void *small_window(void *args) 997 | { TP *parm = (TP *) args; 998 | int tid = parm->tid; 999 | 1000 | int64 adiv[5]; 1001 | 1002 | { uint8 divpt[TBYTE]; 1003 | Kmer_Stream *T; 1004 | int a, x; 1005 | 1006 | x = parm->level; 1007 | T = parm->fng[0]; 1008 | 1009 | for (a = 0; a < KBYTE; a++) 1010 | divpt[a] = 0; 1011 | for (a = 0; a < 4; a++) 1012 | { if (BLEVEL == 4) 1013 | { divpt[0] = x; 1014 | divpt[1] = (a<<6); 1015 | } 1016 | else 1017 | divpt[0] = (((x<<2) | a) << (6-2*BLEVEL)); 1018 | GoTo_Kmer_Entry(T,divpt); 1019 | adiv[a] = T->cidx; 1020 | } 1021 | if (++x < BWIDTH) 1022 | { divpt[0] = (x << (8-2*BLEVEL)); 1023 | divpt[1] = 0; 1024 | GoTo_Kmer_Entry(T,divpt); 1025 | adiv[4] = T->cidx; 1026 | } 1027 | else 1028 | adiv[4] = T->nels; 1029 | } 1030 | 1031 | small_recursion(adiv,BLEVEL,parm); 1032 | 1033 | pthread_mutex_lock(&TMUTEX); 1034 | Tstack[Tavail++] = tid; 1035 | pthread_mutex_unlock(&TMUTEX); 1036 | 1037 | pthread_cond_signal(&TCOND); 1038 | 1039 | return (NULL); 1040 | } 1041 | 1042 | /**************************************************************************************** 1043 | * 1044 | * Main 1045 | * 1046 | *****************************************************************************************/ 1047 | 1048 | static char template[15] = "._SPAIR.XXXX"; 1049 | 1050 | #ifdef SOLO_CHECK 1051 | 1052 | static uint8 code[128] = 1053 | { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1054 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1055 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1056 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1057 | 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 1058 | 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1059 | 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 1060 | 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; 1061 | 1062 | static void compress_norm(char *s, int len, uint8 *t) 1063 | { int i; 1064 | char c, d, e; 1065 | char *s0, *s1, *s2, *s3; 1066 | 1067 | s0 = s; 1068 | s1 = s0+1; 1069 | s2 = s1+1; 1070 | s3 = s2+1; 1071 | 1072 | c = s0[len]; 1073 | d = s1[len]; 1074 | e = s2[len]; 1075 | s0[len] = s1[len] = s2[len] = 'a'; 1076 | 1077 | for (i = 0; i < len; i += 4) 1078 | *t++ = ((code[(int) s0[i]] << 6) | (code[(int) s1[i]] << 4) 1079 | | (code[(int) s2[i]] << 2) | code[(int) s3[i]] ); 1080 | 1081 | s0[len] = c; 1082 | s1[len] = d; 1083 | s2[len] = e; 1084 | } 1085 | 1086 | #endif 1087 | 1088 | static uint8 comp[128] = 1089 | { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1090 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1091 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1092 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1093 | 0, 3, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1094 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1095 | 0, 3, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1096 | 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; 1097 | 1098 | static void compress_comp(char *s, int len, uint8 *t) 1099 | { int i; 1100 | char c, d, e; 1101 | char *s0, *s1, *s2, *s3; 1102 | 1103 | s0 = s; 1104 | s1 = s0-1; 1105 | s2 = s1-1; 1106 | s3 = s2-1; 1107 | 1108 | c = s1[0]; 1109 | d = s2[0]; 1110 | e = s3[0]; 1111 | s1[0] = s2[0] = s3[0] = 't'; 1112 | 1113 | for (i = len-1; i >= 0; i -= 4) 1114 | *t++ = ((comp[(int) s0[i]] << 6) | (comp[(int) s1[i]] << 4) 1115 | | (comp[(int) s2[i]] << 2) | comp[(int) s3[i]] ); 1116 | 1117 | s1[0] = c; 1118 | s2[0] = d; 1119 | s3[0] = e; 1120 | } 1121 | 1122 | static void examine_table(Kmer_Stream *T, int *trim, int *sym) 1123 | { 1124 | // Histogram of middle 100M counts and see if trimmed to ETHRESH 1125 | 1126 | { int64 hist[0x8000]; 1127 | int64 frst, last; 1128 | int hbyte = T->hbyte; 1129 | int i, nz; 1130 | 1131 | for (i = 0; i < 0x8000; i++) 1132 | hist[i] = 0; 1133 | 1134 | if (T->nels+3 < 100000000) 1135 | { frst = 0; 1136 | last = T->nels; 1137 | } 1138 | else 1139 | { frst = T->nels/2 - 50000000; 1140 | last = T->nels/2 + 50000000; 1141 | } 1142 | 1143 | for (GoTo_Kmer_Index(T,frst); T->cidx < last; Next_Kmer_Entry(T)) 1144 | hist[*((int16 *) (T->csuf+hbyte))] += 1; 1145 | 1146 | for (nz = 1; hist[nz] == 0; nz++) 1147 | ; 1148 | if (nz < ETHRESH) 1149 | *trim = 0; 1150 | else 1151 | *trim = 1; 1152 | } 1153 | 1154 | // Walk to a non-palindromic k-mer and see if its complement is in T 1155 | 1156 | { int64 sidx; 1157 | char *seq; 1158 | uint8 *cmp; 1159 | int kmer; 1160 | 1161 | kmer = T->kmer; 1162 | 1163 | sidx = 1; 1164 | GoTo_Kmer_Index(T,sidx); 1165 | seq = Current_Kmer(T,NULL); 1166 | cmp = Current_Entry(T,NULL); 1167 | while (1) 1168 | { compress_comp(seq,kmer,cmp); 1169 | if (GoTo_Kmer_Entry(T,cmp)) 1170 | { if (T->cidx != sidx) 1171 | { *sym = 1; 1172 | break; 1173 | } 1174 | } 1175 | else 1176 | { *sym = 0; 1177 | break; 1178 | } 1179 | sidx += 1; 1180 | seq = Current_Kmer(T,seq); 1181 | } 1182 | free(cmp); 1183 | free(seq); 1184 | } 1185 | } 1186 | 1187 | int main(int argc, char *argv[]) 1188 | { Kmer_Stream *T; 1189 | char *input; 1190 | char *troot; 1191 | int64 **PLOT; 1192 | int bypass; 1193 | 1194 | char *SORT_PATH; 1195 | char *OUT; 1196 | char *SRC; 1197 | 1198 | // Process command line arguments 1199 | 1200 | (void) print_hap; 1201 | 1202 | { int i, j, k; 1203 | int flags[128]; 1204 | char *eptr; 1205 | 1206 | ARG_INIT("PloidyPlot"); 1207 | OUT = NULL; 1208 | ETHRESH = 4; 1209 | NTHREADS = 4; 1210 | SORT_PATH = "/tmp"; 1211 | 1212 | j = 1; 1213 | for (i = 1; i < argc; i++) 1214 | if (argv[i][0] == '-') 1215 | switch (argv[i][1]) 1216 | { default: 1217 | ARG_FLAGS("vklfs") 1218 | break; 1219 | case 'e': 1220 | ARG_POSITIVE(ETHRESH,"Error-mer threshold") 1221 | break; 1222 | case 'o': 1223 | if (OUT == NULL) 1224 | free(OUT); 1225 | OUT = Strdup(argv[i]+2,"Allocating name"); 1226 | if (OUT == NULL) 1227 | exit (1); 1228 | break; 1229 | case 'P': 1230 | SORT_PATH = argv[i]+2; 1231 | break; 1232 | case 'T': 1233 | ARG_POSITIVE(NTHREADS,"Number of threads") 1234 | if (NTHREADS > 64) 1235 | { fprintf(stderr,"%s: Warning, only 64 threads will be used\n",Prog_Name); 1236 | NTHREADS = 64; 1237 | } 1238 | break; 1239 | } 1240 | else 1241 | argv[j++] = argv[i]; 1242 | argc = j; 1243 | 1244 | VERBOSE = flags['v']; 1245 | 1246 | #ifdef SOLO_CHECK 1247 | if (argc != 3) 1248 | #else 1249 | if (argc != 2) 1250 | #endif 1251 | { fprintf(stderr,"\nUsage: %s %s\n",Prog_Name,Usage[0]); 1252 | fprintf(stderr," %*s %s\n",(int) strlen(Prog_Name),"",Usage[1]); 1253 | fprintf(stderr,"\n"); 1254 | fprintf(stderr," -o: root name for output plots\n"); 1255 | fprintf(stderr," default is root path of argument\n"); 1256 | fprintf(stderr,"\n"); 1257 | fprintf(stderr," -e: count threshold below which k-mers are considered erroneous\n"); 1258 | fprintf(stderr," -v: verbose mode\n"); 1259 | fprintf(stderr," -T: number of threads to use\n"); 1260 | // This P argument does not work properly, only some of the files are saved where it claims it does 1261 | fprintf(stderr," -P: Place all temporary files in directory -P.\n"); 1262 | exit (1); 1263 | } 1264 | 1265 | SRC = argv[1]; 1266 | if (OUT == NULL) 1267 | OUT = Root(argv[1],".ktab"); 1268 | 1269 | troot = mktemp(template); 1270 | } 1271 | 1272 | // If appropriately named het-mer table found then ask if reuse 1273 | 1274 | { FILE *f; 1275 | int a; 1276 | 1277 | bypass = 0; 1278 | f = fopen(Catenate(OUT,".smu","",""),"r"); 1279 | if (f != NULL) 1280 | { fprintf(stdout,"\n Found het-table %s.smu, use it? ",OUT); 1281 | fflush(stdout); 1282 | while ((a = getc(stdin)) != '\n') 1283 | if (a == 'y' || a == 'Y') 1284 | bypass = 1; 1285 | 1286 | if (bypass) 1287 | { PLOT = Malloc(sizeof(int64 *)*(SMAX+1),"Allocating thread working memory"); 1288 | PLOT[0] = Malloc(sizeof(int64)*(SMAX+1)*(FMAX+1),"Allocating plot"); 1289 | for (a = 1; a <= SMAX; a++) 1290 | PLOT[a] = PLOT[a-1] + (FMAX+1); 1291 | fread(PLOT[0],sizeof(int64),(SMAX+1)*(FMAX+1),f); 1292 | } 1293 | 1294 | fclose(f); 1295 | } 1296 | } 1297 | 1298 | // Open input table and see if it needs conditioning 1299 | 1300 | { char *command; 1301 | char *tname; 1302 | int symm, trim; 1303 | 1304 | tname = Malloc(strlen(SRC) + strlen(troot) + 10,"Allocating strings"); 1305 | command = Malloc(strlen(SRC) + strlen(troot) + 100,"Allocating strings"); 1306 | if (tname == NULL || command == NULL) 1307 | exit (1); 1308 | 1309 | T = Open_Kmer_Stream(SRC); 1310 | if (T == NULL) 1311 | { fprintf(stderr,"%s: Cannot open k-mer table %s\n",Prog_Name,SRC); 1312 | exit (1); 1313 | } 1314 | 1315 | KMER = T->kmer; 1316 | KBYTE = T->kbyte; 1317 | TBYTE = T->tbyte; 1318 | 1319 | if (bypass) 1320 | goto skip_build; 1321 | 1322 | examine_table(T,&trim,&symm); 1323 | 1324 | if (VERBOSE) 1325 | { fprintf(stderr,"\n The input table is"); 1326 | if (trim) 1327 | if (symm) 1328 | fprintf(stderr," trimmed and symmetric\n"); 1329 | else 1330 | fprintf(stderr," trimmed but not symmetric\n"); 1331 | else 1332 | if (symm) 1333 | fprintf(stderr," untrimmed yet symmetric\n"); 1334 | else 1335 | fprintf(stderr," untrimmed and not symmetric\n"); 1336 | } 1337 | 1338 | sprintf(tname,"%s",SRC); 1339 | input = NULL; 1340 | 1341 | // Trim source table to k-mers with counts >= ETHRESH if needed 1342 | 1343 | if (!trim) 1344 | { if (VERBOSE) 1345 | { fprintf(stderr,"\n Trimming k-mers in table with count < %d\n",ETHRESH); 1346 | fflush(stderr); 1347 | } 1348 | 1349 | sprintf(command,"Logex -T%d '%s.trim=A[%d-]' %s",NTHREADS,troot,ETHRESH,tname); 1350 | if (system(command) != 0) 1351 | { fprintf(stderr,"%s: Something went wrong with command:\n %s\n",Prog_Name,command); 1352 | exit (1); 1353 | } 1354 | 1355 | sprintf(tname,"%s.trim",troot); 1356 | } 1357 | 1358 | // Make the (relevant) table symmetric if it is not 1359 | 1360 | if (!symm) 1361 | { if (VERBOSE) 1362 | { if (trim) 1363 | fprintf(stderr,"\n Making table symmetric\n"); 1364 | else 1365 | fprintf(stderr,"\n Making trimmed table symmetric\n"); 1366 | fflush(stderr); 1367 | } 1368 | 1369 | sprintf(command,"Symmex -T%d -P%s %s %s.symx",NTHREADS,SORT_PATH,tname,troot); 1370 | 1371 | if (system(command) != 0) 1372 | { fprintf(stderr,"%s: Something went wrong with command:\n %s\n",Prog_Name,command); 1373 | exit (1); 1374 | } 1375 | 1376 | if (!trim) 1377 | { sprintf(command,"Fastrm %s.trim",troot); 1378 | if (system(command) != 0) 1379 | { fprintf(stderr,"%s: Something went wrong with command:\n %s\n", 1380 | Prog_Name,command); 1381 | exit (1); 1382 | } 1383 | } 1384 | 1385 | sprintf(tname,"%s.symx",troot); 1386 | } 1387 | 1388 | // input is the name of the relevant conditioned table, unless the original => NULL 1389 | 1390 | free(command); 1391 | if (!(symm && trim)) 1392 | { input = tname; 1393 | Free_Kmer_Stream(T); 1394 | T = Open_Kmer_Stream(input); 1395 | } 1396 | else 1397 | free(tname); 1398 | } 1399 | 1400 | if (VERBOSE) 1401 | { fprintf(stderr,"\n Starting to count covariant pairs\n"); 1402 | fflush(stderr); 1403 | } 1404 | 1405 | BLEVEL = 1; 1406 | BWIDTH = 4; 1407 | while (4*NTHREADS > BWIDTH) 1408 | { BWIDTH *= 4; 1409 | BLEVEL += 1; 1410 | } 1411 | 1412 | Cache_Size = (MEMORY_LIMIT/NTHREADS)/TBYTE; 1413 | 1414 | #ifdef SOLO_CHECK 1415 | if ((int) strlen(argv[3]) != KMER) 1416 | { fprintf(stderr,"%s: string is not of length %d\n",Prog_Name,KMER); 1417 | exit (1); 1418 | } 1419 | CENT = Current_Entry(T,NULL); 1420 | compress_norm(argv[3],KMER,CENT); 1421 | if (GoTo_Kmer_Entry(T,CENT) < 0) 1422 | { fprintf(stderr,"%s: string is not in table\n",Prog_Name); 1423 | exit (1); 1424 | } 1425 | printf("%s: %d\n",argv[3],Current_Count(T)); 1426 | CIDX = T->cidx; 1427 | #endif 1428 | 1429 | #ifdef DEBUG_GENERAL 1430 | printf("Threads = %d BL = %d(%d) Cache = %lld\n",NTHREADS,BLEVEL,BWIDTH,Cache_Size); 1431 | #endif 1432 | 1433 | { TP parm[NTHREADS]; 1434 | int a, t; 1435 | int64 **plot; 1436 | uint8 *cache; 1437 | 1438 | for (t = 0; t < NTHREADS; t++) 1439 | { plot = Malloc(sizeof(int64 *)*(SMAX+1),"Allocating thread working memory"); 1440 | plot[0] = Malloc(sizeof(int64)*(SMAX+1)*(FMAX+1),"Allocating plot"); 1441 | for (a = 1; a <= SMAX; a++) 1442 | plot[a] = plot[a-1] + (FMAX+1); 1443 | bzero(plot[0],sizeof(int64)*(SMAX+1)*(FMAX+1)); 1444 | parm[t].plot = plot; 1445 | } 1446 | 1447 | parm[0].fng[0] = T; 1448 | for (t = 0; t < NTHREADS; t++) 1449 | for (a = 0; a < 4; a++) 1450 | if (a+t > 0) 1451 | parm[t].fng[a] = Clone_Kmer_Stream(T); 1452 | 1453 | Divpt = Current_Entry(T,NULL); 1454 | Pair = Malloc(sizeof(uint8)*T->nels,"Allocating pair table"); 1455 | cache = Malloc(MEMORY_LIMIT,"Allocating cache buffer"); 1456 | if (Pair == NULL || cache == NULL) 1457 | exit (1); 1458 | 1459 | bzero(Pair,T->nels); 1460 | 1461 | for (PASS1 = 1; PASS1 >= 0; PASS1--) 1462 | { 1463 | // Analyze the top levels, each threaded, up to level, BLEVEL, where 1464 | // the number of nodes is greater than the number of threads 1465 | // by a factor of 4 or more 1466 | 1467 | { int64 adiv[5]; 1468 | 1469 | for (t = 0; t < KBYTE; t++) 1470 | Divpt[t] = 0; 1471 | 1472 | adiv[0] = 0; 1473 | for (a = 1; a < 4; a++) 1474 | { Divpt[0] = (a << 6); 1475 | GoTo_Kmer_Entry(T,Divpt); 1476 | adiv[a] = T->cidx; 1477 | } 1478 | adiv[4] = T->nels; 1479 | 1480 | big_window(adiv,0,parm); 1481 | } 1482 | 1483 | // Assign a thread to each subtree at level BLEVEL until all are done 1484 | 1485 | { pthread_t threads[NTHREADS]; 1486 | int tstack[NTHREADS]; 1487 | int x; 1488 | 1489 | Tstack = tstack; 1490 | 1491 | for (t = 0; t < NTHREADS; t++) 1492 | { Tstack[t] = t; 1493 | parm[t].tid = t; 1494 | parm[t].cache = cache + t*(MEMORY_LIMIT/NTHREADS); 1495 | } 1496 | Tavail = NTHREADS; 1497 | 1498 | pthread_mutex_init(&TMUTEX,NULL); 1499 | pthread_cond_init(&TCOND,NULL); 1500 | 1501 | for (x = 0; x < BWIDTH; x++) 1502 | { pthread_mutex_lock(&TMUTEX); 1503 | 1504 | if (Tavail <= 0) 1505 | pthread_cond_wait(&TCOND,&TMUTEX); 1506 | 1507 | t = Tstack[--Tavail]; 1508 | 1509 | #ifdef DEBUG_GENERAL 1510 | printf("Launching %d on thread %d\n",x,t); 1511 | #endif 1512 | 1513 | pthread_mutex_unlock(&TMUTEX); 1514 | 1515 | parm[t].level = x; 1516 | 1517 | pthread_create(threads+t,NULL,small_window,parm+t); 1518 | } 1519 | 1520 | pthread_mutex_lock(&TMUTEX); 1521 | while (Tavail < NTHREADS) 1522 | pthread_cond_wait(&TCOND,&TMUTEX); 1523 | pthread_mutex_unlock(&TMUTEX); 1524 | } 1525 | } 1526 | 1527 | free(cache); 1528 | free(Pair); 1529 | free(Divpt); 1530 | 1531 | { char *command; 1532 | int64 *plot0, *plott; 1533 | int i; 1534 | 1535 | for (t = NTHREADS-1; t >= 0; t--) 1536 | for (a = 3; a >= 0; a--) 1537 | if (a+t > 0) 1538 | Free_Kmer_Stream(parm[t].fng[a]); 1539 | Free_Kmer_Stream(T); 1540 | 1541 | for (t = 1; t < NTHREADS; t++) 1542 | for (i = 0; i <= SMAX; i++) 1543 | { plot0 = parm[0].plot[i]; 1544 | plott = parm[t].plot[i]; 1545 | for (a = 0; a <= FMAX; a++) 1546 | plot0[a] += plott[a]; 1547 | } 1548 | 1549 | for (t = NTHREADS-1; t >= 1; t--) 1550 | { free(parm[t].plot[0]); 1551 | free(parm[t].plot); 1552 | } 1553 | 1554 | PLOT = parm[0].plot; 1555 | 1556 | if (input != NULL) 1557 | { command = Malloc(strlen(input)+100,"Allocating strings"); 1558 | if (command == NULL) 1559 | exit (1); 1560 | sprintf(command,"Fastrm %s",input); 1561 | if (system(command) != 0) 1562 | { fprintf(stderr,"%s: Something went wrong with command:\n %s\n",Prog_Name,command); 1563 | exit (1); 1564 | } 1565 | free(command); 1566 | free(input); 1567 | } 1568 | } 1569 | } 1570 | 1571 | #ifdef SOLO_CHECK 1572 | 1573 | exit (0); 1574 | 1575 | #endif 1576 | 1577 | if (VERBOSE) 1578 | { fprintf(stderr,"\n Count complete, plotting\n"); 1579 | fflush(stderr); 1580 | } 1581 | 1582 | skip_build: 1583 | 1584 | fprintf(stderr,"\n About to save stuff\n"); 1585 | 1586 | FILE *f; 1587 | int a, i; 1588 | 1589 | f = fopen(Catenate(OUT,"_text.smu","",""),"w"); 1590 | fprintf(stderr,"\n Saving stuff\n"); 1591 | 1592 | // fprintf(f, "// %dx%d matrix, the i'th number in the j'th row give the number of hetmer pairs (a,b)\n", SMAX,FMAX); 1593 | // fprintf(f, "// s.t. count(a)+count(b) = j+1 and min(count(a),count(b)) = i+1.\n"); 1594 | for (a = 0; a <= SMAX; a++) 1595 | { 1596 | for (i = 0; i < FMAX; i++) 1597 | if(PLOT[a][i] > 0) 1598 | { 1599 | fprintf(f,"%i\t%i\t%lld\n",i,a-i,PLOT[a][i]); 1600 | } 1601 | } 1602 | fclose(f); 1603 | 1604 | free(OUT); 1605 | 1606 | Catenate(NULL,NULL,NULL,NULL); 1607 | Numbered_Suffix(NULL,0,NULL); 1608 | free(Prog_Name); 1609 | 1610 | exit (0); 1611 | } 1612 | -------------------------------------------------------------------------------- /src_ploidyplot/gene_core.c: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | #include 4 | #include 5 | #include 6 | #include 7 | #include 8 | 9 | #include "gene_core.h" 10 | 11 | /******************************************************************************************* 12 | * 13 | * GENERAL UTILITIES 14 | * 15 | ********************************************************************************************/ 16 | 17 | char *Prog_Name; 18 | 19 | void *Malloc(int64 size, char *mesg) 20 | { void *p; 21 | 22 | if ((p = malloc(size)) == NULL) 23 | { if (mesg == NULL) 24 | fprintf(stderr,"%s: Out of memory\n",Prog_Name); 25 | else 26 | fprintf(stderr,"%s: Out of memory (%s)\n",Prog_Name,mesg); 27 | } 28 | return (p); 29 | } 30 | 31 | void *Realloc(void *p, int64 size, char *mesg) 32 | { if (size <= 0) 33 | size = 1; 34 | if ((p = realloc(p,size)) == NULL) 35 | { if (mesg == NULL) 36 | fprintf(stderr,"%s: Out of memory\n",Prog_Name); 37 | else 38 | fprintf(stderr,"%s: Out of memory (%s)\n",Prog_Name,mesg); 39 | } 40 | return (p); 41 | } 42 | 43 | char *Strdup(char *name, char *mesg) 44 | { char *s; 45 | 46 | if (name == NULL) 47 | return (NULL); 48 | if ((s = strdup(name)) == NULL) 49 | { if (mesg == NULL) 50 | fprintf(stderr,"%s: Out of memory\n",Prog_Name); 51 | else 52 | fprintf(stderr,"%s: Out of memory (%s)\n",Prog_Name,mesg); 53 | } 54 | return (s); 55 | } 56 | 57 | char *Strndup(char *name, int len, char *mesg) 58 | { char *s; 59 | 60 | if (name == NULL) 61 | return (NULL); 62 | if ((s = strndup(name,len)) == NULL) 63 | { if (mesg == NULL) 64 | fprintf(stderr,"%s: Out of memory\n",Prog_Name); 65 | else 66 | fprintf(stderr,"%s: Out of memory (%s)\n",Prog_Name,mesg); 67 | } 68 | return (s); 69 | } 70 | 71 | char *PathTo(char *name) 72 | { char *path, *find; 73 | 74 | if (name == NULL) 75 | return (NULL); 76 | if ((find = rindex(name,'/')) != NULL) 77 | path = Strndup(name,find-name,"Extracting path from"); 78 | else 79 | path = Strdup(".","Allocating default path"); 80 | return (path); 81 | } 82 | 83 | char *Root(char *name, char *suffix) 84 | { char *path, *find, *dot; 85 | int epos; 86 | 87 | if (name == NULL) 88 | return (NULL); 89 | find = rindex(name,'/'); 90 | if (find == NULL) 91 | find = name; 92 | else 93 | find += 1; 94 | if (suffix == NULL) 95 | { dot = strrchr(find,'.'); 96 | path = Strndup(find,dot-find,"Extracting root from"); 97 | } 98 | else 99 | { epos = strlen(find); 100 | epos -= strlen(suffix); 101 | if (epos > 0 && strcasecmp(find+epos,suffix) == 0) 102 | path = Strndup(find,epos,"Extracting root from"); 103 | else 104 | path = Strdup(find,"Allocating root"); 105 | } 106 | return (path); 107 | } 108 | 109 | char *Catenate(char *path, char *sep, char *root, char *suffix) 110 | { static char *cat = NULL; 111 | static int max = -1; 112 | int len; 113 | 114 | if (path == NULL || root == NULL || sep == NULL || suffix == NULL) 115 | { free(cat); 116 | max = -1; 117 | return (NULL); 118 | } 119 | len = strlen(path); 120 | len += strlen(sep); 121 | len += strlen(root); 122 | len += strlen(suffix); 123 | if (len > max) 124 | { max = ((int) (1.2*len)) + 100; 125 | if ((cat = (char *) realloc(cat,max+1)) == NULL) 126 | { fprintf(stderr,"%s: Out of memory (Making path name for %s)\n",Prog_Name,root); 127 | return (NULL); 128 | } 129 | } 130 | sprintf(cat,"%s%s%s%s",path,sep,root,suffix); 131 | return (cat); 132 | } 133 | 134 | char *Numbered_Suffix(char *left, int num, char *right) 135 | { static char *suffix = NULL; 136 | static int max = -1; 137 | int len; 138 | 139 | if (left == NULL || right == NULL) 140 | { free(suffix); 141 | max = -1; 142 | return (NULL); 143 | } 144 | len = strlen(left); 145 | len += strlen(right) + 40; 146 | if (len > max) 147 | { max = ((int) (1.2*len)) + 100; 148 | if ((suffix = (char *) realloc(suffix,max+1)) == NULL) 149 | { fprintf(stderr,"%s: Out of memory (Making number suffix for %d)\n",Prog_Name,num); 150 | return (NULL); 151 | } 152 | } 153 | sprintf(suffix,"%s%d%s",left,num,right); 154 | return (suffix); 155 | } 156 | 157 | 158 | #define COMMA ',' 159 | 160 | // Print big integers with commas/periods for better readability 161 | 162 | void Print_Number(int64 num, int width, FILE *out) 163 | { if (width == 0) 164 | { if (num < 1000ll) 165 | fprintf(out,"%lld",num); 166 | else if (num < 1000000ll) 167 | fprintf(out,"%lld%c%03lld",num/1000ll,COMMA,num%1000ll); 168 | else if (num < 1000000000ll) 169 | fprintf(out,"%lld%c%03lld%c%03lld",num/1000000ll, 170 | COMMA,(num%1000000ll)/1000ll,COMMA,num%1000ll); 171 | else 172 | fprintf(out,"%lld%c%03lld%c%03lld%c%03lld",num/1000000000ll, 173 | COMMA,(num%1000000000ll)/1000000ll, 174 | COMMA,(num%1000000ll)/1000ll,COMMA,num%1000ll); 175 | } 176 | else 177 | { if (num < 1000ll) 178 | fprintf(out,"%*lld",width,num); 179 | else if (num < 1000000ll) 180 | { if (width <= 4) 181 | fprintf(out,"%lld%c%03lld",num/1000ll,COMMA,num%1000ll); 182 | else 183 | fprintf(out,"%*lld%c%03lld",width-4,num/1000ll,COMMA,num%1000ll); 184 | } 185 | else if (num < 1000000000ll) 186 | { if (width <= 8) 187 | fprintf(out,"%lld%c%03lld%c%03lld",num/1000000ll,COMMA,(num%1000000ll)/1000ll, 188 | COMMA,num%1000ll); 189 | else 190 | fprintf(out,"%*lld%c%03lld%c%03lld",width-8,num/1000000ll,COMMA,(num%1000000ll)/1000ll, 191 | COMMA,num%1000ll); 192 | } 193 | else 194 | { if (width <= 12) 195 | fprintf(out,"%lld%c%03lld%c%03lld%c%03lld",num/1000000000ll,COMMA, 196 | (num%1000000000ll)/1000000ll,COMMA, 197 | (num%1000000ll)/1000ll,COMMA,num%1000ll); 198 | else 199 | fprintf(out,"%*lld%c%03lld%c%03lld%c%03lld",width-12,num/1000000000ll,COMMA, 200 | (num%1000000000ll)/1000000ll,COMMA, 201 | (num%1000000ll)/1000ll,COMMA,num%1000ll); 202 | } 203 | } 204 | } 205 | 206 | // Return the number of symbols to print num, base 10 (without commas as above) 207 | 208 | int Number_Digits(int64 num) 209 | { int digit; 210 | 211 | if (num == 0) 212 | return (1); 213 | if (num < 0) 214 | { num = -num; 215 | digit = 1; 216 | } 217 | else 218 | digit = 0; 219 | while (num >= 1) 220 | { num /= 10; 221 | digit += 1; 222 | } 223 | return (digit); 224 | } 225 | 226 | 227 | /******************************************************************************************* 228 | * 229 | * READ AND ARROW COMPRESSION/DECOMPRESSION UTILITIES 230 | * 231 | ********************************************************************************************/ 232 | 233 | // Compress read into 2-bits per base (from [0-3] per byte representation 234 | 235 | void Compress_Read(int len, char *s) 236 | { int i; 237 | char c, d; 238 | char *s0, *s1, *s2, *s3; 239 | 240 | s0 = s; 241 | s1 = s0+1; 242 | s2 = s1+1; 243 | s3 = s2+1; 244 | 245 | c = s1[len]; 246 | d = s2[len]; 247 | s0[len] = s1[len] = s2[len] = 0; 248 | 249 | for (i = 0; i < len; i += 4) 250 | *s++ = (char ) ((s0[i] << 6) | (s1[i] << 4) | (s2[i] << 2) | s3[i]); 251 | 252 | s1[len] = c; 253 | s2[len] = d; 254 | } 255 | 256 | // Uncompress read form 2-bits per base into [0-3] per byte representation 257 | 258 | void Uncompress_Read(int len, char *s) 259 | { int i, tlen, byte; 260 | char *s0, *s1, *s2, *s3; 261 | char *t; 262 | 263 | s0 = s; 264 | s1 = s0+1; 265 | s2 = s1+1; 266 | s3 = s2+1; 267 | 268 | tlen = (len-1)/4; 269 | 270 | t = s+tlen; 271 | for (i = tlen*4; i >= 0; i -= 4) 272 | { byte = *t--; 273 | s0[i] = (char) ((byte >> 6) & 0x3); 274 | s1[i] = (char) ((byte >> 4) & 0x3); 275 | s2[i] = (char) ((byte >> 2) & 0x3); 276 | s3[i] = (char) (byte & 0x3); 277 | } 278 | s[len] = 4; 279 | } 280 | 281 | // Convert read in [0-3] representation to ascii representation (end with '\n') 282 | 283 | void Lower_Read(char *s) 284 | { static char letter[4] = { 'a', 'c', 'g', 't' }; 285 | 286 | for ( ; *s != 4; s++) 287 | *s = letter[(int) *s]; 288 | *s = '\0'; 289 | } 290 | 291 | void Upper_Read(char *s) 292 | { static char letter[4] = { 'A', 'C', 'G', 'T' }; 293 | 294 | for ( ; *s != 4; s++) 295 | *s = letter[(int) *s]; 296 | *s = '\0'; 297 | } 298 | 299 | void Letter_Arrow(char *s) 300 | { static char letter[4] = { '1', '2', '3', '4' }; 301 | 302 | for ( ; *s != 4; s++) 303 | *s = letter[(int) *s]; 304 | *s = '\0'; 305 | } 306 | 307 | // Convert read in ascii representation to [0-3] representation (end with 4) 308 | 309 | void Number_Read(char *s) 310 | { static char number[128] = 311 | { 0, 0, 0, 0, 0, 0, 0, 0, 312 | 0, 0, 0, 0, 0, 0, 0, 0, 313 | 0, 0, 0, 0, 0, 0, 0, 0, 314 | 0, 0, 0, 0, 0, 0, 0, 0, 315 | 0, 0, 0, 0, 0, 0, 0, 0, 316 | 0, 0, 0, 0, 0, 0, 0, 0, 317 | 0, 0, 0, 0, 0, 0, 0, 0, 318 | 0, 0, 0, 0, 0, 0, 0, 0, 319 | 0, 0, 0, 1, 0, 0, 0, 2, 320 | 0, 0, 0, 0, 0, 0, 0, 0, 321 | 0, 0, 0, 0, 3, 0, 0, 0, 322 | 0, 0, 0, 0, 0, 0, 0, 0, 323 | 0, 0, 0, 1, 0, 0, 0, 2, 324 | 0, 0, 0, 0, 0, 0, 0, 0, 325 | 0, 0, 0, 0, 3, 0, 0, 0, 326 | 0, 0, 0, 0, 0, 0, 0, 0, 327 | }; 328 | 329 | for ( ; *s != '\0'; s++) 330 | *s = number[(int) *s]; 331 | *s = 4; 332 | } 333 | 334 | void Number_Arrow(char *s) 335 | { static char arrow[128] = 336 | { 3, 3, 3, 3, 3, 3, 3, 3, 337 | 3, 3, 3, 3, 3, 3, 3, 3, 338 | 3, 3, 3, 3, 3, 3, 3, 3, 339 | 3, 3, 3, 3, 3, 3, 3, 3, 340 | 3, 3, 3, 3, 3, 3, 3, 3, 341 | 3, 3, 3, 3, 3, 3, 3, 3, 342 | 3, 0, 1, 2, 3, 3, 3, 3, 343 | 3, 3, 3, 3, 3, 3, 3, 3, 344 | 3, 3, 3, 3, 3, 3, 3, 2, 345 | 3, 3, 3, 3, 3, 3, 3, 3, 346 | 3, 3, 3, 3, 3, 3, 3, 3, 347 | 3, 3, 3, 3, 3, 3, 3, 3, 348 | 3, 3, 3, 3, 3, 3, 3, 3, 349 | 3, 3, 3, 3, 3, 3, 3, 3, 350 | 3, 3, 3, 3, 3, 3, 3, 3, 351 | 3, 3, 3, 3, 3, 3, 3, 3, 352 | }; 353 | 354 | for ( ; *s != '\0'; s++) 355 | *s = arrow[(int) *s]; 356 | *s = 4; 357 | } 358 | 359 | void Change_Read(char *s) 360 | { static char change[128] = 361 | { 0, 0, 0, 0, 0, 0, 0, 0, 362 | 0, 0, 0, 0, 0, 0, 0, 0, 363 | 0, 0, 0, 0, 0, 0, 0, 0, 364 | 0, 0, 0, 0, 0, 0, 0, 0, 365 | 0, 0, 0, 0, 0, 0, 0, 0, 366 | 0, 0, 0, 0, 0, 0, 0, 0, 367 | 0, 0, 0, 0, 0, 0, 0, 0, 368 | 0, 0, 0, 0, 0, 0, 0, 0, 369 | 0, 'a', 0, 'c', 0, 0, 0, 'g', 370 | 0, 0, 0, 0, 0, 0, 0, 0, 371 | 0, 0, 0, 0, 't', 0, 0, 0, 372 | 0, 0, 0, 0, 0, 0, 0, 0, 373 | 0, 'A', 0, 'C', 0, 0, 0, 'G', 374 | 0, 0, 0, 0, 0, 0, 0, 0, 375 | 0, 0, 0, 0, 'T', 0, 0, 0, 376 | 0, 0, 0, 0, 0, 0, 0, 0, 377 | }; 378 | 379 | for ( ; *s != '\0'; s++) 380 | *s = change[(int) *s]; 381 | } 382 | -------------------------------------------------------------------------------- /src_ploidyplot/gene_core.h: -------------------------------------------------------------------------------- 1 | #ifndef _CORE 2 | 3 | #define _CORE 4 | 5 | #include 6 | 7 | /******************************************************************************************* 8 | * 9 | * MY STANDARD TYPE DECLARATIONS 10 | * 11 | ********************************************************************************************/ 12 | 13 | typedef unsigned char uint8; 14 | typedef unsigned short uint16; 15 | typedef unsigned int uint32; 16 | typedef unsigned long long uint64; 17 | typedef signed char int8; 18 | typedef signed short int16; 19 | typedef signed int int32; 20 | typedef signed long long int64; 21 | typedef float float32; 22 | typedef double float64; 23 | 24 | /******************************************************************************************* 25 | * 26 | * MACROS TO HELP PARSE COMMAND LINE 27 | * 28 | ********************************************************************************************/ 29 | 30 | extern char *Prog_Name; // Name of program, available everywhere 31 | 32 | #define ARG_INIT(name) \ 33 | Prog_Name = Strdup(name,""); \ 34 | for (i = 0; i < 128; i++) \ 35 | flags[i] = 0; 36 | 37 | #define ARG_FLAGS(set) \ 38 | for (k = 1; argv[i][k] != '\0'; k++) \ 39 | { if (index(set,argv[i][k]) == NULL) \ 40 | { fprintf(stderr,"%s: -%c is an illegal option\n",Prog_Name,argv[i][k]); \ 41 | exit (1); \ 42 | } \ 43 | flags[(int) argv[i][k]] = 1; \ 44 | } 45 | 46 | #define ARG_POSITIVE(var,name) \ 47 | var = strtol(argv[i]+2,&eptr,10); \ 48 | if (*eptr != '\0' || argv[i][2] == '\0') \ 49 | { fprintf(stderr,"%s: -%c '%s' argument is not an integer\n", \ 50 | Prog_Name,argv[i][1],argv[i]+2); \ 51 | exit (1); \ 52 | } \ 53 | if (var <= 0) \ 54 | { fprintf(stderr,"%s: %s must be positive (%d)\n",Prog_Name,name,var); \ 55 | exit (1); \ 56 | } 57 | 58 | #define ARG_NON_NEGATIVE(var,name) \ 59 | var = strtol(argv[i]+2,&eptr,10); \ 60 | if (*eptr != '\0' || argv[i][2] == '\0') \ 61 | { fprintf(stderr,"%s: -%c '%s' argument is not an integer\n", \ 62 | Prog_Name,argv[i][1],argv[i]+2); \ 63 | exit (1); \ 64 | } \ 65 | if (var < 0) \ 66 | { fprintf(stderr,"%s: %s must be non-negative (%d)\n",Prog_Name,name,var); \ 67 | exit (1); \ 68 | } 69 | 70 | #define ARG_REAL(var) \ 71 | var = strtod(argv[i]+2,&eptr); \ 72 | if (*eptr != '\0' || argv[i][2] == '\0') \ 73 | { fprintf(stderr,"%s: -%c '%s' argument is not a real number\n", \ 74 | Prog_Name,argv[i][1],argv[i]+2); \ 75 | exit (1); \ 76 | } 77 | 78 | /******************************************************************************************* 79 | * 80 | * MEMORY ALLOCATION,FILE HANDLING, AND PRETTY PRINTING UTILITIES 81 | * 82 | ********************************************************************************************/ 83 | 84 | // The following general utilities return NULL if any of their input pointers are NULL, or if they 85 | // could not perform their function (in which case they also print an error to stderr). 86 | 87 | void *Malloc(int64 size, char *mesg); // Guarded versions of malloc, realloc 88 | void *Realloc(void *object, int64 size, char *mesg); // and strdup, that output "mesg" to 89 | char *Strdup(char *string, char *mesg); // stderr if out of memory 90 | char *Strndup(char *string, int len, char *mesg); // stderr if out of memory 91 | 92 | char *PathTo(char *path); // Return path portion of file name "path" 93 | char *Root(char *path, char *suffix); // Return the root name, excluding suffix, of "path" 94 | 95 | // Catenate returns concatenation of path.sep.root.suffix in a *temporary* buffer 96 | // Numbered_Suffix returns concatenation of left..right in a *temporary* buffer 97 | 98 | char *Catenate(char *path, char *sep, char *root, char *suffix); 99 | char *Numbered_Suffix(char *left, int num, char *right); 100 | 101 | void Print_Number(int64 num, int width, FILE *out); // Print readable big integer 102 | int Number_Digits(int64 num); // Return # of digits in printed number 103 | 104 | /******************************************************************************************* 105 | * 106 | * ROUTINES FOR HANDLING DNA AND ARROW STRINGS 107 | * 108 | ********************************************************************************************/ 109 | 110 | #define COMPRESSED_LEN(len) (((len)+3) >> 2) 111 | 112 | void Compress_Read(int len, char *s); // Compress read in-place into 2-bit form 113 | void Uncompress_Read(int len, char *s); // Uncompress read in-place into numeric form 114 | void Print_Read(char *s, int width); 115 | 116 | void Lower_Read(char *s); // Convert read from numbers to lowercase letters (0-3 to acgt) 117 | void Upper_Read(char *s); // Convert read from numbers to uppercase letters (0-3 to ACGT) 118 | void Number_Read(char *s); // Convert read from letters to numbers 119 | void Change_Read(char *s); // Convert read from one case to the other 120 | 121 | void Letter_Arrow(char *s); // Convert arrow pw's from numbers to uppercase letters (0-3 to 1234) 122 | void Number_Arrow(char *s); // Convert arrow pw string from letters to numbers 123 | 124 | #endif // _CORE 125 | -------------------------------------------------------------------------------- /src_ploidyplot/libfastk.h: -------------------------------------------------------------------------------- 1 | /******************************************************************************************* 2 | * 3 | * C library routines to access and operate upon FastK histogram, k-mer tables, and profiles 4 | * 5 | * Author: Gene Myers 6 | * Date : November 2020 7 | * 8 | *******************************************************************************************/ 9 | 10 | #ifndef _LIBFASTK 11 | #define _LIBFASTK 12 | 13 | #include 14 | #include 15 | #include 16 | #include 17 | #include 18 | #include 19 | #include 20 | #include 21 | #include 22 | #include 23 | #include 24 | #include 25 | 26 | #include "gene_core.h" 27 | 28 | // HISTOGRAM 29 | 30 | typedef struct 31 | { int kmer; // Histogram is for k-mers of this length 32 | int unique; // 1 => count of unique k-mers, 0 => count of k-mer instances 33 | int low; // Histogram is for range [low,hgh] 34 | int high; 35 | int64 *hist; // hist[i] for i in [low,high] = # of k-mers occuring i times 36 | } Histogram; 37 | 38 | Histogram *Load_Histogram(char *name); 39 | void Modify_Histogram(Histogram *H, int low, int high, int unique); 40 | int Write_Histogram(char *name, Histogram *H); 41 | void Free_Histogram(Histogram *H); 42 | 43 | 44 | // K-MER TABLE 45 | 46 | typedef struct 47 | { int kmer; // Kmer length 48 | int minval; // The minimum count of a k-mer in the table 49 | int64 nels; // # of unique, sorted k-mers in the table 50 | 51 | void *private[7]; // Private fields 52 | } Kmer_Table; 53 | 54 | Kmer_Table *Load_Kmer_Table(char *name, int cut_off); 55 | void Free_Kmer_Table(Kmer_Table *T); 56 | 57 | char *Fetch_Kmer(Kmer_Table *T, int64 i, char *seq); 58 | int Fetch_Count(Kmer_Table *T, int64 i); 59 | 60 | int64 Find_Kmer(Kmer_Table *T, char *kseq); 61 | 62 | 63 | // K-MER STREAM 64 | 65 | typedef struct 66 | { int kmer; // Kmer length 67 | int minval; // The minimum count of a k-mer in the stream 68 | int64 nels; // # of elements in entire table 69 | // Current position 70 | int64 cidx; // current element index 71 | uint8 *csuf; // current element suffix 72 | int cpre; // current element prefix 73 | // Other useful parameters 74 | int ibyte; // # of bytes in prefix 75 | int kbyte; // Kmer encoding in bytes 76 | int tbyte; // Kmer+count entry in bytes 77 | int hbyte; // Kmer suffix in bytes (= kbyte - ibyte) 78 | int pbyte; // Kmer,count suffix in bytes (= tbyte - ibyte) 79 | 80 | void *private[10]; // Private fields 81 | } Kmer_Stream; 82 | 83 | Kmer_Stream *Open_Kmer_Stream(char *name); 84 | Kmer_Stream *Clone_Kmer_Stream(Kmer_Stream *S); 85 | void Free_Kmer_Stream(Kmer_Stream *S); 86 | 87 | void First_Kmer_Entry(Kmer_Stream *S); 88 | void Next_Kmer_Entry(Kmer_Stream *S); 89 | 90 | char *Current_Kmer(Kmer_Stream *S, char *seq); 91 | int Current_Count(Kmer_Stream *S); 92 | uint8 *Current_Entry(Kmer_Stream *S, uint8 *seq); 93 | 94 | void GoTo_Kmer_Index(Kmer_Stream *S, int64 idx); 95 | int GoTo_Kmer_String(Kmer_Stream *S, char *seq); 96 | int GoTo_Kmer_Entry(Kmer_Stream *S, uint8 *entry); 97 | 98 | 99 | // PROFILES 100 | 101 | typedef struct 102 | { int kmer; // Kmer length 103 | int nparts; // # of threads/parts for the profiles 104 | int nreads; // total # of reads in data set 105 | int64 *nbase; // nbase[i] for i in [0,nparts) = id of last read in part i + 1 106 | int64 *index; // index[i] for i in [0,nreads) = offset in relevant part of 107 | // compressed profile for read i 108 | void *private[4]; // Private fields 109 | } Profile_Index; 110 | 111 | Profile_Index *Open_Profiles(char *name); 112 | Profile_Index *Clone_Profiles(Profile_Index *P); 113 | 114 | void Free_Profiles(Profile_Index *P); 115 | 116 | int Fetch_Profile(Profile_Index *P, int64 id, int plen, uint16 *profile); 117 | 118 | #endif // _LIBFASTK 119 | -------------------------------------------------------------------------------- /src_ploidyplot/matrix.c: -------------------------------------------------------------------------------- 1 | /*****************************************************************************************\ 2 | * * 3 | * Matrix inversion, determinants, and linear equations via LU-decomposition * 4 | * * 5 | * Author: Gene Myers * 6 | * Date : April 2007 * 7 | * Mod : June 2008 -- Added TDG's and Cubic Spline to enable snakes and curves * 8 | * Dec 2008 -- Refined TDG's and cubic splines to Decompose/Solve paradigm * 9 | * * 10 | \*****************************************************************************************/ 11 | 12 | #include 13 | #include 14 | #include 15 | #include 16 | #include 17 | 18 | #include "gene_core.h" 19 | #include "matrix.h" 20 | 21 | #define TINY 1.0e-20 22 | 23 | /**************************************************************************************** 24 | * * 25 | * LU-FACTORIZATION SYSTEM SOLVER * 26 | * * 27 | ****************************************************************************************/ 28 | 29 | 30 | // M is a square double matrix where the row index moves the fastest. 31 | // LU_Decompose takes M and produces an LU factorization of M that 32 | // can then be used to rapidly solve the system for given right hand sides 33 | // and to compute M's determinant. The return value is NULL if the matrix 34 | // is nonsingular. If the matrix appears unstable (had to use a very nearly 35 | // zero pivot) then the integer pointed at by stable will be zero, and 36 | // non-zero otherwise. M is subsumed and effectively destroyed by the routine. 37 | 38 | LU_Factor *LU_Decompose(Double_Matrix *M, int *stable) 39 | { LU_Factor *F; 40 | int n, i, j; 41 | int *p, sign; 42 | double *v; 43 | double *avec[1001], **a; 44 | 45 | n = M->n; 46 | 47 | if (n > 1000) 48 | a = Malloc(sizeof(double)*n,"Allocating LU Factor work space"); 49 | else 50 | a = avec; 51 | F = Malloc(sizeof(LU_Factor),"Allocating LU Factor"); 52 | p = Malloc((sizeof(int) + sizeof(double))*n,"Allocating LU Factor"); 53 | if (a == NULL || F == NULL || p == NULL) 54 | exit (1); 55 | 56 | v = (double *) (p+n); 57 | 58 | p[0] = 0; 59 | a[0] = M->m; 60 | for (i = 1; i < n; i++) 61 | { a[i] = a[i-1] + n; 62 | p[i] = i; 63 | } 64 | 65 | *stable = 1; 66 | sign = 1; 67 | for (i = 0; i < n; i++) // Find the scale factors for each row in v. 68 | { double b, f, *r; 69 | 70 | r = a[i]; 71 | b = 0.; 72 | for (j = 0; j < n; j++) 73 | { f = fabs(r[j]); 74 | if (f > b) 75 | b = f; 76 | } 77 | if (b == 0.0) 78 | { free(p); 79 | free(F); 80 | if (n > 1000) 81 | free(a); 82 | return (NULL); 83 | } 84 | v[i] = 1./b; 85 | } 86 | 87 | for (j = 0; j < n; j++) // For each column 88 | { double b, s, *r; 89 | int k, w; 90 | 91 | for (i = 0; i < j; i++) // Determine U 92 | { r = a[i]; 93 | s = r[j]; 94 | for (k = 0; k < i; k++) 95 | s -= r[k]*a[k][j]; 96 | r[j] = s; 97 | } 98 | 99 | b = -1.; 100 | w = j; 101 | for (i = j; i < n; i++) // Determine L without dividing by pivot, in order to 102 | { r = a[i]; // determine who the pivot should be. 103 | s = r[j]; 104 | for (k = 0; k < j; k++) 105 | s -= r[k]*a[k][j]; 106 | r[j] = s; 107 | 108 | s = v[i]*fabs(s); // Update best pivot seen thus far 109 | if (s > b) 110 | { b = s; 111 | w = i; 112 | } 113 | } 114 | 115 | if (w != j) // Pivot if necessary 116 | { r = a[w]; 117 | a[w] = a[j]; 118 | a[j] = r; 119 | k = p[w]; 120 | p[w] = p[j]; 121 | p[j] = k; 122 | sign = -sign; 123 | v[w] = v[j]; 124 | } 125 | 126 | if (fabs(a[j][j]) < TINY) // Complete column of L by dividing by selected pivot 127 | { if (a[j][j] < 0.) 128 | a[j][j] = -TINY; 129 | else 130 | a[j][j] = TINY; 131 | *stable = 0; 132 | } 133 | b = 1./a[j][j]; 134 | for (i = j+1; i < n; i++) 135 | a[i][j] *= b; 136 | } 137 | 138 | #ifdef DEBUG_LU 139 | { int i, j; 140 | 141 | printf("\nLU Decomposition\n"); 142 | for (i = 0; i < n; i++) 143 | { printf(" %2d: ",p[i]); 144 | for (j = 0; j < n; j++) 145 | printf(" %8g",a[i][j]); 146 | printf("\n"); 147 | } 148 | } 149 | #endif 150 | 151 | if (n > 1000) 152 | free(a); 153 | 154 | F->sign = sign; 155 | F->perm = p; 156 | F->lu_mat = M; 157 | return (F); 158 | } 159 | 160 | 161 | // Display LU factorization F to specified file 162 | 163 | void Show_LU_Product(FILE *file, LU_Factor *F) 164 | { int n, i, j, k; 165 | int *p; 166 | double u, **a, *d; 167 | 168 | n = F->lu_mat->n; 169 | d = F->lu_mat->m; 170 | p = F->perm; 171 | a = (double **) (p+n); 172 | 173 | for (i = 0; i < n; i++) 174 | a[i] = d + p[i]*n; 175 | 176 | fprintf(file,"\nLU Product:\n"); 177 | for (i = 0; i < n; i++) 178 | { for (j = 0; j < i; j++) 179 | { u = 0.; 180 | for (k = 0; k <= j; k++) 181 | u += a[i][k] * a[k][j]; 182 | fprintf(file," %g",u); 183 | } 184 | for (j = i; j < n; j++) 185 | { u = a[i][j]; 186 | for (k = 0; k < i; k++) 187 | u += a[i][k] * a[k][j]; 188 | fprintf(file," %g",u); 189 | } 190 | fprintf(file,"\n"); 191 | } 192 | } 193 | 194 | 195 | // Given rhs vector B and LU-factorization F, solve the system of equations 196 | // and return the result in B. 197 | // To invert M = L*U given the LU-decomposition, simply call LU_Solve with 198 | // b = [ 0^k-1 1 0^n-k] to get the k'th column of the inverse matrix. 199 | 200 | Double_Vector *LU_Solve(Double_Vector *B, LU_Factor *F) 201 | { double *x; 202 | int n, i, j; 203 | int *p; 204 | double *a, *b, s, *r; 205 | 206 | n = F->lu_mat->n; 207 | a = F->lu_mat->m; 208 | p = F->perm; 209 | b = B->m; 210 | x = (double *) (p+n); 211 | 212 | for (i = 0; i < n; i++) 213 | { r = a + p[i]*n; 214 | s = b[p[i]]; 215 | for (j = 0; j < i; j++) 216 | s -= r[j] * x[j]; 217 | x[i] = s; 218 | } 219 | 220 | for (i = n; i-- > 0; ) 221 | { r = a + p[i]*n; 222 | s = x[i]; 223 | for (j = i+1; j < n; j++) 224 | s -= r[j] * b[j]; 225 | b[i] = s/r[i]; 226 | } 227 | 228 | return (B); 229 | } 230 | 231 | 232 | // Transpose a matrix M in-place and as a convenience return a pointer to it 233 | 234 | Double_Matrix *Transpose_Matrix(Double_Matrix *M) 235 | { int n; 236 | double *a; 237 | int p, q; 238 | int i, j; 239 | 240 | n = M->n; 241 | a = M->m; 242 | 243 | p = 0; 244 | for (j = 0; j < n; j++) // Transpose the result 245 | { q = j; 246 | for (i = 0; i < j; i++) 247 | { double x = a[p]; 248 | a[p++] = a[q]; 249 | a[q] = x; 250 | q += n; 251 | } 252 | p += (n-j); 253 | } 254 | 255 | return (M); 256 | } 257 | 258 | 259 | // Generate the right inverse of the matrix that gave rise to the LU factorization f. 260 | // That is for matrix A, return matrix A^-1 s.t. A * A^-1 = I. If transpose is non-zero 261 | // then the transpose of the right inverse is returned. 262 | 263 | Double_Matrix *LU_Invert(LU_Factor *F, int transpose) 264 | { int n, i, j; 265 | Double_Matrix *M, G; 266 | double *m, *g; 267 | 268 | n = F->lu_mat->n; 269 | 270 | M = Malloc(sizeof(Double_Matrix),"Allocating matrix"); 271 | m = Malloc(sizeof(double)*n*n,"Allocating matrix"); 272 | if (M == NULL || m == NULL) 273 | exit (1); 274 | 275 | M->n = n; 276 | M->m = m; 277 | G.n = n; 278 | 279 | g = m; 280 | for (i = 0; i < n; i++) // Find the inverse of each column in the 281 | { G.m = g; 282 | for (j = 0; j < n; j++) 283 | g[j] = 0.; 284 | g[i] = 1.; 285 | LU_Solve(&G,F); 286 | g += n; 287 | } 288 | 289 | if (!transpose) 290 | Transpose_Matrix(M); 291 | 292 | return (M); 293 | } 294 | 295 | 296 | // Given an LU-factorization F, return the value of the determinant of the 297 | // original matrix. 298 | 299 | double LU_Determinant(LU_Factor *F) 300 | { int i, n; 301 | int *p; 302 | double *a, det; 303 | 304 | n = F->lu_mat->n; 305 | a = F->lu_mat->m; 306 | p = F->perm; 307 | 308 | det = F->sign; 309 | for (i = 0; i < n; i++) 310 | det *= a[p[i]*n+i]; 311 | return (det); 312 | } 313 | -------------------------------------------------------------------------------- /src_ploidyplot/matrix.h: -------------------------------------------------------------------------------- 1 | /*****************************************************************************************\ 2 | * * 3 | * Matrix inversion, determinants, and linear equations via LU-decomposition * 4 | * * 5 | * Author: Gene Myers * 6 | * Date : April 2007 * 7 | * Mod : June 2008 -- Added TDG's and Cubic Spline to enable snakes and curves * 8 | * * 9 | \*****************************************************************************************/ 10 | 11 | #ifndef _MATRIX_LIB 12 | 13 | #define _MATRIX_LIB 14 | 15 | typedef struct 16 | { int n; 17 | double *m; 18 | } Double_Matrix; 19 | 20 | typedef Double_Matrix Double_Vector; 21 | 22 | typedef struct 23 | { Double_Matrix *lu_mat; // LU decomposion: L is below the diagonal and U is on and above it 24 | int *perm; // Permutation of the original rows of m due to pivoting 25 | int sign; // Sign to apply to the determinant due to pivoting (+1 or -1) 26 | } LU_Factor; 27 | 28 | LU_Factor *LU_Decompose(Double_Matrix *M, int *stable); 29 | void Show_LU_Factor(FILE *file, LU_Factor *F); 30 | Double_Vector *LU_Solve(Double_Vector *B, LU_Factor *F); 31 | Double_Matrix *Transpose_Matrix(Double_Matrix *M); 32 | Double_Matrix *LU_Invert(LU_Factor *F, int transpose); 33 | double LU_Determinant(LU_Factor *F); 34 | 35 | #endif 36 | -------------------------------------------------------------------------------- /tests/README.md: -------------------------------------------------------------------------------- 1 | ### Manual tests 2 | 3 | This is a place for manual tests that have not been automated (yet?). 4 | Don't forget to re-install the package/script before execution. Somehting like 5 | 6 | ``` 7 | make install INSTALL_PREFIX=~ 8 | ``` 9 | 10 | should do the job. 11 | 12 | #### interface tests 13 | 14 | ##### data prep 15 | 16 | Download `SRR3265401` - nice teteraploid Sacharomyces run I use often for testing. 17 | 18 | ##### smudgeplot plot 19 | 20 | Defaults: 21 | 22 | ``` 23 | smudgeplot.py all data/Scer/kmerpairs_default_e2_text.smu -o data/Scer/240918_trial 24 | ``` 25 | 26 | Testing parameters: 27 | 28 | ``` 29 | smudgeplot.py all data/Scer/kmerpairs_default_e2_text.smu -o data/Scer/240918_trial_params -t "Species 1" -c 20 -ylim 80 -col_ramp magma --invert_cols 30 | ``` 31 | 32 | ##### smudgeplot hetkmers 33 | 34 | two different methods to extract homologous kmers 35 | 36 | TODO 37 | 38 | ##### Dicots 39 | 40 | This is a large dataset of the first 540 dicot genomes sequenced by the Tree of Life. Some of them are completed, some of them are with insufficient coverage or otherwise qc failed data. The idea here is to be able to tell those apart, get reasonable defaults so the generated plot is meaningful for a reasonable number (i.e. nearly all) of them. 41 | 42 | ``` 43 | time ./exec/smudgeplot.py plot data/dicots/smu_files/daAchMill1.k31_ploidy.smu.txt -o data/dicots/alt_plots/daAchMill1 --alt_plot -q 0.9 44 | ``` 45 | 46 | ```bash 47 | for smu in data/dicots/smu_files/*.smu.txt; do 48 | species=$(basename $smu) 49 | echo $species $smu 50 | time ./exec/smudgeplot.py plot $smu -o data/dicots/alt_plots/$species --alt_plot -q 0.9 51 | done 52 | 53 | for smu in data/dicots/smu_files/*.smu.txt; do 54 | species=$(basename $smu .smu.txt) 55 | echo $species $smu 56 | time ./exec/smudgeplot.py plot $smu -c 10 -o data/dicots/alt_plots_c10/$species --alt_plot -q 0.9 57 | done 58 | 59 | for smu in $(ls data/dicots/smu_files/*.smu.txt | head -20); do 60 | species=$(basename $smu .smu.txt) 61 | echo $species $smu 62 | smudgeplot.py all $smu -t $species -o data/dicots/automated_smudgeplots/$species 63 | done 64 | ``` -------------------------------------------------------------------------------- /tests/run_smudge_version.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | smudgeplot.py -v 2> version 4 | version=$(cat version | cut -f 3 -d ' ') 5 | rm version 6 | echo "testing $version" 7 | 8 | outdir=figures/$version 9 | mkdir -p $outdir 10 | rm $outdir/* 11 | 12 | Rscript install.R 13 | install -C exec/smudgeplot.py /usr/local/bin 14 | install -C exec/smudgeplot_plot.R /usr/local/bin 15 | 16 | for sp in "Aric1" "Avag1" "Mflo2" "Rmag1"; do 17 | smudgeplot.py plot -o $outdir/"$sp"_smudgeplot_"$version" -t "$sp $version" data/$sp/*coverages_2.tsv 18 | done 19 | 20 | for sp in "Ps791" "Rvar1"; do 21 | smudgeplot.py plot -o $outdir/"$sp"_smudgeplot_"$version" -t "$sp $version" data/$sp/*coverages_2.tsv -nbins 15 22 | done 23 | 24 | sp="Lcla1" 25 | 26 | smudgeplot.py plot -o $outdir/"$sp"_smudgeplot_"$version"_homozyg -t "$sp $version" --homozygous data/$sp/*coverages_2.tsv 27 | 28 | smudgeplot.py plot -o $outdir/"$sp"_smudgeplot_"$version" -t "$sp $version" data/$sp/*coverages_2.tsv 29 | --------------------------------------------------------------------------------