├── .github
    └── ISSUE_TEMPLATE
    │   ├── a-bug.md
    │   ├── feature-request.md
    │   ├── smudgeplot-inference-problem.md
    │   └── smudgeplot-interpretation.md
├── .gitignore
├── FAQ.md
├── LICENSE
├── Makefile
├── README.md
├── exec
    ├── centrality_plot.R
    ├── smudgeplot
    ├── smudgeplot.py
    └── smudgeplot_plot.R
├── playground
    ├── BGA_tutorial.md
    ├── DEVELOPMENT.md
    ├── alternative_fitting
    │   ├── README.md
    │   ├── alternative_plot_covA_covB.R
    │   ├── alternative_plotting.R
    │   ├── alternative_plotting_functions.R
    │   ├── alternative_plotting_testing.R
    │   └── pair_clustering.py
    ├── interactive_plot_strawberry_full_kmer_families_fooling_around.R
    ├── more_away_pairs.py
    ├── playground.R
    ├── playground.py
    └── popart.R
├── src_ploidyplot
    ├── PloidyPlot.c
    ├── gene_core.c
    ├── gene_core.h
    ├── libfastk.c
    ├── libfastk.h
    ├── matrix.c
    └── matrix.h
└── tests
    ├── README.md
    └── run_smudge_version.sh


/.github/ISSUE_TEMPLATE/a-bug.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | name: A bug
 3 | about: If it looks like an error in the code
 4 | labels: potential_problems
 5 | 
 6 | ---
 7 | 
 8 | **What did you do**
 9 | 
10 | Tell us about the problem. what is the version of the software you used (smudgeplot -v)? What was the input (possibly with a few example lines)? What is the command you run? What is the error output you get? And what you have expected to see instead?
11 | 


--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/feature-request.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | name: Feature request
 3 | about: Any ideas how to improve smudgeplot?
 4 | title: feature request: [short descrition]
 5 | labels: enhancement
 6 | 
 7 | ---
 8 | 
 9 | **Background**
10 | 
11 | I suppose you have a reason why you propose us an improvement. If it has a biological or algorithmical motivation, give us something to undestand where the suggestion comes from...
12 | 
13 | **Feature**
14 | 
15 | What do you think the feature should do. Be as detailed as possilbe here. Don't hesitate to write down examples of how the feature should operate.
16 | 
17 | **Contribution**
18 | 
19 | Do you have an idea how to implement the feature? Would you be willing to contribute to get the feature?


--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/smudgeplot-inference-problem.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | name: Smudgeplot inference problem
 3 | about: When there is a suspicious smudgeplot suggesting something wrong
 4 | 
 5 | ---
 6 | 
 7 | **About your genome**
 8 | 
 9 | Tell us about your genome, so we understand why the smudgeplot seems to be wrong. Also please include the evidence you have (karyotype, inSitu...).
10 | 
11 | **smudgeplot**
12 | 
13 | Show us please the command you have used to generate the smudgeplot and the smudgeplot if possible. Tell us what looks suspicious on the smudgeplot and how do you expect it to look like?
14 | 


--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/smudgeplot-interpretation.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | name: Smudgeplot interpretation
 3 | about: For interpretation problems of smudgeplot, send an issue by email if the data
 4 |   are sensitive
 5 | 
 6 | ---
 7 | 
 8 | I have troubles understanding my smudgeplot. I have used follwing command to generate it
 9 | 
10 | ```
11 | smudgeplot plot -i kmer_pairs_coverages_2.tsv -o my_org -t "Figure 1a: genome structure of X. odoratum" -L 40 -k 19
12 | ```
13 | 
14 | and it look like this:
15 | 
16 | (add smudgeplot)
17 | 
18 | Now, I (know/have indication) already of (genome size/number of chromosomes/ploidy/...) from (RADseq/flow cytometry/karyotypes/inSitu/...) data. This does not make sense together with the smudge because (it predicts unexpected ploidy/shows only one smudge/...).
19 | 
20 | How should I understand my smudgeplot?
21 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | data
 2 | figures
 3 | playground
 4 | docs
 5 | exec/PloidyPlot
 6 | exec/hetmers
 7 | *.o
 8 | .DS_Store
 9 | smu2text_smu
10 | 


--------------------------------------------------------------------------------
/FAQ.md:
--------------------------------------------------------------------------------
1 | 
2 | 
3 | migrated to [wiki](https://github.com/tbenavi1/smudgeplot/wiki/FAQ)
4 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 |                                  Apache License
  2 |                            Version 2.0, January 2004
  3 |                         http://www.apache.org/licenses/
  4 | 
  5 |    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
  6 | 
  7 |    1. Definitions.
  8 | 
  9 |       "License" shall mean the terms and conditions for use, reproduction,
 10 |       and distribution as defined by Sections 1 through 9 of this document.
 11 | 
 12 |       "Licensor" shall mean the copyright owner or entity authorized by
 13 |       the copyright owner that is granting the License.
 14 | 
 15 |       "Legal Entity" shall mean the union of the acting entity and all
 16 |       other entities that control, are controlled by, or are under common
 17 |       control with that entity. For the purposes of this definition,
 18 |       "control" means (i) the power, direct or indirect, to cause the
 19 |       direction or management of such entity, whether by contract or
 20 |       otherwise, or (ii) ownership of fifty percent (50%) or more of the
 21 |       outstanding shares, or (iii) beneficial ownership of such entity.
 22 | 
 23 |       "You" (or "Your") shall mean an individual or Legal Entity
 24 |       exercising permissions granted by this License.
 25 | 
 26 |       "Source" form shall mean the preferred form for making modifications,
 27 |       including but not limited to software source code, documentation
 28 |       source, and configuration files.
 29 | 
 30 |       "Object" form shall mean any form resulting from mechanical
 31 |       transformation or translation of a Source form, including but
 32 |       not limited to compiled object code, generated documentation,
 33 |       and conversions to other media types.
 34 | 
 35 |       "Work" shall mean the work of authorship, whether in Source or
 36 |       Object form, made available under the License, as indicated by a
 37 |       copyright notice that is included in or attached to the work
 38 |       (an example is provided in the Appendix below).
 39 | 
 40 |       "Derivative Works" shall mean any work, whether in Source or Object
 41 |       form, that is based on (or derived from) the Work and for which the
 42 |       editorial revisions, annotations, elaborations, or other modifications
 43 |       represent, as a whole, an original work of authorship. For the purposes
 44 |       of this License, Derivative Works shall not include works that remain
 45 |       separable from, or merely link (or bind by name) to the interfaces of,
 46 |       the Work and Derivative Works thereof.
 47 | 
 48 |       "Contribution" shall mean any work of authorship, including
 49 |       the original version of the Work and any modifications or additions
 50 |       to that Work or Derivative Works thereof, that is intentionally
 51 |       submitted to Licensor for inclusion in the Work by the copyright owner
 52 |       or by an individual or Legal Entity authorized to submit on behalf of
 53 |       the copyright owner. For the purposes of this definition, "submitted"
 54 |       means any form of electronic, verbal, or written communication sent
 55 |       to the Licensor or its representatives, including but not limited to
 56 |       communication on electronic mailing lists, source code control systems,
 57 |       and issue tracking systems that are managed by, or on behalf of, the
 58 |       Licensor for the purpose of discussing and improving the Work, but
 59 |       excluding communication that is conspicuously marked or otherwise
 60 |       designated in writing by the copyright owner as "Not a Contribution."
 61 | 
 62 |       "Contributor" shall mean Licensor and any individual or Legal Entity
 63 |       on behalf of whom a Contribution has been received by Licensor and
 64 |       subsequently incorporated within the Work.
 65 | 
 66 |    2. Grant of Copyright License. Subject to the terms and conditions of
 67 |       this License, each Contributor hereby grants to You a perpetual,
 68 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 69 |       copyright license to reproduce, prepare Derivative Works of,
 70 |       publicly display, publicly perform, sublicense, and distribute the
 71 |       Work and such Derivative Works in Source or Object form.
 72 | 
 73 |    3. Grant of Patent License. Subject to the terms and conditions of
 74 |       this License, each Contributor hereby grants to You a perpetual,
 75 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 76 |       (except as stated in this section) patent license to make, have made,
 77 |       use, offer to sell, sell, import, and otherwise transfer the Work,
 78 |       where such license applies only to those patent claims licensable
 79 |       by such Contributor that are necessarily infringed by their
 80 |       Contribution(s) alone or by combination of their Contribution(s)
 81 |       with the Work to which such Contribution(s) was submitted. If You
 82 |       institute patent litigation against any entity (including a
 83 |       cross-claim or counterclaim in a lawsuit) alleging that the Work
 84 |       or a Contribution incorporated within the Work constitutes direct
 85 |       or contributory patent infringement, then any patent licenses
 86 |       granted to You under this License for that Work shall terminate
 87 |       as of the date such litigation is filed.
 88 | 
 89 |    4. Redistribution. You may reproduce and distribute copies of the
 90 |       Work or Derivative Works thereof in any medium, with or without
 91 |       modifications, and in Source or Object form, provided that You
 92 |       meet the following conditions:
 93 | 
 94 |       (a) You must give any other recipients of the Work or
 95 |           Derivative Works a copy of this License; and
 96 | 
 97 |       (b) You must cause any modified files to carry prominent notices
 98 |           stating that You changed the files; and
 99 | 
100 |       (c) You must retain, in the Source form of any Derivative Works
101 |           that You distribute, all copyright, patent, trademark, and
102 |           attribution notices from the Source form of the Work,
103 |           excluding those notices that do not pertain to any part of
104 |           the Derivative Works; and
105 | 
106 |       (d) If the Work includes a "NOTICE" text file as part of its
107 |           distribution, then any Derivative Works that You distribute must
108 |           include a readable copy of the attribution notices contained
109 |           within such NOTICE file, excluding those notices that do not
110 |           pertain to any part of the Derivative Works, in at least one
111 |           of the following places: within a NOTICE text file distributed
112 |           as part of the Derivative Works; within the Source form or
113 |           documentation, if provided along with the Derivative Works; or,
114 |           within a display generated by the Derivative Works, if and
115 |           wherever such third-party notices normally appear. The contents
116 |           of the NOTICE file are for informational purposes only and
117 |           do not modify the License. You may add Your own attribution
118 |           notices within Derivative Works that You distribute, alongside
119 |           or as an addendum to the NOTICE text from the Work, provided
120 |           that such additional attribution notices cannot be construed
121 |           as modifying the License.
122 | 
123 |       You may add Your own copyright statement to Your modifications and
124 |       may provide additional or different license terms and conditions
125 |       for use, reproduction, or distribution of Your modifications, or
126 |       for any such Derivative Works as a whole, provided Your use,
127 |       reproduction, and distribution of the Work otherwise complies with
128 |       the conditions stated in this License.
129 | 
130 |    5. Submission of Contributions. Unless You explicitly state otherwise,
131 |       any Contribution intentionally submitted for inclusion in the Work
132 |       by You to the Licensor shall be under the terms and conditions of
133 |       this License, without any additional terms or conditions.
134 |       Notwithstanding the above, nothing herein shall supersede or modify
135 |       the terms of any separate license agreement you may have executed
136 |       with Licensor regarding such Contributions.
137 | 
138 |    6. Trademarks. This License does not grant permission to use the trade
139 |       names, trademarks, service marks, or product names of the Licensor,
140 |       except as required for reasonable and customary use in describing the
141 |       origin of the Work and reproducing the content of the NOTICE file.
142 | 
143 |    7. Disclaimer of Warranty. Unless required by applicable law or
144 |       agreed to in writing, Licensor provides the Work (and each
145 |       Contributor provides its Contributions) on an "AS IS" BASIS,
146 |       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 |       implied, including, without limitation, any warranties or conditions
148 |       of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 |       PARTICULAR PURPOSE. You are solely responsible for determining the
150 |       appropriateness of using or redistributing the Work and assume any
151 |       risks associated with Your exercise of permissions under this License.
152 | 
153 |    8. Limitation of Liability. In no event and under no legal theory,
154 |       whether in tort (including negligence), contract, or otherwise,
155 |       unless required by applicable law (such as deliberate and grossly
156 |       negligent acts) or agreed to in writing, shall any Contributor be
157 |       liable to You for damages, including any direct, indirect, special,
158 |       incidental, or consequential damages of any character arising as a
159 |       result of this License or out of the use or inability to use the
160 |       Work (including but not limited to damages for loss of goodwill,
161 |       work stoppage, computer failure or malfunction, or any and all
162 |       other commercial damages or losses), even if such Contributor
163 |       has been advised of the possibility of such damages.
164 | 
165 |    9. Accepting Warranty or Additional Liability. While redistributing
166 |       the Work or Derivative Works thereof, You may choose to offer,
167 |       and charge a fee for, acceptance of support, warranty, indemnity,
168 |       or other liability obligations and/or rights consistent with this
169 |       License. However, in accepting such obligations, You may act only
170 |       on Your own behalf and on Your sole responsibility, not on behalf
171 |       of any other Contributor, and only if You agree to indemnify,
172 |       defend, and hold each Contributor harmless for any liability
173 |       incurred by, or claims asserted against, such Contributor by reason
174 |       of your accepting any such warranty or additional liability.
175 | 
176 |    END OF TERMS AND CONDITIONS
177 | 
178 |    APPENDIX: How to apply the Apache License to your work.
179 | 
180 |       To apply the Apache License to your work, attach the following
181 |       boilerplate notice, with the fields enclosed by brackets "{}"
182 |       replaced with your own identifying information. (Don't include
183 |       the brackets!)  The text should be enclosed in the appropriate
184 |       comment syntax for the file format. We also recommend that a
185 |       file or class name and description of purpose be included on the
186 |       same "printed page" as the copyright notice for easier
187 |       identification within third-party archives.
188 | 
189 |    Copyright {yyyy} {name of copyright owner}
190 | 
191 |    Licensed under the Apache License, Version 2.0 (the "License");
192 |    you may not use this file except in compliance with the License.
193 |    You may obtain a copy of the License at
194 | 
195 |        http://www.apache.org/licenses/LICENSE-2.0
196 | 
197 |    Unless required by applicable law or agreed to in writing, software
198 |    distributed under the License is distributed on an "AS IS" BASIS,
199 |    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 |    See the License for the specific language governing permissions and
201 |    limitations under the License.
202 | 
203 | 


--------------------------------------------------------------------------------
/Makefile:
--------------------------------------------------------------------------------
 1 | # PATH for libraries is guessed
 2 | CFLAGS = -O3 -Wall -Wextra -Wno-unused-result -fno-strict-aliasing
 3 | 
 4 | ifndef INSTALL_PREFIX
 5 |     INSTALL_PREFIX = /usr/local
 6 | endif
 7 | 
 8 | HET_KMERS_INST = $(INSTALL_PREFIX)/bin/smudgeplot.py $(INSTALL_PREFIX)/bin/hetmers
 9 | SMUDGEPLOT_INST = $(INSTALL_PREFIX)/bin/smudgeplot_plot.R $(INSTALL_PREFIX)/bin/centrality_plot.R
10 | 
11 | .PHONY : install
12 | install : $(HET_KMERS_INST) $(SMUDGEPLOT_INST) $(CUTOFF_INST) 
13 | 
14 | $(INSTALL_PREFIX)/bin/% : exec/%
15 | 	install -C $< $(INSTALL_PREFIX)/bin
16 | 
17 | exec/hetmers: src_ploidyplot/PloidyPlot.c src_ploidyplot/libfastk.c src_ploidyplot/libfastk.h src_ploidyplot/matrix.c src_ploidyplot/matrix.h
18 | 	gcc $(CFLAGS) -o $@ src_ploidyplot/PloidyPlot.c src_ploidyplot/libfastk.c src_ploidyplot/matrix.c -lpthread -lm
19 | 
20 | 
21 | .PHONY : clean
22 | clean :
23 | 	rm -f exec/hetmers
24 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Smudgeplot 
  2 | 
  3 | <font size ="4">**_Version: 0.4.0 Arched_**</font>
  4 | 
  5 | <font size ="4">**_Authors: [Gene W Myers](https://github.com/thegenemyers) and [Kamil S. Jaron](https://github.com/KamilSJaron), Tianyi Ma._**</font>
  6 | 
  7 | ### Install the whole thing
  8 |  
  9 | This version of smudgeplot operates on FastK k-mer databases. So, before installing smudgeplot, please install [FastK](https://github.com/thegenemyers/FASTK). The smudgeplot installation consist of one python, two R scripts and the C-backend to search for all the k-mer pairs (hetmers) that needs to be compilet.
 10 | 
 11 | #### Quick
 12 | 
 13 | Assuming you have admin right / can write to `/usr/local/bin`, you can simply run
 14 | 
 15 | ```bash
 16 | sudo make
 17 | ```
 18 | That should do everything necesarry to make smudgeplot fully operational. You can run `smudgeplot.py --help` to see if it worked.
 19 | 
 20 | #### Custom installation location
 21 | 
 22 | If there is a different directory where you store your executables, you can specify `INSTALL_PREFIX` variable to make. The binaries are then added to `$INSTALL_PREFIX/bin`. For example
 23 | 
 24 | ```bash
 25 | make -s INSTALL_PREFIX=~
 26 | ```
 27 | 
 28 | will install smudgeplot to `~/bin/`.
 29 | 
 30 | #### Manual installation
 31 | 
 32 | Compiling the `C` executable
 33 | 
 34 | ```
 35 | make exec/hetmers # this will compile hetmers (kmer pair searching engine of PloidyPlot) backend
 36 | ```
 37 | 
 38 | Now you can move all three files from the `exec` directory somewhere your system will see it (or alternativelly, you can add that directory to `$PATH` variable).
 39 | 
 40 | ```
 41 | install -C exec/smudgeplot.py /usr/local/bin
 42 | install -C exec/hetmers /usr/local/bin
 43 | install -C exec/smudgeplot_plot.R /usr/local/bin
 44 | install -C exec/centrality_plot.R /usr/local/bin
 45 | ```
 46 | 
 47 | ### Runing this version on Sacharomyces data
 48 | Requires ~2.1GB of space and `FastK` and `smudgeplot` installed.
 49 | 
 50 | ```bash
 51 | # download data
 52 | wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR326/001/SRR3265401/SRR3265401_1.fastq.gz
 53 | wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR326/001/SRR3265401/SRR3265401_2.fastq.gz
 54 | 
 55 | # sort them in a reasonable place
 56 | mkdir data/Scer
 57 | mv *fastq.gz data/Scer/
 58 | 
 59 | # run FastK to create a k-mer database
 60 | FastK -v -t4 -k31 -M16 -T4 data/Scer/SRR3265401_[12].fastq.gz -Ndata/Scer/FastK_Table
 61 | 
 62 | # Find all k-mer pairs in the dataset using hetmer module
 63 | smudgeplot.py hetmers -L 12 -t 4 -o data/Scer/kmerpairs --verbose data/Scer/FastK_Table
 64 | # this now generated `data/Scer/kmerpairs_text.smu` file;
 65 | # it's a flat file with three columns; covB, covA and freq (the number of k-mer pairs with these respective coverages)
 66 | 
 67 | # use the .smu file to infer ploidy and create smudgeplot
 68 | smudgeplot.py all -o data/Scer/trial_run data/Scer/kmerpairs_text.smu
 69 | 
 70 | # check that bunch files are generated (3 pdfs; some summary tables and logs)
 71 | ls data/Scer/trial_run_*
 72 | ```
 73 | 
 74 | The y-axis scaling is by default 100, one can spcify argument `ylim` to scale it differently
 75 | 
 76 | ```bash
 77 | smudgeplot.py all -o data/Scer/trial_run_ylim70 data/Scer/kmerpairs_text.smu -ylim 70
 78 | ```
 79 | 
 80 | There is also a plotting module that requires the coverage and a list of smudges and their respective sizes listed in a tabular file. This plotting module does not inference and should be used only if you know the right answers already. 
 81 | 
 82 | ### How smudgeplot works
 83 | 
 84 | This tool extracts heterozygous kmer pairs from kmer count databases and performs gymnastics with them. We are able to disentangle genome structure by comparing the sum of kmer pair coverages (CovA + CovB) to their relative coverage (CovB / (CovA + CovB)). Such an approach also allows us to analyze obscure genomes with duplications, various ploidy levels, etc.
 85 | 
 86 | Smudgeplots are computed from raw or even better from trimmed reads and show the haplotype structure using heterozygous kmer pairs. For example (of an older version):
 87 | 
 88 | ![smudgeexample](https://user-images.githubusercontent.com/8181573/45959760-f1032d00-c01a-11e8-8576-ff0512c33da9.png)
 89 | 
 90 | Every haplotype structure has a unique smudge on the graph and the heat of the smudge indicates how frequently the haplotype structure is represented in the genome compared to the other structures. The image above is an ideal case, where the sequencing coverage is sufficient to beautifully separate all the smudges, providing very strong and clear evidence of triploidy.
 91 | 
 92 | This tool is planned to be a part of [GenomeScope](https://github.com/tbenavi1/genomescope2.0) in the near future.
 93 | 
 94 | ### More about the use
 95 | 
 96 | The input is a set of whole genome sequencing reads, the more coverage the better. The method is designed to process big datasets, don't hesitate to pull all single-end/pair-end libraries together.
 97 | 
 98 | The workflow is automatic, but it's not fool-proof. It requires some decisions. Use this tool joinlty with [GenomeScope](https://github.com/tbenavi1/genomescope2.0). The tutorials on our wiki are currently outdated (build for version 0.2.5), and will be updated by 18th of October. 
 99 | 
100 | Smudgeplot generates two plots, one with coloration on a log scale and the other on a linear scale. The legend indicates approximate kmer pairs per tile densities. Note that a single polymorphism generates multiple heterozygous kmers. As such, the reported numbers do not directly correspond to the number of variants. Instead, the actual number is approximately 1/k times the reported numbers, where k is the kmer size (in summary already recalculated). It's important to note that this process does not exhaustively attempt to find all of the heterozygous kmers from the genome. Instead, only a sufficient sample is obtained in order to identify relative genome structure. You can also report the minimal number of loci that are heterozygous if the inference is correct.
101 | 
102 | ### GenomeScope
103 | 
104 | You can feed the kmer coverage histogram to GenomeScope. (Either run the [genomescope script](https://github.com/schatzlab/genomescope/blob/master/genomescope.R) or use the [web server](http://qb.cshl.edu/genomescope/))
105 | 
106 | ```
107 | Rscript genomescope.R kmcdb_k21.hist <k-mer_length> <read_length> <output_dir> [kmer_max] [verbose]
108 | ```
109 | 
110 | This script estimates the size, heterozygosity, and repetitive fraction of the genome. By inspecting the fitted model you can determine the location of the smallest peak after the error tail. Then, you can decide the low end cutoff below which all kmers will be discarded as errors (cca 0.5 times the haploid kmer coverage), and the high end cutoff above which all kmers will be discarded (cca 8.5 times the haploid kmer coverage).
111 | 
112 | ## Frequently Asked Questions
113 | 
114 | Are collected on [our wiki](https://github.com/KamilSJaron/smudgeplot/wiki/FAQ). Smudgeplot does not demand much on computational resources, but make sure you check [memory requirements](https://github.com/KamilSJaron/smudgeplot/wiki/smudgeplot-hetkmers#memory-requirements) before you extract kmer pairs (`hetkmers` task). If you don't find an answer for your question in FAQ, open an [issue](https://github.com/KamilSJaron/smudgeplot/issues/new/choose) or drop us an email.
115 | 
116 | Check [projects](https://github.com/KamilSJaron/smudgeplot/projects) to see how the development goes.
117 | 
118 | ## Contributions
119 | 
120 | This is definitely an open project, contributions are welcome. You can check some of the ideas for the future in [projects](https://github.com/KamilSJaron/smudgeplot/projects) and in the development [dev](https://github.com/KamilSJaron/smudgeplot/tree/dev) branch. The file [playground/DEVELOPMENT.md](playground/DEVELOPMENT.md) contains some development notes. The directory [playground](playground) contains some snippets, attempts, and other items of interest.
121 | 
122 | ## Reference
123 | 
124 | Ranallo-Benavidez, T.R., Jaron, K.S. & Schatz, M.C. GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. *Nature Communications* **11**, 1432 (2020). https://doi.org/10.1038/s41467-020-14998-3
125 | 
126 | ## Acknowledgements
127 | 
128 | This [blogpost](http://www.everydayanalytics.ca/2014/09/5-ways-to-do-2d-histograms-in-r.html) by Myles Harrison has largely inspired the visual output of smudgeplots. The colourblind friendly colour theme was suggested by @ggrimes. Grateful for helpful comments of beta testers and pre-release chatters!
129 | 


--------------------------------------------------------------------------------
/exec/centrality_plot.R:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env Rscript
 2 | args = commandArgs(trailingOnly=TRUE)
 3 | input_name <- args[1]
 4 | # input_name = '~/test/daAchMill1_centralities.txt'
 5 | 
 6 | output_name <- gsub('.txt', '.pdf', input_name)
 7 | tested_covs <- read.table(input_name, col.names = c('cov', 'centrality'))
 8 | 
 9 | pdf(output_name)
10 |     plot(tested_covs[, 'cov'], tested_covs[, 'centrality'], xlab = 'Coverage', ylab = 'Centrality [(theoretical_center - actual_center) / coverage ]', pch = 20)
11 | dev.off()


--------------------------------------------------------------------------------
/exec/smudgeplot:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | 
  3 | import argparse
  4 | import sys
  5 | import os
  6 | from math import log
  7 | from math import ceil
  8 | import numpy as np
  9 | from scipy.signal import argrelextrema
 10 | 
 11 | version = '0.2.0'
 12 | 
 13 | class parser():
 14 |     def __init__(self):
 15 |         argparser = argparse.ArgumentParser(
 16 |             # description='Inference of ploidy and heterozygosity structure using whole genome sequencing data',
 17 |             usage='''smudgeplot <task> [options] \n
 18 | tasks: cutoff    Calculate meaningful values for lower/upper kmer histogram cutoff.
 19 |        hetkmers  Calculate unique kmer pairs from a Jellyfish or KMC dump file.
 20 |        plot      Generate 2d histogram; infere ploidy and plot a smudgeplot.\n\n''')
 21 |         argparser.add_argument('task', help='Task to execute; for task specific options execute smudgeplot <task> -h')
 22 |         argparser.add_argument('-v', '--version', action="store_true", default = False, help="print the version and exit")
 23 |         # print version is a special case
 24 |         if len(sys.argv) > 1:
 25 |             if sys.argv[1] in ['-v', '--version']:
 26 |                 self.task = "version"
 27 |                 return
 28 |             # the following line either prints help and die; or assign the name of task to variable task
 29 |             self.task = argparser.parse_args([sys.argv[1]]).task
 30 |         else:
 31 |             self.task = ""
 32 |         # if the task is known (i.e. defined in this file);
 33 |         if hasattr(self, self.task):
 34 |             # load arguments of that task
 35 |             getattr(self, self.task)()
 36 |         else:
 37 |             argparser.print_usage()
 38 |             print('"' + self.task + '" is not a valid task name')
 39 |             exit(1)
 40 | 
 41 |     def hetkmers(self):
 42 |         '''
 43 |         Calculate unique kmer pairs from a Jellyfish or KMC dump file.
 44 |         '''
 45 |         argparser = argparse.ArgumentParser(prog = 'smudgeplot hetkmers',
 46 |             description='Calculate unique kmer pairs from a Jellyfish or KMC dump file.')
 47 |         argparser.add_argument('infile', nargs='?', type=argparse.FileType('r'), default=sys.stdin, help='Alphabetically sorted Jellyfish or KMC dump file (stdin).')
 48 |         argparser.add_argument('-o', help='The pattern used to name the output (kmerpairs).', default='kmerpairs')
 49 |         argparser.add_argument('-k', help='The length of the kmer.', default=21)
 50 |         argparser.add_argument('-t', help='Number of processes to use.', default = 4)
 51 |         argparser.add_argument('--middle', dest='middle', action='store_const', const = True, default = False,
 52 |                           help='Get all kmer pairs one SNP away from each other (default: just the middle one).')
 53 |         self.arguments = argparser.parse_args(sys.argv[2:])
 54 | 
 55 |     def plot(self):
 56 |         '''
 57 |         Generate 2d histogram; infer ploidy and plot a smudgeplot.
 58 |         '''
 59 |         argparser = argparse.ArgumentParser(prog = 'smudgeplot plot', description='Generate 2d histogram for smudgeplot')
 60 |         argparser.add_argument('infile', nargs='?', type=argparse.FileType('r'), default=sys.stdin, help='name of the input tsv file with covarages (default \"coverages_2.tsv\")."')
 61 |         argparser.add_argument('-o', help='The pattern used to name the output (smudgeplot).', default='smudgeplot')
 62 |         argparser.add_argument('-q', help='Remove kmer pairs with coverage over the specified quantile; (default none).', type=float, default=1)
 63 |         argparser.add_argument('-L', help='The lower boundary used when dumping kmers (default min(total_pair_cov) / 2).', type=int, default=0)
 64 |         argparser.add_argument('-n', help='The expected haploid coverage (default estimated from data).', type=int, default=0)
 65 |         argparser.add_argument('-t', '--title', help='name printed at the top of the smudgeplot (default none).', default='')
 66 |         argparser.add_argument('-m', '-method', help='The algorithm for annotation of smudges (default \'local_aggregation\')', default='local_aggregation')
 67 |         argparser.add_argument('-nbins', help='The number of nbins used for smudgeplot matrix (nbins x nbins) (default autodetection).', type=int, default=0)
 68 |         # argparser.add_argument('-k', help='The length of the kmer.', default=21)
 69 |         argparser.add_argument('-kmer_file', help='Name of the input files containing kmer seuqences (assuming the same order as in the coverage file)', default = "")
 70 |         argparser.add_argument('--homozygous', action="store_true", default = False, help="Assume no heterozygosity in the genome - plotting a paralog structure; (default False).")
 71 |         self.arguments = argparser.parse_args(sys.argv[2:])
 72 | 
 73 |     def cutoff(self):
 74 |         '''
 75 |         Calculate meaningful values for lower/upper kmer histogram cutoff.
 76 |         '''
 77 |         argparser = argparse.ArgumentParser(prog = 'smudgeplot cutoff', description='Calculate meaningful values for lower/upper kmer histogram cutoff.')
 78 |         argparser.add_argument('infile', type=argparse.FileType('r'), help='Name of the input kmer histogram file (default \"kmer.hist\")."')
 79 |         argparser.add_argument('boundary', help='Which bounary to compute L (lower, default) or U (upper)', default = 'L')
 80 |         self.arguments = argparser.parse_args(sys.argv[2:])
 81 | 
 82 | 
 83 | def round_up_nice(x):
 84 |     digits = ceil(log(x, 10))
 85 |     if digits <= 1:
 86 |         multiplier = 10 ** (digits - 1)
 87 |     else:
 88 |         multiplier = 10 ** (digits - 2)
 89 |     return(ceil(x / multiplier) * multiplier)
 90 | 
 91 | def cutoff(args):
 92 |     # kmer_hist = open("data/Mflo2/kmer.hist","r")
 93 |     kmer_hist = args.infile
 94 |     hist = np.array([int(line.split()[1]) for line in kmer_hist])
 95 |     if args.boundary == "L":
 96 |         local_minima = argrelextrema(hist, np.less)[0][0]
 97 |         L = max(10, int(round(local_minima * 1.25)))
 98 |         print(L, end = '')
 99 |     else:
100 |         # take 99.8 quantile of kmers that are more than one in the read set
101 |         hist_rel_cumsum = np.cumsum(hist[1:]) / np.sum(hist[1:])
102 |         U = round_up_nice(np.argmax(hist_rel_cumsum > 0.998))
103 |         print(U, end = '')
104 | 
105 | def main():
106 |     _parser = parser()
107 | 
108 |     print('Running smudgeplot v' + version)
109 |     if _parser.task == "version":
110 |         exit(0)
111 | 
112 |     print('Task: ' + _parser.task)
113 | 
114 |     if _parser.task == "cutoff":
115 |         cutoff(_parser.arguments)
116 | 
117 |     # if _parser.task == "hetkmers":
118 |     #     hetkmers(_parser.arguments)
119 |     #
120 |     # if _parser.task == "plot":
121 |     # call .R script
122 |     #     plot(_parser.arguments)
123 | 
124 |     print('Done!')
125 |     exit(0)
126 | 
127 | if __name__=='__main__':
128 |     main()


--------------------------------------------------------------------------------
/exec/smudgeplot.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | 
  3 | import argparse
  4 | import sys
  5 | import numpy as np
  6 | from pandas import read_csv # type: ignore
  7 | from pandas import DataFrame # type: ignore
  8 | from numpy import arange
  9 | from numpy import argmin
 10 | from numpy import concatenate
 11 | from os import system
 12 | from math import log
 13 | from math import ceil
 14 | from statistics import fmean
 15 | from collections import defaultdict
 16 | # import matplotlib as mpl
 17 | # import matplotlib.pyplot as plt
 18 | # from matplotlib.pyplot import plot
 19 | 
 20 | version = '0.4.0dev'
 21 | 
 22 | ############################
 23 | # processing of user input #
 24 | ############################
 25 | 
 26 | class parser():
 27 |     def __init__(self):
 28 |         argparser = argparse.ArgumentParser(
 29 |             # description='Inference of ploidy and heterozygosity structure using whole genome sequencing data',
 30 |             usage='''smudgeplot <task> [options] \n
 31 | tasks: cutoff           Calculate meaningful values for lower kmer histogram cutoff.
 32 |        hetmers          Calculate unique kmer pairs from a FastK k-mer database.
 33 |        peak_agregation  Agregates smudges using local agregation algorithm.
 34 |        plot             Generate 2d histogram; infere ploidy and plot a smudgeplot.
 35 |        all              Runs all the steps (with default options)\n\n''')
 36 |         # removing this for now;
 37 |         #        extract   Extract kmer pairs within specified coverage sum and minor covrage ratio ranges
 38 |         argparser.add_argument('task', help='Task to execute; for task specific options execute smudgeplot <task> -h')
 39 |         argparser.add_argument('-v', '--version', action="store_true", default = False, help="print the version and exit")
 40 |         # print version is a special case
 41 |         if len(sys.argv) > 1:
 42 |             if sys.argv[1] in ['-v', '--version']:
 43 |                 self.task = "version"
 44 |                 return
 45 |             # the following line either prints help and die; or assign the name of task to variable task
 46 |             self.task = argparser.parse_args([sys.argv[1]]).task
 47 |         else:
 48 |             self.task = ""
 49 |         # if the task is known (i.e. defined in this file);
 50 |         if hasattr(self, self.task):
 51 |             # load arguments of that task
 52 |             getattr(self, self.task)()
 53 |         else:
 54 |             argparser.print_usage()
 55 |             sys.stderr.write('"' + self.task + '" is not a valid task name\n')
 56 |             exit(1)
 57 | 
 58 |     def hetmers(self):
 59 |         '''
 60 |         Calculate unique kmer pairs from a Jellyfish or KMC dump file.
 61 |         '''
 62 |         argparser = argparse.ArgumentParser(prog = 'smudgeplot hetkmers',
 63 |             description='Calculate unique kmer pairs from FastK k-mer database.')
 64 |         argparser.add_argument('infile', nargs='?', help='Input FastK database (.ktab) file.')
 65 |         argparser.add_argument('-L', help='Count threshold below which k-mers are considered erroneous', type=int)
 66 |         argparser.add_argument('-t', help='Number of threads (default 4)', type=int, default=4)
 67 |         argparser.add_argument('-o', help='The pattern used to name the output (kmerpairs).', default='kmerpairs')
 68 |         argparser.add_argument('-tmp', help='Directory where all temporary files will be stored (default /tmp).', default='.')
 69 |         argparser.add_argument('--verbose', action="store_true", default = False, help='verbose mode')
 70 |         self.arguments = argparser.parse_args(sys.argv[2:])
 71 | 
 72 |     def plot(self):
 73 |         '''
 74 |         Generate 2d histogram; infer ploidy and plot a smudgeplot.
 75 |         '''
 76 |         argparser = argparse.ArgumentParser(prog = 'smudgeplot plot', description='Generate 2d histogram for smudgeplot')
 77 |         argparser.add_argument('infile', help='name of the input tsv file with covarages and frequencies')
 78 |         argparser.add_argument('smudgefile', help='name of the input tsv file with sizes of individual smudges')
 79 |         argparser.add_argument('n', help='The expected haploid coverage.', type=float)
 80 |         argparser.add_argument('-o', help='The pattern used to name the output (smudgeplot).', default='smudgeplot')
 81 |         
 82 |         argparser = self.add_plotting_arguments(argparser)
 83 | 
 84 |         self.arguments = argparser.parse_args(sys.argv[2:])
 85 | 
 86 |     def cutoff(self):
 87 |         '''
 88 |         Calculate meaningful values for lower/upper kmer histogram cutoff.
 89 |         '''
 90 |         argparser = argparse.ArgumentParser(prog = 'smudgeplot cutoff', description='Calculate meaningful values for lower/upper kmer histogram cutoff.')
 91 |         argparser.add_argument('infile', type=argparse.FileType('r'), help='Name of the input kmer histogram file (default \"kmer.hist\")."')
 92 |         argparser.add_argument('boundary', help='Which bounary to compute L (lower) or U (upper)')
 93 |         self.arguments = argparser.parse_args(sys.argv[2:])
 94 | 
 95 |     def peak_agregation(self):
 96 |         '''
 97 |         Extract kmer pairs within specified coverage sum and minor covrage ratio ranges.
 98 |         '''
 99 |         argparser = argparse.ArgumentParser()
100 |         argparser.add_argument('infile', nargs='?', help='name of the input tsv file with covarages and frequencies.')
101 |         argparser.add_argument('-nf', '-noise_filter', help='Do not agregate into smudge k-mer pairs with frequency lower than this parameter', type=int, default=50)
102 |         argparser.add_argument('-d', '-distance', help='Manthattan distance of k-mer pairs that are considered neioboring for the local agregation purposes.', type=int, default=5)
103 |         argparser.add_argument('--mask_errors', help='instead of reporting assignments to individual smudges, just remove all monotonically decreasing points from the error line', action="store_true", default = False)
104 |         self.arguments = argparser.parse_args(sys.argv[2:])
105 | 
106 |     def all(self):
107 |         argparser = argparse.ArgumentParser()
108 |         argparser.add_argument('infile', nargs='?', help='name of the input tsv file with covarages and frequencies.')
109 |         argparser.add_argument('-o', help='The pattern used to name the output (smudgeplot).', default='smudgeplot')
110 |         argparser.add_argument('-cov_min', help='Minimal coverage to explore (default 6)', default=6, type = int)
111 |         argparser.add_argument('-cov_max', help='Maximal coverage to explore (default 50)', default=60, type = int)
112 |         argparser.add_argument('-cov', help='Define coverage instead of infering it. Disables cov_min and cov_max.', default=0, type=int)
113 | 
114 |         argparser = self.add_plotting_arguments(argparser)
115 | 
116 |         self.arguments = argparser.parse_args(sys.argv[2:])
117 |     
118 |     def add_plotting_arguments(self, argparser):
119 |         argparser.add_argument('-c', '-cov_filter', help='Filter pairs with one of them having coverage bellow specified threshold (default 0; disables parameter L)', type=int, default=0)
120 |         argparser.add_argument('-t', '--title', help='name printed at the top of the smudgeplot (default none).', default='')
121 |         argparser.add_argument('-ylim', help='The upper limit for the coverage sum (the y axis)', type = int, default=0)
122 |         argparser.add_argument('-col_ramp', help='An R palette used for the plot (default "viridis", other sensible options are "magma", "mako" or "grey.colors" - recommended in combination with --invert_cols).', default='viridis')
123 |         argparser.add_argument('--invert_cols', action="store_true", default = False, help="Revert the colour palette (default False).")
124 |         return(argparser)
125 |     
126 |     def format_aguments_for_R_plotting(self):
127 |         plot_args = ""
128 |         if self.arguments.c != 0:
129 |             plot_args += " -c " + str(self.arguments.c)
130 |         if self.arguments.title:
131 |             plot_args += " -t \"" + self.arguments.title + "\""
132 |         if self.arguments.ylim != 0:
133 |             plot_args += " -ylim " + str(self.arguments.ylim)
134 |         if self.arguments.col_ramp:
135 |             plot_args += " -col_ramp \"" + self.arguments.col_ramp + "\""
136 |         if self.arguments.invert_cols:
137 |             plot_args += " --invert_cols"
138 |         return(plot_args)
139 | 
140 | ###############
141 | # task cutoff #
142 | ###############
143 | 
144 | # taken from https://stackoverflow.com/a/29614335
145 | def local_min(ys):
146 |     return [i for i, y in enumerate(ys)
147 |             if ((i == 0) or (ys[i - 1] >= y))
148 |             and ((i == len(ys) - 1) or (y < ys[i+1]))]
149 | 
150 | def round_up_nice(x):
151 |     digits = ceil(log(x, 10))
152 |     if digits <= 1:
153 |         multiplier = 10 ** (digits - 1)
154 |     else:
155 |         multiplier = 10 ** (digits - 2)
156 |     return(ceil(x / multiplier) * multiplier)
157 | 
158 | def cutoff(args):
159 |     # kmer_hist = open("data/Scer/kmc_k31.hist","r")
160 |     kmer_hist = args.infile
161 |     hist = [int(line.split()[1]) for line in kmer_hist]
162 |     if args.boundary == "L":
163 |         local_minima = local_min(hist)[0]
164 |         L = max(10, int(round(local_minima * 1.25)))
165 |         sys.stdout.write(str(L))
166 |     else:
167 |         sys.stderr.write('Warning: We discourage using the original hetmer algorithm.\n\tThe updated (recommended) version does not take the argument U\n')
168 |         # take 99.8 quantile of kmers that are more than one in the read set
169 |         number_of_kmers = sum(hist[1:])
170 |         hist_rel_cumsum = [sum(hist[1:i+1]) / number_of_kmers for i in range(1, len(hist))]
171 |         min(range(len(hist_rel_cumsum))) 
172 |         U = round_up_nice(min([i for i, q in enumerate(hist_rel_cumsum) if q > 0.998]))
173 |         sys.stdout.write(str(U))
174 |     sys.stdout.flush()
175 | 
176 | ########################
177 | # task peak_agregation #
178 | ########################
179 | 
180 | def load_hetmers(smufile):
181 |     cov_tab = read_csv(smufile, names = ['covB', 'covA', 'freq'], sep='\t')
182 |     cov_tab = cov_tab.sort_values('freq', ascending = False)
183 |     return(cov_tab)
184 | 
185 | def local_agregation(cov_tab, distance, noise_filter, mask_errors):
186 |     # generate a dictionary that gives us for each combination of coverages a frequency
187 |     cov2freq = defaultdict(int)
188 |     cov2peak = defaultdict(int)
189 | 
190 |     L = min(cov_tab['covB']) # important only when --mask_errors is on
191 | 
192 |     ### clustering
193 |     next_peak = 1
194 |     for idx, covB, covA, freq in cov_tab.itertuples():
195 |         cov2freq[(covA, covB)] = freq # a make a frequency dictionary on the fly, because I don't need any value that was not processed yet
196 |         if freq < noise_filter:
197 |             break
198 |         highest_neigbour_coords = (0, 0)
199 |         highest_neigbour_freq = 0
200 |         # for each kmer pair I will retrieve all neibours (Manhattan distance)
201 |         for xA in range(covA - distance,covA + distance + 1):
202 |             # for explored A coverage in neiborhood, we explore all possible B coordinates
203 |             distanceB = distance - abs(covA - xA)
204 |             for xB in range(covB - distanceB,covB + distanceB + 1):
205 |                 xB, xA = sorted([xA, xB]) # this is to make sure xB is smaller than xA
206 |                 # iterating only though those that were assigned already
207 |                 # and recroding only the one with highest frequency
208 |                 if cov2peak[(xA, xB)] and cov2freq[(xA, xB)] > highest_neigbour_freq:
209 |                     highest_neigbour_coords = (xA, xB)
210 |                     highest_neigbour_freq = cov2freq[(xA, xB)]
211 |         if highest_neigbour_freq > 0:
212 |             cov2peak[(covA, covB)] = cov2peak[(highest_neigbour_coords)]
213 |         else:
214 |             # print("new peak:", (covA, covB))
215 |             if mask_errors:
216 |                 if covB < L + distance:
217 |                     cov2peak[(covA, covB)] = 1 # error line
218 |                 else:
219 |                     cov2peak[(covA, covB)] = 0 # central smudges
220 |             else:
221 |                 cov2peak[(covA, covB)] = next_peak # if I want to keep info about all locally agregated smudges
222 |                 next_peak += 1
223 |     return(cov2peak)
224 | 
225 | def peak_agregation(args):
226 |     ### load data
227 |     cov_tab = load_hetmers(args.infile)
228 | 
229 |     cov2peak = local_agregation(cov_tab, args.d, args.nf, mask_errors = False)
230 | 
231 |     cov_tab = cov_tab.sort_values(['covA', 'covB'], ascending = True)
232 |     for idx, covB, covA, freq in cov_tab.itertuples():
233 |         peak = cov2peak[(covA, covB)]
234 |         sys.stdout.write(f"{covB}\t{covA}\t{freq}\t{peak}\n")
235 |     sys.stdout.flush()
236 | 
237 | def get_smudge_container(cov_tab, cov, smudge_filter):
238 |     smudge_container = dict()
239 |     genomic_cov_tab = cov_tab[cov_tab['peak'] == 0] # this removed all the marked errors
240 |     total_kmer_pairs = sum(genomic_cov_tab['freq'])
241 | 
242 |     for Bs in range(1,9):
243 |         min_cov = 0 if Bs == 1 else cov * (Bs - 0.5)
244 |         max_cov = cov * (Bs + 0.5)
245 |         cov_tab_isoB = genomic_cov_tab.loc[(genomic_cov_tab["covB"] > min_cov) & (genomic_cov_tab["covB"] < max_cov)] #  
246 | 
247 |         for As in range(Bs,(17 - Bs)):
248 |             min_cov = 0 if As == 1 else cov * (As - 0.5)
249 |             max_cov = cov * (As + 0.5)
250 |             cov_tab_iso_smudge = cov_tab_isoB.loc[(cov_tab_isoB["covA"] > min_cov) & (cov_tab_isoB["covA"] < max_cov)]
251 |             if sum(cov_tab_iso_smudge['freq']) / total_kmer_pairs > smudge_filter:
252 |                 # sys.stderr.write(f"{As}A{Bs}B: {sum(cov_tab_iso_smudge['freq']) / total_kmer_pairs}\n")
253 |                 smudge_container["A" * As + "B" * Bs] = cov_tab_iso_smudge
254 |     return(smudge_container)
255 | 
256 | def get_centrality(smudge_container, cov):
257 |     centralities = list()
258 |     freqs = list()
259 |     for smudge in smudge_container.keys():
260 |         As = smudge.count('A')
261 |         Bs = smudge.count('B')
262 |         smudge_tab = smudge_container[smudge]
263 |         kmer_in_the_smudge = sum(smudge_tab['freq'])
264 |         freqs.append(kmer_in_the_smudge)
265 |         # center as a a mean
266 |         # center_A = sum((smudge_tab['freq'] * smudge_tab['covA'])) / kmer_in_the_smudge
267 |         # center_B = sum((smudge_tab['freq'] * smudge_tab['covB'])) / kmer_in_the_smudge
268 |         # center as a mode 
269 |         center = smudge_tab.loc[smudge_tab['freq'].idxmax()]
270 |         center_A = center['covA']
271 |         center_B = center['covB']
272 |         ## emprical to edge
273 |         # distA = min([abs(smudge_tab['covA'].max() - center['covA']), abs(center['covA'] - smudge_tab['covA'].min())])
274 |         # distB = min([abs(smudge_tab['covB'].max() - center['covB']), abs(center['covB'] - smudge_tab['covB'].min())])
275 |         ## theoretical to edge
276 |         # distA = min(abs(center['covA'] - (cov * (As - 0.5))), abs((cov * (As + 0.5)) - center['covA']))
277 |         # distB = min(abs(center['covB'] - (cov * (Bs - 0.5))), abs((cov * (Bs + 0.5)) - center['covB']))
278 |         ## theoretical relative distance to the center
279 |         distA = abs((center_A - (cov * As)) / cov)
280 |         distB = abs((center_B - (cov * Bs)) / cov)
281 | 
282 |         # sys.stderr.write(f"Processing: {As}A{Bs}B; with center: {distA}, {distB}\n")
283 |         centrality = distA + distB
284 |         centralities.append(centrality)
285 | 
286 |     if len(centralities) == 0:
287 |         return(1)
288 |     return(fmean(centralities, weights=freqs))
289 | 
290 | def test_coverage_range(cov_tab, min_c, max_c, smudge_size_cutoff = 0.02):
291 |     # covs_to_test = range(min_c, max_c)
292 |     covs_to_test = arange(min_c + 0.05, max_c + 0.05, 2)
293 |     cov_centralities = list()
294 |     for cov in covs_to_test:
295 |         smudge_container = get_smudge_container(cov_tab, cov, smudge_size_cutoff)
296 |         cov_centralities.append(get_centrality(smudge_container, cov))
297 | 
298 |     best_coverage = covs_to_test[argmin(cov_centralities)]
299 | 
300 |     tenths_to_test = arange(best_coverage - 1.9, best_coverage + 1.9, 0.2)
301 |     tenths_centralities = list()
302 |     for cov in tenths_to_test:
303 |         smudge_container = get_smudge_container(cov_tab, cov, smudge_size_cutoff)
304 |         tenths_centralities.append(get_centrality(smudge_container, cov))
305 | 
306 |     best_tenth = tenths_to_test[argmin(tenths_centralities)]
307 |     sys.stderr.write(f"Best coverage to precsion of one tenth: {round(best_tenth, 2)}\n")  
308 | 
309 |     hundredths_to_test = list(arange(best_tenth - 0.19, best_tenth + 0.19, 0.01))
310 |     hundredths_centralities = list()
311 |     for cov in hundredths_to_test:
312 |         smudge_container = get_smudge_container(cov_tab, cov, smudge_size_cutoff)
313 |         hundredths_centralities.append(get_centrality(smudge_container, cov))
314 | 
315 |     final_cov = hundredths_to_test[argmin(hundredths_centralities)]
316 |     just_to_be_sure_cov = final_cov/2
317 | 
318 |     hundredths_to_test.append(just_to_be_sure_cov)
319 |     smudge_container = get_smudge_container(cov_tab, just_to_be_sure_cov, smudge_size_cutoff)
320 |     hundredths_centralities.append(get_centrality(smudge_container, just_to_be_sure_cov))
321 | 
322 |     final_cov = hundredths_to_test[argmin(hundredths_centralities)]
323 |     sys.stderr.write(f"Best coverage to precision of one hundreth: {round(final_cov, 3)}\n")  
324 | 
325 |     all_coverages = concatenate((covs_to_test, tenths_to_test, hundredths_to_test))
326 |     all_centralities = concatenate((cov_centralities, tenths_centralities, hundredths_centralities))
327 | 
328 |     return(DataFrame({'coverage': all_coverages, 'centrality': all_centralities}))
329 | 
330 | #####################
331 | # the script itself #
332 | #####################
333 | 
334 | def main():
335 |     _parser = parser()
336 | 
337 |     sys.stderr.write('Running smudgeplot v' + version + "\n")
338 |     if _parser.task == "version":
339 |         exit(0)
340 | 
341 |     sys.stderr.write('Task: ' + _parser.task + "\n")
342 | 
343 |     if _parser.task == "cutoff":
344 |         cutoff(_parser.arguments)
345 | 
346 |     if _parser.task == "hetmers":
347 |         # PloidyPlot is expected ot be installed in the system as well as the R library supporting it
348 |         args = _parser.arguments
349 |         plot_args = " -o" + str(args.o)
350 |         plot_args += " -e" + str(args.L)
351 |         plot_args += " -T" + str(args.t)
352 |         if args.verbose:
353 |             plot_args += " -v"
354 |         if args.tmp != '.':
355 |             plot_args += " -P" + args.tmp
356 |         plot_args += " " + args.infile
357 | 
358 |         sys.stderr.write("Calling: hetmers (PloidyPlot kmer pair search) " + plot_args + "\n")
359 |         system("hetmers " + plot_args)
360 | 
361 |     if _parser.task == "plot":
362 |         # the plotting script is expected ot be installed in the system as well as the R library supporting it
363 |         args = _parser.arguments
364 |         
365 |         plot_args = f'-i "{args.infile}" -s "{args.smudgefile}" -n {args.n} -o "{args.o}" ' + _parser.format_aguments_for_R_plotting()
366 | 
367 |         sys.stderr.write("Calling: smudgeplot_plot.R " + plot_args + "\n")
368 |         system("smudgeplot_plot.R " + plot_args)
369 | 
370 |     if _parser.task == "peak_agregation":
371 |         peak_agregation(_parser.arguments)
372 | 
373 |     if _parser.task == "all":
374 |         args = _parser.arguments
375 | 
376 |         sys.stderr.write("\nLoading data\n")
377 |         cov_tab = load_hetmers(args.infile)
378 |         # cov_tab = load_hetmers("data/dicots/smu_files/daAchMill1.k31_ploidy.smu.txt")
379 | 
380 |         sys.stderr.write("\nMasking errors using local agregation algorithm\n")
381 |         cov2peak = local_agregation(cov_tab, distance = 1, noise_filter = 1000, mask_errors = True)
382 |         cov_tab['peak'] = [cov2peak[(covA, covB)] for idx, covB, covA, freq in cov_tab.itertuples()]
383 | 
384 |         cov_tab = cov_tab.sort_values(['covA', 'covB'], ascending = True)
385 |         total_kmers = sum(cov_tab['freq'])
386 |         genomic_kmers = sum(cov_tab[cov_tab['peak'] == 0]['freq'])
387 |         total_error_kmers = sum(cov_tab[cov_tab['peak'] == 1]['freq'])
388 |         error_fraction = total_error_kmers / total_kmers
389 |         sys.stderr.write(f"Total kmers: {total_kmers}\n\tGenomic kmers: {genomic_kmers}\n\tSequencing errors: {total_error_kmers}\n\tFraction or errors: {round(total_error_kmers/total_kmers, 3)}")
390 | 
391 |         with open(args.o + "_masked_errors_smu.txt", 'w') as error_annotated_smu:
392 |             error_annotated_smu.write("covB\tcovA\tfreq\tis_error\n")
393 |             for idx, covB, covA, freq, is_error in cov_tab.itertuples():
394 |                 error_annotated_smu.write(f"{covB}\t{covA}\t{freq}\t{is_error}\n") # might not be needed
395 | 
396 |         sys.stderr.write("\nInfering 1n coverage using grid algorihm\n")
397 | 
398 |         smudge_size_cutoff = 0.001 # this is % of all k-mer pairs smudge needs to have to be considered a valid smudge
399 | 
400 |         if args.cov == 0: # not specified user coverage
401 |             centralities = test_coverage_range(cov_tab, args.cov_min, args.cov_max, smudge_size_cutoff)
402 |             np.savetxt(args.o + "_centralities.txt", np.around(centralities, decimals=6), fmt="%.4f", delimiter = '\t')
403 |             # plot(centralities['coverage'], centralities['coverage'])
404 | 
405 |             if error_fraction < 0.75:
406 |                 cov = centralities['coverage'][argmin(centralities['centrality'])]
407 |             else:
408 |                 sys.stderr.write(f"Too many errors observed: {error_fraction}, not trusting coverage inference\n")
409 |                 cov = 0
410 | 
411 |             sys.stderr.write(f"\nInferred coverage: {cov}\n")
412 |         else:
413 |             cov = args.cov
414 | 
415 |         final_smudges = get_smudge_container(cov_tab, cov, 0.03)
416 |         # sys.stderr.write(str(final_smudges) + '\n')
417 | 
418 |         annotated_smudges = list(final_smudges.keys())
419 |         smudge_sizes = [round(sum(final_smudges[smudge]['freq']) / genomic_kmers, 4) for smudge in annotated_smudges]
420 |         
421 |         sys.stderr.write(f'Detected smudges / sizes ({args.o} + "_smudge_sizes.txt):"\n')
422 |         sys.stderr.write('\t' + str(annotated_smudges) + '\n')
423 |         sys.stderr.write('\t' + str(smudge_sizes) + '\n')
424 | 
425 |         # saving smudge sizes
426 |         smudge_table = DataFrame({'smudge': annotated_smudges, 'size': smudge_sizes})
427 |         np.savetxt(args.o + "_smudge_sizes.txt", smudge_table, fmt='%s', delimiter = '\t')
428 | 
429 |         sys.stderr.write("\nPlotting\n")
430 | 
431 |         system("centrality_plot.R " + args.o + "_centralities.txt")
432 |         # Rscript playground/alternative_fitting/alternative_plotting_testing.R -i data/dicots/peak_agregation/$ToLID.cov_tab_peaks -o data/dicots/peak_agregation/$ToLID
433 |         args = _parser.arguments
434 |         
435 |         plot_args = f'-i "{args.o}_masked_errors_smu.txt" -s "{args.o}_smudge_sizes.txt" -n {round(cov, 3)} -o "{args.o}" ' + _parser.format_aguments_for_R_plotting()
436 | 
437 |         sys.stderr.write("Calling: smudgeplot_plot.R " + plot_args + "\n")
438 |         system("smudgeplot_plot.R " + plot_args) 
439 | 
440 |     sys.stderr.write("\nDone!\n")
441 |     exit(0)
442 | 
443 | if __name__=='__main__':
444 |     main()
445 | 


--------------------------------------------------------------------------------
/exec/smudgeplot_plot.R:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env Rscript
  2 | 
  3 | suppressPackageStartupMessages(library("methods"))
  4 | suppressPackageStartupMessages(library("argparse"))
  5 | suppressPackageStartupMessages(library("viridis"))
  6 | 
  7 | # suppressPackageStartupMessages(library("smudgeplot"))
  8 | 
  9 | #################
 10 | ### funcitons ###
 11 | #################
 12 | # retirying the smudgeplot R package
 13 | get_col_ramp <- function(.args, delay = 0){
 14 |     colour_ramp <- eval(parse(text = paste0(.args$col_ramp,"(", 32 - delay, ")")))
 15 |     if (.args$invert_cols){
 16 |         colour_ramp <- rev(colour_ramp)
 17 |     }
 18 |     colour_ramp <- c(rep(colour_ramp[1], delay), colour_ramp)
 19 |     return(colour_ramp)
 20 | }
 21 | 
 22 | wtd.quantile <- function(x, q=0.25, weight=NULL) {
 23 |   o <- order(x)
 24 |   n <- sum(weight)
 25 |   order <- 1 + (n - 1) * q
 26 |   low  <- pmax(floor(order), 1)
 27 |   high <- pmin(ceiling(order), n)
 28 |   low_contribution <- high - order
 29 |   allq <- approx(x=cumsum(weight[o])/sum(weight), y=x[o], xout = c(low, high)/n, method = "constant",
 30 |       f = 1, rule = 2)$y
 31 |   low_contribution * allq[1] + (1 - low_contribution) * allq[2]
 32 | }
 33 | 
 34 | wtd.iqr <- function(x, w=NULL) {
 35 |   wtd.quantile(x, q=0.75, weight=w) - wtd.quantile(x, q=0.25, weight=w)
 36 | }
 37 | 
 38 | plot_alt <- function(cov_tab, ylim, colour_ramp, log = F){
 39 |     A_equals_B <- cov_tab[, 'covA'] == cov_tab[, 'covB']
 40 |     cov_tab[A_equals_B, 'freq'] <- cov_tab[A_equals_B, 'freq'] * 2
 41 |     if (log){
 42 |         cov_tab[, 'freq'] <- log10(cov_tab[, 'freq'])
 43 |     }
 44 |     cov_tab$col <- colour_ramp[1 + round(31 * cov_tab[, 'freq'] / max(cov_tab[, 'freq']))]
 45 | 
 46 |     # c(bottom, left, top, right)
 47 |     par(mar=c(4.8,4.8,1,1))
 48 |     plot(NULL, xlim = c(0, 0.5), ylim = ylim,
 49 |          xlab = 'Normalized minor kmer coverage: B / (A + B)',
 50 |          ylab = 'Total coverage of the kmer pair: A + B', cex.lab = 1.4, bty = 'n')
 51 |     min_cov_to_plot <- max(ylim[1],min(cov_tab[, 'total_pair_cov']))
 52 |     nothing <- sapply(min_cov_to_plot:ylim[2], plot_one_coverage, cov_tab)
 53 |     return(0)
 54 | }
 55 | 
 56 | plot_one_coverage <- function(cov, cov_tab){
 57 |     cov_row_to_plot <- cov_tab[cov_tab[, 'total_pair_cov'] == cov, ]
 58 |     width <- 1 / (2 * cov)
 59 |     cov_row_to_plot$left <- cov_row_to_plot[, 'minor_variant_rel_cov'] - width
 60 |     cov_row_to_plot$right <- sapply(cov_row_to_plot[, 'minor_variant_rel_cov'], function(x){ min(0.5, x + width)})
 61 |     apply(cov_row_to_plot, 1, plot_one_box, cov)
 62 | }
 63 | 
 64 | plot_one_box <- function(one_box_row, cov){
 65 |     left <- as.numeric(one_box_row['left'])
 66 |     right <- as.numeric(one_box_row['right'])
 67 |     rect(left, cov - 0.5, right, cov + 0.5, col = one_box_row['col'], border = NA)
 68 | }
 69 | 
 70 | plot_isoA_line <- function (.covA, .L, .col = "black", .ymax = 250, .lwd, .lty) {
 71 |     min_covB <- .L # min(.cov_tab[, 'covB']) # should be L really
 72 |     max_covB <- .covA
 73 |     B_covs <- seq(min_covB, max_covB, length = 500)
 74 |     isoline_x <- B_covs/ (B_covs + .covA)
 75 |     isoline_y <- B_covs + .covA
 76 |     lines(isoline_x[isoline_y < .ymax], isoline_y[isoline_y < .ymax], lwd = .lwd, lty = .lty, col = .col)
 77 | }
 78 | 
 79 | plot_isoB_line <- function (.covB, .ymax, .col = "black", .lwd, .lty) {
 80 |     cov_range <- seq((2 * .covB) - 2, .ymax, length = 500)
 81 |     lines((.covB)/cov_range, cov_range, lwd = .lwd, lty = .lty, col = .col)
 82 | }
 83 | 
 84 | plot_iso_grid <- function(.cov, .L, .ymax, .col = 'black', .lwd = 2, .lty = 2){
 85 |     for (i in 0:15){
 86 |         cov <- (i + 0.5) * .cov
 87 |         plot_isoA_line(cov, .L = .L, .ymax = .ymax, .col, .lwd = .lwd, .lty = .lty)
 88 |         if (i < 8){
 89 |             plot_isoB_line(cov, .ymax, .col, .lwd = .lwd, .lty = .lty)
 90 |         }
 91 |     }
 92 | }
 93 | 
 94 | plot_expected_haplotype_structure <- function(.n, .peak_sizes,
 95 |                                               .adjust = F, .cex = 1.3, xmax = 0.49){
 96 |     .peak_sizes <- .peak_sizes[.peak_sizes[, 'size'] > 0.05, ]
 97 |     .peak_sizes[, 'ploidy'] <- nchar(.peak_sizes[, 'structure'])
 98 | 
 99 |     decomposed_struct <- strsplit(.peak_sizes[, 'structure'], '')
100 |     .peak_sizes[, 'corrected_minor_variant_cov'] <- sapply(decomposed_struct, function(x){ sum(x == 'B') } ) / .peak_sizes[, 'ploidy']
101 |     .peak_sizes[, 'label'] <- reduce_structure_representation(.peak_sizes[, 'structure'])
102 | 
103 |     borercases <- .peak_sizes$corrected_minor_variant_cov == 0.5
104 | 
105 |     for(i in 1:nrow(.peak_sizes)){
106 |         # xmax is in the middle of the last square in the 2d histogram,
107 |         # which is too far from the edge, so I average it with 0.49
108 |         # witch will pull the label bit to the edge
109 |         text( ifelse( borercases[i] & .adjust, (xmax + 0.49) / 2, .peak_sizes$corrected_minor_variant_cov[i]),
110 |              .peak_sizes$ploidy[i] * .n, .peak_sizes[i, 'label'],
111 |              offset = 0, cex = .cex, xpd = T, pos = ifelse( borercases[i] & .adjust, 2, 1))
112 |     }
113 | }
114 | 
115 | reduce_structure_representation <- function(smudge_labels){
116 |     structures_to_adjust <- (sapply(smudge_labels, nchar) > 4)
117 | 
118 |     if ( any(structures_to_adjust) ) {
119 |         decomposed_struct <- strsplit(smudge_labels[structures_to_adjust], '')
120 |         As <- sapply(decomposed_struct, function(x){ sum(x == 'A') } )
121 |         Bs <- sapply(decomposed_struct, length) - As
122 |         smudge_labels[structures_to_adjust] <- paste0(As, 'A', Bs, 'B')
123 |     }
124 |     return(smudge_labels)
125 | }
126 | 
127 | plot_legend <- function(kmer_max, .colour_ramp, .log_scale = T){
128 |     par(mar=c(0,0,2,1))
129 |     plot.new()
130 |     print_title <- ifelse(.log_scale, 'log kmers pairs', 'kmers pairs')
131 |     title(print_title)
132 |     for(i in 1:32){
133 |         rect(0,(i - 0.01) / 33, 0.5, (i + 0.99) / 33, col = .colour_ramp[i])
134 |     }
135 |     # kmer_max <- max(smudge_container$dens)
136 |     if( .log_scale == T ){
137 |         for(i in 0:6){
138 |             text(0.75, i / 6, rounding(10^(log10(kmer_max) * i / 6)), offset = 0)
139 |         }
140 |     } else {
141 |         for(i in 0:6){
142 |             text(0.75, i / 6, rounding(kmer_max * i / 6), offset = 0)
143 |         }
144 |     }
145 | }
146 | 
147 | rounding <- function(number){
148 |     if(number > 1000){
149 |         round(number / 1000) * 1000
150 |     } else if (number > 100){
151 |         round(number / 100) * 100
152 |     } else {
153 |         round(number / 10) * 10
154 |     }
155 | }
156 | 
157 | #############
158 | ## SETTING ##
159 | #############
160 | 
161 | parser <- ArgumentParser()
162 | parser$add_argument("--homozygous", action="store_true", default = F,
163 |                     help="Assume no heterozygosity in the genome - plotting a paralog structure; [default FALSE]")
164 | parser$add_argument("-i", "--input", default = "*_smu.txt",
165 |                     help="name of the input tsv file with covarages [default \"*_smu.txt\"]")
166 | parser$add_argument("-s", "--smudges", default = "not_specified",
167 |                     help="name of the input tsv file with annotated smudges and their respective sizes")
168 | parser$add_argument("-o", "--output", default = "smudgeplot",
169 |                     help="name pattern used for the output files (OUTPUT_smudgeplot.png, OUTPUT_summary.txt, OUTPUT_warrnings.txt) [default \"smudgeplot\"]")
170 | parser$add_argument("-t", "--title",
171 |                     help="name printed at the top of the smudgeplot [default none]")
172 | parser$add_argument("-q", "--quantile_filt", type = "double",
173 |                     help="Remove kmer pairs with coverage over the specified quantile; [default none]")
174 | parser$add_argument("-n", "--n_cov", type = "double",
175 |                     help="the haploid coverage of the sequencing data [default inference from data]")
176 | parser$add_argument("-c", "-cov_filter", type = "integer",
177 |                     help="Filter pairs with one of them having coverage bellow specified threshold [default 0]")
178 | parser$add_argument("-ylim", type = "integer", 
179 |                     help="The upper limit for the coverage sum (the y axis)")
180 | parser$add_argument("-col_ramp", default = "viridis",
181 |                     help="A colour ramp available in your R session [viridis]")
182 | parser$add_argument("--invert_cols", action="store_true", default = F,
183 |                     help="Set this flag to invert colorus of Smudgeplot (dark for high, light for low densities)")
184 |  
185 | args <- parser$parse_args()
186 | 
187 | colour_ramp_log <- get_col_ramp(args, 16) # create palette for the log plots
188 | colour_ramp <- get_col_ramp(args) # create palette for the linear plots
189 | 
190 | if ( !file.exists(args$input) ) {
191 |     stop("The input file not found. Please use --help to get help", call.=FALSE)
192 | }
193 | 
194 | cov_tab <- read.table(args$input, header = T) # col.names = c('covB', 'covA', 'freq', 'is_error'), 
195 | smudge_tab <- read.table(args$smudges, col.names = c('structure', 'size'))
196 | 
197 | # total covarate of the kmer pair
198 | cov_tab[, 'total_pair_cov'] <- cov_tab[, 'covA'] + cov_tab[, 'covB']
199 | # calcualte relative coverage of the minor allele
200 | cov_tab[, 'minor_variant_rel_cov'] <- cov_tab[, 'covB'] / cov_tab[, 'total_pair_cov']
201 | 
202 | ##### coverage filtering
203 | 
204 | if ( !is.null(args$c) ){
205 |     threshold <- args$c
206 |     low_cov_filt <- cov_tab[, 'covA'] < threshold | cov_tab[, 'covB'] < threshold 
207 |     # smudge_warn(args$output, "Removing", sum(cov_tab[low_cov_filt, 'freq']), 
208 |     #             "kmer pairs for which one of the pair had coverage below",
209 |     #             threshold, paste0("(Specified by argument -c ", args$c, ")"))
210 |     cov_tab <- cov_tab[!low_cov_filt, ]
211 |     # smudge_warn(args$output, "Processing", sum(cov_tab[, 'freq']), "kmer pairs")    
212 | }
213 | 
214 | ##### quantile filtering
215 | if ( !is.null(args$q) ){
216 |     # quantile filtering (remove top q%, it's not really informative)    
217 |     threshold <- wtd.quantile(cov_tab[, 'total_pair_cov'], args$q, cov_tab[, 'freq'])
218 |     high_cov_filt <- cov_tab[, 'total_pair_cov'] < threshold
219 |     # smudge_warn(args$output, "Removing", sum(cov_tab[!high_cov_filt, 'freq']), 
220 |     #             "kmer pairs with coverage higher than",
221 |     #             threshold, paste0("(", args$q, " quantile)"))
222 |     cov_tab <- cov_tab[high_cov_filt, ]
223 | }
224 | 
225 | cov <- args$n_cov
226 | if (cov == wtd.quantile(cov_tab[, 'total_pair_cov'], 0.95, cov_tab[, 'freq'])){
227 |     ylim <- c(min(cov_tab[, 'total_pair_cov']), max(cov_tab[, 'total_pair_cov']))
228 | } else {
229 |     ylim <- c(min(cov_tab[, 'total_pair_cov']) - 1, # or 0?
230 |           min(max(100, 10*cov), max(cov_tab[, 'total_pair_cov'])))
231 | }
232 | 
233 | xlim <- c(0, 0.5)
234 | error_fraction <- sum(cov_tab[, 'is_error'] * cov_tab[, 'freq']) / sum(cov_tab[, 'freq']) * 100
235 | error_string <- paste("err =", round(error_fraction, 1), "%")
236 | cov_string <- paste0("1n = ", cov)
237 | 
238 | if (!is.null(args$ylim)){ # if ylim is specified, set the boundary by the argument instead
239 |     ylim[2] <- args$ylim
240 | }
241 | 
242 | fig_title <- ifelse(length(args$title) == 0, NA, args$title[1])
243 | # histogram_bins = max(30, args$nbins)
244 | 
245 | ##########
246 | # LINEAR #
247 | ##########
248 | pdf(paste0(args$output,'_smudgeplot.pdf'))
249 | 
250 | # layout(matrix(c(2,4,1,3), 2, 2, byrow=T), c(3,1), c(1,3))
251 | layout(matrix(c(4,2,1,3), 2, 2, byrow=T), c(3,1), c(1,3))
252 | # 1 smudge plot
253 | plot_alt(cov_tab, ylim, colour_ramp_log)
254 | if (cov > 0){
255 |     plot_expected_haplotype_structure(cov, smudge_tab, T, xmax = 0.49)
256 | }
257 | 
258 | 
259 | # 4 legend
260 | plot_legend(max(cov_tab[, 'freq']), colour_ramp, F)
261 | 
262 | ### add annotation
263 | # print smudge sizes
264 | plot.new()
265 | if (cov > 0){
266 |     legend('topleft', bty = 'n', reduce_structure_representation(smudge_tab[,'structure']), cex = 1.1)
267 |     legend('top', bty = 'n', legend = round(smudge_tab[,2], 2), cex = 1.1)
268 |     legend('bottomleft', bty = 'n', legend = c(cov_string, error_string), cex = 1.1)
269 | } else {
270 |     legend('bottomleft', bty = 'n', legend = error_string, cex = 1.1)
271 | }
272 | 
273 | plot.new()
274 | mtext(bquote(italic(.(fig_title))), side=3, adj=0.1, line=-3, cex = 1.6)
275 | 
276 | 
277 | dev.off()
278 | 
279 | ############
280 | # log plot #
281 | ############
282 | 
283 | pdf(paste0(args$output,'_smudgeplot_log10.pdf'))
284 | 
285 | layout(matrix(c(4,2,1,3), 2, 2, byrow=T), c(3,1), c(1,3))
286 | # cov_tab[, 'freq'] <- log10(cov_tab[, 'freq'])
287 | # 1 smudge plot
288 | plot_alt(cov_tab, ylim, colour_ramp_log, log = T)
289 | 
290 | if (cov > 0){
291 |     plot_expected_haplotype_structure(cov, smudge_tab, T, xmax = 0.49)
292 | }
293 | 
294 | # 4 legend
295 | plot_legend(max(cov_tab[, 'freq']), colour_ramp_log, T)
296 | 
297 | # print smudge sizes
298 | plot.new()
299 | if (cov > 0){
300 |     legend('topleft', bty = 'n', reduce_structure_representation(smudge_tab[,'structure']), cex = 1.1)
301 |     legend('top', bty = 'n', legend = round(smudge_tab[,2], 2), cex = 1.1)
302 |     legend('bottomleft', bty = 'n', legend = c(cov_string, error_string), cex = 1.1)
303 | } else {
304 |     legend('bottomleft', bty = 'n', legend = error_string, cex = 1.1)
305 | }
306 | 
307 | 
308 | plot.new()
309 | mtext(bquote(italic(.(fig_title))), side=3, adj=0.1, line=-3, cex = 1.6)
310 | 
311 | dev.off()


--------------------------------------------------------------------------------
/playground/BGA_tutorial.md:
--------------------------------------------------------------------------------
 1 | ## Smudgeplot
 2 | 
 3 | Have you ever sequenced something not-well studied? Something that might show strange genomic signatures? Smudgeplot is a visualisation technique for whole-genome sequencing reads from a single individual. The visualisation techique is based on the idea of het-mers. Het-mers are k-mer pairs that are exactly one nucleotide pair away from each other, while forming a unique pair in the sequencing dataset. These k-mers are assumed to be mostly representing two alleles of a heterozygous, but potentially can also show pairing of imperfect paralogs, or sequencing errors paired up with a homozygous genomic k-mer. Nevertheless, the predicted ploidy by smudgeplot is simply the ploidy with the highest number of k-mer pairs (if a reasonable estimate must be evaluated for each individual case!).
 4 | 
 5 | 
 6 | 
 7 | ### Installing the software
 8 | 
 9 | Open gitpod. And install the development version of smudgeplot (branch sploidyplot) & FastK. 
10 | 
11 | ```
12 | mkdir src bin && cd src # create directories for source code & binaries
13 | git clone -b sploidyplot https://github.com/KamilSJaron/smudgeplot
14 | git clone https://github.com/thegenemyers/FastK
15 | ```
16 | 
17 | Now smudgeplot make install smudgeplot R package, compiles the C kernel for searching for k-mer pairs and copy all the executables to `workspace/bin/` (which will be our dedicated spot for executables). 
18 | 
19 | ```
20 | cd smudgeplot && make -s INSTALL_PREFIX=/workspace && cd ..
21 | cd FastK && make FastK Histex
22 | install -c Histex FastK /workspace/bin/
23 | ```
24 | 
25 | 
26 | ** 8 Datasets **
27 | 
28 | Species name	SRA/ENA ID
29 | Pseudoloma neurophilia	SRR926312
30 | Tubulinosema ratisbonensis	ERR3154977
31 | Nosema ceranae	SRR17317293
32 | Nematocida ausubeli	SRR058692
33 | Nematocida ausubeli	SRR350188
34 | Hamiltosporidium tvaerminnensis	SRR16954898
35 | Encephalitozoon hellem	SRR14017862
36 | Agmasoma penaei	SRR926341
37 | 
38 | TODO: get them urls
39 | 
40 | Finally, if your session is running; Start downloading the data; For example:
41 | 
42 | ```
43 | wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR926/SRR926341/SRR926341_[12].fastq.gz
44 | ```
45 | 
46 | ### Constructing a database
47 | 
48 | The whole process operates with raw, or trimmed sequencing reads. From those we generate a k-mer database using [FastK](https://github.com/thegenemyers/FASTK). FastK is currently the fastest k-mer counter out there and the only supported by the lastest version of smudgeplot*. This database contains an index of all the k-mers and their coverages in the sequencing readset. Within this set user must chose a theshold for excluding low frequencing k-mers that will be considered errors. That choice is not too difficult to make by looking at the k-mer spectra. Of all the retained k-mers we find all the het-mers. Then we plot a 2d histogram. 
49 | 
50 | 
51 | *Note: The previous versions of smudgeplot (up to 2.5.0) were operating on k-mer "dumps" flat files you can generate with any counter you like. You can imagine that text files are very inefficient to operate on. The new version is operating directly on the optimised k-mer database instead.
52 | 
53 | ```
54 | FastK -v -t4 -k31 -M16 -T4 *.fastq.gz -NSRR8495097
55 | ```
56 | 
57 | 20'
58 | 
59 | 23:24
60 | 
61 | one file is also ~20', it's mostly function of the number of k-mers, we could speed it up by chosing higher t maybe?
62 | 
63 | ### Getting the k-mer spectra out of it
64 | 
65 | ```
66 | Histex -G SRR8495097 > SRR8495097_k31.hist
67 | 
68 | | GeneScopeFK -o data/Pvir1/GenomeScopeFK/ -k 17
69 | 
70 | 
71 | # Run PloidyPlot to find all k-mer pairs in the dataset
72 | PloidyPlot -e12 -k -v -T4 -odata/Scer/kmerpairs data/Scer/FastK_Table
73 | # this now generated `data/Scer/kmerpairs_text.smu` file;
74 | # it's a flat file with three columns; covB, covA and freq (the number of k-mer pairs with these respective coverages)
75 | 
76 | # use the .smu file to infer ploidy and create smudgeplot
77 | smudgeplot.py plot -n 15 -t Sacharomyces -o data/Scer/trial_run data/Scer/kmerpairs_text.smu 
78 | ```
79 | 


--------------------------------------------------------------------------------
/playground/DEVELOPMENT.md:
--------------------------------------------------------------------------------
 1 | # STANDARDS
 2 | 
 3 | - spaces around operators
 4 | - snake_case for variables and functions
 5 | - camelCase for classes and methods
 6 | - verbose naming is more important than a detailed documentation
 7 | - R code is tested using `testthat` and python code in `dev` branch using `unittest`
 8 | 
 9 | ## versioning
10 | 
11 | - we try to keep `master` branch clean (i.e. production ready code).
12 | - `dev` branch should be also a working code, however here mistakes are permitted. Most of the development is done in subbranches of `dev` branch, once a feature is implemented, it should be merged to `dev`. I like to use `--no-ff` to keep a record, what was developed where (this might be a bad practice and if you know why, let me know).
13 | - One rule I would like to keep is this. There must be at least 72 hours incubation period between commits merged into `dev` and merging `dev` into `master`. The reason is simple, if there are any mistakes that were not spotted, there is a change to catch them. Also it takes while to run all the tests and stuff (the travis.ci testing is not working yet, but I hope it will quite soon).
14 | 
15 | ## language
16 | 
17 | The future is `C` backend based on [FastK](https://github.com/thegenemyers/FASTK), inference and plotting in `R` and `python` user interface.


--------------------------------------------------------------------------------
/playground/alternative_fitting/README.md:
--------------------------------------------------------------------------------
  1 | # Sploidyplot
  2 | 
  3 | The goal is to have a smudge inference based on an explicit model. I hoped for a model that would make a lot of sense - based on negative binomials.
  4 | 
  5 | Gene - made his own EM algorithm, I think. I could not decipher it
  6 | Richard and Tianyi - made me an EM algorithm that work on simply coverage A and B; also using normal distributions
  7 | 
  8 | 
  9 | ## alternative plotting
 10 | 
 11 | A minimalist attempt
 12 | 
 13 | ```R
 14 | minidata <- daArtCamp1[daArtCamp1[, 'total_pair_cov'] < 20, ]
 15 | coverages_to_plot <- unique(minidata[, 'total_pair_cov'])
 16 | number_of_coverages_to_plot <- length(coverages_to_plot)
 17 | mini_ylim <- c(5, 20)
 18 | L <- 4
 19 | cols <- c(rgb(1,0,0, 0.5), rgb(1,1,0, 0.5), rgb(0,1,1, 0.5), rgb(1,0,1, 0.5), rgb(0,1,0, 0.5), rgb(0,0,1, 0.5))
 20 | # plot(1:6, pch = 20, cex = 5, col = cols)
 21 | 
 22 | plot_dot_smudgeplot(minidata, rep('black', 32), xlim, mini_ylim, cex = 3)
 23 | points((L - 1) / coverages_to_plot, coverages_to_plot, cex = 3, pch = 20, col = 'blue')
 24 | 
 25 | for( cov in 8:19){
 26 |     rect(0, cov - 0.5, 0.5, cov + 0.5, col = NA, border = 'black')
 27 |     width <- 1 / (2 * cov)
 28 |     min_ratio <- L / cov
 29 |     rect(min_ratio - width, cov - 0.5, min(0.5, min_ratio + width), cov + 0.5, col = sample(cols, 1))    
 30 | }
 31 | 
 32 | text(rep(0.05, number_of_coverages_to_plot), 8:19, 8:19)
 33 | ```
 34 | 
 35 | This is more serious attempt that does not really work.
 36 | 
 37 | Alternative local aggregation
 38 | 
 39 | ```bash
 40 | for ToLID in daAchMill1 daAchPtar1 daAdoMosc1 daAjuCham1 daAjuRept1 daArcMinu1 daArtCamp1 daArtMari1 daArtVulg1 daAtrBela1; do
 41 |     python3 playground/alternative_fitting/pair_clustering.py data/dicots/smu_files/$ToLID.k31_ploidy.smu.txt --mask_errors > data/dicots/peak_agregation/$ToLID.cov_tab_peaks
 42 |     Rscript playground/alternative_fitting/alternative_plotting_testing.R -i data/dicots/peak_agregation/$ToLID.cov_tab_peaks -o data/dicots/peak_agregation/$ToLID
 43 | done
 44 | ```
 45 | 
 46 | This worked well. The agregation produced beautiful blocks, and vastly of the same shape as noticed by Richard. He suggested we should fix their shape and fit only a single parameter - coverage.
 47 | 
 48 | ```R
 49 | smudge_tab <- read.table('data/dicots/peak_agregation/daArtMari1.cov_tab_peaks', col.names = c('covB', 'covA', 'freq', 'smudge'))
 50 | 
 51 | all_smudges <- unique(smudge_tab[, 'smudge'])
 52 | all_smudge_sizes <- sapply(all_smudges, function(x){ sum(smudge_tab[smudge_tab[, 'smudge'] == x, 'freq']) })
 53 | 
 54 | # plot(sort(all_smudge_sizes, decreasing = T) / sum(all_smudge_sizes), ylim = c(0, 1))
 55 | # sort(all_smudge_sizes, decreasing = T) / sum(all_smudge_sizes) > 0.02
 56 | # 2% of data soiunds reasonable
 57 | 
 58 | smudges <- all_smudges[all_smudge_sizes / sum(all_smudge_sizes) > 0.02 & all_smudges != 0]
 59 | smudge_sizes <- all_smudge_sizes[all_smudge_sizes / sum(all_smudge_sizes) > 0.02 & all_smudges != 0]
 60 | 
 61 | smudge_tab[, 'total_pair_cov'] <- smudge_tab[, 1] + smudge_tab[, 2]
 62 | smudge_tab[, 'minor_variant_rel_cov'] <- smudge_tab[, 1] / smudge_tab[, 'total_pair_cov']
 63 | 
 64 | 
 65 | smudge_tab_reduced <- smudge_tab[smudge_tab[, 'smudge'] %in% smudges, ]
 66 | source('playground/alternative_fitting/alternative_plotting_functions.R')
 67 | 
 68 | per_smudge_cov_list <- lapply(smudges, function(x){ smudge_tab_reduced[smudge_tab_reduced$smudge == x, ] })
 69 | names(per_smudge_cov_list) <- smudges
 70 | 
 71 | cov_sum_summary <- sapply(per_smudge_cov_list, function(x){ summary(x[, 'total_pair_cov']) } )
 72 | rel_cov_summary <- sapply(per_smudge_cov_list, function(x){ summary(x[, 'minor_variant_rel_cov']) } )
 73 | 
 74 | colnames(cov_sum_summary) <- colnames(rel_cov_summary) <- smudges
 75 | 
 76 | data.frame(smudges = smudges, total_pair_cov = round(cov_sum_summary[4, ], 1), minor_variant_rel_cov = round(rel_cov_summary[4, ], 3))
 77 | 
 78 | head(per_smudge_cov_list[['2']])
 79 | 
 80 | one_smudge <- per_smudge_cov_list[['2']]
 81 | one_smudge[one_smudge[, 'minor_variant_rel_cov'] == 0.5, ]
 82 | 
 83 | table(one_smudge[, 'covB'])
 84 | (one_smudge[, 'minor_variant_rel_cov'])
 85 | 
 86 | 
 87 | 
 88 |     # cov_range <- seq((2 * .L) - 2, max_cov_pair, length = 500)
 89 |     # lines((.L - 1)/cov_range, cov_range, lwd = 2.5, lty = 2, 
 90 | 
 91 | plot_peakmap(smudge_tab_reduced, xlim = c(0, 0.5), ylim = c(0, 300))
 92 | 
 93 | plot_seq_error_line(smudge_tab, 4)
 94 | plot_seq_error_line(smudge_tab, 13)
 95 | plot_seq_error_line(smudge_tab, 48)
 96 | plot_seq_error_line(smudge_tab, 80)
 97 | 
 98 | one_smudge <- per_smudge_cov_list[['4']]
 99 | min(one_smudge[ ,'total_pair_cov'])
100 | 
101 | one_smudge[one_smudge[ ,'total_pair_cov'] == 61, ]
102 | one_smudge <- one_smudge[order(one_smudge[, 'minor_variant_rel_cov']), ]
103 | 
104 | right_part_of_the_smudge <- one_smudge[one_smudge[ ,'minor_variant_rel_cov'] > 0.2131147, ]
105 | 
106 | all_minor_var_rel_covs <- sort(unique(round(right_part_of_the_smudge[, 'minor_variant_rel_cov'], 2)))
107 | corresponding_min_cov_sums <- sapply(all_minor_var_rel_covs, function(x){ min(right_part_of_the_smudge[round(right_part_of_the_smudge[, 'minor_variant_rel_cov'], 2) == x, 'total_pair_cov']) } )
108 | 
109 | lines(all_minor_var_rel_covs, corresponding_min_cov_sums, lwd = 3, lty = 3, col = 'red')
110 | 
111 | subtract_line <- function(rel_cov, cov_tab){
112 |     approx_rel_cov = round(rel_cov, 2)
113 |     band_covs = round(cov_tab[, 'minor_variant_rel_cov'], 2) == approx_rel_cov
114 |     cov_tab[band_covs, ][which.min(cov_tab[band_covs, 'total_pair_cov']), ]
115 | }
116 | 
117 | edge_points <- t(sapply(all_minor_var_rel_covs, subtract_line, right_part_of_the_smudge))
118 | total_pair_cov <- sapply(1:29, function(x){edge_points[[x,5]]})
119 | minor_variant_rel_cov <- sapply(1:29, function(x){edge_points[[x,6]]})
120 | lm(total_pair_cov ~ minor_variant_rel_cov + I(minor_variant_rel_cov^2))
121 | 
122 | plot_isoA_line <- function (.covA, .cov_tab, .col = "black") {
123 |     min_covB <- min(.cov_tab[, 'covB']) # should be L really
124 |     max_covB <- .covA
125 |     B_covs <- seq(min_covB, max_covB, length = 500)
126 |     lines(B_covs/ (B_covs + .covA), B_covs + .covA, lwd = 2.5, lty = 2, 
127 |         col = .col)
128 | }
129 | 
130 | plot_isoA_line(48, smudge_tab, 'blue')
131 | plot_isoA_line(79, smudge_tab, 'blue')
132 | plot_isoA_line(110, smudge_tab, 'blue')
133 | plot_isoA_line(141, smudge_tab, 'blue')
134 | plot_isoA_line(172, smudge_tab, 'blue')
135 | 
136 | ```
137 | 
138 | HA, looks great! Let's plot it on the background...
139 | 
140 | ```R
141 | library(smudgeplot)
142 | source('playground/alternative_fitting/alternative_plotting_functions.R')
143 | colour_ramp <- viridis(32)
144 | 
145 | 
146 | smudge_tab <- read.table('data/dicots/peak_agregation/daAchMill1.cov_tab_errors', col.names = c('covB', 'covA', 'freq', 'is_error'))
147 | # smudge_tab <- read.table('data/dicots/peak_agregation/daArtMari1.cov_tab_peaks', col.names = c('covB', 'covA', 'freq', 'smudge'))
148 | smudge_tab[, 'total_pair_cov'] <- smudge_tab[, 1] + smudge_tab[, 2]
149 | smudge_tab[, 'minor_variant_rel_cov'] <- smudge_tab[, 1] / smudge_tab[, 'total_pair_cov']
150 | cov = 31.1 # this is from GenomeScope this time
151 | 
152 | plot_alt(smudge_tab[smudge_tab[, 'is_error'] != 0, ], c(0, 100), colour_ramp, T) 
153 | plot_alt(smudge_tab[smudge_tab[, 'is_error'] != 1, ], c(0, 100), colour_ramp, T)
154 | plot_alt(smudge_tab, c(0, 100), colour_ramp, T)
155 | plot_iso_grid(31.1, 4, 100)
156 | plot_smudge_labels(18.1, 100)
157 | # .peak_points, .peak_sizes, .min_kmerpair_cov, .max_kmerpair_cov, col = "red"
158 | dev.off()
159 | 
160 | plot_iso_grid()
161 | 
162 | plot_smudge_labels(cov, 240)
163 | text(0.49, cov / 2, "2err", offset = 0, cex = 1.3, xpd = T, pos = 2)
164 | ```
165 | 
166 | Say we will test ploidy up to 16 (capturing up to octoploid paralogs). That makes
167 | 
168 | ```R
169 | smudge_tab_with_err <- read.table('data/dicots/peak_agregation/daAchMill1.cov_tab_errors', col.names = c('covB', 'covA', 'freq', 'is_error'))
170 | 
171 | smudge_filtering_threshold <- 0.01 # at least 1% of genomic kmers
172 | colour_ramp <- viridis(32)
173 | 
174 | # # error band, done on non filtered data
175 | # smudge_tab[, 'edgepoint'] <- F
176 | # smudge_tab[smudge_tab[, 'covB'] < L + 3, 'edgepoint'] <- T
177 | # plot_alt(smudge_tab[smudge_tab[, 'edgepoint'], ], c(0, 500), colour_ramp, T)
178 | 
179 | cov <- 19.55 # this will be unknown
180 | L <- min(smudge_tab_with_err[, 'covB'])
181 | smudge_tab <- smudge_tab_with_err[smudge_tab_with_err[, 'is_error'] == 0, ]
182 | genomic_kmer_pairs <- sum(smudge_tab[ ,'freq'])
183 | 
184 | plot_alt(smudge_tab, c(0, 300), colour_ramp, T)
185 | 
186 | smudge_tab[, 'total_pair_cov'] <- smudge_tab[, 1] + smudge_tab[, 2]
187 | smudge_tab[, 'minor_variant_rel_cov'] <- smudge_tab[, 1] / smudge_tab[, 'total_pair_cov']
188 | 
189 | plot_alt(smudge_tab, c(0, 300), colour_ramp) 
190 | plot_all_smudge_labels(cov, 300)
191 | dev.off()
192 | 
193 | #### isolating all smudges given cov
194 | 
195 | # total_genomic_kmer_pairs <- sum(smudge_tab[, 'freq'])
196 | 
197 | # plot_alt(smudge_container[[1]], c(0, 300), colour_ramp, T) 
198 | # looks good!
199 | 
200 | # two functions need to be sources from the smudgeplot package here
201 | covs_to_test <- seq(10.05, 60.05, by = 0.1)
202 | centrality_grid <- sapply(covs_to_test, run_replicate, smudge_tab, smudge_filtering_threshold)
203 | covs_to_test[which.max(centrality_grid)]
204 | 
205 | sapply(c(21.71, 21.72, 21.73), run_replicate, smudge_tab, smudge_filtering_threshold)
206 | # 21.72 is our winner!
207 | 
208 | tested_covs <- test_grid_of_coverages(smudge_tab, smudge_filtering_threshold, min_to_explore, max_to_explore)
209 | plot(tested_covs[, 'cov'], tested_covs[, 'centrality'])
210 | 
211 | ```
212 | 
213 | Fixing the main package
214 | 
215 | ```bash
216 | for smu_file in  data/dicots/smu_files/*.k31_ploidy.smu.txt; do 
217 |     ToLID=$(basename $smu_file .k31_ploidy.smu.txt); 
218 |     smudgeplot.py all $smu_file -o data/dicots/grid_fits/$ToLID
219 | done
220 | 
221 | 
222 | ```
223 | 
224 | 
225 | ## Homopolymer compressed testing
226 | 
227 | Datasets with lots of errors. Sacharomyces will do.
228 | 
229 | ```
230 | FastK -v -c -t4 -k31 -M16 -T4 data/Scer/SRR3265401_[12].fastq.gz -Ndata/Scer/FastK_Table_hc
231 | hetmers -e4 -k -v -T4 -odata/Scer/kmerpairs_hc data/Scer/FastK_Table_hc
232 | 
233 | smudgeplot.py hetmers -L 4 -t 4 -o data/Scer/kmerpairs_default_e --verbose data/Scer/FastK_Table
234 | 
235 | smudgeplot.py all -o data/Scer/homopolymer_e4_wo data/Scer/kmerpairs_default_e_text.smu
236 | 
237 | smudgeplot.py all -o data/Scer/homopolymer_e4_with data/Scer/kmerpairs_hc_text.smu
238 | ```
239 | 
240 | ## Other
241 | 
242 | 
243 | ### .smu to smu.txt
244 | 
245 | For the legacy `.smu` files, we have a convertor for to flat files.
246 | 
247 | ```bash
248 | gcc src_ploidyplot/smu2text_smu.c -o exec/smu2text_smu
249 | exec/smu2text_smu data/ddSalArbu1/ddSalArbu1.k31_ploidy.smu | less
250 | ```


--------------------------------------------------------------------------------
/playground/alternative_fitting/alternative_plot_covA_covB.R:
--------------------------------------------------------------------------------
 1 | library(ggplot2)
 2 | 
 3 | plot_unsquared_smudgeplot <- function(cov_tab, colour_ramp, xlim, ylim){
 4 |     # this is the adjustment for plotting
 5 |     # cov_tab[cov_tab$covA == cov_tab$covB, 'freq'] <- cov_tab[cov_tab$covA == cov_tab$covB, 'freq'] * 2
 6 |     cov_tab$col = colour_ramp[1 + round(31 * cov_tab$freq / max(cov_tab$freq))]
 7 | 
 8 |     plot(NULL, xlim = xlim, ylim = ylim,
 9 |          xlab = 'covA',
10 |          ylab = 'covB', cex.lab = 1.4)
11 | 
12 |     ## This might bite me in the a.., instead of taking L as an argument, I guess it from the data
13 |     # L = floor(min(cov_tab_daAchMill1[, 'total_pair_cov']) / 2)
14 |     ggplot(cov_tab, aes(x=covA, y=covB, weight = weight)) +
15 |     geom_bin2d() +
16 |     theme_bw()
17 | 
18 | }
19 | 
20 | real_clean <- read.table('data/Fiin/kmerpairs_k51_text.smu', col.names = c('covA', 'covB', 'freq'))
21 | real_clean$weight <- real_clean$freq / sum(real_clean$freq)
22 | 
23 | # plot(real_clean[, 'covA'], real_clean[, 'covB'])
24 | 
25 | xlim <- range(real_clean[, 'covA'])
26 | ylim <- range(real_clean[, 'covB'])
27 | 
28 | library(smudgeplot)
29 | args <- list()
30 | args$col_ramp <- 'viridis'
31 | args$invert_cols <- F
32 | colour_ramp <- get_col_ramp(args)
33 | real_clean$col <- colour_ramp[1 + round(31 * real_clean$freq / max(real_clean$freq))]
34 | 
35 | plot(NULL, xlim = xlim, ylim = ylim,
36 |      xlab = 'covA',
37 |      ylab = 'covB', cex.lab = 1.4)
38 | 
39 | ggplot(real_clean, aes(x=covA, y=covB, weight = weight)) +
40 |   geom_bin2d() +
41 |   theme_bw()
42 | 
43 | head(real_clean)
44 | 
45 | 
46 | # plotSquare <- function(row){
47 | #     rect(as.numeric(row['covA']) - 0.5, as.numeric(row['covB']) - 0.5, as.numeric(row['covA']) + 0.5, as.numeric(row['covB']) + 0.5, col = row['col'], border = NA)
48 | # }
49 | # apply(real_clean, 1, plotSquare)
50 | 


--------------------------------------------------------------------------------
/playground/alternative_fitting/alternative_plotting.R:
--------------------------------------------------------------------------------
 1 | # HM Revenue and custom Tax Return form;
 2 | # Needs to be 2023 Jan to 2024 Jan
 3 | 
 4 | 
 5 | 
 6 | library(smudgeplot)
 7 | source('playground/alternative_plotting_functions.R')
 8 | 
 9 | cov_tab_daAchMill1 <- read.table('data/dicots/smu_files/daAchMill1.k31_ploidy.smu.txt', col.names = c('covB', 'covA', 'freq'))
10 | ylim = c(0, 250)
11 | 
12 | xlim = c(0, 0.5)
13 | 
14 | 
15 | 
16 | cov_tab_daAchMill1[, 'total_pair_cov'] <- cov_tab_daAchMill1[, 1] + cov_tab_daAchMill1[, 2]
17 | cov_tab_daAchMill1[, 'minor_variant_rel_cov'] <- cov_tab_daAchMill1[, 1] / cov_tab_daAchMill1[, 'total_pair_cov']
18 | 
19 | args <- list()
20 | args$col_ramp <- 'viridis'
21 | args$invert_cols <- F
22 | colour_ramp <- get_col_ramp(args)
23 | cols = colour_ramp[1 + round(31 * cov_tab_daAchMill1$freq / max(cov_tab_daAchMill1$freq))]
24 | 
25 | # solving this "density" problem; cov1 cov1; have twice lower probability than cov1 cov2; we need to multiply these points, but it needs to be somehow corrected for the fit / summaries
26 | 
27 | plot_dot_smudgeplot(cov_tab_daAchMill1, colour_ramp, xlim, ylim)
28 | 
29 | plot_unsquared_smudgeplot(cov_tab_daAchMill1, colour_ramp, xlim, ylim)
30 | 
31 | # plots the line where there will be nothing
32 | plot_seq_error_line(cov_tab_daAchMill1, .col = 'red')
33 | 
34 | head(cov_tab_daAchMill1[order(cov_tab_daAchMill1[,'total_pair_cov']), ], 12)
35 | colour_ramp
36 | 3 / 8:13
37 | 
38 | ####
39 | 
40 | cov_tab_Fiin_ideal <- read.table('data/Fiin/idealised/kmerpairs_idealised_with_transformations.tsv', header = T)
41 | head(cov_tab_Fiin_ideal)
42 | 
43 | xlim = c(0, 0.5)
44 | ylim = c(0, max(cov_tab_Fiin_ideal[, 'total_pair_cov']))
45 | 
46 | plot_dot_smudgeplot(cov_tab_Fiin_ideal, colour_ramp, xlim, ylim)
47 | 
48 | plot_unsquared_smudgeplot(cov_tab_Fiin_ideal, colour_ramp, xlim, ylim)
49 | 


--------------------------------------------------------------------------------
/playground/alternative_fitting/alternative_plotting_functions.R:
--------------------------------------------------------------------------------
  1 | plot_alt <- function(cov_tab, ylim, colour_ramp, logscale = F){
  2 |     A_equals_B <- cov_tab[, 'covA'] == cov_tab[, 'covB']
  3 |     cov_tab[A_equals_B, 'freq'] <- cov_tab[A_equals_B, 'freq'] * 2
  4 |     if (logscale){
  5 |         cov_tab[, 'freq'] <- log10(cov_tab[, 'freq'])
  6 |     }
  7 |     cov_tab$col <- colour_ramp[1 + round(31 * cov_tab[, 'freq'] / max(cov_tab[, 'freq']))]
  8 | 
  9 |     plot(NULL, xlim = c(0, 0.5), ylim = ylim,
 10 |          xlab = 'Normalized minor kmer coverage: B / (A + B)',
 11 |          ylab = 'Total coverage of the kmer pair: A + B', cex.lab = 1.4)
 12 |     min_cov_to_plot <- max(ylim[1],min(cov_tab[, 'total_pair_cov']))
 13 |     nothing <- sapply(min_cov_to_plot:ylim[2], plot_one_coverage, cov_tab)
 14 |     return(0)
 15 | }
 16 | 
 17 | plot_one_coverage <- function(cov, cov_tab){
 18 |     cov_row_to_plot <- cov_tab[cov_tab[, 'total_pair_cov'] == cov, ]
 19 |     width <- 1 / (2 * cov)
 20 |     cov_row_to_plot$left <- cov_row_to_plot[, 'minor_variant_rel_cov'] - width
 21 |     cov_row_to_plot$right <- sapply(cov_row_to_plot[, 'minor_variant_rel_cov'], function(x){ min(0.5, x + width)})
 22 |     apply(cov_row_to_plot, 1, plot_one_box, cov)
 23 | }
 24 | 
 25 | plot_one_box <- function(one_box_row, cov){
 26 |     left <- as.numeric(one_box_row['left'])
 27 |     right <- as.numeric(one_box_row['right'])
 28 |     rect(left, cov - 0.5, right, cov + 0.5, col = one_box_row['col'], border = NA)
 29 | }
 30 | 
 31 | plot_dot_smudgeplot <- function(cov_tab, colour_ramp, xlim, ylim, background_col = 'grey', cex = 0.4){
 32 |     # this is the adjustment for plotting
 33 |     cov_tab[cov_tab$covA == cov_tab$covB, 'freq'] <- cov_tab[cov_tab$covA == cov_tab$covB, 'freq'] * 2
 34 |     cov_tab$col = colour_ramp[1 + round(31 * cov_tab$freq / max(cov_tab$freq))]
 35 | 
 36 |     plot(NULL, xlim = xlim, ylim = ylim, xlab = 'Normalized minor kmer coverage: B / (A + B)',
 37 |          ylab = 'Total coverage of the kmer pair: A + B')
 38 |     rect(xlim[1], ylim[1], xlim[2], ylim[2], col = background_col, border = NA)
 39 |     points(cov_tab[, 'minor_variant_rel_cov'], cov_tab[, 'total_pair_cov'], col = cov_tab$col, pch = 20, cex = cex)
 40 | }
 41 | 
 42 | plot_peakmap <- function(cov_tab, xlim, ylim, background_col = 'grey', cex = 2){
 43 |     # this is the adjustment for plotting
 44 |     plot(NULL, xlim = xlim, ylim = ylim, xlab = 'Normalized minor kmer coverage: B / (A + B)',
 45 |          ylab = 'Total coverage of the kmer pair: A + B')
 46 |     points(cov_tab[, 'minor_variant_rel_cov'], cov_tab[, 'total_pair_cov'], col = cov_tab$smudge, pch = 20, cex = cex)
 47 |     legend('bottomleft', col = 1:8, pch = 20, title = 'smudge', legend = 1:8)
 48 | }
 49 | 
 50 | plot_seq_error_line <- function (.cov_tab, .L = NA, .col = "black") {
 51 |     if (is.na(.L)) {
 52 |         .L <- min(.cov_tab[, "covB"])
 53 |     }
 54 |     max_cov_pair <- max(.cov_tab[, "total_pair_cov"])
 55 |     cov_range <- seq((2 * .L) - 2, max_cov_pair, length = 500)
 56 |     lines((.L - 1)/cov_range, cov_range, lwd = 2.5, lty = 2, 
 57 |         col = .col)
 58 | }
 59 | 
 60 | plot_isoA_line <- function (.covA, .L, .col = "black", .ymax = 250, .lwd, .lty) {
 61 |     min_covB <- .L # min(.cov_tab[, 'covB']) # should be L really
 62 |     max_covB <- .covA
 63 |     B_covs <- seq(min_covB, max_covB, length = 500)
 64 |     isoline_x <- B_covs/ (B_covs + .covA)
 65 |     isoline_y <- B_covs + .covA
 66 |     lines(isoline_x[isoline_y < .ymax], isoline_y[isoline_y < .ymax], lwd = .lwd, lty = .lty, col = .col)
 67 | }
 68 | 
 69 | plot_isoB_line <- function (.covB, .ymax, .col = "black", .lwd, .lty) {
 70 |     cov_range <- seq((2 * .covB) - 2, .ymax, length = 500)
 71 |     lines((.covB)/cov_range, cov_range, lwd = .lwd, lty = .lty, col = .col)
 72 | }
 73 | 
 74 | plot_iso_grid <- function(.cov, .L, .ymax, .col = 'black', .lwd = 2, .lty = 2){
 75 |     for (i in 0:15){
 76 |         cov <- (i + 0.5) * .cov
 77 |         plot_isoA_line(cov, .L = .L, .ymax = .ymax, .col, .lwd = .lwd, .lty = .lty)
 78 |         if (i < 8){
 79 |             plot_isoB_line(cov, .ymax, .col, .lwd = .lwd, .lty = .lty)
 80 |         }
 81 |     }
 82 | }
 83 | 
 84 | plot_smudge_labels <- function(cov_est, ymax, xmax = 0.49, .cex = 1.3, .L = 4){
 85 |     for (As in 1:(floor(ymax / cov_est) - 1)){
 86 |         label <- paste0(As, "Aerr")
 87 |         text(.L / (As * cov_est), (As * cov_est) + .L, label, 
 88 |                  offset = 0, cex = .cex, xpd = T, pos = ifelse(As == 1, 3, 4))
 89 |     }
 90 |     for (ploidy in 2:floor(ymax / cov_est)){
 91 |         for (Bs in 1:floor(ploidy / 2)){
 92 |             As = ploidy - Bs
 93 |             label <- paste0(As, "A", Bs, "B")
 94 |             text(ifelse(As == Bs, (xmax + 0.49)/2, Bs / ploidy), ploidy * cov_est, label, 
 95 |                  offset = 0, cex = .cex, xpd = T, 
 96 |                  pos = ifelse(As == Bs, 2, 1))
 97 |         }
 98 |     }
 99 | }
100 | 
101 | create_smudge_container <- function(cov, cov_tab, smudge_filtering_threshold){
102 |     smudge_container <- list()
103 |     total_genomic_kmers <- sum(cov_tab[ , 'freq'])
104 |     
105 |     for (Bs in 1:8){
106 |         cov_tab_isoB <- cov_tab[cov_tab[ , 'covB'] > cov * ifelse(Bs == 1, 0, Bs - 0.5) & cov_tab[ , 'covB'] < cov * (Bs + 0.5), ]
107 |         # cov_tab_isoB[, 'Bs'] <- Bs
108 |         cov_tab_isoB[, 'As'] <- round(cov_tab_isoB[, 'covA'] / cov) #these are be individual smudges cutouts given coverages
109 |         cov_tab_isoB[cov_tab_isoB[, 'As'] == 0, 'As'] = 1
110 |         for( As in Bs:(16 - Bs)){
111 |             cov_tab_one_smudge <- cov_tab_isoB[cov_tab_isoB[, 'As'] == As, ]
112 |             if (sum(cov_tab_one_smudge[, 'freq']) / total_genomic_kmers > smudge_filtering_threshold){
113 |                 label <- paste0(As, "A", Bs, "B") 
114 |                 smudge_container[[label]] <- cov_tab_one_smudge[,-which(names(cov_tab_one_smudge) %in% c('is_error', 'As'))]
115 |             }
116 |         }
117 |     }
118 |     return(smudge_container)
119 | }


--------------------------------------------------------------------------------
/playground/alternative_fitting/alternative_plotting_testing.R:
--------------------------------------------------------------------------------
 1 | library(smudgeplot)
 2 | library(argparse)
 3 | source('playground/alternative_fitting/alternative_plotting_functions.R')
 4 | 
 5 | parser <- ArgumentParser()
 6 | parser$add_argument("-i", "-infile", help="Input file")
 7 | parser$add_argument("-o", "-outfile", help="Output file")
 8 | args <- parser$parse_args()
 9 | 
10 | args$col_ramp <- 'viridis'
11 | args$invert_cols <- F
12 | 
13 | # cov_tab_daAchMill1 <- read.table('data/dicots/smu_files/daAchMill1.k31_ploidy.smu.txt', col.names = c('covB', 'covA', 'freq'))
14 | # cov_tab <- read.table(args$file, col.names = c('covB', 'covA', 'freq'))
15 | # args <- list()
16 | # args$i <- 'data/ddSalArbu1/ddSalArbu1.k31_ploidy_converted.smu_with_peaks.txt'
17 | # args$o <- 'data/ddSalArbu1/smudge_with_peaks'
18 | 
19 | cov_tab <- read.table(args$i, col.names = c('covB', 'covA', 'freq','peak'))
20 | 
21 | xlim = c(0, 0.5)
22 | ylim = c(0, 300)
23 | 
24 | 
25 | cov_tab[, 'total_pair_cov'] <- cov_tab[, 1] + cov_tab[, 2]
26 | cov_tab[, 'minor_variant_rel_cov'] <- cov_tab[, 1] / cov_tab[, 'total_pair_cov']
27 | 
28 | colour_ramp <- viridis(32)
29 | colour_ramp_log <- get_col_ramp(args, 16)
30 | # cols = colour_ramp[1 + round(31 * cov_tab$freq / max(cov_tab$freq))]
31 | 
32 | # solving this "density" problem; cov1 cov1; have twice lower probability than cov1 cov2; we need to multiply these points, but it needs to be somehow corrected for the fit / summaries
33 | 
34 | plot_dot_smudgeplot(cov_tab, colour_ramp, xlim, ylim)
35 | 
36 | pdf(paste0(args$o, '_background.pdf'))
37 |     plot_alt(cov_tab, ylim, colour_ramp, F)
38 | dev.off()
39 | 
40 | pdf(paste0(args$o, '_log_background.pdf'))
41 |     plot_alt(cov_tab, ylim, colour_ramp, T)
42 | dev.off()
43 | 
44 | pdf(paste0(args$o, '_peaks.pdf'))
45 |     plot_peakmap(cov_tab, xlim = xlim, ylim = ylim)
46 | dev.off()
47 | 
48 | # plots the line where there will be nothing
49 | # plot_seq_error_line(cov_tab, .col = 'red')
50 | 
51 | # head(cov_tab[order(cov_tab[,'total_pair_cov']), ], 12)
52 | # colour_ramp
53 | # 3 / 8:13
54 | 
55 | ####
56 | 
57 | # cov_tab_Fiin_ideal <- read.table('data/Fiin/idealised/kmerpairs_idealised_with_transformations.tsv', header = T)
58 | # head(cov_tab_Fiin_ideal)
59 | 
60 | # xlim = c(0, 0.5)
61 | # ylim = c(0, max(cov_tab_Fiin_ideal[, 'total_pair_cov']))
62 | 
63 | # pdf('data/Fiin/idealised/straw_plot_test3.pdf')
64 | #     plot_dot_smudgeplot(cov_tab_Fiin_ideal, colour_ramp, xlim, ylim)
65 | # dev.off()
66 | 
67 | # pdf('data/Fiin/idealised/straw_plot_test2.pdf')
68 | #     plot_unsquared_smudgeplot(cov_tab_Fiin_ideal, colour_ramp, xlim, ylim)
69 | # dev.off()
70 | 
71 | # # testing the packaged version
72 | 
73 | # library(smudgeplot)
74 | # args <- list()
75 | # args$col_ramp <- 'viridis'
76 | # args$invert_cols <- F
77 | # colour_ramp <- get_col_ramp(args)
78 | 
79 | # cov_tab_Fiin_ideal <- read.table('data/Fiin/idealised/kmerpairs_idealised_with_transformations.tsv', header = T)
80 | 
81 | # pdf('data/Fiin/idealised/straw_plot_test.pdf')
82 | #     plot_alt(cov_tab_Fiin_ideal, c(50, 700), colour_ramp)
83 | # dev.off()
84 | 
85 | # source('playground/alternative_fitting/alternative_plotting_functions.R')
86 | 
87 | # pdf('data/Fiin/idealised/straw_plot_test2.pdf')
88 | #     plot_unsquared_smudgeplot(cov_tab_Fiin_ideal, colour_ramp, c(0, 0.5), c(50, 700))
89 | # dev.off()
90 | 
91 | 


--------------------------------------------------------------------------------
/playground/alternative_fitting/pair_clustering.py:
--------------------------------------------------------------------------------
 1 | 
 2 | # cov2freq = defualtdict(covA, covB) -> freq
 3 | # cov2peak = dict(covA, covB) -> peak
 4 | # dict(peak) -> summit (if relevant)
 5 | # import numpy as np
 6 | 
 7 | import argparse
 8 | from pandas import read_csv # type: ignore
 9 | from collections import defaultdict
10 | from statistics import mean
11 | # import matplotlib.pyplot as plt
12 | 
13 | ####
14 | 
15 | parser = argparse.ArgumentParser()
16 | parser.add_argument('infile', nargs='?', help='name of the input tsv file with covarages and frequencies.')
17 | parser.add_argument('-nf', '-noise_filter', help='Do not agregate into smudge k-mer pairs with frequency lower than this parameter', type=int, default=50)
18 | parser.add_argument('-d', '-distance', help='Manthattan distance of k-mer pairs that are considered neioboring for the local agregation purposes.', type=int, default=5)
19 | parser.add_argument('--mask_errors', help='instead of reporting assignments to individual smudges, just remove all monotonically decreasing points from the error line', action="store_true", default = False)
20 | args = parser.parse_args()
21 | 
22 | ### what should be aguments at some point
23 | # smu_file = 'data/ddSalArbu1/ddSalArbu1.k31_ploidy_converted.smu.txt'
24 | # distance = 5
25 | # noise_filter = 100
26 | 
27 | smu_file = args.infile
28 | distance = args.d
29 | noise_filter = args.nf
30 | 
31 | ### load data
32 | # cov_tab = np.loadtxt(smu_file, dtype=int)
33 | cov_tab = read_csv(smu_file, names = ['covB', 'covA', 'freq'], sep='\t')
34 | cov_tab = cov_tab.sort_values('freq', ascending = False)
35 | L = min(cov_tab['covB']) # important only when --mask_errors is on
36 | 
37 | # generate a dictionary that gives us for each combination of coverages a frequency
38 | cov2freq = defaultdict(int)
39 | cov2peak = defaultdict(int)
40 | # for idx, covB, covA, freq in cov_tab.itertuples():
41 | #     cov2freq[(covA, covB)] = freq
42 | # I can create this one when I iterate though the data though
43 | 
44 | # plt.hist(means)
45 | # plt.hist([x for x in means if x < 100 and x > -100])
46 | # plt.show() 
47 | 
48 | ### clustering
49 | next_peak = 1
50 | for idx, covB, covA, freq in cov_tab.itertuples():
51 |     cov2freq[(covA, covB)] = freq # with this I can get rid of lines 23 24 that pre-makes this table
52 |     if freq < noise_filter:
53 |         break
54 |     highest_neigbour_coords = (0, 0)
55 |     highest_neigbour_freq = 0
56 |     # for each kmer pair I will retrieve all neibours (Manhattan distance)
57 |     for xA in range(covA - distance,covA + distance + 1):
58 |         # for explored A coverage in neiborhood, we explore all possible B coordinates
59 |         distanceB = distance - abs(covA - xA)
60 |         for xB in range(covB - distanceB,covB + distanceB + 1):
61 |             xB, xA = sorted([xA, xB]) # this is to make sure xB is smaller than xA
62 |             # iterating only though those that were assigned already
63 |             # and recroding only the one with highest frequency
64 |             if cov2peak[(xA, xB)] and cov2freq[(xA, xB)] > highest_neigbour_freq:
65 |                 highest_neigbour_coords = (xA, xB)
66 |                 highest_neigbour_freq = cov2freq[(xA, xB)]
67 |     if highest_neigbour_freq > 0:
68 |         cov2peak[(covA, covB)] = cov2peak[(highest_neigbour_coords)]
69 |     else:
70 |         # print("new peak:", (covA, covB))
71 |         if args.mask_errors:
72 |             if covB < L + args.d:
73 |                 cov2peak[(covA, covB)] = 1 # error line
74 |             else:
75 |                 cov2peak[(covA, covB)] = 0 # central smudges
76 |         else:
77 |             cov2peak[(covA, covB)] = next_peak # if I want to keep info about all locally agregated smudges
78 |             next_peak += 1
79 | 
80 | cov_tab = cov_tab.sort_values(['covA', 'covB'], ascending = True)
81 | for idx, covB, covA, freq in cov_tab.itertuples():
82 |     print(covB, covA, freq, cov2peak[(covA, covB)])
83 |     # if idx > 20:
84 |     #     break
85 | 


--------------------------------------------------------------------------------
/playground/interactive_plot_strawberry_full_kmer_families_fooling_around.R:
--------------------------------------------------------------------------------
 1 | library("methods")
 2 | library("argparse")
 3 | library("smudgeplot")
 4 | # library("hexbin")
 5 | 
 6 | # preprocessing
 7 | # to get simply number of memebers / family (exploration)
 8 | # cat data/strawberry_iinumae/kmer_counts_L109_U.tsv  | cut -f 1 > data/strawberry_iinumae/kmer_counts_L109_U_family_members.tsv
 9 | # awk '{row_sum = 0; row_max = 0; row_min = 10000; for (i=2; i <= NF; i++){ row_sum += $i; if ($i > row_max){row_max = $i} if ($i < row_min){row_min = $i} } print row_sum "\t" row_min "\t" row_max }' data/strawberry_iinumae/kmer_counts_L109_U.tsv > data/strawberry_iinumae/kmer_counts_L109_U_sums_min_max.tsv
10 | # (exploration)
11 | #
12 | #
13 | 
14 | args <- ArgumentParser()$parse_args()
15 | args$homozygous <- F
16 | args$input <- 'data/Fiin/kmerpairs_k51_text.smu'
17 | args$output = './data/Fiin/testrun'
18 | args$title = 'F. iinumae'
19 | args$nbins <- 40
20 | args$L <- NULL
21 | args$n_cov <- NULL
22 | args$k <- 21
23 | 
24 | 


--------------------------------------------------------------------------------
/playground/more_away_pairs.py:
--------------------------------------------------------------------------------
  1 | def get_2away_pairs(local_index_to_kmer, k):
  2 |   """local_index_to_kmer is a dictionary where the value is a kmer portion, and the key is the index of the original kmer in which the kmer portion is found. get_2away_pairs returns a list of pairs where each pair of indices corresponds to a pair of kmers different in exactly two bases."""
  3 | 
  4 |   #These are the base cases for the recursion. If k==1, the kmers obviously can't differ in exactly two bases, so return an empty list. if k==2, return every pair of indices where the kmers at those indices differ at exactly two bases.
  5 |   if k == 1:
  6 |     return []
  7 |   if k == 2:
  8 |     return [(i, j) for (i,j) in combinations(local_index_to_kmer, 2) if local_index_to_kmer[i][0] != local_index_to_kmer[j][0] and local_index_to_kmer[i][1] != local_index_to_kmer[j][1]]
  9 | 
 10 |   #Get the two halves of the kmer
 11 |   k_L = k//2
 12 |   k_R = k-k_L
 13 | 
 14 |   #initialize dictionaries in which the key is the hash of half of the kmer, and the value is a list of indices of the kmers with that same hash
 15 |   kmer_L_hashes = defaultdict(list)
 16 |   kmer_R_hashes = defaultdict(list)
 17 | 
 18 |   #initialize pairs, which will be returned by get_1away_pairs
 19 |   pairs = []
 20 | 
 21 |   #initialize dictionaries containing the left halves and the right halves (since we will have to check cases where the left half differs by 1 and the right half differs by 1)
 22 |   local_index_to_kmer_L = {}
 23 |   local_index_to_kmer_R = {}
 24 | 
 25 |   #for each kmer, calculate its left hash and right hash, then add its index to the corresponding entries of the dictionary
 26 |   for i, kmer in local_index_to_kmer.items():
 27 |     kmer_L = kmer[:k_L]
 28 |     kmer_R = kmer[k_L:]
 29 |     local_index_to_kmer_L[i] = kmer_L
 30 |     local_index_to_kmer_R[i] = kmer_R
 31 |     kmer_L_hashes[kmer_to_int(kmer_L)] += [i]
 32 |     kmer_R_hashes[kmer_to_int(kmer_R)] += [i]
 33 | 
 34 |   #for each left hash in which there are multiple kmers with that left hash, find the list of pairs in which the right half differs by 2. (aka, if left half matches, recurse on right half).
 35 |   for kmer_L_hash_indices in kmer_L_hashes.values(): #same in first half
 36 |     if len(kmer_L_hash_indices) > 1:
 37 |       pairs += get_2away_pairs({kmer_L_hash_index:local_index_to_kmer[kmer_L_hash_index][k_L:] for kmer_L_hash_index in kmer_L_hash_indices}, k_R) #differ by 2 in right half
 38 | 
 39 |   #for each right hash in which there are multiple kmers with that right hash, find the list of pairs in which the left half differs by 2. (aka, if right half matches, recurse on left half).
 40 |   for kmer_R_hash_indices in kmer_R_hashes.values(): #same in second half
 41 |     if len(kmer_R_hash_indices) > 1:
 42 |       pairs += get_2away_pairs({kmer_R_hash_index:local_index_to_kmer[kmer_R_hash_index][:k_L] for kmer_R_hash_index in kmer_R_hash_indices}, k_L) #differ by 2 in left half
 43 | 
 44 |   #Find matching pairs where the left half is one away, and the right half is one away
 45 |   possible_pairs_L = set(get_1away_pairs(local_index_to_kmer_L,k_L))
 46 |   possible_pairs_R = set(get_1away_pairs(local_index_to_kmer_R,k_R))
 47 |   pairs += list(possible_pairs_L.intersection(possible_pairs_R))
 48 |   return(pairs)
 49 | 
 50 | 
 51 | ###This code has not been cleaned... needs to be edited!!!
 52 | def get_3away_pairs(kmers):
 53 |   """kmers is a list of kmers. get_3away_pairs returns a list of pairs where each pair of kmers is different in exactly three bases."""
 54 |   k = len(kmers[0])
 55 |   if k == 1 or k==2:
 56 |     return []
 57 |   if k == 3:
 58 |     return [pair for pair in combinations(kmers, 2) if pair[0][0] != pair[1][0] and pair[0][1] != pair[1][1] and pair[0][2] != pair[1][2]]
 59 |   k_L = k//2
 60 |   k_R = k-k_L
 61 |   kmer_L_hashes = defaultdict(list)
 62 |   kmer_R_hashes = defaultdict(list)
 63 |   pairs = []
 64 |   kmers_L = []
 65 |   kmers_R = []
 66 |   for i, kmer in enumerate(kmers):
 67 |     kmer_L = kmer[:k_L]
 68 |     kmer_R = kmer[k_L:]
 69 |     #print(kmer_L)
 70 |     #print(kmer_R)
 71 |     kmers_L.append(kmer_L)
 72 |     kmers_R.append(kmer_R)
 73 |     kmer_L_hashes[kmer_to_int(kmer_L)] += [i]
 74 |     kmer_R_hashes[kmer_to_int(kmer_R)] += [i]
 75 |   for kmer_L_hash in kmer_L_hashes.values(): #same in first half
 76 |     if len(kmer_L_hash) > 1:
 77 |       kmer_L = kmers[kmer_L_hash[0]][:k_L] #first half
 78 |       pairs += [tuple(kmer_L + kmer for kmer in pair) for pair in get_3away_pairs([kmers[i][k_L:] for i in kmer_L_hash])] #differ by 3 in second half
 79 |   for kmer_R_hash in kmer_R_hashes.values(): #same in second half
 80 |     if len(kmer_R_hash) > 1:
 81 |       kmer_R = kmers[kmer_R_hash[0]][k_L:] #second half
 82 |       #print(kmer_R)
 83 |       pairs += [tuple(kmer + kmer_R for kmer in pair) for pair in get_3away_pairs([kmers[i][:k_L] for i in kmer_R_hash])] #differ by 3 in first half
 84 |   possible_pairs = []
 85 |   possible_pairs_L = get_1away_pairs(kmers_L)
 86 |   possible_pairs_R = get_2away_pairs(kmers_R)
 87 |   #print(kmers_L)
 88 |   #print(kmers_R)
 89 |   #print(possible_pairs_L)
 90 |   #print(possible_pairs_R)
 91 |   for possible_pair_L in possible_pairs_L:
 92 |     for possible_pair_R in possible_pairs_R:
 93 |       possible_kmer1 = possible_pair_L[0]+possible_pair_R[0]
 94 |       possible_kmer2 = possible_pair_L[1]+possible_pair_R[1]
 95 |       if possible_kmer1 in kmers and possible_kmer2 in kmers:
 96 |         pairs += [(possible_kmer1, possible_kmer2)]
 97 |   possible_pairs = []
 98 |   possible_pairs_L = get_2away_pairs(kmers_L)
 99 |   possible_pairs_R = get_1away_pairs(kmers_R)
100 |   for possible_pair_L in possible_pairs_L:
101 |     for possible_pair_R in possible_pairs_R:
102 |       possible_kmer1 = possible_pair_L[0]+possible_pair_R[0]
103 |       possible_kmer2 = possible_pair_L[1]+possible_pair_R[1]
104 |       if possible_kmer1 in kmers and possible_kmer2 in kmers:
105 |         pairs += [(possible_kmer1, possible_kmer2)]
106 |   return(pairs)


--------------------------------------------------------------------------------
/playground/playground.R:
--------------------------------------------------------------------------------
  1 | files <- c('data/Avag1/coverages_2.tsv',
  2 |            'data/Lcla1/Lcla1_pairs_coverages_2.tsv',
  3 |            'data/Mflo2/coverages_2.tsv',
  4 |            'data/Rvar1/Rvar1_pairs_coverages_2.tsv',
  5 |            'data/Ps791/Ps791_pairs_coverages_2.tsv',
  6 |            'data/Aric1/Aric1_pairs_coverages_2.tsv',
  7 |            "data/Rmag1/Rmag1_pairs_coverages_2.tsv")
  8 | 
  9 | ###
 10 | library(smudgeplot)
 11 | args <- list()
 12 | args$input <- 'data/Mflo2/Mflo2_coverages_2.tsv'
 13 | args$output <- "figures/Mflo2_v0.1.0"
 14 | args$nbins <- 40
 15 | args$kmer_size <- 21
 16 | args$homozygous <- F
 17 | 
 18 | # args <- list()
 19 | # args$input <- 'data/rice/SRR1919013_k21_l35_u500_coverages.tsv'
 20 | # args$output <- "data/rice/smudge"
 21 | # args$nbins <- 40
 22 | # args$kmer_size <- 21
 23 | # args$homozygous <- F
 24 | 
 25 | ###
 26 | 
 27 | i <- 7
 28 | n <- NA
 29 | cov <- read.table(args$input)
 30 | 
 31 | # run bits of smudgeplot_plot.R to get k, and peak summary
 32 | 
 33 | filter <- total_pair_cov < 350
 34 | total_pair_cov_filt <- total_pair_cov[filter]
 35 | minor_variant_rel_cov_filt <- minor_variant_rel_cov[filter]
 36 | 
 37 | ymax <- max(total_pair_cov_filt)
 38 | ymin <- min(total_pair_cov_filt)
 39 | 
 40 | # the lims trick will make sure that the last column of squares will have the same width as the other squares
 41 | smudge_container <- get_smudge_container(minor_variant_rel_cov, total_pair_cov, .nbins = 40)
 42 | 
 43 | x <- seq(xlim[1], ((nbins - 1) / nbins) * xlim[2], length = nbins)
 44 | y <- c(seq(ylim[1] - 0.1, ((nbins - 1) / nbins) * ylim[2], length = nbins), ylim[2])
 45 | 
 46 | .peak_points <- peak_points
 47 | .smudge_container <- smudge_container
 48 | .total_pair_cov <- total_pair_cov
 49 | .treshold <- 0.05
 50 | fig_title <- 'test'
 51 | 
 52 | image(smudge_container, col = colour_ramp)
 53 | # contour(x.bin, y.bin, freq2D, add=TRUE, col=rgb(1,1,1,.7))
 54 | 
 55 | #######
 56 | # PLOT
 57 | #######
 58 | 
 59 | library(plotly)
 60 | packageVersion('plotly')
 61 | 
 62 | p <- plot_ly(x = k_toplot$x, y = k_toplot$y, z = k_toplot$z) %>% add_surface()
 63 | htmlwidgets::saveWidget(p, "Ps791_smudge_surface.html")
 64 | # Create a shareable link to your chart
 65 | # Set up API credentials: https://plot.ly/r/getting-started
 66 | chart_link = api_create(p, filename="Ps791_smudge_surface-2")
 67 | chart_link
 68 | 
 69 | layout(matrix(c(2,4,1,3), 2, 2, byrow=T), c(3,1), c(1,3))
 70 | # 1 smudge plot
 71 | plot_smudgeplot(k_toplot, n, colour_ramp)
 72 | plot_expected_haplotype_structure(n, peak_sizes, T)
 73 | # annotate_peaks(peak_points, ymin, ymax)
 74 | # annotate_summits(peak_points, peak_sizes, ymin, ymax, 'black')
 75 | # TODO fix plot_seq_error_line(total_pair_cov)
 76 | # 2,3 hist
 77 | # TODO rescale histogram axis by the scale of the smudgeplot
 78 | plot_histograms(minor_variant_rel_cov, total_pair_cov)
 79 | # 4 legend
 80 | plot_legend(k_toplot, total_pair_cov, colour_ramp)
 81 | 
 82 | # findInterval(c(0.1, 0.2, 0.33, 0.5), seq(0, 0.5, length = 41))
 83 | 
 84 | ##########################################################
 85 | ## TEST
 86 | ## idea here was to propagate from the highest point and expand the peak till it's monotonic
 87 | # starting_point <- which(dens_m == max(dens_m), arr.ind = TRUE)
 88 | # starting_val <- dens_m[starting_point]
 89 | # peak_points <- data.frame(x = starting_point[,2], y = starting_point[,1], value = starting_val)
 90 | #
 91 | # points_to_explore <- get_neibours(starting_val)
 92 | # val_to_comp <- starting_val
 93 | #
 94 | # for(one_point in 1:nrow(points_to_explore)){
 95 | #     one_point <- points_to_explore[one_point,]
 96 | #     point_val <- dens_m[t(one_point)]
 97 | #     if(point_val < val_to_comp){
 98 | #         peak_points <- rbind(peak_points,
 99 | #             data.frame(x = one_point[2], y = one_point[1], value = point_val))
100 | #     }
101 | # }
102 | #
103 | # get_neibours <- function(point){
104 | #     neibours_vec <- matrix(rep(starting_point,8) + c(-1,-1,0,-1,1,-1,-1,0,+1,0,-1,1,0,1,1,1),
105 | #                            ncol = 2, byrow = T)
106 | #     neibours_vec[rowSums(neibours_vec <= 30 & neibours_vec >= 1) == 2,]
107 | # }
108 | #
109 | 
110 | ##########################
111 | ###ALTERNATIVE PLTTING ###
112 | ##########################
113 | # library('spatialfil')
114 | # msnFit(high_cov_filt, minor_variant_rel_cov)
115 | 
116 | ## alternative plotting
117 | # library(hexbin) # honeycomb plot
118 | # h <- hexbin(df)
119 | # # h@count <- sqrt(h@count)
120 | # plot(h, colramp=rf)
121 | # gplot.hexbin(h, colorcut=10, colramp=rf)
122 | 
123 | 
124 | ### TEST  plot lines at expected coverages
125 | #
126 | # for(i in 2:6){
127 | #       lines(c(0, 0.6), rep(i * n, 2), lwd = 1.4)
128 | #       text(0.1, i * n, paste0(i,'x'), pos = 3)
129 | # }
130 | 
131 | # FUTURE - wrapper
132 | # smudgeplot < - function(.k, .minor_variant_rel_cov, .total_pair_cov, .n,
133 | #                             .sqrt_scale = T, .cex = 1.4, .fig_title = NA){
134 | #     if( .sqrt_scale == T ){
135 | #         # to display densities on squared root scale (bit like log scale but less agressive)
136 | #         .k$z <- sqrt(.k$z)
137 | #     }
138 | #
139 | #     pal <- brewer.pal(11,'Spectral')
140 | #     rf <- colorRampPalette(rev(pal[3:11]))
141 | #     colour_ramp <- rf(32)
142 | #
143 | #     layout(matrix(c(2,4,1,3), 2, 2, byrow=T), c(3,1), c(1,3))
144 | #
145 | #     # 2D HISTOGRAM
146 | #     plot_smudgeplot(...)
147 | #
148 | #     # 1D historgram - minor_variant_rel_cov on top
149 | #     plot_histogram(...)
150 | #
151 | #     # 1D historgram - total pair coverage - right
152 | #     plot_histogram(...)
153 | #
154 | #     # LEGEND (topright corener)
155 | #     plot_legend(...)
156 | #
157 | # }
158 | 


--------------------------------------------------------------------------------
/playground/playground.py:
--------------------------------------------------------------------------------
  1 | #-----
  2 | # WHat I tried ot make plots work
  3 | # https://matplotlib.org/faq/howto_faq.html#generate-images-without-having-a-window-appear
  4 | import matplotlib
  5 | matplotlib.use('Agg')
  6 | import matplotlib.pyplot as plt
  7 | #-------
  8 | 
  9 | #Load the particular dumps file you wish
 10 | #These were created using jellyfish dump -c -L lower -U upper SRR_k21.jf > SRR_k21.dumps
 11 | dumps_file = 'ERR2135445.dumps' #aric1 -L 20 -U 350
 12 | dumps_file = 'SRR801084_k21.dumps' #avag1 -L 30 -U 300
 13 | dumps_file = 'SRR4242457_k21.dumps' #mare2 -L 13 -U 132
 14 | dumps_file = 'SRR4242472_k21.dumps' #ment1 -L 50 -U 350
 15 | dumps_file = 'SRR4242474_k21.dumps' #mflo2 -L 60 -U 400
 16 | dumps_file = 'SRR4242467_k21.dumps' #minc3 -L 25 -U 210
 17 | dumps_file = 'SRR4242462_k21.dumps' #mjav2 -L 80 -U 600
 18 | dumps_file = 'ERR2135453_k21.dumps' #rmac1 -L 100 -U 700
 19 | dumps_file = 'ERR2135451_k21.dumps' #rmag1 -L 60 -U 500
 20 | 
 21 | #  dumps_file = 'kmers_dump_L120_U1500.tsv'
 22 | 
 23 | 
 24 | 
 25 | 
 26 | 
 27 | plt.hist(coverages_2, bins = 1000)
 28 | plt.savefig('coverages_2_hist.png')
 29 | plt.close()
 30 | 
 31 | #  then plot a histogram of the coverages
 32 | 
 33 | plt.hist(coverages_3, bins = 1000)
 34 | plt.savefig('coverages_3_hist.png')
 35 | plt.close()
 36 | 
 37 | #n, bins, patches = plt.hist(coverages_3, bins = 1000)
 38 | #bins[np.argmax(n)]
 39 | 
 40 | #save families_4 to a pickle file, then plot a histogram of the coverages
 41 | 
 42 | plt.hist(coverages_4, bins = 1000)
 43 | plt.savefig('coverages_4_hist.png')
 44 | plt.close()
 45 | 
 46 | 
 47 | plt.hist(coverages_5, bins = 1000)
 48 | plt.savefig('coverages_5_hist.png')
 49 | plt.close()
 50 | 
 51 | #save families_6 to a pickle file, then plot a histogram of the coverages
 52 | plt.hist(coverages_6, bins = 1000)
 53 | plt.savefig('coverages_6_hist.png')
 54 | plt.close()
 55 | 
 56 | ###some code to load previously saved pickle files
 57 | # test_kmers = pickle.load(open('test_kmers.p', 'rb'))
 58 | # test_coverages = pickle.load(open('test_coverages.p', 'rb'))
 59 | G = pickle.load(open('G.p', 'rb'))
 60 | component_lengths = pickle.load(open('component_lengths.p', 'rb'))
 61 | families_2 = pickle.load(open('families_2.p', 'rb'))
 62 | coverages_2 = pickle.load(open('coverages_2.p', 'rb'))
 63 | # one_away_pair = pickle.load(open('one_away_pairs.p', 'rb'))
 64 | 
 65 | # perhaps faster way how to calculate coverages_2
 66 | # coverages_2 = [test_coverages[cov_i1] + test_coverages[cov_i2] for cov_i1, cov_i2 in families_2]
 67 | 
 68 | 
 69 | #-----
 70 | #   for coverage in coverages_2:
 71 | #
 72 | 
 73 | ###Everything below this is just scratch work
 74 | #f = open('ERR2135445_l20_u100.fa', 'r')
 75 | #g = open('new.fa', 'w')
 76 | #for line in f:
 77 | #  if line.startswith('>'):
 78 | #    g.write('>' + str(int(line[1:-1])+10000) + '\n')
 79 | #  else:
 80 | #    g.write(line)
 81 | #f.close()
 82 | #g.close()
 83 | 
 84 | #get_3away_pairs(['AAAAAAAA', 'AACTAAGA', 'AACAATGA', 'AAAAATCG'])
 85 | 
 86 | 
 87 | #get_1away_pairs(['AAA', 'AAC'])
 88 | 
 89 | #kmers = [''.join([random.choice('ACGT') for _ in range(20)]) for _ in range(10)]
 90 | 
 91 | #df2 = df[:1000000]
 92 | 
 93 | #for pair in pairs:
 94 | #  #f.write(str(df2[df2[0] == pair[0]].iloc[0,1])+'\n')
 95 | #  #f.write(str(df2[df2[0] == pair[1]].iloc[0,1])+'\n')
 96 | #  [x[1] for x in pairs if x[0] == pair[0]]+[x[0] for x in pairs if x[1] == pair[0]]
 97 | #  a = df2[df2[0] == pair[0]].iloc[0,1]/89.2
 98 | #  b = df2[df2[0] == pair[1]].iloc[0,1]/89.2
 99 | #  f.write(str((a, b, a+b))+'\n')
100 | 
101 | #Counter([min([Counter(pair[0])[x] for x in ['A', 'C', 'G', 'T']]) for pair in pairs])
102 | 
103 | #high_complexity_pairs = [pair for pair in pairs if min([Counter(pair[0])[x] for x in ['A', 'C', 'G', 'T']])==5]
104 | 
105 | #for hcpair in high_complexity_pairs:
106 | #  f.write(str(df2[df2[0] == hcpair[0]]))
107 | #  f.write(str(df2[df2[0] == hcpair[1]]))
108 | 
109 | #570620  TAAAATAATTTTTTTCTTAAA  115
110 | #878881  TAAAATAATTTTTTTCTAAAA  67
111 | #182
112 | 
113 | #526664  AATTACCATTCAACCAGTTTC  156
114 | #922303  AATTACCATTCAACCAGATTC  166
115 | #322
116 | 
117 | #394517  AAGAGAAAAGAAAAAAGTAAT  180
118 | #788086  AAGAGAAAAGAAAAAAGAAAT  180
119 | #360
120 | 
121 | #420665  AAAAAAAAGTGTTTTACTTTG  119
122 | #946878  AAAAAAAAGTGTTTTACTCTG  95
123 | #214
124 | 
125 | #594426  ACAAAATATTACCTTTATCTA  117
126 | #768315  ACAAAATATTACCTTTATTTA  152
127 | #269
128 | 
129 | #536269  ACAGATTGGCTTGTTTGAGCC  103
130 | #711261  ACAGATTGGCTTGTTTGAACC  99
131 | 
132 | #383862  ATTTCATTTGTTAGAAAAAAA  139
133 | #907248  ATTTCATTTGTTAGAAAAGAA  162
134 | 
135 | #438051  TCAACAGAAAATAATGGAGCA  152
136 | #962365  TCAACAGAAAATAATGGAACA  143
137 | 
138 | #425231  AAAAAAAAACGAAAAAATTTT  15
139 | #734086  AAAAAAAAACGAAAAAAATTT  18
140 | 
141 | #607197  AAAAAAAAACACGACATGTTT  154
142 | #783001  AAAAAAAAACACGACATGCTT  134
143 | 
144 | 
145 | #test_kmers = {i:kmer for (i, kmer) in enumerate(kmers[:100000])}
146 | 
147 | #members = [x[0] for x in one_away_pairs] + [x[1] for x in one_away_pairs] + [x[0] for x in two_away_pairs] + [x[1] for x in two_away_pairs]
148 | #G = nx.Graph()
149 | #for one_away_pair in one_away_pairs:
150 | #  G.add_edge(*one_away_pair)
151 | #for two_away_pair in two_away_pairs:
152 | #  G.add_edge(*two_away_pair)
153 | 
154 | #component_lengths = [len(x) for x in nx.connected_components(G)]
155 | #Counter(component_lengths)
156 | #families = [list(x) for x in nx.connected_components(G) if len(x) == 2]
157 | #coverages = [df2.iloc[families[i][0], 1]+df2.iloc[families[i][1], 1] for i in range(len(families))]
158 | #plt.hist(coverages, bins = 100)
159 | #plt.savefig('coverages_hist.png')
160 | #plt.close()
161 | 
162 | #families_3 = [list(x) for x in nx.connected_components(G) if len(x) == 3]
163 | #coverages_3 = [df2.iloc[families_3[i][0], 1]+df2.iloc[families_3[i][1], 1] for i in range(len(families_3))]
164 | #plt.hist(coverages_3, bins = 100)
165 | #plt.savefig('coverages_3_hist.png')
166 | #plt.close()
167 | 


--------------------------------------------------------------------------------
/playground/popart.R:
--------------------------------------------------------------------------------
 1 | library(smudgeplot)
 2 | 
 3 | args <- ArgumentParser()$parse_args()
 4 | args$output <- "data/Scer/sploidyplot_test"
 5 | args$nbins <- 40
 6 | args$kmer_size <- 21
 7 | args$homozygous <- FALSE
 8 | args$L <- c()
 9 | args$col_ramp <- 'viridis'
10 | args$invert_cols <- TRUE
11 | 
12 | cov_tab <- read.table("data/Scer/PloidyPlot3_text.smu", col.names = c('covB', 'covA', 'freq'), skip = 2) #nolint
13 | cov_tab[, 'total_pair_cov'] <- cov_tab[, 1] + cov_tab[, 2]
14 | cov_tab[, 'minor_variant_rel_cov'] <- cov_tab[, 1] / cov_tab[, 'total_pair_cov']
15 | 
16 | cov_tab_1n_est <- round(estimate_1n_coverage_1d_subsets(cov_tab), 1)
17 | 
18 | xlim <- c(0, 0.5)
19 | # max(total_pair_cov); 10*draft_n
20 | ylim <- c(0, 150)
21 | nbins <- 40
22 | 
23 | smudge_container <- get_smudge_container(cov_tab, nbins, xlim, ylim)
24 | smudge_container$z <- smudge_container$dens
25 | 
26 | plot_popart <- function(cov_tab, ylim, colour_ramp){
27 |     A_equals_B <- cov_tab[, 'covA'] == cov_tab[, 'covB']
28 |     cov_tab[A_equals_B, 'freq'] <- cov_tab[A_equals_B, 'freq'] * 2
29 |     cov_tab$col <- colour_ramp[1 + round(31 * cov_tab[, 'freq'] / max(cov_tab[, 'freq']))]
30 | 
31 |     plot(NULL, xlim = c(0.1, 0.5), ylim = ylim, xaxt = "n", yaxt = "n", xlab = '', ylab = '', bty = 'n')
32 |     min_cov_to_plot <- max(ylim[1],min(cov_tab[, 'total_pair_cov']))
33 |     nothing <- sapply(min_cov_to_plot:ylim[2], plot_one_coverage, cov_tab)
34 |     return(0)
35 | }
36 | 
37 | par(mfrow = c(2, 5))
38 | 
39 | args$col_ramp <- "viridis"
40 | args$invert_cols <- FALSE
41 | colour_ramp <- get_col_ramp(args) # get the default colour ramp (Spectral, 11)
42 | # plot_smudgeplot(smudge_container, 15.5, colour_ramp)
43 | plot_popart(cov_tab, c(20, 120), colour_ramp)
44 | 
45 | args$invert_cols <- TRUE
46 | colour_ramp <- get_col_ramp(args) # get the default colour ramp (Spectral, 11)
47 | # plot_smudgeplot(smudge_container, 15.5, colour_ramp)
48 | plot_popart(cov_tab, c(20, 120), colour_ramp)
49 | 
50 | 
51 | for (ramp in c("grey.colors", "magma", "plasma", "mako", "inferno", "rocket", "heat.colors", "cm.colors")){
52 |     args$col_ramp <- ramp
53 |     colour_ramp <- get_col_ramp(args) # get the default colour ramp (Spectral, 11)
54 |     plot_popart(cov_tab, c(20, 120), colour_ramp)
55 | }
56 | 
57 | 


--------------------------------------------------------------------------------
/src_ploidyplot/PloidyPlot.c:
--------------------------------------------------------------------------------
   1 | /******************************************************************************************
   2 |  *
   3 |  *  PloidyPlot: A C-backed searching quickly for hetmers:
   4 |  *                 unique k-mer pairs different by exactly one nucleotide
   5 |  *
   6 |  *  Author:  Gene Myers
   7 |  *  Date  :  May, 2021
   8 |  *  Reduced to the k-mer pair search by Kamil Jaron in August, 2023
   9 |  *
  10 |  ********************************************************************************************/
  11 | 
  12 | #include <stdlib.h>
  13 | #include <stdio.h>
  14 | #include <ctype.h>
  15 | #include <string.h>
  16 | #include <unistd.h>
  17 | #include <dirent.h>
  18 | #include <math.h>
  19 | #include <pthread.h>
  20 | 
  21 | #undef  SOLO_CHECK
  22 | 
  23 | #undef  DEBUG_GENERAL
  24 | #undef  DEBUG_RECURSION
  25 | #undef  DEBUG_THREADS
  26 | #undef  DEBUG_BOUNDARY
  27 | #undef  DEBUG_SCAN
  28 | #undef  DEBUG_BIG_SCAN
  29 | 
  30 | #include "libfastk.h"
  31 | #include "matrix.h"
  32 | 
  33 | static char *Usage[] = { " [-v] [-T<int(4)>] [-P<dir(/tmp)>]",
  34 |                          " [-o<output>] [-e<int(4)>] <source>[.ktab]"
  35 |                        };
  36 | 
  37 | static int VERBOSE;
  38 | static int NTHREADS;   //  At most 64 allowed
  39 | 
  40 | #ifdef SOLO_CHECK
  41 | 
  42 | static uint8 *CENT;
  43 | static int64  CIDX;
  44 | 
  45 | #endif
  46 | 
  47 | static int ETHRESH;   //  Error threshold
  48 | 
  49 | #define SMAX  1000    //  Max. value of CovA+CovB
  50 | #define FMAX   500    //  Max. value of min(CovA,CovB)
  51 | 
  52 | static int BLEVEL;     //  <= 4
  53 | static int BWIDTH;     //  = 4^BLEVEL
  54 | 
  55 | static int64 MEMORY_LIMIT = 0x100000000ll;  // 4GB 
  56 | static int64 Cache_Size;                    // Divided evenly amont the threads
  57 | 
  58 | static int KMER;
  59 | static int KBYTE;
  60 | static int TBYTE;
  61 | 
  62 | static int PASS1;
  63 | 
  64 | /****************************************************************************************
  65 |  *
  66 |  *  Print & compare utilities
  67 |  *
  68 |  *****************************************************************************************/
  69 | 
  70 | #define  COUNT_PTR(p)   ((uint16 *) (p+KBYTE))
  71 | 
  72 | static char dna[4] = { 'a', 'c', 'g', 't' };
  73 | 
  74 | static char *fmer[256], _fmer[1280];
  75 | 
  76 | static void setup_fmer_table()
  77 | { char *t;
  78 |   int   i, l3, l2, l1, l0;
  79 | 
  80 |   i = 0;
  81 |   t = _fmer;
  82 |   for (l3 = 0; l3 < 4; l3++)
  83 |    for (l2 = 0; l2 < 4; l2++)
  84 |     for (l1 = 0; l1 < 4; l1++)
  85 |      for (l0 = 0; l0 < 4; l0++)
  86 |        { fmer[i] = t;
  87 |          *t++ = dna[l3];
  88 |          *t++ = dna[l2];
  89 |          *t++ = dna[l1];
  90 |          *t++ = dna[l0];
  91 |          *t++ = 0;
  92 |          i += 1;
  93 |        }
  94 | }
  95 | 
  96 | static void print_hap(uint8 *seq, int len, int half)
  97 | { static int firstime = 1;
  98 |   int i, b, h, k;
  99 | 
 100 |   if (firstime)
 101 |     { firstime = 0;
 102 |       setup_fmer_table();
 103 |     }
 104 | 
 105 |   h = half >> 2;
 106 |   b = len >> 2;
 107 |   for (i = 0; i < h; i++)
 108 |     printf("%s",fmer[seq[i]]);
 109 |   k = 6;
 110 |   for (i = h << 2; k >= 0; i++)
 111 |     { if (i == half)
 112 |         printf("%c",dna[(seq[h] >> k) & 0x3]-32);
 113 |       else
 114 |         printf("%c",dna[(seq[h] >> k) & 0x3]);
 115 |       k -= 2;
 116 |     }
 117 |   for (i = h+1; i < b; i++)
 118 |     printf("%s",fmer[seq[i]]);
 119 |   k = 6;
 120 |   for (i = b << 2; i < len; i++)
 121 |     { printf("%c",dna[(seq[b] >> k) & 0x3]);
 122 |       k -= 2;
 123 |     }
 124 | }
 125 | 
 126 | static inline int mycmp(uint8 *a, uint8 *b, int n)
 127 | { while (n-- > 0)
 128 |     { if (*a++ != *b++)
 129 |         return (a[-1] < b[-1] ? -1 : 1);
 130 |     }
 131 |   return (0);
 132 | }
 133 | 
 134 | 
 135 | /****************************************************************************************
 136 |  *
 137 |  *  Find Het Pairs by merging 4 lists:
 138 |  *    Out of core: stream indices from fng[a] to end[a] for a in [0,3]
 139 |  *    In core:     table pointer from ptr[a] to eptr[a] for a in [0,3]
 140 |  *
 141 |  *****************************************************************************************/
 142 | 
 143 | typedef struct
 144 |   { int           level;    //  position of variation
 145 |     int64        *bound;    //  17-element array of next level transition points
 146 |                           //  Stream scans (big & small):
 147 |     Kmer_Stream  *fng[4];   //  4 streams for scan
 148 |     int64         end[4];   //  index of end of each scan
 149 |                           //  Small scans:
 150 |     int           tid;      //  thread assigned to this subtree
 151 |     uint8        *cache;    //  in-core cache for subtrees that fit
 152 |                           //  In-core scans:
 153 |     int64         cidx;     //  Table index of 1st cache entry
 154 |     uint8        *ept[4];   //  Each list is in [ptr[a],eptr[a])
 155 |     uint8        *ptr[4];   //    in steps of TBYTES
 156 |                           //  Plot:
 157 |     int64       **plot;     //  Accumulate A+B, B/(A+B) pairs here
 158 | 
 159 | #ifdef SOLO_CHECK
 160 |     uint8        *cptr;
 161 | #endif
 162 |   } TP;
 163 | 
 164 | static uint8 *Pair;  //  Incidence array (# of table entries)
 165 | 
 166 | static uint8 Prefix[4] = { 0x3f, 0x0f, 0x03, 0x00 };
 167 | static uint8 Shift[4]  = { 6, 4, 2, 0 };
 168 | 
 169 | static void *analysis_thread_1(void *args)
 170 | { TP           *parm  = (TP *) args;
 171 |   int64        *end   = parm->end;
 172 |   Kmer_Stream **fng   = parm->fng;
 173 |   int           level = parm->level;
 174 |   int64        *bound = parm->bound;
 175 | 
 176 |   int ll = ((level+1)>>2);
 177 |   int ls = Shift[(level+1)&0x3];
 178 | 
 179 |   int mask  = Prefix[level&0x3];
 180 |   int offs  = (level >> 2) + 1;
 181 |   int rem   = KBYTE - offs;
 182 | 
 183 |   uint8 *ent[4];
 184 |   int    lst[4];
 185 |   int    in[4], itop;
 186 |   int    cnt[4];
 187 |   int    mc, hc;
 188 |   uint8 *mr, *hr;
 189 |   int    a, i, x;
 190 |   
 191 |   for (a = 0; a < 4; a++)
 192 |     { ent[a] = Current_Entry(fng[a],NULL);
 193 |       lst[a] = (ent[a][ll] >> ls) & 0x3;
 194 |       if (fng[a]->cidx < end[a])
 195 |         { x = (ent[a][ll] >> ls) & 0x3;
 196 |           while (x > lst[a])
 197 |             bound[(a<<2)+(++(lst[a]))] = fng[a]->cidx;
 198 |         }
 199 |     }
 200 | 
 201 | #ifdef DEBUG_SCAN
 202 |   for (a = 0; a < 4; a++)
 203 |     { printf(" %c %10lld: ",dna[a],fng[a]->cidx);
 204 |       print_hap(ent[a],KMER,level);
 205 |     }
 206 |   printf("\n");
 207 | #endif
 208 | 
 209 |   while (1)
 210 |     { for (a = 0; a < 4; a++)
 211 |         if (fng[a]->cidx < end[a])
 212 |           break;
 213 |       if (a >= 4)
 214 |         break;
 215 | 
 216 |       mr = ent[a]+offs;
 217 |       mc = mr[-1] & mask;
 218 |       in[0] = a;
 219 |       itop  = 1;
 220 |       for (a++; a < 4; a++)
 221 |         if (fng[a]->cidx < end[a])
 222 |           { hr = ent[a]+offs;
 223 |             hc = hr[-1] & mask;
 224 |             if (hc == mc)
 225 |               { int v = mycmp(hr,mr,rem);
 226 |                 if (v == 0)
 227 |                   in[itop++] = a;
 228 |                 else if (v < 0)
 229 |                   { mc = hc;
 230 |                     mr = hr;
 231 |                     in[0] = a;
 232 |                     itop  = 1;
 233 |                   }
 234 |               }
 235 |             else if (hc < mc)
 236 |               { mc = hc;
 237 |                 mr = hr;
 238 |                 in[0] = a;
 239 |                 itop  = 1;
 240 |               }
 241 |           }
 242 | 
 243 |       if (itop > 1)
 244 |         { cnt[0] = *((uint16 *) (ent[in[0]]+KBYTE));
 245 |           for (i = 1; i < itop; i++)
 246 |             { cnt[i] = *((uint16 *) (ent[in[i]]+KBYTE));
 247 |               for (a = 0; a < i; a++)
 248 |                 { x = cnt[a]+cnt[i];
 249 |                   if (x <= SMAX)
 250 |                     { Pair[fng[in[i]]->cidx] += 1;
 251 |                       Pair[fng[in[a]]->cidx] += 1;
 252 |                     }
 253 |                 }
 254 |             }
 255 |         }
 256 | 
 257 |       for (i = 0; i < itop; i++)
 258 |         { Kmer_Stream *t;
 259 | 
 260 |           a = in[i];
 261 |           t = fng[a];
 262 | #ifdef DEBUG_SCAN
 263 |           if (i == 0) printf("\n");
 264 |           printf(" %c %10lld: ",dna[a&0x3],fng[a]->cidx); 
 265 |           print_hap(ent[a],KMER,level);
 266 |           printf("\n");
 267 | #endif
 268 |           Next_Kmer_Entry(t);
 269 |           if (t->cidx < end[a])
 270 |             { Current_Entry(t,ent[a]);
 271 |               x = (ent[a][ll] >> ls) & 0x3;
 272 |               while (x > lst[a])
 273 |                 bound[(a<<2)+(++(lst[a]))] = t->cidx;
 274 |             }
 275 |         }
 276 | 
 277 | #ifdef DEBUG_SCAN
 278 |       printf("\n");
 279 |       for (a = 0; a < 4; a++)
 280 |         { printf(" %c %10lld: ",dna[a],fng[a]->cidx);
 281 |           print_hap(ent[a],KMER,level);
 282 |         }
 283 |       printf("\n");
 284 | #endif
 285 |     }
 286 | 
 287 |   for (a = 0; a < 4; a++)
 288 |     free(ent[a]);
 289 | 
 290 |   return (NULL);
 291 | }
 292 | 
 293 | static void *analysis_thread_2(void *args)
 294 | { TP           *parm  = (TP *) args;
 295 |   int64        *end   = parm->end;
 296 |   Kmer_Stream **fng   = parm->fng;
 297 |   int           level = parm->level;
 298 |   int64        *bound = parm->bound;
 299 |   int64       **plot  = parm->plot;
 300 | 
 301 |   int ll = ((level+1)>>2);
 302 |   int ls = Shift[(level+1)&0x3];
 303 | 
 304 |   int mask  = Prefix[level&0x3];
 305 |   int offs  = (level >> 2) + 1;
 306 |   int rem   = KBYTE - offs;
 307 | 
 308 |   uint8 *ent[4];
 309 |   int    lst[4];
 310 |   int    in[4], itop;
 311 |   int    cnt[4];
 312 |   int    mc, hc;
 313 |   uint8 *mr, *hr;
 314 |   int    a, i, x;
 315 |   
 316 |   for (a = 0; a < 4; a++)
 317 |     { ent[a] = Current_Entry(fng[a],NULL);
 318 |       lst[a] = (ent[a][ll] >> ls) & 0x3;
 319 |       if (fng[a]->cidx < end[a])
 320 |         { x = (ent[a][ll] >> ls) & 0x3;
 321 |           while (x > lst[a])
 322 |             bound[(a<<2)+(++(lst[a]))] = fng[a]->cidx;
 323 |         }
 324 |     }
 325 | 
 326 | #ifdef DEBUG_SCAN
 327 |   for (a = 0; a < 4; a++)
 328 |     { printf(" %c %10lld: ",dna[a],fng[a]->cidx);
 329 |       print_hap(ent[a],KMER,level);
 330 |     }
 331 |   printf("\n");
 332 | #endif
 333 | 
 334 |   while (1)
 335 |     { for (a = 0; a < 4; a++)
 336 |         if (fng[a]->cidx < end[a])
 337 |           break;
 338 |       if (a >= 4)
 339 |         break;
 340 | 
 341 |       mr = ent[a]+offs;
 342 |       mc = mr[-1] & mask;
 343 |       in[0] = a;
 344 |       itop  = 1;
 345 |       for (a++; a < 4; a++)
 346 |         if (fng[a]->cidx < end[a])
 347 |           { hr = ent[a]+offs;
 348 |             hc = hr[-1] & mask;
 349 |             if (hc == mc)
 350 |               { int v = mycmp(hr,mr,rem);
 351 |                 if (v == 0)
 352 |                   in[itop++] = a;
 353 |                 else if (v < 0)
 354 |                   { mc = hc;
 355 |                     mr = hr;
 356 |                     in[0] = a;
 357 |                     itop  = 1;
 358 |                   }
 359 |               }
 360 |             else if (hc < mc)
 361 |               { mc = hc;
 362 |                 mr = hr;
 363 |                 in[0] = a;
 364 |                 itop  = 1;
 365 |               }
 366 |           }
 367 | 
 368 |       if (itop > 1)
 369 | #ifdef SOLO_CHECK
 370 |         { for (i = 0; i < itop; i++)
 371 |             if (mycmp(ent[in[i]],CENT,KBYTE) == 0)
 372 |               for (a = 0; a < itop; a++)
 373 |                 if (a != i)
 374 |                   { printf("  ");
 375 |                     print_hap(ent[in[a]],KMER,level);
 376 |                     printf(": %d\n",*((uint16 *) (ent[in[a]]+KBYTE)));
 377 |                   }
 378 |         }
 379 | #else
 380 |         { cnt[0] = *((uint16 *) (ent[in[0]]+KBYTE));
 381 |           for (i = 1; i < itop; i++)
 382 |             { cnt[i] = *((uint16 *) (ent[in[i]]+KBYTE));
 383 |               if (Pair[fng[in[i]]->cidx] <= 1)
 384 |                 for (a = 0; a < i; a++)
 385 |                   { x = cnt[a]+cnt[i];
 386 |                     if (x <= SMAX && Pair[fng[in[a]]->cidx] <= 1)
 387 |                       { if (cnt[a] < cnt[i])
 388 |                           plot[x][cnt[a]] += 1;
 389 |                         else
 390 |                           plot[x][cnt[i]] += 1;
 391 |                       }
 392 |                   }
 393 |             }
 394 |         }
 395 | #endif
 396 | 
 397 |       for (i = 0; i < itop; i++)
 398 |         { Kmer_Stream *t;
 399 | 
 400 |           a = in[i];
 401 |           t = fng[a];
 402 | #ifdef DEBUG_SCAN
 403 |           if (i == 0) printf("\n");
 404 |           printf(" %c %10lld: ",dna[a&0x3],fng[a]->cidx); 
 405 |           print_hap(ent[a],KMER,level);
 406 |           printf("\n");
 407 | #endif
 408 |           Next_Kmer_Entry(t);
 409 |           if (t->cidx < end[a])
 410 |             { Current_Entry(t,ent[a]);
 411 |               x = (ent[a][ll] >> ls) & 0x3;
 412 |               while (x > lst[a])
 413 |                 bound[(a<<2)+(++(lst[a]))] = t->cidx;
 414 |             }
 415 |         }
 416 | 
 417 | #ifdef DEBUG_SCAN
 418 |       printf("\n");
 419 |       for (a = 0; a < 4; a++)
 420 |         { printf(" %c %10lld: ",dna[a],fng[a]->cidx);
 421 |           print_hap(ent[a],KMER,level);
 422 |         }
 423 |       printf("\n");
 424 | #endif
 425 |     }
 426 | 
 427 |   for (a = 0; a < 4; a++)
 428 |     free(ent[a]);
 429 | 
 430 |   return (NULL);
 431 | }
 432 | 
 433 | static void *analysis_in_core_1(void *args)
 434 | { TP          *parm  = (TP *) args;
 435 |   uint8      **ptr   = parm->ptr;
 436 |   uint8      **ept   = parm->ept;
 437 |   int          level = parm-> level;
 438 |   uint8      **bound = (uint8 **) (parm->bound);
 439 |   uint8       *cache = parm->cache;
 440 |   int64        aidx  = parm->cidx;
 441 | 
 442 |   int ll = ((level+1)>>2);
 443 |   int ls = Shift[(level+1)&0x3];
 444 | 
 445 |   int mask  = Prefix[level&0x3];
 446 |   int offs  = (level >> 2) + 1;
 447 |   int rem   = KBYTE - offs;
 448 | 
 449 |   int    lst[4];
 450 |   int    in[4], itop;
 451 |   int    cnt[4];
 452 |   int    mc, hc;
 453 |   uint8 *mr, *hr;
 454 |   int    a, i, x;
 455 | 
 456 |   for (a = 0; a < 4; a++)
 457 |     { lst[a] = 0;
 458 |       if (ptr[a] < ept[a])
 459 |         { x = (ptr[a][ll] >> ls) & 0x3;
 460 |           while (x > lst[a])
 461 |             bound[(a<<2)+(++(lst[a]))] = ptr[a];
 462 |         }
 463 |     }
 464 | 
 465 | #ifdef DEBUG_SCAN
 466 |   for (a = 0; a < 4; a++)
 467 |     { printf(" %c %10ld: ",dna[a],(ptr[a]-parm->cache)/TBYTE); 
 468 |       print_hap(ptr[a],KMER,level);
 469 |     }
 470 |   printf("\n");
 471 | #endif
 472 | 
 473 |   while (1)
 474 |     { for (a = 0; a < 4; a++)
 475 |         if (ptr[a] < ept[a])
 476 |           break;
 477 |       if (a >= 4)
 478 |         break;
 479 | 
 480 |       mr = ptr[a]+offs;
 481 |       mc = mr[-1] & mask;
 482 |       in[0] = a;
 483 |       itop  = 1;
 484 |       for (a++; a < 4; a++)
 485 |         if (ptr[a] < ept[a])
 486 |           { hr = ptr[a]+offs;
 487 |             hc = hr[-1] & mask;
 488 |             if (hc == mc)
 489 |               { int v = mycmp(hr,mr,rem);
 490 |                 if (v == 0)
 491 |                   in[itop++] = a;
 492 |                 else if (v < 0)
 493 |                   { mc = hc;
 494 |                     mr = hr;
 495 |                     in[0] = a;
 496 |                     itop  = 1;
 497 |                   }
 498 |               }
 499 |             else if (hc < mc)
 500 |               { mc = hc;
 501 |                 mr = hr;
 502 |                 in[0] = a;
 503 |                 itop  = 1;
 504 |               }
 505 |           }
 506 | 
 507 |       if (itop > 1)
 508 |         { cnt[0] = *((uint16 *) (ptr[in[0]]+KBYTE));
 509 |           for (i = 1; i < itop; i++)
 510 |             { cnt[i] = *((uint16 *) (ptr[in[i]]+KBYTE));
 511 |               for (a = 0; a < i; a++)
 512 |                 { x = cnt[a]+cnt[i];
 513 |                   if (x <= SMAX)
 514 |                     { Pair[aidx + (ptr[in[i]]-cache)/TBYTE] += 1;
 515 |                       Pair[aidx + (ptr[in[a]]-cache)/TBYTE] += 1;
 516 |                     }
 517 |                 }
 518 |             }
 519 |         }
 520 | 
 521 |       for (i = 0; i < itop; i++)
 522 |         { a = in[i];
 523 | #ifdef DEBUG_SCAN
 524 |           if (i == 0) printf("\n");
 525 |           printf("%c %10ld: ",dna[a&0x3],(ptr[a]-parm->cache)/TBYTE); 
 526 |           print_hap(ptr[a],KMER,level);
 527 |           printf("\n");
 528 | #endif
 529 |           ptr[a] += TBYTE;
 530 |           if (ptr[a] < ept[a])
 531 |             { x = (ptr[a][ll] >> ls) & 0x3;
 532 |               while (x > lst[a])
 533 |                 bound[(a<<2)+(++(lst[a]))] = ptr[a];
 534 |             }
 535 |         }
 536 | 
 537 | #ifdef DEBUG_SCAN
 538 |       for (a = 0; a < 4; a++)
 539 |         { printf(" %c %10ld: ",dna[a],(ptr[a]-parm->cache)/TBYTE); 
 540 |           print_hap(ptr[a],KMER,level);
 541 |         }
 542 |       printf("\n");
 543 | #endif
 544 |     }
 545 | 
 546 |   return (NULL);
 547 | }
 548 | 
 549 | static void *analysis_in_core_2(void *args)
 550 | { TP          *parm  = (TP *) args;
 551 |   uint8      **ptr   = parm->ptr;
 552 |   uint8      **ept   = parm->ept;
 553 |   int          level = parm-> level;
 554 |   uint8      **bound = (uint8 **) (parm->bound);
 555 |   int64      **plot  = parm->plot;
 556 |   uint8       *cache = parm->cache;
 557 |   int64        aidx  = parm->cidx;
 558 | 
 559 |   int ll = ((level+1)>>2);
 560 |   int ls = Shift[(level+1)&0x3];
 561 | 
 562 |   int mask  = Prefix[level&0x3];
 563 |   int offs  = (level >> 2) + 1;
 564 |   int rem   = KBYTE - offs;
 565 | 
 566 |   int    lst[4];
 567 |   int    in[4], itop;
 568 |   int    cnt[4];
 569 |   int    mc, hc;
 570 |   uint8 *mr, *hr;
 571 |   int    a, i, x;
 572 | 
 573 |   for (a = 0; a < 4; a++)
 574 |     { lst[a] = 0;
 575 |       if (ptr[a] < ept[a])
 576 |         { x = (ptr[a][ll] >> ls) & 0x3;
 577 |           while (x > lst[a])
 578 |             bound[(a<<2)+(++(lst[a]))] = ptr[a];
 579 |         }
 580 |     }
 581 | 
 582 | #ifdef DEBUG_SCAN
 583 |   for (a = 0; a < 4; a++)
 584 |     { printf(" %c %10ld: ",dna[a],(ptr[a]-parm->cache)/TBYTE); 
 585 |       print_hap(ptr[a],KMER,level);
 586 |     }
 587 |   printf("\n");
 588 | #endif
 589 | 
 590 |   while (1)
 591 |     { for (a = 0; a < 4; a++)
 592 |         if (ptr[a] < ept[a])
 593 |           break;
 594 |       if (a >= 4)
 595 |         break;
 596 | 
 597 |       mr = ptr[a]+offs;
 598 |       mc = mr[-1] & mask;
 599 |       in[0] = a;
 600 |       itop  = 1;
 601 |       for (a++; a < 4; a++)
 602 |         if (ptr[a] < ept[a])
 603 |           { hr = ptr[a]+offs;
 604 |             hc = hr[-1] & mask;
 605 |             if (hc == mc)
 606 |               { int v = mycmp(hr,mr,rem);
 607 |                 if (v == 0)
 608 |                   in[itop++] = a;
 609 |                 else if (v < 0)
 610 |                   { mc = hc;
 611 |                     mr = hr;
 612 |                     in[0] = a;
 613 |                     itop  = 1;
 614 |                   }
 615 |               }
 616 |             else if (hc < mc)
 617 |               { mc = hc;
 618 |                 mr = hr;
 619 |                 in[0] = a;
 620 |                 itop  = 1;
 621 |               }
 622 |           }
 623 | 
 624 |       if (itop > 1)
 625 | #ifdef SOLO_CHECK
 626 |         { for (i = 0; i < itop; i++)
 627 |             if (mycmp(ptr[in[i]],CENT,KBYTE) == 0)
 628 |               for (a = 0; a < itop; a++)
 629 |                 if (a != i)
 630 |                   { printf("  ");
 631 |                     print_hap(ptr[in[a]],KMER,level);
 632 |                     printf(": %d\n",*((uint16 *) (ptr[in[a]]+KBYTE)));
 633 |                   }
 634 |         }
 635 | #else
 636 |         { cnt[0] = *((uint16 *) (ptr[in[0]]+KBYTE));
 637 |           for (i = 1; i < itop; i++)
 638 |             { cnt[i] = *((uint16 *) (ptr[in[i]]+KBYTE));
 639 |               if (Pair[aidx+(ptr[in[i]]-cache)/TBYTE] <= 1)
 640 |                 for (a = 0; a < i; a++)
 641 |                   { x = cnt[a]+cnt[i];
 642 |                     if (x <= SMAX && Pair[aidx+(ptr[in[a]]-cache)/TBYTE] <= 1)
 643 |                       { if (cnt[a] < cnt[i])
 644 |                           plot[x][cnt[a]] += 1;
 645 |                         else
 646 |                           plot[x][cnt[i]] += 1;
 647 |                       }
 648 |                   }
 649 |             }
 650 |         }
 651 | #endif
 652 | 
 653 |       for (i = 0; i < itop; i++)
 654 |         { a = in[i];
 655 | #ifdef DEBUG_SCAN
 656 |           if (i == 0) printf("\n");
 657 |           printf("%c %10ld: ",dna[a&0x3],(ptr[a]-parm->cache)/TBYTE); 
 658 |           print_hap(ptr[a],KMER,level);
 659 |           printf("\n");
 660 | #endif
 661 |           ptr[a] += TBYTE;
 662 |           if (ptr[a] < ept[a])
 663 |             { x = (ptr[a][ll] >> ls) & 0x3;
 664 |               while (x > lst[a])
 665 |                 bound[(a<<2)+(++(lst[a]))] = ptr[a];
 666 |             }
 667 |         }
 668 | 
 669 | #ifdef DEBUG_SCAN
 670 |       for (a = 0; a < 4; a++)
 671 |         { printf(" %c %10ld: ",dna[a],(ptr[a]-parm->cache)/TBYTE); 
 672 |           print_hap(ptr[a],KMER,level);
 673 |         }
 674 |       printf("\n");
 675 | #endif
 676 |     }
 677 | 
 678 |   return (NULL);
 679 | }
 680 | 
 681 | 
 682 | /****************************************************************************************
 683 |  *
 684 |  *  Find Het Pairs in top level nodes (level < BLEVEL <= 3) by paneling 4 merge intervals
 685 |  *    with all threads.
 686 |  *
 687 |  *****************************************************************************************/
 688 | 
 689 | static uint8 *Divpt;
 690 | 
 691 | static void big_window(int64 *adiv, int level, TP *parm)
 692 | { int64 bound[17];
 693 |   int   a;
 694 | 
 695 | #ifdef DEBUG_GENERAL
 696 |   printf("Doing big %d: %lld\n",level,adiv[4]-adiv[0]); fflush(stdout);
 697 | #endif
 698 | 
 699 |   { int64        e;
 700 |     uint8        lm;
 701 |     int          t, ls;
 702 |     Kmer_Stream *T;
 703 | #ifndef DEBUG_THREADS
 704 |     pthread_t    threads[NTHREADS];
 705 | #endif
 706 | 
 707 |     T = parm[0].fng[0];
 708 | 
 709 |     ls = Shift[level];      //  level < BLEVEL
 710 |     lm = 0xff ^ (0x3<<ls);
 711 |     for (a = 0; a < 4; a++)
 712 |       GoTo_Kmer_Index(parm[0].fng[a],adiv[a]);
 713 |     parm[0].level = level;
 714 |     parm[0].bound = bound;
 715 |     for (t = 1; t < NTHREADS; t++)
 716 |       { parm[t].level = level;
 717 |         parm[t].bound = bound;
 718 |         parm[t-1].end[0] = e = adiv[0] + t*(adiv[1]-adiv[0])/NTHREADS;
 719 |         T = parm[t].fng[0];
 720 |         GoTo_Kmer_Index(T,e);
 721 |         Current_Entry(T,Divpt);
 722 |         for (a = 1; a < 4; a++)
 723 |           { Divpt[0] = (Divpt[0] & lm) | (a << ls);
 724 |             T = parm[t].fng[a];
 725 |             GoTo_Kmer_Entry(T,Divpt);
 726 |             parm[t-1].end[a] = T->cidx;
 727 |           }
 728 |       }
 729 |     for (a = 0; a < 4; a++)
 730 |       parm[NTHREADS-1].end[a] = adiv[a+1];
 731 | 
 732 | #ifdef DEBUG_BIG_SCAN
 733 |     for (a = 0; a < 4; a++)
 734 |       for (t = 0; t < NTHREADS; t++)
 735 |         { printf("%c/%d %10lld: ",dna[a],t,parm[t].fng[a]->cidx);
 736 |           Current_Entry(parm[t].fng[a],Divpt);
 737 |           print_hap(Divpt,KMER,level);
 738 |           printf("\n");
 739 |         }
 740 | #endif
 741 | 
 742 |     for (a = 0; a < 16; a += 4)
 743 |       { bound[a] = adiv[a>>2];
 744 |         for (t = 1; t < 4; t++)
 745 |           bound[a+t] = -1;
 746 |       }
 747 | 
 748 |     if (PASS1)
 749 | #ifdef DEBUG_THREADS
 750 |       { for (t = 0; t < NTHREADS; t++)
 751 |           analysis_thread_1(parm+t);
 752 | #else
 753 |       { for (t = 1; t < NTHREADS; t++)
 754 |           pthread_create(threads+t,NULL,analysis_thread_1,parm+t);
 755 |         analysis_thread_1(parm);
 756 |         for (t = 1; t < NTHREADS; t++)
 757 |           pthread_join(threads[t],NULL);
 758 | #endif
 759 |       }
 760 |     else
 761 | #ifdef DEBUG_THREADS
 762 |       { for (t = 0; t < NTHREADS; t++)
 763 |           analysis_thread_2(parm+t);
 764 | #else
 765 |       { for (t = 1; t < NTHREADS; t++)
 766 |           pthread_create(threads+t,NULL,analysis_thread_2,parm+t);
 767 |         analysis_thread_2(parm);
 768 |         for (t = 1; t < NTHREADS; t++)
 769 |           pthread_join(threads[t],NULL);
 770 | #endif
 771 |       }
 772 | 
 773 |     for (a = 0; a < 16; a += 4)
 774 |       for (t = 1; t < 4; t++)
 775 |         if (bound[a+t] < 0)
 776 |           bound[a+t] = adiv[(a>>2)+1];
 777 |     bound[16] = adiv[4];
 778 | 
 779 | #ifdef DEBUG_BOUNDARY
 780 |     T = parm[0].fng[0];
 781 |     for (a = 0; a <= 16; a++)
 782 |       { if (a > 0)
 783 |           { GoTo_Kmer_Index(T,bound[a]-1);
 784 |             Current_Entry(T,Divpt);
 785 |             printf("%c %10lld: ",dna[a&0x3],T->cidx);
 786 |             print_hap(Divpt,KMER,level+1);
 787 |             printf("\n");
 788 |           }
 789 |         if (a < 16)
 790 |           { GoTo_Kmer_Index(T,bound[a]);
 791 |             Current_Entry(T,Divpt);
 792 |             printf("%c %10lld: ",dna[a&0x3],T->cidx);
 793 |             print_hap(Divpt,KMER,level+1);
 794 |             printf("\n");
 795 |           }
 796 |       }
 797 | #endif
 798 |     }
 799 | 
 800 |   level += 1;
 801 |   if (level < BLEVEL)
 802 |     for (a = 0; a < 16; a += 4)
 803 |       big_window(bound+a,level,parm);
 804 | }
 805 | 
 806 | 
 807 | /****************************************************************************************
 808 |  *
 809 |  *  Find Het Pairs for a lower level node by list merging
 810 |  *
 811 |  *****************************************************************************************/
 812 | 
 813 | static void in_core_recursion(uint8 **aptr, int level, TP *parm)
 814 | { uint8 *bound[17];
 815 |   int    a;
 816 | 
 817 | #ifdef SOLO_CHECK
 818 |   if (aptr[0] <= parm->cptr && parm->cptr < aptr[4])
 819 |     { printf("Inside %ld-%ld (%d %d)\n",
 820 |              (aptr[0]-parm->cache)/TBYTE,(aptr[4]-parm->cache)/TBYTE,parm->tid,level);
 821 |       printf("     ");
 822 |       print_hap(aptr[0],KMER,level);
 823 |       printf(" : ");
 824 |       print_hap(aptr[4],KMER,level);
 825 |       printf("\n");
 826 |     }
 827 | #endif
 828 | 
 829 |   if (aptr[4]-aptr[0] <= TBYTE) return;
 830 | 
 831 | #ifdef DEBUG_RECURSION
 832 |   printf("Doing in-core %d: %ld\n",level,(aptr[4]-aptr[0])/TBYTE);
 833 | #endif
 834 | 
 835 |   { int t;
 836 | 
 837 |     parm->ptr[0] = aptr[0];
 838 |     for (a = 1; a < 4; a++)
 839 |       parm->ptr[a] = parm->ept[a-1] = aptr[a];
 840 |     parm->ept[3] = aptr[4];
 841 | 
 842 |     parm->level = level;
 843 |     parm->bound = (int64 *) bound;
 844 | 
 845 |     for (a = 0; a < 16; a += 4)
 846 |       { bound[a] = aptr[a>>2];
 847 |         for (t = 1; t < 4; t++)
 848 |           bound[a+t] = NULL;
 849 |       }
 850 |     bound[16] = aptr[4];
 851 | 
 852 |     if (PASS1)
 853 |       analysis_in_core_1(parm);
 854 |     else
 855 |       analysis_in_core_2(parm);
 856 | 
 857 |     for (a = 0; a < 16; a += 4)
 858 |       for (t = 1; t < 4; t++)
 859 |         if (bound[a+t] == NULL)
 860 |           bound[a+t] = bound[a+4];
 861 | 
 862 | #ifdef DEBUG_BOUNDARY
 863 |     { uint8 *cache = parm->cache;
 864 | 
 865 |       if (level+1 < KMER)
 866 |         for (a = 0; a <= 16; a++)
 867 |           { if (a > 0)
 868 |               { printf("%c %10ld: ",dna[a&0x3],(bound[a]-cache)/TBYTE); 
 869 |                 print_hap(bound[a],KMER,level+1);
 870 |                 printf("\n");
 871 |               }
 872 |             if (a < 16)
 873 |               { printf("%c %c %10ld: ",dna[a>>2],dna[a&0x3],(bound[a]-cache)/TBYTE); 
 874 |                 print_hap(bound[a],KMER,level+1);
 875 |                 printf(" - ");
 876 |               }
 877 |           }
 878 |     }
 879 | #endif
 880 |   }
 881 | 
 882 |   level += 1;
 883 |   if (level < KMER)
 884 |     for (a = 0; a < 16; a += 4)
 885 |       in_core_recursion(bound+a,level,parm);
 886 | }
 887 | 
 888 | static pthread_mutex_t TMUTEX;
 889 | static pthread_cond_t  TCOND;
 890 | 
 891 | static int *Tstack;
 892 | static int  Tavail;
 893 | 
 894 | static void small_recursion(int64 *adiv, int level, TP *parm)
 895 | {
 896 | #ifdef DEBUG_RECURSION
 897 |   printf("Doing small %d: %lld [%lld-%lld]\n",level,adiv[4]-adiv[0],adiv[0],adiv[4]);
 898 | #endif
 899 | 
 900 |   if (adiv[4]-adiv[0] < Cache_Size)
 901 |     { uint8       *aptr[5];
 902 | 
 903 |       { uint8       *C = parm->cache;
 904 |         Kmer_Stream *T;
 905 |         int64   i;
 906 |         int     a;
 907 | 
 908 | #ifdef SOLO_CHECK
 909 |         if (adiv[0] <= CIDX && CIDX < adiv[4])
 910 |           { printf("Heading in %lld-%lld (%d %d)\n",adiv[0],adiv[4],parm->tid,level);
 911 |             parm->cptr = parm->cache + (CIDX - adiv[0])*TBYTE;
 912 |           }  
 913 |         else
 914 |           parm->cptr = NULL;
 915 | #endif
 916 | 
 917 |         T = parm->fng[0];
 918 |         GoTo_Kmer_Index(T,adiv[0]);
 919 |         parm->cidx = T->cidx;
 920 |         for (i = 0; T->cidx < adiv[4]; i++)
 921 |           { Current_Entry(T,C);
 922 |             C += TBYTE;
 923 |             Next_Kmer_Entry(T);
 924 |           } 
 925 |         for (a = 1; a <= 4; a++)
 926 |           aptr[a] = parm->cache + (adiv[a]-adiv[0])*TBYTE;
 927 |         aptr[0] = parm->cache;
 928 |       }
 929 | 
 930 |       in_core_recursion(aptr,level,parm);
 931 | 
 932 |       return;
 933 |     }
 934 | 
 935 |   { int64 bound[17];
 936 |     int   a;
 937 | 
 938 |     { int t;
 939 | 
 940 |       for (a = 0; a < 4; a++)
 941 |         { parm->end[a] = adiv[a+1];
 942 |           GoTo_Kmer_Index(parm->fng[a],adiv[a]);
 943 |         }
 944 |       parm->level = level;
 945 |       parm->bound = bound;
 946 | 
 947 |       for (a = 0; a < 16; a += 4)
 948 |         { bound[a] = adiv[a>>2];
 949 |           for (t = 1; t < 4; t++)
 950 |             bound[a+t] = -1;
 951 |         }
 952 | 
 953 |       if (PASS1)
 954 |         analysis_thread_1(parm);
 955 |       else
 956 |         analysis_thread_2(parm);
 957 | 
 958 |       for (a = 0; a < 16; a += 4)
 959 |         for (t = 1; t < 4; t++)
 960 |           if (bound[a+t] < 0)
 961 |             bound[a+t] = bound[a+4];
 962 |       bound[16] = adiv[4];
 963 | 
 964 | #ifdef DEBUG_BOUNDARY
 965 |       { Kmer_Stream *T;
 966 | 
 967 |         T = parm->fng[0];
 968 |         if (level+1 < KMER)
 969 |           for (a = 0; a <= 16; a++)
 970 |             { if (a > 0)
 971 |                 { GoTo_Kmer_Index(T,bound[a]-1);
 972 |                   Current_Entry(T,Divpt);
 973 |                   printf("%c %10lld: ",dna[a&0x3],T->cidx);
 974 |                   print_hap(Divpt,KMER,level+1);
 975 |                   printf("\n");
 976 |                 }
 977 |               if (a < 16)
 978 |                 { GoTo_Kmer_Index(T,bound[a]);
 979 |                   Current_Entry(T,Divpt);
 980 |                   printf("%c %10lld: ",dna[a&0x3],T->cidx); 
 981 |                   print_hap(Divpt,KMER,level+1);
 982 |                   printf("\n");
 983 |                 }
 984 |             }
 985 |       }
 986 | #endif
 987 |     }
 988 | 
 989 |     level += 1;
 990 |     if (level < KMER)
 991 |       for (a = 0; a < 16; a += 4)
 992 |         small_recursion(bound+a,level,parm);
 993 |   }
 994 | }
 995 | 
 996 | static void *small_window(void *args)
 997 | { TP *parm  = (TP *) args;
 998 |   int tid   = parm->tid;
 999 | 
1000 |   int64 adiv[5];
1001 | 
1002 |   { uint8 divpt[TBYTE];
1003 |     Kmer_Stream *T;
1004 |     int a, x;
1005 | 
1006 |     x = parm->level;
1007 |     T = parm->fng[0];
1008 | 
1009 |     for (a = 0; a < KBYTE; a++)
1010 |       divpt[a] = 0;
1011 |     for (a = 0; a < 4; a++)
1012 |       { if (BLEVEL == 4)
1013 |           { divpt[0] = x;
1014 |             divpt[1] = (a<<6);
1015 |           }
1016 |         else
1017 |           divpt[0] = (((x<<2) | a) << (6-2*BLEVEL));
1018 |         GoTo_Kmer_Entry(T,divpt);
1019 |         adiv[a] = T->cidx; 
1020 |       }
1021 |     if (++x < BWIDTH)
1022 |       { divpt[0] = (x << (8-2*BLEVEL));
1023 |         divpt[1] = 0;
1024 |         GoTo_Kmer_Entry(T,divpt);
1025 |         adiv[4] = T->cidx; 
1026 |       }
1027 |     else
1028 |       adiv[4] = T->nels;
1029 |   }
1030 | 
1031 |   small_recursion(adiv,BLEVEL,parm);
1032 | 
1033 |   pthread_mutex_lock(&TMUTEX);
1034 |     Tstack[Tavail++] = tid;
1035 |   pthread_mutex_unlock(&TMUTEX);
1036 | 
1037 |   pthread_cond_signal(&TCOND);
1038 | 
1039 |   return (NULL);
1040 | }
1041 | 
1042 | /****************************************************************************************
1043 |  *
1044 |  *  Main
1045 |  *
1046 |  *****************************************************************************************/
1047 | 
1048 | static char template[15] = "._SPAIR.XXXX";
1049 | 
1050 | #ifdef SOLO_CHECK
1051 | 
1052 | static uint8 code[128] =
1053 |   { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1054 |     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1055 |     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1056 |     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1057 |     0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0,
1058 |     0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1059 |     0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0,
1060 |     0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
1061 | 
1062 | static void compress_norm(char *s, int len, uint8 *t)
1063 | { int    i;
1064 |   char   c, d, e;
1065 |   char  *s0, *s1, *s2, *s3;
1066 | 
1067 |   s0 = s;
1068 |   s1 = s0+1;
1069 |   s2 = s1+1;
1070 |   s3 = s2+1;
1071 | 
1072 |   c = s0[len];
1073 |   d = s1[len];
1074 |   e = s2[len];
1075 |   s0[len] = s1[len] = s2[len] = 'a';
1076 | 
1077 |   for (i = 0; i < len; i += 4)
1078 |     *t++ = ((code[(int) s0[i]] << 6) | (code[(int) s1[i]] << 4)
1079 |          |  (code[(int) s2[i]] << 2) | code[(int) s3[i]] );
1080 | 
1081 |   s0[len] = c;
1082 |   s1[len] = d;
1083 |   s2[len] = e;
1084 | }
1085 | 
1086 | #endif
1087 | 
1088 | static uint8 comp[128] =
1089 |   { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1090 |     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1091 |     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1092 |     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1093 |     0, 3, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
1094 |     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1095 |     0, 3, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
1096 |     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
1097 | 
1098 | static void compress_comp(char *s, int len, uint8 *t)
1099 | { int    i;
1100 |   char   c, d, e;
1101 |   char  *s0, *s1, *s2, *s3;
1102 | 
1103 |   s0 = s;
1104 |   s1 = s0-1;
1105 |   s2 = s1-1;
1106 |   s3 = s2-1;
1107 | 
1108 |   c = s1[0];
1109 |   d = s2[0];
1110 |   e = s3[0];
1111 |   s1[0] = s2[0] = s3[0] = 't';
1112 | 
1113 |   for (i = len-1; i >= 0; i -= 4)
1114 |     *t++ = ((comp[(int) s0[i]] << 6) | (comp[(int) s1[i]] << 4)
1115 |          |  (comp[(int) s2[i]] << 2) | comp[(int) s3[i]] );
1116 | 
1117 |   s1[0] = c;
1118 |   s2[0] = d;
1119 |   s3[0] = e;
1120 | }
1121 | 
1122 | static void examine_table(Kmer_Stream *T, int *trim, int *sym)
1123 | {
1124 |   //  Histogram of middle 100M counts and see if trimmed to ETHRESH
1125 | 
1126 |   { int64 hist[0x8000];
1127 |     int64 frst, last;
1128 |     int   hbyte = T->hbyte;
1129 |     int   i, nz;
1130 | 
1131 |     for (i = 0; i < 0x8000; i++)
1132 |       hist[i] = 0;
1133 | 
1134 |     if (T->nels+3 < 100000000)
1135 |       { frst = 0;
1136 |         last = T->nels;
1137 |       }
1138 |     else
1139 |       { frst = T->nels/2 - 50000000;
1140 |         last = T->nels/2 + 50000000;
1141 |       }
1142 | 
1143 |     for (GoTo_Kmer_Index(T,frst); T->cidx < last; Next_Kmer_Entry(T))
1144 |       hist[*((int16 *) (T->csuf+hbyte))] += 1;
1145 | 
1146 |     for (nz = 1; hist[nz] == 0; nz++)
1147 |       ;
1148 |     if (nz < ETHRESH)
1149 |       *trim = 0;
1150 |     else
1151 |       *trim = 1;
1152 |   }
1153 | 
1154 |   //  Walk to a non-palindromic k-mer and see if its complement is in T
1155 | 
1156 |   { int64  sidx;
1157 |     char  *seq;
1158 |     uint8 *cmp;
1159 |     int    kmer;
1160 | 
1161 |     kmer = T->kmer;
1162 | 
1163 |     sidx = 1;
1164 |     GoTo_Kmer_Index(T,sidx);
1165 |     seq = Current_Kmer(T,NULL);
1166 |     cmp = Current_Entry(T,NULL);
1167 |     while (1)
1168 |       { compress_comp(seq,kmer,cmp);
1169 |         if (GoTo_Kmer_Entry(T,cmp))
1170 |           { if (T->cidx != sidx)
1171 |               { *sym = 1;
1172 |                 break;
1173 |               }
1174 |           }
1175 |         else
1176 |           { *sym = 0;
1177 |             break;
1178 |           }
1179 |         sidx += 1;
1180 |         seq = Current_Kmer(T,seq);
1181 |       }
1182 |     free(cmp);
1183 |     free(seq);
1184 |   }
1185 | }
1186 | 
1187 | int main(int argc, char *argv[])
1188 | { Kmer_Stream *T;
1189 |   char        *input;
1190 |   char        *troot;
1191 |   int64      **PLOT;
1192 |   int          bypass;
1193 | 
1194 |   char  *SORT_PATH;
1195 |   char  *OUT;
1196 |   char  *SRC;
1197 | 
1198 |   //  Process command line arguments
1199 | 
1200 |   (void) print_hap;
1201 | 
1202 |   { int    i, j, k;
1203 |     int    flags[128];
1204 |     char  *eptr;
1205 | 
1206 |     ARG_INIT("PloidyPlot");
1207 |     OUT  = NULL;
1208 |     ETHRESH  = 4;
1209 |     NTHREADS = 4;
1210 |     SORT_PATH = "/tmp";
1211 | 
1212 |     j = 1;
1213 |     for (i = 1; i < argc; i++)
1214 |       if (argv[i][0] == '-')
1215 |         switch (argv[i][1])
1216 |         { default:
1217 |             ARG_FLAGS("vklfs")
1218 |             break;
1219 |           case 'e':
1220 |             ARG_POSITIVE(ETHRESH,"Error-mer threshold")
1221 |             break;
1222 |           case 'o':
1223 |             if (OUT == NULL)
1224 |               free(OUT);
1225 |             OUT = Strdup(argv[i]+2,"Allocating name");
1226 |             if (OUT == NULL)
1227 |               exit (1);
1228 |             break;
1229 |           case 'P':
1230 |             SORT_PATH = argv[i]+2;
1231 |             break;
1232 |           case 'T':
1233 |             ARG_POSITIVE(NTHREADS,"Number of threads")
1234 |             if (NTHREADS > 64)
1235 |               { fprintf(stderr,"%s: Warning, only 64 threads will be used\n",Prog_Name);
1236 |                 NTHREADS = 64;
1237 |               }
1238 |             break;
1239 |         }
1240 |       else
1241 |         argv[j++] = argv[i];
1242 |     argc = j;
1243 | 
1244 |     VERBOSE = flags['v'];
1245 | 
1246 | #ifdef SOLO_CHECK
1247 |     if (argc != 3)
1248 | #else
1249 |     if (argc != 2)
1250 | #endif
1251 |       { fprintf(stderr,"\nUsage: %s %s\n",Prog_Name,Usage[0]);
1252 |         fprintf(stderr,"       %*s %s\n",(int) strlen(Prog_Name),"",Usage[1]);
1253 |         fprintf(stderr,"\n");
1254 |         fprintf(stderr,"      -o: root name for output plots\n");
1255 |         fprintf(stderr,"          default is root path of <asm> argument\n");
1256 |         fprintf(stderr,"\n");
1257 |         fprintf(stderr,"      -e: count threshold below which k-mers are considered erroneous\n");
1258 |         fprintf(stderr,"      -v: verbose mode\n");
1259 |         fprintf(stderr,"      -T: number of threads to use\n");
1260 |         // This P argument does not work properly, only some of the files are saved where it claims it does
1261 |         fprintf(stderr,"      -P: Place all temporary files in directory -P.\n");
1262 |         exit (1);
1263 |       }
1264 | 
1265 |     SRC = argv[1];
1266 |     if (OUT == NULL)
1267 |       OUT = Root(argv[1],".ktab");
1268 | 
1269 |     troot = mktemp(template);
1270 |   }
1271 | 
1272 |   //  If appropriately named het-mer table found then ask if reuse
1273 | 
1274 |   { FILE *f;
1275 |     int   a;
1276 | 
1277 |     bypass = 0;
1278 |     f = fopen(Catenate(OUT,".smu","",""),"r");
1279 |     if (f != NULL)
1280 |       { fprintf(stdout,"\n  Found het-table %s.smu, use it? ",OUT);
1281 |         fflush(stdout);
1282 |         while ((a = getc(stdin)) != '\n')
1283 |           if (a == 'y' || a == 'Y')
1284 |             bypass = 1;
1285 | 
1286 |         if (bypass)
1287 |           { PLOT    = Malloc(sizeof(int64 *)*(SMAX+1),"Allocating thread working memory");
1288 |             PLOT[0] = Malloc(sizeof(int64)*(SMAX+1)*(FMAX+1),"Allocating plot");
1289 |             for (a = 1; a <= SMAX; a++)
1290 |               PLOT[a] = PLOT[a-1] + (FMAX+1);
1291 |             fread(PLOT[0],sizeof(int64),(SMAX+1)*(FMAX+1),f);
1292 |           }
1293 | 
1294 |         fclose(f);
1295 |       }
1296 |   }
1297 | 
1298 |   //  Open input table and see if it needs conditioning
1299 | 
1300 |   { char *command;
1301 |     char *tname;
1302 |     int   symm, trim;
1303 | 
1304 |     tname   = Malloc(strlen(SRC) + strlen(troot) + 10,"Allocating strings");
1305 |     command = Malloc(strlen(SRC) + strlen(troot) + 100,"Allocating strings");
1306 |     if (tname == NULL || command == NULL)
1307 |       exit (1);
1308 | 
1309 |     T = Open_Kmer_Stream(SRC);
1310 |     if (T == NULL)
1311 |       { fprintf(stderr,"%s: Cannot open k-mer table %s\n",Prog_Name,SRC);
1312 |         exit (1);
1313 |       }
1314 | 
1315 |     KMER  = T->kmer;
1316 |     KBYTE = T->kbyte;
1317 |     TBYTE = T->tbyte;
1318 | 
1319 |     if (bypass)
1320 |       goto skip_build;
1321 | 
1322 |     examine_table(T,&trim,&symm);
1323 | 
1324 |     if (VERBOSE)
1325 |       { fprintf(stderr,"\n  The input table is");
1326 |         if (trim)
1327 |           if (symm)
1328 |             fprintf(stderr," trimmed and symmetric\n");
1329 |           else
1330 |             fprintf(stderr," trimmed but not symmetric\n");
1331 |         else
1332 |           if (symm)
1333 |             fprintf(stderr," untrimmed yet symmetric\n");
1334 |           else
1335 |             fprintf(stderr," untrimmed and not symmetric\n");
1336 |       }
1337 | 
1338 |     sprintf(tname,"%s",SRC);
1339 |     input = NULL;
1340 | 
1341 |     //  Trim source table to k-mers with counts >= ETHRESH if needed
1342 | 
1343 |     if (!trim)
1344 |       { if (VERBOSE)
1345 |           { fprintf(stderr,"\n  Trimming k-mers in table with count < %d\n",ETHRESH);
1346 |             fflush(stderr);
1347 |           }
1348 | 
1349 |         sprintf(command,"Logex -T%d '%s.trim=A[%d-]' %s",NTHREADS,troot,ETHRESH,tname);
1350 |         if (system(command) != 0)
1351 |           { fprintf(stderr,"%s: Something went wrong with command:\n    %s\n",Prog_Name,command);
1352 |             exit (1);
1353 |           }
1354 | 
1355 |         sprintf(tname,"%s.trim",troot);
1356 |       }
1357 | 
1358 |     //  Make the (relevant) table symmetric if it is not
1359 | 
1360 |     if (!symm)
1361 |       { if (VERBOSE)
1362 |           { if (trim)
1363 |               fprintf(stderr,"\n  Making table symmetric\n");
1364 |             else
1365 |               fprintf(stderr,"\n  Making trimmed table symmetric\n");
1366 |             fflush(stderr);
1367 |           }
1368 | 
1369 |         sprintf(command,"Symmex -T%d -P%s %s %s.symx",NTHREADS,SORT_PATH,tname,troot);
1370 | 
1371 |         if (system(command) != 0)
1372 |           { fprintf(stderr,"%s: Something went wrong with command:\n    %s\n",Prog_Name,command);
1373 |             exit (1);
1374 |           }
1375 | 
1376 |         if (!trim)
1377 |           { sprintf(command,"Fastrm %s.trim",troot);
1378 |             if (system(command) != 0)
1379 |               { fprintf(stderr,"%s: Something went wrong with command:\n    %s\n",
1380 |                                Prog_Name,command);
1381 |                 exit (1);
1382 |               }
1383 |           }
1384 | 
1385 |         sprintf(tname,"%s.symx",troot);
1386 |       }
1387 | 
1388 |     //  input is the name of the relevant conditioned table, unless the original => NULL
1389 | 
1390 |     free(command);
1391 |     if (!(symm && trim))
1392 |       { input = tname;
1393 |         Free_Kmer_Stream(T);
1394 |         T = Open_Kmer_Stream(input);
1395 |       }
1396 |     else
1397 |       free(tname);
1398 |   }
1399 | 
1400 |   if (VERBOSE)
1401 |     { fprintf(stderr,"\n  Starting to count covariant pairs\n");
1402 |       fflush(stderr);
1403 |     }
1404 | 
1405 |   BLEVEL = 1;
1406 |   BWIDTH = 4;
1407 |   while (4*NTHREADS > BWIDTH)
1408 |     { BWIDTH *= 4;
1409 |       BLEVEL += 1;
1410 |     }
1411 | 
1412 |   Cache_Size = (MEMORY_LIMIT/NTHREADS)/TBYTE;
1413 | 
1414 | #ifdef SOLO_CHECK
1415 |   if ((int) strlen(argv[3]) != KMER)
1416 |     { fprintf(stderr,"%s: string is not of length %d\n",Prog_Name,KMER);
1417 |       exit (1);
1418 |     }
1419 |   CENT = Current_Entry(T,NULL);
1420 |   compress_norm(argv[3],KMER,CENT);
1421 |   if (GoTo_Kmer_Entry(T,CENT) < 0)
1422 |     { fprintf(stderr,"%s: string is not in table\n",Prog_Name);
1423 |       exit (1);
1424 |     }
1425 |   printf("%s: %d\n",argv[3],Current_Count(T));
1426 |   CIDX = T->cidx;
1427 | #endif
1428 | 
1429 | #ifdef DEBUG_GENERAL
1430 |   printf("Threads = %d BL = %d(%d) Cache = %lld\n",NTHREADS,BLEVEL,BWIDTH,Cache_Size);
1431 | #endif
1432 | 
1433 |   { TP      parm[NTHREADS];
1434 |     int     a, t;
1435 |     int64 **plot;
1436 |     uint8  *cache;
1437 | 
1438 |     for (t = 0; t < NTHREADS; t++)
1439 |       { plot    = Malloc(sizeof(int64 *)*(SMAX+1),"Allocating thread working memory");
1440 |         plot[0] = Malloc(sizeof(int64)*(SMAX+1)*(FMAX+1),"Allocating plot");
1441 |         for (a = 1; a <= SMAX; a++)
1442 |           plot[a] = plot[a-1] + (FMAX+1);
1443 |         bzero(plot[0],sizeof(int64)*(SMAX+1)*(FMAX+1));
1444 |         parm[t].plot = plot;
1445 |       }
1446 | 
1447 |     parm[0].fng[0] = T;
1448 |     for (t = 0; t < NTHREADS; t++)
1449 |       for (a = 0; a < 4; a++)
1450 |         if (a+t > 0)
1451 |           parm[t].fng[a] = Clone_Kmer_Stream(T);
1452 | 
1453 |     Divpt = Current_Entry(T,NULL);
1454 |     Pair  = Malloc(sizeof(uint8)*T->nels,"Allocating pair table");
1455 |     cache = Malloc(MEMORY_LIMIT,"Allocating cache buffer");
1456 |     if (Pair == NULL || cache == NULL)
1457 |       exit (1);
1458 | 
1459 |     bzero(Pair,T->nels);
1460 | 
1461 |     for (PASS1 = 1; PASS1 >= 0; PASS1--)
1462 |       {
1463 |         //  Analyze the top levels, each threaded, up to level, BLEVEL, where
1464 |         //    the number of nodes is greater than the number of threads
1465 |         //    by a factor of 4 or more
1466 | 
1467 |         { int64 adiv[5];
1468 | 
1469 |           for (t = 0; t < KBYTE; t++)
1470 |             Divpt[t] = 0;
1471 | 
1472 |           adiv[0] = 0;
1473 |           for (a = 1; a < 4; a++)
1474 |             { Divpt[0] = (a << 6);
1475 |               GoTo_Kmer_Entry(T,Divpt);
1476 |               adiv[a] = T->cidx;
1477 |             }
1478 |           adiv[4] = T->nels;
1479 | 
1480 |           big_window(adiv,0,parm);
1481 |         }
1482 | 
1483 |         //  Assign a thread to each subtree at level BLEVEL until all are done
1484 | 
1485 |         { pthread_t threads[NTHREADS];
1486 |           int       tstack[NTHREADS];
1487 |           int       x;
1488 | 
1489 |           Tstack = tstack;
1490 | 
1491 |           for (t = 0; t < NTHREADS; t++)
1492 |             { Tstack[t]   = t;
1493 |               parm[t].tid = t;
1494 |               parm[t].cache = cache + t*(MEMORY_LIMIT/NTHREADS);
1495 |             }
1496 |           Tavail = NTHREADS;
1497 | 
1498 |           pthread_mutex_init(&TMUTEX,NULL);
1499 |           pthread_cond_init(&TCOND,NULL);
1500 | 
1501 |           for (x = 0; x < BWIDTH; x++)
1502 |             { pthread_mutex_lock(&TMUTEX);
1503 | 
1504 |               if (Tavail <= 0)
1505 |                 pthread_cond_wait(&TCOND,&TMUTEX);
1506 | 
1507 |               t = Tstack[--Tavail];
1508 | 
1509 | #ifdef DEBUG_GENERAL
1510 |               printf("Launching %d on thread %d\n",x,t);
1511 | #endif
1512 | 
1513 |               pthread_mutex_unlock(&TMUTEX);
1514 | 
1515 |               parm[t].level = x;
1516 | 
1517 |               pthread_create(threads+t,NULL,small_window,parm+t);
1518 |             }
1519 | 
1520 |           pthread_mutex_lock(&TMUTEX);
1521 |           while (Tavail < NTHREADS)
1522 |             pthread_cond_wait(&TCOND,&TMUTEX);
1523 | 	  pthread_mutex_unlock(&TMUTEX);
1524 |         }
1525 |       }
1526 | 
1527 |     free(cache);
1528 |     free(Pair);
1529 |     free(Divpt);
1530 | 
1531 |     { char  *command;
1532 |       int64 *plot0, *plott;
1533 |       int    i;
1534 | 
1535 |       for (t = NTHREADS-1; t >= 0; t--)
1536 |         for (a = 3; a >= 0; a--)
1537 |           if (a+t > 0)
1538 |             Free_Kmer_Stream(parm[t].fng[a]);
1539 |       Free_Kmer_Stream(T);
1540 | 
1541 |       for (t = 1; t < NTHREADS; t++)
1542 |         for (i = 0; i <= SMAX; i++)
1543 |           { plot0 = parm[0].plot[i];
1544 |             plott = parm[t].plot[i];
1545 |             for (a = 0; a <= FMAX; a++)
1546 |               plot0[a] += plott[a];
1547 |           }
1548 | 
1549 |       for (t = NTHREADS-1; t >= 1; t--)
1550 |         { free(parm[t].plot[0]);
1551 |           free(parm[t].plot);
1552 |         }
1553 | 
1554 |       PLOT = parm[0].plot;
1555 | 
1556 |       if (input != NULL)
1557 |         { command = Malloc(strlen(input)+100,"Allocating strings");
1558 |           if (command == NULL)
1559 |             exit (1);
1560 |           sprintf(command,"Fastrm %s",input);
1561 |           if (system(command) != 0)
1562 |             { fprintf(stderr,"%s: Something went wrong with command:\n    %s\n",Prog_Name,command);
1563 |               exit (1);
1564 |             }
1565 |           free(command);
1566 |           free(input);
1567 |         }
1568 |     }
1569 |   }
1570 | 
1571 | #ifdef SOLO_CHECK
1572 | 
1573 |   exit (0);
1574 | 
1575 | #endif
1576 | 
1577 |   if (VERBOSE)
1578 |     { fprintf(stderr,"\n  Count complete, plotting\n");
1579 |       fflush(stderr);
1580 |     }
1581 | 
1582 | skip_build:
1583 | 
1584 | fprintf(stderr,"\n  About to save stuff\n");
1585 | 
1586 | FILE  *f;
1587 | int    a, i;
1588 | 
1589 | f = fopen(Catenate(OUT,"_text.smu","",""),"w");
1590 | fprintf(stderr,"\n  Saving stuff\n");
1591 | 
1592 | // fprintf(f, "// %dx%d matrix, the i'th number in the j'th row give the number of hetmer pairs (a,b)\n", SMAX,FMAX);
1593 | // fprintf(f, "//                     s.t. count(a)+count(b) = j+1 and min(count(a),count(b)) = i+1.\n");
1594 | for (a = 0; a <= SMAX; a++)
1595 |   {
1596 |     for (i = 0; i < FMAX; i++)
1597 |       if(PLOT[a][i] > 0)
1598 |       {
1599 |         fprintf(f,"%i\t%i\t%lld\n",i,a-i,PLOT[a][i]);
1600 |       }
1601 |   }
1602 | fclose(f);
1603 | 
1604 | free(OUT);
1605 | 
1606 | Catenate(NULL,NULL,NULL,NULL);
1607 | Numbered_Suffix(NULL,0,NULL);
1608 | free(Prog_Name);
1609 | 
1610 | exit (0);
1611 | }
1612 | 


--------------------------------------------------------------------------------
/src_ploidyplot/gene_core.c:
--------------------------------------------------------------------------------
  1 | #include <stdlib.h>
  2 | #include <stdio.h>
  3 | #include <string.h>
  4 | #include <ctype.h>
  5 | #include <unistd.h>
  6 | #include <dirent.h>
  7 | #include <zlib.h>
  8 | 
  9 | #include "gene_core.h"
 10 | 
 11 | /*******************************************************************************************
 12 |  *
 13 |  *  GENERAL UTILITIES
 14 |  *
 15 |  ********************************************************************************************/
 16 | 
 17 | char *Prog_Name;
 18 | 
 19 | void *Malloc(int64 size, char *mesg)
 20 | { void *p;
 21 | 
 22 |   if ((p = malloc(size)) == NULL)
 23 |     { if (mesg == NULL)
 24 |         fprintf(stderr,"%s: Out of memory\n",Prog_Name);
 25 |       else
 26 |         fprintf(stderr,"%s: Out of memory (%s)\n",Prog_Name,mesg);
 27 |     }
 28 |   return (p);
 29 | }
 30 | 
 31 | void *Realloc(void *p, int64 size, char *mesg)
 32 | { if (size <= 0)
 33 |     size = 1;
 34 |   if ((p = realloc(p,size)) == NULL)
 35 |     { if (mesg == NULL)
 36 |         fprintf(stderr,"%s: Out of memory\n",Prog_Name);
 37 |       else
 38 |         fprintf(stderr,"%s: Out of memory (%s)\n",Prog_Name,mesg);
 39 |     }
 40 |   return (p);
 41 | }
 42 | 
 43 | char *Strdup(char *name, char *mesg)
 44 | { char *s;
 45 | 
 46 |   if (name == NULL)
 47 |     return (NULL);
 48 |   if ((s = strdup(name)) == NULL)
 49 |     { if (mesg == NULL)
 50 |         fprintf(stderr,"%s: Out of memory\n",Prog_Name);
 51 |       else
 52 |         fprintf(stderr,"%s: Out of memory (%s)\n",Prog_Name,mesg);
 53 |     }
 54 |   return (s);
 55 | }
 56 | 
 57 | char *Strndup(char *name, int len, char *mesg)
 58 | { char *s;
 59 | 
 60 |   if (name == NULL)
 61 |     return (NULL);
 62 |   if ((s = strndup(name,len)) == NULL)
 63 |     { if (mesg == NULL)
 64 |         fprintf(stderr,"%s: Out of memory\n",Prog_Name);
 65 |       else
 66 |         fprintf(stderr,"%s: Out of memory (%s)\n",Prog_Name,mesg);
 67 |     }
 68 |   return (s);
 69 | }
 70 | 
 71 | char *PathTo(char *name)
 72 | { char *path, *find;
 73 | 
 74 |   if (name == NULL)
 75 |     return (NULL);
 76 |   if ((find = rindex(name,'/')) != NULL)
 77 |     path = Strndup(name,find-name,"Extracting path from");
 78 |   else
 79 |     path = Strdup(".","Allocating default path");
 80 |   return (path);
 81 | }
 82 | 
 83 | char *Root(char *name, char *suffix)
 84 | { char *path, *find, *dot;
 85 |   int   epos;
 86 | 
 87 |   if (name == NULL)
 88 |     return (NULL);
 89 |   find = rindex(name,'/');
 90 |   if (find == NULL)
 91 |     find = name;
 92 |   else
 93 |     find += 1;
 94 |   if (suffix == NULL)
 95 |     { dot = strrchr(find,'.');
 96 |       path = Strndup(find,dot-find,"Extracting root from");
 97 |     }
 98 |   else
 99 |     { epos  = strlen(find);
100 |       epos -= strlen(suffix);
101 |       if (epos > 0 && strcasecmp(find+epos,suffix) == 0)
102 |         path = Strndup(find,epos,"Extracting root from");
103 |       else
104 |         path = Strdup(find,"Allocating root");
105 |     }
106 |   return (path);
107 | }
108 | 
109 | char *Catenate(char *path, char *sep, char *root, char *suffix)
110 | { static char *cat = NULL;
111 |   static int   max = -1;
112 |   int len;
113 | 
114 |   if (path == NULL || root == NULL || sep == NULL || suffix == NULL)
115 |     { free(cat);
116 |       max = -1;
117 |       return (NULL);
118 |     }
119 |   len =  strlen(path);
120 |   len += strlen(sep);
121 |   len += strlen(root);
122 |   len += strlen(suffix);
123 |   if (len > max)
124 |     { max = ((int) (1.2*len)) + 100;
125 |       if ((cat = (char *) realloc(cat,max+1)) == NULL)
126 |         { fprintf(stderr,"%s: Out of memory (Making path name for %s)\n",Prog_Name,root);
127 |           return (NULL);
128 |         }
129 |     }
130 |   sprintf(cat,"%s%s%s%s",path,sep,root,suffix);
131 |   return (cat);
132 | }
133 | 
134 | char *Numbered_Suffix(char *left, int num, char *right)
135 | { static char *suffix = NULL;
136 |   static int   max = -1;
137 |   int len;
138 | 
139 |   if (left == NULL || right == NULL)
140 |     { free(suffix);
141 |       max = -1;
142 |       return (NULL);
143 |     }
144 |   len =  strlen(left);
145 |   len += strlen(right) + 40;
146 |   if (len > max)
147 |     { max = ((int) (1.2*len)) + 100;
148 |       if ((suffix = (char *) realloc(suffix,max+1)) == NULL)
149 |         { fprintf(stderr,"%s: Out of memory (Making number suffix for %d)\n",Prog_Name,num);
150 |           return (NULL);
151 |         }
152 |     }
153 |   sprintf(suffix,"%s%d%s",left,num,right);
154 |   return (suffix);
155 | }
156 | 
157 | 
158 | #define  COMMA  ','
159 | 
160 | //  Print big integers with commas/periods for better readability
161 | 
162 | void Print_Number(int64 num, int width, FILE *out)
163 | { if (width == 0)
164 |     { if (num < 1000ll)
165 |         fprintf(out,"%lld",num);
166 |       else if (num < 1000000ll)
167 |         fprintf(out,"%lld%c%03lld",num/1000ll,COMMA,num%1000ll);
168 |       else if (num < 1000000000ll)
169 |         fprintf(out,"%lld%c%03lld%c%03lld",num/1000000ll,
170 |                                            COMMA,(num%1000000ll)/1000ll,COMMA,num%1000ll);
171 |       else
172 |         fprintf(out,"%lld%c%03lld%c%03lld%c%03lld",num/1000000000ll,
173 |                                                    COMMA,(num%1000000000ll)/1000000ll,
174 |                                                    COMMA,(num%1000000ll)/1000ll,COMMA,num%1000ll);
175 |     }
176 |   else
177 |     { if (num < 1000ll)
178 |         fprintf(out,"%*lld",width,num);
179 |       else if (num < 1000000ll)
180 |         { if (width <= 4)
181 |             fprintf(out,"%lld%c%03lld",num/1000ll,COMMA,num%1000ll);
182 |           else
183 |             fprintf(out,"%*lld%c%03lld",width-4,num/1000ll,COMMA,num%1000ll);
184 |         }
185 |       else if (num < 1000000000ll)
186 |         { if (width <= 8)
187 |             fprintf(out,"%lld%c%03lld%c%03lld",num/1000000ll,COMMA,(num%1000000ll)/1000ll,
188 |                                                COMMA,num%1000ll);
189 |           else
190 |             fprintf(out,"%*lld%c%03lld%c%03lld",width-8,num/1000000ll,COMMA,(num%1000000ll)/1000ll,
191 |                                                 COMMA,num%1000ll);
192 |         }
193 |       else
194 |         { if (width <= 12)
195 |             fprintf(out,"%lld%c%03lld%c%03lld%c%03lld",num/1000000000ll,COMMA,
196 |                                                        (num%1000000000ll)/1000000ll,COMMA,
197 |                                                        (num%1000000ll)/1000ll,COMMA,num%1000ll);
198 |           else
199 |             fprintf(out,"%*lld%c%03lld%c%03lld%c%03lld",width-12,num/1000000000ll,COMMA,
200 |                                                         (num%1000000000ll)/1000000ll,COMMA,
201 |                                                         (num%1000000ll)/1000ll,COMMA,num%1000ll);
202 |         }
203 |     }
204 | }
205 | 
206 | //  Return the number of symbols to print num, base 10 (without commas as above)
207 | 
208 | int  Number_Digits(int64 num)
209 | { int digit;
210 | 
211 |   if (num == 0)
212 |     return (1);
213 |   if (num < 0)
214 |     { num = -num;
215 |       digit = 1;
216 |     }
217 |   else
218 |     digit = 0;
219 |   while (num >= 1)
220 |     { num /= 10;
221 |       digit += 1;
222 |     }
223 |   return (digit);
224 | }
225 | 
226 | 
227 | /*******************************************************************************************
228 |  *
229 |  *  READ AND ARROW COMPRESSION/DECOMPRESSION UTILITIES
230 |  *
231 |  ********************************************************************************************/
232 | 
233 | //  Compress read into 2-bits per base (from [0-3] per byte representation
234 | 
235 | void Compress_Read(int len, char *s)
236 | { int   i; 
237 |   char  c, d;
238 |   char *s0, *s1, *s2, *s3;
239 |   
240 |   s0 = s;
241 |   s1 = s0+1;
242 |   s2 = s1+1;
243 |   s3 = s2+1;
244 |   
245 |   c = s1[len];
246 |   d = s2[len];
247 |   s0[len] = s1[len] = s2[len] = 0;
248 |   
249 |   for (i = 0; i < len; i += 4)
250 |     *s++ = (char ) ((s0[i] << 6) | (s1[i] << 4) | (s2[i] << 2) | s3[i]);
251 |   
252 |   s1[len] = c;
253 |   s2[len] = d;
254 | }
255 | 
256 | //  Uncompress read form 2-bits per base into [0-3] per byte representation
257 | 
258 | void Uncompress_Read(int len, char *s)
259 | { int   i, tlen, byte;
260 |   char *s0, *s1, *s2, *s3;
261 |   char *t;
262 | 
263 |   s0 = s;
264 |   s1 = s0+1;
265 |   s2 = s1+1;
266 |   s3 = s2+1;
267 | 
268 |   tlen = (len-1)/4;
269 | 
270 |   t = s+tlen;
271 |   for (i = tlen*4; i >= 0; i -= 4)
272 |     { byte = *t--;
273 |       s0[i] = (char) ((byte >> 6) & 0x3);
274 |       s1[i] = (char) ((byte >> 4) & 0x3);
275 |       s2[i] = (char) ((byte >> 2) & 0x3);
276 |       s3[i] = (char) (byte & 0x3);
277 |     }
278 |   s[len] = 4;
279 | }
280 | 
281 | //  Convert read in [0-3] representation to ascii representation (end with '\n')
282 | 
283 | void Lower_Read(char *s)
284 | { static char letter[4] = { 'a', 'c', 'g', 't' };
285 | 
286 |   for ( ; *s != 4; s++)
287 |     *s = letter[(int) *s];
288 |   *s = '\0';
289 | }
290 | 
291 | void Upper_Read(char *s)
292 | { static char letter[4] = { 'A', 'C', 'G', 'T' };
293 | 
294 |   for ( ; *s != 4; s++)
295 |     *s = letter[(int) *s];
296 |   *s = '\0';
297 | }
298 | 
299 | void Letter_Arrow(char *s)
300 | { static char letter[4] = { '1', '2', '3', '4' };
301 | 
302 |   for ( ; *s != 4; s++)
303 |     *s = letter[(int) *s];
304 |   *s = '\0';
305 | }
306 | 
307 | //  Convert read in ascii representation to [0-3] representation (end with 4)
308 | 
309 | void Number_Read(char *s)
310 | { static char number[128] =
311 |     { 0, 0, 0, 0, 0, 0, 0, 0,
312 |       0, 0, 0, 0, 0, 0, 0, 0,
313 |       0, 0, 0, 0, 0, 0, 0, 0,
314 |       0, 0, 0, 0, 0, 0, 0, 0,
315 |       0, 0, 0, 0, 0, 0, 0, 0,
316 |       0, 0, 0, 0, 0, 0, 0, 0,
317 |       0, 0, 0, 0, 0, 0, 0, 0,
318 |       0, 0, 0, 0, 0, 0, 0, 0,
319 |       0, 0, 0, 1, 0, 0, 0, 2,
320 |       0, 0, 0, 0, 0, 0, 0, 0,
321 |       0, 0, 0, 0, 3, 0, 0, 0,
322 |       0, 0, 0, 0, 0, 0, 0, 0,
323 |       0, 0, 0, 1, 0, 0, 0, 2,
324 |       0, 0, 0, 0, 0, 0, 0, 0,
325 |       0, 0, 0, 0, 3, 0, 0, 0,
326 |       0, 0, 0, 0, 0, 0, 0, 0,
327 |     };
328 | 
329 |   for ( ; *s != '\0'; s++)
330 |     *s = number[(int) *s];
331 |   *s = 4;
332 | }
333 | 
334 | void Number_Arrow(char *s)
335 | { static char arrow[128] =
336 |     { 3, 3, 3, 3, 3, 3, 3, 3,
337 |       3, 3, 3, 3, 3, 3, 3, 3,
338 |       3, 3, 3, 3, 3, 3, 3, 3,
339 |       3, 3, 3, 3, 3, 3, 3, 3,
340 |       3, 3, 3, 3, 3, 3, 3, 3,
341 |       3, 3, 3, 3, 3, 3, 3, 3,
342 |       3, 0, 1, 2, 3, 3, 3, 3,
343 |       3, 3, 3, 3, 3, 3, 3, 3,
344 |       3, 3, 3, 3, 3, 3, 3, 2,
345 |       3, 3, 3, 3, 3, 3, 3, 3,
346 |       3, 3, 3, 3, 3, 3, 3, 3,
347 |       3, 3, 3, 3, 3, 3, 3, 3,
348 |       3, 3, 3, 3, 3, 3, 3, 3,
349 |       3, 3, 3, 3, 3, 3, 3, 3,
350 |       3, 3, 3, 3, 3, 3, 3, 3,
351 |       3, 3, 3, 3, 3, 3, 3, 3,
352 |     };
353 | 
354 |   for ( ; *s != '\0'; s++)
355 |     *s = arrow[(int) *s];
356 |   *s = 4;
357 | }
358 | 
359 | void Change_Read(char *s)
360 | { static char change[128] =
361 |     {   0,   0,   0,   0,   0,   0,   0,   0,
362 |         0,   0,   0,   0,   0,   0,   0,   0,
363 |         0,   0,   0,   0,   0,   0,   0,   0,
364 |         0,   0,   0,   0,   0,   0,   0,   0,
365 |         0,   0,   0,   0,   0,   0,   0,   0,
366 |         0,   0,   0,   0,   0,   0,   0,   0,
367 |         0,   0,   0,   0,   0,   0,   0,   0,
368 |         0,   0,   0,   0,   0,   0,   0,   0,
369 |         0, 'a',   0, 'c',   0,   0,   0, 'g',
370 |         0,   0,   0,   0,   0,   0,   0,   0,
371 |         0,   0,   0,   0, 't',   0,   0,   0,
372 |         0,   0,   0,   0,   0,   0,   0,   0,
373 |         0, 'A',   0, 'C',   0,   0,   0, 'G',
374 |         0,   0,   0,   0,   0,   0,   0,   0,
375 |         0,   0,   0,   0, 'T',   0,   0,   0,
376 |         0,   0,   0,   0,   0,   0,   0,   0,
377 |     };
378 | 
379 |   for ( ; *s != '\0'; s++)
380 |     *s = change[(int) *s];
381 | }
382 | 


--------------------------------------------------------------------------------
/src_ploidyplot/gene_core.h:
--------------------------------------------------------------------------------
  1 | #ifndef _CORE
  2 | 
  3 | #define _CORE
  4 | 
  5 | #include <stdio.h>
  6 | 
  7 | /*******************************************************************************************
  8 |  *
  9 |  *  MY STANDARD TYPE DECLARATIONS
 10 |  *
 11 |  ********************************************************************************************/
 12 | 
 13 | typedef unsigned char      uint8;
 14 | typedef unsigned short     uint16;
 15 | typedef unsigned int       uint32;
 16 | typedef unsigned long long uint64;
 17 | typedef signed char        int8;
 18 | typedef signed short       int16;
 19 | typedef signed int         int32;
 20 | typedef signed long long   int64;
 21 | typedef float              float32;
 22 | typedef double             float64;
 23 | 
 24 | /*******************************************************************************************
 25 |  *
 26 |  *  MACROS TO HELP PARSE COMMAND LINE
 27 |  *
 28 |  ********************************************************************************************/
 29 | 
 30 | extern char *Prog_Name;   //  Name of program, available everywhere
 31 | 
 32 | #define ARG_INIT(name)                  \
 33 |   Prog_Name = Strdup(name,"");          \
 34 |   for (i = 0; i < 128; i++)             \
 35 |     flags[i] = 0;
 36 | 
 37 | #define ARG_FLAGS(set)                                                                  \
 38 |   for (k = 1; argv[i][k] != '\0'; k++)                                                  \
 39 |     { if (index(set,argv[i][k]) == NULL)                                                \
 40 |         { fprintf(stderr,"%s: -%c is an illegal option\n",Prog_Name,argv[i][k]);        \
 41 |           exit (1);                                                                     \
 42 |         }                                                                               \
 43 |       flags[(int) argv[i][k]] = 1;                                                      \
 44 |     }
 45 | 
 46 | #define ARG_POSITIVE(var,name)                                                          \
 47 |   var = strtol(argv[i]+2,&eptr,10);                                                     \
 48 |   if (*eptr != '\0' || argv[i][2] == '\0')                                              \
 49 |     { fprintf(stderr,"%s: -%c '%s' argument is not an integer\n",			\
 50 |                      Prog_Name,argv[i][1],argv[i]+2);      				\
 51 |       exit (1);                                                                         \
 52 |     }                                                                                   \
 53 |   if (var <= 0)                                                                         \
 54 |     { fprintf(stderr,"%s: %s must be positive (%d)\n",Prog_Name,name,var);              \
 55 |       exit (1);                                                                         \
 56 |     }
 57 | 
 58 | #define ARG_NON_NEGATIVE(var,name)                                                      \
 59 |   var = strtol(argv[i]+2,&eptr,10);                                                     \
 60 |   if (*eptr != '\0' || argv[i][2] == '\0')                                              \
 61 |     { fprintf(stderr,"%s: -%c '%s' argument is not an integer\n",			\
 62 |                      Prog_Name,argv[i][1],argv[i]+2);      				\
 63 |       exit (1);                                                                         \
 64 |     }                                                                                   \
 65 |   if (var < 0)	                                                                        \
 66 |     { fprintf(stderr,"%s: %s must be non-negative (%d)\n",Prog_Name,name,var);          \
 67 |       exit (1);                                                                         \
 68 |     }
 69 | 
 70 | #define ARG_REAL(var)                                                                   \
 71 |   var = strtod(argv[i]+2,&eptr);                                                        \
 72 |   if (*eptr != '\0' || argv[i][2] == '\0')                                              \
 73 |     { fprintf(stderr,"%s: -%c '%s' argument is not a real number\n",			\
 74 |                      Prog_Name,argv[i][1],argv[i]+2);      				\
 75 |       exit (1);                                                                         \
 76 |     }
 77 | 
 78 | /*******************************************************************************************
 79 |  *
 80 |  *  MEMORY ALLOCATION,FILE HANDLING, AND PRETTY PRINTING UTILITIES
 81 |  *
 82 |  ********************************************************************************************/
 83 | 
 84 | //  The following general utilities return NULL if any of their input pointers are NULL, or if they
 85 | //    could not perform their function (in which case they also print an error to stderr).
 86 | 
 87 | void *Malloc(int64 size, char *mesg);                    //  Guarded versions of malloc, realloc
 88 | void *Realloc(void *object, int64 size, char *mesg);     //  and strdup, that output "mesg" to
 89 | char *Strdup(char *string, char *mesg);                  //  stderr if out of memory
 90 | char *Strndup(char *string, int len, char *mesg);        //  stderr if out of memory
 91 | 
 92 | char *PathTo(char *path);                // Return path portion of file name "path"
 93 | char *Root(char *path, char *suffix);    // Return the root name, excluding suffix, of "path"
 94 | 
 95 | // Catenate returns concatenation of path.sep.root.suffix in a *temporary* buffer
 96 | // Numbered_Suffix returns concatenation of left.<num>.right in a *temporary* buffer
 97 | 
 98 | char *Catenate(char *path, char *sep, char *root, char *suffix);
 99 | char *Numbered_Suffix(char *left, int num, char *right);
100 | 
101 | void Print_Number(int64 num, int width, FILE *out);   //  Print readable big integer
102 | int  Number_Digits(int64 num);                        //  Return # of digits in printed number
103 | 
104 | /*******************************************************************************************
105 |  *
106 |  *  ROUTINES FOR HANDLING DNA AND ARROW STRINGS
107 |  *
108 |  ********************************************************************************************/
109 | 
110 | #define COMPRESSED_LEN(len)  (((len)+3) >> 2)
111 | 
112 | void   Compress_Read(int len, char *s);   //  Compress read in-place into 2-bit form
113 | void Uncompress_Read(int len, char *s);   //  Uncompress read in-place into numeric form
114 | void      Print_Read(char *s, int width);
115 | 
116 | void Lower_Read(char *s);     //  Convert read from numbers to lowercase letters (0-3 to acgt)
117 | void Upper_Read(char *s);     //  Convert read from numbers to uppercase letters (0-3 to ACGT)
118 | void Number_Read(char *s);    //  Convert read from letters to numbers
119 | void Change_Read(char *s);    //  Convert read from one case to the other
120 | 
121 | void Letter_Arrow(char *s);   //  Convert arrow pw's from numbers to uppercase letters (0-3 to 1234)
122 | void Number_Arrow(char *s);   //  Convert arrow pw string from letters to numbers
123 | 
124 | #endif // _CORE
125 | 


--------------------------------------------------------------------------------
/src_ploidyplot/libfastk.h:
--------------------------------------------------------------------------------
  1 | /*******************************************************************************************
  2 |  *
  3 |  *  C library routines to access and operate upon FastK histogram, k-mer tables, and profiles
  4 |  *
  5 |  *  Author:  Gene Myers
  6 |  *  Date  :  November 2020
  7 |  *
  8 |  *******************************************************************************************/
  9 | 
 10 | #ifndef _LIBFASTK
 11 | #define _LIBFASTK
 12 | 
 13 | #include <stdio.h>
 14 | #include <stdlib.h>
 15 | #include <string.h>
 16 | #include <ctype.h>
 17 | #include <stdint.h>
 18 | #include <math.h>
 19 | #include <dirent.h>
 20 | #include <fcntl.h>
 21 | #include <sys/types.h>
 22 | #include <sys/uio.h>
 23 | #include <unistd.h>
 24 | #include <errno.h>
 25 | 
 26 | #include "gene_core.h"
 27 | 
 28 |   //  HISTOGRAM
 29 | 
 30 | typedef struct
 31 |   { int    kmer;    //  Histogram is for k-mers of this length
 32 |     int    unique;  // 1 => count  of unique k-mers, 0 => count of k-mer instances
 33 |     int    low;     //  Histogram is for range [low,hgh]
 34 |     int    high;
 35 |     int64 *hist;    //  hist[i] for i in [low,high] = # of k-mers occuring i times
 36 |   } Histogram;
 37 | 
 38 | Histogram *Load_Histogram(char *name);
 39 | void       Modify_Histogram(Histogram *H, int low, int high, int unique);
 40 | int        Write_Histogram(char *name, Histogram *H);
 41 | void       Free_Histogram(Histogram *H);
 42 | 
 43 | 
 44 |   //  K-MER TABLE
 45 | 
 46 | typedef struct
 47 |   { int     kmer;         //  Kmer length
 48 |     int     minval;       //  The minimum count of a k-mer in the table
 49 |     int64   nels;         //  # of unique, sorted k-mers in the table
 50 | 
 51 |     void   *private[7];   //  Private fields
 52 |   } Kmer_Table;
 53 | 
 54 | Kmer_Table *Load_Kmer_Table(char *name, int cut_off);
 55 | void        Free_Kmer_Table(Kmer_Table *T);
 56 | 
 57 | char       *Fetch_Kmer(Kmer_Table *T, int64 i, char *seq);
 58 | int         Fetch_Count(Kmer_Table *T, int64 i);
 59 | 
 60 | int64       Find_Kmer(Kmer_Table *T, char *kseq);
 61 | 
 62 | 
 63 |   //  K-MER STREAM
 64 | 
 65 | typedef struct
 66 |   { int    kmer;       //  Kmer length
 67 |     int    minval;     //  The minimum count of a k-mer in the stream
 68 |     int64  nels;       //  # of elements in entire table
 69 |                     // Current position
 70 |     int64  cidx;       //  current element index
 71 |     uint8 *csuf;       //  current element suffix
 72 |     int    cpre;       //  current element prefix
 73 |                     // Other useful parameters
 74 |     int    ibyte;      //  # of bytes in prefix
 75 |     int    kbyte;      //  Kmer encoding in bytes
 76 |     int    tbyte;      //  Kmer+count entry in bytes
 77 |     int    hbyte;      //  Kmer suffix in bytes (= kbyte - ibyte)
 78 |     int    pbyte;      //  Kmer,count suffix in bytes (= tbyte - ibyte)
 79 | 
 80 |     void  *private[10]; //  Private fields
 81 |   } Kmer_Stream;
 82 | 
 83 | Kmer_Stream *Open_Kmer_Stream(char *name);
 84 | Kmer_Stream *Clone_Kmer_Stream(Kmer_Stream *S);
 85 | void         Free_Kmer_Stream(Kmer_Stream *S);
 86 | 
 87 | void         First_Kmer_Entry(Kmer_Stream *S);
 88 | void         Next_Kmer_Entry(Kmer_Stream *S);
 89 | 
 90 | char        *Current_Kmer(Kmer_Stream *S, char *seq);
 91 | int          Current_Count(Kmer_Stream *S);
 92 | uint8       *Current_Entry(Kmer_Stream *S, uint8 *seq);
 93 | 
 94 | void         GoTo_Kmer_Index(Kmer_Stream *S, int64 idx);
 95 | int          GoTo_Kmer_String(Kmer_Stream *S, char *seq);
 96 | int          GoTo_Kmer_Entry(Kmer_Stream *S, uint8 *entry);
 97 | 
 98 | 
 99 |   //  PROFILES
100 | 
101 | typedef struct
102 |   { int    kmer;     //  Kmer length
103 |     int    nparts;   //  # of threads/parts for the profiles
104 |     int    nreads;   //  total # of reads in data set
105 |     int64 *nbase;    //  nbase[i] for i in [0,nparts) = id of last read in part i + 1
106 |     int64 *index;    //  index[i] for i in [0,nreads) = offset in relevant part of
107 |                      //    compressed profile for read i
108 |     void  *private[4]; // Private fields
109 |   } Profile_Index;
110 | 
111 | Profile_Index *Open_Profiles(char *name);
112 | Profile_Index *Clone_Profiles(Profile_Index *P);
113 | 
114 | void Free_Profiles(Profile_Index *P);
115 | 
116 | int Fetch_Profile(Profile_Index *P, int64 id, int plen, uint16 *profile);
117 | 
118 | #endif // _LIBFASTK
119 | 


--------------------------------------------------------------------------------
/src_ploidyplot/matrix.c:
--------------------------------------------------------------------------------
  1 | /*****************************************************************************************\
  2 | *                                                                                         *
  3 | *  Matrix inversion, determinants, and linear equations via LU-decomposition              *
  4 | *                                                                                         *
  5 | *  Author:  Gene Myers                                                                    *
  6 | *  Date  :  April 2007                                                                    *
  7 | *  Mod   :  June 2008 -- Added TDG's and Cubic Spline to enable snakes and curves         *
  8 | *           Dec 2008 -- Refined TDG's and cubic splines to Decompose/Solve paradigm       *
  9 | *                                                                                         *
 10 | \*****************************************************************************************/
 11 | 
 12 | #include <stdio.h>
 13 | #include <stdlib.h>
 14 | #include <string.h>
 15 | #include <math.h>
 16 | #include <float.h>
 17 | 
 18 | #include "gene_core.h"
 19 | #include "matrix.h"
 20 | 
 21 | #define TINY 1.0e-20
 22 | 
 23 | /****************************************************************************************
 24 |  *                                                                                      *
 25 |  *  LU-FACTORIZATION SYSTEM SOLVER                                                      *
 26 |  *                                                                                      *
 27 |  ****************************************************************************************/
 28 | 
 29 | 
 30 | //  M is a square double matrix where the row index moves the fastest.
 31 | //    LU_Decompose takes M and produces an LU factorization of M that
 32 | //    can then be used to rapidly solve the system for given right hand sides
 33 | //    and to compute M's determinant.  The return value is NULL if the matrix
 34 | //    is nonsingular.  If the matrix appears unstable (had to use a very nearly
 35 | //    zero pivot) then the integer pointed at by stable will be zero, and
 36 | //    non-zero otherwise.  M is subsumed and effectively destroyed by the routine.
 37 | 
 38 | LU_Factor *LU_Decompose(Double_Matrix *M, int *stable)
 39 | { LU_Factor      *F;
 40 |   int             n, i, j;
 41 |   int            *p, sign;
 42 |   double         *v;
 43 |   double         *avec[1001], **a;
 44 | 
 45 |   n = M->n;
 46 | 
 47 |   if (n > 1000)
 48 |     a = Malloc(sizeof(double)*n,"Allocating LU Factor work space");
 49 |   else
 50 |     a = avec;
 51 |   F = Malloc(sizeof(LU_Factor),"Allocating LU Factor");
 52 |   p = Malloc((sizeof(int) + sizeof(double))*n,"Allocating LU Factor");
 53 |   if (a == NULL || F == NULL || p == NULL)
 54 |     exit (1);
 55 | 
 56 |   v = (double *) (p+n);
 57 | 
 58 |   p[0] = 0;
 59 |   a[0] = M->m;
 60 |   for (i = 1; i < n; i++)
 61 |     { a[i] = a[i-1] + n;
 62 |       p[i] = i;
 63 |     }
 64 | 
 65 |   *stable = 1;
 66 |   sign    = 1;
 67 |   for (i = 0; i < n; i++)  // Find the scale factors for each row in v.
 68 |     { double b, f, *r;
 69 | 
 70 |       r = a[i];
 71 |       b = 0.;
 72 |       for (j = 0; j < n; j++)
 73 |         { f = fabs(r[j]);
 74 |           if (f > b)
 75 |             b = f;
 76 |         }
 77 |       if (b == 0.0)
 78 |         { free(p);
 79 |           free(F);
 80 |           if (n > 1000)
 81 |             free(a);
 82 |           return (NULL);
 83 |         }
 84 |       v[i] = 1./b;
 85 |     }
 86 | 
 87 |   for (j = 0; j < n; j++)      //  For each column
 88 |     { double b, s, *r;
 89 |       int    k, w;
 90 | 
 91 |       for (i = 0; i < j; i++)    // Determine U
 92 |         { r = a[i];
 93 |           s = r[j];
 94 |           for (k = 0; k < i; k++)
 95 |             s -= r[k]*a[k][j];
 96 |           r[j] = s;
 97 |         }
 98 | 
 99 |       b = -1.;
100 |       w = j;
101 |       for (i = j; i < n; i++)      // Determine L without dividing by pivot, in order to
102 |         { r = a[i];                //   determine who the pivot should be.
103 |           s = r[j];
104 |           for (k = 0; k < j; k++)
105 |             s -= r[k]*a[k][j];
106 |           r[j] = s;
107 | 
108 |           s = v[i]*fabs(s);        // Update best pivot seen thus far
109 |           if (s > b)
110 |             { b = s;
111 |               w = i;
112 |             }
113 | 	}
114 | 
115 |       if (w != j)                  // Pivot if necessary
116 |         { r    = a[w];
117 |           a[w] = a[j];
118 |           a[j] = r;
119 |           k    = p[w];
120 |           p[w] = p[j];
121 |           p[j] = k;
122 |           sign = -sign;
123 |           v[w] = v[j];
124 |         }
125 | 
126 |       if (fabs(a[j][j]) < TINY)    // Complete column of L by dividing by selected pivot
127 |         { if (a[j][j] < 0.)
128 |             a[j][j] = -TINY;
129 |           else
130 |             a[j][j] = TINY;
131 |           *stable = 0;
132 |         }
133 |       b = 1./a[j][j];
134 |       for (i = j+1; i < n; i++)
135 |         a[i][j] *= b;
136 |     }
137 | 
138 | #ifdef DEBUG_LU
139 |   { int i, j;
140 | 
141 |     printf("\nLU Decomposition\n");
142 |     for (i = 0; i < n; i++)
143 |       { printf("  %2d: ",p[i]);
144 |         for (j = 0; j < n; j++)
145 |           printf(" %8g",a[i][j]);
146 |         printf("\n");
147 |       }
148 |   }
149 | #endif
150 | 
151 |   if (n > 1000)
152 |     free(a);
153 | 
154 |   F->sign   = sign;
155 |   F->perm   = p;
156 |   F->lu_mat = M;
157 |   return (F);
158 | }
159 | 
160 | 
161 | //  Display LU factorization F to specified file
162 | 
163 | void Show_LU_Product(FILE *file, LU_Factor *F)
164 | { int    n, i, j, k;
165 |   int   *p;
166 |   double u, **a, *d;
167 | 
168 |   n = F->lu_mat->n;
169 |   d = F->lu_mat->m;
170 |   p = F->perm;
171 |   a = (double **) (p+n);
172 | 
173 |   for (i = 0; i < n; i++)
174 |     a[i] = d + p[i]*n;
175 | 
176 |   fprintf(file,"\nLU Product:\n");
177 |   for (i = 0; i < n; i++)
178 |     { for (j = 0; j < i; j++)
179 |         { u = 0.;
180 |           for (k = 0; k <= j; k++)
181 |             u += a[i][k] * a[k][j];
182 |           fprintf(file," %g",u);
183 |         }
184 |       for (j = i; j < n; j++)
185 |         { u = a[i][j];
186 |           for (k = 0; k < i; k++)
187 |             u += a[i][k] * a[k][j];
188 |           fprintf(file," %g",u);
189 |         }
190 |       fprintf(file,"\n");
191 |     }
192 | }
193 | 
194 | 
195 | //  Given rhs vector B and LU-factorization F, solve the system of equations
196 | //    and return the result in B.
197 | //  To invert M = L*U given the LU-decomposition, simply call LU_Solve with
198 | //    b = [ 0^k-1 1 0^n-k] to get the k'th column of the inverse matrix.
199 | 
200 | Double_Vector *LU_Solve(Double_Vector *B, LU_Factor *F)
201 | { double   *x;
202 |   int       n, i, j;
203 |   int      *p;
204 |   double   *a, *b, s, *r;
205 | 
206 |   n = F->lu_mat->n;
207 |   a = F->lu_mat->m;
208 |   p = F->perm;
209 |   b = B->m;
210 |   x = (double *) (p+n);
211 | 
212 |   for (i = 0; i < n; i++)
213 |     { r = a + p[i]*n;
214 |       s = b[p[i]];
215 |       for (j = 0; j < i; j++)
216 |         s -= r[j] * x[j];
217 |       x[i] = s;
218 |     }
219 | 
220 |   for (i = n; i-- > 0; )
221 |     { r = a + p[i]*n;
222 |       s = x[i]; 
223 |       for (j = i+1; j < n; j++)
224 |         s -= r[j] * b[j];
225 |       b[i] = s/r[i];
226 |     }
227 | 
228 |   return (B);
229 | }
230 |   
231 | 
232 | //  Transpose a matrix M in-place and as a convenience return a pointer to it
233 | 
234 | Double_Matrix *Transpose_Matrix(Double_Matrix *M)
235 | { int     n;
236 |   double *a;
237 |   int     p, q;
238 |   int     i, j;
239 | 
240 |   n = M->n;
241 |   a = M->m;
242 | 
243 |   p = 0;
244 |   for (j = 0; j < n; j++)                 //  Transpose the result
245 |     { q = j;
246 |       for (i = 0; i < j; i++)
247 |         { double x = a[p];
248 |           a[p++] = a[q];
249 |           a[q] = x;
250 |           q += n;
251 |         }
252 |       p += (n-j);
253 |     }
254 | 
255 |   return (M);
256 | }
257 | 
258 | 
259 | //  Generate the right inverse of the matrix that gave rise to the LU factorization f.
260 | //    That is for matrix A, return matrix A^-1 s.t. A * A^-1 = I.  If transpose is non-zero
261 | //    then the transpose of the right inverse is returned.
262 | 
263 | Double_Matrix *LU_Invert(LU_Factor *F, int transpose)
264 | { int            n, i, j;
265 |   Double_Matrix *M, G;
266 |   double        *m, *g;
267 | 
268 |   n = F->lu_mat->n;
269 | 
270 |   M = Malloc(sizeof(Double_Matrix),"Allocating matrix");
271 |   m = Malloc(sizeof(double)*n*n,"Allocating matrix");
272 |   if (M == NULL || m == NULL)
273 |     exit (1);
274 | 
275 |   M->n = n;
276 |   M->m = m;
277 |   G.n = n;
278 | 
279 |   g = m;
280 |   for (i = 0; i < n; i++)                 //  Find the inverse of each column in the
281 |     { G.m = g;
282 |       for (j = 0; j < n; j++)
283 |         g[j] = 0.;
284 |       g[i] = 1.;
285 |       LU_Solve(&G,F);
286 |       g += n;
287 |     }
288 | 
289 |   if (!transpose)
290 |     Transpose_Matrix(M);
291 | 
292 |   return (M);
293 | }
294 | 
295 | 
296 | //  Given an LU-factorization F, return the value of the determinant of the
297 | //    original matrix.
298 | 
299 | double LU_Determinant(LU_Factor *F)
300 | { int     i, n;
301 |   int    *p;
302 |   double *a, det;
303 | 
304 |   n = F->lu_mat->n;
305 |   a = F->lu_mat->m;
306 |   p = F->perm;
307 | 
308 |   det = F->sign;
309 |   for (i = 0; i < n; i++)
310 |     det *= a[p[i]*n+i];
311 |   return (det);
312 | }
313 | 


--------------------------------------------------------------------------------
/src_ploidyplot/matrix.h:
--------------------------------------------------------------------------------
 1 | /*****************************************************************************************\
 2 | *                                                                                         *
 3 | *  Matrix inversion, determinants, and linear equations via LU-decomposition              *
 4 | *                                                                                         *
 5 | *  Author:  Gene Myers                                                                    *
 6 | *  Date  :  April 2007                                                                    *
 7 | *  Mod   :  June 2008 -- Added TDG's and Cubic Spline to enable snakes and curves         *
 8 | *                                                                                         *
 9 | \*****************************************************************************************/
10 | 
11 | #ifndef _MATRIX_LIB
12 | 
13 | #define _MATRIX_LIB
14 | 
15 | typedef struct
16 |   { int     n;
17 |     double *m;
18 |   } Double_Matrix;
19 | 
20 | typedef Double_Matrix Double_Vector;
21 | 
22 | typedef struct
23 |   { Double_Matrix *lu_mat;  //  LU decomposion: L is below the diagonal and U is on and above it
24 |     int           *perm;    //  Permutation of the original rows of m due to pivoting
25 |     int            sign;    //  Sign to apply to the determinant due to pivoting (+1 or -1)
26 |   } LU_Factor;
27 | 
28 | LU_Factor     *LU_Decompose(Double_Matrix *M, int *stable);
29 | void           Show_LU_Factor(FILE *file, LU_Factor *F);
30 | Double_Vector *LU_Solve(Double_Vector *B, LU_Factor *F);
31 | Double_Matrix *Transpose_Matrix(Double_Matrix *M);
32 | Double_Matrix *LU_Invert(LU_Factor *F, int transpose);
33 | double         LU_Determinant(LU_Factor *F);
34 | 
35 | #endif
36 | 


--------------------------------------------------------------------------------
/tests/README.md:
--------------------------------------------------------------------------------
 1 | ### Manual tests
 2 | 
 3 | This is a place for manual tests that have not been automated (yet?).
 4 | Don't forget to re-install the package/script before execution. Somehting like
 5 | 
 6 | ```
 7 | make install INSTALL_PREFIX=~
 8 | ```
 9 | 
10 | should do the job.
11 | 
12 | #### interface tests
13 | 
14 | ##### data prep
15 | 
16 | Download `SRR3265401` - nice teteraploid Sacharomyces run I use often for testing.
17 | 
18 | ##### smudgeplot plot
19 | 
20 | Defaults:
21 | 
22 | ```
23 | smudgeplot.py all data/Scer/kmerpairs_default_e2_text.smu -o data/Scer/240918_trial
24 | ```
25 | 
26 | Testing parameters:
27 | 
28 | ```
29 | smudgeplot.py all data/Scer/kmerpairs_default_e2_text.smu -o data/Scer/240918_trial_params -t "Species 1" -c 20 -ylim 80 -col_ramp magma --invert_cols
30 | ```
31 | 
32 | ##### smudgeplot hetkmers
33 | 
34 | two different methods to extract homologous kmers
35 | 
36 | TODO
37 | 
38 | ##### Dicots
39 | 
40 | This is a large dataset of the first 540 dicot genomes sequenced by the Tree of Life. Some of them are completed, some of them are with insufficient coverage or otherwise qc failed data. The idea here is to be able to tell those apart, get reasonable defaults so the generated plot is meaningful for a reasonable number (i.e. nearly all) of them.
41 | 
42 | ```
43 | time ./exec/smudgeplot.py plot data/dicots/smu_files/daAchMill1.k31_ploidy.smu.txt -o data/dicots/alt_plots/daAchMill1 --alt_plot -q 0.9
44 | ```
45 | 
46 | ```bash
47 | for smu in data/dicots/smu_files/*.smu.txt; do
48 |     species=$(basename $smu)
49 |     echo $species $smu
50 |     time ./exec/smudgeplot.py plot $smu -o data/dicots/alt_plots/$species --alt_plot -q 0.9
51 | done
52 | 
53 | for smu in data/dicots/smu_files/*.smu.txt; do
54 |     species=$(basename $smu .smu.txt)
55 |     echo $species $smu
56 |     time ./exec/smudgeplot.py plot $smu -c 10 -o data/dicots/alt_plots_c10/$species --alt_plot -q 0.9
57 | done
58 | 
59 | for smu in $(ls data/dicots/smu_files/*.smu.txt | head -20); do
60 |     species=$(basename $smu .smu.txt)
61 |     echo $species $smu
62 |     smudgeplot.py all $smu -t $species -o data/dicots/automated_smudgeplots/$species
63 | done
64 | ```


--------------------------------------------------------------------------------
/tests/run_smudge_version.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | smudgeplot.py -v 2> version
 4 | version=$(cat version | cut -f 3 -d ' ')
 5 | rm version
 6 | echo "testing $version"
 7 | 
 8 | outdir=figures/$version
 9 | mkdir -p $outdir
10 | rm $outdir/*
11 | 
12 | Rscript install.R
13 | install -C exec/smudgeplot.py /usr/local/bin
14 | install -C exec/smudgeplot_plot.R /usr/local/bin
15 | 
16 | for sp in "Aric1" "Avag1" "Mflo2" "Rmag1"; do
17 | 	smudgeplot.py plot -o $outdir/"$sp"_smudgeplot_"$version" -t "$sp $version" data/$sp/*coverages_2.tsv
18 | done
19 | 
20 | for sp in "Ps791" "Rvar1"; do
21 | 	smudgeplot.py plot -o $outdir/"$sp"_smudgeplot_"$version" -t "$sp $version" data/$sp/*coverages_2.tsv -nbins 15
22 | done
23 | 
24 | sp="Lcla1"
25 | 
26 | smudgeplot.py plot -o $outdir/"$sp"_smudgeplot_"$version"_homozyg -t "$sp $version" --homozygous data/$sp/*coverages_2.tsv
27 | 
28 | smudgeplot.py plot -o $outdir/"$sp"_smudgeplot_"$version" -t "$sp $version" data/$sp/*coverages_2.tsv
29 | 


--------------------------------------------------------------------------------