├── .github
└── ISSUE_TEMPLATE
│ ├── a-bug.md
│ ├── feature-request.md
│ ├── smudgeplot-inference-problem.md
│ └── smudgeplot-interpretation.md
├── .gitignore
├── FAQ.md
├── LICENSE
├── Makefile
├── README.md
├── exec
├── centrality_plot.R
├── smudgeplot
├── smudgeplot.py
└── smudgeplot_plot.R
├── playground
├── BGA_tutorial.md
├── DEVELOPMENT.md
├── alternative_fitting
│ ├── README.md
│ ├── alternative_plot_covA_covB.R
│ ├── alternative_plotting.R
│ ├── alternative_plotting_functions.R
│ ├── alternative_plotting_testing.R
│ └── pair_clustering.py
├── interactive_plot_strawberry_full_kmer_families_fooling_around.R
├── more_away_pairs.py
├── playground.R
├── playground.py
└── popart.R
├── src_ploidyplot
├── PloidyPlot.c
├── gene_core.c
├── gene_core.h
├── libfastk.c
├── libfastk.h
├── matrix.c
└── matrix.h
└── tests
├── README.md
└── run_smudge_version.sh
/.github/ISSUE_TEMPLATE/a-bug.md:
--------------------------------------------------------------------------------
1 | ---
2 | name: A bug
3 | about: If it looks like an error in the code
4 | labels: potential_problems
5 |
6 | ---
7 |
8 | **What did you do**
9 |
10 | Tell us about the problem. what is the version of the software you used (smudgeplot -v)? What was the input (possibly with a few example lines)? What is the command you run? What is the error output you get? And what you have expected to see instead?
11 |
--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/feature-request.md:
--------------------------------------------------------------------------------
1 | ---
2 | name: Feature request
3 | about: Any ideas how to improve smudgeplot?
4 | title: feature request: [short descrition]
5 | labels: enhancement
6 |
7 | ---
8 |
9 | **Background**
10 |
11 | I suppose you have a reason why you propose us an improvement. If it has a biological or algorithmical motivation, give us something to undestand where the suggestion comes from...
12 |
13 | **Feature**
14 |
15 | What do you think the feature should do. Be as detailed as possilbe here. Don't hesitate to write down examples of how the feature should operate.
16 |
17 | **Contribution**
18 |
19 | Do you have an idea how to implement the feature? Would you be willing to contribute to get the feature?
--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/smudgeplot-inference-problem.md:
--------------------------------------------------------------------------------
1 | ---
2 | name: Smudgeplot inference problem
3 | about: When there is a suspicious smudgeplot suggesting something wrong
4 |
5 | ---
6 |
7 | **About your genome**
8 |
9 | Tell us about your genome, so we understand why the smudgeplot seems to be wrong. Also please include the evidence you have (karyotype, inSitu...).
10 |
11 | **smudgeplot**
12 |
13 | Show us please the command you have used to generate the smudgeplot and the smudgeplot if possible. Tell us what looks suspicious on the smudgeplot and how do you expect it to look like?
14 |
--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/smudgeplot-interpretation.md:
--------------------------------------------------------------------------------
1 | ---
2 | name: Smudgeplot interpretation
3 | about: For interpretation problems of smudgeplot, send an issue by email if the data
4 | are sensitive
5 |
6 | ---
7 |
8 | I have troubles understanding my smudgeplot. I have used follwing command to generate it
9 |
10 | ```
11 | smudgeplot plot -i kmer_pairs_coverages_2.tsv -o my_org -t "Figure 1a: genome structure of X. odoratum" -L 40 -k 19
12 | ```
13 |
14 | and it look like this:
15 |
16 | (add smudgeplot)
17 |
18 | Now, I (know/have indication) already of (genome size/number of chromosomes/ploidy/...) from (RADseq/flow cytometry/karyotypes/inSitu/...) data. This does not make sense together with the smudge because (it predicts unexpected ploidy/shows only one smudge/...).
19 |
20 | How should I understand my smudgeplot?
21 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | data
2 | figures
3 | playground
4 | docs
5 | exec/PloidyPlot
6 | exec/hetmers
7 | *.o
8 | .DS_Store
9 | smu2text_smu
10 |
--------------------------------------------------------------------------------
/FAQ.md:
--------------------------------------------------------------------------------
1 |
2 |
3 | migrated to [wiki](https://github.com/tbenavi1/smudgeplot/wiki/FAQ)
4 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Apache License
2 | Version 2.0, January 2004
3 | http://www.apache.org/licenses/
4 |
5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6 |
7 | 1. Definitions.
8 |
9 | "License" shall mean the terms and conditions for use, reproduction,
10 | and distribution as defined by Sections 1 through 9 of this document.
11 |
12 | "Licensor" shall mean the copyright owner or entity authorized by
13 | the copyright owner that is granting the License.
14 |
15 | "Legal Entity" shall mean the union of the acting entity and all
16 | other entities that control, are controlled by, or are under common
17 | control with that entity. For the purposes of this definition,
18 | "control" means (i) the power, direct or indirect, to cause the
19 | direction or management of such entity, whether by contract or
20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the
21 | outstanding shares, or (iii) beneficial ownership of such entity.
22 |
23 | "You" (or "Your") shall mean an individual or Legal Entity
24 | exercising permissions granted by this License.
25 |
26 | "Source" form shall mean the preferred form for making modifications,
27 | including but not limited to software source code, documentation
28 | source, and configuration files.
29 |
30 | "Object" form shall mean any form resulting from mechanical
31 | transformation or translation of a Source form, including but
32 | not limited to compiled object code, generated documentation,
33 | and conversions to other media types.
34 |
35 | "Work" shall mean the work of authorship, whether in Source or
36 | Object form, made available under the License, as indicated by a
37 | copyright notice that is included in or attached to the work
38 | (an example is provided in the Appendix below).
39 |
40 | "Derivative Works" shall mean any work, whether in Source or Object
41 | form, that is based on (or derived from) the Work and for which the
42 | editorial revisions, annotations, elaborations, or other modifications
43 | represent, as a whole, an original work of authorship. For the purposes
44 | of this License, Derivative Works shall not include works that remain
45 | separable from, or merely link (or bind by name) to the interfaces of,
46 | the Work and Derivative Works thereof.
47 |
48 | "Contribution" shall mean any work of authorship, including
49 | the original version of the Work and any modifications or additions
50 | to that Work or Derivative Works thereof, that is intentionally
51 | submitted to Licensor for inclusion in the Work by the copyright owner
52 | or by an individual or Legal Entity authorized to submit on behalf of
53 | the copyright owner. For the purposes of this definition, "submitted"
54 | means any form of electronic, verbal, or written communication sent
55 | to the Licensor or its representatives, including but not limited to
56 | communication on electronic mailing lists, source code control systems,
57 | and issue tracking systems that are managed by, or on behalf of, the
58 | Licensor for the purpose of discussing and improving the Work, but
59 | excluding communication that is conspicuously marked or otherwise
60 | designated in writing by the copyright owner as "Not a Contribution."
61 |
62 | "Contributor" shall mean Licensor and any individual or Legal Entity
63 | on behalf of whom a Contribution has been received by Licensor and
64 | subsequently incorporated within the Work.
65 |
66 | 2. Grant of Copyright License. Subject to the terms and conditions of
67 | this License, each Contributor hereby grants to You a perpetual,
68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69 | copyright license to reproduce, prepare Derivative Works of,
70 | publicly display, publicly perform, sublicense, and distribute the
71 | Work and such Derivative Works in Source or Object form.
72 |
73 | 3. Grant of Patent License. Subject to the terms and conditions of
74 | this License, each Contributor hereby grants to You a perpetual,
75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76 | (except as stated in this section) patent license to make, have made,
77 | use, offer to sell, sell, import, and otherwise transfer the Work,
78 | where such license applies only to those patent claims licensable
79 | by such Contributor that are necessarily infringed by their
80 | Contribution(s) alone or by combination of their Contribution(s)
81 | with the Work to which such Contribution(s) was submitted. If You
82 | institute patent litigation against any entity (including a
83 | cross-claim or counterclaim in a lawsuit) alleging that the Work
84 | or a Contribution incorporated within the Work constitutes direct
85 | or contributory patent infringement, then any patent licenses
86 | granted to You under this License for that Work shall terminate
87 | as of the date such litigation is filed.
88 |
89 | 4. Redistribution. You may reproduce and distribute copies of the
90 | Work or Derivative Works thereof in any medium, with or without
91 | modifications, and in Source or Object form, provided that You
92 | meet the following conditions:
93 |
94 | (a) You must give any other recipients of the Work or
95 | Derivative Works a copy of this License; and
96 |
97 | (b) You must cause any modified files to carry prominent notices
98 | stating that You changed the files; and
99 |
100 | (c) You must retain, in the Source form of any Derivative Works
101 | that You distribute, all copyright, patent, trademark, and
102 | attribution notices from the Source form of the Work,
103 | excluding those notices that do not pertain to any part of
104 | the Derivative Works; and
105 |
106 | (d) If the Work includes a "NOTICE" text file as part of its
107 | distribution, then any Derivative Works that You distribute must
108 | include a readable copy of the attribution notices contained
109 | within such NOTICE file, excluding those notices that do not
110 | pertain to any part of the Derivative Works, in at least one
111 | of the following places: within a NOTICE text file distributed
112 | as part of the Derivative Works; within the Source form or
113 | documentation, if provided along with the Derivative Works; or,
114 | within a display generated by the Derivative Works, if and
115 | wherever such third-party notices normally appear. The contents
116 | of the NOTICE file are for informational purposes only and
117 | do not modify the License. You may add Your own attribution
118 | notices within Derivative Works that You distribute, alongside
119 | or as an addendum to the NOTICE text from the Work, provided
120 | that such additional attribution notices cannot be construed
121 | as modifying the License.
122 |
123 | You may add Your own copyright statement to Your modifications and
124 | may provide additional or different license terms and conditions
125 | for use, reproduction, or distribution of Your modifications, or
126 | for any such Derivative Works as a whole, provided Your use,
127 | reproduction, and distribution of the Work otherwise complies with
128 | the conditions stated in this License.
129 |
130 | 5. Submission of Contributions. Unless You explicitly state otherwise,
131 | any Contribution intentionally submitted for inclusion in the Work
132 | by You to the Licensor shall be under the terms and conditions of
133 | this License, without any additional terms or conditions.
134 | Notwithstanding the above, nothing herein shall supersede or modify
135 | the terms of any separate license agreement you may have executed
136 | with Licensor regarding such Contributions.
137 |
138 | 6. Trademarks. This License does not grant permission to use the trade
139 | names, trademarks, service marks, or product names of the Licensor,
140 | except as required for reasonable and customary use in describing the
141 | origin of the Work and reproducing the content of the NOTICE file.
142 |
143 | 7. Disclaimer of Warranty. Unless required by applicable law or
144 | agreed to in writing, Licensor provides the Work (and each
145 | Contributor provides its Contributions) on an "AS IS" BASIS,
146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 | implied, including, without limitation, any warranties or conditions
148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 | PARTICULAR PURPOSE. You are solely responsible for determining the
150 | appropriateness of using or redistributing the Work and assume any
151 | risks associated with Your exercise of permissions under this License.
152 |
153 | 8. Limitation of Liability. In no event and under no legal theory,
154 | whether in tort (including negligence), contract, or otherwise,
155 | unless required by applicable law (such as deliberate and grossly
156 | negligent acts) or agreed to in writing, shall any Contributor be
157 | liable to You for damages, including any direct, indirect, special,
158 | incidental, or consequential damages of any character arising as a
159 | result of this License or out of the use or inability to use the
160 | Work (including but not limited to damages for loss of goodwill,
161 | work stoppage, computer failure or malfunction, or any and all
162 | other commercial damages or losses), even if such Contributor
163 | has been advised of the possibility of such damages.
164 |
165 | 9. Accepting Warranty or Additional Liability. While redistributing
166 | the Work or Derivative Works thereof, You may choose to offer,
167 | and charge a fee for, acceptance of support, warranty, indemnity,
168 | or other liability obligations and/or rights consistent with this
169 | License. However, in accepting such obligations, You may act only
170 | on Your own behalf and on Your sole responsibility, not on behalf
171 | of any other Contributor, and only if You agree to indemnify,
172 | defend, and hold each Contributor harmless for any liability
173 | incurred by, or claims asserted against, such Contributor by reason
174 | of your accepting any such warranty or additional liability.
175 |
176 | END OF TERMS AND CONDITIONS
177 |
178 | APPENDIX: How to apply the Apache License to your work.
179 |
180 | To apply the Apache License to your work, attach the following
181 | boilerplate notice, with the fields enclosed by brackets "{}"
182 | replaced with your own identifying information. (Don't include
183 | the brackets!) The text should be enclosed in the appropriate
184 | comment syntax for the file format. We also recommend that a
185 | file or class name and description of purpose be included on the
186 | same "printed page" as the copyright notice for easier
187 | identification within third-party archives.
188 |
189 | Copyright {yyyy} {name of copyright owner}
190 |
191 | Licensed under the Apache License, Version 2.0 (the "License");
192 | you may not use this file except in compliance with the License.
193 | You may obtain a copy of the License at
194 |
195 | http://www.apache.org/licenses/LICENSE-2.0
196 |
197 | Unless required by applicable law or agreed to in writing, software
198 | distributed under the License is distributed on an "AS IS" BASIS,
199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 | See the License for the specific language governing permissions and
201 | limitations under the License.
202 |
203 |
--------------------------------------------------------------------------------
/Makefile:
--------------------------------------------------------------------------------
1 | # PATH for libraries is guessed
2 | CFLAGS = -O3 -Wall -Wextra -Wno-unused-result -fno-strict-aliasing
3 |
4 | ifndef INSTALL_PREFIX
5 | INSTALL_PREFIX = /usr/local
6 | endif
7 |
8 | HET_KMERS_INST = $(INSTALL_PREFIX)/bin/smudgeplot.py $(INSTALL_PREFIX)/bin/hetmers
9 | SMUDGEPLOT_INST = $(INSTALL_PREFIX)/bin/smudgeplot_plot.R $(INSTALL_PREFIX)/bin/centrality_plot.R
10 |
11 | .PHONY : install
12 | install : $(HET_KMERS_INST) $(SMUDGEPLOT_INST) $(CUTOFF_INST)
13 |
14 | $(INSTALL_PREFIX)/bin/% : exec/%
15 | install -C $< $(INSTALL_PREFIX)/bin
16 |
17 | exec/hetmers: src_ploidyplot/PloidyPlot.c src_ploidyplot/libfastk.c src_ploidyplot/libfastk.h src_ploidyplot/matrix.c src_ploidyplot/matrix.h
18 | gcc $(CFLAGS) -o $@ src_ploidyplot/PloidyPlot.c src_ploidyplot/libfastk.c src_ploidyplot/matrix.c -lpthread -lm
19 |
20 |
21 | .PHONY : clean
22 | clean :
23 | rm -f exec/hetmers
24 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Smudgeplot
2 |
3 | **_Version: 0.4.0 Arched_**
4 |
5 | **_Authors: [Gene W Myers](https://github.com/thegenemyers) and [Kamil S. Jaron](https://github.com/KamilSJaron), Tianyi Ma._**
6 |
7 | ### Install the whole thing
8 |
9 | This version of smudgeplot operates on FastK k-mer databases. So, before installing smudgeplot, please install [FastK](https://github.com/thegenemyers/FASTK). The smudgeplot installation consist of one python, two R scripts and the C-backend to search for all the k-mer pairs (hetmers) that needs to be compilet.
10 |
11 | #### Quick
12 |
13 | Assuming you have admin right / can write to `/usr/local/bin`, you can simply run
14 |
15 | ```bash
16 | sudo make
17 | ```
18 | That should do everything necesarry to make smudgeplot fully operational. You can run `smudgeplot.py --help` to see if it worked.
19 |
20 | #### Custom installation location
21 |
22 | If there is a different directory where you store your executables, you can specify `INSTALL_PREFIX` variable to make. The binaries are then added to `$INSTALL_PREFIX/bin`. For example
23 |
24 | ```bash
25 | make -s INSTALL_PREFIX=~
26 | ```
27 |
28 | will install smudgeplot to `~/bin/`.
29 |
30 | #### Manual installation
31 |
32 | Compiling the `C` executable
33 |
34 | ```
35 | make exec/hetmers # this will compile hetmers (kmer pair searching engine of PloidyPlot) backend
36 | ```
37 |
38 | Now you can move all three files from the `exec` directory somewhere your system will see it (or alternativelly, you can add that directory to `$PATH` variable).
39 |
40 | ```
41 | install -C exec/smudgeplot.py /usr/local/bin
42 | install -C exec/hetmers /usr/local/bin
43 | install -C exec/smudgeplot_plot.R /usr/local/bin
44 | install -C exec/centrality_plot.R /usr/local/bin
45 | ```
46 |
47 | ### Runing this version on Sacharomyces data
48 | Requires ~2.1GB of space and `FastK` and `smudgeplot` installed.
49 |
50 | ```bash
51 | # download data
52 | wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR326/001/SRR3265401/SRR3265401_1.fastq.gz
53 | wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR326/001/SRR3265401/SRR3265401_2.fastq.gz
54 |
55 | # sort them in a reasonable place
56 | mkdir data/Scer
57 | mv *fastq.gz data/Scer/
58 |
59 | # run FastK to create a k-mer database
60 | FastK -v -t4 -k31 -M16 -T4 data/Scer/SRR3265401_[12].fastq.gz -Ndata/Scer/FastK_Table
61 |
62 | # Find all k-mer pairs in the dataset using hetmer module
63 | smudgeplot.py hetmers -L 12 -t 4 -o data/Scer/kmerpairs --verbose data/Scer/FastK_Table
64 | # this now generated `data/Scer/kmerpairs_text.smu` file;
65 | # it's a flat file with three columns; covB, covA and freq (the number of k-mer pairs with these respective coverages)
66 |
67 | # use the .smu file to infer ploidy and create smudgeplot
68 | smudgeplot.py all -o data/Scer/trial_run data/Scer/kmerpairs_text.smu
69 |
70 | # check that bunch files are generated (3 pdfs; some summary tables and logs)
71 | ls data/Scer/trial_run_*
72 | ```
73 |
74 | The y-axis scaling is by default 100, one can spcify argument `ylim` to scale it differently
75 |
76 | ```bash
77 | smudgeplot.py all -o data/Scer/trial_run_ylim70 data/Scer/kmerpairs_text.smu -ylim 70
78 | ```
79 |
80 | There is also a plotting module that requires the coverage and a list of smudges and their respective sizes listed in a tabular file. This plotting module does not inference and should be used only if you know the right answers already.
81 |
82 | ### How smudgeplot works
83 |
84 | This tool extracts heterozygous kmer pairs from kmer count databases and performs gymnastics with them. We are able to disentangle genome structure by comparing the sum of kmer pair coverages (CovA + CovB) to their relative coverage (CovB / (CovA + CovB)). Such an approach also allows us to analyze obscure genomes with duplications, various ploidy levels, etc.
85 |
86 | Smudgeplots are computed from raw or even better from trimmed reads and show the haplotype structure using heterozygous kmer pairs. For example (of an older version):
87 |
88 | 
89 |
90 | Every haplotype structure has a unique smudge on the graph and the heat of the smudge indicates how frequently the haplotype structure is represented in the genome compared to the other structures. The image above is an ideal case, where the sequencing coverage is sufficient to beautifully separate all the smudges, providing very strong and clear evidence of triploidy.
91 |
92 | This tool is planned to be a part of [GenomeScope](https://github.com/tbenavi1/genomescope2.0) in the near future.
93 |
94 | ### More about the use
95 |
96 | The input is a set of whole genome sequencing reads, the more coverage the better. The method is designed to process big datasets, don't hesitate to pull all single-end/pair-end libraries together.
97 |
98 | The workflow is automatic, but it's not fool-proof. It requires some decisions. Use this tool joinlty with [GenomeScope](https://github.com/tbenavi1/genomescope2.0). The tutorials on our wiki are currently outdated (build for version 0.2.5), and will be updated by 18th of October.
99 |
100 | Smudgeplot generates two plots, one with coloration on a log scale and the other on a linear scale. The legend indicates approximate kmer pairs per tile densities. Note that a single polymorphism generates multiple heterozygous kmers. As such, the reported numbers do not directly correspond to the number of variants. Instead, the actual number is approximately 1/k times the reported numbers, where k is the kmer size (in summary already recalculated). It's important to note that this process does not exhaustively attempt to find all of the heterozygous kmers from the genome. Instead, only a sufficient sample is obtained in order to identify relative genome structure. You can also report the minimal number of loci that are heterozygous if the inference is correct.
101 |
102 | ### GenomeScope
103 |
104 | You can feed the kmer coverage histogram to GenomeScope. (Either run the [genomescope script](https://github.com/schatzlab/genomescope/blob/master/genomescope.R) or use the [web server](http://qb.cshl.edu/genomescope/))
105 |
106 | ```
107 | Rscript genomescope.R kmcdb_k21.hist [kmer_max] [verbose]
108 | ```
109 |
110 | This script estimates the size, heterozygosity, and repetitive fraction of the genome. By inspecting the fitted model you can determine the location of the smallest peak after the error tail. Then, you can decide the low end cutoff below which all kmers will be discarded as errors (cca 0.5 times the haploid kmer coverage), and the high end cutoff above which all kmers will be discarded (cca 8.5 times the haploid kmer coverage).
111 |
112 | ## Frequently Asked Questions
113 |
114 | Are collected on [our wiki](https://github.com/KamilSJaron/smudgeplot/wiki/FAQ). Smudgeplot does not demand much on computational resources, but make sure you check [memory requirements](https://github.com/KamilSJaron/smudgeplot/wiki/smudgeplot-hetkmers#memory-requirements) before you extract kmer pairs (`hetkmers` task). If you don't find an answer for your question in FAQ, open an [issue](https://github.com/KamilSJaron/smudgeplot/issues/new/choose) or drop us an email.
115 |
116 | Check [projects](https://github.com/KamilSJaron/smudgeplot/projects) to see how the development goes.
117 |
118 | ## Contributions
119 |
120 | This is definitely an open project, contributions are welcome. You can check some of the ideas for the future in [projects](https://github.com/KamilSJaron/smudgeplot/projects) and in the development [dev](https://github.com/KamilSJaron/smudgeplot/tree/dev) branch. The file [playground/DEVELOPMENT.md](playground/DEVELOPMENT.md) contains some development notes. The directory [playground](playground) contains some snippets, attempts, and other items of interest.
121 |
122 | ## Reference
123 |
124 | Ranallo-Benavidez, T.R., Jaron, K.S. & Schatz, M.C. GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. *Nature Communications* **11**, 1432 (2020). https://doi.org/10.1038/s41467-020-14998-3
125 |
126 | ## Acknowledgements
127 |
128 | This [blogpost](http://www.everydayanalytics.ca/2014/09/5-ways-to-do-2d-histograms-in-r.html) by Myles Harrison has largely inspired the visual output of smudgeplots. The colourblind friendly colour theme was suggested by @ggrimes. Grateful for helpful comments of beta testers and pre-release chatters!
129 |
--------------------------------------------------------------------------------
/exec/centrality_plot.R:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env Rscript
2 | args = commandArgs(trailingOnly=TRUE)
3 | input_name <- args[1]
4 | # input_name = '~/test/daAchMill1_centralities.txt'
5 |
6 | output_name <- gsub('.txt', '.pdf', input_name)
7 | tested_covs <- read.table(input_name, col.names = c('cov', 'centrality'))
8 |
9 | pdf(output_name)
10 | plot(tested_covs[, 'cov'], tested_covs[, 'centrality'], xlab = 'Coverage', ylab = 'Centrality [(theoretical_center - actual_center) / coverage ]', pch = 20)
11 | dev.off()
--------------------------------------------------------------------------------
/exec/smudgeplot:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 |
3 | import argparse
4 | import sys
5 | import os
6 | from math import log
7 | from math import ceil
8 | import numpy as np
9 | from scipy.signal import argrelextrema
10 |
11 | version = '0.2.0'
12 |
13 | class parser():
14 | def __init__(self):
15 | argparser = argparse.ArgumentParser(
16 | # description='Inference of ploidy and heterozygosity structure using whole genome sequencing data',
17 | usage='''smudgeplot [options] \n
18 | tasks: cutoff Calculate meaningful values for lower/upper kmer histogram cutoff.
19 | hetkmers Calculate unique kmer pairs from a Jellyfish or KMC dump file.
20 | plot Generate 2d histogram; infere ploidy and plot a smudgeplot.\n\n''')
21 | argparser.add_argument('task', help='Task to execute; for task specific options execute smudgeplot -h')
22 | argparser.add_argument('-v', '--version', action="store_true", default = False, help="print the version and exit")
23 | # print version is a special case
24 | if len(sys.argv) > 1:
25 | if sys.argv[1] in ['-v', '--version']:
26 | self.task = "version"
27 | return
28 | # the following line either prints help and die; or assign the name of task to variable task
29 | self.task = argparser.parse_args([sys.argv[1]]).task
30 | else:
31 | self.task = ""
32 | # if the task is known (i.e. defined in this file);
33 | if hasattr(self, self.task):
34 | # load arguments of that task
35 | getattr(self, self.task)()
36 | else:
37 | argparser.print_usage()
38 | print('"' + self.task + '" is not a valid task name')
39 | exit(1)
40 |
41 | def hetkmers(self):
42 | '''
43 | Calculate unique kmer pairs from a Jellyfish or KMC dump file.
44 | '''
45 | argparser = argparse.ArgumentParser(prog = 'smudgeplot hetkmers',
46 | description='Calculate unique kmer pairs from a Jellyfish or KMC dump file.')
47 | argparser.add_argument('infile', nargs='?', type=argparse.FileType('r'), default=sys.stdin, help='Alphabetically sorted Jellyfish or KMC dump file (stdin).')
48 | argparser.add_argument('-o', help='The pattern used to name the output (kmerpairs).', default='kmerpairs')
49 | argparser.add_argument('-k', help='The length of the kmer.', default=21)
50 | argparser.add_argument('-t', help='Number of processes to use.', default = 4)
51 | argparser.add_argument('--middle', dest='middle', action='store_const', const = True, default = False,
52 | help='Get all kmer pairs one SNP away from each other (default: just the middle one).')
53 | self.arguments = argparser.parse_args(sys.argv[2:])
54 |
55 | def plot(self):
56 | '''
57 | Generate 2d histogram; infer ploidy and plot a smudgeplot.
58 | '''
59 | argparser = argparse.ArgumentParser(prog = 'smudgeplot plot', description='Generate 2d histogram for smudgeplot')
60 | argparser.add_argument('infile', nargs='?', type=argparse.FileType('r'), default=sys.stdin, help='name of the input tsv file with covarages (default \"coverages_2.tsv\")."')
61 | argparser.add_argument('-o', help='The pattern used to name the output (smudgeplot).', default='smudgeplot')
62 | argparser.add_argument('-q', help='Remove kmer pairs with coverage over the specified quantile; (default none).', type=float, default=1)
63 | argparser.add_argument('-L', help='The lower boundary used when dumping kmers (default min(total_pair_cov) / 2).', type=int, default=0)
64 | argparser.add_argument('-n', help='The expected haploid coverage (default estimated from data).', type=int, default=0)
65 | argparser.add_argument('-t', '--title', help='name printed at the top of the smudgeplot (default none).', default='')
66 | argparser.add_argument('-m', '-method', help='The algorithm for annotation of smudges (default \'local_aggregation\')', default='local_aggregation')
67 | argparser.add_argument('-nbins', help='The number of nbins used for smudgeplot matrix (nbins x nbins) (default autodetection).', type=int, default=0)
68 | # argparser.add_argument('-k', help='The length of the kmer.', default=21)
69 | argparser.add_argument('-kmer_file', help='Name of the input files containing kmer seuqences (assuming the same order as in the coverage file)', default = "")
70 | argparser.add_argument('--homozygous', action="store_true", default = False, help="Assume no heterozygosity in the genome - plotting a paralog structure; (default False).")
71 | self.arguments = argparser.parse_args(sys.argv[2:])
72 |
73 | def cutoff(self):
74 | '''
75 | Calculate meaningful values for lower/upper kmer histogram cutoff.
76 | '''
77 | argparser = argparse.ArgumentParser(prog = 'smudgeplot cutoff', description='Calculate meaningful values for lower/upper kmer histogram cutoff.')
78 | argparser.add_argument('infile', type=argparse.FileType('r'), help='Name of the input kmer histogram file (default \"kmer.hist\")."')
79 | argparser.add_argument('boundary', help='Which bounary to compute L (lower, default) or U (upper)', default = 'L')
80 | self.arguments = argparser.parse_args(sys.argv[2:])
81 |
82 |
83 | def round_up_nice(x):
84 | digits = ceil(log(x, 10))
85 | if digits <= 1:
86 | multiplier = 10 ** (digits - 1)
87 | else:
88 | multiplier = 10 ** (digits - 2)
89 | return(ceil(x / multiplier) * multiplier)
90 |
91 | def cutoff(args):
92 | # kmer_hist = open("data/Mflo2/kmer.hist","r")
93 | kmer_hist = args.infile
94 | hist = np.array([int(line.split()[1]) for line in kmer_hist])
95 | if args.boundary == "L":
96 | local_minima = argrelextrema(hist, np.less)[0][0]
97 | L = max(10, int(round(local_minima * 1.25)))
98 | print(L, end = '')
99 | else:
100 | # take 99.8 quantile of kmers that are more than one in the read set
101 | hist_rel_cumsum = np.cumsum(hist[1:]) / np.sum(hist[1:])
102 | U = round_up_nice(np.argmax(hist_rel_cumsum > 0.998))
103 | print(U, end = '')
104 |
105 | def main():
106 | _parser = parser()
107 |
108 | print('Running smudgeplot v' + version)
109 | if _parser.task == "version":
110 | exit(0)
111 |
112 | print('Task: ' + _parser.task)
113 |
114 | if _parser.task == "cutoff":
115 | cutoff(_parser.arguments)
116 |
117 | # if _parser.task == "hetkmers":
118 | # hetkmers(_parser.arguments)
119 | #
120 | # if _parser.task == "plot":
121 | # call .R script
122 | # plot(_parser.arguments)
123 |
124 | print('Done!')
125 | exit(0)
126 |
127 | if __name__=='__main__':
128 | main()
--------------------------------------------------------------------------------
/exec/smudgeplot.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 |
3 | import argparse
4 | import sys
5 | import numpy as np
6 | from pandas import read_csv # type: ignore
7 | from pandas import DataFrame # type: ignore
8 | from numpy import arange
9 | from numpy import argmin
10 | from numpy import concatenate
11 | from os import system
12 | from math import log
13 | from math import ceil
14 | from statistics import fmean
15 | from collections import defaultdict
16 | # import matplotlib as mpl
17 | # import matplotlib.pyplot as plt
18 | # from matplotlib.pyplot import plot
19 |
20 | version = '0.4.0dev'
21 |
22 | ############################
23 | # processing of user input #
24 | ############################
25 |
26 | class parser():
27 | def __init__(self):
28 | argparser = argparse.ArgumentParser(
29 | # description='Inference of ploidy and heterozygosity structure using whole genome sequencing data',
30 | usage='''smudgeplot [options] \n
31 | tasks: cutoff Calculate meaningful values for lower kmer histogram cutoff.
32 | hetmers Calculate unique kmer pairs from a FastK k-mer database.
33 | peak_agregation Agregates smudges using local agregation algorithm.
34 | plot Generate 2d histogram; infere ploidy and plot a smudgeplot.
35 | all Runs all the steps (with default options)\n\n''')
36 | # removing this for now;
37 | # extract Extract kmer pairs within specified coverage sum and minor covrage ratio ranges
38 | argparser.add_argument('task', help='Task to execute; for task specific options execute smudgeplot -h')
39 | argparser.add_argument('-v', '--version', action="store_true", default = False, help="print the version and exit")
40 | # print version is a special case
41 | if len(sys.argv) > 1:
42 | if sys.argv[1] in ['-v', '--version']:
43 | self.task = "version"
44 | return
45 | # the following line either prints help and die; or assign the name of task to variable task
46 | self.task = argparser.parse_args([sys.argv[1]]).task
47 | else:
48 | self.task = ""
49 | # if the task is known (i.e. defined in this file);
50 | if hasattr(self, self.task):
51 | # load arguments of that task
52 | getattr(self, self.task)()
53 | else:
54 | argparser.print_usage()
55 | sys.stderr.write('"' + self.task + '" is not a valid task name\n')
56 | exit(1)
57 |
58 | def hetmers(self):
59 | '''
60 | Calculate unique kmer pairs from a Jellyfish or KMC dump file.
61 | '''
62 | argparser = argparse.ArgumentParser(prog = 'smudgeplot hetkmers',
63 | description='Calculate unique kmer pairs from FastK k-mer database.')
64 | argparser.add_argument('infile', nargs='?', help='Input FastK database (.ktab) file.')
65 | argparser.add_argument('-L', help='Count threshold below which k-mers are considered erroneous', type=int)
66 | argparser.add_argument('-t', help='Number of threads (default 4)', type=int, default=4)
67 | argparser.add_argument('-o', help='The pattern used to name the output (kmerpairs).', default='kmerpairs')
68 | argparser.add_argument('-tmp', help='Directory where all temporary files will be stored (default /tmp).', default='.')
69 | argparser.add_argument('--verbose', action="store_true", default = False, help='verbose mode')
70 | self.arguments = argparser.parse_args(sys.argv[2:])
71 |
72 | def plot(self):
73 | '''
74 | Generate 2d histogram; infer ploidy and plot a smudgeplot.
75 | '''
76 | argparser = argparse.ArgumentParser(prog = 'smudgeplot plot', description='Generate 2d histogram for smudgeplot')
77 | argparser.add_argument('infile', help='name of the input tsv file with covarages and frequencies')
78 | argparser.add_argument('smudgefile', help='name of the input tsv file with sizes of individual smudges')
79 | argparser.add_argument('n', help='The expected haploid coverage.', type=float)
80 | argparser.add_argument('-o', help='The pattern used to name the output (smudgeplot).', default='smudgeplot')
81 |
82 | argparser = self.add_plotting_arguments(argparser)
83 |
84 | self.arguments = argparser.parse_args(sys.argv[2:])
85 |
86 | def cutoff(self):
87 | '''
88 | Calculate meaningful values for lower/upper kmer histogram cutoff.
89 | '''
90 | argparser = argparse.ArgumentParser(prog = 'smudgeplot cutoff', description='Calculate meaningful values for lower/upper kmer histogram cutoff.')
91 | argparser.add_argument('infile', type=argparse.FileType('r'), help='Name of the input kmer histogram file (default \"kmer.hist\")."')
92 | argparser.add_argument('boundary', help='Which bounary to compute L (lower) or U (upper)')
93 | self.arguments = argparser.parse_args(sys.argv[2:])
94 |
95 | def peak_agregation(self):
96 | '''
97 | Extract kmer pairs within specified coverage sum and minor covrage ratio ranges.
98 | '''
99 | argparser = argparse.ArgumentParser()
100 | argparser.add_argument('infile', nargs='?', help='name of the input tsv file with covarages and frequencies.')
101 | argparser.add_argument('-nf', '-noise_filter', help='Do not agregate into smudge k-mer pairs with frequency lower than this parameter', type=int, default=50)
102 | argparser.add_argument('-d', '-distance', help='Manthattan distance of k-mer pairs that are considered neioboring for the local agregation purposes.', type=int, default=5)
103 | argparser.add_argument('--mask_errors', help='instead of reporting assignments to individual smudges, just remove all monotonically decreasing points from the error line', action="store_true", default = False)
104 | self.arguments = argparser.parse_args(sys.argv[2:])
105 |
106 | def all(self):
107 | argparser = argparse.ArgumentParser()
108 | argparser.add_argument('infile', nargs='?', help='name of the input tsv file with covarages and frequencies.')
109 | argparser.add_argument('-o', help='The pattern used to name the output (smudgeplot).', default='smudgeplot')
110 | argparser.add_argument('-cov_min', help='Minimal coverage to explore (default 6)', default=6, type = int)
111 | argparser.add_argument('-cov_max', help='Maximal coverage to explore (default 50)', default=60, type = int)
112 | argparser.add_argument('-cov', help='Define coverage instead of infering it. Disables cov_min and cov_max.', default=0, type=int)
113 |
114 | argparser = self.add_plotting_arguments(argparser)
115 |
116 | self.arguments = argparser.parse_args(sys.argv[2:])
117 |
118 | def add_plotting_arguments(self, argparser):
119 | argparser.add_argument('-c', '-cov_filter', help='Filter pairs with one of them having coverage bellow specified threshold (default 0; disables parameter L)', type=int, default=0)
120 | argparser.add_argument('-t', '--title', help='name printed at the top of the smudgeplot (default none).', default='')
121 | argparser.add_argument('-ylim', help='The upper limit for the coverage sum (the y axis)', type = int, default=0)
122 | argparser.add_argument('-col_ramp', help='An R palette used for the plot (default "viridis", other sensible options are "magma", "mako" or "grey.colors" - recommended in combination with --invert_cols).', default='viridis')
123 | argparser.add_argument('--invert_cols', action="store_true", default = False, help="Revert the colour palette (default False).")
124 | return(argparser)
125 |
126 | def format_aguments_for_R_plotting(self):
127 | plot_args = ""
128 | if self.arguments.c != 0:
129 | plot_args += " -c " + str(self.arguments.c)
130 | if self.arguments.title:
131 | plot_args += " -t \"" + self.arguments.title + "\""
132 | if self.arguments.ylim != 0:
133 | plot_args += " -ylim " + str(self.arguments.ylim)
134 | if self.arguments.col_ramp:
135 | plot_args += " -col_ramp \"" + self.arguments.col_ramp + "\""
136 | if self.arguments.invert_cols:
137 | plot_args += " --invert_cols"
138 | return(plot_args)
139 |
140 | ###############
141 | # task cutoff #
142 | ###############
143 |
144 | # taken from https://stackoverflow.com/a/29614335
145 | def local_min(ys):
146 | return [i for i, y in enumerate(ys)
147 | if ((i == 0) or (ys[i - 1] >= y))
148 | and ((i == len(ys) - 1) or (y < ys[i+1]))]
149 |
150 | def round_up_nice(x):
151 | digits = ceil(log(x, 10))
152 | if digits <= 1:
153 | multiplier = 10 ** (digits - 1)
154 | else:
155 | multiplier = 10 ** (digits - 2)
156 | return(ceil(x / multiplier) * multiplier)
157 |
158 | def cutoff(args):
159 | # kmer_hist = open("data/Scer/kmc_k31.hist","r")
160 | kmer_hist = args.infile
161 | hist = [int(line.split()[1]) for line in kmer_hist]
162 | if args.boundary == "L":
163 | local_minima = local_min(hist)[0]
164 | L = max(10, int(round(local_minima * 1.25)))
165 | sys.stdout.write(str(L))
166 | else:
167 | sys.stderr.write('Warning: We discourage using the original hetmer algorithm.\n\tThe updated (recommended) version does not take the argument U\n')
168 | # take 99.8 quantile of kmers that are more than one in the read set
169 | number_of_kmers = sum(hist[1:])
170 | hist_rel_cumsum = [sum(hist[1:i+1]) / number_of_kmers for i in range(1, len(hist))]
171 | min(range(len(hist_rel_cumsum)))
172 | U = round_up_nice(min([i for i, q in enumerate(hist_rel_cumsum) if q > 0.998]))
173 | sys.stdout.write(str(U))
174 | sys.stdout.flush()
175 |
176 | ########################
177 | # task peak_agregation #
178 | ########################
179 |
180 | def load_hetmers(smufile):
181 | cov_tab = read_csv(smufile, names = ['covB', 'covA', 'freq'], sep='\t')
182 | cov_tab = cov_tab.sort_values('freq', ascending = False)
183 | return(cov_tab)
184 |
185 | def local_agregation(cov_tab, distance, noise_filter, mask_errors):
186 | # generate a dictionary that gives us for each combination of coverages a frequency
187 | cov2freq = defaultdict(int)
188 | cov2peak = defaultdict(int)
189 |
190 | L = min(cov_tab['covB']) # important only when --mask_errors is on
191 |
192 | ### clustering
193 | next_peak = 1
194 | for idx, covB, covA, freq in cov_tab.itertuples():
195 | cov2freq[(covA, covB)] = freq # a make a frequency dictionary on the fly, because I don't need any value that was not processed yet
196 | if freq < noise_filter:
197 | break
198 | highest_neigbour_coords = (0, 0)
199 | highest_neigbour_freq = 0
200 | # for each kmer pair I will retrieve all neibours (Manhattan distance)
201 | for xA in range(covA - distance,covA + distance + 1):
202 | # for explored A coverage in neiborhood, we explore all possible B coordinates
203 | distanceB = distance - abs(covA - xA)
204 | for xB in range(covB - distanceB,covB + distanceB + 1):
205 | xB, xA = sorted([xA, xB]) # this is to make sure xB is smaller than xA
206 | # iterating only though those that were assigned already
207 | # and recroding only the one with highest frequency
208 | if cov2peak[(xA, xB)] and cov2freq[(xA, xB)] > highest_neigbour_freq:
209 | highest_neigbour_coords = (xA, xB)
210 | highest_neigbour_freq = cov2freq[(xA, xB)]
211 | if highest_neigbour_freq > 0:
212 | cov2peak[(covA, covB)] = cov2peak[(highest_neigbour_coords)]
213 | else:
214 | # print("new peak:", (covA, covB))
215 | if mask_errors:
216 | if covB < L + distance:
217 | cov2peak[(covA, covB)] = 1 # error line
218 | else:
219 | cov2peak[(covA, covB)] = 0 # central smudges
220 | else:
221 | cov2peak[(covA, covB)] = next_peak # if I want to keep info about all locally agregated smudges
222 | next_peak += 1
223 | return(cov2peak)
224 |
225 | def peak_agregation(args):
226 | ### load data
227 | cov_tab = load_hetmers(args.infile)
228 |
229 | cov2peak = local_agregation(cov_tab, args.d, args.nf, mask_errors = False)
230 |
231 | cov_tab = cov_tab.sort_values(['covA', 'covB'], ascending = True)
232 | for idx, covB, covA, freq in cov_tab.itertuples():
233 | peak = cov2peak[(covA, covB)]
234 | sys.stdout.write(f"{covB}\t{covA}\t{freq}\t{peak}\n")
235 | sys.stdout.flush()
236 |
237 | def get_smudge_container(cov_tab, cov, smudge_filter):
238 | smudge_container = dict()
239 | genomic_cov_tab = cov_tab[cov_tab['peak'] == 0] # this removed all the marked errors
240 | total_kmer_pairs = sum(genomic_cov_tab['freq'])
241 |
242 | for Bs in range(1,9):
243 | min_cov = 0 if Bs == 1 else cov * (Bs - 0.5)
244 | max_cov = cov * (Bs + 0.5)
245 | cov_tab_isoB = genomic_cov_tab.loc[(genomic_cov_tab["covB"] > min_cov) & (genomic_cov_tab["covB"] < max_cov)] #
246 |
247 | for As in range(Bs,(17 - Bs)):
248 | min_cov = 0 if As == 1 else cov * (As - 0.5)
249 | max_cov = cov * (As + 0.5)
250 | cov_tab_iso_smudge = cov_tab_isoB.loc[(cov_tab_isoB["covA"] > min_cov) & (cov_tab_isoB["covA"] < max_cov)]
251 | if sum(cov_tab_iso_smudge['freq']) / total_kmer_pairs > smudge_filter:
252 | # sys.stderr.write(f"{As}A{Bs}B: {sum(cov_tab_iso_smudge['freq']) / total_kmer_pairs}\n")
253 | smudge_container["A" * As + "B" * Bs] = cov_tab_iso_smudge
254 | return(smudge_container)
255 |
256 | def get_centrality(smudge_container, cov):
257 | centralities = list()
258 | freqs = list()
259 | for smudge in smudge_container.keys():
260 | As = smudge.count('A')
261 | Bs = smudge.count('B')
262 | smudge_tab = smudge_container[smudge]
263 | kmer_in_the_smudge = sum(smudge_tab['freq'])
264 | freqs.append(kmer_in_the_smudge)
265 | # center as a a mean
266 | # center_A = sum((smudge_tab['freq'] * smudge_tab['covA'])) / kmer_in_the_smudge
267 | # center_B = sum((smudge_tab['freq'] * smudge_tab['covB'])) / kmer_in_the_smudge
268 | # center as a mode
269 | center = smudge_tab.loc[smudge_tab['freq'].idxmax()]
270 | center_A = center['covA']
271 | center_B = center['covB']
272 | ## emprical to edge
273 | # distA = min([abs(smudge_tab['covA'].max() - center['covA']), abs(center['covA'] - smudge_tab['covA'].min())])
274 | # distB = min([abs(smudge_tab['covB'].max() - center['covB']), abs(center['covB'] - smudge_tab['covB'].min())])
275 | ## theoretical to edge
276 | # distA = min(abs(center['covA'] - (cov * (As - 0.5))), abs((cov * (As + 0.5)) - center['covA']))
277 | # distB = min(abs(center['covB'] - (cov * (Bs - 0.5))), abs((cov * (Bs + 0.5)) - center['covB']))
278 | ## theoretical relative distance to the center
279 | distA = abs((center_A - (cov * As)) / cov)
280 | distB = abs((center_B - (cov * Bs)) / cov)
281 |
282 | # sys.stderr.write(f"Processing: {As}A{Bs}B; with center: {distA}, {distB}\n")
283 | centrality = distA + distB
284 | centralities.append(centrality)
285 |
286 | if len(centralities) == 0:
287 | return(1)
288 | return(fmean(centralities, weights=freqs))
289 |
290 | def test_coverage_range(cov_tab, min_c, max_c, smudge_size_cutoff = 0.02):
291 | # covs_to_test = range(min_c, max_c)
292 | covs_to_test = arange(min_c + 0.05, max_c + 0.05, 2)
293 | cov_centralities = list()
294 | for cov in covs_to_test:
295 | smudge_container = get_smudge_container(cov_tab, cov, smudge_size_cutoff)
296 | cov_centralities.append(get_centrality(smudge_container, cov))
297 |
298 | best_coverage = covs_to_test[argmin(cov_centralities)]
299 |
300 | tenths_to_test = arange(best_coverage - 1.9, best_coverage + 1.9, 0.2)
301 | tenths_centralities = list()
302 | for cov in tenths_to_test:
303 | smudge_container = get_smudge_container(cov_tab, cov, smudge_size_cutoff)
304 | tenths_centralities.append(get_centrality(smudge_container, cov))
305 |
306 | best_tenth = tenths_to_test[argmin(tenths_centralities)]
307 | sys.stderr.write(f"Best coverage to precsion of one tenth: {round(best_tenth, 2)}\n")
308 |
309 | hundredths_to_test = list(arange(best_tenth - 0.19, best_tenth + 0.19, 0.01))
310 | hundredths_centralities = list()
311 | for cov in hundredths_to_test:
312 | smudge_container = get_smudge_container(cov_tab, cov, smudge_size_cutoff)
313 | hundredths_centralities.append(get_centrality(smudge_container, cov))
314 |
315 | final_cov = hundredths_to_test[argmin(hundredths_centralities)]
316 | just_to_be_sure_cov = final_cov/2
317 |
318 | hundredths_to_test.append(just_to_be_sure_cov)
319 | smudge_container = get_smudge_container(cov_tab, just_to_be_sure_cov, smudge_size_cutoff)
320 | hundredths_centralities.append(get_centrality(smudge_container, just_to_be_sure_cov))
321 |
322 | final_cov = hundredths_to_test[argmin(hundredths_centralities)]
323 | sys.stderr.write(f"Best coverage to precision of one hundreth: {round(final_cov, 3)}\n")
324 |
325 | all_coverages = concatenate((covs_to_test, tenths_to_test, hundredths_to_test))
326 | all_centralities = concatenate((cov_centralities, tenths_centralities, hundredths_centralities))
327 |
328 | return(DataFrame({'coverage': all_coverages, 'centrality': all_centralities}))
329 |
330 | #####################
331 | # the script itself #
332 | #####################
333 |
334 | def main():
335 | _parser = parser()
336 |
337 | sys.stderr.write('Running smudgeplot v' + version + "\n")
338 | if _parser.task == "version":
339 | exit(0)
340 |
341 | sys.stderr.write('Task: ' + _parser.task + "\n")
342 |
343 | if _parser.task == "cutoff":
344 | cutoff(_parser.arguments)
345 |
346 | if _parser.task == "hetmers":
347 | # PloidyPlot is expected ot be installed in the system as well as the R library supporting it
348 | args = _parser.arguments
349 | plot_args = " -o" + str(args.o)
350 | plot_args += " -e" + str(args.L)
351 | plot_args += " -T" + str(args.t)
352 | if args.verbose:
353 | plot_args += " -v"
354 | if args.tmp != '.':
355 | plot_args += " -P" + args.tmp
356 | plot_args += " " + args.infile
357 |
358 | sys.stderr.write("Calling: hetmers (PloidyPlot kmer pair search) " + plot_args + "\n")
359 | system("hetmers " + plot_args)
360 |
361 | if _parser.task == "plot":
362 | # the plotting script is expected ot be installed in the system as well as the R library supporting it
363 | args = _parser.arguments
364 |
365 | plot_args = f'-i "{args.infile}" -s "{args.smudgefile}" -n {args.n} -o "{args.o}" ' + _parser.format_aguments_for_R_plotting()
366 |
367 | sys.stderr.write("Calling: smudgeplot_plot.R " + plot_args + "\n")
368 | system("smudgeplot_plot.R " + plot_args)
369 |
370 | if _parser.task == "peak_agregation":
371 | peak_agregation(_parser.arguments)
372 |
373 | if _parser.task == "all":
374 | args = _parser.arguments
375 |
376 | sys.stderr.write("\nLoading data\n")
377 | cov_tab = load_hetmers(args.infile)
378 | # cov_tab = load_hetmers("data/dicots/smu_files/daAchMill1.k31_ploidy.smu.txt")
379 |
380 | sys.stderr.write("\nMasking errors using local agregation algorithm\n")
381 | cov2peak = local_agregation(cov_tab, distance = 1, noise_filter = 1000, mask_errors = True)
382 | cov_tab['peak'] = [cov2peak[(covA, covB)] for idx, covB, covA, freq in cov_tab.itertuples()]
383 |
384 | cov_tab = cov_tab.sort_values(['covA', 'covB'], ascending = True)
385 | total_kmers = sum(cov_tab['freq'])
386 | genomic_kmers = sum(cov_tab[cov_tab['peak'] == 0]['freq'])
387 | total_error_kmers = sum(cov_tab[cov_tab['peak'] == 1]['freq'])
388 | error_fraction = total_error_kmers / total_kmers
389 | sys.stderr.write(f"Total kmers: {total_kmers}\n\tGenomic kmers: {genomic_kmers}\n\tSequencing errors: {total_error_kmers}\n\tFraction or errors: {round(total_error_kmers/total_kmers, 3)}")
390 |
391 | with open(args.o + "_masked_errors_smu.txt", 'w') as error_annotated_smu:
392 | error_annotated_smu.write("covB\tcovA\tfreq\tis_error\n")
393 | for idx, covB, covA, freq, is_error in cov_tab.itertuples():
394 | error_annotated_smu.write(f"{covB}\t{covA}\t{freq}\t{is_error}\n") # might not be needed
395 |
396 | sys.stderr.write("\nInfering 1n coverage using grid algorihm\n")
397 |
398 | smudge_size_cutoff = 0.001 # this is % of all k-mer pairs smudge needs to have to be considered a valid smudge
399 |
400 | if args.cov == 0: # not specified user coverage
401 | centralities = test_coverage_range(cov_tab, args.cov_min, args.cov_max, smudge_size_cutoff)
402 | np.savetxt(args.o + "_centralities.txt", np.around(centralities, decimals=6), fmt="%.4f", delimiter = '\t')
403 | # plot(centralities['coverage'], centralities['coverage'])
404 |
405 | if error_fraction < 0.75:
406 | cov = centralities['coverage'][argmin(centralities['centrality'])]
407 | else:
408 | sys.stderr.write(f"Too many errors observed: {error_fraction}, not trusting coverage inference\n")
409 | cov = 0
410 |
411 | sys.stderr.write(f"\nInferred coverage: {cov}\n")
412 | else:
413 | cov = args.cov
414 |
415 | final_smudges = get_smudge_container(cov_tab, cov, 0.03)
416 | # sys.stderr.write(str(final_smudges) + '\n')
417 |
418 | annotated_smudges = list(final_smudges.keys())
419 | smudge_sizes = [round(sum(final_smudges[smudge]['freq']) / genomic_kmers, 4) for smudge in annotated_smudges]
420 |
421 | sys.stderr.write(f'Detected smudges / sizes ({args.o} + "_smudge_sizes.txt):"\n')
422 | sys.stderr.write('\t' + str(annotated_smudges) + '\n')
423 | sys.stderr.write('\t' + str(smudge_sizes) + '\n')
424 |
425 | # saving smudge sizes
426 | smudge_table = DataFrame({'smudge': annotated_smudges, 'size': smudge_sizes})
427 | np.savetxt(args.o + "_smudge_sizes.txt", smudge_table, fmt='%s', delimiter = '\t')
428 |
429 | sys.stderr.write("\nPlotting\n")
430 |
431 | system("centrality_plot.R " + args.o + "_centralities.txt")
432 | # Rscript playground/alternative_fitting/alternative_plotting_testing.R -i data/dicots/peak_agregation/$ToLID.cov_tab_peaks -o data/dicots/peak_agregation/$ToLID
433 | args = _parser.arguments
434 |
435 | plot_args = f'-i "{args.o}_masked_errors_smu.txt" -s "{args.o}_smudge_sizes.txt" -n {round(cov, 3)} -o "{args.o}" ' + _parser.format_aguments_for_R_plotting()
436 |
437 | sys.stderr.write("Calling: smudgeplot_plot.R " + plot_args + "\n")
438 | system("smudgeplot_plot.R " + plot_args)
439 |
440 | sys.stderr.write("\nDone!\n")
441 | exit(0)
442 |
443 | if __name__=='__main__':
444 | main()
445 |
--------------------------------------------------------------------------------
/exec/smudgeplot_plot.R:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env Rscript
2 |
3 | suppressPackageStartupMessages(library("methods"))
4 | suppressPackageStartupMessages(library("argparse"))
5 | suppressPackageStartupMessages(library("viridis"))
6 |
7 | # suppressPackageStartupMessages(library("smudgeplot"))
8 |
9 | #################
10 | ### funcitons ###
11 | #################
12 | # retirying the smudgeplot R package
13 | get_col_ramp <- function(.args, delay = 0){
14 | colour_ramp <- eval(parse(text = paste0(.args$col_ramp,"(", 32 - delay, ")")))
15 | if (.args$invert_cols){
16 | colour_ramp <- rev(colour_ramp)
17 | }
18 | colour_ramp <- c(rep(colour_ramp[1], delay), colour_ramp)
19 | return(colour_ramp)
20 | }
21 |
22 | wtd.quantile <- function(x, q=0.25, weight=NULL) {
23 | o <- order(x)
24 | n <- sum(weight)
25 | order <- 1 + (n - 1) * q
26 | low <- pmax(floor(order), 1)
27 | high <- pmin(ceiling(order), n)
28 | low_contribution <- high - order
29 | allq <- approx(x=cumsum(weight[o])/sum(weight), y=x[o], xout = c(low, high)/n, method = "constant",
30 | f = 1, rule = 2)$y
31 | low_contribution * allq[1] + (1 - low_contribution) * allq[2]
32 | }
33 |
34 | wtd.iqr <- function(x, w=NULL) {
35 | wtd.quantile(x, q=0.75, weight=w) - wtd.quantile(x, q=0.25, weight=w)
36 | }
37 |
38 | plot_alt <- function(cov_tab, ylim, colour_ramp, log = F){
39 | A_equals_B <- cov_tab[, 'covA'] == cov_tab[, 'covB']
40 | cov_tab[A_equals_B, 'freq'] <- cov_tab[A_equals_B, 'freq'] * 2
41 | if (log){
42 | cov_tab[, 'freq'] <- log10(cov_tab[, 'freq'])
43 | }
44 | cov_tab$col <- colour_ramp[1 + round(31 * cov_tab[, 'freq'] / max(cov_tab[, 'freq']))]
45 |
46 | # c(bottom, left, top, right)
47 | par(mar=c(4.8,4.8,1,1))
48 | plot(NULL, xlim = c(0, 0.5), ylim = ylim,
49 | xlab = 'Normalized minor kmer coverage: B / (A + B)',
50 | ylab = 'Total coverage of the kmer pair: A + B', cex.lab = 1.4, bty = 'n')
51 | min_cov_to_plot <- max(ylim[1],min(cov_tab[, 'total_pair_cov']))
52 | nothing <- sapply(min_cov_to_plot:ylim[2], plot_one_coverage, cov_tab)
53 | return(0)
54 | }
55 |
56 | plot_one_coverage <- function(cov, cov_tab){
57 | cov_row_to_plot <- cov_tab[cov_tab[, 'total_pair_cov'] == cov, ]
58 | width <- 1 / (2 * cov)
59 | cov_row_to_plot$left <- cov_row_to_plot[, 'minor_variant_rel_cov'] - width
60 | cov_row_to_plot$right <- sapply(cov_row_to_plot[, 'minor_variant_rel_cov'], function(x){ min(0.5, x + width)})
61 | apply(cov_row_to_plot, 1, plot_one_box, cov)
62 | }
63 |
64 | plot_one_box <- function(one_box_row, cov){
65 | left <- as.numeric(one_box_row['left'])
66 | right <- as.numeric(one_box_row['right'])
67 | rect(left, cov - 0.5, right, cov + 0.5, col = one_box_row['col'], border = NA)
68 | }
69 |
70 | plot_isoA_line <- function (.covA, .L, .col = "black", .ymax = 250, .lwd, .lty) {
71 | min_covB <- .L # min(.cov_tab[, 'covB']) # should be L really
72 | max_covB <- .covA
73 | B_covs <- seq(min_covB, max_covB, length = 500)
74 | isoline_x <- B_covs/ (B_covs + .covA)
75 | isoline_y <- B_covs + .covA
76 | lines(isoline_x[isoline_y < .ymax], isoline_y[isoline_y < .ymax], lwd = .lwd, lty = .lty, col = .col)
77 | }
78 |
79 | plot_isoB_line <- function (.covB, .ymax, .col = "black", .lwd, .lty) {
80 | cov_range <- seq((2 * .covB) - 2, .ymax, length = 500)
81 | lines((.covB)/cov_range, cov_range, lwd = .lwd, lty = .lty, col = .col)
82 | }
83 |
84 | plot_iso_grid <- function(.cov, .L, .ymax, .col = 'black', .lwd = 2, .lty = 2){
85 | for (i in 0:15){
86 | cov <- (i + 0.5) * .cov
87 | plot_isoA_line(cov, .L = .L, .ymax = .ymax, .col, .lwd = .lwd, .lty = .lty)
88 | if (i < 8){
89 | plot_isoB_line(cov, .ymax, .col, .lwd = .lwd, .lty = .lty)
90 | }
91 | }
92 | }
93 |
94 | plot_expected_haplotype_structure <- function(.n, .peak_sizes,
95 | .adjust = F, .cex = 1.3, xmax = 0.49){
96 | .peak_sizes <- .peak_sizes[.peak_sizes[, 'size'] > 0.05, ]
97 | .peak_sizes[, 'ploidy'] <- nchar(.peak_sizes[, 'structure'])
98 |
99 | decomposed_struct <- strsplit(.peak_sizes[, 'structure'], '')
100 | .peak_sizes[, 'corrected_minor_variant_cov'] <- sapply(decomposed_struct, function(x){ sum(x == 'B') } ) / .peak_sizes[, 'ploidy']
101 | .peak_sizes[, 'label'] <- reduce_structure_representation(.peak_sizes[, 'structure'])
102 |
103 | borercases <- .peak_sizes$corrected_minor_variant_cov == 0.5
104 |
105 | for(i in 1:nrow(.peak_sizes)){
106 | # xmax is in the middle of the last square in the 2d histogram,
107 | # which is too far from the edge, so I average it with 0.49
108 | # witch will pull the label bit to the edge
109 | text( ifelse( borercases[i] & .adjust, (xmax + 0.49) / 2, .peak_sizes$corrected_minor_variant_cov[i]),
110 | .peak_sizes$ploidy[i] * .n, .peak_sizes[i, 'label'],
111 | offset = 0, cex = .cex, xpd = T, pos = ifelse( borercases[i] & .adjust, 2, 1))
112 | }
113 | }
114 |
115 | reduce_structure_representation <- function(smudge_labels){
116 | structures_to_adjust <- (sapply(smudge_labels, nchar) > 4)
117 |
118 | if ( any(structures_to_adjust) ) {
119 | decomposed_struct <- strsplit(smudge_labels[structures_to_adjust], '')
120 | As <- sapply(decomposed_struct, function(x){ sum(x == 'A') } )
121 | Bs <- sapply(decomposed_struct, length) - As
122 | smudge_labels[structures_to_adjust] <- paste0(As, 'A', Bs, 'B')
123 | }
124 | return(smudge_labels)
125 | }
126 |
127 | plot_legend <- function(kmer_max, .colour_ramp, .log_scale = T){
128 | par(mar=c(0,0,2,1))
129 | plot.new()
130 | print_title <- ifelse(.log_scale, 'log kmers pairs', 'kmers pairs')
131 | title(print_title)
132 | for(i in 1:32){
133 | rect(0,(i - 0.01) / 33, 0.5, (i + 0.99) / 33, col = .colour_ramp[i])
134 | }
135 | # kmer_max <- max(smudge_container$dens)
136 | if( .log_scale == T ){
137 | for(i in 0:6){
138 | text(0.75, i / 6, rounding(10^(log10(kmer_max) * i / 6)), offset = 0)
139 | }
140 | } else {
141 | for(i in 0:6){
142 | text(0.75, i / 6, rounding(kmer_max * i / 6), offset = 0)
143 | }
144 | }
145 | }
146 |
147 | rounding <- function(number){
148 | if(number > 1000){
149 | round(number / 1000) * 1000
150 | } else if (number > 100){
151 | round(number / 100) * 100
152 | } else {
153 | round(number / 10) * 10
154 | }
155 | }
156 |
157 | #############
158 | ## SETTING ##
159 | #############
160 |
161 | parser <- ArgumentParser()
162 | parser$add_argument("--homozygous", action="store_true", default = F,
163 | help="Assume no heterozygosity in the genome - plotting a paralog structure; [default FALSE]")
164 | parser$add_argument("-i", "--input", default = "*_smu.txt",
165 | help="name of the input tsv file with covarages [default \"*_smu.txt\"]")
166 | parser$add_argument("-s", "--smudges", default = "not_specified",
167 | help="name of the input tsv file with annotated smudges and their respective sizes")
168 | parser$add_argument("-o", "--output", default = "smudgeplot",
169 | help="name pattern used for the output files (OUTPUT_smudgeplot.png, OUTPUT_summary.txt, OUTPUT_warrnings.txt) [default \"smudgeplot\"]")
170 | parser$add_argument("-t", "--title",
171 | help="name printed at the top of the smudgeplot [default none]")
172 | parser$add_argument("-q", "--quantile_filt", type = "double",
173 | help="Remove kmer pairs with coverage over the specified quantile; [default none]")
174 | parser$add_argument("-n", "--n_cov", type = "double",
175 | help="the haploid coverage of the sequencing data [default inference from data]")
176 | parser$add_argument("-c", "-cov_filter", type = "integer",
177 | help="Filter pairs with one of them having coverage bellow specified threshold [default 0]")
178 | parser$add_argument("-ylim", type = "integer",
179 | help="The upper limit for the coverage sum (the y axis)")
180 | parser$add_argument("-col_ramp", default = "viridis",
181 | help="A colour ramp available in your R session [viridis]")
182 | parser$add_argument("--invert_cols", action="store_true", default = F,
183 | help="Set this flag to invert colorus of Smudgeplot (dark for high, light for low densities)")
184 |
185 | args <- parser$parse_args()
186 |
187 | colour_ramp_log <- get_col_ramp(args, 16) # create palette for the log plots
188 | colour_ramp <- get_col_ramp(args) # create palette for the linear plots
189 |
190 | if ( !file.exists(args$input) ) {
191 | stop("The input file not found. Please use --help to get help", call.=FALSE)
192 | }
193 |
194 | cov_tab <- read.table(args$input, header = T) # col.names = c('covB', 'covA', 'freq', 'is_error'),
195 | smudge_tab <- read.table(args$smudges, col.names = c('structure', 'size'))
196 |
197 | # total covarate of the kmer pair
198 | cov_tab[, 'total_pair_cov'] <- cov_tab[, 'covA'] + cov_tab[, 'covB']
199 | # calcualte relative coverage of the minor allele
200 | cov_tab[, 'minor_variant_rel_cov'] <- cov_tab[, 'covB'] / cov_tab[, 'total_pair_cov']
201 |
202 | ##### coverage filtering
203 |
204 | if ( !is.null(args$c) ){
205 | threshold <- args$c
206 | low_cov_filt <- cov_tab[, 'covA'] < threshold | cov_tab[, 'covB'] < threshold
207 | # smudge_warn(args$output, "Removing", sum(cov_tab[low_cov_filt, 'freq']),
208 | # "kmer pairs for which one of the pair had coverage below",
209 | # threshold, paste0("(Specified by argument -c ", args$c, ")"))
210 | cov_tab <- cov_tab[!low_cov_filt, ]
211 | # smudge_warn(args$output, "Processing", sum(cov_tab[, 'freq']), "kmer pairs")
212 | }
213 |
214 | ##### quantile filtering
215 | if ( !is.null(args$q) ){
216 | # quantile filtering (remove top q%, it's not really informative)
217 | threshold <- wtd.quantile(cov_tab[, 'total_pair_cov'], args$q, cov_tab[, 'freq'])
218 | high_cov_filt <- cov_tab[, 'total_pair_cov'] < threshold
219 | # smudge_warn(args$output, "Removing", sum(cov_tab[!high_cov_filt, 'freq']),
220 | # "kmer pairs with coverage higher than",
221 | # threshold, paste0("(", args$q, " quantile)"))
222 | cov_tab <- cov_tab[high_cov_filt, ]
223 | }
224 |
225 | cov <- args$n_cov
226 | if (cov == wtd.quantile(cov_tab[, 'total_pair_cov'], 0.95, cov_tab[, 'freq'])){
227 | ylim <- c(min(cov_tab[, 'total_pair_cov']), max(cov_tab[, 'total_pair_cov']))
228 | } else {
229 | ylim <- c(min(cov_tab[, 'total_pair_cov']) - 1, # or 0?
230 | min(max(100, 10*cov), max(cov_tab[, 'total_pair_cov'])))
231 | }
232 |
233 | xlim <- c(0, 0.5)
234 | error_fraction <- sum(cov_tab[, 'is_error'] * cov_tab[, 'freq']) / sum(cov_tab[, 'freq']) * 100
235 | error_string <- paste("err =", round(error_fraction, 1), "%")
236 | cov_string <- paste0("1n = ", cov)
237 |
238 | if (!is.null(args$ylim)){ # if ylim is specified, set the boundary by the argument instead
239 | ylim[2] <- args$ylim
240 | }
241 |
242 | fig_title <- ifelse(length(args$title) == 0, NA, args$title[1])
243 | # histogram_bins = max(30, args$nbins)
244 |
245 | ##########
246 | # LINEAR #
247 | ##########
248 | pdf(paste0(args$output,'_smudgeplot.pdf'))
249 |
250 | # layout(matrix(c(2,4,1,3), 2, 2, byrow=T), c(3,1), c(1,3))
251 | layout(matrix(c(4,2,1,3), 2, 2, byrow=T), c(3,1), c(1,3))
252 | # 1 smudge plot
253 | plot_alt(cov_tab, ylim, colour_ramp_log)
254 | if (cov > 0){
255 | plot_expected_haplotype_structure(cov, smudge_tab, T, xmax = 0.49)
256 | }
257 |
258 |
259 | # 4 legend
260 | plot_legend(max(cov_tab[, 'freq']), colour_ramp, F)
261 |
262 | ### add annotation
263 | # print smudge sizes
264 | plot.new()
265 | if (cov > 0){
266 | legend('topleft', bty = 'n', reduce_structure_representation(smudge_tab[,'structure']), cex = 1.1)
267 | legend('top', bty = 'n', legend = round(smudge_tab[,2], 2), cex = 1.1)
268 | legend('bottomleft', bty = 'n', legend = c(cov_string, error_string), cex = 1.1)
269 | } else {
270 | legend('bottomleft', bty = 'n', legend = error_string, cex = 1.1)
271 | }
272 |
273 | plot.new()
274 | mtext(bquote(italic(.(fig_title))), side=3, adj=0.1, line=-3, cex = 1.6)
275 |
276 |
277 | dev.off()
278 |
279 | ############
280 | # log plot #
281 | ############
282 |
283 | pdf(paste0(args$output,'_smudgeplot_log10.pdf'))
284 |
285 | layout(matrix(c(4,2,1,3), 2, 2, byrow=T), c(3,1), c(1,3))
286 | # cov_tab[, 'freq'] <- log10(cov_tab[, 'freq'])
287 | # 1 smudge plot
288 | plot_alt(cov_tab, ylim, colour_ramp_log, log = T)
289 |
290 | if (cov > 0){
291 | plot_expected_haplotype_structure(cov, smudge_tab, T, xmax = 0.49)
292 | }
293 |
294 | # 4 legend
295 | plot_legend(max(cov_tab[, 'freq']), colour_ramp_log, T)
296 |
297 | # print smudge sizes
298 | plot.new()
299 | if (cov > 0){
300 | legend('topleft', bty = 'n', reduce_structure_representation(smudge_tab[,'structure']), cex = 1.1)
301 | legend('top', bty = 'n', legend = round(smudge_tab[,2], 2), cex = 1.1)
302 | legend('bottomleft', bty = 'n', legend = c(cov_string, error_string), cex = 1.1)
303 | } else {
304 | legend('bottomleft', bty = 'n', legend = error_string, cex = 1.1)
305 | }
306 |
307 |
308 | plot.new()
309 | mtext(bquote(italic(.(fig_title))), side=3, adj=0.1, line=-3, cex = 1.6)
310 |
311 | dev.off()
--------------------------------------------------------------------------------
/playground/BGA_tutorial.md:
--------------------------------------------------------------------------------
1 | ## Smudgeplot
2 |
3 | Have you ever sequenced something not-well studied? Something that might show strange genomic signatures? Smudgeplot is a visualisation technique for whole-genome sequencing reads from a single individual. The visualisation techique is based on the idea of het-mers. Het-mers are k-mer pairs that are exactly one nucleotide pair away from each other, while forming a unique pair in the sequencing dataset. These k-mers are assumed to be mostly representing two alleles of a heterozygous, but potentially can also show pairing of imperfect paralogs, or sequencing errors paired up with a homozygous genomic k-mer. Nevertheless, the predicted ploidy by smudgeplot is simply the ploidy with the highest number of k-mer pairs (if a reasonable estimate must be evaluated for each individual case!).
4 |
5 |
6 |
7 | ### Installing the software
8 |
9 | Open gitpod. And install the development version of smudgeplot (branch sploidyplot) & FastK.
10 |
11 | ```
12 | mkdir src bin && cd src # create directories for source code & binaries
13 | git clone -b sploidyplot https://github.com/KamilSJaron/smudgeplot
14 | git clone https://github.com/thegenemyers/FastK
15 | ```
16 |
17 | Now smudgeplot make install smudgeplot R package, compiles the C kernel for searching for k-mer pairs and copy all the executables to `workspace/bin/` (which will be our dedicated spot for executables).
18 |
19 | ```
20 | cd smudgeplot && make -s INSTALL_PREFIX=/workspace && cd ..
21 | cd FastK && make FastK Histex
22 | install -c Histex FastK /workspace/bin/
23 | ```
24 |
25 |
26 | ** 8 Datasets **
27 |
28 | Species name SRA/ENA ID
29 | Pseudoloma neurophilia SRR926312
30 | Tubulinosema ratisbonensis ERR3154977
31 | Nosema ceranae SRR17317293
32 | Nematocida ausubeli SRR058692
33 | Nematocida ausubeli SRR350188
34 | Hamiltosporidium tvaerminnensis SRR16954898
35 | Encephalitozoon hellem SRR14017862
36 | Agmasoma penaei SRR926341
37 |
38 | TODO: get them urls
39 |
40 | Finally, if your session is running; Start downloading the data; For example:
41 |
42 | ```
43 | wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR926/SRR926341/SRR926341_[12].fastq.gz
44 | ```
45 |
46 | ### Constructing a database
47 |
48 | The whole process operates with raw, or trimmed sequencing reads. From those we generate a k-mer database using [FastK](https://github.com/thegenemyers/FASTK). FastK is currently the fastest k-mer counter out there and the only supported by the lastest version of smudgeplot*. This database contains an index of all the k-mers and their coverages in the sequencing readset. Within this set user must chose a theshold for excluding low frequencing k-mers that will be considered errors. That choice is not too difficult to make by looking at the k-mer spectra. Of all the retained k-mers we find all the het-mers. Then we plot a 2d histogram.
49 |
50 |
51 | *Note: The previous versions of smudgeplot (up to 2.5.0) were operating on k-mer "dumps" flat files you can generate with any counter you like. You can imagine that text files are very inefficient to operate on. The new version is operating directly on the optimised k-mer database instead.
52 |
53 | ```
54 | FastK -v -t4 -k31 -M16 -T4 *.fastq.gz -NSRR8495097
55 | ```
56 |
57 | 20'
58 |
59 | 23:24
60 |
61 | one file is also ~20', it's mostly function of the number of k-mers, we could speed it up by chosing higher t maybe?
62 |
63 | ### Getting the k-mer spectra out of it
64 |
65 | ```
66 | Histex -G SRR8495097 > SRR8495097_k31.hist
67 |
68 | | GeneScopeFK -o data/Pvir1/GenomeScopeFK/ -k 17
69 |
70 |
71 | # Run PloidyPlot to find all k-mer pairs in the dataset
72 | PloidyPlot -e12 -k -v -T4 -odata/Scer/kmerpairs data/Scer/FastK_Table
73 | # this now generated `data/Scer/kmerpairs_text.smu` file;
74 | # it's a flat file with three columns; covB, covA and freq (the number of k-mer pairs with these respective coverages)
75 |
76 | # use the .smu file to infer ploidy and create smudgeplot
77 | smudgeplot.py plot -n 15 -t Sacharomyces -o data/Scer/trial_run data/Scer/kmerpairs_text.smu
78 | ```
79 |
--------------------------------------------------------------------------------
/playground/DEVELOPMENT.md:
--------------------------------------------------------------------------------
1 | # STANDARDS
2 |
3 | - spaces around operators
4 | - snake_case for variables and functions
5 | - camelCase for classes and methods
6 | - verbose naming is more important than a detailed documentation
7 | - R code is tested using `testthat` and python code in `dev` branch using `unittest`
8 |
9 | ## versioning
10 |
11 | - we try to keep `master` branch clean (i.e. production ready code).
12 | - `dev` branch should be also a working code, however here mistakes are permitted. Most of the development is done in subbranches of `dev` branch, once a feature is implemented, it should be merged to `dev`. I like to use `--no-ff` to keep a record, what was developed where (this might be a bad practice and if you know why, let me know).
13 | - One rule I would like to keep is this. There must be at least 72 hours incubation period between commits merged into `dev` and merging `dev` into `master`. The reason is simple, if there are any mistakes that were not spotted, there is a change to catch them. Also it takes while to run all the tests and stuff (the travis.ci testing is not working yet, but I hope it will quite soon).
14 |
15 | ## language
16 |
17 | The future is `C` backend based on [FastK](https://github.com/thegenemyers/FASTK), inference and plotting in `R` and `python` user interface.
--------------------------------------------------------------------------------
/playground/alternative_fitting/README.md:
--------------------------------------------------------------------------------
1 | # Sploidyplot
2 |
3 | The goal is to have a smudge inference based on an explicit model. I hoped for a model that would make a lot of sense - based on negative binomials.
4 |
5 | Gene - made his own EM algorithm, I think. I could not decipher it
6 | Richard and Tianyi - made me an EM algorithm that work on simply coverage A and B; also using normal distributions
7 |
8 |
9 | ## alternative plotting
10 |
11 | A minimalist attempt
12 |
13 | ```R
14 | minidata <- daArtCamp1[daArtCamp1[, 'total_pair_cov'] < 20, ]
15 | coverages_to_plot <- unique(minidata[, 'total_pair_cov'])
16 | number_of_coverages_to_plot <- length(coverages_to_plot)
17 | mini_ylim <- c(5, 20)
18 | L <- 4
19 | cols <- c(rgb(1,0,0, 0.5), rgb(1,1,0, 0.5), rgb(0,1,1, 0.5), rgb(1,0,1, 0.5), rgb(0,1,0, 0.5), rgb(0,0,1, 0.5))
20 | # plot(1:6, pch = 20, cex = 5, col = cols)
21 |
22 | plot_dot_smudgeplot(minidata, rep('black', 32), xlim, mini_ylim, cex = 3)
23 | points((L - 1) / coverages_to_plot, coverages_to_plot, cex = 3, pch = 20, col = 'blue')
24 |
25 | for( cov in 8:19){
26 | rect(0, cov - 0.5, 0.5, cov + 0.5, col = NA, border = 'black')
27 | width <- 1 / (2 * cov)
28 | min_ratio <- L / cov
29 | rect(min_ratio - width, cov - 0.5, min(0.5, min_ratio + width), cov + 0.5, col = sample(cols, 1))
30 | }
31 |
32 | text(rep(0.05, number_of_coverages_to_plot), 8:19, 8:19)
33 | ```
34 |
35 | This is more serious attempt that does not really work.
36 |
37 | Alternative local aggregation
38 |
39 | ```bash
40 | for ToLID in daAchMill1 daAchPtar1 daAdoMosc1 daAjuCham1 daAjuRept1 daArcMinu1 daArtCamp1 daArtMari1 daArtVulg1 daAtrBela1; do
41 | python3 playground/alternative_fitting/pair_clustering.py data/dicots/smu_files/$ToLID.k31_ploidy.smu.txt --mask_errors > data/dicots/peak_agregation/$ToLID.cov_tab_peaks
42 | Rscript playground/alternative_fitting/alternative_plotting_testing.R -i data/dicots/peak_agregation/$ToLID.cov_tab_peaks -o data/dicots/peak_agregation/$ToLID
43 | done
44 | ```
45 |
46 | This worked well. The agregation produced beautiful blocks, and vastly of the same shape as noticed by Richard. He suggested we should fix their shape and fit only a single parameter - coverage.
47 |
48 | ```R
49 | smudge_tab <- read.table('data/dicots/peak_agregation/daArtMari1.cov_tab_peaks', col.names = c('covB', 'covA', 'freq', 'smudge'))
50 |
51 | all_smudges <- unique(smudge_tab[, 'smudge'])
52 | all_smudge_sizes <- sapply(all_smudges, function(x){ sum(smudge_tab[smudge_tab[, 'smudge'] == x, 'freq']) })
53 |
54 | # plot(sort(all_smudge_sizes, decreasing = T) / sum(all_smudge_sizes), ylim = c(0, 1))
55 | # sort(all_smudge_sizes, decreasing = T) / sum(all_smudge_sizes) > 0.02
56 | # 2% of data soiunds reasonable
57 |
58 | smudges <- all_smudges[all_smudge_sizes / sum(all_smudge_sizes) > 0.02 & all_smudges != 0]
59 | smudge_sizes <- all_smudge_sizes[all_smudge_sizes / sum(all_smudge_sizes) > 0.02 & all_smudges != 0]
60 |
61 | smudge_tab[, 'total_pair_cov'] <- smudge_tab[, 1] + smudge_tab[, 2]
62 | smudge_tab[, 'minor_variant_rel_cov'] <- smudge_tab[, 1] / smudge_tab[, 'total_pair_cov']
63 |
64 |
65 | smudge_tab_reduced <- smudge_tab[smudge_tab[, 'smudge'] %in% smudges, ]
66 | source('playground/alternative_fitting/alternative_plotting_functions.R')
67 |
68 | per_smudge_cov_list <- lapply(smudges, function(x){ smudge_tab_reduced[smudge_tab_reduced$smudge == x, ] })
69 | names(per_smudge_cov_list) <- smudges
70 |
71 | cov_sum_summary <- sapply(per_smudge_cov_list, function(x){ summary(x[, 'total_pair_cov']) } )
72 | rel_cov_summary <- sapply(per_smudge_cov_list, function(x){ summary(x[, 'minor_variant_rel_cov']) } )
73 |
74 | colnames(cov_sum_summary) <- colnames(rel_cov_summary) <- smudges
75 |
76 | data.frame(smudges = smudges, total_pair_cov = round(cov_sum_summary[4, ], 1), minor_variant_rel_cov = round(rel_cov_summary[4, ], 3))
77 |
78 | head(per_smudge_cov_list[['2']])
79 |
80 | one_smudge <- per_smudge_cov_list[['2']]
81 | one_smudge[one_smudge[, 'minor_variant_rel_cov'] == 0.5, ]
82 |
83 | table(one_smudge[, 'covB'])
84 | (one_smudge[, 'minor_variant_rel_cov'])
85 |
86 |
87 |
88 | # cov_range <- seq((2 * .L) - 2, max_cov_pair, length = 500)
89 | # lines((.L - 1)/cov_range, cov_range, lwd = 2.5, lty = 2,
90 |
91 | plot_peakmap(smudge_tab_reduced, xlim = c(0, 0.5), ylim = c(0, 300))
92 |
93 | plot_seq_error_line(smudge_tab, 4)
94 | plot_seq_error_line(smudge_tab, 13)
95 | plot_seq_error_line(smudge_tab, 48)
96 | plot_seq_error_line(smudge_tab, 80)
97 |
98 | one_smudge <- per_smudge_cov_list[['4']]
99 | min(one_smudge[ ,'total_pair_cov'])
100 |
101 | one_smudge[one_smudge[ ,'total_pair_cov'] == 61, ]
102 | one_smudge <- one_smudge[order(one_smudge[, 'minor_variant_rel_cov']), ]
103 |
104 | right_part_of_the_smudge <- one_smudge[one_smudge[ ,'minor_variant_rel_cov'] > 0.2131147, ]
105 |
106 | all_minor_var_rel_covs <- sort(unique(round(right_part_of_the_smudge[, 'minor_variant_rel_cov'], 2)))
107 | corresponding_min_cov_sums <- sapply(all_minor_var_rel_covs, function(x){ min(right_part_of_the_smudge[round(right_part_of_the_smudge[, 'minor_variant_rel_cov'], 2) == x, 'total_pair_cov']) } )
108 |
109 | lines(all_minor_var_rel_covs, corresponding_min_cov_sums, lwd = 3, lty = 3, col = 'red')
110 |
111 | subtract_line <- function(rel_cov, cov_tab){
112 | approx_rel_cov = round(rel_cov, 2)
113 | band_covs = round(cov_tab[, 'minor_variant_rel_cov'], 2) == approx_rel_cov
114 | cov_tab[band_covs, ][which.min(cov_tab[band_covs, 'total_pair_cov']), ]
115 | }
116 |
117 | edge_points <- t(sapply(all_minor_var_rel_covs, subtract_line, right_part_of_the_smudge))
118 | total_pair_cov <- sapply(1:29, function(x){edge_points[[x,5]]})
119 | minor_variant_rel_cov <- sapply(1:29, function(x){edge_points[[x,6]]})
120 | lm(total_pair_cov ~ minor_variant_rel_cov + I(minor_variant_rel_cov^2))
121 |
122 | plot_isoA_line <- function (.covA, .cov_tab, .col = "black") {
123 | min_covB <- min(.cov_tab[, 'covB']) # should be L really
124 | max_covB <- .covA
125 | B_covs <- seq(min_covB, max_covB, length = 500)
126 | lines(B_covs/ (B_covs + .covA), B_covs + .covA, lwd = 2.5, lty = 2,
127 | col = .col)
128 | }
129 |
130 | plot_isoA_line(48, smudge_tab, 'blue')
131 | plot_isoA_line(79, smudge_tab, 'blue')
132 | plot_isoA_line(110, smudge_tab, 'blue')
133 | plot_isoA_line(141, smudge_tab, 'blue')
134 | plot_isoA_line(172, smudge_tab, 'blue')
135 |
136 | ```
137 |
138 | HA, looks great! Let's plot it on the background...
139 |
140 | ```R
141 | library(smudgeplot)
142 | source('playground/alternative_fitting/alternative_plotting_functions.R')
143 | colour_ramp <- viridis(32)
144 |
145 |
146 | smudge_tab <- read.table('data/dicots/peak_agregation/daAchMill1.cov_tab_errors', col.names = c('covB', 'covA', 'freq', 'is_error'))
147 | # smudge_tab <- read.table('data/dicots/peak_agregation/daArtMari1.cov_tab_peaks', col.names = c('covB', 'covA', 'freq', 'smudge'))
148 | smudge_tab[, 'total_pair_cov'] <- smudge_tab[, 1] + smudge_tab[, 2]
149 | smudge_tab[, 'minor_variant_rel_cov'] <- smudge_tab[, 1] / smudge_tab[, 'total_pair_cov']
150 | cov = 31.1 # this is from GenomeScope this time
151 |
152 | plot_alt(smudge_tab[smudge_tab[, 'is_error'] != 0, ], c(0, 100), colour_ramp, T)
153 | plot_alt(smudge_tab[smudge_tab[, 'is_error'] != 1, ], c(0, 100), colour_ramp, T)
154 | plot_alt(smudge_tab, c(0, 100), colour_ramp, T)
155 | plot_iso_grid(31.1, 4, 100)
156 | plot_smudge_labels(18.1, 100)
157 | # .peak_points, .peak_sizes, .min_kmerpair_cov, .max_kmerpair_cov, col = "red"
158 | dev.off()
159 |
160 | plot_iso_grid()
161 |
162 | plot_smudge_labels(cov, 240)
163 | text(0.49, cov / 2, "2err", offset = 0, cex = 1.3, xpd = T, pos = 2)
164 | ```
165 |
166 | Say we will test ploidy up to 16 (capturing up to octoploid paralogs). That makes
167 |
168 | ```R
169 | smudge_tab_with_err <- read.table('data/dicots/peak_agregation/daAchMill1.cov_tab_errors', col.names = c('covB', 'covA', 'freq', 'is_error'))
170 |
171 | smudge_filtering_threshold <- 0.01 # at least 1% of genomic kmers
172 | colour_ramp <- viridis(32)
173 |
174 | # # error band, done on non filtered data
175 | # smudge_tab[, 'edgepoint'] <- F
176 | # smudge_tab[smudge_tab[, 'covB'] < L + 3, 'edgepoint'] <- T
177 | # plot_alt(smudge_tab[smudge_tab[, 'edgepoint'], ], c(0, 500), colour_ramp, T)
178 |
179 | cov <- 19.55 # this will be unknown
180 | L <- min(smudge_tab_with_err[, 'covB'])
181 | smudge_tab <- smudge_tab_with_err[smudge_tab_with_err[, 'is_error'] == 0, ]
182 | genomic_kmer_pairs <- sum(smudge_tab[ ,'freq'])
183 |
184 | plot_alt(smudge_tab, c(0, 300), colour_ramp, T)
185 |
186 | smudge_tab[, 'total_pair_cov'] <- smudge_tab[, 1] + smudge_tab[, 2]
187 | smudge_tab[, 'minor_variant_rel_cov'] <- smudge_tab[, 1] / smudge_tab[, 'total_pair_cov']
188 |
189 | plot_alt(smudge_tab, c(0, 300), colour_ramp)
190 | plot_all_smudge_labels(cov, 300)
191 | dev.off()
192 |
193 | #### isolating all smudges given cov
194 |
195 | # total_genomic_kmer_pairs <- sum(smudge_tab[, 'freq'])
196 |
197 | # plot_alt(smudge_container[[1]], c(0, 300), colour_ramp, T)
198 | # looks good!
199 |
200 | # two functions need to be sources from the smudgeplot package here
201 | covs_to_test <- seq(10.05, 60.05, by = 0.1)
202 | centrality_grid <- sapply(covs_to_test, run_replicate, smudge_tab, smudge_filtering_threshold)
203 | covs_to_test[which.max(centrality_grid)]
204 |
205 | sapply(c(21.71, 21.72, 21.73), run_replicate, smudge_tab, smudge_filtering_threshold)
206 | # 21.72 is our winner!
207 |
208 | tested_covs <- test_grid_of_coverages(smudge_tab, smudge_filtering_threshold, min_to_explore, max_to_explore)
209 | plot(tested_covs[, 'cov'], tested_covs[, 'centrality'])
210 |
211 | ```
212 |
213 | Fixing the main package
214 |
215 | ```bash
216 | for smu_file in data/dicots/smu_files/*.k31_ploidy.smu.txt; do
217 | ToLID=$(basename $smu_file .k31_ploidy.smu.txt);
218 | smudgeplot.py all $smu_file -o data/dicots/grid_fits/$ToLID
219 | done
220 |
221 |
222 | ```
223 |
224 |
225 | ## Homopolymer compressed testing
226 |
227 | Datasets with lots of errors. Sacharomyces will do.
228 |
229 | ```
230 | FastK -v -c -t4 -k31 -M16 -T4 data/Scer/SRR3265401_[12].fastq.gz -Ndata/Scer/FastK_Table_hc
231 | hetmers -e4 -k -v -T4 -odata/Scer/kmerpairs_hc data/Scer/FastK_Table_hc
232 |
233 | smudgeplot.py hetmers -L 4 -t 4 -o data/Scer/kmerpairs_default_e --verbose data/Scer/FastK_Table
234 |
235 | smudgeplot.py all -o data/Scer/homopolymer_e4_wo data/Scer/kmerpairs_default_e_text.smu
236 |
237 | smudgeplot.py all -o data/Scer/homopolymer_e4_with data/Scer/kmerpairs_hc_text.smu
238 | ```
239 |
240 | ## Other
241 |
242 |
243 | ### .smu to smu.txt
244 |
245 | For the legacy `.smu` files, we have a convertor for to flat files.
246 |
247 | ```bash
248 | gcc src_ploidyplot/smu2text_smu.c -o exec/smu2text_smu
249 | exec/smu2text_smu data/ddSalArbu1/ddSalArbu1.k31_ploidy.smu | less
250 | ```
--------------------------------------------------------------------------------
/playground/alternative_fitting/alternative_plot_covA_covB.R:
--------------------------------------------------------------------------------
1 | library(ggplot2)
2 |
3 | plot_unsquared_smudgeplot <- function(cov_tab, colour_ramp, xlim, ylim){
4 | # this is the adjustment for plotting
5 | # cov_tab[cov_tab$covA == cov_tab$covB, 'freq'] <- cov_tab[cov_tab$covA == cov_tab$covB, 'freq'] * 2
6 | cov_tab$col = colour_ramp[1 + round(31 * cov_tab$freq / max(cov_tab$freq))]
7 |
8 | plot(NULL, xlim = xlim, ylim = ylim,
9 | xlab = 'covA',
10 | ylab = 'covB', cex.lab = 1.4)
11 |
12 | ## This might bite me in the a.., instead of taking L as an argument, I guess it from the data
13 | # L = floor(min(cov_tab_daAchMill1[, 'total_pair_cov']) / 2)
14 | ggplot(cov_tab, aes(x=covA, y=covB, weight = weight)) +
15 | geom_bin2d() +
16 | theme_bw()
17 |
18 | }
19 |
20 | real_clean <- read.table('data/Fiin/kmerpairs_k51_text.smu', col.names = c('covA', 'covB', 'freq'))
21 | real_clean$weight <- real_clean$freq / sum(real_clean$freq)
22 |
23 | # plot(real_clean[, 'covA'], real_clean[, 'covB'])
24 |
25 | xlim <- range(real_clean[, 'covA'])
26 | ylim <- range(real_clean[, 'covB'])
27 |
28 | library(smudgeplot)
29 | args <- list()
30 | args$col_ramp <- 'viridis'
31 | args$invert_cols <- F
32 | colour_ramp <- get_col_ramp(args)
33 | real_clean$col <- colour_ramp[1 + round(31 * real_clean$freq / max(real_clean$freq))]
34 |
35 | plot(NULL, xlim = xlim, ylim = ylim,
36 | xlab = 'covA',
37 | ylab = 'covB', cex.lab = 1.4)
38 |
39 | ggplot(real_clean, aes(x=covA, y=covB, weight = weight)) +
40 | geom_bin2d() +
41 | theme_bw()
42 |
43 | head(real_clean)
44 |
45 |
46 | # plotSquare <- function(row){
47 | # rect(as.numeric(row['covA']) - 0.5, as.numeric(row['covB']) - 0.5, as.numeric(row['covA']) + 0.5, as.numeric(row['covB']) + 0.5, col = row['col'], border = NA)
48 | # }
49 | # apply(real_clean, 1, plotSquare)
50 |
--------------------------------------------------------------------------------
/playground/alternative_fitting/alternative_plotting.R:
--------------------------------------------------------------------------------
1 | # HM Revenue and custom Tax Return form;
2 | # Needs to be 2023 Jan to 2024 Jan
3 |
4 |
5 |
6 | library(smudgeplot)
7 | source('playground/alternative_plotting_functions.R')
8 |
9 | cov_tab_daAchMill1 <- read.table('data/dicots/smu_files/daAchMill1.k31_ploidy.smu.txt', col.names = c('covB', 'covA', 'freq'))
10 | ylim = c(0, 250)
11 |
12 | xlim = c(0, 0.5)
13 |
14 |
15 |
16 | cov_tab_daAchMill1[, 'total_pair_cov'] <- cov_tab_daAchMill1[, 1] + cov_tab_daAchMill1[, 2]
17 | cov_tab_daAchMill1[, 'minor_variant_rel_cov'] <- cov_tab_daAchMill1[, 1] / cov_tab_daAchMill1[, 'total_pair_cov']
18 |
19 | args <- list()
20 | args$col_ramp <- 'viridis'
21 | args$invert_cols <- F
22 | colour_ramp <- get_col_ramp(args)
23 | cols = colour_ramp[1 + round(31 * cov_tab_daAchMill1$freq / max(cov_tab_daAchMill1$freq))]
24 |
25 | # solving this "density" problem; cov1 cov1; have twice lower probability than cov1 cov2; we need to multiply these points, but it needs to be somehow corrected for the fit / summaries
26 |
27 | plot_dot_smudgeplot(cov_tab_daAchMill1, colour_ramp, xlim, ylim)
28 |
29 | plot_unsquared_smudgeplot(cov_tab_daAchMill1, colour_ramp, xlim, ylim)
30 |
31 | # plots the line where there will be nothing
32 | plot_seq_error_line(cov_tab_daAchMill1, .col = 'red')
33 |
34 | head(cov_tab_daAchMill1[order(cov_tab_daAchMill1[,'total_pair_cov']), ], 12)
35 | colour_ramp
36 | 3 / 8:13
37 |
38 | ####
39 |
40 | cov_tab_Fiin_ideal <- read.table('data/Fiin/idealised/kmerpairs_idealised_with_transformations.tsv', header = T)
41 | head(cov_tab_Fiin_ideal)
42 |
43 | xlim = c(0, 0.5)
44 | ylim = c(0, max(cov_tab_Fiin_ideal[, 'total_pair_cov']))
45 |
46 | plot_dot_smudgeplot(cov_tab_Fiin_ideal, colour_ramp, xlim, ylim)
47 |
48 | plot_unsquared_smudgeplot(cov_tab_Fiin_ideal, colour_ramp, xlim, ylim)
49 |
--------------------------------------------------------------------------------
/playground/alternative_fitting/alternative_plotting_functions.R:
--------------------------------------------------------------------------------
1 | plot_alt <- function(cov_tab, ylim, colour_ramp, logscale = F){
2 | A_equals_B <- cov_tab[, 'covA'] == cov_tab[, 'covB']
3 | cov_tab[A_equals_B, 'freq'] <- cov_tab[A_equals_B, 'freq'] * 2
4 | if (logscale){
5 | cov_tab[, 'freq'] <- log10(cov_tab[, 'freq'])
6 | }
7 | cov_tab$col <- colour_ramp[1 + round(31 * cov_tab[, 'freq'] / max(cov_tab[, 'freq']))]
8 |
9 | plot(NULL, xlim = c(0, 0.5), ylim = ylim,
10 | xlab = 'Normalized minor kmer coverage: B / (A + B)',
11 | ylab = 'Total coverage of the kmer pair: A + B', cex.lab = 1.4)
12 | min_cov_to_plot <- max(ylim[1],min(cov_tab[, 'total_pair_cov']))
13 | nothing <- sapply(min_cov_to_plot:ylim[2], plot_one_coverage, cov_tab)
14 | return(0)
15 | }
16 |
17 | plot_one_coverage <- function(cov, cov_tab){
18 | cov_row_to_plot <- cov_tab[cov_tab[, 'total_pair_cov'] == cov, ]
19 | width <- 1 / (2 * cov)
20 | cov_row_to_plot$left <- cov_row_to_plot[, 'minor_variant_rel_cov'] - width
21 | cov_row_to_plot$right <- sapply(cov_row_to_plot[, 'minor_variant_rel_cov'], function(x){ min(0.5, x + width)})
22 | apply(cov_row_to_plot, 1, plot_one_box, cov)
23 | }
24 |
25 | plot_one_box <- function(one_box_row, cov){
26 | left <- as.numeric(one_box_row['left'])
27 | right <- as.numeric(one_box_row['right'])
28 | rect(left, cov - 0.5, right, cov + 0.5, col = one_box_row['col'], border = NA)
29 | }
30 |
31 | plot_dot_smudgeplot <- function(cov_tab, colour_ramp, xlim, ylim, background_col = 'grey', cex = 0.4){
32 | # this is the adjustment for plotting
33 | cov_tab[cov_tab$covA == cov_tab$covB, 'freq'] <- cov_tab[cov_tab$covA == cov_tab$covB, 'freq'] * 2
34 | cov_tab$col = colour_ramp[1 + round(31 * cov_tab$freq / max(cov_tab$freq))]
35 |
36 | plot(NULL, xlim = xlim, ylim = ylim, xlab = 'Normalized minor kmer coverage: B / (A + B)',
37 | ylab = 'Total coverage of the kmer pair: A + B')
38 | rect(xlim[1], ylim[1], xlim[2], ylim[2], col = background_col, border = NA)
39 | points(cov_tab[, 'minor_variant_rel_cov'], cov_tab[, 'total_pair_cov'], col = cov_tab$col, pch = 20, cex = cex)
40 | }
41 |
42 | plot_peakmap <- function(cov_tab, xlim, ylim, background_col = 'grey', cex = 2){
43 | # this is the adjustment for plotting
44 | plot(NULL, xlim = xlim, ylim = ylim, xlab = 'Normalized minor kmer coverage: B / (A + B)',
45 | ylab = 'Total coverage of the kmer pair: A + B')
46 | points(cov_tab[, 'minor_variant_rel_cov'], cov_tab[, 'total_pair_cov'], col = cov_tab$smudge, pch = 20, cex = cex)
47 | legend('bottomleft', col = 1:8, pch = 20, title = 'smudge', legend = 1:8)
48 | }
49 |
50 | plot_seq_error_line <- function (.cov_tab, .L = NA, .col = "black") {
51 | if (is.na(.L)) {
52 | .L <- min(.cov_tab[, "covB"])
53 | }
54 | max_cov_pair <- max(.cov_tab[, "total_pair_cov"])
55 | cov_range <- seq((2 * .L) - 2, max_cov_pair, length = 500)
56 | lines((.L - 1)/cov_range, cov_range, lwd = 2.5, lty = 2,
57 | col = .col)
58 | }
59 |
60 | plot_isoA_line <- function (.covA, .L, .col = "black", .ymax = 250, .lwd, .lty) {
61 | min_covB <- .L # min(.cov_tab[, 'covB']) # should be L really
62 | max_covB <- .covA
63 | B_covs <- seq(min_covB, max_covB, length = 500)
64 | isoline_x <- B_covs/ (B_covs + .covA)
65 | isoline_y <- B_covs + .covA
66 | lines(isoline_x[isoline_y < .ymax], isoline_y[isoline_y < .ymax], lwd = .lwd, lty = .lty, col = .col)
67 | }
68 |
69 | plot_isoB_line <- function (.covB, .ymax, .col = "black", .lwd, .lty) {
70 | cov_range <- seq((2 * .covB) - 2, .ymax, length = 500)
71 | lines((.covB)/cov_range, cov_range, lwd = .lwd, lty = .lty, col = .col)
72 | }
73 |
74 | plot_iso_grid <- function(.cov, .L, .ymax, .col = 'black', .lwd = 2, .lty = 2){
75 | for (i in 0:15){
76 | cov <- (i + 0.5) * .cov
77 | plot_isoA_line(cov, .L = .L, .ymax = .ymax, .col, .lwd = .lwd, .lty = .lty)
78 | if (i < 8){
79 | plot_isoB_line(cov, .ymax, .col, .lwd = .lwd, .lty = .lty)
80 | }
81 | }
82 | }
83 |
84 | plot_smudge_labels <- function(cov_est, ymax, xmax = 0.49, .cex = 1.3, .L = 4){
85 | for (As in 1:(floor(ymax / cov_est) - 1)){
86 | label <- paste0(As, "Aerr")
87 | text(.L / (As * cov_est), (As * cov_est) + .L, label,
88 | offset = 0, cex = .cex, xpd = T, pos = ifelse(As == 1, 3, 4))
89 | }
90 | for (ploidy in 2:floor(ymax / cov_est)){
91 | for (Bs in 1:floor(ploidy / 2)){
92 | As = ploidy - Bs
93 | label <- paste0(As, "A", Bs, "B")
94 | text(ifelse(As == Bs, (xmax + 0.49)/2, Bs / ploidy), ploidy * cov_est, label,
95 | offset = 0, cex = .cex, xpd = T,
96 | pos = ifelse(As == Bs, 2, 1))
97 | }
98 | }
99 | }
100 |
101 | create_smudge_container <- function(cov, cov_tab, smudge_filtering_threshold){
102 | smudge_container <- list()
103 | total_genomic_kmers <- sum(cov_tab[ , 'freq'])
104 |
105 | for (Bs in 1:8){
106 | cov_tab_isoB <- cov_tab[cov_tab[ , 'covB'] > cov * ifelse(Bs == 1, 0, Bs - 0.5) & cov_tab[ , 'covB'] < cov * (Bs + 0.5), ]
107 | # cov_tab_isoB[, 'Bs'] <- Bs
108 | cov_tab_isoB[, 'As'] <- round(cov_tab_isoB[, 'covA'] / cov) #these are be individual smudges cutouts given coverages
109 | cov_tab_isoB[cov_tab_isoB[, 'As'] == 0, 'As'] = 1
110 | for( As in Bs:(16 - Bs)){
111 | cov_tab_one_smudge <- cov_tab_isoB[cov_tab_isoB[, 'As'] == As, ]
112 | if (sum(cov_tab_one_smudge[, 'freq']) / total_genomic_kmers > smudge_filtering_threshold){
113 | label <- paste0(As, "A", Bs, "B")
114 | smudge_container[[label]] <- cov_tab_one_smudge[,-which(names(cov_tab_one_smudge) %in% c('is_error', 'As'))]
115 | }
116 | }
117 | }
118 | return(smudge_container)
119 | }
--------------------------------------------------------------------------------
/playground/alternative_fitting/alternative_plotting_testing.R:
--------------------------------------------------------------------------------
1 | library(smudgeplot)
2 | library(argparse)
3 | source('playground/alternative_fitting/alternative_plotting_functions.R')
4 |
5 | parser <- ArgumentParser()
6 | parser$add_argument("-i", "-infile", help="Input file")
7 | parser$add_argument("-o", "-outfile", help="Output file")
8 | args <- parser$parse_args()
9 |
10 | args$col_ramp <- 'viridis'
11 | args$invert_cols <- F
12 |
13 | # cov_tab_daAchMill1 <- read.table('data/dicots/smu_files/daAchMill1.k31_ploidy.smu.txt', col.names = c('covB', 'covA', 'freq'))
14 | # cov_tab <- read.table(args$file, col.names = c('covB', 'covA', 'freq'))
15 | # args <- list()
16 | # args$i <- 'data/ddSalArbu1/ddSalArbu1.k31_ploidy_converted.smu_with_peaks.txt'
17 | # args$o <- 'data/ddSalArbu1/smudge_with_peaks'
18 |
19 | cov_tab <- read.table(args$i, col.names = c('covB', 'covA', 'freq','peak'))
20 |
21 | xlim = c(0, 0.5)
22 | ylim = c(0, 300)
23 |
24 |
25 | cov_tab[, 'total_pair_cov'] <- cov_tab[, 1] + cov_tab[, 2]
26 | cov_tab[, 'minor_variant_rel_cov'] <- cov_tab[, 1] / cov_tab[, 'total_pair_cov']
27 |
28 | colour_ramp <- viridis(32)
29 | colour_ramp_log <- get_col_ramp(args, 16)
30 | # cols = colour_ramp[1 + round(31 * cov_tab$freq / max(cov_tab$freq))]
31 |
32 | # solving this "density" problem; cov1 cov1; have twice lower probability than cov1 cov2; we need to multiply these points, but it needs to be somehow corrected for the fit / summaries
33 |
34 | plot_dot_smudgeplot(cov_tab, colour_ramp, xlim, ylim)
35 |
36 | pdf(paste0(args$o, '_background.pdf'))
37 | plot_alt(cov_tab, ylim, colour_ramp, F)
38 | dev.off()
39 |
40 | pdf(paste0(args$o, '_log_background.pdf'))
41 | plot_alt(cov_tab, ylim, colour_ramp, T)
42 | dev.off()
43 |
44 | pdf(paste0(args$o, '_peaks.pdf'))
45 | plot_peakmap(cov_tab, xlim = xlim, ylim = ylim)
46 | dev.off()
47 |
48 | # plots the line where there will be nothing
49 | # plot_seq_error_line(cov_tab, .col = 'red')
50 |
51 | # head(cov_tab[order(cov_tab[,'total_pair_cov']), ], 12)
52 | # colour_ramp
53 | # 3 / 8:13
54 |
55 | ####
56 |
57 | # cov_tab_Fiin_ideal <- read.table('data/Fiin/idealised/kmerpairs_idealised_with_transformations.tsv', header = T)
58 | # head(cov_tab_Fiin_ideal)
59 |
60 | # xlim = c(0, 0.5)
61 | # ylim = c(0, max(cov_tab_Fiin_ideal[, 'total_pair_cov']))
62 |
63 | # pdf('data/Fiin/idealised/straw_plot_test3.pdf')
64 | # plot_dot_smudgeplot(cov_tab_Fiin_ideal, colour_ramp, xlim, ylim)
65 | # dev.off()
66 |
67 | # pdf('data/Fiin/idealised/straw_plot_test2.pdf')
68 | # plot_unsquared_smudgeplot(cov_tab_Fiin_ideal, colour_ramp, xlim, ylim)
69 | # dev.off()
70 |
71 | # # testing the packaged version
72 |
73 | # library(smudgeplot)
74 | # args <- list()
75 | # args$col_ramp <- 'viridis'
76 | # args$invert_cols <- F
77 | # colour_ramp <- get_col_ramp(args)
78 |
79 | # cov_tab_Fiin_ideal <- read.table('data/Fiin/idealised/kmerpairs_idealised_with_transformations.tsv', header = T)
80 |
81 | # pdf('data/Fiin/idealised/straw_plot_test.pdf')
82 | # plot_alt(cov_tab_Fiin_ideal, c(50, 700), colour_ramp)
83 | # dev.off()
84 |
85 | # source('playground/alternative_fitting/alternative_plotting_functions.R')
86 |
87 | # pdf('data/Fiin/idealised/straw_plot_test2.pdf')
88 | # plot_unsquared_smudgeplot(cov_tab_Fiin_ideal, colour_ramp, c(0, 0.5), c(50, 700))
89 | # dev.off()
90 |
91 |
--------------------------------------------------------------------------------
/playground/alternative_fitting/pair_clustering.py:
--------------------------------------------------------------------------------
1 |
2 | # cov2freq = defualtdict(covA, covB) -> freq
3 | # cov2peak = dict(covA, covB) -> peak
4 | # dict(peak) -> summit (if relevant)
5 | # import numpy as np
6 |
7 | import argparse
8 | from pandas import read_csv # type: ignore
9 | from collections import defaultdict
10 | from statistics import mean
11 | # import matplotlib.pyplot as plt
12 |
13 | ####
14 |
15 | parser = argparse.ArgumentParser()
16 | parser.add_argument('infile', nargs='?', help='name of the input tsv file with covarages and frequencies.')
17 | parser.add_argument('-nf', '-noise_filter', help='Do not agregate into smudge k-mer pairs with frequency lower than this parameter', type=int, default=50)
18 | parser.add_argument('-d', '-distance', help='Manthattan distance of k-mer pairs that are considered neioboring for the local agregation purposes.', type=int, default=5)
19 | parser.add_argument('--mask_errors', help='instead of reporting assignments to individual smudges, just remove all monotonically decreasing points from the error line', action="store_true", default = False)
20 | args = parser.parse_args()
21 |
22 | ### what should be aguments at some point
23 | # smu_file = 'data/ddSalArbu1/ddSalArbu1.k31_ploidy_converted.smu.txt'
24 | # distance = 5
25 | # noise_filter = 100
26 |
27 | smu_file = args.infile
28 | distance = args.d
29 | noise_filter = args.nf
30 |
31 | ### load data
32 | # cov_tab = np.loadtxt(smu_file, dtype=int)
33 | cov_tab = read_csv(smu_file, names = ['covB', 'covA', 'freq'], sep='\t')
34 | cov_tab = cov_tab.sort_values('freq', ascending = False)
35 | L = min(cov_tab['covB']) # important only when --mask_errors is on
36 |
37 | # generate a dictionary that gives us for each combination of coverages a frequency
38 | cov2freq = defaultdict(int)
39 | cov2peak = defaultdict(int)
40 | # for idx, covB, covA, freq in cov_tab.itertuples():
41 | # cov2freq[(covA, covB)] = freq
42 | # I can create this one when I iterate though the data though
43 |
44 | # plt.hist(means)
45 | # plt.hist([x for x in means if x < 100 and x > -100])
46 | # plt.show()
47 |
48 | ### clustering
49 | next_peak = 1
50 | for idx, covB, covA, freq in cov_tab.itertuples():
51 | cov2freq[(covA, covB)] = freq # with this I can get rid of lines 23 24 that pre-makes this table
52 | if freq < noise_filter:
53 | break
54 | highest_neigbour_coords = (0, 0)
55 | highest_neigbour_freq = 0
56 | # for each kmer pair I will retrieve all neibours (Manhattan distance)
57 | for xA in range(covA - distance,covA + distance + 1):
58 | # for explored A coverage in neiborhood, we explore all possible B coordinates
59 | distanceB = distance - abs(covA - xA)
60 | for xB in range(covB - distanceB,covB + distanceB + 1):
61 | xB, xA = sorted([xA, xB]) # this is to make sure xB is smaller than xA
62 | # iterating only though those that were assigned already
63 | # and recroding only the one with highest frequency
64 | if cov2peak[(xA, xB)] and cov2freq[(xA, xB)] > highest_neigbour_freq:
65 | highest_neigbour_coords = (xA, xB)
66 | highest_neigbour_freq = cov2freq[(xA, xB)]
67 | if highest_neigbour_freq > 0:
68 | cov2peak[(covA, covB)] = cov2peak[(highest_neigbour_coords)]
69 | else:
70 | # print("new peak:", (covA, covB))
71 | if args.mask_errors:
72 | if covB < L + args.d:
73 | cov2peak[(covA, covB)] = 1 # error line
74 | else:
75 | cov2peak[(covA, covB)] = 0 # central smudges
76 | else:
77 | cov2peak[(covA, covB)] = next_peak # if I want to keep info about all locally agregated smudges
78 | next_peak += 1
79 |
80 | cov_tab = cov_tab.sort_values(['covA', 'covB'], ascending = True)
81 | for idx, covB, covA, freq in cov_tab.itertuples():
82 | print(covB, covA, freq, cov2peak[(covA, covB)])
83 | # if idx > 20:
84 | # break
85 |
--------------------------------------------------------------------------------
/playground/interactive_plot_strawberry_full_kmer_families_fooling_around.R:
--------------------------------------------------------------------------------
1 | library("methods")
2 | library("argparse")
3 | library("smudgeplot")
4 | # library("hexbin")
5 |
6 | # preprocessing
7 | # to get simply number of memebers / family (exploration)
8 | # cat data/strawberry_iinumae/kmer_counts_L109_U.tsv | cut -f 1 > data/strawberry_iinumae/kmer_counts_L109_U_family_members.tsv
9 | # awk '{row_sum = 0; row_max = 0; row_min = 10000; for (i=2; i <= NF; i++){ row_sum += $i; if ($i > row_max){row_max = $i} if ($i < row_min){row_min = $i} } print row_sum "\t" row_min "\t" row_max }' data/strawberry_iinumae/kmer_counts_L109_U.tsv > data/strawberry_iinumae/kmer_counts_L109_U_sums_min_max.tsv
10 | # (exploration)
11 | #
12 | #
13 |
14 | args <- ArgumentParser()$parse_args()
15 | args$homozygous <- F
16 | args$input <- 'data/Fiin/kmerpairs_k51_text.smu'
17 | args$output = './data/Fiin/testrun'
18 | args$title = 'F. iinumae'
19 | args$nbins <- 40
20 | args$L <- NULL
21 | args$n_cov <- NULL
22 | args$k <- 21
23 |
24 |
--------------------------------------------------------------------------------
/playground/more_away_pairs.py:
--------------------------------------------------------------------------------
1 | def get_2away_pairs(local_index_to_kmer, k):
2 | """local_index_to_kmer is a dictionary where the value is a kmer portion, and the key is the index of the original kmer in which the kmer portion is found. get_2away_pairs returns a list of pairs where each pair of indices corresponds to a pair of kmers different in exactly two bases."""
3 |
4 | #These are the base cases for the recursion. If k==1, the kmers obviously can't differ in exactly two bases, so return an empty list. if k==2, return every pair of indices where the kmers at those indices differ at exactly two bases.
5 | if k == 1:
6 | return []
7 | if k == 2:
8 | return [(i, j) for (i,j) in combinations(local_index_to_kmer, 2) if local_index_to_kmer[i][0] != local_index_to_kmer[j][0] and local_index_to_kmer[i][1] != local_index_to_kmer[j][1]]
9 |
10 | #Get the two halves of the kmer
11 | k_L = k//2
12 | k_R = k-k_L
13 |
14 | #initialize dictionaries in which the key is the hash of half of the kmer, and the value is a list of indices of the kmers with that same hash
15 | kmer_L_hashes = defaultdict(list)
16 | kmer_R_hashes = defaultdict(list)
17 |
18 | #initialize pairs, which will be returned by get_1away_pairs
19 | pairs = []
20 |
21 | #initialize dictionaries containing the left halves and the right halves (since we will have to check cases where the left half differs by 1 and the right half differs by 1)
22 | local_index_to_kmer_L = {}
23 | local_index_to_kmer_R = {}
24 |
25 | #for each kmer, calculate its left hash and right hash, then add its index to the corresponding entries of the dictionary
26 | for i, kmer in local_index_to_kmer.items():
27 | kmer_L = kmer[:k_L]
28 | kmer_R = kmer[k_L:]
29 | local_index_to_kmer_L[i] = kmer_L
30 | local_index_to_kmer_R[i] = kmer_R
31 | kmer_L_hashes[kmer_to_int(kmer_L)] += [i]
32 | kmer_R_hashes[kmer_to_int(kmer_R)] += [i]
33 |
34 | #for each left hash in which there are multiple kmers with that left hash, find the list of pairs in which the right half differs by 2. (aka, if left half matches, recurse on right half).
35 | for kmer_L_hash_indices in kmer_L_hashes.values(): #same in first half
36 | if len(kmer_L_hash_indices) > 1:
37 | pairs += get_2away_pairs({kmer_L_hash_index:local_index_to_kmer[kmer_L_hash_index][k_L:] for kmer_L_hash_index in kmer_L_hash_indices}, k_R) #differ by 2 in right half
38 |
39 | #for each right hash in which there are multiple kmers with that right hash, find the list of pairs in which the left half differs by 2. (aka, if right half matches, recurse on left half).
40 | for kmer_R_hash_indices in kmer_R_hashes.values(): #same in second half
41 | if len(kmer_R_hash_indices) > 1:
42 | pairs += get_2away_pairs({kmer_R_hash_index:local_index_to_kmer[kmer_R_hash_index][:k_L] for kmer_R_hash_index in kmer_R_hash_indices}, k_L) #differ by 2 in left half
43 |
44 | #Find matching pairs where the left half is one away, and the right half is one away
45 | possible_pairs_L = set(get_1away_pairs(local_index_to_kmer_L,k_L))
46 | possible_pairs_R = set(get_1away_pairs(local_index_to_kmer_R,k_R))
47 | pairs += list(possible_pairs_L.intersection(possible_pairs_R))
48 | return(pairs)
49 |
50 |
51 | ###This code has not been cleaned... needs to be edited!!!
52 | def get_3away_pairs(kmers):
53 | """kmers is a list of kmers. get_3away_pairs returns a list of pairs where each pair of kmers is different in exactly three bases."""
54 | k = len(kmers[0])
55 | if k == 1 or k==2:
56 | return []
57 | if k == 3:
58 | return [pair for pair in combinations(kmers, 2) if pair[0][0] != pair[1][0] and pair[0][1] != pair[1][1] and pair[0][2] != pair[1][2]]
59 | k_L = k//2
60 | k_R = k-k_L
61 | kmer_L_hashes = defaultdict(list)
62 | kmer_R_hashes = defaultdict(list)
63 | pairs = []
64 | kmers_L = []
65 | kmers_R = []
66 | for i, kmer in enumerate(kmers):
67 | kmer_L = kmer[:k_L]
68 | kmer_R = kmer[k_L:]
69 | #print(kmer_L)
70 | #print(kmer_R)
71 | kmers_L.append(kmer_L)
72 | kmers_R.append(kmer_R)
73 | kmer_L_hashes[kmer_to_int(kmer_L)] += [i]
74 | kmer_R_hashes[kmer_to_int(kmer_R)] += [i]
75 | for kmer_L_hash in kmer_L_hashes.values(): #same in first half
76 | if len(kmer_L_hash) > 1:
77 | kmer_L = kmers[kmer_L_hash[0]][:k_L] #first half
78 | pairs += [tuple(kmer_L + kmer for kmer in pair) for pair in get_3away_pairs([kmers[i][k_L:] for i in kmer_L_hash])] #differ by 3 in second half
79 | for kmer_R_hash in kmer_R_hashes.values(): #same in second half
80 | if len(kmer_R_hash) > 1:
81 | kmer_R = kmers[kmer_R_hash[0]][k_L:] #second half
82 | #print(kmer_R)
83 | pairs += [tuple(kmer + kmer_R for kmer in pair) for pair in get_3away_pairs([kmers[i][:k_L] for i in kmer_R_hash])] #differ by 3 in first half
84 | possible_pairs = []
85 | possible_pairs_L = get_1away_pairs(kmers_L)
86 | possible_pairs_R = get_2away_pairs(kmers_R)
87 | #print(kmers_L)
88 | #print(kmers_R)
89 | #print(possible_pairs_L)
90 | #print(possible_pairs_R)
91 | for possible_pair_L in possible_pairs_L:
92 | for possible_pair_R in possible_pairs_R:
93 | possible_kmer1 = possible_pair_L[0]+possible_pair_R[0]
94 | possible_kmer2 = possible_pair_L[1]+possible_pair_R[1]
95 | if possible_kmer1 in kmers and possible_kmer2 in kmers:
96 | pairs += [(possible_kmer1, possible_kmer2)]
97 | possible_pairs = []
98 | possible_pairs_L = get_2away_pairs(kmers_L)
99 | possible_pairs_R = get_1away_pairs(kmers_R)
100 | for possible_pair_L in possible_pairs_L:
101 | for possible_pair_R in possible_pairs_R:
102 | possible_kmer1 = possible_pair_L[0]+possible_pair_R[0]
103 | possible_kmer2 = possible_pair_L[1]+possible_pair_R[1]
104 | if possible_kmer1 in kmers and possible_kmer2 in kmers:
105 | pairs += [(possible_kmer1, possible_kmer2)]
106 | return(pairs)
--------------------------------------------------------------------------------
/playground/playground.R:
--------------------------------------------------------------------------------
1 | files <- c('data/Avag1/coverages_2.tsv',
2 | 'data/Lcla1/Lcla1_pairs_coverages_2.tsv',
3 | 'data/Mflo2/coverages_2.tsv',
4 | 'data/Rvar1/Rvar1_pairs_coverages_2.tsv',
5 | 'data/Ps791/Ps791_pairs_coverages_2.tsv',
6 | 'data/Aric1/Aric1_pairs_coverages_2.tsv',
7 | "data/Rmag1/Rmag1_pairs_coverages_2.tsv")
8 |
9 | ###
10 | library(smudgeplot)
11 | args <- list()
12 | args$input <- 'data/Mflo2/Mflo2_coverages_2.tsv'
13 | args$output <- "figures/Mflo2_v0.1.0"
14 | args$nbins <- 40
15 | args$kmer_size <- 21
16 | args$homozygous <- F
17 |
18 | # args <- list()
19 | # args$input <- 'data/rice/SRR1919013_k21_l35_u500_coverages.tsv'
20 | # args$output <- "data/rice/smudge"
21 | # args$nbins <- 40
22 | # args$kmer_size <- 21
23 | # args$homozygous <- F
24 |
25 | ###
26 |
27 | i <- 7
28 | n <- NA
29 | cov <- read.table(args$input)
30 |
31 | # run bits of smudgeplot_plot.R to get k, and peak summary
32 |
33 | filter <- total_pair_cov < 350
34 | total_pair_cov_filt <- total_pair_cov[filter]
35 | minor_variant_rel_cov_filt <- minor_variant_rel_cov[filter]
36 |
37 | ymax <- max(total_pair_cov_filt)
38 | ymin <- min(total_pair_cov_filt)
39 |
40 | # the lims trick will make sure that the last column of squares will have the same width as the other squares
41 | smudge_container <- get_smudge_container(minor_variant_rel_cov, total_pair_cov, .nbins = 40)
42 |
43 | x <- seq(xlim[1], ((nbins - 1) / nbins) * xlim[2], length = nbins)
44 | y <- c(seq(ylim[1] - 0.1, ((nbins - 1) / nbins) * ylim[2], length = nbins), ylim[2])
45 |
46 | .peak_points <- peak_points
47 | .smudge_container <- smudge_container
48 | .total_pair_cov <- total_pair_cov
49 | .treshold <- 0.05
50 | fig_title <- 'test'
51 |
52 | image(smudge_container, col = colour_ramp)
53 | # contour(x.bin, y.bin, freq2D, add=TRUE, col=rgb(1,1,1,.7))
54 |
55 | #######
56 | # PLOT
57 | #######
58 |
59 | library(plotly)
60 | packageVersion('plotly')
61 |
62 | p <- plot_ly(x = k_toplot$x, y = k_toplot$y, z = k_toplot$z) %>% add_surface()
63 | htmlwidgets::saveWidget(p, "Ps791_smudge_surface.html")
64 | # Create a shareable link to your chart
65 | # Set up API credentials: https://plot.ly/r/getting-started
66 | chart_link = api_create(p, filename="Ps791_smudge_surface-2")
67 | chart_link
68 |
69 | layout(matrix(c(2,4,1,3), 2, 2, byrow=T), c(3,1), c(1,3))
70 | # 1 smudge plot
71 | plot_smudgeplot(k_toplot, n, colour_ramp)
72 | plot_expected_haplotype_structure(n, peak_sizes, T)
73 | # annotate_peaks(peak_points, ymin, ymax)
74 | # annotate_summits(peak_points, peak_sizes, ymin, ymax, 'black')
75 | # TODO fix plot_seq_error_line(total_pair_cov)
76 | # 2,3 hist
77 | # TODO rescale histogram axis by the scale of the smudgeplot
78 | plot_histograms(minor_variant_rel_cov, total_pair_cov)
79 | # 4 legend
80 | plot_legend(k_toplot, total_pair_cov, colour_ramp)
81 |
82 | # findInterval(c(0.1, 0.2, 0.33, 0.5), seq(0, 0.5, length = 41))
83 |
84 | ##########################################################
85 | ## TEST
86 | ## idea here was to propagate from the highest point and expand the peak till it's monotonic
87 | # starting_point <- which(dens_m == max(dens_m), arr.ind = TRUE)
88 | # starting_val <- dens_m[starting_point]
89 | # peak_points <- data.frame(x = starting_point[,2], y = starting_point[,1], value = starting_val)
90 | #
91 | # points_to_explore <- get_neibours(starting_val)
92 | # val_to_comp <- starting_val
93 | #
94 | # for(one_point in 1:nrow(points_to_explore)){
95 | # one_point <- points_to_explore[one_point,]
96 | # point_val <- dens_m[t(one_point)]
97 | # if(point_val < val_to_comp){
98 | # peak_points <- rbind(peak_points,
99 | # data.frame(x = one_point[2], y = one_point[1], value = point_val))
100 | # }
101 | # }
102 | #
103 | # get_neibours <- function(point){
104 | # neibours_vec <- matrix(rep(starting_point,8) + c(-1,-1,0,-1,1,-1,-1,0,+1,0,-1,1,0,1,1,1),
105 | # ncol = 2, byrow = T)
106 | # neibours_vec[rowSums(neibours_vec <= 30 & neibours_vec >= 1) == 2,]
107 | # }
108 | #
109 |
110 | ##########################
111 | ###ALTERNATIVE PLTTING ###
112 | ##########################
113 | # library('spatialfil')
114 | # msnFit(high_cov_filt, minor_variant_rel_cov)
115 |
116 | ## alternative plotting
117 | # library(hexbin) # honeycomb plot
118 | # h <- hexbin(df)
119 | # # h@count <- sqrt(h@count)
120 | # plot(h, colramp=rf)
121 | # gplot.hexbin(h, colorcut=10, colramp=rf)
122 |
123 |
124 | ### TEST plot lines at expected coverages
125 | #
126 | # for(i in 2:6){
127 | # lines(c(0, 0.6), rep(i * n, 2), lwd = 1.4)
128 | # text(0.1, i * n, paste0(i,'x'), pos = 3)
129 | # }
130 |
131 | # FUTURE - wrapper
132 | # smudgeplot < - function(.k, .minor_variant_rel_cov, .total_pair_cov, .n,
133 | # .sqrt_scale = T, .cex = 1.4, .fig_title = NA){
134 | # if( .sqrt_scale == T ){
135 | # # to display densities on squared root scale (bit like log scale but less agressive)
136 | # .k$z <- sqrt(.k$z)
137 | # }
138 | #
139 | # pal <- brewer.pal(11,'Spectral')
140 | # rf <- colorRampPalette(rev(pal[3:11]))
141 | # colour_ramp <- rf(32)
142 | #
143 | # layout(matrix(c(2,4,1,3), 2, 2, byrow=T), c(3,1), c(1,3))
144 | #
145 | # # 2D HISTOGRAM
146 | # plot_smudgeplot(...)
147 | #
148 | # # 1D historgram - minor_variant_rel_cov on top
149 | # plot_histogram(...)
150 | #
151 | # # 1D historgram - total pair coverage - right
152 | # plot_histogram(...)
153 | #
154 | # # LEGEND (topright corener)
155 | # plot_legend(...)
156 | #
157 | # }
158 |
--------------------------------------------------------------------------------
/playground/playground.py:
--------------------------------------------------------------------------------
1 | #-----
2 | # WHat I tried ot make plots work
3 | # https://matplotlib.org/faq/howto_faq.html#generate-images-without-having-a-window-appear
4 | import matplotlib
5 | matplotlib.use('Agg')
6 | import matplotlib.pyplot as plt
7 | #-------
8 |
9 | #Load the particular dumps file you wish
10 | #These were created using jellyfish dump -c -L lower -U upper SRR_k21.jf > SRR_k21.dumps
11 | dumps_file = 'ERR2135445.dumps' #aric1 -L 20 -U 350
12 | dumps_file = 'SRR801084_k21.dumps' #avag1 -L 30 -U 300
13 | dumps_file = 'SRR4242457_k21.dumps' #mare2 -L 13 -U 132
14 | dumps_file = 'SRR4242472_k21.dumps' #ment1 -L 50 -U 350
15 | dumps_file = 'SRR4242474_k21.dumps' #mflo2 -L 60 -U 400
16 | dumps_file = 'SRR4242467_k21.dumps' #minc3 -L 25 -U 210
17 | dumps_file = 'SRR4242462_k21.dumps' #mjav2 -L 80 -U 600
18 | dumps_file = 'ERR2135453_k21.dumps' #rmac1 -L 100 -U 700
19 | dumps_file = 'ERR2135451_k21.dumps' #rmag1 -L 60 -U 500
20 |
21 | # dumps_file = 'kmers_dump_L120_U1500.tsv'
22 |
23 |
24 |
25 |
26 |
27 | plt.hist(coverages_2, bins = 1000)
28 | plt.savefig('coverages_2_hist.png')
29 | plt.close()
30 |
31 | # then plot a histogram of the coverages
32 |
33 | plt.hist(coverages_3, bins = 1000)
34 | plt.savefig('coverages_3_hist.png')
35 | plt.close()
36 |
37 | #n, bins, patches = plt.hist(coverages_3, bins = 1000)
38 | #bins[np.argmax(n)]
39 |
40 | #save families_4 to a pickle file, then plot a histogram of the coverages
41 |
42 | plt.hist(coverages_4, bins = 1000)
43 | plt.savefig('coverages_4_hist.png')
44 | plt.close()
45 |
46 |
47 | plt.hist(coverages_5, bins = 1000)
48 | plt.savefig('coverages_5_hist.png')
49 | plt.close()
50 |
51 | #save families_6 to a pickle file, then plot a histogram of the coverages
52 | plt.hist(coverages_6, bins = 1000)
53 | plt.savefig('coverages_6_hist.png')
54 | plt.close()
55 |
56 | ###some code to load previously saved pickle files
57 | # test_kmers = pickle.load(open('test_kmers.p', 'rb'))
58 | # test_coverages = pickle.load(open('test_coverages.p', 'rb'))
59 | G = pickle.load(open('G.p', 'rb'))
60 | component_lengths = pickle.load(open('component_lengths.p', 'rb'))
61 | families_2 = pickle.load(open('families_2.p', 'rb'))
62 | coverages_2 = pickle.load(open('coverages_2.p', 'rb'))
63 | # one_away_pair = pickle.load(open('one_away_pairs.p', 'rb'))
64 |
65 | # perhaps faster way how to calculate coverages_2
66 | # coverages_2 = [test_coverages[cov_i1] + test_coverages[cov_i2] for cov_i1, cov_i2 in families_2]
67 |
68 |
69 | #-----
70 | # for coverage in coverages_2:
71 | #
72 |
73 | ###Everything below this is just scratch work
74 | #f = open('ERR2135445_l20_u100.fa', 'r')
75 | #g = open('new.fa', 'w')
76 | #for line in f:
77 | # if line.startswith('>'):
78 | # g.write('>' + str(int(line[1:-1])+10000) + '\n')
79 | # else:
80 | # g.write(line)
81 | #f.close()
82 | #g.close()
83 |
84 | #get_3away_pairs(['AAAAAAAA', 'AACTAAGA', 'AACAATGA', 'AAAAATCG'])
85 |
86 |
87 | #get_1away_pairs(['AAA', 'AAC'])
88 |
89 | #kmers = [''.join([random.choice('ACGT') for _ in range(20)]) for _ in range(10)]
90 |
91 | #df2 = df[:1000000]
92 |
93 | #for pair in pairs:
94 | # #f.write(str(df2[df2[0] == pair[0]].iloc[0,1])+'\n')
95 | # #f.write(str(df2[df2[0] == pair[1]].iloc[0,1])+'\n')
96 | # [x[1] for x in pairs if x[0] == pair[0]]+[x[0] for x in pairs if x[1] == pair[0]]
97 | # a = df2[df2[0] == pair[0]].iloc[0,1]/89.2
98 | # b = df2[df2[0] == pair[1]].iloc[0,1]/89.2
99 | # f.write(str((a, b, a+b))+'\n')
100 |
101 | #Counter([min([Counter(pair[0])[x] for x in ['A', 'C', 'G', 'T']]) for pair in pairs])
102 |
103 | #high_complexity_pairs = [pair for pair in pairs if min([Counter(pair[0])[x] for x in ['A', 'C', 'G', 'T']])==5]
104 |
105 | #for hcpair in high_complexity_pairs:
106 | # f.write(str(df2[df2[0] == hcpair[0]]))
107 | # f.write(str(df2[df2[0] == hcpair[1]]))
108 |
109 | #570620 TAAAATAATTTTTTTCTTAAA 115
110 | #878881 TAAAATAATTTTTTTCTAAAA 67
111 | #182
112 |
113 | #526664 AATTACCATTCAACCAGTTTC 156
114 | #922303 AATTACCATTCAACCAGATTC 166
115 | #322
116 |
117 | #394517 AAGAGAAAAGAAAAAAGTAAT 180
118 | #788086 AAGAGAAAAGAAAAAAGAAAT 180
119 | #360
120 |
121 | #420665 AAAAAAAAGTGTTTTACTTTG 119
122 | #946878 AAAAAAAAGTGTTTTACTCTG 95
123 | #214
124 |
125 | #594426 ACAAAATATTACCTTTATCTA 117
126 | #768315 ACAAAATATTACCTTTATTTA 152
127 | #269
128 |
129 | #536269 ACAGATTGGCTTGTTTGAGCC 103
130 | #711261 ACAGATTGGCTTGTTTGAACC 99
131 |
132 | #383862 ATTTCATTTGTTAGAAAAAAA 139
133 | #907248 ATTTCATTTGTTAGAAAAGAA 162
134 |
135 | #438051 TCAACAGAAAATAATGGAGCA 152
136 | #962365 TCAACAGAAAATAATGGAACA 143
137 |
138 | #425231 AAAAAAAAACGAAAAAATTTT 15
139 | #734086 AAAAAAAAACGAAAAAAATTT 18
140 |
141 | #607197 AAAAAAAAACACGACATGTTT 154
142 | #783001 AAAAAAAAACACGACATGCTT 134
143 |
144 |
145 | #test_kmers = {i:kmer for (i, kmer) in enumerate(kmers[:100000])}
146 |
147 | #members = [x[0] for x in one_away_pairs] + [x[1] for x in one_away_pairs] + [x[0] for x in two_away_pairs] + [x[1] for x in two_away_pairs]
148 | #G = nx.Graph()
149 | #for one_away_pair in one_away_pairs:
150 | # G.add_edge(*one_away_pair)
151 | #for two_away_pair in two_away_pairs:
152 | # G.add_edge(*two_away_pair)
153 |
154 | #component_lengths = [len(x) for x in nx.connected_components(G)]
155 | #Counter(component_lengths)
156 | #families = [list(x) for x in nx.connected_components(G) if len(x) == 2]
157 | #coverages = [df2.iloc[families[i][0], 1]+df2.iloc[families[i][1], 1] for i in range(len(families))]
158 | #plt.hist(coverages, bins = 100)
159 | #plt.savefig('coverages_hist.png')
160 | #plt.close()
161 |
162 | #families_3 = [list(x) for x in nx.connected_components(G) if len(x) == 3]
163 | #coverages_3 = [df2.iloc[families_3[i][0], 1]+df2.iloc[families_3[i][1], 1] for i in range(len(families_3))]
164 | #plt.hist(coverages_3, bins = 100)
165 | #plt.savefig('coverages_3_hist.png')
166 | #plt.close()
167 |
--------------------------------------------------------------------------------
/playground/popart.R:
--------------------------------------------------------------------------------
1 | library(smudgeplot)
2 |
3 | args <- ArgumentParser()$parse_args()
4 | args$output <- "data/Scer/sploidyplot_test"
5 | args$nbins <- 40
6 | args$kmer_size <- 21
7 | args$homozygous <- FALSE
8 | args$L <- c()
9 | args$col_ramp <- 'viridis'
10 | args$invert_cols <- TRUE
11 |
12 | cov_tab <- read.table("data/Scer/PloidyPlot3_text.smu", col.names = c('covB', 'covA', 'freq'), skip = 2) #nolint
13 | cov_tab[, 'total_pair_cov'] <- cov_tab[, 1] + cov_tab[, 2]
14 | cov_tab[, 'minor_variant_rel_cov'] <- cov_tab[, 1] / cov_tab[, 'total_pair_cov']
15 |
16 | cov_tab_1n_est <- round(estimate_1n_coverage_1d_subsets(cov_tab), 1)
17 |
18 | xlim <- c(0, 0.5)
19 | # max(total_pair_cov); 10*draft_n
20 | ylim <- c(0, 150)
21 | nbins <- 40
22 |
23 | smudge_container <- get_smudge_container(cov_tab, nbins, xlim, ylim)
24 | smudge_container$z <- smudge_container$dens
25 |
26 | plot_popart <- function(cov_tab, ylim, colour_ramp){
27 | A_equals_B <- cov_tab[, 'covA'] == cov_tab[, 'covB']
28 | cov_tab[A_equals_B, 'freq'] <- cov_tab[A_equals_B, 'freq'] * 2
29 | cov_tab$col <- colour_ramp[1 + round(31 * cov_tab[, 'freq'] / max(cov_tab[, 'freq']))]
30 |
31 | plot(NULL, xlim = c(0.1, 0.5), ylim = ylim, xaxt = "n", yaxt = "n", xlab = '', ylab = '', bty = 'n')
32 | min_cov_to_plot <- max(ylim[1],min(cov_tab[, 'total_pair_cov']))
33 | nothing <- sapply(min_cov_to_plot:ylim[2], plot_one_coverage, cov_tab)
34 | return(0)
35 | }
36 |
37 | par(mfrow = c(2, 5))
38 |
39 | args$col_ramp <- "viridis"
40 | args$invert_cols <- FALSE
41 | colour_ramp <- get_col_ramp(args) # get the default colour ramp (Spectral, 11)
42 | # plot_smudgeplot(smudge_container, 15.5, colour_ramp)
43 | plot_popart(cov_tab, c(20, 120), colour_ramp)
44 |
45 | args$invert_cols <- TRUE
46 | colour_ramp <- get_col_ramp(args) # get the default colour ramp (Spectral, 11)
47 | # plot_smudgeplot(smudge_container, 15.5, colour_ramp)
48 | plot_popart(cov_tab, c(20, 120), colour_ramp)
49 |
50 |
51 | for (ramp in c("grey.colors", "magma", "plasma", "mako", "inferno", "rocket", "heat.colors", "cm.colors")){
52 | args$col_ramp <- ramp
53 | colour_ramp <- get_col_ramp(args) # get the default colour ramp (Spectral, 11)
54 | plot_popart(cov_tab, c(20, 120), colour_ramp)
55 | }
56 |
57 |
--------------------------------------------------------------------------------
/src_ploidyplot/PloidyPlot.c:
--------------------------------------------------------------------------------
1 | /******************************************************************************************
2 | *
3 | * PloidyPlot: A C-backed searching quickly for hetmers:
4 | * unique k-mer pairs different by exactly one nucleotide
5 | *
6 | * Author: Gene Myers
7 | * Date : May, 2021
8 | * Reduced to the k-mer pair search by Kamil Jaron in August, 2023
9 | *
10 | ********************************************************************************************/
11 |
12 | #include
13 | #include
14 | #include
15 | #include
16 | #include
17 | #include
18 | #include
19 | #include
20 |
21 | #undef SOLO_CHECK
22 |
23 | #undef DEBUG_GENERAL
24 | #undef DEBUG_RECURSION
25 | #undef DEBUG_THREADS
26 | #undef DEBUG_BOUNDARY
27 | #undef DEBUG_SCAN
28 | #undef DEBUG_BIG_SCAN
29 |
30 | #include "libfastk.h"
31 | #include "matrix.h"
32 |
33 | static char *Usage[] = { " [-v] [-T] [-P]",
34 | " [-o