├── seqlogo ├── tests │ ├── __init__.py │ └── test_seqlogo.py ├── __init__.py ├── seqlogo.py ├── utils.py └── core.py ├── .gitignore ├── requirements.txt ├── MANIFEST.in ├── setup.cfg ├── .github └── ISSUE_TEMPLATE │ ├── feature_request.md │ └── bug_report.md ├── environment.yml ├── LICENSE ├── setup.py ├── CONTRIBUTING.md ├── CODE_OF_CONDUCT.md ├── README.md └── docs └── figures ├── ic_scale.svg └── no_ic_scale.svg /seqlogo/tests/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .ipynb_checkpoints/ 2 | build/ 3 | dist/ 4 | seqLogo.egg-info 5 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | ghostscript>=0.6,<=9.18 2 | numpy>=1.16.1 3 | pandas>=0.24.0 4 | pytest>=4.2.0 5 | pytz>=2018.9 6 | seqlogo>=5.29.1 7 | weblogo>=3.6.0 8 | pre-commit 9 | -------------------------------------------------------------------------------- /MANIFEST.in: -------------------------------------------------------------------------------- 1 | include README.md 2 | include requirements.txt 3 | include environment.yml 4 | include LICENSE 5 | include CONTRIBUTING.md 6 | include CODE_OF_CONDUCT.md 7 | include setup.cfg -------------------------------------------------------------------------------- /setup.cfg: -------------------------------------------------------------------------------- 1 | [bumpversion] 2 | current_version = 5.29.4 3 | 4 | [bdist_wheel] 5 | universal = 1 6 | 7 | [metadata] 8 | description-file = README.md 9 | 10 | [tool:pytest] 11 | addopts = --basetemp=$HOME 12 | filterwarnings = ignore 13 | 14 | [bumpversion:file:setup.py] 15 | 16 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/feature_request.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Feature request 3 | about: Suggest an idea for this project 4 | title: '' 5 | labels: '' 6 | assignees: betteridiot 7 | 8 | --- 9 | 10 | **Is your feature request related to a problem? Please describe.** 11 | A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] 12 | 13 | **Describe the solution you'd like** 14 | A clear and concise description of what you want to happen. 15 | 16 | **Describe alternatives you've considered** 17 | A clear and concise description of any alternative solutions or features you've considered. 18 | 19 | **Additional context** 20 | Add any other context or screenshots about the feature request here. 21 | -------------------------------------------------------------------------------- /seqlogo/tests/test_seqlogo.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | import numpy as np 3 | import pandas as pd 4 | import seqlogo 5 | import os 6 | import pytest 7 | 8 | def test_pwm(): 9 | np.random.seed(42) 10 | random_pwm = np.random.dirichlet(np.ones(4), size=6) 11 | assert seqlogo.Pwm(random_pwm), "PWM could not be generated" 12 | 13 | def test_pfm2pwm(): 14 | np.random.seed(42) 15 | pfm = pd.DataFrame(np.random.randint(0, 36, size=(8, 4))) 16 | assert seqlogo.pfm2pwm(pfm), "PWM from PFM could not be generated" 17 | 18 | def test_seqlogo_plot(tmpdir): 19 | file = tmpdir.mkdir('pytest').join('test.svg') 20 | np.random.seed(42) 21 | pfm = pd.DataFrame(np.random.randint(0, 36, size=(8, 4))) 22 | pwm = seqlogo.pfm2pwm(pfm) 23 | seqlogo.seqlogo(pwm, format = 'eps', filename = file.strpath) 24 | assert file.read() -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/bug_report.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Bug report 3 | about: Create a report to help us improve 4 | title: '' 5 | labels: '' 6 | assignees: betteridiot 7 | 8 | --- 9 | 10 | **Describe the bug** 11 | A clear and concise description of what the bug is. 12 | 13 | **To Reproduce** 14 | Steps to reproduce the behavior: 15 | 1. Go to '...' 16 | 2. Click on '....' 17 | 3. Scroll down to '....' 18 | 4. See error 19 | 20 | **Expected behavior** 21 | A clear and concise description of what you expected to happen. 22 | 23 | **Screenshots** 24 | If applicable, add screenshots to help explain your problem. 25 | 26 | **Desktop (please complete the following information):** 27 | - OS: [e.g. iOS] 28 | - Python Version [e.g. 22] 29 | - NumPy Version [e.g. 42] 30 | - Pandas Version [e.g. 42] 31 | - matplotlib Version [e.g. 42] 32 | 33 | **Additional context** 34 | Add any other context about the problem here. 35 | -------------------------------------------------------------------------------- /seqlogo/__init__.py: -------------------------------------------------------------------------------- 1 | """ 2 | 3 | `seqLogo` is Python port of the Bioconductor package `seqLogo`_. 4 | 5 | The purpose of `seqLogo` is to process both Position Frequency Matrices (PFM) 6 | and Position Weight Matrices (PWM) and produce `WebLogo`_-like motif plots 7 | 8 | 9 | Note: 10 | Additional support for extended alphabets have been added. 11 | 12 | The main class of `seqLogo` is: 13 | 14 | #. ``seqLogo.Pwm``: the main PWM handler 15 | 16 | However, additional helpful methods are exposed: 17 | 18 | #. ``seqLogo.pfm2pwm``: automatically converts a PFM to a PWM 19 | #. ``seqLogo.seqLogo``: the main method for plotting WebLogo-like motif plots 20 | 21 | .. _seqLogo: 22 | http://bioconductor.org/packages/release/bioc/html/seqLogo.html 23 | .. _WebLogo: 24 | http://weblogo.threeplusone.com/ 25 | 26 | This code is part of the seqLogo distribution and governed by its 27 | license. Please see the LICENSE file that should have been included 28 | as part of this package. 29 | 30 | """ 31 | from seqlogo.core import * 32 | from seqlogo.seqlogo import * 33 | -------------------------------------------------------------------------------- /environment.yml: -------------------------------------------------------------------------------- 1 | name: seqlogo 2 | channels: 3 | - conda-forge 4 | - bioconda 5 | - defaults 6 | dependencies: 7 | - blas=1.1=openblas 8 | - bzip2=1.0.6=h14c3975_1002 9 | - ca-certificates=2018.11.29=ha4d7672_0 10 | - certifi=2018.11.29=py37_1000 11 | - libffi=3.2.1=hf484d3e_1005 12 | - libgcc-ng=7.3.0=hdf63c60_0 13 | - libgfortran-ng=7.2.0=hdf63c60_3 14 | - libstdcxx-ng=7.3.0=hdf63c60_0 15 | - ncurses=6.1=hf484d3e_1002 16 | - numpy=1.16.1=py37_blas_openblash1522bff_0 17 | - openblas=0.3.3=h9ac9557_1001 18 | - openssl=1.0.2p=h14c3975_1002 19 | - pandas=0.24.0=py37hf484d3e_0 20 | - pip=19.0.1=py37_0 21 | - python=3.7.1=hd21baee_1000 22 | - python-dateutil=2.7.5=py_0 23 | - pytz=2018.9=py_0 24 | - readline=7.0=hf8c457e_1001 25 | - setuptools=40.7.1=py37_0 26 | - six=1.12.0=py37_1000 27 | - sqlite=3.26.0=h67949de_1000 28 | - tk=8.6.9=h84994c4_1000 29 | - weblogo=3.6.0=pyh24bf2e0_1 30 | - wheel=0.32.3=py37_0 31 | - xz=5.2.4=h14c3975_1001 32 | - zlib=1.2.11=h14c3975_1004 33 | - pip: 34 | - atomicwrites==1.3.0 35 | - attrs==18.2.0 36 | - ghostscript==0.6 37 | - more-itertools==5.0.0 38 | - pluggy==0.8.1 39 | - py==1.7.0 40 | - pytest==4.2.0 41 | - seqlogo==5.29.1 42 | prefix: /root/miniconda3/envs/seqlogo 43 | 44 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2018, Marcus D Sherman 2 | All rights reserved. 3 | 4 | Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 5 | 6 | 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 7 | 8 | 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 9 | 10 | 3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. 11 | 12 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 13 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools.command.test import test as TestCommand 2 | from setuptools import setup 3 | import os 4 | import sys 5 | from os import path 6 | 7 | __version__ = '5.29.8' 8 | 9 | def readme(): 10 | this_directory = path.abspath(path.dirname(__file__)) 11 | with open(path.join(this_directory, 'README.md'), encoding='utf-8') as f: 12 | long_description = f.read() 13 | return long_description 14 | 15 | 16 | class PyTest(TestCommand): 17 | user_args = [('pytest-args=', 'a', 'Arguments to pass to py.test')] 18 | 19 | def initialize_options(self): 20 | TestCommand.initialize_options(self) 21 | self.pytest_args = [] 22 | 23 | def run_tests(self): 24 | import pytest 25 | errno = pytest.main(self.pytest_args) 26 | if errno: 27 | sys.exit(errno) 28 | else: 29 | errno = pytest.main(['-Wignore']) 30 | sys.exit(errno) 31 | 32 | 33 | setup( 34 | name='seqlogo', 35 | version=__version__, 36 | description='Python port of the R Bioconductor `seqlogo` package ', 37 | long_description=readme(), 38 | long_description_content_type='text/markdown', 39 | url='https://github.com/betteridiot/seqlogo', 40 | author='Marcus D. Sherman', 41 | author_email='mdsherman@betteridiot.tech', 42 | license='BSD 3-Clause', 43 | install_requires=[ 44 | 'numpy', 45 | 'pandas', 46 | 'weblogo', 47 | 'ghostscript', 48 | 'pytest' 49 | ], 50 | tests_require=['pytest'], 51 | cmdclass = {'test' : PyTest}, 52 | packages=['seqlogo', 'seqlogo.tests'], 53 | package_dir={'seqlogo': './seqlogo'}, 54 | package_data={'seqlogo': ['docs/*']}, 55 | classifiers=[ 56 | 'Development Status :: 4 - Beta', 57 | 'Intended Audience :: Developers', 58 | 'Intended Audience :: End Users/Desktop', 59 | 'Intended Audience :: Science/Research', 60 | 'Intended Audience :: Information Technology', 61 | 'License :: OSI Approved :: BSD License', 62 | 'Natural Language :: English', 63 | 'Operating System :: Unix', 64 | 'Operating System :: MacOS', 65 | 'Programming Language :: Python :: 3.5', 66 | 'Programming Language :: Python :: 3.5', 67 | 'Programming Language :: Python :: 3.6', 68 | 'Programming Language :: Python :: 3.7', 69 | 'Programming Language :: Python :: Implementation :: CPython', 70 | 'Topic :: Scientific/Engineering', 71 | 'Topic :: Scientific/Engineering :: Information Analysis' 72 | ], 73 | keywords='sequence logo seqlogo bioinformatics genomics weblogo', 74 | include_package_data=True, 75 | zip_safe=False 76 | ) 77 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # How to contribute to seqLogo 2 | ---- 3 | **First and foremost:** by contributing in this project, you agree to abide by 4 | the Contributor Covenant [code of conduct](https://github.com/betteridiot/seqLogo/blob/master/CODE_OF_CONDUCT.md) 5 | 6 | ## Our Promise: 7 | The maintainers of this project promise to address each issue/question/pull 8 | request in the following manner: 9 | * Prompt acknowledgment of receipt of issue/question/pull request 10 | * Potentially assigning to a specific maintainer 11 | * If needed, a description of when work on a given issue has started, or an explanation of why the issues/pull request is not being addressed 12 | * Closing the issue upon the maintainer's determination of issue resolution. 13 | 14 | ## General Questions 15 | 16 | * Search the GitHub [Issues](https://github.com/betteridiot/seqLogo/issues) for this project to see if this question has already been addressed 17 | * If neither of these avenues answer your question, please feel free to create 18 | a new [issue](https://github.com/betteridiot/seqLogo/issues) that has a 19 | succinct (but informative) subject line. In the body of the issue, please ask 20 | your question, including any context regarding your specific problem or inquiry. 21 | 22 | A *bad* general question would look like: 23 | > "Your module will not work on my computer. How do I fix it?" 24 | 25 | A *good* general question would look like: 26 | > "I am running your code on Windows 10 version 1803 in a Anaconda build of Python version 3.6. When I try to \, I get \. Here is an example of how I invoked your code: \. Can you tell what is going on?" 27 | 28 | ## Bug reporting 29 | 30 | * Ensure the bug has not been already reported: 31 | * Search the GitHub [Issues](https://github.com/betteridiot/seqLogo/issues) for this project 32 | * If the bug has not been previously reported, feel free to open a new issue: 33 | * Please follow the Bug report template provided on issue creation as closely as possible 34 | 35 | ## Bugfixes, patches, or documentation corrections 36 | Any additions to the code should follow the [Google Python Style Guide](https://github.com/google/styleguide/blob/gh-pages/pyguide.md) or [NumPy Style Guide](https://numpydoc.readthedocs.io/en/latest/) for documentation purposes. 37 | 38 | The preferred method of contributing in the form of a pull request is to fork 39 | the latest version of the project (likely the `master` branch), and cloning that 40 | fork to your local machine: 41 | 42 | ```bash 43 | 44 | git clone git@github.com:your-username/seqlogo.git 45 | 46 | ``` 47 | 48 | Commit your changes to your fork, and submit a [Pull Request](https://github.com/betteridiot/seqLogo/pulls) 49 | 50 | The maintainers will review your PR within a timely manner. Be aware that the 51 | maintainers may request an improvement or alternative, and also reserve the 52 | right to reject any Pull Request if it does not meet sytle or community 53 | guidelines. 54 | 55 | A good PR should contain the following items: 56 | * Follows the [Google Python Style Guide](https://github.com/google/styleguide/blob/gh-pages/pyguide.md) or [NumPy Style Guide](https://numpydoc.readthedocs.io/en/latest/). 57 | * **Tests:** either doctests for unit tests (written using `pytest`) 58 | * Contains a helpful/meaningful **commit message**. 59 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | # Contributor Covenant Code of Conduct 2 | 3 | ## Our Pledge 4 | 5 | In the interest of fostering an open and welcoming environment, we as 6 | contributors and maintainers pledge to making participation in our project and 7 | our community a harassment-free experience for everyone, regardless of age, body 8 | size, disability, ethnicity, sex characteristics, gender identity and expression, 9 | level of experience, education, socio-economic status, nationality, personal 10 | appearance, race, religion, or sexual identity and orientation. 11 | 12 | ## Our Standards 13 | 14 | Examples of behavior that contributes to creating a positive environment 15 | include: 16 | 17 | * Using welcoming and inclusive language 18 | * Being respectful of differing viewpoints and experiences 19 | * Gracefully accepting constructive criticism 20 | * Focusing on what is best for the community 21 | * Showing empathy towards other community members 22 | 23 | Examples of unacceptable behavior by participants include: 24 | 25 | * The use of sexualized language or imagery and unwelcome sexual attention or 26 | advances 27 | * Trolling, insulting/derogatory comments, and personal or political attacks 28 | * Public or private harassment 29 | * Publishing others' private information, such as a physical or electronic 30 | address, without explicit permission 31 | * Other conduct which could reasonably be considered inappropriate in a 32 | professional setting 33 | 34 | ## Our Responsibilities 35 | 36 | Project maintainers are responsible for clarifying the standards of acceptable 37 | behavior and are expected to take appropriate and fair corrective action in 38 | response to any instances of unacceptable behavior. 39 | 40 | Project maintainers have the right and responsibility to remove, edit, or 41 | reject comments, commits, code, wiki edits, issues, and other contributions 42 | that are not aligned to this Code of Conduct, or to ban temporarily or 43 | permanently any contributor for other behaviors that they deem inappropriate, 44 | threatening, offensive, or harmful. 45 | 46 | ## Scope 47 | 48 | This Code of Conduct applies both within project spaces and in public spaces 49 | when an individual is representing the project or its community. Examples of 50 | representing a project or community include using an official project e-mail 51 | address, posting via an official social media account, or acting as an appointed 52 | representative at an online or offline event. Representation of a project may be 53 | further defined and clarified by project maintainers. 54 | 55 | ## Enforcement 56 | 57 | Instances of abusive, harassing, or otherwise unacceptable behavior may be 58 | reported by contacting the project team at mdsherman@betteridiot.tech. All 59 | complaints will be reviewed and investigated and will result in a response that 60 | is deemed necessary and appropriate to the circumstances. The project team is 61 | obligated to maintain confidentiality with regard to the reporter of an incident. 62 | Further details of specific enforcement policies may be posted separately. 63 | 64 | Project maintainers who do not follow or enforce the Code of Conduct in good 65 | faith may face temporary or permanent repercussions as determined by other 66 | members of the project's leadership. 67 | 68 | ## Attribution 69 | 70 | This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4, 71 | available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html 72 | 73 | [homepage]: https://www.contributor-covenant.org 74 | 75 | For answers to common questions about this code of conduct, see 76 | https://www.contributor-covenant.org/faq 77 | -------------------------------------------------------------------------------- /seqlogo/seqlogo.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | import tempfile 4 | import re 5 | 6 | import pkg_resources 7 | weblogo_version = pkg_resources.get_distribution('weblogo').version 8 | try: 9 | if weblogo_version < "3.7": 10 | import weblogolib as wl 11 | else: 12 | import weblogo as wl 13 | except ModuleNotFoundError: 14 | import weblogo as wl 15 | 16 | from seqlogo import utils 17 | 18 | 19 | _sizes = { 20 | 'small': 3.54, 21 | 'medium': 5, 22 | 'large': 7.25, 23 | 'xlarge': 10.25 24 | } 25 | 26 | 27 | def seqlogo(pm, ic_scale = True, color_scheme = None, size = 'medium', 28 | format = 'svg', filename = None, **kwargs): 29 | """The plotting method of the `seqlogo` distribution. Depends on using 30 | any of the 3 classes exposed by `seqlogo`: 31 | * `seqlogo.Ppm` 32 | * `seqlogo.Pwm` 33 | * `seqlogo.CompletePm` 34 | 35 | Given an `M x N` PM matrix, where `M` is the number of positions and `N` 36 | is the number of letters, calculate and render a WebLogo-like motif plot. 37 | 38 | When `ic_scale` is `True`, the height of each column is proportional to 39 | its information content. The y-axis label and scale will reflect information 40 | content. Otherwise, all columns have the same height and y-axis label will 41 | reflect "bits" 42 | 43 | Args: 44 | pm (`seqlogo.Pm` subclass): a pre-formatted Pm instance 45 | ic_scale (bool): whether or not to scale the column heights (default: True) 46 | size (str): small (3.54 in), medium (5 in), large (7.25 in), xlarge (10.25) (default: 'medium') 47 | format (str): desired matplotlib supported output format Options are 'eps', 'pdf', 'png', 'jpeg', and 'svg' (default: "svg") 48 | filename (None | str): Name of the file to save the figure. If `None`: 49 | the figure will not be saved. (default: None) 50 | color_scheme (str): the color scheme to use for weblogo: 51 | 'auto': None 52 | 'monochrome': all black 53 | 'base pairing': (NA Only) TAU are orange, GC are blue 54 | 'classic': (NA Only) classic WebLogo color scheme for nucleic acids 55 | 'hydrophobicity': (AA only) Color based on hydrophobicity 56 | 'chemistry': (AA only) Color based on chemical properties 57 | 'charge': (AA Only) Color based on charge 58 | **kwargs: all additional keyword arguments found at http://weblogo.threeplusone.com/manual.html 59 | """ 60 | # Ensure color scheme matches the alphabet 61 | if pm._alphabet_type in utils._NA_ALPHABETS: 62 | if color_scheme is None: 63 | color_scheme = 'classic' 64 | if color_scheme not in utils.NA_COLORSCHEMES: 65 | raise ValueError('{} color_scheme selected is not an allowed nucleic acid color scheme'.format(color_scheme)) 66 | elif pm._alphabet_type in utils._AA_ALPHABETS: 67 | if color_scheme is None: 68 | color_scheme = 'hydrophobicity' 69 | if color_scheme not in utils.AA_COLORSCHEMES: 70 | raise ValueError('{} color_scheme selected is not an allowed amino acid color scheme'.format(color_scheme)) 71 | 72 | color_scheme = wl.std_color_schemes[color_scheme] 73 | 74 | # Setup the format writer 75 | out_format = wl.formatters[format] 76 | 77 | # Prepare the logo size 78 | stack_width = (_sizes[size]/pm.length) * 72 79 | 80 | # Initialize the options 81 | if ic_scale: 82 | unit_name = 'bits' 83 | else: 84 | unit_name = 'probability' 85 | options = wl.LogoOptions(unit_name = unit_name, color_scheme = color_scheme, 86 | show_fineprint = False, stack_width = stack_width, **kwargs) 87 | 88 | #Initialize the output format 89 | logo_format = wl.LogoFormat(pm, options) 90 | 91 | out = out_format(pm, logo_format) 92 | 93 | # Create the file if the user supplied an filename 94 | if filename: 95 | with open('{}'.format(filename), 'wb') as out_file: 96 | out_file.write(out) 97 | 98 | if format == 'svg': 99 | svg_hash = hash(out) 100 | out = re.sub(rb'("#?glyph.*?)(")', rb'\1 %s\2' % str(svg_hash).encode(), out) 101 | 102 | try: 103 | if get_ipython(): 104 | import IPython.display as ipd 105 | if format == 'svg': 106 | return ipd.SVG(out) 107 | elif format in ('png', 'jpeg', 'svg'): 108 | return ipd.Image(out) 109 | else: 110 | raise ValueError('{} format not supported for plotting in console'.format(format)) 111 | except NameError: 112 | if filename is None: 113 | raise ValueError('If not in an IPython/Jupyter console and no filename is given, nothing will be rendered') 114 | 115 | -------------------------------------------------------------------------------- /seqlogo/utils.py: -------------------------------------------------------------------------------- 1 | """ 2 | Copyright (c) 2018, Marcus D. Sherman 3 | 4 | This code is part of the seqlogo distribution and governed by its 5 | license. Please see the LICENSE file that should have been included 6 | as part of this package. 7 | 8 | @author: "Marcus D. Sherman" 9 | @copyright: "Copyright 2018, University of Michigan, Mills Lab 10 | @email: "mdshermanbetteridiottech" 11 | 12 | """ 13 | from collections import OrderedDict 14 | from copy import deepcopy 15 | import numpy as np 16 | import pandas as pd 17 | import seqlogo as sl 18 | 19 | 20 | _AA_LETTERS = "ACDEFGHIKLMNPQRSTVWY" 21 | _REDUCED_AA_LETTERS = "ACDEFGHIKLMNPQRSTVWYX*-" 22 | _AMBIG_AA_LETTERS = "ACDEFGHIKLMNOPQRSTUVWYBJZX*-" 23 | 24 | _DNA_LETTERS = "ACGT" 25 | _REDUCED_DNA_LETTERS = "ACGTN-" 26 | _AMBIG_DNA_LETTERS = "ACGTRYSWKMBDHVN-" 27 | 28 | _RNA_LETTERS = "ACGU" 29 | _REDUCED_RNA_LETTERS = "ACGUN-" 30 | _AMBIG_RNA_LETTERS = "ACGURYSWKMBDHVN-" 31 | 32 | _IDX_LETTERS = { 33 | "DNA": _DNA_LETTERS, "reduced DNA": _REDUCED_DNA_LETTERS, "ambig DNA": _AMBIG_DNA_LETTERS, 34 | "RNA": _RNA_LETTERS, "reduced RNA": _REDUCED_RNA_LETTERS, "ambig RNA": _AMBIG_RNA_LETTERS, 35 | "AA":_AA_LETTERS, "reduced AA": _REDUCED_AA_LETTERS, "ambig AA": _AMBIG_AA_LETTERS 36 | } 37 | 38 | _NA_ALPHABETS = set(( 39 | "DNA", "reduced DNA", "ambig DNA", 40 | "RNA", "reduced RNA", "ambig RNA" 41 | )) 42 | 43 | 44 | _AA_ALPHABETS = set(( 45 | "AA", "reduced AA", "ambig AA" 46 | )) 47 | 48 | # Using Robinson-Robinson Frequencies --> Order matters 49 | _AA_background = OrderedDict(( 50 | ('A', 0.087135727479), ('C', 0.033468612677), ('D', 0.046870296325), ('E', 0.049525516559), 51 | ('F', 0.039767240243), ('G', 0.088606655336), ('H', 0.033621241997), ('I', 0.036899088289), 52 | ('K', 0.080483022246), ('L', 0.085361634465), ('M', 0.014743987313), ('N', 0.040418278548), 53 | ('P', 0.050677889818), ('Q', 0.038269289735), ('R', 0.040894944605), ('S', 0.069597088795), 54 | ('T', 0.058530491824), ('V', 0.064717068767), ('W', 0.010489421950), ('Y', 0.029922503029) 55 | )) 56 | 57 | 58 | _NA_background = {nt: 0.25 for nt in 'ACGT'} 59 | 60 | 61 | _conv_alph_len = { 62 | 'reduced AA': (len(_REDUCED_AA_LETTERS) - 3, len(_REDUCED_AA_LETTERS) - 3), 63 | 'ambig AA': (len(_AMBIG_AA_LETTERS) - 6, len(_AMBIG_AA_LETTERS) - 3), 64 | 'reduced DNA': (len(_REDUCED_DNA_LETTERS) - 2, len(_REDUCED_DNA_LETTERS) - 2), 65 | 'ambig DNA': (len(_AMBIG_DNA_LETTERS) - 12,len(_AMBIG_DNA_LETTERS) - 2), 66 | 'reduced RNA': (len(_REDUCED_RNA_LETTERS) - 2, len(_REDUCED_RNA_LETTERS) - 2), 67 | 'ambig RNA': (len(_AMBIG_RNA_LETTERS) - 12, len(_AMBIG_RNA_LETTERS) - 2) 68 | } 69 | 70 | 71 | def convert_pm(non_std_pm, pm_type = 'pfm', alphabet_type = 'reduced DNA', background = None, pseudocount = None): 72 | ambig_start, len_alph = _conv_alph_len[alphabet_type] 73 | 74 | # Convert all to pfm for easy weight calculations 75 | if pm_type == 'pwm': 76 | pm = sl.pwm2pfm(non_std_pm, background, pseudocount, alphabet_type) 77 | elif pm_type == 'ppm': 78 | pm = sl.ppm2pfm(non_std_pm, alphabet_type) 79 | else: 80 | pm = non_std_pm 81 | new_pm = pd.DataFrame(non_std_pm[:,:ambig_start].copy(), columns = list(_IDX_LETTERS[alphabet_type][:ambig_start])) 82 | if isinstance(pm, pd.DataFrame): 83 | weights = pm.iloc[:,:len_alph].sum(axis = 1) / pm.sum(axis = 1) 84 | elif isinstance(pm, np.ndarray): 85 | weights = pm[:,:len_alph].sum(axis = 1) / pm.sum(axis = 1) 86 | 87 | if 'ambig' in alphabet_type: 88 | pd_pm = pd.DataFrame(pm, columns = list(_IDX_LETTERS[alphabet_type])) 89 | # dealing with ambiguous sequences 90 | for letter in _IDX_LETTERS[alphabet_type][ambig_start:-1]: # Don't care about '-' 91 | ambig_pairs = _AMBIGUITIES[alphabet_type][letter] 92 | new_pm.loc[:,list(ambig_pairs)] += (pd_pm[letter]/len(ambig_pairs))[:, np.newaxis] 93 | if pm_type == 'ppm': 94 | new_pm = sl.pfm2ppm(new_pm, alphabet_type.split()[1]) 95 | elif pm_type == 'pwm': 96 | new_pm = sl.pfm2pwm(new_pm, background, pseudocount, alphabet_type.split()[1]) 97 | return new_pm.values, weights 98 | 99 | # Dealing with reduced sequences 100 | else: 101 | # equally distribute the 'N/X/*/-' counts 102 | new_pm += pm[:,len_alph:-1].sum(axis = 1)[:, np.newaxis] / new_pm.shape[1] 103 | 104 | # Return in same form as given 105 | if pm_type == 'pwm': 106 | new_pm = sl.pfm2pwm(new_pm, background, pseudocount, alphabet_type.split()[1]) 107 | elif pm_type == 'ppm': 108 | new_pm = sl.pfm2ppm(new_pm, alphabet_type.split()[1]) 109 | return new_pm, weights 110 | 111 | 112 | NA_COLORSCHEMES = set(( 113 | 'monochrome', 'base pairing', 'classic' 114 | )) 115 | 116 | AA_ALPHABETS = set(( 117 | "AA", "reduced AA", "ambig AA" 118 | )) 119 | 120 | AA_COLORSCHEMES = set(( 121 | 'monochrome', 'hydrophobicity', 'chemistry','charge' 122 | )) 123 | 124 | dna_ambiguity = { 125 | "A": "A", 126 | "C": "C", 127 | "G": "G", 128 | "T": "T", 129 | "M": "AC", 130 | "R": "AG", 131 | "W": "AT", 132 | "S": "CG", 133 | "Y": "CT", 134 | "K": "GT", 135 | "V": "ACG", 136 | "H": "ACT", 137 | "D": "AGT", 138 | "B": "CGT", 139 | "N": 'ACGT' 140 | } 141 | 142 | rna_ambiguity = { 143 | "A": "A", 144 | "C": "C", 145 | "G": "G", 146 | "U": "U", 147 | "M": "AC", 148 | "R": "AG", 149 | "W": "AU", 150 | "S": "CG", 151 | "Y": "CU", 152 | "K": "GU", 153 | "V": "ACG", 154 | "H": "ACU", 155 | "D": "AGU", 156 | "B": "CGU", 157 | "N": 'ACGU' 158 | } 159 | 160 | amino_acid_ambiguity = { 161 | "A": "A", 162 | "B": "ND", 163 | "C": "C", 164 | "D": "D", 165 | "E": "E", 166 | "F": "F", 167 | "G": "G", 168 | "H": "H", 169 | "I": "I", 170 | "K": "K", 171 | "L": "L", 172 | "M": "M", 173 | "N": "N", 174 | "P": "P", 175 | "Q": "Q", 176 | "R": "R", 177 | "S": "S", 178 | "T": "T", 179 | "V": "V", 180 | "W": "W", 181 | "Y": "Y", 182 | "Z": "QE", 183 | "J": "IL", 184 | 'U': 'U', 185 | 'O': 'O', 186 | 'X': "ACDEFGHIKLMNPQRSTVWY", 187 | '*': "ACDEFGHIKLMNPQRSTVWY" 188 | } 189 | 190 | _AMBIGUITIES = { 191 | "ambig DNA": dna_ambiguity, 192 | "ambig RNA": rna_ambiguity, 193 | "ambig AA": amino_acid_ambiguity 194 | } 195 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | [![PyPI version](https://badge.fury.io/py/seqlogo.svg)](https://pypi.org/project/seqlogo/) 3 | [![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat-square)](http://bioconda.github.io/recipes/seqlogo/README.html) 4 | [![License](https://img.shields.io/badge/License-BSD%203--Clause-blue.svg)](https://github.com/betteridiot/seqlogo/blob/master/LICENSE) 5 |
6 | 7 | # seqlogo 8 | Python port of Bioconductor's [seqLogo](http://bioconductor.org/packages/release/bioc/html/seqLogo.html) served by [WebLogo](http://weblogo.threeplusone.com/) 9 | 10 | ## Overview 11 | 12 | In the field of bioinformatics, a common task is to look for sequence motifs at 13 | different sites along the genome or within a protein sequence. One aspect of this 14 | analysis involves creating a variant of a Position Matrix (PM): Position Frequency Matrix (PFM), 15 | Position Probability Matrix (PPM), and Position Weight Matrix (PWM). The formal format for 16 | a PWM file can be found [here](http://bioinformatics.intec.ugent.be/MotifSuite/pwmformat.php). 17 | 18 | --- 19 | #### Specification 20 | A PM file can be just a plain text, whitespace delimited matrix, such that the number of columns 21 | matches the number of letters in your desired alphabet and the number of rows is the number of positions 22 | in your sequence. Any comment lines that start with `#` will be skipped. 23 | 24 | *Note*: [TRANSFAC matrix](http://meme-suite.org/doc/transfac-format.html) and [MEME Motif](http://meme-suite.org/doc/meme-format.html) formats are not directly supported. 25 | 26 | 27 | 28 | Where is the probability that at position, letter is seen. 29 | 30 | This is often generated in a frequentist fashion. If a pipeline 31 | tallies all observed letters at each position, this is called a Position Frequency Matrix (PFM). 32 | 33 | The PFM can be converted to a PPM in a straight-forward manner, creating a matrix 34 | that for any given position and letter, the probability of that letter at that position 35 | is reported. 36 | 37 | A PWM is the PPM converted into log-likelihood. Pseudocounts can be applied to prevent 38 | probabilities of 0 from turing into -inf in the conversion process. Lastly, each position's 39 | log-likelihood is corrected for some background probability for every given letter in the 40 | selected alphabet. 41 | 42 | --- 43 | #### Features 44 | * `seqlogo` can use any PM as entry points for analysis (from a file or in array formats) 45 | and, subsequently, plot the sequence logos. 46 | 47 | * `seqlogo` was written to support BIOINF 529 :Bioinformatics Concepts and Algorithms 48 | at the University of Michigan in the Department of Computational Medicine & Bioinformatics. 49 | 50 | * `seqlogo` attempts to blend the user-friendly api of Bioconductor's [seqLogo](http://bioconductor.org/packages/release/bioc/html/seqLogo.html) 51 | and the rendering power of the [WebLogo](http://weblogo.threeplusone.com/)Python API. 52 | 53 | * `seqlogo` supports the following alphabets: 54 | 55 | | Alphabet name | Alphabet Letters | 56 | | :--- | :--- | 57 | | **`"DNA"`** | `"ACGT"` | 58 | | `"reduced DNA"` | `"ACGTN-"` | 59 | | `"ambig DNA"` | `"ACGTRYSWKMBDHVN-"` | 60 | | **`"RNA"`** | `"ACGU"` | 61 | | `"reduced RNA"` | `"ACGUN-"` | 62 | | `"ambig RNA"` | `"ACGURYSWKMBDHVN-"` | 63 | | **`"AA"`** | `"ACDEFGHIKLMNPQRSTVWY"` | 64 | | `"reduced AA"` | `"ACDEFGHIKLMNPQRSTVWYX*-"` | 65 | | `"ambig AA"` | `"ACDEFGHIKLMNOPQRSTUVWYBJZX*-"` | 66 | (**Bolded** alphabet names are the most commonly used) 67 | * `seqlogo` can also render sequence logos in a number of formats: 68 | * `"svg"` (default) 69 | * `"eps"` 70 | * `"pdf"` 71 | * `"jpeg"` 72 | * `"png"` 73 | 74 | * All plots can be rendered in 4 different sizes: 75 | * `"small"`: 3.54" wide 76 | * `"medium"`: 5" wide 77 | * `"large"`: 7.25" wide 78 | * `"xlarge"`: 10.25" wide 79 | 80 | *Note*: all sizes taken from [this](http://www.sciencemag.org/sites/default/files/Figure_prep_guide.pdf) publication 81 | guide from Science Magazine. 82 | 83 | --- 84 | #### Recommended settings: 85 | * For best results, implement `seqlogo` within a IPython/Jupyter environment (for inline plotting purposes). 86 | * Initially written for Python 3.7, but has shown to work in versions 3.5+ (**Python 2.7 is not supported**) 87 | 88 | *** 89 | ## Setup 90 | 91 | ### Minimal Requirements: 92 | 1. `numpy` 93 | 2. `pandas` 94 | 3. `weblogo` 95 | 96 | **Note**: it is strongly encouraged that `jupyter` is installed as well. 97 | 98 | --- 99 | #### `conda` environment: 100 | 101 | To produce the ideal virtual environment that will run `seqlogo` on a `conda`-based 102 | build, clone the repo or download the environment.yml within the repo. Then run the following 103 | command: 104 | 105 | ```bash 106 | 107 | $ conda env create -f environment.yml 108 | 109 | ``` 110 | 111 | --- 112 | #### Installation 113 | 114 | To install using `pip`: (recommended) 115 | 116 | ```bash 117 | 118 | $ pip install seqlogo 119 | 120 | ``` 121 | 122 | To install using `conda` 123 | 124 | ```bash 125 | 126 | $ conda install -c bioconda seqlogo 127 | 128 | ``` 129 | 130 | 131 | Or install from GitHub directly 132 | 133 | ```bash 134 | 135 | $ pip install git+https://github.com/betteridiot/seqlogo.git#egg=seqlogo 136 | 137 | ``` 138 | 139 | *** 140 | ## Quickstart 141 | 142 | ### Importing 143 | 144 | ```python 145 | 146 | import numpy as np 147 | import pandas as pd 148 | import seqlogo 149 | 150 | ``` 151 | 152 | ### Generate some PM data (without frequency data) 153 | 154 | For many demonstrations that speak to PWMs, they are often started with PPM data. 155 | Many packages preclude sequence logo generation from this entry point. However, 156 | `seqlogo` can handle it just fine. One point to make though is that if no count 157 | data is provided, `seqlogo` just generates the PFM data by multiplying the 158 | probabilities by 100. This is **only** for `weblogolib` compatability. 159 | 160 | ```python 161 | 162 | # Setting seed for demonstration purposes 163 | >>> np.random.seed(42) 164 | 165 | # Making a fake PPM 166 | >>> random_ppm = np.random.dirichlet(np.ones(4), size=6) 167 | >>> ppm = seqlogo.Ppm(random_ppm) 168 | >>> ppm 169 | A C G T 170 | 0 0.082197 0.527252 0.230641 0.159911 171 | 1 0.070375 0.070363 0.024826 0.834435 172 | 2 0.161962 0.216972 0.003665 0.617401 173 | 3 0.735638 0.098290 0.082638 0.083434 174 | 4 0.179898 0.368931 0.280463 0.170708 175 | 5 0.498510 0.079138 0.182004 0.240349 176 | 177 | ``` 178 | 179 | ### Generate some frequency data and convert to PWM 180 | Sometimes the user has frequency data instead of PWM. To construct a `Pwm` instance 181 | that automatically computes Information Content and PWM values, the user can use 182 | the `seqlogo.pfm2pwm()` function. 183 | 184 | ```python 185 | 186 | # Setting seed for demonstration purposes 187 | >>> np.random.seed(42) 188 | 189 | # Making some fake Position Frequency Data (PFM) 190 | >>> pfm = pd.DataFrame(np.random.randint(0, 36, size=(8, 4))) 191 | 192 | # Convert to Position Weight Matrix (PWM) 193 | >>> pwm = seqlogo.pfm2pwm(pfm) 194 | >>> pwm 195 | A C G T 196 | 0 0.698830 -0.301170 -1.301170 0.213404 197 | 1 0.263034 0.552541 -0.584962 -0.584962 198 | 2 0.148523 0.754244 0.148523 -3.375039 199 | 3 0.182864 -4.209453 0.314109 0.648528 200 | 4 -4.000000 0.321928 1.000000 -0.540568 201 | 5 -0.222392 -0.029747 0.085730 0.140178 202 | 6 0.697437 0.597902 -2.209453 -0.624491 203 | 7 0.736966 -0.584962 0.502500 -2.000000 204 | 205 | ``` 206 | 207 | ### `seqlogo.CompletePm` demo 208 | 209 | Here is a quickstart guide on how to leverage the power of `seqlogo.CompletePm` 210 | 211 | ```python 212 | 213 | # Setting seed for demonstration purposes 214 | >>> np.random.seed(42) 215 | 216 | # Making a fake PWM 217 | >>> random_ppm = np.random.dirichlet(np.ones(4), size=6) 218 | >>> cpm = seqlogo.CompletePM(ppm = random_ppm) 219 | 220 | # Pfm was imputed 221 | >>> print(cpm.pfm) 222 | A C G T 223 | 0 8 52 23 15 224 | 1 7 7 2 83 225 | 2 16 21 0 61 226 | 3 73 9 8 8 227 | 4 17 36 28 17 228 | 5 49 7 18 24 229 | 230 | # Shows the how the PPM data was formatted 231 | >>> print(cpm.ppm) 232 | A C G T 233 | 0 0.082197 0.527252 0.230641 0.159911 234 | 1 0.070375 0.070363 0.024826 0.834435 235 | 2 0.161962 0.216972 0.003665 0.617401 236 | 3 0.735638 0.098290 0.082638 0.083434 237 | 4 0.179898 0.368931 0.280463 0.170708 238 | 5 0.498510 0.079138 0.182004 0.240349 239 | 240 | # Computing the PWM using default background and pseudocounts 241 | >>> print(cpm.pwm) 242 | A C G T 243 | 0 -1.604773 1.076564 -0.116281 -0.644662 244 | 1 -1.828788 -1.829031 -3.331983 1.738871 245 | 2 -0.626276 -0.204418 -6.091862 1.304279 246 | 3 1.557068 -1.346815 -1.597049 -1.583223 247 | 4 -0.474749 0.561423 0.165882 -0.550396 248 | 5 0.995695 -1.659494 -0.457960 -0.056800 249 | 250 | # See the consensus sequence 251 | >>> print(cpm.consensus) 252 | CTTACA 253 | 254 | # See the Information Content 255 | >>> print(cpm.ic) 256 | 0 0.305806 257 | 1 1.110856 258 | 2 0.637149 259 | 3 0.748989 260 | 4 0.074286 261 | 5 0.268034 262 | dtype: float64 263 | 264 | ``` 265 | 266 | ### Plot the sequence logo with information content scaling 267 | 268 | ```python 269 | 270 | # Setting seed for demonstration purposes 271 | >>> np.random.seed(42) 272 | 273 | # Making a fake PWM 274 | >>> random_ppm = np.random.dirichlet(np.ones(4), size=6) 275 | >>> ppm = seqlogo.Ppm(random_ppm) 276 | >>> seqlogo.seqlogo(ppm, ic_scale = False, format = 'svg', size = 'medium') 277 | 278 | ``` 279 | 280 | The above code will produce: 281 | 282 | ![](https://github.com/betteridiot/seqlogo/blob/master/docs/figures/ic_scale.svg) 283 | 284 | 285 | ### Plot the sequence logo with no information content scaling 286 | 287 | ```python 288 | 289 | # Setting seed for demonstration purposes 290 | >>> np.random.seed(42) 291 | 292 | # Making a fake PWM 293 | >>> random_ppm = np.random.dirichlet(np.ones(4), size=6) 294 | >>> ppm = seqlogo.Ppm(random_ppm) 295 | >>> seqlogo.seqlogo(ppm, ic_scale = False, format = 'svg', size = 'medium') 296 | 297 | ``` 298 | 299 | The above code will produce: 300 | 301 | ![](https://github.com/betteridiot/seqlogo/blob/master/docs/figures/no_ic_scale.svg) 302 | 303 | *** 304 | ## Documentation 305 | 306 | `seqlogo` exposes 5 classes to the user for handling PM data: 307 | 1. `seqlogo.Pm`: the base class for all other specialized PM subclasses 308 | 2. `seqlogo.Pfm`: The class used for handling PFM data 309 | 3. `seqlogo.Ppm`: The class used for handling PPM data 310 | 4. `seqlogo.Pwm`: The class used for handling PWM data 311 | 5. `seqlogo.CompletePm`: This final class will take any/all of the other PM subclass data 312 | and compute any of the other missing data. That is, if the user only provides a `seqlogo.Pfm` 313 | and passes it to `seqlogo.CompletePm`, it will solve for the PPM, PWM, consensus sequence, and 314 | information content. 315 | 316 | Additionally, `seqlogo` also provides 6 methods for converting PM structures: 317 | 1. `seqlogo.pfm2ppm`: converts a PFM to a PPM 318 | 2. `seqlogo.pfm2pwm`: converts a PFM to a PWM 319 | 3. `seqlogo.ppm2pfm`: converts a PPM to a PFM 320 | 4. `seqlogo.ppm2pwm`: converts a PPM to a PWM 321 | 5. `seqlogo.pwm2pfm`: converts a PWM to a PFM 322 | 6. `seqlogo.pwm2ppm`: converts a PWM to a PPM 323 | 324 | The signatures for each item above are as follows: 325 | 326 | ### Classes 327 | ```python 328 | 329 | seqlogo.CompletePm(pfm = None, ppm = None, pwm = None, background = None, pseudocount = None, 330 | alphabet_type = 'DNA', alphabet = None, default_pm = 'ppm'): 331 | """ 332 | Creates the CompletePm instance. If the user does not define any `pm_filename_or_array`, 333 | it will be initialized to empty. Will generate all other attributes as soon 334 | as a `pm_filename_or_array` is supplied. 335 | 336 | Args: 337 | pfm (str or `numpy.ndarray` or `pandas.DataFrame` or Pm): The user supplied 338 | PFM. If it is a filename, the file will be opened 339 | and parsed. If it is an `numpy.ndarray` or `pandas.DataFrame`, 340 | it will just be assigned. (default: None, skips '#' comment lines) 341 | ppm (str or `numpy.ndarray` or `pandas.DataFrame` or Pm): The user supplied 342 | PPM. If it is a filename, the file will be opened 343 | and parsed. If it is an `numpy.ndarray` or `pandas.DataFrame`, 344 | it will just be assigned. (default: None, skips '#' comment lines) 345 | pwm (str or `numpy.ndarray` or `pandas.DataFrame` or Pm): The user supplied 346 | PWM. If it is a filename, the file will be opened 347 | and parsed. If it is an `numpy.ndarray` or `pandas.DataFrame`, 348 | it will just be assigned. (default: None, skips '#' comment lines) 349 | background (constant or Collection): Offsets used to calculate background letter probabilities (defaults: If 350 | using an Nucleic Acid alphabet: 0.25; if using an Aminio Acid alphabet: Robinson-Robinson Frequencies) 351 | pseudocount (constant): Some constant to offset PPM conversion to PWM to prevent -/+ inf. (defaults to 1e-10) 352 | alphabet_type (str): Desired alphabet to use. Order matters (default: 'DNA') 353 | "DNA" := "ACGT" 354 | "reduced DNA" := "ACGTN-" 355 | "ambig DNA" := "ACGTRYSWKMBDHVN-" 356 | "RNA" := "ACGU" 357 | "reduced RNA" := "ACGUN-" 358 | "ambig RNA" := "ACGURYSWKMBDHVN-" 359 | "AA" : = "ACDEFGHIKLMNPQRSTVWY" 360 | "reduced AA" := "ACDEFGHIKLMNPQRSTVWYX*-" 361 | "ambig AA" := "ACDEFGHIKLMNOPQRSTUVWYBJZX*-" 362 | "custom" := None 363 | (default: 'DNA') 364 | alphabet (str): if 'custom' is selected or a specialize alphabet is desired, this accepts a string (default: None) 365 | default_pm (str): which of the 3 pm's do you want to call '*home*'? (default: 'ppm') 366 | """ 367 | 368 | seqlogo.Pm(pm_filename_or_array = None, pm_type = 'ppm', alphabet_type = 'DNA', alphabet = None, 369 | background = None, pseudocount = None): 370 | """Initializes the Pm 371 | 372 | Creates the Pm instance. If the user does not define `pm_filename_or_array`, 373 | it will be initialized to empty. Will generate all other attributes as soon 374 | as a `pm_filename_or_array` is supplied. 375 | 376 | Args: 377 | pm_filename_or_array (str or `numpy.ndarray` or `pandas.DataFrame` or Pm): The user supplied 378 | PM. If it is a filename, the file will be opened 379 | and parsed. If it is an `numpy.ndarray` or `pandas.DataFrame`, 380 | it will just be assigned. (default: None, skips '#' comment lines) 381 | alphabet_type (str): Desired alphabet to use. Order matters (default: 'DNA') 382 | "DNA" := "ACGT" 383 | "reduced DNA" := "ACGTN-" 384 | "ambig DNA" := "ACGTRYSWKMBDHVN-" 385 | "RNA" := "ACGU" 386 | "reduced RNA" := "ACGUN-" 387 | "ambig RNA" := "ACGURYSWKMBDHVN-" 388 | "AA" : = "ACDEFGHIKLMNPQRSTVWY" 389 | "reduced AA" := "ACDEFGHIKLMNPQRSTVWYX*-" 390 | "ambig AA" := "ACDEFGHIKLMNOPQRSTUVWYBJZX*-" 391 | "custom" := None 392 | (default: 'DNA') 393 | alphabet (str): if 'custom' is selected or a specialize alphabet is desired, this accepts a string (default: None) 394 | background (constant or Collection): Offsets used to calculate background letter probabilities (defaults: If 395 | using an Nucleic Acid alphabet: 0.25; if using an Aminio Acid alphabet: Robinson-Robinson Frequencies) 396 | pseudocount (constant): Some constant to offset PPM conversion to PWM to prevent -/+ inf. (default: 1e-10) 397 | """ 398 | 399 | seqlogo.Pfm(pfm_filename_or_array = None, pm_type = 'pfm', alphabet_type = 'DNA', alphabet = None, 400 | background = None, pseudocount = None): 401 | """Initializes the Pfm 402 | 403 | Creates the Pfm instance. If the user does not define `pfm_filename_or_array`, 404 | it will be initialized to empty. Will generate all other attributes as soon 405 | as a `pfm_filename_or_array` is supplied. 406 | 407 | Args: 408 | pfm_filename_or_array (str or `numpy.ndarray` or `pandas.DataFrame` or Pm): The user supplied 409 | PFM. If it is a filename, the file will be opened 410 | and parsed. If it is an `numpy.ndarray` or `pandas.DataFrame`, 411 | it will just be assigned. (default: None, skips '#' comment lines) 412 | alphabet_type (str): Desired alphabet to use. Order matters (default: 'DNA') 413 | "DNA" := "ACGT" 414 | "reduced DNA" := "ACGTN-" 415 | "ambig DNA" := "ACGTRYSWKMBDHVN-" 416 | "RNA" := "ACGU" 417 | "reduced RNA" := "ACGUN-" 418 | "ambig RNA" := "ACGURYSWKMBDHVN-" 419 | "AA" : = "ACDEFGHIKLMNPQRSTVWY" 420 | "reduced AA" := "ACDEFGHIKLMNPQRSTVWYX*-" 421 | "ambig AA" := "ACDEFGHIKLMNOPQRSTUVWYBJZX*-" 422 | "custom" := None 423 | (default: 'DNA') 424 | alphabet (str): if 'custom' is selected or a specialize alphabet is desired, this accepts a string (default: None) 425 | background (constant or Collection): Offsets used to calculate background letter probabilities (defaults: If 426 | using an Nucleic Acid alphabet: 0.25; if using an Aminio Acid alphabet: Robinson-Robinson Frequencies) 427 | pseudocount (constant): Some constant to offset PPM conversion to PWM to prevent -/+ inf. (default: 1e-10) 428 | """ 429 | 430 | seqlogo.Ppm(ppm_filename_or_array = None, pm_type = 'ppm', alphabet_type = 'DNA', alphabet = None, 431 | background = None, pseudocount = None): 432 | """Initializes the Ppm 433 | 434 | Creates the Ppm instance. If the user does not define `ppm_filename_or_array`, 435 | it will be initialized to empty. Will generate all other attributes as soon 436 | as a `ppm_filename_or_array` is supplied. 437 | 438 | Args: 439 | ppm_filename_or_array (str or `numpy.ndarray` or `pandas.DataFrame` or Pm): The user supplied 440 | PPM. If it is a filename, the file will be opened 441 | and parsed. If it is an `numpy.ndarray` or `pandas.DataFrame`, 442 | it will just be assigned. (default: None, skips '#' comment lines) 443 | alphabet_type (str): Desired alphabet to use. Order matters (default: 'DNA') 444 | "DNA" := "ACGT" 445 | "reduced DNA" := "ACGTN-" 446 | "ambig DNA" := "ACGTRYSWKMBDHVN-" 447 | "RNA" := "ACGU" 448 | "reduced RNA" := "ACGUN-" 449 | "ambig RNA" := "ACGURYSWKMBDHVN-" 450 | "AA" : = "ACDEFGHIKLMNPQRSTVWY" 451 | "reduced AA" := "ACDEFGHIKLMNPQRSTVWYX*-" 452 | "ambig AA" := "ACDEFGHIKLMNOPQRSTUVWYBJZX*-" 453 | "custom" := None 454 | (default: 'DNA') 455 | alphabet (str): if 'custom' is selected or a specialize alphabet is desired, this accepts a string (default: None) 456 | background (constant or Collection): Offsets used to calculate background letter probabilities (defaults: If 457 | using an Nucleic Acid alphabet: 0.25; if using an Aminio Acid alphabet: Robinson-Robinson Frequencies) 458 | pseudocount (constant): Some constant to offset PPM conversion to PWM to prevent -/+ inf. (default: 1e-10) 459 | """ 460 | 461 | seqlogo.Pwm(pwm_filename_or_array = None, pm_type = 'pwm', alphabet_type = 'DNA', alphabet = None, 462 | background = None, pseudocount = None): 463 | """Initializes the Pwm 464 | 465 | Creates the Pwm instance. If the user does not define `pwm_filename_or_array`, 466 | it will be initialized to empty. Will generate all other attributes as soon 467 | as a `pwm_filename_or_array` is supplied. 468 | 469 | Args: 470 | pwm_filename_or_array (str or `numpy.ndarray` or `pandas.DataFrame` or Pm): The user supplied 471 | PWM. If it is a filename, the file will be opened 472 | and parsed. If it is an `numpy.ndarray` or `pandas.DataFrame`, 473 | it will just be assigned. (default: None, skips '#' comment lines) 474 | alphabet_type (str): Desired alphabet to use. Order matters (default: 'DNA') 475 | "DNA" := "ACGT" 476 | "reduced DNA" := "ACGTN-" 477 | "ambig DNA" := "ACGTRYSWKMBDHVN-" 478 | "RNA" := "ACGU" 479 | "reduced RNA" := "ACGUN-" 480 | "ambig RNA" := "ACGURYSWKMBDHVN-" 481 | "AA" : = "ACDEFGHIKLMNPQRSTVWY" 482 | "reduced AA" := "ACDEFGHIKLMNPQRSTVWYX*-" 483 | "ambig AA" := "ACDEFGHIKLMNOPQRSTUVWYBJZX*-" 484 | "custom" := None 485 | (default: 'DNA') 486 | alphabet (str): if 'custom' is selected or a specialize alphabet is desired, this accepts a string (default: None) 487 | background (constant or Collection): Offsets used to calculate background letter probabilities (defaults: If 488 | using an Nucleic Acid alphabet: 0.25; if using an Aminio Acid alphabet: Robinson-Robinson Frequencies) 489 | pseudocount (constant): Some constant to offset PPM conversion to PWM to prevent -/+ inf. (default: 1e-10) 490 | """ 491 | 492 | ``` 493 | 494 | ### Conversion Methods 495 | 496 | ```python 497 | 498 | seqlogo.pfm2ppm(pfm): 499 | """Converts a Pfm to a ppm array 500 | 501 | Args: 502 | pfm (Pfm): a fully initialized Pfm 503 | 504 | Returns: 505 | (np.array): converted values 506 | """ 507 | 508 | seqlogo.pfm2pwm(pfm, background = None, pseudocount = None): 509 | """Converts a Pfm to a pwm array 510 | 511 | Args: 512 | pfm (Pfm): a fully initialized Pfm 513 | background: accounts for relative weights from background. Must be a constant or same number of columns as Pwm (default: None) 514 | pseudocount (const): The number used to offset log-likelihood conversion from probabilites (default: None -> 1e-10) 515 | 516 | Returns: 517 | (np.array): converted values 518 | """ 519 | 520 | seqlogo.ppm2pfm(ppm): 521 | """Converts a Ppm to a pfm array 522 | 523 | Args: 524 | ppm (Ppm): a fully initialized Ppm 525 | 526 | Returns: 527 | (np.array): converted values 528 | """ 529 | 530 | seqlogo.ppm2pwm(ppm, background= None, pseudocount = None): 531 | """Converts a Ppm to a pwm array 532 | 533 | Args: 534 | ppm (Ppm): a fully initialized Ppm 535 | background: accounts for relative weights from background. Must be a constant or same number of columns as Pwm (default: None) 536 | pseudocount (const): The number used to offset log-likelihood conversion from probabilites (default: None -> 1e-10) 537 | 538 | Returns: 539 | (np.array): converted values 540 | 541 | Raises: 542 | ValueError: if the pseudocount isn't a constant or the same length as sequence 543 | """ 544 | 545 | seqlogo.pwm2pfm(pwm, background = None, pseudocount = None): 546 | """Converts a Pwm to a pfm array 547 | 548 | Args: 549 | pwm (Pwm): a fully initialized Pwm 550 | background: accounts for relative weights from background. Must be a constant or same number of columns as Pwm (default: None) 551 | pseudocount (const): The number used to offset log-likelihood conversion from probabilites (default: None -> 1e-10) 552 | 553 | Returns: 554 | (np.array): converted values 555 | """ 556 | 557 | seqlogo.pwm2ppm(pwm, background = None, pseudocount = None): 558 | """Converts a Pwm to a ppm array 559 | 560 | Args: 561 | pwm (Pwm): a fully initialized Pwm 562 | background: accounts for relative weights from background. Must be a constant or same number of columns as Pwm (default: None) 563 | pseudocount (const): The number used to offset log-likelihood conversion from probabilites (default: None -> 1e-10) 564 | 565 | Returns: 566 | (np.array): converted values 567 | 568 | Raises: 569 | ValueError: if the pseudocount isn't a constant or the same length as sequence 570 | """ 571 | 572 | ``` 573 | 574 | *** 575 | ## Contributing 576 | 577 | Please see our contribution guidelines [here](https://github.com/betteridiot/seqlogo/blob/master/CONTRIBUTING.md) 578 | 579 | *** 580 | ## Acknowledgments 581 | 582 | 1. Bembom O (2018). seqlogo: Sequence logos for DNA sequence alignments. R package version 1.48.0. 583 | 2. Crooks GE, Hon G, Chandonia JM, Brenner SE WebLogo: A sequence logo generator, 584 | Genome Research, 14:1188-1190, (2004). 585 | -------------------------------------------------------------------------------- /docs/figures/ic_scale.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | 204 | 205 | 206 | 207 | 208 | 209 | 210 | 211 | 212 | 213 | 214 | 215 | 216 | 217 | 218 | 219 | 220 | 221 | 222 | 223 | 224 | 225 | 226 | 227 | 228 | 229 | 230 | 231 | 232 | 233 | 234 | 235 | 236 | 237 | 238 | 239 | 240 | 241 | 242 | 243 | 244 | 245 | 246 | 247 | 248 | 249 | 250 | 251 | 252 | 253 | 254 | 255 | 256 | 257 | 258 | 259 | 260 | 261 | 262 | 263 | 264 | 265 | 266 | 267 | 268 | 269 | 270 | 271 | 272 | 273 | 274 | 275 | 276 | 277 | 278 | 279 | 280 | 281 | 282 | 283 | 284 | 285 | 286 | 287 | 288 | 289 | 290 | 291 | 292 | 293 | 294 | -------------------------------------------------------------------------------- /docs/figures/no_ic_scale.svg: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 188 | 189 | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | 204 | 205 | 206 | 207 | 208 | 209 | 210 | 211 | 212 | 213 | 214 | 215 | 216 | 217 | 218 | 219 | 220 | 221 | 222 | 223 | 224 | 225 | 226 | 227 | 228 | 229 | 230 | 231 | 232 | 233 | 234 | 235 | 236 | 237 | 238 | 239 | 240 | 241 | 242 | 243 | 244 | 245 | 246 | 247 | 248 | 249 | 250 | 251 | 252 | 253 | 254 | 255 | 256 | 257 | 258 | 259 | 260 | 261 | 262 | 263 | 264 | 265 | 266 | 267 | 268 | 269 | 270 | 271 | 272 | 273 | 274 | 275 | 276 | 277 | 278 | 279 | 280 | 281 | 282 | 283 | 284 | 285 | 286 | 287 | 288 | 289 | 290 | 291 | 292 | 293 | 294 | 295 | 296 | 297 | 298 | 299 | 300 | 301 | 302 | 303 | 304 | 305 | 306 | 307 | 308 | 309 | 310 | 311 | 312 | 313 | -------------------------------------------------------------------------------- /seqlogo/core.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | from seqlogo import utils, core 4 | from functools import singledispatch, partial 5 | from numbers import Real 6 | 7 | 8 | # Check to see if currently within an IPython console 9 | try: 10 | if get_ipython(): 11 | from IPython.display import display 12 | ipy = True 13 | except NameError: 14 | ipy = False 15 | 16 | def _init_pm(pm_matrix, pm_type = 'ppm', alphabet_type = 'DNA', alphabet = None): 17 | """Checks for the file (if filename is supplied) and reads it in if present. 18 | Otherwise it just ensures that the position matrix (PM) dimensions match the 19 | expected alphabet. 20 | 21 | Args: 22 | pm_matrix (str or `numpy.ndarray` or `pandas.DataFrame`): The user supplied 23 | PM. If it is a filename, the file will be opened 24 | and parsed. If it is an `numpy.ndarray` or `pandas.DataFrame`, 25 | it will just be assigned. (default: None)) 26 | pm_type (str): whether the PM is a PWM or PFM (default: 'pwm') 27 | alphabet_type (str): Desired alphabet to use. Order matters (default: 'DNA') 28 | "DNA" := "ACGT" 29 | "reduced DNA" := "ACGTN-" 30 | "ambig DNA" := "ACGTRYSWKMBDHVN-" 31 | "RNA" := "ACGU" 32 | "reduced RNA" := "ACGUN-" 33 | "ambig RNA" := "ACGURYSWKMBDHVN-" 34 | "AA" : = "ACDEFGHIKLMNPQRSTVWY" 35 | "reduced AA" := "ACDEFGHIKLMNPQRSTVWYX*-" 36 | "ambig AA" := "ACDEFGHIKLMNOPQRSTUVWYBJZX*-" 37 | (default: "DNA") 38 | alphabet (str): if 'custom' is selected or a specialize alphabet is desired, this accepts a string (default: None) 39 | 40 | Returns: 41 | pm (pd.DataFrame): a properly formatted PM instance object 42 | 43 | Raises: 44 | FileNotFoundError if `pm_filename_or_array` is a string, but not a file 45 | ValueError if desired alphabet is not supported 46 | ValueError if the PM is not well formed 47 | ValueError if the probabilities do not add up to 1 48 | TypeError if `pm_filename_or_array` is not a file or array-like structure 49 | """ 50 | pm = _submit_pm(pm_matrix) 51 | 52 | if alphabet_type != 'custom': 53 | ex_alph_len = len(utils._IDX_LETTERS[alphabet_type]) 54 | else: 55 | ex_alph_len = len(alphabet) 56 | 57 | if not pm.shape[1] == ex_alph_len: 58 | if alphabet_type in utils._NA_ALPHABETS or alphabet_type in utils._AA_ALPHABETS or alphabet_type == 'custom': 59 | if pm.shape[0] == ex_alph_len: 60 | pm = pm.transpose() 61 | else: 62 | print(pm.shape) 63 | raise ValueError('{} alphabet selected, but PM is not {} rows long'.format(alphabet, ex_alph_len)) 64 | 65 | if alphabet_type != 'custom': 66 | pm.columns = list(utils._IDX_LETTERS[alphabet_type]) 67 | else: 68 | pm.columns = list(alphabet) 69 | 70 | if pm_type == 'ppm': 71 | if not np.allclose(pm.sum(axis=1), 1, 1e-9): 72 | raise ValueError('All or some PPM columns do not add to 1') 73 | 74 | return pm 75 | 76 | @partial(np.vectorize, otypes = [np.float64]) 77 | def __proportion(prob): 78 | """Vectorized proportion formula that feeds into _row_wise_ic 79 | 80 | Args: 81 | prob (`np.float64`): probability for a given letter at given position 82 | 83 | returns (`np.float64`): normalized probability 84 | """ 85 | if prob > 0: 86 | return prob * np.log2(prob) 87 | else: 88 | return 0 89 | 90 | def _row_wise_ic(row): 91 | """Get the information content for each row across all letters 92 | 93 | Args: 94 | row (`pandas.Series`): row from the PWM 95 | 96 | Returns: 97 | The information content for the row 98 | """ 99 | return 2 + np.sum(__proportion(row), axis = 1) 100 | 101 | class Pm: 102 | """Main class for handling Position Matrices (PM). 103 | 104 | Base class for creating PMs. Will calculate both consensus sequence and information 105 | content if the pm_type is set to either 'ppm' or 'pwm' 106 | 107 | Attributes: 108 | pm (`pandas.DataFrame`): PM DataFrame generated by user-submitted PM 109 | consensus (str): The consensus sequence determined by the PM 110 | ic (`numpy.ndarray`): The information content for each position 111 | width (int): Length of the sequence/motif 112 | length (int): an alias for `width` 113 | alphabet_type (str): Desired alphabet type to use. (default: 'DNA') 114 | alphabet (str): Desired alphabet to use. Order matters (default: None) 115 | weight (`numpy.array`): 1-D array of ones. Used for WebLogo compatability. If the chosen 116 | alphabet type allows gaps, will base weights on gap average in that position 117 | counts (`pandas.DataFrame`): Counts of letters at the given position. If 118 | `counts` is not supplied (because PPM was the entry-point), the PPM will be cast 119 | as a PFM by multiplying it by 100 120 | pseudocount: some number to offset PPM values at to prevent -inf/+inf (default: 1e-10) 121 | background (Collection): must be an iterable with length of alphabet with each letter's respective respective background probability or constant. (default: for NA-0.25, for AA-Robinson-Robinson Frequencies) 122 | """ 123 | 124 | __all__ = ['pm', 'consensus', 'ic', 'width', 'counts', 'entropy' 125 | 'alphabet', 'alphabet_type', 'length', 'weight'] 126 | 127 | def __init__(self, pm_filename_or_array = None, pm_type = 'ppm', alphabet_type = 'DNA', alphabet = None, 128 | background = None, pseudocount = None): 129 | """Initializes the Pm 130 | 131 | Creates the Pm instance. If the user does not define `pm_filename_or_array`, 132 | it will be initialized to empty. Will generate all other attributes as soon 133 | as a `pm_filename_or_array` is supplied. 134 | 135 | Args: 136 | pm_filename_or_array (str or `numpy.ndarray` or `pandas.DataFrame` or Pm): The user supplied 137 | PM. If it is a filename, the file will be opened 138 | and parsed. If it is an `numpy.ndarray` or `pandas.DataFrame`, 139 | it will just be assigned. (default: None, skips '#' comment lines) 140 | alphabet_type (str): Desired alphabet to use. Order matters (default: 'DNA') 141 | "DNA" := "ACGT" 142 | "reduced DNA" := "ACGTN-" 143 | "ambig DNA" := "ACGTRYSWKMBDHVN-" 144 | "RNA" := "ACGU" 145 | "reduced RNA" := "ACGUN-" 146 | "ambig RNA" := "ACGURYSWKMBDHVN-" 147 | "AA" : = "ACDEFGHIKLMNPQRSTVWY" 148 | "reduced AA" := "ACDEFGHIKLMNPQRSTVWYX*-" 149 | "ambig AA" := "ACDEFGHIKLMNOPQRSTUVWYBJZX*-" 150 | "custom" := None 151 | (default: 'DNA') 152 | alphabet (str): if 'custom' is selected or a specialize alphabet is desired, this accepts a string (default: None) 153 | background (constant or Collection): Offsets used to calculate background letter probabilities (defaults: If 154 | using an Nucleic Acid alphabet: 0.25; if using an Aminio Acid alphabet: Robinson-Robinson Frequencies) 155 | pseudocount (constant): Some constant to offset PPM conversion to PWM to prevent -/+ inf. (default: 1e-10) 156 | """ 157 | 158 | self._pm = self._pfm = self._ppm = self._pwm = None 159 | self._weight = self._width = self._consensus = None 160 | self._counts = self._weight = self._ic = None 161 | self._alphabet_type = alphabet_type 162 | 163 | self._alphabet = alphabet 164 | self._pm_type = pm_type 165 | if pseudocount is None: 166 | self.pseudocount = 1e-10 167 | else: 168 | self.pseudocount = pseudocount 169 | if background is None: 170 | self.background = None 171 | else: 172 | self.background = background 173 | 174 | if pm_filename_or_array is not None: 175 | self._update_pm(pm_filename_or_array, pm_type, alphabet_type, alphabet, self.background, self.pseudocount) 176 | 177 | 178 | def _update_pm(self, pm, pm_type ='ppm', alphabet_type = 'DNA', alphabet = None, background = None, pseudocount = None): 179 | if alphabet_type is None: 180 | alphabet_type = self.alphabet_type 181 | 182 | # TODO: Check for non-standard alphabets and convert 183 | # if len(alphabet_type.split()) > 1: 184 | # sub_class, family = alphabet_type.split() 185 | # pm, self._weight = utils.convert_pm(pm, pm_type, alphabet_type, background, pseudocount) 186 | # alphabet_type = family 187 | # self._alphabet_type = alphabet_type 188 | 189 | # self._alphabet = alphabet 190 | # alphabet = utils._IDX_LETTERS[family] 191 | # alphabet = utils._IDX_LETTERS[self._alphabet_type] 192 | 193 | setattr(self, "_{}".format(self._pm_type), _init_pm(pm, pm_type, alphabet_type, alphabet)) 194 | self._width = self._get_width(self._get_pm) 195 | if not isinstance(self.pseudocount, Real): 196 | if len(self.pseudocount) != self.width: 197 | raise ValueError('pseudocount must be the same length as sequence or a constant') 198 | if self._weight is None: 199 | self._weight = np.ones((self.width,), dtype=np.int8) 200 | self._consensus = self._generate_consensus(self._get_pm) 201 | self.background = _check_background(self, background = background, alphabet_type = alphabet_type, alphabet= alphabet) 202 | if pm_type not in ('pm', 'pfm'): 203 | if pm_type == 'ppm': 204 | self._ic = (self.ppm * ppm2pwm(self.ppm, background = self.background, pseudocount = self.pseudocount, alphabet_type = alphabet_type, alphabet = alphabet)).sum(axis = 1) 205 | elif pm_type == 'pwm': 206 | self._ic = (pwm2ppm(self.pwm, background = self.background, pseudocount = self.pseudocount, alphabet_type = alphabet_type, alphabet = alphabet) * self.pwm).sum(axis = 1) 207 | 208 | @property 209 | def pseudocount(self): 210 | return self._pseudocount 211 | 212 | @pseudocount.setter 213 | def pseudocount(self, pseudocount): 214 | self._pseudocount = pseudocount 215 | 216 | @property 217 | def _get_pm(self): 218 | return getattr(self, "_{}".format(self._pm_type)) 219 | 220 | def __len__(self): 221 | return self._get_pm.shape[0] 222 | 223 | def __str__(self): 224 | if ipy: 225 | display(self._get_pm) 226 | return '' 227 | else: 228 | return self._get_pm.__str__() 229 | 230 | def __repr__(self): 231 | if ipy: 232 | display(self._get_pm) 233 | return '' 234 | else: 235 | return self._get_pm.__repr__() 236 | 237 | def sum(self, axis = None): 238 | return np.sum(self._get_pm, axis = axis) 239 | 240 | def __add__(self, other): 241 | return self._get_pm + other 242 | 243 | def __radd__(self, other): 244 | return other + self._get_pm 245 | 246 | def __sub__(self, other): 247 | return self._get_pm - other 248 | 249 | def _rsub_(self, other): 250 | return other - self._get_pm 251 | 252 | def __mul__(self, other): 253 | return self._get_pm * other 254 | 255 | def __rmul__(self, other): 256 | return other * self._get_pm 257 | 258 | def __truediv__(self, other): 259 | return self._get_pm / other 260 | 261 | def __rtruediv__(self, other): 262 | return other / self._get_pm 263 | 264 | def __floordiv__(self, other): 265 | return self._get_pm // other 266 | 267 | def __rfloordiv__(self, other): 268 | return other // self._get_pm 269 | 270 | def __divmod__(self, other): 271 | return np.divmod(self._get_pm, other) 272 | 273 | def __rdivmod__(self, other): 274 | return np.divmod(other, self._get_pm) 275 | 276 | def __mod__(self, other): 277 | return self._get_pm % other 278 | 279 | def __rmod__(self, other): 280 | return other % self._get_pm 281 | 282 | def __pow__(self, other): 283 | return self._get_pm ** other 284 | 285 | def __rpow__(self, other): 286 | return other ** self._get_pm 287 | 288 | @property 289 | def shape(self): 290 | return self._get_pm.shape 291 | 292 | @property 293 | def T(self): 294 | return self._get_pm.T 295 | 296 | @property 297 | def weight(self): 298 | return self._weight 299 | 300 | @property 301 | def entropy_interval(self): 302 | """Used just for WebLogo API calls""" 303 | return None 304 | 305 | @property 306 | def length(self): 307 | return self.width 308 | 309 | @property 310 | def entropy(self): 311 | return self.ic 312 | 313 | @property 314 | def counts(self): 315 | if self._counts is None: 316 | self.counts = (getattr(self, "_{}".format(self._pm_type)) * 100).astype(np.int64).values 317 | return self._counts 318 | 319 | @counts.setter 320 | def counts(self, counts): 321 | self._counts = counts 322 | 323 | @classmethod 324 | def __dir__(cls): 325 | """Just used to clean up the attributes and methods shown when `dir()` is called""" 326 | return sorted(cls.__all__) 327 | 328 | @staticmethod 329 | def _generate_consensus(pm): 330 | if pm is not None: 331 | return ''.join(pm.idxmax(axis=1)) 332 | 333 | @staticmethod 334 | def _generate_ic(pm): 335 | if pm is not None: 336 | return _row_wise_ic(pm) 337 | 338 | @staticmethod 339 | def _get_width(pm): 340 | return pm.shape[0] 341 | 342 | @property 343 | def consensus(self): 344 | return self._consensus 345 | 346 | @property 347 | def ic(self): 348 | return self._ic 349 | 350 | @property 351 | def width(self): 352 | return self._width 353 | 354 | @property 355 | def alphabet_type(self): 356 | return self._alphabet_type 357 | 358 | @property 359 | def alphabet(self): 360 | if self._alphabet is None: 361 | if self.alphabet_type in utils._NA_ALPHABETS or self.alphabet_type in utils._AA_ALPHABETS: 362 | return utils._IDX_LETTERS[self.alphabet_type] 363 | elif self.alphabet_type == 'custom': 364 | raise ValueError("'custom' alphabet_type selected, but no alphabet was supplied") 365 | return self._alphabet 366 | 367 | @property 368 | def entropy(self): 369 | """Used just for WebLogo API call. Corrects for their conversion rate""" 370 | return self.ic / (1/np.log(2)) 371 | 372 | @property 373 | def length(self): 374 | return self.width 375 | 376 | @property 377 | def pm(self): 378 | return self._get_pm 379 | 380 | @pm.setter 381 | def pm(self, pm_filename_or_array, pm_type = 'ppm', alphabet_type = 'DNA', alphabet = None): 382 | self._update_pm(pm_filename_or_array, pm_type, alphabet_type, alphabet) 383 | 384 | @property 385 | def background(self): 386 | return self._background 387 | 388 | @background.setter 389 | def background(self, background): 390 | self._background = background 391 | 392 | class Ppm(Pm): 393 | """Main class for handling Position Probability Matrices (PPM). 394 | 395 | A PPM differs from a Position Frequency Matrix in that instead of counts for 396 | a given letter, the normalized weight is already calculated. 397 | 398 | This class automatically generates the consensus sequence for a given `alphabet`. 399 | It also calculates the Information Content (IC) for each position. 400 | 401 | Attributes: 402 | ppm (`pandas.DataFrame`): PPM DataFrame generated by user-submitted PPM 403 | consensus (str): The consensus sequence determined by the PPM 404 | ic (`numpy.ndarray`): The information content for each position 405 | width (int): Length of the sequence/motif 406 | length (int): an alias for `width` 407 | alphabet_type (str): Desired alphabet type to use. (default: 'DNA') 408 | alphabet (str): Desired alphabet to use. Order matters (default: None) 409 | weight (`numpy.array`): 1-D array of ones. Used for WebLogo compatability. If the chosen 410 | alphabet type allows gaps, will base weights on gap average in that position 411 | counts (`pandas.DataFrame`): Counts of letters at the given position. If 412 | `counts` is not supplied (because PPM was the entry-point), the PPM will be cast 413 | as a PFM by multiplying it by 100 414 | pseudocount: some number to offset PPM values at to prevent -inf/+inf (default: 1e-10) 415 | background (Collection): must be an iterable with length of alphabet with each letter's respective respective background probability or constant. (default: for NA-0.25, for AA-Robinson-Robinson Frequencies) 416 | """ 417 | 418 | __all__ = ['ppm', 'consensus', 'ic', 'width', 'counts', 'background', 419 | 'alphabet', 'alphabet_type', 'length', 'weight', 'pseudocount'] 420 | 421 | def __init__(self, *args, **kwargs): 422 | super().__init__(*args, pm_type='ppm', **kwargs) 423 | 424 | @classmethod 425 | def __dir__(cls): 426 | """Just used to clean up the attributes and methods shown when `dir()` is called""" 427 | return sorted(cls.__all__) 428 | 429 | @property 430 | def ppm(self): 431 | return self._get_pm 432 | 433 | @ppm.setter 434 | def ppm(self, ppm_filename_or_array, pm_type = 'ppm', alphabet_type = 'DNA', alphabet = None): 435 | super()._update_pm(ppm_filename_or_array, pm_type, alphabet_type, alphabet) 436 | 437 | class Pfm(Pm): 438 | """Main class for handling Position Frequency Matrices (PFM). 439 | 440 | A Position Frequency Matrix contains the counts for a given letter at a given position 441 | 442 | This class automatically generates the consensus sequence for a given `alphabet`. 443 | 444 | Attributes: 445 | pfm (`pandas.DataFrame`): PFM DataFrame generated by user-submitted PFM 446 | consensus (str): The consensus sequence determined by the PFM 447 | width (int): Length of the sequence/motif 448 | length (int): an alias for `width` 449 | alphabet_type (str): Desired alphabet type to use. (default: 'DNA') 450 | alphabet (str): Desired alphabet to use. Order matters (default: None) 451 | weight (`numpy.array`): 1-D array of ones. Used for WebLogo compatability. If the chosen 452 | alphabet type allows gaps, will base weights on gap average in that position 453 | counts (`pandas.DataFrame`): Synonym for pfm 454 | """ 455 | 456 | __all__ = ['pfm', 'consensus', 'width', 'counts', 'alphabet', 'alphabet_type', 'length', 'weight', 'background'] 457 | 458 | def __init__(self, *args, **kwargs): 459 | super().__init__(*args, pm_type='pfm', **kwargs) 460 | 461 | @classmethod 462 | def __dir__(cls): 463 | """Just used to clean up the attributes and methods shown when `dir()` is called""" 464 | return sorted(cls.__all__) 465 | 466 | @property 467 | def pfm(self): 468 | return self._get_pm 469 | 470 | @pfm.setter 471 | def pfm(self, pfm_filename_or_array, pm_type = 'pfm', alphabet_type = 'DNA', alphabet = None): 472 | super()._update_pm(pfm_filename_or_array, pm_type, alphabet_type, alphabet) 473 | 474 | class Pwm(Pm): 475 | """Main class for handling Position Weight Matrices (PWM). 476 | 477 | A PWM differs from a Position Frequency Matrix in that instead of counts for 478 | a given letter, the normalized weight is already calculated. 479 | 480 | This class automatically generates the consensus sequence for a given `alphabet` and PWM. It also calculates the Information Content (IC) for each position. 481 | 482 | Attributes: 483 | pwm (`pandas.DataFrame`): PWM DataFrame generated by user-submitted PWM 484 | consensus (str): The consensus sequence determined by the PWM 485 | ic (`numpy.ndarray`): The information content for each position 486 | width (int): Length of the sequence/motif 487 | length (int): an alias for `width` 488 | alphabet_type (str): Desired alphabet type to use. (default: 'DNA') 489 | alphabet (str): Desired alphabet to use. Order matters (default: None) 490 | weight (`numpy.array`): 1-D array of ones. Used for WebLogo compatability. If the chosen 491 | alphabet type allows gaps, will base weights on gap average in that position 492 | counts (`pandas.DataFrame`): Counts of letters at the given position. If 493 | `counts` is not supplied (because PPM was the entry-point), the PPM will be cast 494 | as a PFM by multiplying it by 100 495 | pseudocount: some number to offset PPM values at to prevent -inf/+inf (default: 1e-10) 496 | background (Collection): must be an iterable with length of alphabet with each letter's respective respective background probability or constant. (default: for NA-0.25, for AA-Robinson-Robinson Frequencies) 497 | """ 498 | 499 | __all__ = ['pwm', 'consensus', 'ic', 'width', 'counts', 'background', 500 | 'alphabet', 'alphabet_type', 'length', 'weight', 'pseudocount'] 501 | 502 | def __init__(self, *args, **kwargs): 503 | super().__init__(*args, pm_type='pwm', **kwargs) 504 | 505 | @classmethod 506 | def __dir__(cls): 507 | """Just used to clean up the attributes and methods shown when `dir()` is called""" 508 | return sorted(cls.__all__) 509 | 510 | @property 511 | def pwm(self): 512 | return self._get_pm 513 | 514 | @pwm.setter 515 | def pwm(self, pwm_filename_or_array, pm_type = 'pwm', alphabet_type = 'DNA', alphabet = None): 516 | super()._update_pm(pwm_filename_or_array, pm_type, alphabet_type, alphabet) 517 | 518 | @singledispatch 519 | def _submit_pm(pm_matrix): 520 | raise TypeError('pm_filename_or_array` must be a filename, `np.ndarray`, `pd.DataFrame`, or `Pm`') 521 | 522 | @_submit_pm.register(np.ndarray) 523 | def _(pm_matrix): 524 | return pd.DataFrame(data = pm_matrix) 525 | 526 | @_submit_pm.register(Pm) 527 | @_submit_pm.register(Pfm) 528 | @_submit_pm.register(pd.DataFrame) 529 | def _(pm_matrix): 530 | return pm_matrix 531 | 532 | @_submit_pm.register 533 | def _(pm_matrix: str) -> pd.DataFrame: 534 | if not os.path.isfile(pm_matrix): 535 | raise FileNotFoundError('{} was not found'.format(pm_matrix)) 536 | if alphabet not in utils._IDX_LETTERS: 537 | raise ValueError('alphabet must be a version of DNA, RNA, or AA') 538 | 539 | return pd.read_table(pm_matrix, delim_whitespace = True, header = None, comment = '#') 540 | 541 | def _check_background(pm, background = None, alphabet_type = "DNA", alphabet = None): 542 | """Just used to make sure background frequencies are acceptable or present""" 543 | 544 | # If the user supplied the background 545 | if background is not None: 546 | if not isinstance(background, Real): 547 | if not len(background) == pm.shape[1]: 548 | raise ValueError("background must be an iterable with length of alphabet with each letter's respective background probability or constant") 549 | else: 550 | return background 551 | else: 552 | return background 553 | 554 | # Attempt to figure out the background data from existing information 555 | else: 556 | try: 557 | alph = pm.alphabet 558 | except AttributeError: 559 | if alphabet is not None: 560 | alph = alphabet 561 | else: 562 | alph = utils._IDX_LETTERS[alphabet_type] 563 | if isinstance(pm, core.Pm): 564 | if pm.alphabet_type in utils._NA_ALPHABETS: 565 | background = utils._NA_background 566 | elif pm.alphabet_type in utils._AA_ALPHABETS: 567 | background = utils._AA_background 568 | else: 569 | raise ValueError('alphabet type ({}) not supported by default backgrounds. Please provide your own'.format(pm.alphabet_type)) 570 | elif isinstance(pm, pd.DataFrame): 571 | if alphabet_type in utils._NA_ALPHABETS: 572 | background = utils._NA_background 573 | elif alphabet_type in utils._AA_ALPHABETS: 574 | background = utils._AA_background 575 | else: 576 | raise ValueError('alphabet type ({}) not supported by default backgrounds. Please provide your own'.format(pm.alphabet_type)) 577 | else: 578 | raise ValueError('provided position matrix must be of type Pm, Pfm, Ppm, or Pwm') 579 | # Have to make sure that the background is a constant or long enough to broadcast through the matrix 580 | if not isinstance(background, Real): 581 | assert len(alph) == len(background), 'Background must be of equal length of sequence or a constant' 582 | return np.array(list(background.values())) 583 | 584 | def pwm2ppm(pwm, background = None, pseudocount = None, alphabet_type = 'DNA', alphabet = None): 585 | """Converts a Pwm to a ppm array 586 | 587 | Args: 588 | pwm (Pwm): a fully initialized Pwm 589 | background: accounts for relative weights from background. Must be a constant or same number of columns as Pwm (default: None) 590 | pseudocount (const): The number used to offset log-likelihood conversion from probabilites (default: None -> 1e-10) 591 | 592 | Returns: 593 | (np.array): converted values 594 | 595 | Raises: 596 | ValueError: if the pseudocount isn't a constant or the same length as sequence 597 | """ 598 | background = _check_background(pwm, background, alphabet_type, alphabet) 599 | if pseudocount is not None: 600 | if not isinstance(pseudocount, Real) and len(pseudocount) != pwm.length: 601 | raise ValueError('pseudocount must be the same length as the sequence or a constant') 602 | else: 603 | pseudocount = 1e-10 604 | return _init_pm(np.power(2, pwm + np.log2(background)) - pseudocount, pm_type = 'ppm', alphabet_type = alphabet_type, alphabet = alphabet) 605 | 606 | def ppm2pwm(ppm, background= None, pseudocount = None, alphabet_type = 'DNA', alphabet = None): 607 | """Converts a Ppm to a pwm array 608 | 609 | Args: 610 | ppm (Ppm): a fully initialized Ppm 611 | background: accounts for relative weights from background. Must be a constant or same number of columns as Pwm (default: None) 612 | pseudocount (const): The number used to offset log-likelihood conversion from probabilites (default: None -> 1e-10) 613 | 614 | Returns: 615 | (np.array): converted values 616 | 617 | Raises: 618 | ValueError: if the pseudocount isn't a constant or the same length as sequence 619 | """ 620 | background = _check_background(ppm, background, alphabet_type, alphabet) 621 | if pseudocount is not None: 622 | if not isinstance(pseudocount, Real) and len(pseudocount) != ppm.length: 623 | raise ValueError('pseudocount must be the same length as the sequence or a constant') 624 | else: 625 | pseudocount = 1e-10 626 | return _init_pm(np.log2(ppm + pseudocount) - np.log2(background), pm_type = 'pwm', alphabet_type = alphabet_type, alphabet = alphabet) 627 | 628 | def ppm2pfm(ppm, alphabet_type = 'DNA', alphabet = None): 629 | """Converts a Ppm to a pfm array 630 | 631 | Args: 632 | ppm (Ppm): a fully initialized Ppm 633 | 634 | Returns: 635 | (np.array): converted values 636 | """ 637 | return _init_pm((ppm * 100).astype(np.float64), pm_type = 'pfm', alphabet_type = alphabet_type, alphabet = alphabet) 638 | 639 | def pfm2ppm(pfm, alphabet_type = 'DNA', alphabet = None): 640 | """Converts a Pfm to a ppm array 641 | 642 | Args: 643 | pfm (Pfm): a fully initialized Pfm 644 | 645 | Returns: 646 | (np.array): converted values 647 | """ 648 | return _init_pm((pfm.T / pfm.sum(axis = 1)).T, pm_type = 'ppm', alphabet_type = alphabet_type, alphabet = alphabet) 649 | 650 | def pfm2pwm(pfm, background = None, pseudocount = None, alphabet_type = 'DNA', alphabet = None): 651 | """Converts a Pfm to a pwm array 652 | 653 | Args: 654 | pfm (Pfm): a fully initialized Pfm 655 | background: accounts for relative weights from background. Must be a constant or same number of columns as Pwm (default: None) 656 | pseudocount (const): The number used to offset log-likelihood conversion from probabilites (default: None -> 1e-10) 657 | 658 | Returns: 659 | (np.array): converted values 660 | """ 661 | return _init_pm( 662 | ppm2pwm( 663 | pfm2ppm( 664 | pfm, 665 | alphabet_type = alphabet_type, 666 | alphabet = alphabet 667 | ), 668 | background, 669 | pseudocount, 670 | alphabet_type, 671 | alphabet), 672 | pm_type = 'pwm', 673 | alphabet_type = alphabet_type, 674 | alphabet = alphabet 675 | ) 676 | 677 | def pwm2pfm(pwm, background = None, pseudocount = None, alphabet_type = 'DNA', alphabet = None): 678 | """Converts a Pwm to a pfm array 679 | 680 | Args: 681 | pwm (Pwm): a fully initialized Pwm 682 | background: accounts for relative weights from background. Must be a constant or same number of columns as Pwm (default: None) 683 | pseudocount (const): The number used to offset log-likelihood conversion from probabilites (default: None -> 1e-10) 684 | 685 | Returns: 686 | (np.array): converted values 687 | """ 688 | return _init_pm(ppm2pfm(pwm2ppm(pwm, background, pseudocount)), pm_type = 'pfm', alphabet_type = alphabet_type, alphabet = alphabet) 689 | 690 | class CompletePm(Pm): 691 | """Final class of the seqlogo package. 692 | 693 | If the user supplies *any* PM structure (PFM, PPM, PWM), it will compute any missing values, to 694 | include information content, consensus, and weights. 695 | 696 | Attributes: 697 | pfm (pd.DataFrame): position frequency matrix. Calculated if missing if another is present 698 | ppm (pd.DataFrame): position probability matrix. Calculated if missing if another is present 699 | pwm (pd.DataFrame): position weight matrix. Calculated if missing if another is present 700 | ic (np.array): positional information content 701 | length (int): length of sequence 702 | width (int): synonym for `length` 703 | weight (np.array): array of weights (calculated if gapped, else just ones) 704 | counts (`numpy.ndarray` or `pandas.DataFrame` or `Pm`): count data for each letter 705 | at a given position. (default: None) 706 | alphabet_type (str): Desired alphabet to use. Order matters (default: 'DNA') 707 | "DNA" := "ACGT" 708 | "reduced DNA" := "ACGTN-" 709 | "ambig DNA" := "ACGTRYSWKMBDHVN-" 710 | "RNA" := "ACGU" 711 | "reduced RNA" := "ACGUN-" 712 | "ambig RNA" := "ACGURYSWKMBDHVN-" 713 | "AA" : = "ACDEFGHIKLMNPQRSTVWY" 714 | "reduced AA" := "ACDEFGHIKLMNPQRSTVWYX*-" 715 | "ambig AA" := "ACDEFGHIKLMNOPQRSTUVWYBJZX*-" 716 | "custom" := None 717 | (default: 'DNA') 718 | alphabet (str): if 'custom' is selected or a specialize alphabet is desired, this accepts a string (default: None) 719 | background (constant or Collection): Offsets used to calculate background letter probabilities (defaults: If 720 | using an Nucleic Acid alphabet: 0.25; if using an Aminio Acid alphabet: Robinson-Robinson Frequencies) 721 | pseudocount (constant): Some constant to offset PPM conversion to PWM to prevent -/+ inf. (default: 1e-10) 722 | """ 723 | 724 | __all__ = ['pm', 'pfm', 'ppm', 'pwm', 'consensus', 'ic', 'width', 'counts', 725 | 'alphabet', 'alphabet_type', 'length', 'weight'] 726 | 727 | def __init__(self, pfm = None, ppm = None, pwm = None, background = None, pseudocount = None, 728 | alphabet_type = 'DNA', alphabet = None, default_pm = 'ppm'): 729 | """Initializes the CompletePm 730 | 731 | Creates the CompletePm instance. If the user does not define any `pm_filename_or_array`, 732 | it will be initialized to empty. Will generate all other attributes as soon 733 | as a `pm_filename_or_array` is supplied. 734 | 735 | Args: 736 | pfm (str or `numpy.ndarray` or `pandas.DataFrame` or Pm): The user supplied 737 | PFM. If it is a filename, the file will be opened 738 | and parsed. If it is an `numpy.ndarray` or `pandas.DataFrame`, 739 | it will just be assigned. (default: None, skips '#' comment lines) 740 | ppm (str or `numpy.ndarray` or `pandas.DataFrame` or Pm): The user supplied 741 | PPM. If it is a filename, the file will be opened 742 | and parsed. If it is an `numpy.ndarray` or `pandas.DataFrame`, 743 | it will just be assigned. (default: None, skips '#' comment lines) 744 | pwm (str or `numpy.ndarray` or `pandas.DataFrame` or Pm): The user supplied 745 | PWM. If it is a filename, the file will be opened 746 | and parsed. If it is an `numpy.ndarray` or `pandas.DataFrame`, 747 | it will just be assigned. (default: None, skips '#' comment lines) 748 | background (constant or Collection): Offsets used to calculate background letter probabilities (defaults: If 749 | using an Nucleic Acid alphabet: 0.25; if using an Aminio Acid alphabet: Robinson-Robinson Frequencies) 750 | pseudocount (constant): Some constant to offset PPM conversion to PWM to prevent -/+ inf. (defaults to 1e-10) 751 | alphabet_type (str): Desired alphabet to use. Order matters (default: 'DNA') 752 | "DNA" := "ACGT" 753 | "reduced DNA" := "ACGTN-" 754 | "ambig DNA" := "ACGTRYSWKMBDHVN-" 755 | "RNA" := "ACGU" 756 | "reduced RNA" := "ACGUN-" 757 | "ambig RNA" := "ACGURYSWKMBDHVN-" 758 | "AA" : = "ACDEFGHIKLMNPQRSTVWY" 759 | "reduced AA" := "ACDEFGHIKLMNPQRSTVWYX*-" 760 | "ambig AA" := "ACDEFGHIKLMNOPQRSTUVWYBJZX*-" 761 | "custom" := None 762 | (default: 'DNA') 763 | alphabet (str): if 'custom' is selected or a specialize alphabet is desired, this accepts a string (default: None) 764 | default_pm (str): which of the 3 pm's do you want to call '*home*'? (default: 'ppm') 765 | """ 766 | self._pm = self._pfm = self._ppm = self._pwm = None 767 | self._weight = self._width = self._consensus = None 768 | self._counts = self._weight = self._ic = None 769 | self._alphabet_type = alphabet_type 770 | self._alphabet = alphabet 771 | self._pm_type = "cpm" 772 | self._default_pm = default_pm 773 | 774 | if pseudocount is None: 775 | self.pseudocount = 1e-10 776 | else: 777 | self.pseudocount = pseudocount 778 | if background is None: 779 | self.background = None 780 | else: 781 | self.background = background 782 | 783 | if any([pfm is not None, ppm is not None, pwm is not None]): 784 | self._update_pm(pfm, ppm, pwm, background, pseudocount, alphabet_type, alphabet) 785 | 786 | def _update_pm( 787 | self, 788 | pfm = None, 789 | ppm = None, 790 | pwm = None, 791 | background = None, 792 | pseudocount = None, 793 | alphabet_type = None, 794 | alphabet = None 795 | ): 796 | if alphabet_type is None: 797 | alphabet_type = self.alphabet_type 798 | 799 | # Set the ones the user has provided 800 | if pfm is not None: 801 | if not isinstance(pfm, core.Pm): 802 | self._pfm = _init_pm(pfm, pm_type = 'pfm', alphabet_type = alphabet_type, alphabet = alphabet) 803 | else: 804 | self._pfm = pfm.pfm 805 | if ppm is not None: 806 | if not isinstance(ppm, core.Pm): 807 | self._ppm = _init_pm(ppm, pm_type = 'ppm', alphabet_type = alphabet_type, alphabet = alphabet) 808 | else: 809 | self._ppm = ppm.ppm 810 | if pwm is not None: 811 | if not isinstance(pfm, core.Pm): 812 | self._pwm = _init_pm(pwm, pm_type = 'pwm', alphabet_type = alphabet_type, alphabet = alphabet) 813 | else: 814 | self._pwm = pwm.pwm 815 | 816 | # Fill in the blanks 817 | if pfm is not None and ppm is None: 818 | self._ppm = pfm2ppm(self._pfm, alphabet_type = alphabet_type, alphabet = alphabet) 819 | 820 | if pfm is not None and pwm is None: 821 | self._pwm = pfm2pwm(self._ppm, background, pseudocount, alphabet_type = alphabet_type, alphabet = alphabet) 822 | 823 | if ppm is not None and pfm is None: 824 | self._pfm = ppm2pfm(self._ppm, alphabet_type = alphabet_type, alphabet = alphabet) 825 | 826 | if ppm is not None and pwm is None: 827 | self._pwm = ppm2pwm(self._ppm, background, pseudocount, alphabet_type = alphabet_type, alphabet = alphabet) 828 | 829 | if pwm is not None and ppm is None: 830 | self._ppm = pwm2ppm(self._pwm, background, pseudocount, alphabet_type = alphabet_type, alphabet = alphabet) 831 | 832 | if pwm is not None and pfm is None: 833 | self._pfm = ppm2pfm(self._ppm, alphabet_type = alphabet_type, alphabet = alphabet) 834 | 835 | self._width = self._get_width(self._get_pm) 836 | if not isinstance(self.pseudocount, Real): 837 | if len(self.pseudocount) != self.width: 838 | raise ValueError('pseudocount must be the same length as sequence or a constant') 839 | if self._alphabet_type not in ("DNA", "RNA", "AA"): 840 | if 'NA' in self._alphabet_type: 841 | self._weight = self._get_pm.iloc[:,:-2].sum(axis=1)/self._get_pm.sum(axis=1) 842 | else: 843 | self._weight = self._get_pm.iloc[:,:-3].sum(axis=1)/self._get_pm.sum(axis=1) 844 | else: 845 | self._weight = np.ones((self.width,), dtype=np.int8) 846 | self._counts = self.pfm 847 | self._consensus = self._generate_consensus(self._get_pm) 848 | self.background = _check_background(self, background, alphabet_type, alphabet) 849 | self._ic = (self.ppm * self.pwm).sum(axis=1) 850 | 851 | @property 852 | def counts(self): 853 | return self._counts.values 854 | 855 | @property 856 | def pfm(self): 857 | return self._pfm 858 | 859 | @pfm.setter 860 | def pfm(self, pfm_filename_or_array, pm_type = 'pfm', alphabet_type = 'DNA', alphabet = None): 861 | super()._update_pm(pfm_filename_or_array, pm_type, alphabet_type, alphabet) 862 | 863 | @property 864 | def ppm(self): 865 | return self._ppm 866 | 867 | @ppm.setter 868 | def ppm(self, ppm_filename_or_array, pm_type = 'ppm', alphabet_type = 'DNA', alphabet = None): 869 | super()._update_pm(ppm_filename_or_array, pm_type, alphabet_type, alphabet) 870 | 871 | @property 872 | def pwm(self): 873 | return self._pwm 874 | 875 | @pwm.setter 876 | def pwm(self, ppm_filename_or_array, pm_type = 'pwm', alphabet_type = 'DNA', alphabet = None): 877 | super()._update_pm(pwm_filename_or_array, pm_type, alphabet_type, alphabet) 878 | 879 | @classmethod 880 | def __dir__(cls): 881 | """Just used to clean up the attributes and methods shown when `dir()` is called""" 882 | return sorted(cls.__all__) 883 | 884 | @property 885 | def _get_pm(self): 886 | return getattr(self, "_{}".format(self._default_pm)) 887 | 888 | Cpm = CompletePm 889 | --------------------------------------------------------------------------------