├── .gitignore
├── .travis.yml
├── LICENSE.rst
├── NOTES.rst
├── README.rst
├── fetch_sample_data.sh
├── reading_sequence_files
├── README.rst
├── check_start_met.py
├── check_stops.py
├── count_fasta.py
├── count_fasta_adv.py
├── print_seq.py
├── record_lengths.py
└── total_length.py
├── reading_writing_alignments
├── README.rst
├── count_gaps.py
└── sort_gaps.py
├── tests
├── README.rst
├── test_consistency.py
└── test_scripts.py
├── using_seqfeatures
├── README.rst
├── bases_in_genes.py
├── extract_cds.py
├── total_feature_lengths.py
└── total_gene_lengths.py
└── writing_sequence_files
├── README.rst
├── convert_gb_to_fasta.py
├── cut_final_star.py
├── cut_star_dangerous.py
├── filter_wanted_id.py
├── filter_wanted_id_in_order.py
├── length_filter.py
└── length_filter_naive.py
/.gitignore:
--------------------------------------------------------------------------------
1 | #Ingore sample files
2 | *.gbk
3 | *.fna
4 | *.ffn
5 | *.faa
6 | *.fasta
7 | *.sth
8 |
9 | #Ignore backup files from some Unix editors,
10 | *~
11 | *.swp
12 | *.bak
13 |
14 | #Ignore patches and any original files created by patch command
15 | *.diff
16 | *.patch
17 | *.orig
18 | *.rej
19 |
20 | #Ignore these hidden files from Mac OS X
21 | .DS_Store
22 |
23 | #Ignore hidden files from Dolphin window manager
24 | .directory
25 |
26 | #Ignore all compiled python files (e.g. from running the unit tests):
27 | *.pyc
28 | *.pyo
29 |
30 | #Ignore all Jython class files (present if using Jython)
31 | *.class
32 |
33 | #Ignore compressed archives of files
34 | *.zip
35 | *.tar.gz
36 |
--------------------------------------------------------------------------------
/.travis.yml:
--------------------------------------------------------------------------------
1 | #Special configuration file to run tests on Travis-CI via GitHub notifications
2 | #See http://travis-ci.org/ for details
3 | #
4 | #Note when testing Python 3, the 'python' command will invoke Python 3
5 | #and similarly for PyPy too.
6 |
7 | language: python
8 | python:
9 | - "2.6"
10 | - "2.7"
11 | - "3.3"
12 | - "3.4"
13 | - "pypy"
14 | - "pypy3"
15 |
16 | install:
17 | - pip install biopython
18 | - ./fetch_sample_data.sh
19 |
20 | script:
21 | - python tests/test_consistency.py
22 | - python tests/test_scripts.py
23 |
--------------------------------------------------------------------------------
/LICENSE.rst:
--------------------------------------------------------------------------------
1 | =====================
2 | Copyright and Licence
3 | =====================
4 |
5 | Copyright 2014-2015 by Peter Cock, The James Hutton Institute, Dundee, UK.
6 | All rights reserved.
7 |
8 | This work is licensed under a `Creative Commons Attribution-ShareAlike 4.0 International
9 | License `_ (CC-BY-SA 4.0).
10 |
11 | .. image:: http://i.creativecommons.org/l/by-sa/4.0/88x31.png
12 |
13 | Note this documentation links to and uses external and separately licenced sample data.
14 |
--------------------------------------------------------------------------------
/NOTES.rst:
--------------------------------------------------------------------------------
1 | As this material is aimed at Python beginners, we're avoiding a lot of
2 | useful but not fundamental things, including:
3 |
4 | * String formating with the % operator
5 | * Exceptions and try/except error handling
6 | * The ``with`` statement for context management (e.g. closing file handles)
7 | * The increment/decrement operators, use ``count = count + 1`` not ``count += 1``
8 | * List comprehensions, generator expressions, generator functions (just use for loops)
9 |
10 | Also note that the examples should try to run under both Python 2.6, 2.7
11 | and 3.3 (or later) without changes. i.e. The same versions of Python which
12 | are supported by Biopython.
13 |
14 | To this end, only simple print statements are used as ``print(some_string)``
15 | which will work on both Python 2 and 3, with or without using
16 | ``from __future__ import print_function``.
17 |
18 | Additionally, basic automated testing is done on TravisCI via the special
19 | ``.travis.yml`` file, test results here:
20 |
21 | .. image:: https://travis-ci.org/peterjc/biopython_workshop.png?branch=master
22 | :alt: Current status of TravisCI build for master branch
23 | :target: https://travis-ci.org/peterjc/biopython_workshop/builds
24 |
--------------------------------------------------------------------------------
/README.rst:
--------------------------------------------------------------------------------
1 | =========================
2 | Introduction to Biopython
3 | =========================
4 |
5 | This is a basic introduction to Biopython, intended for a classroom based workshop.
6 | It assumes you have been introduced to both working at the command line, and basic
7 | Python - for example as covered in Martin Jones' free eBook
8 | `Python for Biologists `_.
9 |
10 | The Biopython website http://www.biopython.org has more information including the
11 | `Biopython Tutorial & Cookbook `_
12 | (html, `PDF available `_),
13 | which is worth going through once you have mastered the basics of Python. That Tutorial & Cookbook
14 | is also available as `Jupyter Notebooks `_,
15 | as is `another short introductory tutorial `_.
16 |
17 | =================
18 | Workshop Sections
19 | =================
20 |
21 | I've broken up the workshop into sections:
22 |
23 | * `Reading sequence files `_.
24 | * `Writing sequence files `_.
25 | * `Working with sequence features `_.
26 | * `Reading and writing alignment files `_.
27 |
28 | This material focuses on Biopython's `SeqIO `_
29 | and `AlignIO `_ modules (these links
30 | include an overview and tables of supported file formats), each of which
31 | also has a whole chapter in the `Biopython Tutorial & Cookbook
32 | `_
33 | (`PDF `_)
34 | which would be worth reading after this workshop to learn more.
35 |
36 | ========
37 | Notation
38 | ========
39 |
40 | Text blocks starting with ``$`` show something you would type and run at the
41 | command line prompt, where the ``$`` itself represents the prompt. For example:
42 |
43 | .. sourcecode:: console
44 |
45 | $ python -V
46 | Python 2.7.5
47 |
48 | Depending how your system is configured, rather than just ``$`` you may see your
49 | user name and the current working directory. Here you would only type ``python -V``
50 | (python space minus capital V) to find out the default version of Python installed.
51 |
52 | Lines starting ``>>>`` represent the interactive Python prompt, and something
53 | you would type inside Python. For example:
54 |
55 | .. sourcecode:: pycon
56 |
57 | $ python
58 | Python 2.7.3 (default, Nov 7 2012, 23:34:47)
59 | [GCC 4.4.6 20120305 (Red Hat 4.4.6-4)] on linux2
60 | Type "help", "copyright", "credits" or "license" for more information.
61 | >>> 7 * 6
62 | 42
63 | >>> quit()
64 |
65 | Here you would only need to type ``7 * 6`` (and enter) into Python, the ``>>>``
66 | is already there. To quit the interactive Python prompt use ``quit()`` (and enter).
67 | This example would usually be shortened to just:
68 |
69 | .. sourcecode:: pycon
70 |
71 | >>> 7 * 6
72 | 42
73 |
74 | These text blocks are also used for entire short Python scripts, which you can
75 | copy and save as a plain text file with the extension ``.py`` to run them.
76 |
77 | ================
78 | Sample Solutions
79 | ================
80 |
81 | Each workshop section was written in a separate directory, and in addition
82 | to the main text (named ``README.rst`` which is plain text file with markup
83 | to make it look pretty on GitHub), the folders contain sample solution
84 | Python scripts (named as in the text).
85 |
86 | ===========================
87 | Prerequisites & Sample Data
88 | ===========================
89 |
90 | If you are reading this on GitHub.com, you can view, copy/paste or download
91 | individual examples from your web browser.
92 |
93 | To make a local copy of the entire workshop, you can use the ``git``
94 | command line tool:
95 |
96 | .. sourcecode:: console
97 |
98 | $ git clone https://github.com/peterjc/biopython_workshop.git
99 |
100 | Alternatively, depending on your firewall settings, use:
101 |
102 | .. sourcecode:: console
103 |
104 | $ git clone git@github.com:peterjc/biopython_workshop.git
105 |
106 | To learn more about ``git`` and software version control, I recommend attending a
107 | `Software Carpentry Workshop `_
108 | or similar course.
109 |
110 | This should make a new sub-directory, ``biopython_workshop/`` which we will now
111 | change into:
112 |
113 | .. sourcecode:: console
114 |
115 | $ cd biopython_workshop
116 |
117 | Most of the examples use real biological data files. You should download them
118 | now using the `provided shell script `_:
119 |
120 | .. sourcecode:: console
121 |
122 | $ bash fetch_sample_data.sh
123 |
124 | We assume you have Python and Biopython 1.63 or later installed and working.
125 | Biopython 1.63 supports Python 2.6, 2.7 and 3.3 (and should work on more recent
126 | versions). The examples here assume you are using Python 2.6 or 2.7, but in
127 | general should work with Python 3 with minimal changes. Check this works:
128 |
129 | .. sourcecode:: console
130 |
131 | $ python -c "import Bio; print(Bio.__version__)"
132 | 1.63
133 |
134 | =======
135 | History
136 | =======
137 |
138 | This material was first used as part of a two-day course "Introduction to Python for
139 | Biologists" (Kathryn Crouch, Peter Cock and Tim Booth), part of a two-week course
140 | `Keystone Skills in Bioinformatics `_,
141 | held in February 2014 at Centre for Ecology & Hydrology (CEH), Wallingford, UK.
142 | In a morning session lasting about 2.5 hours (plus coffee break), we covered all
143 | of `reading sequence files `_ and
144 | `writing sequence files `_ - and I quickly
145 | talked through `alignment files `_.
146 |
147 | I presented much of it again later in February 2014 at the University of Dundee
148 | as part of the third year undergraduate course *BS32010 Applied Bioinformatics*
149 | run by Dr David Martin and Dr David Booth. In the two hour slot we covered all
150 | of `reading sequence files `_ and most of
151 | `writing sequence files `_.
152 |
153 | I repeated this in March 2015 for the same third year undergraduate course,
154 | *BS32010 Applied Bioinformatics* at the University of Dundee. In a three hour
155 | slot we covered `reading sequence files `_
156 | most of `writing sequence files `_ (up to
157 | editing sequences, but not filtering by identifier), and the start of
158 | `multiple-sequence alignments `_.
159 |
160 | =====================
161 | Copyright and Licence
162 | =====================
163 |
164 | Copyright 2014-2015 by Peter Cock, The James Hutton Institute, Dundee, UK.
165 | All rights reserved.
166 |
167 | This work is licensed under a `Creative Commons Attribution-ShareAlike 4.0 International
168 | License `_ (CC-BY-SA 4.0).
169 |
170 | .. image:: http://i.creativecommons.org/l/by-sa/4.0/88x31.png
171 |
172 | Note this documentation links to and uses external and separately licenced
173 | sample data files.
174 |
--------------------------------------------------------------------------------
/fetch_sample_data.sh:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env bash
2 | # Set bash strict mode (fail on errors, undefined variables, and via pipes)
3 | set -euo pipefail
4 |
5 | if [ -x "$(command -v wget)" ]; then
6 | # e.g. Linux
7 | echo "Downloading files using wget"
8 | FETCH="wget"
9 | elif [ -x "$(command -v curl)" ]; then
10 | # e.g. Max OS X
11 | echo "Downloading files using curl"
12 | FETCH="curl -O"
13 | else
14 | echo "ERROR: Failed to find wget or curl"
15 | exit 1
16 | fi
17 |
18 | echo "=============================================="
19 | echo "Fetching Escherichia coli K-12 files from NCBI"
20 | echo "=============================================="
21 |
22 | # Note: These files are no longer being updated...
23 | $FETCH ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/Escherichia_coli_K_12_substr__MG1655_uid57779/NC_000913.gbk
24 | $FETCH ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/Escherichia_coli_K_12_substr__MG1655_uid57779/NC_000913.fna
25 | $FETCH ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/Escherichia_coli_K_12_substr__MG1655_uid57779/NC_000913.ffn
26 | $FETCH ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/Escherichia_coli_K_12_substr__MG1655_uid57779/NC_000913.faa
27 |
28 | echo "=========================================================="
29 | echo "Fetching proteins from Potato Genome Sequencing Consortium"
30 | echo "=========================================================="
31 |
32 | $FETCH http://potato.plantbiology.msu.edu/data/PGSC_DM_v3.4_pep_representative.fasta.zip
33 | unzip -o PGSC_DM_v3.4_pep_representative.fasta.zip
34 |
35 | echo "===================================="
36 | echo "Fetching PF08792 alignment from PFAM"
37 | echo "===================================="
38 |
39 | if [ -x "$(command -v wget)" ]; then
40 | # Note: Using -O to set the filename explicitly as default is format?format=stockholm
41 | wget -O "PF08792_seed.sth" http://pfam.sanger.ac.uk/family/PF08792/alignment/seed/format?format=stockholm
42 | elif [ -x "$(command -v curl)" ]; then
43 | # Note: Mac OS alternative needs -L due to link redirect:
44 | curl -o "PF08792_seed.sth" -L http://pfam.sanger.ac.uk/family/PF08792/alignment/seed/format?format=stockholm
45 | else
46 | echo "ERROR: Failed to find wget or curl"
47 | exit 1
48 | fi
49 |
--------------------------------------------------------------------------------
/reading_sequence_files/README.rst:
--------------------------------------------------------------------------------
1 | ===================================
2 | Reading Sequence Files in Biopython
3 | ===================================
4 |
5 | Dealing with assorted sequence file formats is one of the strengths of Biopython.
6 | The primary module we'll be using is `Bio.SeqIO `_,
7 | which is short for sequence input/output (following the naming convention set by
8 | `BioPerl's SeqIO module `_).
9 |
10 | For these examples we're going to use files for the famous bacteria *Esherichia coli*
11 | K12 (from the NCBI FTP server), and some potato genes from the PGSC - see the
12 | sample data instructions in the `introduction <../README.rst>`_ for how to download
13 | these files.
14 |
15 | -------------
16 | Built-in Help
17 | -------------
18 |
19 | Python code should be documented. You can (and should) write special comment strings
20 | called ``docstrings`` at the start of your own modules, classes and functions which
21 | are used by Python as the built-in help text. Let's look at some of the built-in
22 | Biopython documentation.
23 |
24 | We'll run the interactive Python prompt from within the command line terminal (but you
25 | could use a Python GUI, or `IPython `_, if you prefer - depending
26 | on what you are used to working with).
27 |
28 | Load Biopython's ``SeqIO`` module with the ``import`` command, and have a look at the built
29 | in help:
30 |
31 | .. sourcecode:: pycon
32 |
33 | $ python2.7
34 | Python 2.7.3 (default, Nov 7 2012, 23:34:47)
35 | [GCC 4.4.6 20120305 (Red Hat 4.4.6-4)] on linux2
36 | Type "help", "copyright", "credits" or "license" for more information.
37 | >>> from Bio import SeqIO
38 | >>> help(SeqIO)
39 |
40 | You'll see the `SeqIO help text `_
41 | built into Biopython - the latest version of which should also be online. Pressing
42 | space will show the next page of help text, the up and down cursor arrows scroll,
43 | and ``q`` will quit the help and return to the Python prompt.
44 |
45 | Rather than showing the help for the entire ``SeqIO`` module, you can ask for the help
46 | on a particular object or function. Let's start with ``SeqIO.parse`` - and from now on
47 | the triple greater-than-sign prompt (``>>>``) will be used to indicate something you
48 | would type into Python:
49 |
50 | .. sourcecode:: pycon
51 |
52 | >>> help(SeqIO.parse)
53 |
54 | This gives some examples, and we'll start with something very similar.
55 |
56 | ----------------
57 | Counting Records
58 | ----------------
59 |
60 | We'll start by looking at the protein sequence in the FASTA amino acid file,
61 | ``NC_000913.faa``. First take a quick peek using some command line tools like
62 | ``head`` to look at the start of the file:
63 |
64 | .. sourcecode:: console
65 |
66 | $ head NC_000913.faa
67 | >gi|16127995|ref|NP_414542.1| thr operon leader peptide [Escherichia coli str. K-12 substr. MG1655]
68 | MKRISTTITTTITITTGNGAG
69 | >gi|16127996|ref|NP_414543.1| fused aspartokinase I and homoserine dehydrogenase I [Escherichia coli str. K-12 substr. MG1655]
70 | MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKTISGQDALPNISDAERI
71 | FAELLTGLAAAQPGFPLAQLKTFVDQEFAQIKHVLHGISLLGQCPDSINAALICRGEKMSIAIMAGVLEA
72 | RGHNVTVIDPVEKLLAVGHYLESTVDIAESTRRIAASRIPADHMVLMAGFTAGNEKGELVVLGRNGSDYS
73 | AAVLAACLRADCCEIWTDVDGVYTCDPRQVPDARLLKSMSYQEAMELSYFGAKVLHPRTITPIAQFQIPC
74 | LIKNTGNPQAPGTLIGASRDEDELPVKGISNLNNMAMFSVSGPGMKGMVGMAARVFAAMSRARISVVLIT
75 | QSSSEYSISFCVPQSDCVRAERAMQEEFYLELKEGLLEPLAVTERLAIISVVGDGMRTLRGISAKFFAAL
76 | ARANINIVAIAQGSSERSISVVVNNDDATTGVRVTHQMLFNTDQVIEVFVIGVGGVGGALLEQLKRQQSW
77 |
78 | We can use ``grep`` to count the number of proteins by using the regular
79 | expression pattern ``^>``. The caret is a special symbol meaning look at
80 | the start of a line, so this means look for lines starting with a greater
81 | than sign (which is how individual FASTA format sequences are marked):
82 |
83 | .. sourcecode:: console
84 |
85 | $ grep -c "^>" NC_000913.faa
86 | 4141
87 |
88 | Now let's count the records with Biopython using the ``SeqIO.parse`` function:
89 |
90 | .. sourcecode:: pycon
91 |
92 | $ python
93 | Python 2.7.3 (default, Nov 7 2012, 23:34:47)
94 | [GCC 4.4.6 20120305 (Red Hat 4.4.6-4)] on linux2
95 | Type "help", "copyright", "credits" or "license" for more information.
96 | >>> from Bio import SeqIO
97 | >>> filename = "NC_000913.faa"
98 | >>> count = 0
99 | >>> for record in SeqIO.parse(filename, "fasta"):
100 | ... count = count + 1
101 | ...
102 | >>> print("There were " + str(count) + " records in file " + filename)
103 | There were 4141 records in file NC_000913.faa
104 |
105 | Running more than few commands like this at the Python prompt gets complicated,
106 | especially with indentation like this for loop. It is tough if you make a mistake
107 | and need to edit lines to rerun them (even with the up-arrow trick). It is also
108 | fiddly to copy and paste without the ``>>>`` prompt and ``...`` line continuation
109 | characters.
110 |
111 | Instead, using your favourite editor (e.g. ``nano`` or ``gedit``) create a plain
112 | text file (in the same directory as the *E. coli* files) named ``count_fasta.py``:
113 |
114 | .. sourcecode:: console
115 |
116 | $ nano count_fasta.py
117 |
118 | Edit your new file ``count_fasta.py`` to contain the following:
119 |
120 | .. sourcecode:: python
121 |
122 | from Bio import SeqIO
123 | filename = "NC_000913.faa"
124 | count = 0
125 | for record in SeqIO.parse(filename, "fasta"):
126 | count = count + 1
127 | print("There were " + str(count) + " records in file " + filename)
128 |
129 | This time it should be easy to copy & paste in one go. We can now run this:
130 |
131 | .. sourcecode:: console
132 |
133 | $ python count_fasta.py
134 | There were 4141 records in file NC_000913.faa
135 |
136 | **Exercise**: Modify this to count the number of records in the other FASTA files,
137 | both from *E. coli* K12 and the potato genome (``PGSC_DM_v3.4_pep_representative.fasta``).
138 |
139 | **Advanced Exercise**: Using ``sys.argv`` get the filename as a command line argument,
140 | so that you can run it like this:
141 |
142 | .. sourcecode:: console
143 |
144 | $ python count_fasta_adv.py NC_000913.ffn
145 | There were 4321 records in file NC_000913.ffn
146 |
147 | ----------------------
148 | Looking at the records
149 | ----------------------
150 |
151 | In the above example, we used a for loop to count the records in a FASTA file,
152 | but didn't actually look at the information in the records. The ``SeqIO.parse``
153 | function was creating `SeqRecord objects `_.
154 | Biopython's ``SeqRecord`` objects are a container holding the sequence, and any
155 | annotation about it - most importantly the identifier.
156 |
157 | For FASTA files, the record identifier is taken to be the first word on the ``>``
158 | line - anything after a space is *not* part of the identifier.
159 |
160 | This simple example prints out the record identifers and their lengths:
161 |
162 | .. sourcecode:: python
163 |
164 | from Bio import SeqIO
165 | filename = "NC_000913.faa"
166 | for record in SeqIO.parse(filename, "fasta"):
167 | print("Record " + record.id + ", length " + str(len(record.seq)))
168 |
169 | Notice that given a ``SeqRecord`` object we access the identifer as ``record.id``
170 | and the sequence object as ``record.seq``. As a shortcut, ``len(record)`` gives
171 | the sequence length, ``len(record.seq)``.
172 |
173 | If you save that as ``record_lengths.py`` and run it you'll get over four thousand
174 | lines of output:
175 |
176 | .. sourcecode:: console
177 |
178 | $ python record_lengths.py
179 | Record gi|16127995|ref|NP_414542.1|, length 21
180 | Record gi|16127996|ref|NP_414543.1|, length 820
181 | Record gi|16127997|ref|NP_414544.1|, length 310
182 | Record gi|16127998|ref|NP_414545.1|, length 428
183 | ...
184 | Record gi|16132219|ref|NP_418819.1|, length 46
185 | Record gi|16132220|ref|NP_418820.1|, length 228
186 |
187 | The output shown here is truncated!
188 |
189 | **Exercise**: Count how many sequences are less than 100 amino acids long.
190 |
191 | **Exercise**: Create a modified script ``total_length.py`` based on the above examples
192 | which counts the number of records and calculates the total length of all the
193 | sequences (i.e. ``21 + 820 + 310 + 428 + ... + 46 + 228``), giving:
194 |
195 | .. sourcecode:: console
196 |
197 | $ python total_length.py
198 | 4141 records, total length 1311442
199 |
200 | **Advanced Exercise**: Plot a histogram of the sequence length distribution (tip - see the
201 | `Biopython Tutorial & Cookbook `_).
202 |
203 | -----------------------
204 | Looking at the sequence
205 | -----------------------
206 |
207 | The record identifiers are very important, but more important still is the sequence
208 | itself. In the ``SeqRecord`` objects the identifiers are stored as standard Python
209 | strings (e.g. ``.id``). For the sequence, Biopython uses a string-like ``Seq`` object,
210 | accessed as ``.seq``.
211 |
212 | In many ways the ``Seq`` objects act like Python strings, you can print them, take
213 | their length using the ``len(...)`` function, and slice them with square brackets
214 | to get a sub-sequence or a single letter.
215 |
216 | **Exercise**: Using ``SeqIO.parse(...)`` in a for loop, for each record print out the
217 | identifier, the first 10 letters of each sequences, and the last 10 letters. e.g.:
218 |
219 | .. sourcecode:: console
220 |
221 | $ python print_seq.py
222 | gi|16127995|ref|NP_414542.1| MKRISTTITT...ITITTGNGAG
223 | gi|16127996|ref|NP_414543.1| MRVLKFGGTS...LRTLSWKLGV
224 | gi|16127997|ref|NP_414544.1| MVKVYAPASS...DTAGARVLEN
225 | ...
226 | gi|16132219|ref|NP_418819.1| MTKVRNCVLD...AVILTILTAT
227 | gi|16132220|ref|NP_418820.1| MRITIILVAP...LHDIEKNITK
228 |
229 | ---------------------------------------
230 | Checking proteins start with methionine
231 | ---------------------------------------
232 |
233 | In the next example we'll check all the protein sequences start with a methionine
234 | (represented as the letter "M" in the standard IUPAC single letter amino acid code),
235 | and count how many records fail this. Let's create a script called ``check_start_met.py``:
236 |
237 | .. sourcecode:: python
238 |
239 | from Bio import SeqIO
240 | filename = "NC_000913.faa"
241 | bad = 0
242 | for record in SeqIO.parse(filename, "fasta"):
243 | if not record.seq.startswith("M"):
244 | bad = bad + 1
245 | print(record.id + " starts " + record.seq[0])
246 | print("Found " + str(bad) + " records in " + filename + " which did not start with M")
247 |
248 | If you run that, you should find this *E. coli* protein set all had leading methionines:
249 |
250 | .. sourcecode:: console
251 |
252 | $ python check_start_met.py
253 | Found 0 records in NC_000913.faa which did not start with M
254 |
255 | Good - no strange proteins. This genome has been completely sequenced and a lot of
256 | work has been done on the annotation, so it is a 'Gold Standard'. Now try this on
257 | the potato protein file ``PGSC_DM_v3.4_pep_representative.fasta``:
258 |
259 | .. sourcecode:: console
260 |
261 | $ python check_start_met.py
262 | PGSC0003DMP400032467 starts T
263 | PGSC0003DMP400011427 starts Q
264 | PGSC0003DMP400068739 starts E
265 | ...
266 | PGSC0003DMP400011481 starts Y
267 | Found 208 records in PGSC_DM_v3.4_pep_representative.fasta which did not start with M
268 |
269 | **Excercise**: Modify this script to print out the description of the problem records,
270 | not just the identifier. *Tip*: Try reading the documentation, e.g. Biopython's wiki page
271 | on the `SeqRecord `_.
272 |
273 | **Discussion**: What did you notice about these record descriptions? Can you think of any
274 | reasons why there could be so many genes/proteins with a problem at the start?
275 |
276 | ------------------------
277 | Checking stop characters
278 | ------------------------
279 |
280 | In the standard one letter IUPAC amino acid codes for proteins, "*" is used for a
281 | stop codon. For many analyses tools having a "*" in the protein sequence can cause
282 | an error. There are two main reasons why you might see a "*" in a protein sequence.
283 |
284 | First, it might be there from translation up to and including the closing stop codon
285 | for the gene. In this case, you might want to remove it.
286 |
287 | Second, it could be there from a problematic/broken annotation where there is an
288 | in-frame stop codon. In this case, you might want to fix the annotation, remove
289 | the whole sequence, or perhaps cheat and replace the "*" with an "X" for an unknown
290 | amino acid.
291 |
292 | We'll talk about writing out sequence files soon, but first let's check the example
293 | protein FASTA files for any "*" symbols in the sequence. For this you can use several
294 | of the standard Python string operations which also apply to ``Seq`` objects, e.g.:
295 |
296 | .. sourcecode:: pycon
297 |
298 | >>> my_string = "MLNTCRVPLTDRKVKEKRAMKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVFTAYESE*"
299 | >>> my_string.startswith("M")
300 | True
301 | >>> my_string.endswith("*")
302 | True
303 | >>> len(my_string)
304 | 70
305 | >>> my_string.count("M")
306 | 3
307 | >>> my_string.count("*")
308 | 1
309 |
310 | **Exercise**: Write a python script to check ``NC_000913.faa`` to count the number of
311 | sequences with a "*" in them (anywhere), and the number where the sequence ends with
312 | a "*". Then try it on ``PGSC_DM_v3.4_pep_representative.fasta`` as well. e.g.:
313 |
314 | .. sourcecode:: console
315 |
316 | $ python check_stops.py
317 | Checking NC_000913.faa for terminal stop codons
318 | 0 records with * in them
319 | 0 with * at the end
320 |
321 | **Discussion**: What did you notice about the "*" stop characters in these FASTA files?
322 | What should we do to 'fix' the problems?
323 |
324 | --------------
325 | Single Records
326 | --------------
327 |
328 | One of the example FASTA files for *E. coli* K12 is a single long sequence
329 | for the entire (circular) genome, file ``NC_000913.fna``. We can still use a
330 | for loop and ``SeqIO.parse(...)`` but it can feel awkward. Instead, for the
331 | special case where the sequence file contains one and only one record, you
332 | can use ``SeqIO.read(...)``.
333 |
334 | .. sourcecode:: pycon
335 |
336 | >>> from Bio import SeqIO
337 | >>> record = SeqIO.read("NC_000913.fna", "fasta")
338 | >>> print(record.id + " length " + str(len(record)))
339 | gi|556503834|ref|NC_000913.3| length 4641652
340 |
341 | **Exercise**: Try using ``SeqIO.read(...)`` on one of the protein files.
342 | What happens?
343 |
344 | ----------------------
345 | Different File Formats
346 | ----------------------
347 |
348 | So far we've only been using FASTA format files, which is why when we've called
349 | ``SeqIO.parse(...)`` or ``SeqIO.read(...)`` the second argument has been ``"fasta"``.
350 | The Biopython ``SeqIO`` module supports quite a few other important sequence file
351 | formats (see the table on the `SeqIO wiki page `_).
352 |
353 | If you work with finished genomes, you'll often see nicely annotated files in
354 | the EMBL or GenBank format. Let's try this with the *E. coli* K12 GenBank file,
355 | ``NC_000913.gbk``, based on the previous example:
356 |
357 | .. sourcecode:: pycon
358 |
359 | >>> from Bio import SeqIO
360 | >>> fasta_record = SeqIO.read("NC_000913.fna", "fasta")
361 | >>> print(fasta_record.id + " length " + str(len(fasta_record)))
362 | gi|556503834|ref|NC_000913.3| length 4641652
363 | >>> genbank_record = SeqIO.read("NC_000913.gbk", "genbank")
364 | >>> print(genbank_record.id + " length " + str(len(genbank_record)))
365 | NC_000913.3 length 4641652
366 |
367 | All we needed to change was the file format argument to the ``SeqIO.read(...)``
368 | function - and we could load a GenBank file instead. You'll notice the GenBank
369 | version was given a shorter identifier, and took longer to load. The reason is
370 | that there is a lot more information present - most importantly lots of features
371 | (where each gene is and so on). We'll return to this in a later section,
372 | `working with sequence features <../using_seqfeatures/README.rst>`_.
373 |
374 | ===================================
375 | Writing Sequence Files in Biopython
376 | ===================================
377 |
378 | We move on to `writing sequence files <../writing_sequence_files/README.rst>`_
379 | in the next section.
380 |
--------------------------------------------------------------------------------
/reading_sequence_files/check_start_met.py:
--------------------------------------------------------------------------------
1 | from Bio import SeqIO
2 | #filename = "NC_000913.faa"
3 | filename = "PGSC_DM_v3.4_pep_representative.fasta"
4 | bad = 0
5 | for record in SeqIO.parse(filename, "fasta"):
6 | if not record.seq.startswith("M"):
7 | bad = bad + 1
8 | print(record.id + " starts " + record.seq[0])
9 | print("Found " + str(bad) + " records in " + filename + " which did not start with M")
10 |
11 |
--------------------------------------------------------------------------------
/reading_sequence_files/check_stops.py:
--------------------------------------------------------------------------------
1 | from Bio import SeqIO
2 | filename = "NC_000913.faa"
3 | #filename = "PGSC_DM_v3.4_pep_representative.fasta"
4 | contains_star = 0
5 | ends_with_star = 0
6 | print("Checking " + filename + " for terminal stop codons")
7 | for record in SeqIO.parse(filename, "fasta"):
8 | if record.seq.count("*"):
9 | contains_star = contains_star + 1
10 | if record.seq.endswith("*"):
11 | ends_with_star = ends_with_star + 1
12 | print(str(contains_star) + " records with * in them")
13 | print(str(ends_with_star) + " with * at the end")
14 |
15 |
16 |
--------------------------------------------------------------------------------
/reading_sequence_files/count_fasta.py:
--------------------------------------------------------------------------------
1 | from Bio import SeqIO
2 | filename = "NC_000913.faa"
3 | count = 0
4 | for record in SeqIO.parse(filename, "fasta"):
5 | count = count + 1
6 | print("There were " + str(count) + " records in file " + filename)
7 |
--------------------------------------------------------------------------------
/reading_sequence_files/count_fasta_adv.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | import sys
3 | from Bio import SeqIO
4 |
5 | #Remember sys.argv[0] is the script itself
6 | for filename in sys.argv[1:]:
7 | count = 0
8 | for record in SeqIO.parse(filename, "fasta"):
9 | count += 1 # this is shorthand for count = count + 1
10 | print("There were " + str(count) + " records in file " + filename)
11 |
--------------------------------------------------------------------------------
/reading_sequence_files/print_seq.py:
--------------------------------------------------------------------------------
1 | from Bio import SeqIO
2 | filename = "NC_000913.faa"
3 | for record in SeqIO.parse(filename, "fasta"):
4 | start_seq = record.seq[:10] # first 10 letters
5 | end_seq = record.seq[-10:] # last 10 letters
6 | print(record.id + " " + start_seq + "..." + end_seq)
7 |
--------------------------------------------------------------------------------
/reading_sequence_files/record_lengths.py:
--------------------------------------------------------------------------------
1 | from Bio import SeqIO
2 | filename = "NC_000913.faa"
3 | for record in SeqIO.parse(filename, "fasta"):
4 | print("Record " + record.id + ", length " + str(len(record.seq)))
5 |
--------------------------------------------------------------------------------
/reading_sequence_files/total_length.py:
--------------------------------------------------------------------------------
1 | from Bio import SeqIO
2 | filename = "NC_000913.faa"
3 | count = 0
4 | total = 0
5 | for record in SeqIO.parse(filename, "fasta"):
6 | count = count + 1
7 | # Can use len(record) as shortcut for len(record.seq)
8 | total = total + len(record)
9 | print(str(count) + " records, total length " +str(total) + " in file " + filename)
10 |
--------------------------------------------------------------------------------
/reading_writing_alignments/README.rst:
--------------------------------------------------------------------------------
1 | =========================================
2 | Reading Multiple-sequence Alignment Files
3 | =========================================
4 |
5 | The previous sections looked at Biopython's ``SeqIO`` module for
6 | sequence file input and output
7 | (`reading <../reading_sequence_files/README.rst>`_ and
8 | `writing <../writing_sequence_files/README.rst>`_ sequence files).
9 |
10 | Now we come to the ``AlignIO`` module which as the name suggests
11 | is for alignment input and ouput. Note that this is focused on
12 | dealing with multiple sequence alignments of the kind typically
13 | used in phylogenetics - a separate ``SearchIO`` module targets
14 | pairwise alignments generated by search tools like BLAST.
15 |
16 | These examples use a number of real example sequence alignment files,
17 | see the sample data instructions in the `introduction <../README.rst>`_
18 | for how to download them.
19 |
20 | We're going to look at a small seed alignment for one of the PFAM
21 | domains, the `A2L zinc ribbon domain (A2L_zn_ribbon; PF08792)
22 | `_. This was picked
23 | almost at random - it is small enough to see the entire alignment
24 | on screen, and has some obvious gap-rich columns.
25 |
26 | From the alignments tab on the Pfam webpage, you can download
27 | the raw alignment in several different formats (Selex, Stockholm,
28 | FASTA, and MSF). Biopython is able to work with FASTA (very simple)
29 | and Stockholm format (richly annotated).
30 |
31 | --------------------------
32 | Loading a single Alignment
33 | --------------------------
34 |
35 | As in ``SeqIO``, under ``AlignIO`` we have both ``AlignIO.parse(...)``
36 | for looping over multiple separate alignments, and ``AlignIO.read(...)``
37 | for loading a file containing a single alignment.
38 |
39 | Most of the time you will be working with alignment files which contain
40 | a single alignment, so normally you will use ``AlignIO.read(..)``.
41 |
42 | Here is an example loading the Pfam seed alignment for the `A2L zinc ribbon
43 | domain (A2L_zn_ribbon; PF08792) `_:
44 |
45 | .. sourcecode:: pycon
46 |
47 | >>> from Bio import AlignIO
48 | >>> alignment = AlignIO.read("PF08792_seed.sth", "stockholm")
49 | >>> print(alignment)
50 | SingleLetterAlphabet() alignment with 14 rows and 37 columns
51 | SIPVVCT---CGNDKDFY--KDDDIYICQLCNAETVK VF282_IIV6/150-181
52 | DIIENCKY--CGSFDIE---KVKDIYTCGDCTQTYTT Q9YW27_MSEPV/2-33
53 | SDNIKCKY--CNSFNII---KNKDIYSCCDCSNCYTT Q9EMK1_AMEPV/2-33
54 | AQDWRCDD--CNATLVYV--KKDAQRVCLECGKSTFF Q6XM16_9PHYC/83-115
55 | SKEWICEV--CNKELVYI--RKDAERVCPDCGLSHPY Q8QNH7_ESV1K/101-133
56 | NDDSKCIK--CGGPVLMQ--AARSLLNCQECGYSAAV Q4A276_EHV8U/148-180
57 | KSQNVCSVPDCDGEKILN--QNDGYMVCKKCGFSEPI YR429_MIMIV/213-247
58 | LKYKECKY--CHTDMVFN--TTQFGLQCPNCGCIQEL VF385_ASFB7/145-177
59 | RNLKSCSN--CKHNGLI---TEYNHEFCIFCQSVFQL Q6VZA9_CNPV/2-33
60 | MNLRMCGG--CRRNGLV---SDADYEFCLFCETVFPM Q6TVP3_ORFSA/1-32
61 | MNLRLCSG--CRHNGIV---SEQGYEYCIFCESVFQK VLTF3_VACCC/1-32
62 | MNLKMCSG--CSHNGIV---SEHGYEFCIFCESIFQS Q8V3K7_SWPV1/1-32
63 | NALRHCHG--CKHNGLV---LEQGYEFCIFCQAVFQH O11357_MCV1/5-36
64 | DQIYTCT---CGGQMELWVNSTQSDLVCNECGATQPY Y494R_PBCV1/148-181
65 |
66 | Printing a Biopython alignment object will give you a display as above
67 | (but truncated for larger alignments).
68 |
69 | In many ways, the alignment acts like a list of ``SeqRecord``
70 | objects (just like you would get from ``SeqIO``). The length
71 | of the alignment is the number of rows for example, and you
72 | can loop over the rows as individual ``SeqRecord`` objects:
73 |
74 | .. sourcecode:: pycon
75 |
76 | >>> print(len(alignment))
77 | 14
78 | >>> for record in
79 | >>> for record in alignment:
80 | ... print(record.id + " has " + str(record.seq.count("-")) + " gaps")
81 | ...
82 | VF282_IIV6/150-181 has 5 gaps
83 | Q9YW27_MSEPV/2-33 has 5 gaps
84 | Q9EMK1_AMEPV/2-33 has 5 gaps
85 | Q6XM16_9PHYC/83-115 has 4 gaps
86 | Q8QNH7_ESV1K/101-133 has 4 gaps
87 | Q4A276_EHV8U/148-180 has 4 gaps
88 | YR429_MIMIV/213-247 has 2 gaps
89 | VF385_ASFB7/145-177 has 4 gaps
90 | Q6VZA9_CNPV/2-33 has 5 gaps
91 | Q6TVP3_ORFSA/1-32 has 5 gaps
92 | VLTF3_VACCC/1-32 has 5 gaps
93 | Q8V3K7_SWPV1/1-32 has 5 gaps
94 | O11357_MCV1/5-36 has 5 gaps
95 | Y494R_PBCV1/148-181 has 3 gaps
96 |
97 | **Exercise**: Write a python script called ``count_gaps.py`` which
98 | reports the number of records, the total number of gaps, and the
99 | mean (average) number of gaps per record:
100 |
101 | .. sourcecode:: console
102 |
103 | $ python count_gaps.py
104 | PF08792_seed.sth had 14 records,
105 | Total gaps 61, average per record 4.35714285714
106 |
107 | *Tip*: If you get zero as the average, and are using Python 2,
108 | add the following special import line to the start of your Python
109 | file. This will give natural division (as used in Python 3) rather
110 | than integer division (used by default in Python 2)::
111 |
112 | .. sourcecode:: python
113 |
114 | from __future__ import division
115 |
116 | =========================================
117 | Writing Multiple-sequence Alignment Files
118 | =========================================
119 |
120 | As you might guess from using ``SeqIO.convert(...)`` and
121 | ``SeqIO.write(...)``, there are matching ``AlignIO.convert()``
122 | and ``AlignIO.write(...)`` functions.
123 |
124 | For example, this will convert the Stockholm formatted alignment
125 | into a relaxed PHYLIP format file:
126 |
127 | .. sourcecode:: python
128 |
129 | from Bio import AlignIO
130 | input_filename = "PF08792_seed.sth"
131 | output_filename = "PF08792_seed_converted.phy"
132 | AlignIO.convert(input_filename, "stockholm", output_filename, "phylip-relaxed")
133 |
134 | **Exercise**: Modify this example to convert the Stockholm file
135 | into a FASTA alignment file.
136 |
137 | This ``AlignIO.convert(...)`` example is equivalent to using
138 | ``AlignIO.read(...)`` and ``AlignIO.write(...)`` explicitly:
139 |
140 | .. sourcecode:: python
141 |
142 | from Bio import AlignIO
143 | input_filename = "PF08792_seed.sth"
144 | output_filename = "PF08792_seed_converted.phy"
145 | alignment = AlignIO.read(input_filename, "stockholm")
146 | AlignIO.write(alignment, output_filename, "phylip-relaxed")
147 |
148 | This form is most useful if you wish to modify the alignment in some way,
149 | which we will do next.
150 |
151 | ----------------
152 | Sorting the rows
153 | ----------------
154 |
155 | Downloading from Pfam gives you the option of picking the order
156 | the rows appear in - by default this is according to the *tree*
157 | order (clustering similar sequences together), but it can also
158 | be *alphabetical* (using the identifiers).
159 |
160 | We downloaded the file using the tree order, but here is how you
161 | can sort the rows by identifier within Biopython:
162 |
163 | .. sourcecode:: pycon
164 |
165 | >>> from Bio import AlignIO
166 | >>> alignment = AlignIO.read("PF08792_seed.sth", "stockholm")
167 | >>> alignment.sort()
168 | >>> print(alignment)
169 | SingleLetterAlphabet() alignment with 14 rows and 37 columns
170 | NALRHCHG--CKHNGLV---LEQGYEFCIFCQAVFQH O11357_MCV1/5-36
171 | NDDSKCIK--CGGPVLMQ--AARSLLNCQECGYSAAV Q4A276_EHV8U/148-180
172 | MNLRMCGG--CRRNGLV---SDADYEFCLFCETVFPM Q6TVP3_ORFSA/1-32
173 | RNLKSCSN--CKHNGLI---TEYNHEFCIFCQSVFQL Q6VZA9_CNPV/2-33
174 | AQDWRCDD--CNATLVYV--KKDAQRVCLECGKSTFF Q6XM16_9PHYC/83-115
175 | SKEWICEV--CNKELVYI--RKDAERVCPDCGLSHPY Q8QNH7_ESV1K/101-133
176 | MNLKMCSG--CSHNGIV---SEHGYEFCIFCESIFQS Q8V3K7_SWPV1/1-32
177 | SDNIKCKY--CNSFNII---KNKDIYSCCDCSNCYTT Q9EMK1_AMEPV/2-33
178 | DIIENCKY--CGSFDIE---KVKDIYTCGDCTQTYTT Q9YW27_MSEPV/2-33
179 | SIPVVCT---CGNDKDFY--KDDDIYICQLCNAETVK VF282_IIV6/150-181
180 | LKYKECKY--CHTDMVFN--TTQFGLQCPNCGCIQEL VF385_ASFB7/145-177
181 | MNLRLCSG--CRHNGIV---SEQGYEYCIFCESVFQK VLTF3_VACCC/1-32
182 | DQIYTCT---CGGQMELWVNSTQSDLVCNECGATQPY Y494R_PBCV1/148-181
183 | KSQNVCSVPDCDGEKILN--QNDGYMVCKKCGFSEPI YR429_MIMIV/213-247
184 |
185 | **Exercise**: Write a Python script ``sort_alignment_by_id.py``
186 | which uses ``AlignIO.read(..)`` and ``AlignIO.write(..)``
187 | to convert ``PF08792_seed.sth`` into a sorted FASTA file.
188 |
189 | By default the alignment's sort method uses the identifers as
190 | the sort key, but much like how sorting a Python list works,
191 | you can override this.
192 |
193 | **Advanced Exercise**: Define your own function taking a single
194 | argument (a ``SeqRecord``) which returns the number of gaps
195 | in the sequence. Use this to sort the alignment and print it
196 | to screen (or save it as a new file):
197 |
198 | .. sourcecode:: python
199 |
200 | from Bio import AlignIO
201 |
202 | def count_gaps(record):
203 | """Counts number of gaps in record's sequence."""
204 | return 0 # Fill in code
205 |
206 | filename = "PF08792_seed.sth"
207 | alignment = AlignIO.read(filename, "stockholm")
208 | alignment.sort(key=count_gaps)
209 | print(alignment)
210 |
211 | Expected output:
212 |
213 | .. sourcecode:: console
214 |
215 | $ python sort_gaps.py
216 | SingleLetterAlphabet() alignment with 14 rows and 37 columns
217 | KSQNVCSVPDCDGEKILN--QNDGYMVCKKCGFSEPI YR429_MIMIV/213-247
218 | DQIYTCT---CGGQMELWVNSTQSDLVCNECGATQPY Y494R_PBCV1/148-181
219 | AQDWRCDD--CNATLVYV--KKDAQRVCLECGKSTFF Q6XM16_9PHYC/83-115
220 | SKEWICEV--CNKELVYI--RKDAERVCPDCGLSHPY Q8QNH7_ESV1K/101-133
221 | NDDSKCIK--CGGPVLMQ--AARSLLNCQECGYSAAV Q4A276_EHV8U/148-180
222 | LKYKECKY--CHTDMVFN--TTQFGLQCPNCGCIQEL VF385_ASFB7/145-177
223 | SIPVVCT---CGNDKDFY--KDDDIYICQLCNAETVK VF282_IIV6/150-181
224 | DIIENCKY--CGSFDIE---KVKDIYTCGDCTQTYTT Q9YW27_MSEPV/2-33
225 | SDNIKCKY--CNSFNII---KNKDIYSCCDCSNCYTT Q9EMK1_AMEPV/2-33
226 | RNLKSCSN--CKHNGLI---TEYNHEFCIFCQSVFQL Q6VZA9_CNPV/2-33
227 | MNLRMCGG--CRRNGLV---SDADYEFCLFCETVFPM Q6TVP3_ORFSA/1-32
228 | MNLRLCSG--CRHNGIV---SEQGYEYCIFCESVFQK VLTF3_VACCC/1-32
229 | MNLKMCSG--CSHNGIV---SEHGYEFCIFCESIFQS Q8V3K7_SWPV1/1-32
230 | NALRHCHG--CKHNGLV---LEQGYEFCIFCQAVFQH O11357_MCV1/5-36
231 |
--------------------------------------------------------------------------------
/reading_writing_alignments/count_gaps.py:
--------------------------------------------------------------------------------
1 | from __future__ import division
2 | from Bio import AlignIO
3 |
4 | filename = "PF08792_seed.sth"
5 | alignment = AlignIO.read(filename, "stockholm")
6 | gaps = 0
7 | for record in alignment:
8 | gaps = gaps + record.seq.count("-")
9 | count = len(alignment) # number of records
10 | print(filename + " had " + str(count) + " records,")
11 | print("Total gaps " + str(gaps) + ", average per record " + str(gaps / count))
12 |
--------------------------------------------------------------------------------
/reading_writing_alignments/sort_gaps.py:
--------------------------------------------------------------------------------
1 | from Bio import AlignIO
2 |
3 | def count_gaps(record):
4 | """Counts number of gaps in record's sequence."""
5 | return record.seq.count("-")
6 |
7 | filename = "PF08792_seed.sth"
8 | alignment = AlignIO.read(filename, "stockholm")
9 | alignment.sort(key=count_gaps)
10 | print(alignment)
11 |
--------------------------------------------------------------------------------
/tests/README.rst:
--------------------------------------------------------------------------------
1 | This folder (``tests``) contains a number of Python scripts used to
2 | test all the examples used in this workshop, and is not part of the
3 | material the workshop participants are expected to read.
4 |
--------------------------------------------------------------------------------
/tests/test_consistency.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | """Check the sample python scripts match the emmbeded copies in README.rst files.
3 |
4 | This is a workaround for the fact that (due to security concerns)
5 | neither BitBucket nor GitHub support the reStructuredText include
6 | directive, which would have allowed direct embedding of small
7 | Python scripts into the documentation. See:
8 |
9 | - https://bitbucket.org/site/master/issue/5411/restructuredtext-include-directive
10 | - https://github.com/github/markup/issues/172
11 | """
12 |
13 | from __future__ import print_function
14 | import os
15 | import sys
16 |
17 | filename = os.path.split(__file__)[1]
18 | if os.path.isfile(filename):
19 | #Already in the tests directory
20 | base_path = ".."
21 | elif os.path.isfile(os.path.join("tests", filename)):
22 | #Already in the repository root directory
23 | base_path ="."
24 | else:
25 | sys.stderr.write("Should be in base folder or tests folder.\n")
26 | sys.exit(1)
27 |
28 | def load_and_indent(filename, indent=" "*4):
29 | """Load a text file as a string, adding the indent to each line."""
30 | lines = []
31 | for line in open(filename):
32 | lines.append(indent + line)
33 | return "".join(lines)
34 |
35 | good = 0
36 | warn = 0
37 | errors = 0
38 | for dirpath, dirnames, filenames in os.walk(base_path):
39 | if "README.rst" not in filenames:
40 | continue
41 | readme = os.path.join(dirpath, "README.rst")
42 | if readme.endswith("/tests/README.rst"):
43 | continue
44 | print("-" * 40)
45 | print("Checking %s" % readme)
46 | #Which script files might this contain?
47 | scripts = dict()
48 | for f in filenames:
49 | if f.endswith(".py"):
50 | scripts[f] = load_and_indent(os.path.join(dirpath, f))
51 | if not scripts:
52 | print("No local script files for this")
53 | continue
54 | #Now check the README.rst file contains them...
55 | print("Using: %s" % ", ".join(sorted(scripts)))
56 | with open(readme) as handle:
57 | text = handle.read()
58 | for filename, script in sorted(scripts.items()):
59 | filename_used = (("``%s``" % filename) in text) or (("$ python %s" % filename) in text)
60 | script_embedded = script in text
61 | if filename_used and script_embedded:
62 | print(" - %s named and embedded" % filename)
63 | good += 1
64 | elif filename_used:
65 | print(" - %s named but not embedded (warning)" % filename)
66 | warn += 1
67 | elif script_embedded:
68 | print(" - %s not named, but embedded in text (ERROR)" % filename)
69 | errors += 1
70 | else:
71 | print(" - %s neither named nor embedded (ERROR)" % filename)
72 | errors += 1
73 | print("=" * 40)
74 | print("%i good, %i warnings, %i errors" % (good, warn, errors))
75 | if errors:
76 | sys.stderr.write("Consistency test failed")
77 | sys.exit(1)
78 |
79 |
80 |
--------------------------------------------------------------------------------
/tests/test_scripts.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | """Check the sample python scripts run.
3 |
4 | Useful to catch any Python 2 vs Python 3 syntax errors.
5 |
6 | TODO: Check output against embedded examples in README.rst?
7 | TODO: Handle any command line switches?
8 | """
9 |
10 | from __future__ import print_function
11 | import os
12 | import sys
13 | import subprocess
14 |
15 | filename = os.path.split(__file__)[1]
16 | if os.path.isfile(filename):
17 | #Already in the tests directory
18 | #base_path = ".."
19 | #Assume sample data files in the repository root directory
20 | os.chdir("..")
21 | if os.path.isfile(os.path.join("tests", filename)):
22 | #Already in the repository root directory
23 | base_path = "."
24 | else:
25 | sys.stderr.write("Should be in base folder or tests folder.\n")
26 | sys.exit(1)
27 |
28 |
29 | def abbreviate(text):
30 | if len(text) <= 1000:
31 | return text
32 | lines = text.split("\n")
33 | if len(lines) > 20:
34 | lines = lines[:10] + ["..."] + lines[-10:]
35 | return "\n".join(lines)
36 | # Not elegant...
37 | return lines[:100] + "\n...\n" + lines[-100:]
38 |
39 |
40 | def check(script):
41 | """Runs script and Will increment good, warn or errors."""
42 | global good, warn, errors
43 | #TODO - This assumes 'python' will be aliased as on TravisCI
44 | child = subprocess.Popen(["python", script],
45 | stdout=subprocess.PIPE,
46 | stderr=subprocess.PIPE,
47 | universal_newlines=True,
48 | )
49 | stdout, stderr = child.communicate()
50 | if child.returncode:
51 | errors += 1
52 | sys.stderr.write("Return code %i from %s\n" % (child.returncode, script))
53 | elif stderr:
54 | warn += 1
55 | sys.stderr.write(stderr)
56 | else:
57 | good += 1
58 | sys.stdout.write(abbreviate(stdout))
59 |
60 |
61 | good = 0
62 | warn = 0
63 | errors = 0
64 | for dirpath, dirnames, filenames in os.walk(base_path):
65 | if "README.rst" not in filenames:
66 | continue
67 | readme = os.path.join(dirpath, "README.rst")
68 | if readme.endswith("/tests/README.rst"):
69 | continue
70 | scripts = [f for f in filenames if f.endswith(".py")]
71 | if not scripts:
72 | continue
73 | print("-" * 40)
74 | print("Checking %s (%i scripts)" % (dirpath, len(scripts)))
75 | print("-" * 40)
76 | for f in scripts:
77 | script = os.path.join(dirpath, f)
78 | print("Checking %s" % script)
79 | check(script)
80 | print("=" * 40)
81 | print("%i good, %i warnings, %i errors" % (good, warn, errors))
82 | if errors:
83 | sys.stderr.write("Test failed\n")
84 | sys.exit(1)
85 |
86 |
--------------------------------------------------------------------------------
/using_seqfeatures/README.rst:
--------------------------------------------------------------------------------
1 | ==============================
2 | Working with Sequence Features
3 | ==============================
4 |
5 | This picks up from the end of the section on `reading sequence files
6 | <../reading_sequence_files/README.rst>`_, but looks at the feature
7 | annotation included in some file formats like EMBL or GenBank.
8 |
9 | Most of the time GenBank files contain a single record for a single
10 | chromosome or plasmid, so we'll generally use the ``SeqIO.read(...)``
11 | function. Remember the second argument is the file format, so if we
12 | start from the code to read in a FASTA file:
13 |
14 | .. sourcecode:: pycon
15 |
16 | >>> from Bio import SeqIO
17 | >>> record = SeqIO.read("NC_000913.fna", "fasta")
18 | >>> print(record.id)
19 | gi|556503834|ref|NC_000913.3|
20 | >>> print(len(record))
21 | 4641652
22 | >>> print(len(record.features))
23 | 0
24 |
25 | Now switch the filename and the format:
26 |
27 | .. sourcecode:: pycon
28 |
29 | >>> from Bio import SeqIO
30 | >>> record = SeqIO.read("NC_000913.gbk", "genbank")
31 | >>> print(record.id)
32 | NC_000913.3
33 | >>> print(len(record))
34 | 4641652
35 | >>> print(len(record.features))
36 | 23086
37 |
38 | So what is this new ``.features`` thing? It is a Python list, containing
39 | a Biopython ``SeqFeature`` object for each feature in the GenBank file.
40 | For instance,
41 |
42 | .. sourcecode:: pycon
43 |
44 | >>> my_gene = record.features[3]
45 | >>> print(my_gene)
46 | type: gene
47 | location: [336:2799](+)
48 | qualifiers:
49 | Key: db_xref, Value: ['EcoGene:EG10998', 'GeneID:945803']
50 | Key: gene, Value: ['thrA']
51 | Key: gene_synonym, Value: ['ECK0002; Hs; JW0001; thrA1; thrA2; thrD']
52 | Key: locus_tag, Value: ['b0002']
53 |
54 | Doing a print like this tries to give a human readable display. There
55 | are three key properties, ``.type`` which is a string like ``gene``
56 | or ``CDS``, ``.location`` which describes where on the genome this
57 | feature is, and ``.qualifiers`` which is a Python dictionary full of
58 | all the annotation for the feature (things like gene identifiers).
59 |
60 | This is what this gene looks like in the raw GenBank file::
61 |
62 | gene 337..2799
63 | /gene="thrA"
64 | /locus_tag="b0002"
65 | /gene_synonym="ECK0002; Hs; JW0001; thrA1; thrA2; thrD"
66 | /db_xref="EcoGene:EG10998"
67 | /db_xref="GeneID:945803"
68 |
69 | Hopefully it is fairly clear how this maps to the ``SeqFeature`` structure.
70 | The `Biopython Tutorial & Cookbook `_
71 | (`PDF `_) goes into
72 | more detail about this.
73 |
74 | -----------------
75 | Feature Locations
76 | -----------------
77 |
78 | We're going to focus on using the location information for different feature
79 | types. Continuing with the same example:
80 |
81 | .. sourcecode:: pycon
82 |
83 | >>> from Bio import SeqIO
84 | >>> record = SeqIO.read("NC_000913.gbk", "genbank")
85 | >>> my_gene = record.features[3]
86 | >>> print(my_gene.qualifiers["locus_tag"])
87 | ['b0002']
88 | >>> print(my_gene.location)
89 | [336:2799](+)
90 | >>> print(my_gene.location.start)
91 | 336
92 | >>> print(my_gene.location.end)
93 | 2799
94 | >>> print(my_gene.location.strand)
95 | 1
96 |
97 | Recall in the GenBank file this simple location was ``337..2799``, yet
98 | in Biopython this has become a start value of 336 and 2799 as the end.
99 | The reason for this is to match how Python counting works, in particular
100 | how Python string slicing. In order to pull out this sequence from the full
101 | genome we need to use slice values of 336 and 2799:
102 |
103 | .. sourcecode:: pycon
104 |
105 | >>> gene_seq = record.seq[336:2799]
106 | >>> len(gene_seq)
107 | 2463
108 | >>> print(gene_seq)
109 | ...
110 |
111 | This was a very simple location on the forward strand, if it had been on
112 | the reverse strand you'd need to take the reverse-complement. Also if the
113 | location had been a more complicated compound location like a *join* (used
114 | for eukaryotic genes where the CDS is made up of several exons), then the
115 | location would have-sub parts to consider.
116 |
117 | All these complications are taken care of for you via the ``.extract(...)``
118 | method which takes the full length parent record's sequence as an argument:
119 |
120 | .. sourcecode:: pycon
121 |
122 | >>> gene_seq = my_gene.extract(record.seq)
123 | >>> len(gene_seq)
124 | 2463
125 | >>> print(gene_seq)
126 | ...
127 |
128 | **Exercise**: Finish the following script by setting an appropriate
129 | feature name like the locus tag or GI number (use the ``.qualifiers``
130 | or ``.dbxrefs`` information) to extract all the coding sequences from
131 | the GenBank file:
132 |
133 | .. sourcecode:: python
134 |
135 | from Bio import SeqIO
136 | record = SeqIO.read("NC_000913.gbk", "genbank")
137 | output_handle = open("NC_000913_cds.fasta", "w")
138 | count = 0
139 | for feature in record.features:
140 | if feature.type == "CDS":
141 | count = count + 1
142 | feature_name = "..." # Use feature.qualifiers or feature.dbxrefs here
143 | feature_seq = feature.extract(record.seq)
144 | # Simple FASTA output without line wrapping:
145 | output_handle.write(">" + feature_name + "\n" + str(feature_seq) + "\n")
146 | output_handle.close()
147 | print(str(count) + " CDS sequences extracted")
148 |
149 | .. sourcecode:: console
150 |
151 | $ python extract_cds.py
152 | 4321 CDS sequences extracted
153 |
154 | Check your sequences using the NCBI provided FASTA file ``NC_000913.ffn``.
155 |
156 | **Advanced exercise**: Can you recreate the NCBI naming scheme as used
157 | in ``NC_000913.ffn``?
158 |
159 | **Advanced exercise**: Using the Biopython documentation, can you create
160 | a new ``SeqRecord`` object and then use ``SeqIO.write(...)`` which will
161 | produce line-wrapped FASTA output.
162 |
163 | ---------------
164 | Feature Lengths
165 | ---------------
166 |
167 | The length of Biopython's ``SeqFeature`` objects (and the location objects)
168 | is defined as the length of the sequence region they describe (i.e. how
169 | many bases are includied; or for protein annotation how many amino acids).
170 |
171 | .. sourcecode:: pycon
172 |
173 | >>> len(my_gene)
174 | 2463
175 |
176 | Remember when we checked the length of ``my_gene.extract(record.seq)``
177 | that also gave 2463.
178 |
179 | This example loops over all the features looking for gene records, and
180 | calculates their total length:
181 |
182 | .. sourcecode:: python
183 |
184 | from Bio import SeqIO
185 | record = SeqIO.read("NC_000913.gbk", "genbank")
186 | total = 0
187 | for feature in record.features:
188 | if feature.type == "gene":
189 | total = total + len(feature)
190 | print("Total length of all genes is " + str(total))
191 |
192 | .. sourcecode:: console
193 |
194 | $ python total_gene_lengths.py
195 | Total length of genome is 4641652
196 | Total length of all genes is 4137243
197 |
198 | **Exercise**: Give a separate count for each feature type. Use a dictionary
199 | where the keys are the feature type (e.g. "gene" and "CDS") and the values
200 | are the count for that type.
201 |
202 | **Discussion**: What proportion of the genome is annotated as gene coding?
203 | What assumptions does this estimate 89% make:
204 |
205 | .. sourcecode:: pycon
206 |
207 | >>> 4137243 * 100.0 / 4641652
208 | 89.13298541122859
209 |
210 | **Exercise**: Extend the previous script to also count the number of
211 | features of each type, and report this and the average length of that
212 | feature type. e.g.
213 |
214 | .. sourcecode:: console
215 |
216 | $ python total_feature_lengths.py
217 | Total length of genome is 4641652
218 | misc_feature
219 | - total number: 13686
220 | - total length: 6136082
221 | - average length: 448.347362268
222 | mobile_element
223 | - total number: 49
224 | - total length: 50131
225 | - average length: 1023.08163265
226 | ...
227 |
228 | **Discussion**: What proportion of the genome is annotated with *misc_feature*?
229 | Does this simple calculation give a meaningful answer?
230 |
231 | .. sourcecode:: pycon
232 |
233 | >>> 6136082 * 100.0 / 4641652
234 | 132.19608018869144
235 |
236 | This is an alternative approach, using some more advanced bits of Python like
237 | the set datatype, and the concept of iterating over the bases within a feature:
238 |
239 | .. sourcecode:: pycon
240 |
241 | >>> from Bio import SeqIO
242 | >>> record = SeqIO.read("NC_000913.gbk", "genbank")
243 | >>> bases = set()
244 | >>> for feature in record.features:
245 | ... if feature.type == "misc_feature":
246 | ... bases.update(feature.location)
247 | ...
248 | >>> print(len(bases) * 100.0 / len(record))
249 | 80.69355479471533
250 |
251 | **Exercise**: Without worrying to much about how it works, modify this example
252 | to count the number of bases in the *gene* features.
253 |
254 | .. sourcecode:: console
255 |
256 | $ python bases_in_genes.py
257 | 88.9494085295
258 |
259 | **Discussion**: Compare this calculation (88.95%) to one earlier (89.13%).
260 | Which is a better estimate of the proportion of the genome which encodes genes?
261 | When might these methods give very different answers? Any virologists in the group?
262 | How should this be defined given that any single base may be in more than one gene?
263 |
264 | ------------------------
265 | Translating CDS features
266 | ------------------------
267 |
268 | When dealing with GenBank files and trying to get the protein sequence of the
269 | genes, you'll need to look at the CDS features (coding sequences) - not the
270 | gene features (although for simple cases they'll have the same location).
271 |
272 | Sometimes, as in the *E. coli* exmaple, you will find the translation is
273 | provided in the qualifiers:
274 |
275 | >>> from Bio import SeqIO
276 | >>> record = SeqIO.read("NC_000913.gbk", "genbank")
277 | >>> my_cds = record.features[4]
278 | >>> print(my_cds.qualifiers["locus_tag"])
279 | ['b0002']
280 | >>> print(my_cds.qualifiers["translation"])
281 | ['MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKTISGQDALPNI...KLGV']
282 |
283 | This has been truncated for display here - the whole protein sequence is
284 | present. However, many times the annotation will not include the amino acid
285 | translation - but we can get it by translating the nucleotide sequence.
286 |
287 | >>> print(cds_seq.translate(table=11))
288 | >>> protein_seq = cds_seq.translate(table=11)
289 | >>> len(protein_seq)
290 | 821
291 | >>> print(protein_seq)
292 | MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKTISGQDALPNI...KLGV*
293 |
294 | Notice because this is a bacteria, we used the NCBI translation table 11,
295 | rather than the default (suitable for humans etc).
296 |
297 | **Advanced Exercise**: Using this information, and the CDS extraction script
298 | from earlier, translate all the CDS features into a FASTA file.
299 |
300 | Check your sequences using the NCBI provided FASTA file ``NC_000913.faa``.
301 |
--------------------------------------------------------------------------------
/using_seqfeatures/bases_in_genes.py:
--------------------------------------------------------------------------------
1 | from Bio import SeqIO
2 | record = SeqIO.read("NC_000913.gbk", "genbank")
3 | bases = set() # Python's built in set datatype
4 | for feature in record.features:
5 | if feature.type == "gene":
6 | # This adds all the possible base coordinates
7 | # within the feature location to the set. Try
8 | # print(list(feature.location)) on a gene...
9 | bases.update(feature.location)
10 | # The Python set doesn't store duplicates, so len(bases)
11 | # is the number of unique bases in at least one gene.
12 | print(len(bases) * 100.0 / len(record))
13 |
--------------------------------------------------------------------------------
/using_seqfeatures/extract_cds.py:
--------------------------------------------------------------------------------
1 | from Bio import SeqIO
2 | record = SeqIO.read("NC_000913.gbk", "genbank")
3 | output_handle = open("NC_000913_cds.fasta", "w")
4 | count = 0
5 | for feature in record.features:
6 | if feature.type == "CDS":
7 | count = count + 1
8 | feature_name = feature.qualifiers["locus_tag"][0]
9 | feature_seq = feature.extract(record.seq)
10 | # Simple FASTA output without line wrapping:
11 | output_handle.write(">" + feature_name + "\n" + str(feature_seq) + "\n")
12 | output_handle.close()
13 | print(str(count) + " CDS sequences extracted")
14 |
--------------------------------------------------------------------------------
/using_seqfeatures/total_feature_lengths.py:
--------------------------------------------------------------------------------
1 | from __future__ import division
2 | # (needed under Python 2 for sensible division)
3 |
4 | from Bio import SeqIO
5 | record = SeqIO.read("NC_000913.gbk", "genbank")
6 | print("Total length of genome is " + str(len(record)))
7 | totals = dict()
8 | counts = dict()
9 | for feature in record.features:
10 | if feature.type in totals:
11 | totals[feature.type] = totals[feature.type] + len(feature)
12 | counts[feature.type] = counts[feature.type] + 1
13 | else:
14 | #First time to see this feature type
15 | totals[feature.type] = 1
16 | counts[feature.type] = 1
17 | for f_type in totals:
18 | print(f_type)
19 | print(" - total number: " + str(counts[f_type]))
20 | print(" - total length: " + str(totals[f_type]))
21 | ave_len = totals[f_type] / counts[f_type]
22 | print(" - average length: " + str(ave_len))
23 |
--------------------------------------------------------------------------------
/using_seqfeatures/total_gene_lengths.py:
--------------------------------------------------------------------------------
1 | from Bio import SeqIO
2 | record = SeqIO.read("NC_000913.gbk", "genbank")
3 | print("Total length of genome is " + str(len(record)))
4 | total = 0
5 | for feature in record.features:
6 | if feature.type == "gene":
7 | total = total + len(feature)
8 | print("Total length of all genes is " + str(total))
9 |
--------------------------------------------------------------------------------
/writing_sequence_files/README.rst:
--------------------------------------------------------------------------------
1 | ====================================
2 | Writing Sequences Files in Biopython
3 | ====================================
4 |
5 | The `previous section <../reading_sequence_files/README.rst>`_ talked
6 | about reading sequence files in Biopython using the ``SeqIO.parse(...)``
7 | function. Now we'll focus on writing sequence files using the sister
8 | function ``SeqIO.write(...)``.
9 |
10 | The more gently paced `Biopython Tutorial and Cookbook
11 | `_
12 | (`PDF `_)
13 | first covers creating your own records (``SeqRecord`` objects) and
14 | then how to write them out. We're going to skip that here, and work
15 | with ready-made ``SeqRecord`` objects loaded with ``SeqIO.parse(...)``.
16 |
17 | Let's start with something really simple...
18 |
19 | --------------------------
20 | Converting a sequence file
21 | --------------------------
22 |
23 | Recall we looked at the *E. coli* K12 chromosome as a FASTA file
24 | ``NC_000913.fna`` and as a GenBank file ``NC_000913.gbk``. Suppose
25 | we only had the GenBank file, and wanted to turn it into a FASTA file?
26 |
27 | Biopython's ``SeqIO`` module can read and write lots of sequence file
28 | formats, and has a handy helper function to convert a file:
29 |
30 | .. sourcecode:: pycon
31 |
32 | >>> from Bio import SeqIO
33 | >>> help(SeqIO.convert)
34 |
35 | Here's a very simple script which uses this function:
36 |
37 | .. sourcecode:: python
38 |
39 | from Bio import SeqIO
40 | input_filename = "NC_000913.gbk"
41 | output_filename = "NC_000913_converted.fasta"
42 | count = SeqIO.convert(input_filename, "gb", output_filename, "fasta")
43 | print(str(count) + " records converted")
44 |
45 | Save this as ``convert_gb_to_fasta.py`` and run it:
46 |
47 | .. sourcecode:: console
48 |
49 | $ python convert_gb_to_fasta.py
50 | 1 records converted
51 |
52 | Notice that the ``SeqIO.convert(...)`` function returns the number of
53 | sequences it converted - here only one. Also have a look at the output file:
54 |
55 | .. sourcecode:: console
56 |
57 | $ head NC_000913_converted.fasta
58 | >NC_000913.3 Escherichia coli str. K-12 substr. MG1655, complete genome.
59 | AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTC
60 | TGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGG
61 | TCACTAAATACTTTAACCAATATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTAC
62 | ACAACATCCATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGT
63 | AACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAGCCCGCACCTGACAGTGCGGG
64 | CTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAAGTTCGGCGGT
65 | ACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC
66 | AGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTG
67 | GCGATGATTGAAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAA
68 |
69 | **Warning**: The output will over-write any pre-existing file of the same name.
70 |
71 | **Advanced Exercise**: Modify this to add command line parsing to take
72 | the input and output filenames as arguments.
73 |
74 | The ``SeqIO.convert(...)`` function is effectively a shortcut combining
75 | ``SeqIO.parse(...)`` for input ``SeqIO.write(...)`` for output. Here's how
76 | you'd do this explictly:
77 |
78 | .. sourcecode:: python
79 |
80 | from Bio import SeqIO
81 | input_filename = "NC_000913.gbk"
82 | output_filename = "NC_000913_converted.fasta"
83 | records_iterator = SeqIO.parse(input_filename, "gb")
84 | count = SeqIO.write(records_iterator, output_filename, "fasta")
85 | print(str(count) + " records converted")
86 |
87 | Previously we'd always used the results from ``SeqIO.parse(...)`` in a for
88 | loop - but here the for loop happens inside the ``SeqIO.write(...)`` function.
89 |
90 | **Exercise**: Check this does the same as the ``SeqIO.convert(...)`` version above.
91 |
92 | The ``SeqIO.write(...)`` function is happy to be given multiple records
93 | like this, or simply as a list of ``SeqRecord`` objects. You can also give
94 | it just one record:
95 |
96 | .. sourcecode:: python
97 |
98 | from Bio import SeqIO
99 | input_filename = "NC_000913.gbk"
100 | output_filename = "NC_000913_converted.fasta"
101 | record = SeqIO.read(input_filename, "gb")
102 | SeqIO.write(record, output_filename, "fasta")
103 |
104 | We'll be doing this in the next example, where we call ``SeqIO.write(..)``
105 | several times in order to build up a mult-record output file.
106 |
107 | -------------------------
108 | Filtering a sequence file
109 | -------------------------
110 |
111 | Suppose we wanted to filter a FASTA file by length, for example
112 | exclude protein sequences less than 100 amino acids long.
113 |
114 | The `Biopython Tutorial and Cookbook
115 | `_
116 | (`PDF `_)
117 | has filtering examples combining ``SeqIO.write(...)`` with more
118 | advanced Python features like generator expressions and so on.
119 | These are all worth learning about later, but in this workshop
120 | we will stick with the simpler for-loop.
121 |
122 | You might try something like this:
123 |
124 | .. sourcecode:: python
125 |
126 | from Bio import SeqIO
127 | input_filename = "NC_000913.faa"
128 | output_filename = "NC_000913_long_only.faa"
129 | count = 0
130 | total = 0
131 | for record in SeqIO.parse(input_filename, "fasta"):
132 | total = total + 1
133 | if 100 <= len(record):
134 | count = count + 1
135 | SeqIO.write(record, output_filename, "fasta")
136 | print(str(count) + " records selected out of " + str(total))
137 |
138 | Save this as ``length_filter_naive.py``, and run it, and check it worked.
139 |
140 | .. sourcecode:: console
141 |
142 | $ python length_filter_naive.py
143 | 3719 records selected out of 4141
144 |
145 | **Discussion:** What goes wrong and why? Have a look at the output file...
146 |
147 | .. sourcecode:: console
148 |
149 | $ grep -c "^>" NC_000913_long_only.faa
150 | 1
151 | $ cat NC_000913_long_only.faa
152 | >gi|16132220|ref|NP_418820.1| predicted methyltransferase [Escherichia coli str. K-12 substr. MG1655]
153 | MRITIILVAPARAENIGAAARAMKTMGFSDLRIVDSQAHLEPATRWVAHGSGDIIDNIKV
154 | FPTLAESLHDVDFTVATTARSRAKYHYYATPVELVPLLEEKSSWMSHAALVFGREDSGLT
155 | NEELALADVLTGVPMVADYPSLNLGQAVMVYCYQLATLIQQPAKSDATADQHQLQALRER
156 | AMTLLTTLAVADDIKLVDWLQQRLGLLEQRDTAMLHRLLHDIEKNITK
157 |
158 | The problem is that our output file only contains *one* sequence, actually
159 | the last long sequence in the FASTA file. Why? What happened is each time
160 | round the loop when we called ``SeqIO.write(...)`` to save one record, it
161 | overwrote the existing data.
162 |
163 | The simplest solution is to open and close the file explicitly, using a *file handle*.
164 | The ``SeqIO`` functions are happy to work with either filenames (strings) or
165 | file handles, and this is a case where the more low-level handle is useful.
166 |
167 | Here's a working version of the script, save this as ``length_filter.py``:
168 |
169 | .. sourcecode:: python
170 |
171 | from Bio import SeqIO
172 | input_filename = "NC_000913.faa"
173 | output_filename = "NC_000913_long_only.faa"
174 | count = 0
175 | total = 0
176 | output_handle = open(output_filename, "w")
177 | for record in SeqIO.parse(input_filename, "fasta"):
178 | total = total + 1
179 | if 100 <= len(record):
180 | count = count + 1
181 | SeqIO.write(record, output_handle, "fasta")
182 | output_handle.close()
183 | print(str(count) + " records selected out of " + str(total))
184 |
185 | This time we get the expected output - and it is much faster (needlessly
186 | creating and replacing several thousand small files is slow):
187 |
188 | .. sourcecode:: console
189 |
190 | $ python length_filter.py
191 | 3719 records selected out of 4141
192 | $ grep -c "^>" NC_000913_long_only.faa
193 | 3719
194 |
195 | Yay!
196 |
197 |
198 | -----------------
199 | Editing sequences
200 | -----------------
201 |
202 | One of the examples in the `previous section <../reading_sequence_files/README.rst>`_
203 | looked at the potato protein sequences, and that they all had a terminal "*"
204 | character (stop codon). Python strings, Biopython ``Seq`` and ``SeqRecord`` objects
205 | can all be *sliced* to extract a sub-sequence or partial record. In this case,
206 | we want to take everything up to but excluding the final letter:
207 |
208 | .. sourceode: pycon
209 |
210 | >>> my_seq = "MTAIVIGAKILGIIYSSPQLRKCNSATQNDHSDLQISFWKDHLRQCTTNS*"
211 | >>> cut_seq = my_seq[:-1] # remove last letter
212 | >>> print(cut_seq)
213 | MTAIVIGAKILGIIYSSPQLRKCNSATQNDHSDLQISFWKDHLRQCTTNS
214 |
215 | Consider the following example (which I'm calling ``cut_star_dangerous.py``):
216 |
217 | .. sourcecode:: python
218 |
219 | from Bio import SeqIO
220 | input_filename = "PGSC_DM_v3.4_pep_representative.fasta"
221 | output_filename = "PGSC_DM_v3.4_pep_rep_no_stars.fasta"
222 | output_handle = open(output_filename, "w")
223 | for record in SeqIO.parse(input_filename, "fasta"):
224 | cut_record = record[:-1] # remove last letter
225 | SeqIO.write(cut_record, output_handle, "fasta")
226 | output_handle.close()
227 |
228 | This should work fine on this potato file... but what might go wrong if you
229 | used it on another protein file? What happens if (some of) the input records
230 | don't end with a "*"?
231 |
232 | **Exercise**: Modify this example to only remove the last letter if it is a "*"
233 | (and save the original record unchanged if it does not end with "*"). The sample
234 | solution is called ``cut_final_star.py`` instead.
235 |
236 |
237 | ------------------------
238 | Filtering by record name
239 | ------------------------
240 |
241 | A very common task is pulling out particular sequences from a large sequence
242 | file. Membership testing with Python lists (or sets) is one neat way to do
243 | this. Recap:
244 |
245 | .. sourcecode:: pycon
246 |
247 | >>> wanted_ids = ["PGSC0003DMP400019313", "PGSC0003DMP400020381", "PGSC0003DMP400020972"]
248 | >>> "PGSC0003DMP400067339" in wanted_ids
249 | False
250 | >>> "PGSC0003DMP400020972" in wanted_ids
251 | True
252 |
253 | **Exercise**: Guided by the ``filter_length.py`` script, write a new script
254 | starting as follows which writes out the potato proteins on this list:
255 |
256 | .. sourcecode:: python
257 |
258 | from Bio import SeqIO
259 | wanted_ids = ["PGSC0003DMP400019313", "PGSC0003DMP400020381", "PGSC0003DMP400020972"]
260 | input_filename = "PGSC_DM_v3.4_pep_representative.fasta"
261 | output_filename = "wanted_potato_proteins.fasta"
262 | count = 0
263 | total = 0
264 | output_handle = open(output_filename, "w")
265 | # ...
266 | # Your code here
267 | # ...
268 | output_handle.close()
269 | print(str(count) + " records selected out of " + str(total))
270 |
271 | The sample solution is called ``filter_wanted_ids.py``, and the output should be:
272 |
273 | .. sourcecode:: console
274 |
275 | $ python filter_wanted_id.py
276 | 3 records selected out of 39031
277 |
278 | **Advanced Exerise**: Modify this to read the list of wanted identifiers from
279 | a plain text input file (one identifier per line).
280 |
281 | **Advanced Exerise**: What is the advatage of using a Python set instead of
282 | a Python list for the wanted identifiers?
283 |
284 | **Discussion**: What happens if a wanted identifier is not in the input file?
285 | What happens if an identifer appears twice? What order is the output file?
286 |
287 | ------------------------
288 | Selecting by record name
289 | ------------------------
290 |
291 | In the previous example, we used ``SeqIO.parse(...)`` to loop over the input
292 | FASTA file. This means the output order will be dictated by the input sequence
293 | file's order. What if you want the records in the specified order (regardless
294 | of the order in the FASTA file)?
295 |
296 | In this situation, you can't make a single for loop over the FASTA file. For
297 | a tiny file you could load everything into memory (e.g. as a Python dictionary),
298 | but that won't work on larger files. Instead, we can use Biopython's
299 | ``SeqIO.index(...)`` function which lets us treat a sequence file like a
300 | Python dictionary:
301 |
302 | .. sourcecode:: pycon
303 |
304 | >>> from Bio import SeqIO
305 | >>> filename = "PGSC_DM_v3.4_pep_representative.fasta"
306 | >>> fasta_index = SeqIO.index(filename, "fasta")
307 | >>> print(str(len(fasta_index)) + " records in " + filename)
308 | >>> "PGSC0003DMP400019313" in fasta_index
309 | True
310 | >>> record = fasta_index["PGSC0003DMP400019313"]
311 | >>> print(record)
312 | ID: PGSC0003DMP400019313
313 | Name: PGSC0003DMP400019313
314 | Description: PGSC0003DMP400019313 PGSC0003DMT400028369 Protein
315 | Number of features: 0
316 | Seq('MSKSLYLSLFFLSFVVALFGILPNVKGNILDDICPGSFFPPLCFQMLRNDPSVS...LK*', SingleLetterAlphabet())
317 |
318 | **Exercise**: Write a new version of your ``count_fasta.py`` script using
319 | ``SeqIO.index(...)`` instead of ``SeqIO.parse(...)`` and a for loop.
320 | Which is faster?
321 |
322 | **Exercise**: Complete the following script by using ``SeqIO.index(...)``
323 | to make a FASTA file with records of interest *in the given order*:
324 |
325 | .. sourcecode:: python
326 |
327 | from Bio import SeqIO
328 | wanted_ids = ["PGSC0003DMP400019313", "PGSC0003DMP400020381", "PGSC0003DMP400020972"]
329 | input_filename = "PGSC_DM_v3.4_pep_representative.fasta"
330 | output_filename = "wanted_potato_proteins_in_order.fasta"
331 | fasta_index = SeqIO.index(input_filename, "fasta")
332 | count = 0
333 | total = # Your code here, get total from fasta_index
334 | output_handle = open(output_filename, "w")
335 | for identifier in wanted_ids:
336 | # ...
337 | # Your code here, get the record for the identifier, and write it out
338 | # ...
339 | output_handle.close()
340 | print(str(count) + " records selected out of " + str(total))
341 |
342 | I called this script ``filter_wanted_id_in_order.py`` and the output should be:
343 |
344 | .. sourcecode:: console
345 |
346 | $ python filter_wanted_id_in_order.py
347 | 3 records selected out of 39031
348 |
349 |
350 | Now compare the outfile files from the two approaches:
351 |
352 | .. sourcecode:: console
353 |
354 | $ grep "^>" wanted_potato_proteins.fasta
355 | >PGSC0003DMP400020381 PGSC0003DMT400029984 Protein
356 | >PGSC0003DMP400020972 PGSC0003DMT400030871 Protein
357 | >PGSC0003DMP400019313 PGSC0003DMT400028369 Protein
358 | $ grep "^>" wanted_potato_proteins_in_order.fasta
359 | >PGSC0003DMP400019313 PGSC0003DMT400028369 Protein
360 | >PGSC0003DMP400020381 PGSC0003DMT400029984 Protein
361 | >PGSC0003DMP400020972 PGSC0003DMT400030871 Protein
362 |
363 | The second file has the order specified in the Python list.
364 |
--------------------------------------------------------------------------------
/writing_sequence_files/convert_gb_to_fasta.py:
--------------------------------------------------------------------------------
1 | from Bio import SeqIO
2 | input_filename = "NC_000913.gbk"
3 | output_filename = "NC_000913_converted.fasta"
4 | count = SeqIO.convert(input_filename, "gb", output_filename, "fasta")
5 | print(str(count) + " records converted")
6 |
--------------------------------------------------------------------------------
/writing_sequence_files/cut_final_star.py:
--------------------------------------------------------------------------------
1 | from Bio import SeqIO
2 | input_filename = "PGSC_DM_v3.4_pep_representative.fasta"
3 | output_filename = "PGSC_DM_v3.4_pep_rep_no_stars.fasta"
4 | output_handle = open(output_filename, "w")
5 | for record in SeqIO.parse(input_filename, "fasta"):
6 | if record.seq.endswith("*"):
7 | record = record[:-1] # remove last letter (the star)
8 | SeqIO.write(record,output_handle, "fasta")
9 | output_handle.close()
10 |
--------------------------------------------------------------------------------
/writing_sequence_files/cut_star_dangerous.py:
--------------------------------------------------------------------------------
1 | from Bio import SeqIO
2 | input_filename = "PGSC_DM_v3.4_pep_representative.fasta"
3 | output_filename = "PGSC_DM_v3.4_pep_rep_no_stars.fasta"
4 | output_handle = open(output_filename, "w")
5 | for record in SeqIO.parse(input_filename, "fasta"):
6 | cut_record = record[:-1] # remove last letter
7 | SeqIO.write(cut_record, output_handle, "fasta")
8 | output_handle.close()
9 |
--------------------------------------------------------------------------------
/writing_sequence_files/filter_wanted_id.py:
--------------------------------------------------------------------------------
1 | from Bio import SeqIO
2 | wanted_ids = ["PGSC0003DMP400019313", "PGSC0003DMP400020381", "PGSC0003DMP400020972"]
3 | input_filename = "PGSC_DM_v3.4_pep_representative.fasta"
4 | output_filename = "wanted_potato_proteins.fasta"
5 | count = 0
6 | total = 0
7 | output_handle = open(output_filename, "w")
8 | for record in SeqIO.parse(input_filename, "fasta"):
9 | total = total + 1
10 | if record.id in wanted_ids:
11 | count = count + 1
12 | SeqIO.write(record, output_handle, "fasta")
13 | output_handle.close()
14 | print(str(count) + " records selected out of " + str(total))
15 |
--------------------------------------------------------------------------------
/writing_sequence_files/filter_wanted_id_in_order.py:
--------------------------------------------------------------------------------
1 | from Bio import SeqIO
2 | wanted_ids = ["PGSC0003DMP400019313", "PGSC0003DMP400020381", "PGSC0003DMP400020972"]
3 | input_filename = "PGSC_DM_v3.4_pep_representative.fasta"
4 | output_filename = "wanted_potato_proteins_in_order.fasta"
5 | fasta_index = SeqIO.index(input_filename, "fasta")
6 | count = 0
7 | total = len(fasta_index)
8 | output_handle = open(output_filename, "w")
9 | for identifier in wanted_ids:
10 | record = fasta_index[identifier]
11 | SeqIO.write(record, output_handle, "fasta")
12 | count = count + 1
13 | output_handle.close()
14 | print(str(count) + " records selected out of " + str(total))
15 |
--------------------------------------------------------------------------------
/writing_sequence_files/length_filter.py:
--------------------------------------------------------------------------------
1 | from Bio import SeqIO
2 | input_filename = "NC_000913.faa"
3 | output_filename = "NC_000913_long_only.faa"
4 | count = 0
5 | total = 0
6 | output_handle = open(output_filename, "w")
7 | for record in SeqIO.parse(input_filename, "fasta"):
8 | total = total + 1
9 | if 100 <= len(record):
10 | count = count + 1
11 | SeqIO.write(record, output_handle, "fasta")
12 | output_handle.close()
13 | print(str(count) + " records selected out of " + str(total))
14 |
--------------------------------------------------------------------------------
/writing_sequence_files/length_filter_naive.py:
--------------------------------------------------------------------------------
1 | from Bio import SeqIO
2 | input_filename = "NC_000913.faa"
3 | output_filename = "NC_000913_long_only.faa"
4 | count = 0
5 | total = 0
6 | for record in SeqIO.parse(input_filename, "fasta"):
7 | total = total + 1
8 | if 100 <= len(record):
9 | count = count + 1
10 | SeqIO.write(record, output_filename, "fasta")
11 | print(str(count) + " records selected out of " + str(total))
12 |
--------------------------------------------------------------------------------